ZenFlow: A New DeepSpeed Extension Designed as a Stall-Free Offloading …

The DeepSpeed team unveiled ZenFlow, a new offloading engine designed to overcome a major bottleneck in large language model (LLM) training: CPU-induced GPU stalls. While offloading optimizers and gradients to CPU memory reduces GPU memory pressure, traditional frameworks like ZeRO-Offload and ZeRO-Infinity often leave expensive GPUs idle for most of each training step—waiting on slow CPU updates and PCIe transfers. For example, fine-tuning Llama 2-7B on 4× A100 GPUs with full offloading can balloon step time from 0.5s to over 7s, a 14× slowdown. ZenFlow eliminates these stalls by decoupling GPU and CPU computation with importance-aware pipelining, delivering up to 5× end-to-end speedup over ZeRO-Offload and reducing GPU stalls by more than 85%.

How ZenFlow Works

Importance-Aware Gradient Updates: ZenFlow prioritizes the top-k most impactful gradients for immediate GPU updates, while deferring less important gradients to asynchronous CPU-side accumulation. This reduces per-step gradient traffic by nearly 50% and PCIe bandwidth pressure by about 2× compared to ZeRO-Offload.

Bounded-Asynchronous CPU Accumulation: Non-critical gradients are batched and updated asynchronously on the CPU, hiding CPU work behind GPU compute. This ensures GPUs are always busy, avoiding stalls and maximizing hardware utilization.

Lightweight Gradient Selection: ZenFlow replaces full gradient AllGather with a lightweight, per-column gradient norm proxy, reducing communication volume by over 4,000× with minimal impact on accuracy. This enables efficient scaling across multi-GPU clusters.

Zero Code Changes, Minimal Configuration: ZenFlow is built into DeepSpeed and requires only minor JSON configuration changes. Users set parameters like topk_ratio (e.g., 0.05 for top 5% of gradients) and enable adaptive strategies with select_strategy, select_interval, and update_interval set to “auto”.

Auto-Tuned Performance: The engine adapts update intervals at runtime, eliminating the need for manual tuning and ensuring maximum efficiency as training dynamics evolve.

https://arxiv.org/abs/2505.12242

Performance Highlights

FeatureImpactUp to 5× end-to-end speedupFaster convergence, lower costs>85% reduction in GPU stallsHigher GPU utilization≈2× lower PCIe trafficLess cluster bandwidth pressureNo accuracy loss on GLUE benchmarksMaintains model qualityLightweight gradient selectionScales efficiently to multi-GPU clustersAuto-tuningNo manual parameter tuning required

Practical Usage

Integration: ZenFlow is a drop-in extension for DeepSpeed’s ZeRO-Offload. No code changes are needed; only configuration updates in the DeepSpeed JSON file are required.

Example Use Case: The DeepSpeedExamples repository includes a ZenFlow finetuning example on the GLUE benchmark. Users can run this with a simple script (bash finetune_gpt_glue.sh), following setup and configuration instructions in the repo’s README. The example demonstrates CPU optimizer offload with ZenFlow asynchronous updates, providing a practical starting point for experimentation.

Configuration Example:

Copy CodeCopiedUse a different Browser”zero_optimization”: {
“stage”: 2,
“offload_optimizer”: {
“device”: “cpu”,
“pin_memory”: true
},
“zenflow”: {
“topk_ratio”: 0.05,
“select_strategy”: “auto”,
“select_interval”: “auto”,
“update_interval”: 4,
“full_warm_up_rounds”: 0,
“overlap_step”: true
}
}

Getting Started: Refer to the DeepSpeed-ZenFlow finetuning example and the official tutorial for step-by-step guidance.

Summary

ZenFlow is a significant leap forward for anyone training or fine-tuning large language models on limited GPU resources. By effectively eliminating CPU-induced GPU stalls, it unlocks higher throughput and lower total cost of training, without sacrificing model accuracy. The approach is particularly valuable for organizations scaling LLM workloads across heterogeneous hardware or seeking to maximize GPU utilization in cloud or on-prem clusters.

For technical teams, the combination of automatic tuning, minimal configuration, and seamless integration with DeepSpeed makes ZenFlow both accessible and powerful. The provided examples and documentation lower the barrier to adoption, enabling rapid experimentation and deployment.

ZenFlow redefines offloading for LLM training, delivering stall-free, high-throughput fine-tuning with minimal configuration overhead—a must-try for anyone pushing the boundaries of large-scale AI.

Check out the Technical Paper, GitHub Page and Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post ZenFlow: A New DeepSpeed Extension Designed as a Stall-Free Offloading Engine for Large Language Model (LLM) Training appeared first on MarkTechPost.

Deep Learning Framework Showdown: PyTorch vs TensorFlow in 2025

The choice between PyTorch and TensorFlow remains one of the most debated decisions in AI development. Both frameworks have evolved dramatically since their inception, converging in some areas while maintaining distinct strengths. This article explores the latest patterns from the comprehensive survey paper from Alfaisal University, Saudi Arabia, synthesizing usability, performance, deployment, and ecosystem considerations to guide practitioners in 2025.

Philosophy & Developer Experience

PyTorch burst onto the scene with a dynamic (define-by-run) paradigm, making model development feel like regular Python programming. Researchers embraced this immediacy: debugging is straightforward, and models can be altered on the fly. PyTorch’s architecture—centered around torch.nn.Module—encourages modular, object-oriented design. Training loops are explicit and flexible, giving full control over every step, which is ideal for experimentation and custom architectures.

TensorFlow, initially a static (define-and-run) framework, pivoted with TensorFlow 2.x to embrace eager execution by default. The Keras high-level API, now deeply integrated, simplifies many standard workflows. Users can define models using tf.keras.Model and leverage one-liners like model.fit() for training, reducing boilerplate for common tasks. However, highly custom training procedures may require dropping back to TensorFlow’s lower-level APIs, which can add complexity in PyTorch is often easier due to Pythonic tracebacks and the ability to use standard Python tools. TensorFlow’s errors, especially when using graph compilation (@tf.function), can be less transparent. Still, TensorFlow’s integration with tools like TensorBoard provides robust visualization and logging out of the box, which PyTorch has also adopted via SummaryWriter.

Performance: Training, Inference, & Memory

Training Throughput: Benchmark results are nuanced. PyTorch often trains faster on larger datasets and models, thanks to efficient memory management and optimized CUDA backends. For example, in experiments by Novac et al. (2022), PyTorch completed a CNN training run 25% faster than TensorFlow, with consistently quicker per-epoch times. On very small inputs, TensorFlow sometimes has an edge due to lower overhead, but PyTorch pulls ahead as input size grows[attached_filence Latency**: For small-batch inference, PyTorch frequently delivers lower latency—up to 3× faster than TensorFlow (Keras) in some image classification tasks (Bečirović et al., 2025)[attached_filege diminishes with larger inputs, where both frameworks are more comparable. TensorFlow’s static graph optimization historically gave it a deployment edge, but PyTorch’s TorchScript and ONNX support have closed much of this gap[attached_file Usage**: PyTorch’s memory allocator is praised for handling large tensors and dynamic architectures gracefully, while TensorFlow’s default behavior of pre-allocating GPU memory can lead to fragmentation in multi-process environments. Fine-grained memory control is possible in TensorFlow, but PyTorch’s approach is generally more flexible for research workloads: Both frameworks now support distributed training effectively. TensorFlow retains a slight lead in TPU integration and large-scale deployments, but PyTorch’s Distributed Data Parallel (DDP) scales efficiently across GPUs and nodes. For most practitioners, the scalability gap has narrowed significantly.

Deployment: From Research to Production

TensorFlow offers a mature, end-to-end deployment ecosystem:

Mobile/Embedded: TensorFlow Lite (and Lite Micro) leads for on-device inference, with robust quantization and hardware acceleration.

Web: TensorFlow.js enables training and inference directly in browsers.

Server: TensorFlow Serving provides optimized, versioned model deployment.

Edge: TensorFlow Lite Micro is the de facto standard for microcontroller-scale ML (TinyML)

Mobile: PyTorch Mobile supports Android/iOS, though with a larger runtime footprint than TFLite.

Server: TorchServe, developed with AWS, provides scalable model serving.

Cross-Platform: ONNX export allows PyTorch models to run in diverse environments via ONNX Runtime.

Interoperability is increasingly important. Both frameworks support ONNX, enabling model exchange. Keras 3.0 now supports multiple backends (TensorFlow, JAX, PyTorch), further blurring the lines between ecosystems & Community

PyTorch dominates academic research, with approximately 80% of NeurIPS 2023 papers using PyTorch. Its ecosystem is modular, with many specialized community packages (e.g., Hugging Face Transformers for NLP, PyTorch Geometric for GNNs). The move to the Linux Foundation ensures broad governance and sustainability.

TensorFlow remains a powerhouse in industry, especially for production pipelines. Its ecosystem is more monolithic, with official libraries for vision (TF.Image, KerasCV), NLP (TensorFlow Text), and probabilistic programming (TensorFlow Probability). TensorFlow Hub and TFX streamline model sharing and MLOps: Stack Overflow’s 2023 survey showed TensorFlow slightly ahead in industry, while PyTorch leads in research. Both have massive, active communities, extensive learning resources, and annual developer conferences[attached_fileases & Industry Applications

Computer Vision: TensorFlow’s Object Detection API and KerasCV are widely used in production. PyTorch is favored for research (e.g., Meta’s Detectron2) and innovative architectures (GANs, Vision Transformers)[attached_file The rise of transformers has seen PyTorch surge ahead in research, with Hugging Face leading the charge. TensorFlow still powers large-scale systems like Google Translate, but PyTorch is the go-to for new model development.

Recommender Systems & Beyond: Meta’s DLRM (PyTorch) and Google’s RecNet (TensorFlow) exemplify framework preferences at scale. Both frameworks are used in reinforcement learning, robotics, and scientific computing, with PyTorch often chosen for flexibility and TensorFlow for production robustness.

Conclusion: Choosing the Right Tool

There is no universal “best” framework. The decision hinges on your context:

PyTorch: Opt for research, rapid prototyping, and custom architectures. It excels in flexibility, ease of debugging, and is the community favorite for cutting-edge work.

TensorFlow: Choose for production scalability, mobile/web deployment, and integrated MLOps. Its tooling and deployment options are unmatched for enterprise pipelines.

In 2025, the gap between PyTorch and TensorFlow continues to narrow. The frameworks are borrowing each other’s best ideas, and interoperability is improving. For most teams, the best choice is the one that aligns with your project’s requirements, team expertise, and deployment targets—not an abstract notion of technical superiority.

Both frameworks are here to stay, and the real winner is the AI community, which benefits from their competition and convergence.

Check out the Technical Paper Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Deep Learning Framework Showdown: PyTorch vs TensorFlow in 2025 appeared first on MarkTechPost.

Create personalized products and marketing campaigns using Amazon Nova …

This post was written with Jake Friedman from Wildlife.
Businesses are seeking innovative ways to differentiate themselves through hyper-personalization and enhanced customer experiences. At the Cannes Lions International Festival of Creativity 2025, AWS showcased The Fragrance Lab, an interactive and inspiring experience that demonstrates how generative AI can support the development of hyper-personalized consumer goods and accelerate advertising creative concept and campaign assets development. Following Cannes Lions 2025, The Fragrance Lab received a Gold and Silver Stevie Award from the International Business Awards in the Brand & Experiences category.
Built using Amazon Nova in Amazon Bedrock, The Fragrance Lab represents a comprehensive end-to-end application that illustrates the transformative power of generative AI in retail, consumer goods, advertising, and marketing. While our activation at Cannes Lions focused on personalized fragrance development and ad campaign creation, the underlying architecture and methodology can be adapted across diverse categories, from fashion to food and beverage, opening endless possibilities for customized customer experiences.
Introducing The Fragrance Lab
In this post, we explore the development of The Fragrance Lab. Our vision was to craft a unique blend of physical and digital experiences that would celebrate creativity, advertising, and consumer goods while capturing the spirit of the French Riviera. To bring this vision to life, we collaborated with Wildlife, a company that is exceptional at transforming AWS generative AI services into compelling physical experiences. Wildlife was fundamental in brainstorming ideas that would inspire customers and showcase novel use cases that AI makes possible.

Crafting the fragrance
As the first step, the experience used Amazon Nova Sonic, a speech-to-speech model that engages in intuitive dialogues with attendees to understand their personality and preferences. Nova Sonic extends its capabilities through tool integration, allowing it to manage user traits and interface actions through specialized tools such as addTraitTool, removeTraitTool, and uiActionIntentTool. These tools help maintain conversation state and a consistent flow throughout the experience. The collected conversation data and trait information are then processed through a custom Retrieval Augmented Generation (RAG) system built with Amazon Nova Pro, a highly capable multimodal model that offers our best combination of accuracy, speed, and cost. Nova Pro serves as the intelligence engine for analyzing interactions and extracting essential keywords to determine the perfect fragrance notes and composition. The application also used Amazon Bedrock Guardrails, which offers customizable safeguards and responsible AI policies to block undesirable topics—such as allergens or harmful content—to offer a seamless customer experience.
For example, a customer might share with Nova Sonic that they are interested in travel. Nova Pro picked up that exploring new places often “brings a sense of freshness and excitement,” which resulted in a fragrance that feels fresh and invigorating, featuring “a burst of citrus or a floral breeze.” The customer might also share that they enjoy early morning walks across spring fields, which Nova Pro translates into a top note of fresh bergamot, a middle note featuring floral honey, and a base of lavender. The customers’ inputs guide the selection of fragrance notes—from base, to heart, to top notes—which were then expertly mixed by on-site perfumers to create truly personalized scents. Perfumers were able to customize and craft hundreds of unique fragrances per day, aided by AI. A process that would normally take hours for a perfumer was accelerated to minutes, empowering both the customer and the fragrance expert.

Creating the campaign
After the personalized fragrance formula was created and sent to the perfumer queue, Amazon Nova Canvas generated customized marketing creative, including the fragrance name, tagline, and imagery that captured the essence of the formula. Attendees were able to further customize the campaign assets using guest inputs such as moody, beachy, or playful. The resulting fragrance image was then transformed into dynamic video content through Amazon Nova Reel, which customers could further customize to meet their creative vision and download to save or share. To match the Cannes Lions atmosphere, the campaign videos were generated with a French-accented female voice using Amazon Polly. The entire experience is built in Amazon Bedrock, a fully managed service to build and scale generative AI applications with AI models.
The following data flow diagram shows how multiple Amazon Nova models can be combined for a rich, cohesive, and personalized customer experience.

Best practices for implementation
The Fragrance Lab centers around interactions with Amazon Nova Sonic, providing users with a natural language interface to express their preferences for a custom scent. Through its tool integration capabilities, Nova Sonic orchestrates the entire experience by managing user traits and triggering appropriate workflows. These workflows seamlessly guide the experience from initial conversation to fragrance development and ultimately to campaign asset creation, driving both the visual elements and progression of the experience. The model’s ability to maintain a conversational state, while defining clear conversational flows, helps ensure a consistent and pleasant experience for every user.
A well-defined workflow and conversational assistant are pivotal in guiding these conversations to uncover the qualities that are most important to each user. And the system prompt determines the personality, style, and content of your conversational assistant.
Prompt example:

You are an AI assistant designed to help the user explore their personality and
emotional landscape in the context of creating a unique fragrance. You engage in warm,
free-flowing, playful conversation with the user to draw out their character,
preferences, moods, and desires. Your end goal is to derive a set of 3 to 5 personality
traits that best describe the user. These traits will later be used in a separate
process to match appropriate fragrance ingredients. Your tone is warm, chic, and subtly
playful.

Additional contextual information within the prompt also plays a key role in Amazon Nova Sonic effectively maintaining state, while defining the conversational flow helps ensure consistent, pleasant, and concise experiences for every user.
Prompt example:

1. **Welcoming Users**
Welcome the user to the application experience with a brief overview of the
process and ask if they are ready to continue.
2. **Assistant Turns**
Ask short and unique open ended questions to the user and choose a personality trait
that you think would suit the user best.
3. **Handling User Turns**
Acknowledge the user’s answers briefly and warmly.
Focus on one trait per turn.
Call the “addTraitTool”, “removeTraitTool”, “replaceTraitTool”, or “clearTraitsTool”
tools to manage traits.
If the user says to go back, skip, customize, or confirm/submit it means you should
call the “uiActionIntentTool”

With direct references to our tools in the conversational flow, the user interface feels reactive and connected to the user’s input while providing opportunities for the assistant to demonstrate its expertise on this subject, which comes into the spotlight when user traits and preferences are later mapped to a set of available ingredients and raw fragrance materials.
This complex fragrance recipe development is handled by Nova Pro, using its accuracy and speed to generate consistently high-quality scents. To draw from a wealth of fragrance knowledge in real time, RAG was implemented to extend Nova Pro capabilities beyond pre-trained knowledge with access to knowledge sources that include essential scent design principles, a deep understanding of each available ingredient, their profiles and potential roles within the fragrance, and their possible connections to users’ aromatic identities.
The resulting fragrances are then visualized using Nova Canvas and Nova Reel. The creative models generate original compositions that reveal the fragrance name, ingredients, and a visual identity within a high-end creative campaign asset. A set of conditioning images featuring unbranded fragrance bottles help to anchor each image (as shown in the following image).

Prompt example:

A high-end fragrance ad environment inspired by a [persona description]. A clear,
unbranded perfume bottle is visually centered and tightly framed. Key ingredients [top
note ingredient], [middle note ingredient], [base note ingredient], and [booster
ingredient] are arranged to surround the bottle in a balanced composition, appearing
behind, besides, and partially in front of the base. The scene evokes [atmospheric/mood
descriptors] using [light/color language]. The setting should feel [stylistic direction],
like a [reference style (e.g., fashion editorial, lifestyle spread, luxury campaign)].

Results
Attendees at Cannes Lions took away a physical fragrance mixed by on-site perfumers. While developing hyper-personalized consumer goods might not be scalable across all use cases, brands can innovate with artificial intelligence and achieve manufacturing outcomes that weren’t previously possible. The advertising campaign concept and asset development use case is easy to implement for brands, agencies, and media networks, allowing users to iterate and optimize campaign creative quickly. Using Amazon Bedrock, additional features could be added like translations and sizes, depending on requirements.
You can watch a video walk through of The Fragrance Lab onsite at Cannes Lions 2025, and check out the following example campaign outputs.

Conclusion
The Fragrance Lab demonstrates the power of Amazon Nova in Amazon Bedrock and how customers can create fully personalized consumer experiences. This use case can be replicated across various retail and consumer goods categories including skincare and cosmetics, fashion and accessories, food and beverage, home goods, and wellness products—all benefiting from natural conversation interaction, AI-powered product development, product identity, and creative marketing campaign generation. Get started with Amazon Nova in Amazon Bedrock today.

About the authors
Raechel Frick is a Sr Product Marketing Manager at AWS. With over 20 years of experience in the tech industry, she brings a customer-first approach and growth mindset to building integrated marketing programs. Based in the greater Seattle area, Raechel balances her professional life with being a soccer mom and after-school carpool manager, demonstrating her ability to excel both in the corporate world and family life.
Gaby Ferreres is the Head of Industry Marketing for Media & Entertainment, Sports, Games, Advertising & Marketing at AWS, where she works with technology and industry leaders to accelerate innovation on behalf of customers. She is a global marketing leader and creator of experiences that elevate customer journeys. Before AWS, she held different positions at Microsoft, Telefonica, and more.
Ashley Weston is Sr. Marketing Event Manager for Global Third-Party Programs at AWS, where she partners with industry marketing to deliver the highest visibility and most business-critical events for AWS.
Tiffany Pfremmer is Sr. Industry Marketing Manager at Amazon Web Services (AWS) where she leads strategic integrated marketing initiatives across the Media & Entertainment, Games, and Sports verticals to deliver marketing campaigns that connect AWS cloud solutions with customer opportunities.
Jake Friedman is the President and Co-founder at Wildlife, where he leads a team launching interactive experiences and content campaigns for global brands. His work has been recognized with the Titanium Grand Prix at the Cannes Lions International Festival of Creativity for “boundary-busting, envy-inspiring work that marks a new direction for the industry and moves it forward”. You can find him on LinkedIn.

About Wildlife
Wildlife fuses a digitally born skillset with a future proof mindset to deliver breakthrough products, experiences and campaigns for daring partners. We live by a motto: Technology changes, story doesn’t.

Tyson Foods elevates customer search experience with an AI-powered con …

Tyson Foodservice operates as a critical division within Tyson Foods Inc., using its extensive protein production capabilities to supply a diverse array of foodservice clients across multiple sectors. As one of the largest protein providers in the US, Tyson Foods produces approximately 20% of the nation’s beef, pork, and chicken, which forms the foundation of its foodservice offerings.
Tyson Foodservice operates through a B2B model, selling products to distributors rather than directly to end consumers, while serving diverse foodservice operators, including restaurants, schools, healthcare facilities, and convenience stores. Until recently, Tyson had limited direct engagement with over 1 million unattended operators who purchased their products through distributors without direct company relationships. To bridge this gap, Tyson has implemented a generative AI assistant on their website, enabling them to scale sales efforts, gather customer insights, and establish direct communication channels. The company’s website now functions as a critical interface where operators can explore products, access menu trends, and discover tailored solutions for their specific foodservice segments, all enhanced by AI-driven personalization that better serves both established customers and previously unattended operators.
In this post, we explore how Tyson Foods collaborated with the AWS Generative AI Innovation Center to revolutionize their customer interaction through an intuitive AI assistant integrated into their website. The AI assistant was built using Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Solution overview
In this section, we describe the overall architecture of the solution. The workflow includes the following high-level steps:

A user uses the search bar in https://www.tysonfoodservice.com/. The query string is converted to embeddings using Amazon Bedrock and the Amazon Titan Text Embeddings model. The search application performs a k-nearest neighbors (k-NN) vector search to find relevant results in Amazon OpenSearch Serverless and return those results to the website. The search application is deployed in Amazon Elastic Container Service (Amazon ECS) using AWS Fargate as the capacity provider and exposed as a REST API using an Application Load Balancer protected by AWS WAF.
The user uses the AI assistant interface to ask questions in natural language. The query is processed by the agent node using Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock. Depending on the subject of the query, the agent might orchestrate multiple agents to return relevant information to the user. The application is deployed using a similar architecture to the semantic search component with the addition of an Amazon Relational Database Service (Amazon RDS) database cluster to persist the user high-value actions for analytics purposes.
Products, recipes, ingredients and other relevant data are available from external sources in JSON format. These are processed using Amazon Bedrock and the Amazon Titan Text Embeddings model to create semantic search embeddings. Then these are ingested into OpenSearch Serverless. The ingestion process run in a different ECS cluster using Fargate as the capacity provider.

The following diagram illustrates this architecture.

In the following sections, we discuss the solution’s key components and benefits in more detail.
Improved semantic search
The earlier iteration of search on the Tyson Foodservice website relied on keyword-based search. Traditional keyword-based search on CPG websites like Tyson Foodservice often falters when customers search for products using industry terminology that varies from official catalog descriptions. Chefs searching for “pulled chicken” might miss relevant products labeled as “shredded chicken,” or those looking for “wings” might not see results for “party wings” or “drummettes.” This disconnect frustrates food service professionals who need specific ingredients under tight deadlines and ultimately drives them to competitors where they can more quickly find what they need, resulting in lost revenue opportunities for Tyson. Semantic search transforms this experience by understanding the conceptual relationships between culinary terms, preparation methods, and product applications. A chef searching for “buffalo-style appetizers” would receive results for wings, boneless bites, and similar products regardless of exact keyword matches. By recognizing menu trends, cooking techniques, and professional kitchen terminology, semantic search helps foodservice operators quickly find the Tyson products that meet their exact operational needs, even when using language that differs from catalog descriptions.
Tyson Foodservice implemented their semantic search capability using OpenSearch Serverless, a fully managed service that minimized the operational complexity of maintaining search infrastructure. This solution automatically scales compute and storage resources to match query volume and product catalog size without requiring dedicated administrative overhead. The serverless architecture helped Tyson rapidly deploy advanced natural language processing capabilities across their entire product database while maintaining cost-efficiency, because they only pay for the resources they actually use. With OpenSearch Serverless, Tyson incorporated vector embeddings and powerful query capabilities that understand foodservice terminology variations, preparation methods, and culinary applications, transforming how operators discover products that meet their specific needs even when their search terms don’t exactly match catalog descriptions.
For indexing Tyson’s diverse content library of products, recipes, and articles, we implemented a preprocessing workflow that transforms raw metadata into optimized semantic search queries. We used large language models (LLMs) to analyze and extract only the most relevant elements from each content piece, creating meaningful search strings specifically designed for semantic indexing. This approach made sure that purely presentational website copy and non-essential informational text were filtered out, and search-critical elements like culinary applications, preparation methods, and ingredient specifications received proper emphasis in the index. By curating what content gets indexed rather than including everything verbatim, we dramatically improved search relevance while reducing index bloat, so OpenSearch Serverless delivered more precise results that truly match the intent behind chef and operator queries. For indexing the text as semantic vectors, we used Amazon Titan Text Embeddings V2 on Amazon Bedrock.
The following example prompt illustrates the transformation using only the title, description, and reasons to buy metadata. This generic strategy can be customized according to the customer’s specific needs.

SEARCH_STRING_PROMPT = “”” Given a product title, description, and reasons to
buy, create a single, concise search string suitable for indexing in a vector
database. This string should focus on distinguishing features, assuming all
products are for foodservice operators unless explicitly stated otherwise.
Enclose the generated search string within <search_string> XML tags.

Follow these guidelines:
1. Start with the brand name and product line (if applicable).
2. Include the main product type and specific identifying features.
3. List concrete attributes such as preparation state, packaging, or quantity.
4. Mention specific varieties or assortments included in the product.
5. Incorporate key points from the reasons to buy, focusing on unique and
specific selling points.
6. Avoid generic terms or those common to all products in the category (e.g.,
“food service”, “restaurant”, “operator”).
7. Omit cliché marketing terms (e.g., “versatile”, “high-quality”, “innovative”)
unless they have a specific, demonstrable meaning in the context of the
product.
8. Use precise descriptors that differentiate the product from others in its
category.
9. Omit articles (a, an, the) and unnecessary connecting words.
10. Use lowercase for all terms except proper nouns.
11. Separate terms with single spaces.
12. Aim for a length of 15-20 words.
13. Prioritize terms that potential buyers are most likely to use in specific
search queries.

Example input:
<title>Tyson® Heritage Valley™ IF Unbreaded 8 Piece Cut Chicken</title>
<description>Order a variety of crispy, seasoned chicken cuts with
Heritage Valley™ Uncooked, Ice Glazed 8 Piece Cut Chicken. Featuring an
assortment of breasts, drumsticks, thighs and wings, our chicken portions
are completely customizable and perfect for center-of-plate features.
Separately packaged for quick and easy preparation and portion control,
our packaging helps your staff reduce waste by allowing them to use what
they need, when they need. Ready to cook from frozen, simply fry and
serve as an assortment for a buffet protein choice.
</description>
<reasons_to_buy>
[‘Bone-in assortment of breasts, drumsticks, thighs and wings.’,
‘Individually quick frozen, locking in natural juices and tenderness.’,
‘Different cuts separately bagged for quick and easy preparation and cleanup.’,
‘Ready to cook from frozen.’]
</reasons_to_buy>

Example output: <search_string>tyson heritage valley unbreaded raw 8-piece
chicken bone-in breasts drumsticks thighs wings individually-frozen
separate-bags cook-from-frozen juicy center-of-plate</search_string>

Now, create a similar search string for the following product:
<title>{title}</title>
<description>{description}</description>
<reasons_to_buy>{reasons_to_buy}</reasons_to_buy>
“””

Agentic chat built using Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock and LangGraph
Tyson Foodservice has integrated a powerful generative AI assistant into their website, using Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock and LangGraph. This AI assistant delivers a seamless conversational search experience that offers comprehensive support across Tyson’s extensive range of products, recipes, and articles, providing contextual guidance through natural conversation. Its capabilities include:

Personalized search – Uses semantic search to find relevant products, recipes, and articles. The AI assistant customizes recommendations by learning about the user’s business and role, creating a tailored experience while gathering valuable customer insights for Tyson.
Detailed product information – Provides comprehensive details about specific Tyson products, including descriptions, ingredients, preparation methods, and suggested applications.
Distributor services – Helps users locate nearby distributors and check product availability in their area.
Purchasing assistance – Offers information on how to buy Tyson products and connects customers with sales representatives when needed.
Promotion awareness – Keeps customers informed about current Tyson Foodservice promotions and special offers.
Feedback channel – Provides a streamlined way for customers to submit product and service feedback directly to Tyson.
Natural conversational flow – Maintains context throughout the interaction, allowing users to reference previous results and ask follow-up questions for a more human-like conversation experience.

The following diagram illustrates the high-level architecture of the AI assistant. The system uses the tool calling capabilities of Anthropic’s Claude to implement the AI assistant’s agentic behavior. We used LangGraph to streamline the implementation process, because it provides several convenient primitives specifically designed for building agentic systems with LLMs.

The main components of the architecture are:

Agent node – The agent node is implemented using a large prompt that directly receives the user message and responds using the conversational capabilities of the LLM. It also defines the agentic behavior by using the tool calling capability: whenever serving the user’s request requires calling a tool, the agent node issues a tool request.
Tool execution node – This node implements a generic tool executor that connects to various tools. Whenever a tool call is issued by the agent node, this node handles the execution of the tool call. The tool calling node executes the tools, which are defined as Python functions, and returns the results to the agent node to be transformed or summarized and presented to the user. LangGraph provides a generic implementation of the ToolNode that can also be extended to implement additional functionality.
Tools layer – Tools are implemented as simple programmatic functions that take inputs and return outputs. These tools augment the capabilities of LLMs by performing functions like retrieving data or submitting feedback. The tools are stateless and agnostic to the current conversation between the user and agent. The LLM agent extracts the input parameters required to execute these tools. These tools in our implementation are a thin wrapper around the services and database layer that implement the actual functionality.

The following system prompt provides a general guidance for implementing the agent node:
import date

AGENT_SYSTEM_PROMPT = “””
# Tyson Foodservice (TFS) Customer Support Assistant

## Core Role and Purpose
You are a helpful customer support assistant for Tyson Foodservice a.k.a TFS
hosted on their https://www.tysonfoodservice.com/ website. You will be helpful
and answer the customers questions. The customers are mainly interested in
learning about the products for their specific needs.
Refrain from engaging in any conversation unrelated to tyson food search of
products, recipes or distributors. If the user asks any unrelated questions the
politely decline and mention your purpose. Do not provide and additional
information or advice.

Your job is to stay factual and only provide relevant information from the
current context or retrieved using the tools. Do not offer your own suggestions.
Customers are looking for concrete information that is available in the Tyson
Foodservice database.

## About Tyson Foodservice
Tyson Foods is a major American multinational corporation and one of the world’s
largest processors and marketers of chicken, beef, and pork.

### Distributors
Tyson foods mainly sells their products through distributors and does not sell
them directly. Each distributor is identified by a unique identifier named
distributor_id which is used as parameters for the tools, do not use the
distributor name as query parameter.

### Foodservice Operators
Foodservice Operators, or simply Operators, are Tyson Foods’ primary customers.
These encompass diverse businesses in the foodservice sector, each with unique
needs. Understanding the distinct personas of various Operator types is crucial
for Tyson Foods to:
– Tailor product offerings effectively
– Develop targeted marketing strategies
– Create relevant recipe suggestions
– Address specific operational challenges
By analyzing different Operator segments (e.g., quick-service restaurants, fine
dining, educational institutions, healthcare facilities), Tyson Foods can
customize its products, offer innovative menu solutions, and provide value-added
services. This approach positions Tyson Foods as a strategic partner, driving
growth and maintaining competitiveness in the foodservice industry.

## Using Tools
You will be provide a variety of tools to perform your job, use them wisely and
ask the customer for relevant information that they have not provided. E.g. if
the search tool requires persona and the customer has not provided it then ask
the customer.
– Do not explicitly declare the tools to the users as the users are not aware of
the internal workings of the tools.
– Do not try to intrepret the results of the search tool and show them as it is
to the user.
– Operators may have their preferred distributor they buy from so let them
confirm or select their distributor before checking for availability of
products.
– Customers might sometimes search for things that are not available in tyson
food catalog. If the search did not produce any results then just inform the
user and do not suggest any external sources.
– When trying to determine the parameters for a tool, do not infer them from
other parameters. E.g. do not infer the User’s name from their email.
Explicitly ask for the name.
– If the users complain or praise the chatbot then you can ask for their
feedback in the chatbot and use the `submit_feedback` tool to submit the
feedback. Ask the user to provide the relevant contact information.

## Product, Recipes, and Articles Search
Search functionality is a critical tool on Tyson’s website, allowing users to
find products, recipes, and articles. It enables searches across three main
entity types:
– **Products**: The core offerings of Tyson Foods. These are identified by a
unique GTIN (Global Trade Item Number).
– **Recipes**: Culinary ideas provided by Tyson Foods to encourage product use.
Each recipe incorporates one or more Tyson products.
– **Articles**: Informative content on various topics, created by Tyson Foods
for their customers.
– Do not provide any items or suggestions outside of the ones that are found
through search.
– When the user asks to for details or a product or compare two or more
products, retrieve the details of the products first using the tools to get
product details.
– While users of the site are mainly looking for products, they might also be
interested in recipes and articles so it’s important to not omit them when
displaying the search results.

### User Profile or Persona
In order to serve the user’s better, the search tool can accept the user’s
persona as an input. User profile or persona is a concise description of the
type of role that a user performs in the foodservice industry. A few examples
of persona are
– Restaurant owners looking to optimize costs
– Chef looking for unique ingredients
– K12 operators looking for healthy menu items
They can also be simple roles if the user has not provided any additional
information. Examples are
– Restaurant owner
– Chef
– Hotel Manager
The user persona should not include the search query that they are using for
finding products E.g. these are not good personas
– Restaurant owner looking for chicken nuggets
The above is not a good persona because it includes the product

### Search query string
Search queries should be simple and specific to the products or recipes and
should not contain the operator information
Here are some examples:
– Instead of “healthy chicken wings for K12” use “chicken wings”
– Instead of “mexican beef patties for Deli operation” use “mexican beef
patties”

### Product Results Display
When listing the product results, always display them in the following format as
a numbered list. This will be displayed in the UI using markdown.
1. **Title**
– GTIN
– description – This is a brief description
– [Product Page](Product url link)

### Recipes Results Display
When displaying recipes. Display the following
1. **Title**
– description – This is a brief description
– [Recipe Page](Recipe url link)

## Contact or provide feedback
– If the users want to reach out to Tyson foods team then they can use the form
using this link [Contact
Us](https://www.tysonfoodservice.com/connect/contact-us)
– Users can submit their feedback using the chatbot using tools. When submitting
feedback to Tyson extract user’s message verbatim and do not rephrase it.

## How to buy
If the user wants to buy a product then they have two options.
1. through distributor (preferred option)
2. reaching out to tysons sales representative by filling a form
If the user has not already indicated their preference then present these two
options.
When the user asks for ordering information you do not need to retrieve all the
product details again, only specify the title of the product and be concise with
the details.

### Order through distributor
If they user is interested in buying through a distributor then let them
identify their preferred distributor and then for a specific product or products
they have identified provide the ordering link obtained through the user of
appropriate tool. Also help them check if a product is available with their
distributor.

### Find a tyson Sales Rep
If the user is not interested in a purchasing through a distributor then direct
them to submit a form through this link which will submit their information to a
sales team and someone will reach out to them. Here is the link to the form
https://www.tysonfoodservice.com/connect/find-a-sales-rep

Current date (YYYY-MM-DD): “”” + date.today().strftime(“%Y-%m-%d”) + “n”

Capturing high-value actions: Turning conversations into insights
In designing Tyson Foodservice’s AI assistant, we implemented an innovative solution for capturing high-value actions that transforms customer interactions into strategic business intelligence. This capability provides deeper contextual understanding of customer interests and needs than traditional web analytics. Whereas conventional analytics tools track user behavior through page views, clicks, and time-on-site metrics, our solution uses the rich conversational data generated through natural dialogue. This provides Tyson with unprecedented visibility into customer interests, pain points, and purchase intentions.
The system identifies and logs specific high-value interactions whenever users request detailed product information, inquire about specific product categories, ask about preparation methods or recipe ideas, seek distributor information in their region, or express interest in bulk purchasing or promotions. This approach creates a powerful feedback loop for Tyson Foodservice. As customers naturally express their needs and interests through conversation, the system captures these signals in an aggregate, privacy-respecting manner. Tyson can use these insights to identify trending product categories and potential gaps in their portfolio, understand regional variations in customer interests, recognize seasonal patterns in product inquiries, refine marketing strategies based on direct customer language, and improve inventory management through better demand forecasting. The technical implementation uses the tool-calling capabilities of Anthropic’s Claude 3.5 Sonnet in a straightforward but effective way. Rather than analyzing chat logs after the fact, we integrated the capture mechanism directly into the AI assistant’s operational workflow through LangGraph, allowing for real-time insight collection during customer interactions. When the LLM invokes certain tools to retrieve information requested by users, these tool calls simultaneously trigger the capture of high-value action data. We’ve designed a configurable system where specific tools are designated as high-value action triggers that record meaningful interactions while fulfilling the user’s immediate request.This dual-purpose approach makes sure that valuable business intelligence is gathered as a natural byproduct of providing excellent customer service, without requiring additional processing or analysis steps. The system includes configurable parameters that allow Tyson to adjust which user intents and actions qualify as high value based on evolving business priorities. By transforming every customer conversation into structured, actionable data, Tyson Foodservice can now measure customer interest with unprecedented precision while delivering a superior search experience that feels natural to users.
Conclusion
In this post, we demonstrated a powerful approach to implementing natural conversational AI assistants that seamlessly integrate with existing website functionalities and provide intuitive language interactions for users. By using Amazon Bedrock FMs and OpenSearch Serverless, businesses can quickly expose their website’s capabilities through conversation rather than complex interfaces. The high-value action capture mechanism further enhances this solution by gathering valuable customer insights directly from natural interactions, creating a rich source of business intelligence without additional user friction. This framework provides a flexible blueprint for implementing AI-powered assistants across retail and CPG websites. Organizations can adapt this approach to their specific needs, such as product discovery, customer support, or personalized recommendations. The combination of semantic search with conversational AI creates experiences that understand user intent while maintaining the context necessary for natural dialogue.
If you’re interested in building a similar AI assistant that orchestrates multiple tools, you can get started with Amazon Bedrock Agents, a fully managed AWS solution designed specifically for this purpose. Amazon Bedrock Agents simplifies the process of creating, testing, and deploying conversational experiences that can execute complex tasks across your business systems. With the right architecture and implementation approach demonstrated in this post, you can develop AI-powered interactions that deliver measurable business value while significantly enhancing your customer journey.
For developers exploring AI agent frameworks today, AWS recently introduced Strands Agents, an open source SDK that takes a model-driven approach to building agents with just a model, tools, and a prompt. Unlike workflow-based frameworks, Strands adopts a model-first philosophy that uses advanced reasoning capabilities, offering an interesting alternative approach to frameworks like LangGraph.
Try out these solutions for your own use case, and share your feedback in the comments.

About the authors
Anveshi Charuvaka is a Senior Applied Scientist at AWS’s Generative AI Innovation Center, where he partners with customers to turn Generative AI into solutions for mission-critical business problems. He holds a PhD in Machine Learning and brings over 10 years of experience applying innovative ML and GenAI techniques to complex, real-world challenges.
Barret Miller leads the Digital Enterprise Organization at Tyson Foods, where he spearheads progress in emerging technologies, artificial intelligence, and Smart Office initiatives. With more than 17 years of expertise in software development, data, analytics, and AI, Barret excels at leveraging innovative technology paradigms, including Agentic AI, to tackle and enhance complex business processes.
Vincil Bishop is a Senior Deep Learning Architect in the Generative AI Innovation Center. Vincil has 25 years of experience in the IT industry and holds a PhD in Systems Engineering from Colorado State University. Vincil specializes in the design and implementation of AI solutions that help solve customers’ toughest business challenges.
Tesfagabir Meharizghi is an Applied Scientist at the AWS Generative AI Innovation Center, where he leads projects and collaborates with enterprise customers across various industries to leverage cutting-edge generative AI technologies in solving complex business challenges. He specializes in identifying and prioritizing high-impact use cases, developing scalable AI solutions, and fostering knowledge-sharing partnerships with stakeholders.
  Tanay Chowdhury is a Data Scientist at Generative AI Innovation Center at Amazon Web Services who helps customers solve their business problems using generative AI and machine learning. He has done MS with Thesis in Machine Learning from University of Illinois and has extensive experience in solving customer problem in the field of data science.
Angel Goni is a Principal Solutions Architect at AWS with 15+ years of IT experience across the Financial Services, Retail, and Consumer Packaged Goods sectors. Angel specializes in utilizing cloud technology to impact business KPIs, with particular expertise in multicloud strategies, SAP migrations, and supply chain improvement.

Enhance AI agents using predictive ML models with Amazon SageMaker AI …

Machine learning (ML) has evolved from an experimental phase to becoming an integral part of business operations. Organizations now actively deploy ML models for precise sales forecasting, customer segmentation, and churn prediction. While traditional ML continues to transform business processes, generative AI has emerged as a revolutionary force, introducing powerful and accessible tools that reshape customer experiences.
Despite generative AI’s prominence, traditional ML solutions remain essential for specific predictive tasks. Sales forecasting, which depends on historical data and trend analysis, is most effectively handled by established ML algorithms including random forests, gradient boosting machines (like XGBoost), autoregressive integrated moving average (ARIMA) models, long short-term memory (LSTM) networks, and linear regression techniques. Traditional ML models, such as K-means and hierarchical clustering, also excel in customer segmentation and churn prediction applications. Although generative AI demonstrates exceptional capabilities in creative tasks such as content generation, product design, and personalized customer interactions, traditional ML models maintain their superiority in data-driven predictions. Organizations can achieve optimal results by using both approaches together, creating solutions that deliver accurate predictions while maintaining cost efficiency.
To achieve this, we showcase in this post how customers can expand AI agents’ capabilities by integrating predictive ML models and Model Context Protocol (MCP)—an open protocol that standardizes how applications provide context to large language models (LLMs)—on Amazon SageMaker AI. We demonstrate a comprehensive workflow that enables AI agents to make data-driven business decisions by using ML models hosted SageMaker. Through the use of Strands Agents SDK—an open source SDK that takes a model-driven approach to building and running AI agents in only a few lines of code—and flexible integration options, including direct endpoint access and MCP, we show you how to build intelligent, scalable AI applications that combine the power of conversational AI with predictive analytics.
Solution overview
This solution enhances AI agents by having ML models deployed on Amazon SageMaker AI endpoints integrate with AI Agents, to enable them to make data-driven business decisions through ML predictions. An AI agent is an LLM-powered application that uses an LLM as its core “brain” to autonomously observe its environment, plan actions, and execute tasks with minimal human input. It integrates reasoning, memory, and tool use to perform complex, multistep problem-solving by dynamically creating and revising plans, interacting with external systems, and learning from past interactions to optimize outcomes over time. This enables AI agents to go beyond simple text generation, acting as independent entities capable of decision-making and goal-directed actions in diverse real-world and enterprise scenarios.For this solution, the AI agent is developed using the Strands Agents SDK, which allows for rapid development from simple assistants to complex workflows. Predictive ML models are hosted on Amazon SageMaker AI and will be used as tools by the AI agent. This can happen in two ways: agents can directly invoke SageMaker endpoints for more direct access to model inference capabilities or use the MCP protocol to facilitate the interaction between AI agents and the ML models. Both options are valid: direct tool invocation doesn’t require additional infrastructure by embedding the tool calling directly in the agent code itself, whereas MCP enables dynamic discovery of the tools and decoupling of agent and tool execution through the introduction of an additional architectural component, the MCP server itself. For scalable and secure implementation of the tool calling logic, we recommend the MCP approach. Although we’re recommending MCP, we discuss and implement the direct endpoint access as well, to give readers the freedom to choose the approach that they prefer.
Amazon SageMaker AI offers two methods to host multiple models behind a single endpoint: inference components and multi-model endpoints. This consolidated hosting approach enables efficient deployment of multiple models in one environment, which optimizes computing resources and minimizes response times for model predictions. For demonstration purposes, this post deploys only one model on one endpoint. If you want to learn more about inference components, refer to the Amazon SageMaker AI documentation Shared resource utilization with multiple models. To learn more about multi-model endpoints, refer to the Amazon SageMaker AI documentation Multi-model endpoints.
Architecture
In this post, we define a workflow for empowering AI agents to make data-driven business decisions by invoking predictive ML models using Amazon SageMaker AI. The process begins with a user interacting through an interface, such as a chat-based assistant or application. This input is managed by an AI agent developed using the open source Strands Agents SDK. Strands Agents adopts a model-driven approach, which means developers define agents with only a prompt and a list of tools, facilitating rapid development from simple assistants to complex autonomous workflows.
When the agent is prompted with a request that requires a prediction (for example, “what will be the sales for H2 2025?”), the LLM powering the agent decided to interact with the Amazon SageMaker AI endpoint hosting the ML model. This can happen in two ways: directly using the endpoint as a custom tool of the Strands Agents Python SDK or by calling the tool through MCP. With MCP, the client application can discover the tools exposed by the MCP server, obtain the list of required parameters, and format the request to the Amazon SageMaker inference endpoint. Alternatively, agents can directly invoke SageMaker endpoints using tool annotations (such as @tool), bypassing the MCP server for more direct access to model inference capabilities.
Finally, the prediction generated by the SageMaker hosted model is routed back through the agent and ultimately delivered to the user interface, enabling real-time, intelligent responses.
The following diagram illustrates this process. The complete code for this solution is available on GitHub.

Prerequisites
To get started with this solution, make sure you have:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, refer to AWS Identity and Access Management for Amazon SageMaker AI.
Access to Amazon SageMaker Studio and a SageMaker AI notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the LLMs.

Solution implementation
In this solution, we implement a complete workflow that demonstrates how to use ML models deployed on Amazon SageMaker AI as specialized tools for AI agents. This approach enables agents to access and use ML capabilities for enhanced decision-making without requiring deep ML expertise. We play the role of a data scientist tasked with building an agent that can predict demand for one product. To achieve this, we train a time-series forecasting model, deploy it, and expose it to an AI agent.
The first phase involves training a model using Amazon SageMaker AI. This begins with preparing training data by generating synthetic time series data that incorporates trend, seasonality, and noise components to simulate realistic demand patterns. Following data preparation, feature engineering is performed to extract relevant features from the time series data, including temporal features such as day of week, month, and quarter to effectively capture seasonality patterns. In our example, we train an XGBoost model using the XGBoost container available as 1P container in Amazon SageMaker AI to create a regression model capable of predicting future demand values based on historical patterns. Although we use XGBoost for this example because it’s a well-known model used in many use cases, you can use your preferred container and model, according to the problem you’re trying to solve. For the sake of this post, we won’t detail an end-to-end example of training a model using XGBoost. To learn more about this, we suggest checking out the documentation Use XGBoost with the SageMaker Python SDK. Use the following code:

from sagemaker.xgboost.estimator import XGBoost

xgb_estimator = XGBoost(…)
xgb_estimator.fit({‘train’: train_s3_path, ‘validation’: val_s3_path})

Then, the trained model is packaged and deployed to a SageMaker AI endpoint, making it accessible for real-time inference through API calls:

predictor = xgb_estimator.deploy(
   initial_instance_count=1,
   instance_type=instance_type,
   serializer=JSONSerializer(),
   deserializer=JSONDeserializer()
)

After the model is deployed and ready for inferences, you need to learn how to invoke the endpoint. To invoke the endpoint, write a function like this:

ENDPOINT_NAME = “serverless-xgboost”
REGION = boto3.session.Session().region_name

def invoke_endpoint(payload: list):
    “””
        Use the model deployed on the Amazon SageMaker AI endpoint to generate predictions.
        Args:
            payload: a list of lists containing the inputs to generate predictions from
        Returns:
            predictions: an NumPy array of predictions
    “””
    sagemaker_runtime = boto3.client(“sagemaker-runtime”, region_name=REGION)
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,
        Body=json.dumps(payload),
        ContentType=”application/json”,
        Accept=”application/json”
    )
    predictions = json.loads(response[‘Body’].read().decode(“utf-8″))
    return np.array(predictions)

Note that the function invoke_endpoint() has been written with proper docstring. This is key to making sure that it can be used as a tool by LLMs because the description is what allows them to choose the right tool for the right task. YOu can turn this function into a Strands Agents tool thanks to the @tool decorator:

from strands import tool

@tool()
def invoke_endpoint(payload: list):
    ….

And to use it, pass it to a Strands agent:

from strands import Agent

agent = Agent(
    model=”us.amazon.nova-pro-v1:0”,
    tools=[generate_prediction_with_sagemaker]
)

agent(
    “Invoke the endpoint with this input:nn”
    f”<input>{test_sample}</input>nn”
    “Provide the output in JSON format {‘predictions’:<predictions>}”
)

As you run this code, you can confirm the output from the agent, which correctly identifies the need to call the tool and executes the function calling loop:

<thinking> To fulfill the User’s request, I need to invoke the Amazon SageMaker
endpoint with the provided input data. The input is a list of lists, which is the
required format for the ‘generate_prediction_with_sagemaker’ tool. I will use this
tool to get the predictions. </thinking>

Tool #1: generate_prediction_with_sagemaker The predictions from the Amazon SageMaker
endpoint are as follows:
“`json {  “predictions”: [89.8525238, 52.51485062, 58.35247421, 62.79786301, 85.51475525] } “`

As the agent receives the prediction result from the endpoint tool, it can then use this as an input for other processes. For example, the agent could write the code to create a plot based on these predictions and show it to the user in the conversational UX. It could send these values directly to business intelligence (BI) tools such as Amazon QuickSight or Tableau and automatically update enterprise resource planning (ERP) or customer relationship management (CRM) tools such as SAP or Salesforce.
Connecting to the endpoint through MCP
You can further evolve this pattern by having an MCP server invoke the endpoint rather than the agent itself. This allows for the decoupling of agent and tool logic and an improved security pattern because the MCP server will be the one with the permission to invoke the endpoint. To achieve this, implement an MCP server using the FastMCP framework that wraps the SageMaker endpoint and exposes it as a tool with a well-defined interface. A tool schema must be specified that clearly defines the input parameters and return values for the tool, facilitating straightforward understanding and usage by AI agents. Writing the proper docstring when defining the function achieves this. Additionally, the server must be configured to handle authentication securely, allowing it to access the SageMaker endpoint using AWS credentials or AWS roles. In this example, we run the server on the same compute as the agent and use stdio as communication protocol. For production workloads, we recommend running the MCP server on its own AWS compute and using transport protocols based on HTTPS (for example, Streamable HTTP). If you want to learn how to deploy MCP servers in a serverless fashion, refer to this official AWS GitHub repository. Here’s an example MCP server:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP(“SageMaker App”)
ENDPOINT_NAME = os.environ[“SAGEMAKER_ENDPOINT_NAME”]

@mcp.tool()
async def invoke_endpoint(payload: list):
    “”” Use the model … “””
    […]
    
if __name__ == “__main__”:
    mcp.run(=”stdio”)

Finally, integrate the ML model with the agent framework. This begins with setting up Strands Agents to establish communication with the MCP server and incorporate the ML model as a tool. A comprehensive workflow must be created to determine when and how the agent should use the ML model to enhance its capabilities. The implementation includes programming decision logic that enables the agent to make informed decisions based on the predictions received from the ML model. The phase concludes with testing and evaluation, where the end-to-end workflow is validated by having the agent generate predictions for test scenarios and take appropriate actions based on those predictions.

from mcp import StdioServerParameters
from mcp.client.stdio import stdio_client
from strands.tools.mcp import MCPClient

# Create server parameters for stdio connection
server_params = StdioServerParameters(
    command=”python3″,  # Executable
    args=[“server.py”],  # Optional command line arguments
    env={“SAGEMAKER_ENDPOINT_NAME”: “<your-endpoint-name>”}
)

# Create an agent with MCP tools
with stdio_mcp_client:
    # Get the tools from the MCP server
    tools = stdio_mcp_client.list_tools_sync()
    # Create an agent with these tools
    agent = Agent(model=”us.amazon.nova-pro-v1:0″, tools=tools)
    # Invoke the agent
    agent(
        “Invoke the endpoint with this input:nn”
        f”<input>{test_sample}</input>nn”
        “Provide the output in JSON format {‘predictions’:<predictions>}”
    )

Clean up
When you’re done experimenting with the Strands Agents Python SDK and models on Amazon SageMaker AI, you can delete the endpoint you’ve created to stop incurring unwanted charges. To do that, you can use either the AWS Management Console, the SageMaker Python SDK, or the AWS SDK for Python (boto3):

# SageMaker Python SDK
predictor.delete_model()
predictor.delete_endpoint()

# Alternatively, boto3
sagemaker_runtime.delete_endpoint(EndpointName=endpoint_name)

Conclusion
In this post, we demonstrated how to enhance AI agents’ capabilities by integrating predictive ML models using Amazon SageMaker AI and the MCP. By using the open source Strands Agents SDK and the flexible deployment options of SageMaker AI, developers can create sophisticated AI applications that combine conversational AI with powerful predictive analytics capabilities. The solution we presented offers two integration paths: direct endpoint access through tool annotations and MCP-based integration, giving developers the flexibility to choose the most suitable approach for their specific use cases. Whether you’re building customer service chat assistants that need predictive capabilities or developing complex autonomous workflows, this architecture provides a secure, scalable, and modular foundation for your AI applications. As organizations continue to seek ways to make their AI agents more intelligent and data-driven, the combination of Amazon SageMaker AI, MCP, and the Strands Agents SDK offers a powerful solution for building the next generation of AI-powered applications.
For readers unfamiliar with connecting MCP servers to workloads running on Amazon SageMaker AI, we suggest Extend large language models powered by Amazon SageMaker AI using Model Context Protocol in the AWS Artificial Intelligence Blog, which details the flow and the steps required to build agentic AI solutions with Amazon SageMaker AI.
To learn more about AWS commitment to the MCP standard, we recommend reading Open Protocols for Agent Interoperability Part 1: Inter-Agent Communication on MCP in the AWS Open Source Blog, where we announced that AWS is joining the steering committee for MCP to make sure developers can build breakthrough agentic applications without being tied to one standard. To learn more about how to use MCP with other technologies from AWS, such as Amazon Bedrock Agents, we recommend reading Harness the power of MCP servers with Amazon Bedrock Agents in the AWS Artificial Intelligence Blog. Finally, a great way to securely deploy and scale MCP servers on AWS is provided in the AWS Solutions Library at Guidance for Deploying Model Context Protocol Servers on AWS.

About the authors
Saptarshi Banerjee serves as a Senior Solutions Architect at AWS, collaborating closely with AWS Partners to design and architect mission-critical solutions. With a specialization in generative AI, AI/ML, serverless architecture, Next-Gen Developer Experience tools and cloud-based solutions, Saptarshi is dedicated to enhancing performance, innovation, scalability, and cost-efficiency for AWS Partners within the cloud ecosystem.
Davide Gallitelli is a Senior Worldwide Specialist Solutions Architect for Generative AI at AWS, where he empowers global enterprises to harness the transformative power of AI. Based in Europe but with a worldwide scope, Davide partners with organizations across industries to architect custom AI agents that solve complex business challenges using AWS ML stack. He is particularly passionate about democratizing AI technologies and enabling teams to build practical, scalable solutions that drive organizational transformation.

NVIDIA AI Releases Nemotron Nano 2 AI Models: A Production-Ready Enter …

NVIDIA has unveiled the Nemotron Nano 2 family, introducing a line of hybrid Mamba-Transformer large language models (LLMs) that not only push state-of-the-art reasoning accuracy but also deliver up to 6× higher inference throughput than models of similar size. This release stands out with unprecedented transparency in data and methodology, as NVIDIA provides most of the training corpus and recipes alongside model checkpoints for the community. Critically, these models maintain massive 128K-token context capability on a single midrange GPU, significantly lowering barriers for long-context reasoning and real-world deployment.

Key Highlights

6× throughput vs. similarly sized models: Nemotron Nano 2 models deliver up to 6.3× the token generation speed of models like Qwen3-8B in reasoning-heavy scenarios—without sacrificing accuracy.

Superior accuracy for reasoning, coding & multilingual tasks: Benchmarks show on-par or better results vs. competitive open models, notably exceeding peers in math, code, tool use, and long-context tasks.

128K context length on a single GPU: Efficient pruning and hybrid architecture make it possible to run 128,000 token inference on a single NVIDIA A10G GPU (22GiB).

Open data & weights: Most of the pretraining and post-training datasets, including code, math, multilingual, synthetic SFT, and reasoning data, are released with permissive licensing on Hugging Face.

Hybrid Architecture: Mamba Meets Transformer

Nemotron Nano 2 is built on a hybrid Mamba-Transformer backbone, inspired by the Nemotron-H Architecture. Most traditional self-attention layers are replaced by efficient Mamba-2 layers, with only about 8% of the total layers using self-attention. This architecture is carefully crafted:

Model Details: The 9B-parameter model features 56 layers (out of a pre-trained 62), a hidden size of 4480, with grouped-query attention and Mamba-2 state space layers facilitating both scalability and long sequence retention.

Mamba-2 Innovations: These state-space layers, recently popularized as high-throughput sequence models, are interleaved with sparse self-attention (to preserve long-range dependencies), and large feed-forward networks.

This structure enables high throughput on reasoning tasks requiring “thinking traces”—long generations based on long, in-context input—where traditional transformer-based architectures often slow down or run out of memory.

Training Recipe: Massive Data Diversity, Open Sourcing

Nemotron Nano 2 models are trained and distilled from a 12B parameter teacher model using an extensive, high-quality corpus. NVIDIA’s unprecedented data transparency is a highlight:

20T tokens pretraining: Data sources include curated and synthetic corpora for web, math, code, multilingual, academic, and STEM domains.

Major Datasets Released:

Nemotron-CC-v2: Multilingual web crawl (15 languages), synthetic Q&A rephrasing, deduplication.

Nemotron-CC-Math: 133B tokens of math content, standardized to LaTeX, over 52B “highest quality” subset.

Nemotron-Pretraining-Code: Curated and quality-filtered GitHub source code; rigorous decontamination and deduplication.

Nemotron-Pretraining-SFT: Synthetic, instruction-following datasets across STEM, reasoning, and general domains.

Post-training Data: Includes over 80B tokens of supervised fine-tuning (SFT), RLHF, tool-calling, and multilingual datasets—most of which are open-sourced for direct reproducibility.

Alignment, Distillation, and Compression: Unlocking Cost-Effective, Long-Context Reasoning

NVIDIA’s model compression process is built on the “Minitron” and Mamba pruning frameworks:

Knowledge distillation from the 12B teacher reduces the model to 9B parameters, with careful pruning of layers, FFN dimensions, and embedding width.

Multi-stage SFT and RL: Includes tool-calling optimization (BFCL v3), instruction-following (IFEval), DPO and GRPO reinforcement, and “thinking budget” control (support for controllable reasoning-token budgets at inference).

Memory-targeted NAS: Through architecture search, the pruned models are specifically engineered so that the model and key-value cache both fit—and remain performant—within the A10G GPU memory at a 128k context length.

The result: inference speeds of up to 6× faster than open competitors in scenarios with large input/output tokens, without compromised task accuracy.

Benchmarking: Superior Reasoning and Multilingual Capabilities

In head-to-head evaluations, Nemotron Nano 2 models excel:

Task/BenchNemotron-Nano-9B-v2Qwen3-8BGemma3-12BMMLU (General)74.576.473.6MMLU-Pro (5-shot)59.456.345.1GSM8K CoT (Math)91.484.074.5MATH80.555.442.4HumanEval+58.557.636.7RULER-128K (Long Context)82.2–80.7Global-MMLU-Lite (Avg Multi)69.972.871.9MGSM Multilingual Math (Avg)84.864.557.1

Throughput (tokens/s/GPU) at 8k input/16k output:

Nemotron-Nano-9B-v2: up to 6.3× Qwen3-8B in reasoning traces.

Maintains up to 128k-context with batch size=1—previously impractical on midrange GPUs.

Conclusion

NVIDIA’s Nemotron Nano 2 release is an important moment for open LLM research: it redefines what’s possible on a single cost-effective GPU—both in speed and context capacity—while raising the bar for data transparency and reproducibility. Its hybrid architecture, throughput supremacy, and high-quality open datasets are set to accelerate innovation across the AI ecosystem.

Check out the Technical Details, Paper and Models on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sponsorship Media Kit

The post NVIDIA AI Releases Nemotron Nano 2 AI Models: A Production-Ready Enterprise AI Model Family and 6x Faster than Similar Sized Model appeared first on MarkTechPost.

Memp: A Task-Agnostic Framework that Elevates Procedural Memory to a C …

LLM agents have become powerful enough to handle complex tasks, ranging from web research and report generation to data analysis and multi-step software workflows. However, they struggle with procedural memory, which is often rigid, manually designed, or locked inside model weights today. This makes them fragile: unexpected events like network failures or UI changes can force a complete restart. Unlike humans, who learn by reusing past experiences as routines, current LLM agents lack a systematic way to build, refine, and reuse procedural skills. Existing frameworks offer abstractions but leave the optimization of memory life-cycles largely unresolved. 

Memory plays a crucial role in language agents, allowing them to recall past interactions across short-term, episodic, and long-term contexts. While current systems use methods like vector embeddings, semantic search, and hierarchical structures to store and retrieve information, effectively managing memory, especially procedural memory, remains a challenge. Procedural memory helps agents internalize and automate recurring tasks, yet strategies for constructing, updating, and reusing it are underexplored. Similarly, agents learn from experience through reinforcement learning, imitation, or replay, but face issues like low efficiency, poor generalization, and forgetting. 

Researchers from Zhejiang University and Alibaba Group introduce Memp, a framework designed to give agents a lifelong, adaptable procedural memory. Memp transforms past trajectories into both detailed step-level instructions and higher-level scripts, while offering strategies for memory construction, retrieval, and updating. Unlike static approaches, it continuously refines knowledge through addition, validation, reflection, and discarding, ensuring relevance and efficiency. Tested on ALFWorld and TravelPlanner, Memp consistently improved accuracy, reduced unnecessary exploration, and optimized token use. Notably, memory built from stronger models transferred effectively to weaker ones, boosting their performance. This shows Memp enables agents to learn, adapt, and generalize across tasks. 

When an agent interacts with its environment executing actions, using tools, and refining behavior across multiple steps, it’s a Markov Decision Process. Each step generates states, actions, and feedback, forming trajectories that also yield rewards based on success. However, solving new tasks in unfamiliar environments often results in wasted steps and tokens, as the agent repeats exploratory actions already performed in earlier tasks. Inspired by human procedural memory, the proposed framework equips agents with a memory module that stores, retrieves, and updates procedural knowledge. This enables agents to reuse past experiences, cutting down redundant trials and improving efficiency in complex tasks.

Experiments on TravelPlanner and ALFWorld demonstrate that storing trajectories as either detailed steps or abstract scripts boosts accuracy and reduces exploration time. Retrieval strategies based on semantic similarity further refine memory use. At the same time, dynamic update mechanisms such as validation, adjustment, and reflection allow agents to correct errors, discard outdated knowledge, and continuously refine skills. Results show that procedural memory not only improves task completion rates and efficiency but also transfers effectively from stronger to weaker models, giving smaller systems significant performance gains. Moreover, scaling retrieval improves outcomes up to a point, after which excessive memory can overwhelm the context and reduce effectiveness. This highlights procedural memory as a powerful way to make agents more adaptive, efficient, and human-like in their learning. 

In conclusion, Memp is a task-agnostic framework that treats procedural memory as a central element for optimizing LLM-based agents. By systematically designing strategies for memory construction, retrieval, and updating, Memp allows agents to distill, refine, and reuse past experiences, improving efficiency and accuracy in long-horizon tasks like TravelPlanner and ALFWorld. Unlike static or manually engineered memories, Memp evolves dynamically, continuously updating and discarding outdated knowledge. Results show steady performance gains, efficient learning, and even transferable benefits when migrating memory from stronger to weaker models. Looking ahead, richer retrieval methods and self-assessment mechanisms can further strengthen agents’ adaptability in real-world scenarios. 

Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sponsorship Details

The post Memp: A Task-Agnostic Framework that Elevates Procedural Memory to a Core Optimization Target in LLM-based Agent appeared first on MarkTechPost.

Emerging Trends in AI Cybersecurity Defense: What’s Shaping 2025? To …

The AI security arms race is in full swing. As cyber threats grow more sophisticated, organizations are reimagining defense strategies—with artificial intelligence taking center stage. Here’s a look at some of the most impactful trends you should watch in AI-powered cybersecurity defense.

1. AI-Powered Threat Detection and Automated Response

Gone are the days of siloed security appliances and slow, manual interventions. Modern cybersecurity relies on deep learning models that analyze the behavior of users, devices, and networks for anomalies in real time. These systems lower false positives and respond instantly to suspicious activity—enabling security teams to move from reactive firefighting to proactive protection.

2. The Rise of Automated SOC Operations

Security Operations Centers (SOCs) are experiencing a revolution: with agentic AI taking over routine monitoring, triage, and incident response. Mundane alerts and repetitive investigations are handed off to automated agents, freeing up human analysts for strategic work. The result? Faster mitigation and significantly more efficient resource allocation—even during high-volume attack bursts.

3. Adaptive, Context-Aware Defenses

Static rules and generic access controls are no longer enough. Today’s leading defense systems use AI to analyze real-time context—like user identity, device health, location, and recent activity—before approving access or responding to incidents. This is dramatically strengthening Zero Trust models, helping prevent privilege abuse and lateral movement in ways that conventional solutions can’t.

4. Predictive Intelligence for Next-Gen Security

Why wait for an attack when you can predict it? AI tools are now scanning global threat data to not only spot vulnerabilities but actually anticipate future tactics and attack paths. These predictive systems inform security architects about emerging risks, allowing them to reinforce defenses before threat actors even strike.

5. Spotting AI-Generated Attacks

Phishing emails, spoofed voice calls, deepfake videos—these are the new weapons of social engineering. Security teams now deploy AI-driven solutions specifically designed to identify and intercept synthetic content in multiple formats. Multi-modal verification has become standard, turning the tide against advanced fraud and impersonation attempts.

6. Zero Trust Gets Smarter

Zero Trust is not just about denying access—it’s about continuous, intelligent validation. AI is supercharging Zero Trust policies, creating dynamic access management that adapts to real-world behavior and context. This means suspicious actions are flagged in milliseconds, and trusted access is continuously reassessed rather than granted perpetually.

7. Securing LLMs With Source Traceability

Generative AI adds another layer of risk—hallucination, prompt injection, and unauthorized output. Innovations like RAG-Verification (Retrieval-Augmented Generation) are stepping in, providing source traceability and safeguards for AI-generated content. This ensures that high-stakes decisions made by or with LLMs are backed by verifiable data.

Here are the top AI focused cybersecurity tools and platforms for defense in 2025:

AccuKnox AI CoPilotSpecializes in cloud-native and Kubernetes security, leveraging eBPF runtime visibility and generative AI for automated policy generation, compliance, and zero-trust enforcement.

SentinelOne Singularity XDRDelivers AI-driven threat detection, real-time behavioral analysis, and automated response for endpoints, networks, and cloud workloads—helping reduce alert fatigue and scale SOC operations.

CrowdStrike Falcon Cloud SecurityProvides advanced AI threat protection for both endpoints and cloud environments, known for real-time detection, rapid deployment, and seamless integration.

Torq HyperSOCAn agentic, AI-powered SOC automation platform that features AI agents for enrichment, user verification, and remediation, driving hyperautomation at enterprise scale.

Microsoft Security CopilotIntegrates genAI and Microsoft’s security solutions to automate incident response, investigations, and network monitoring with natural language-driven workflows.

Fortinet FortiAIML-powered threat analysis for traffic, endpoint, and logs, delivers inline remediation, sandbox integration, and policy-triggered user controls.

Deep InstinctUses deep learning for advanced malware and ransomware prevention, focusing on zero-day threat detection and endpoint protection.

Radiant Security SOC AutomationFully autonomous SOC automation with playbook-free alert triage, investigation, remediation, and continuous learning for adaptive security.

Zscaler Cloud SecurityCloud-delivered, AI-powered secure web gateway and zero-trust network access; offers CASB, ZTNA, SWG, and SaaS protection for distributed environments.

These platforms represent the forefront of leveraging AI for detection, prevention, response, SOC automation, cloud workload defense, and Zero Trust security in 2025.

The bottom line? The future of cybersecurity is fast-moving, automated, and context-driven. As attack surfaces widen (especially around AI), defense strategies must evolve to keep pace. Integrating these AI-driven tools and techniques isn’t just an upgrade—it’s an essential shield for today’s digital enterprise.

Feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sponsorship Opportunities

The post Emerging Trends in AI Cybersecurity Defense: What’s Shaping 2025? Top AI Security Tools appeared first on MarkTechPost.

Simplify access control and auditing for Amazon SageMaker Studio using …

AWS supports trusted identity propagation, a feature that allows AWS services to securely propagate a user’s identity across service boundaries. With trusted identity propagation, you have fine-grained access controls based on a physical user’s identity rather than relying on IAM roles. This integration allows for the implementation of access control through services such as Amazon S3 Access Grants and maintains detailed audit logs of user actions across supported AWS services such as Amazon EMR. Furthermore, it supports long-running user background sessions for training jobs, so you can log out of your interactive ML application while the background job continues to run.
Amazon SageMaker Studio now supports trusted identity propagation, offering a powerful solution for enterprises seeking to enhance their ML system security. By integrating trusted identity propagation with SageMaker Studio, organizations can simplify access management by granting permissions to existing AWS IAM Identity Center identities.
In this post, we explore how to enable and use trusted identity propagation in SageMaker Studio, demonstrating its benefits through practical use cases and implementation guidelines. We walk through the setup process, discuss key considerations, and showcase how this feature can transform your organization’s approach to security and access controls.
Solution overview
In this section, we review the architecture for the proposed solution and the steps to enable trusted identity propagation for your SageMaker Studio domain.
The following diagram shows the interaction between the different components that allow the user’s identity to propagate from their identity provider and IAM Identity Center to downstream services such as Amazon EMR and Amazon Athena.

With a trusted identity propagation-enabled SageMaker Studio domain, users can access data across supported AWS services using their end user identity and group membership, in addition to access allowed by their domain or user execution role. In addition, API calls from SageMaker Studio notebooks and supported AWS services and Amazon SageMaker AI features log the user identity in AWS CloudTrail. For a list of supported AWS services and SageMaker AI features, see Trusted identity propagation architecture and compatibility. In the following sections, we show how to enable trusted identity propagation for your domain.
This solution applies for SageMaker Studio domains set up using IAM Identity Center as the method of authentication. If your domain is set up using IAM, see Implement user-level access control for multi-tenant ML platforms on Amazon SageMaker AI for best practices on managing and scaling access control.
Prerequisites
To follow along with this post, you must have the following:

An AWS account with an organization instance of IAM Identity Center configured through AWS Organizations
Administrator permissions (or elevated permissions allowing modification of IAM principals, and SageMaker administrator access to create and update domains)

Create or update the SageMaker execution role
For trusted identity propagation to work, the SageMaker execution role (domain and user profile execution role), should allow the sts:SetContext permissions, in addition to sts:AssumeRole, in its trust policy. For a new SageMaker AI domain, create a domain execution role by following the instructions in Create execution role. For existing domains, follow the instructions in Get your execution role to find the user or domain’s execution role.
Next, to update the trust policy for the role, complete the following steps:

In the navigation pane of the IAM console, choose Roles.
In the list of roles in your account, choose the domain or user execution role.
On the Trust relationships tab, choose Edit trust policy.
Update the trust policy with the following statement:

{
“Version”: “2012-10-17”,
“Statement”: [
…..
{
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“sagemaker.amazonaws.com”,
]
},
“Action”: [
“sts:AssumeRole”,
“sts:SetContext”
],
“Condition”: {
“aws:SourceAccount”: “<account>”
}
}
}
]
}

Choose Update policy to save your changes.

Trusted identity propagation only works for private spaces at the time of launch.
Create a SageMaker AI domain with trusted identity propagation enabled
SageMaker AI domains using IAM Identity Center for authentication can only be set up in the same AWS Region as the IAM Identity Center instance. To create a new SageMaker domain, follow the steps in Use custom setup for Amazon SageMaker AI. For Trusted identity propagation, select Enable trusted identity propagation for all users on this domain, and continue with the rest of the setup to create a domain and assign users and groups, choosing the role you created in the previous step.

Update an existing SageMaker AI domain
You can also update your existing SageMaker AI domain to enable trusted identity propagation. You can enable trusted identity propagation even while the domain or user has active SageMaker Studio applications. However, for the changes to be applied, the active applications must be restarted. You can use the EffectiveTrustedIdentityPropagationStatus field in the response to the DescribeApp API for running applications to determine if the application has trusted identity propagation enabled.
To enable trusted identity propagation for the domain using the SageMaker AI console, choose Edit under Authentication and permissions on the Domain settings tab.

For Trusted identity propagation, select Enable trusted identity propagation for all users on this domain, and choose Submit to save the changes.

(Optional) Update user background session configuration in IAM Identity Center
IAM Identity Center now supports running user background sessions, and the session duration is set by default to 7 days. With background sessions, users can launch long-running SageMaker training jobs that assume the user’s identity context along with the SageMaker execution role. As an administrator, you can enable or disable user background sessions, and modify the session duration for user background sessions. As of the time of writing, the maximum session duration that you can set for user background sessions is 90 days. The user’s session is stopped at the end of the specified duration, and consequently, the training job will also fail at the end of the session duration.
To disable or update the session duration, navigate to the IAM Identity Center console, choose Settings in the navigation pane, and choose Configure under Session duration.

For User background sessions, select Enable user background sessions and use the dropdown to change the session duration. If user background sessions are disabled, the user must be logged in for the duration of the training job; otherwise, the training job will fail once the user logs out. Updating this configuration doesn’t affect current running sessions and only applies to newly created user background sessions. Choose Save to save your settings.

Use cases
Imagine you’re an enterprise with hundreds or even thousands of users, each requiring varying levels of access to data across multiple teams. You’re responsible for maintaining an AI/ML system on SageMaker AI and managing access permissions across diverse data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and AWS Lake Formation. Traditionally, this has involved maintaining complex IAM policies for users, services, and resources, including bucket policies where applicable. This approach is not only tedious but also makes it challenging to track and audit data access without maintaining a separate role for each user.
This is precisely the scenario that trusted identity propagation aims to address. With trusted identity propagation support, you can now maintain service-specific roles with minimal permissions, such as s3:GetDataAccess or LakeFormation:GetDataAccess, along with additional permissions to start jobs, view job statuses, and perform other necessary tasks. For data access, you can assign fine-grained policies directly to individual users. For instance, Jane might have read access to customer data and full access to sales and pricing data, whereas Laura might only have read access to sales trends. Both Jane and Laura can assume the same SageMaker AI role to access their SageMaker Studio applications, while maintaining separate data access permissions based on their individual identities.In the following sections, we explore how this can be achieved for common use cases, demonstrating the power and flexibility of trusted identity propagation in simplifying data access management while maintaining robust security and auditability.
Scenario 1: Experiment with Amazon S3 data in notebooks
S3 Access Grants provide a simplified way to manage data access at scale. Unlike traditional IAM roles and policies that require a detailed knowledge of IAM concepts, and frequent policy updates as new resources are added, with S3 Access Grants, you can define access to data based on familiar database-like grants that automatically scale with your data. This approach significantly reduces the operational overhead of managing thousands of IAM policies and bucket policies, and overcomes the limitations of IAM permissions, while strengthening security through access patterns. If you don’t have S3 Access Grants set up, see Create an S3 Access Grant instance to get started. For detailed architecture and use cases, you can also refer to Scaling data access with Amazon S3 Access Grants. After you have set up S3 Access Grants, you can grant access to your datasets to users based on their identity in IAM Identity Center.
To use S3 Access Grants from SageMaker Studio, update the following IAM roles with policies and trust policies.
For the domain or user execution role, add the following inline policy:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “AllowDataAccessAPI”,
“Effect”: “Allow”,
“Action”: [
“s3:GetDataAccess”
],
“Resource”: [
“arn:aws:s3:<region>:<account>:access-grants/default”
]
},
{
“Sid”: “RequiredForTIP”,
“Effect”: “Allow”,
“Action”: “sts:SetContext”,
“Resource”: “arn:aws:iam::<account>:role/<s3-access-grants-role>”
}
]
}

Make sure the S3 Access Grants role’s trust policy allows the sts:SetContext action in addition to sts:AssumeRole. The following is a sample trust policy:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: [
“access-grants.s3.amazonaws.com”
]
},
“Action”: [
“sts:AssumeRole”,
“sts:SetContext”
],
“Condition”: {
“StringEquals”: {
“aws:SourceArn”: “arn:aws:s3:<region>:<account>:access-grants/default”
}
}
}
]

 
Now, the user can access the data as allowed by S3 Access Grants for your user identity by calling the
GetDataAccess API to return temporary credentials, and by assuming the temporary credentials to read or write to their prefixes. For example, the following code shows how to use Boto3 to get temporary credentials and assume the credentials to get access to Amazon S3 locations that are allowed through S3 Access Grants:

 
import boto3
from botocore.config import Config

def get_access_grant_credentials(account_id: str, target: str,
permission: str = ‘READ’):
s3control = boto3.client(‘s3control’)
response = s3control.get_data_access(
AccountId=account_id,
Target=target,
Permission=permission
)
return response[‘Credentials’]

def create_s3_client_from_credentials(credentials) -> boto3.client:
return boto3.client(
‘s3’,
aws_access_key_id=credentials[‘AccessKeyId’],
aws_secret_access_key=credentials[‘SecretAccessKey’],
aws_session_token=credentials[‘SessionToken’]
)

# Create client
credentials = get_access_grant_credentials(‘<account>’,
“s3://<bucket>/<allowed-prefix>/”)
s3 = create_s3_client_from_credentials(credentials)

# Will succeed
s3.list_objects(Bucket=”<bucket>”, Prefix=”<allowed-prefix>”)

# Will fail
s3.list_objects(Bucket=”<bucket>”, Prefix=”<any-other-prefix>”)

Scenario 2: Access Lake Formation through Athena
Lake Formation provides centralized governance and fine-grained access control management for data stored in Amazon S3 and metadata in the AWS Glue Data Catalog. The Lake Formation permission model operates in conjunction with IAM permissions, offering granular controls at the database, table, column, row, and cell levels. This dual-layer security model provides comprehensive data governance while maintaining flexibility in access patterns.
Data governed through Lake Formation can be accessed through various AWS analytics services. In this scenario, we demonstrate using Athena, a serverless query engine that integrates seamlessly with Lake Formation’s permission model. For other services like Amazon EMR on EC2, make sure the resource is configured to support trusted identity propagation, including setting up security configurations and making sure the EMR cluster is configured with IAM roles that support trusted identity propagation.
The following instructions assume that you have already set up Lake Formation. If not, see Set up AWS Lake Formation and follow the AWS Lake Formation tutorials to set up Lake Formation and bring in your data.
Complete the following steps to access your governed data in trusted identity propagation-enabled SageMaker Studio notebooks using Athena:

Integrate Lake Formation with IAM Identity Center by following the instructions in Integrating IAM Identity Center. At a high level, this includes creating an IAM role allowing creating and updating application configurations in Lake Formation and IAM Identity Center, and providing the single sign-on (SSO) instance ID.
Grant permissions to the IAM Identity Center user to the relevant resources (database, table, row or column) using Lake Formation. See Granting permissions on Data Catalog resources instructions.
Create an Athena workgroup that supports trusted identity propagation by following instructions in Create a workgroup and choosing IAM Identity Center as the method of authentication. Make sure the user has access to write to the query results location provided here using S3 Access Grants, because Athena uses access grants by default when choosing IAM Identity Center as the authentication method.
Update the Athena workgroup’s IAM role with the following trust policy (add sts:SetContext to the existing trust policy). You can find the IAM role by choosing the workgroup you created earlier and looking for Role name.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “AthenaTrustPolicy”,
“Effect”: “Allow”,
“Principal”: {
“Service”: “athena.amazonaws.com”
},
“Action”: [
“sts:AssumeRole”,
“sts:SetContext”
],
“Condition”: {
“StringEquals”: {
“aws:SourceAccount”: “<account-id>”
},
“ArnLike”: {
“aws:SourceArn”: “arn:aws:athena:<region>:<account-id>:workgroup/<workgroup-name>”
}
}
}
]
}

The setup is now complete. You can now launch SageMaker Studio using an IAM Identity Center user, launch a JupyterLab or Code Editor application, and query the database. See the following example code to get started:

import time
import boto3
import pandas as pd
athena_client = boto3.client(“athena”)

database = “<database-name>”
table = “<table-name>”
query = f”SELECT * FROM {database}.{table}”
output_location = “s3://<bucket-name>/queries” # bucket name and location from Step 3

response = athena_client.start_query_execution(
QueryString=query,
QueryExecutionContext={‘Database’: database},
ResultConfiguration={‘OutputLocation’: output_location}
)

# Get the query execution ID
query_execution_id = response[‘QueryExecutionId’]

# wait for query to complete
while True:
query_status = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
status = query_status[‘QueryExecution’][‘Status’][‘State’]
if status in [‘SUCCEEDED’, ‘FAILED’, ‘CANCELLED’]:
break
time.sleep(1)

# If the query succeeded, fetch and display results
if status == ‘SUCCEEDED’:
results = athena_client.get_query_results(QueryExecutionId=query_execution_id)

# Extract column names and data
columns = [col[‘Name’] for col in results[‘ResultSet’][‘ResultSetMetadata’][‘ColumnInfo’]]
data = []
for row in results[‘ResultSet’][‘Rows’][1:]: # Skip the header row
data.append([field.get(‘VarCharValue’, ”) for field in row[‘Data’]])

# Create a pandas DataFrame
df = pd.DataFrame(data, columns=columns)

# Display the first few rows
print(df.head())
else:
print(f”Query failed with status: {status}”)

Scenario 3: Create a training job supported with user background sessions
For a trusted identity propagation-enabled domain, a user background session is a session that continues to run even if the end-user has logged out of their interactive session such as JupyterLab applications in SageMaker Studio. For example, the user can initiate a training job from their SageMaker Studio space, and the job can run in the background for days or weeks regardless of the user’s activity, and use the user’s identity to access data and log audit trails. If your domain doesn’t have trusted identity propagation enabled, you can continue to run training jobs and processing jobs as before; however, if trusted identity propagation is enabled, make sure your user background session time is updated to reflect the duration of your training jobs, because the default is set automatically to 7 days. If you have enabled user background sessions, update your SageMaker Studio domain or user’s execution role with the following permissions to provide a seamless experience for data scientists:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “AllowDataAccessAPI”,
“Effect”: “Allow”,
“Action”: [
“s3:GetDataAccess”,
“s3:GetAccessGrantsInstanceForPrefix”
],
“Resource”: [
“arn:aws:s3:<region>:<account>:access-grants/default”
]
},
{
“Sid”: “RequiredForTIP”,
“Effect”: “Allow”,
“Action”: “sts:SetContext”,
“Resource”: “arn:aws:iam::<account>:role/<s3-access-grants-role>”
}
]
}

With this setup, a data scientist can use an Amazon S3 location that they have access to through S3 Access Grants. SageMaker automatically looks for data access using S3 Access Grants and falls back to the job’s IAM role otherwise. For example, in the following SDK call to create the training job, the user provides the S3 Amazon URI where the data is stored, they have access to it through S3 Access Grants, and they can run this job without additional setup:

response = sm.create_training_job(
TrainingJobName=training_job_name,
AlgorithmSpecification={
‘TrainingImage’: ‘763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04’,
‘TrainingInputMode’: ‘File’,

RoleArn=’arn:aws:iam::<account>:role/tip-domain-role’,
InputDataConfig=[
{
‘ChannelName’: ‘training’,
‘DataSource’: {
‘S3DataSource’: {
‘S3DataType’: ‘S3Prefix’,
‘S3Uri’: ‘s3://<s3-ag-enabled-bucket>/<s3-ag-enabled-prefix>’,
‘S3DataDistributionType’: ‘FullyReplicated’
}
},
‘CompressionType’: ‘None’,
‘RecordWrapperType’: ‘None’
},

}

(Optional) View and manage user background sessions on IAM Identity Center
When training jobs are run as user background sessions, you can view these sessions as user background sessions on IAM Identity Center. The administrator can view a list of all user background sessions and optionally stop a session if the user has left the team, for example. When the user background session is ended, the training job subsequently fails.
To view a list of all user background sessions, on the IAM Identity Center console, choose Users and choose the user you want view the user background sessions for. Choose the Active sessions tab to view a list of sessions. The user background session can be identified by the Session type column, which shows if the session is interactive or a user background session. The list also shows the job’s Amazon Resource Name (ARN) under the Used by column.
To end a session, select the session and choose End sessions.

You will be prompted to confirm the action. Enter confirm to confirm that you want to end the session and choose End sessions to stop the user background session.

Scenario 4: Auditing using CloudTrail
After trusted identity propagation is enabled for your domain, you can now track the user that performed specific actions through CloudTrail. To try this out, log in to SageMaker Studio, and create and open a JupyterLab space. Open a terminal and enter aws s3 ls to list the available buckets in your Region.
On the CloudTrail console, choose Event history in the navigation pane. Update the Lookup attributes to Event name and in the search box, enter ListBuckets. You should see a list of events, as shown in the following screenshot (it might take up to 5 minutes for the logs to be available in CloudTrail).

Choose the event to view its details (verify the user name is SageMaker if you have also listed buckets through the AWS console or APIs). In the event details, you should be able to see an additional field called onBehalfOf that has the user’s identity.

Supported services and SageMaker AI features called from a trusted identity propagation-enabled SageMaker Studio domain will have the OnBehalfOf field in CloudTrail.
Clean up
If you have created a SageMaker Studio domain for the purposes of trying out trusted identity propagation, delete the domain and its associated Amazon Elastic File System (Amazon EFS) volume to avoid incurring additional charges. Before deleting a domain, you must delete all the users and their associated spaces and applications. For detailed instructions, see Stop and delete your Studio running applications and spaces.
If you created a SageMaker training job, they are ephemeral, and the compute is shut down automatically when the job is complete.
Athena is a serverless analytics service that charges per query billing. No cleanup is necessary, but for best practices, delete the workgroup to remove unused resources.
Conclusion
In this post, we showed you how to enable trusted identity propagation for SageMaker AI domains that use IAM Identity Center as the mode of authentication. With trusted identity propagation, administrators can manage user authorization to other AWS services through the user’s physical identity in conjunction with IAM roles. Administrators can streamline permissions management by maintaining a single domain execution role and manage granular access to other AWS services and data sources through the user’s identity. In addition, trusted identity propagation supports auditing, so administrators can track user activity without the need for managing a role for each user profile.
To learn more about enabling this feature and its use cases, see Trusted identity propagation use cases and Trusted identity propagation with Studio. This post covered a subset of supported applications; we encourage you to check out the documentation and choose the services that best serve your use case and share your feedback!

About the authors
Amit Shyam Jaisinghani is a Software Engineer on the SageMaker Studio team at Amazon Web Services, and he earned his Master’s degree in Computer Science from Rochester Institute of Technology. Since joining Amazon in 2019, he has built and enhanced several AWS services, including AWS WorkSpaces and Amazon SageMaker Studio. Outside of work, he explores hiking trails, plays with his two cats, Missy and Minnie, and enjoys playing Age of Empire.
Durga Sury is a Senior Solutions Architect at Amazon SageMaker, where she helps enterprise customers build secure and scalable AI/ML systems. When she’s not architecting solutions, you can find her enjoying sunny walks with her dog, immersing herself in murder mystery books, or catching up on her favorite Netflix shows.
Khushboo Srivastava is a Senior Product Manager for Amazon SageMaker. She enjoys building products that simplify machine learning workflows for customers, and loves playing with her 1-year old daughter.
Krishnan Manivannan is a Senior Software Engineer at Amazon Web Services and a founding member of the SageMaker AI API team. He has 8 years of experience in the architecture and security of large-scale machine learning services. His specialties include API design, service scalability, identity and access management, and inventing new approaches for building and operating distributed systems. Krishnan has led multiple engineering efforts from design through global launch, delivering reliable and secure systems for customers worldwide.

Benchmarking document information localization with Amazon Nova

Every day, enterprises process thousands of documents containing critical business information. From invoices and purchase orders to forms and contracts, accurately locating and extracting specific fields has traditionally been one of the most complex challenges in document processing pipelines. Although optical character recognition (OCR) can tell us what text exists in a document, determining where specific information is located has required sophisticated computer vision solutions.
The evolution of this field illustrates the complexity of the challenge. Early object detection approaches like YOLO (You Only Look Once) revolutionized the field by reformulating object detection as a regression problem, enabling real-time detection. RetinaNet advanced this further by addressing class imbalance issues through Focal Loss, and DETR introduced transformer-based architectures to minimize hand-designed components. However, these approaches shared common limitations: they required extensive training data, complex model architectures, and significant expertise to implement and maintain.
The emergence of multimodal large language models (LLMs) represents a paradigm shift in document processing. These models combine advanced vision understanding with natural language processing capabilities, offering several groundbreaking advantages:

Minimized use of specialized computer vision architectures
Zero-shot capabilities without the need for supervised learning
Natural language interfaces for specifying location tasks
Flexible adaptation to different document types

This post demonstrates how to use foundation models (FMs) in Amazon Bedrock, specifically Amazon Nova Pro, to achieve high-accuracy document field localization while dramatically simplifying implementation. We show how these models can precisely locate and interpret document fields with minimal frontend effort, reducing processing errors and manual intervention. Through comprehensive benchmarking on the FATURA dataset, we provide benchmarking of performance and practical implementation guidance.
Understanding document information localization
Document information localization goes beyond traditional text extraction by identifying the precise spatial position of information within documents. Although OCR tells us what text exists, localization tells us where specific information resides—a crucial distinction for modern document processing workflows. This capability enables critical business operations ranging from automated quality checks and sensitive data redaction to intelligent document comparison and validation.
Traditional approaches to this challenge relied on a combination of rule-based systems and specialized computer vision models. These solutions often required extensive training data, careful template matching, and continuous maintenance to handle document variations. Financial institutions, for instance, would need separate models and rules for each type of invoice or form they processed, making scalability a significant challenge. Multimodal models with localization capabilities available on Amazon Bedrock fundamentally change this paradigm. Rather than requiring complex computer vision architectures or extensive training data, these multimodal LLMs can understand both the visual layout and semantic meaning of documents through natural language interactions. By using models with the capability to localize, organizations can implement robust document localization with significantly reduced technical overhead and greater adaptability to new document types.
Multimodal models with localization capabilities, such as those available on Amazon Bedrock, fundamentally change this paradigm. Rather than requiring complex computer vision architectures or extensive training data, these multimodal LLMs can understand both the visual layout and semantic meaning of documents through natural language interactions. By using models with the capability to localize, organizations can implement robust document localization with significantly reduced technical overhead and greater adaptability to new document types.
Solution overview
We designed a simple localization solution that takes a document image and text prompt as input, processes it through selected FMs on Amazon Bedrock, and returns the field locations using either absolute or normalized coordinates. The solution implements two distinct prompting strategies for document field localization:

Image dimension strategy – Works with absolute pixel coordinates, providing explicit image dimensions and requesting bounding box locations based on the document’s actual size
Scaled coordinate strategy – Uses a normalized 0–1000 coordinate system, making it more flexible across different document sizes and formats

The solution has a modular design to allow for straightforward extension to support custom field schemas through configuration updates rather than code changes. This flexibility, combined with the scalability of Amazon Bedrock, makes the solution suitable for both small-scale document processing and enterprise-wide deployment. In the following sections, we demonstrate the setup and implementation strategies used in our solution for document field localization using Amazon Bedrock FMs. You can see more details in our GitHub repository.
Prerequisites
For this walkthrough, you should have the following prerequisites:

An AWS account with Amazon Bedrock access
Permissions to use Amazon Nova Pro
Python 3.8+ with the boto3 library installed

Initial set ups
Complete the following setup steps:

Configure the Amazon Bedrock runtime client with appropriate retry logic and timeout settings:

import boto3
from botocore.config import Config

# Configure Bedrock client with retry logic
BEDROCK_CONFIG = Config(
region_name=’us-west-2′,
signature_version=’v4′,
read_timeout=500,
retries={
‘max_attempts’: 10,
‘mode’: ‘adaptive’
}
)

# Initialize Bedrock runtime client
bedrock_runtime = boto3.client(“bedrock-runtime”, config=BEDROCK_CONFIG)

Define your field configuration to specify which elements to locate in your documents:

# sample config
field_config = {
“invoice_number”: {“type”: “string”, “required”: True},
“total_amount”: {“type”: “currency”, “required”: True},
“date”: {“type”: “date”, “required”: True}
}

Initialize the BoundingBoxExtractor with your chosen model and strategy:

extractor = BoundingBoxExtractor(
model_id=NOVA_PRO_MODEL_ID, # or other FMs on Amazon Bedrock
prompt_template_path=”path/to/prompt/template”,
field_config=field_config,
norm=None # Set to 1000 for scaled coordinate strategy
)

# Process a document
bboxes, metadata = extractor.get_bboxes(
document_image=document_image,
document_key=”invoice_001″ # Optional tracking key
)

Implement prompting strategies
We test two prompt strategies in this workflow: image dimension and scaled coordinate.
The following is a sample prompt template for the image dimension strategy:

“””
Your task is to detect and localize objects in images with high precision.
Analyze each provided image (width = {w} pixels, height = {h} pixels) and return only a JSON object with bounding box data for detected objects.

Output Requirements:
1. Use absolute pixel coordinates based on provided width and height.
2. Ensure high accuracy and tight-fitting bounding boxes.

Detected Object Structure:
– “element”: Use one of these labels exactly: {elements}
– “bbox”: Array with coordinates [x1, y1, x2, y2] in absolute pixel values.

JSON Structure:
“`json
{schema}
“`

Provide only the specified JSON format without extra information.
“””

The following is a sample prompt template for the scaled coordinate strategy:

“””
Your task is to detect and localize objects in images with high precision.
Analyze each provided image and return only a JSON object with bounding box data for detected objects.

Output Requirements:
Use (x1, y1, x2, y2) format for bounding box coordinates, scaled between 0 and 1000.

Detected Object Structure:
– “element”: Use one of these labels exactly: {elements}
– “bbox”: Array [x1, y1, x2, y2] scaled between 0 and 1000.

JSON Structure:
“`json
{schema}
“`

Provide only the specified JSON format without extra information.
“””

Evaluate performance
We implement evaluation metrics to monitor accuracy:

evaluator = BBoxEvaluator(field_config=field_config)
evaluator.set_iou_threshold(0.5) # Adjust based on requirements
evaluator.set_margin_percent(5) # Tolerance for position matching

# Evaluate predictions
results = evaluator.evaluate(predictions, ground_truth)
print(f”Mean Average Precision: {results[‘mean_ap’]:.4f}”)

This implementation provides a robust foundation for document field localization while maintaining flexibility for different use cases and document types. The choice between image dimension and scaled coordinate strategies depends on your specific accuracy requirements and document variation.
Benchmarking results
We conducted our benchmarking study using FATURA, a public invoice dataset specifically designed for document understanding tasks. The dataset comprises 10,000 single-page invoices saved as JPEG images, representing 50 distinct layout templates with 200 invoices per template. Each document is annotated with 24 key fields, including invoice numbers, dates, line items, and total amounts. The annotations provide both the text values and precise bounding box coordinates in JSON format, making it ideal for evaluating field localization tasks. The dataset has the following key characteristics:

Documents: 10,000 invoices (JPEG format)
Templates: 50 distinct layouts (200 documents each)
Fields per document: 24 annotated fields
Annotation format: JSON with bounding boxes and text values
Field types: Invoice numbers, dates, addresses, line items, amounts, taxes, totals
Image resolution: Standard A4 size at 300 DPI
Language: English

The following figure shows sample invoice templates showcasing layout variation.

The following figure is an example of annotation visualization.

Before conducting the full-scale benchmark, we performed an initial experiment to determine the optimal prompting strategy. We selected a representative subset of 50 images, comprising 5 samples from 10 different templates, and evaluated three distinct approaches:

Image dimension:

Method: Provides explicit pixel dimensions and requests absolute coordinate bounding boxes
Input: Image bytes, image dimensions, output schema

Scaled coordinate:

Method: Uses normalized 0-1000 coordinate system
Input: Image bytes, output schema

Added gridlines:

Method: Enhances image with visual gridlines at fixed intervals
Input: Modified image with gridlines bytes, image dimensions, output schema

The following figure compares performance for different approaches for Mean Average Precision (mAP).

Building on insights from our initial strategy evaluation, we conducted benchmarking using the complete FATURA dataset of 10,000 documents. We employed the scaled coordinate approach for Amazon Nova models, based on their respective optimal performance characteristics from our initial testing. Our evaluation framework assessed Amazon Nova Pro through standard metrics, including Intersection over Union (IoU) and Average Precision (AP). The evaluation spanned all 50 distinct invoice templates, using an IoU threshold of 0.5 and a 5% margin tolerance for field positioning.
The following are our sample results in JSON:

{
“template”: “template1”,
“instance”: “Instance0”,
“metrics”: {
“mean_ap”: 0.8421052631578947,
“field_scores”: {
“TABLE”: [0.9771107575829314, 1.0, 1.0, 1.0, 1.0],
“BUYER”: [0.3842328422050217, 0.0, 0.0, 0, 0.0],
“DATE”: [0.9415158516000428, 1.0, 1.0, 1.0, 1.0],
“DISCOUNT”: [0.8773709977744115, 1.0, 1.0, 1.0, 1.0],
“DUE_DATE”: [0.9338410331219548, 1.0, 1.0, 1.0, 1.0],
“GSTIN_BUYER”: [0.8868145680064249, 1.0, 1.0, 1.0, 1.0],
“NOTE”: [0.7926162009357707, 1.0, 1.0, 1.0, 1.0],
“PAYMENT_DETAILS”: [0.9517931284002012, 1.0, 1.0, 1.0, 1.0],
“PO_NUMBER”: [0.8454266053075804, 1.0, 1.0, 1.0, 1.0],
“SELLER_ADDRESS”: [0.9687004508445741, 1.0, 1.0, 1.0, 1.0],
“SELLER_EMAIL”: [0.8771026147909002, 1.0, 1.0, 1.0, 1.0],
“SELLER_SITE”: [0.8715647216012751, 1.0, 1.0, 1.0, 1.0],
“SUB_TOTAL”: [0.8049954543667662, 1.0, 1.0, 1.0, 1.0],
“TAX”: [0.8751563641702513, 1.0, 1.0, 1.0, 1.0],
“TITLE”: [0.850667327423512, 1.0, 1.0, 1.0, 1.0],
“TOTAL”: [0.7226784112051814, 1.0, 1.0, 1.0, 1.0],
“TOTAL_WORDS”: [0.9099353099528785, 1.0, 1.0, 1.0, 1.0],
“GSTIN_SELLER”: [0.87170328009624, 1.0, 1.0, 1.0, 1.0],
“LOGO”: [0.679425211111111, 1.0, 1.0, 1.0, 1.0]
}
},
“metadata”: {
“usage”: {
“inputTokens”: 2250,
“outputTokens”: 639,
“totalTokens”: 2889
},
“metrics”: {
“latencyMs”: 17535
}
}
}

The following figure is an example of successful localization for Amazon Nova Pro.

The results demonstrate Amazon Nova Pro’s strong performance in document field localization. Amazon Nova Pro achieved a mAP of 0.8305. It demonstrated consistent performance across various document layouts, achieving a mAP above 0.80 across 45 of 50 templates, with the lowest template-specific mAP being 0.665. Although Amazon Nova Pro showed relatively high processing failures (170 out of 10,000 images), it still maintained high overall performance. Most low AP results were attributed to either complete processing failures (particularly over-refusal by its guardrail filters and malformed JSON output) or field misclassifications (particularly confusion between similar fields, such as buyer vs. seller addresses).
The following table summarizes the overall performance metrics.

Mean IoU
Mean AP

Amazon Nova Pro
0.7423
0.8331

The following graph shows the performance distribution for each individual extraction of approximately 20 labels for 10,000 documents.

Field-specific analysis reveals that Amazon Nova Pro excels at locating structured fields like invoice numbers and dates, consistently achieving precision and recall scores above 0.85. It demonstrates particularly strong performance with text fields, maintaining robust accuracy even when dealing with varying currency formats and decimal representations. This resilience to format variations makes it especially valuable for processing documents from multiple sources or regions.
The following graph summarizes field-specific performance. The graph shows AP success percentage for each label, across all documents for each model. It is sorted by highest success.

Conclusion
This benchmarking study demonstrates the significant advances in document field localization by multimodal FMs. Through comprehensive testing on the FATURA dataset, we’ve shown that these models can effectively locate and extract document fields with minimal setup effort, dramatically simplifying traditional computer vision workflows. Amazon Nova Pro emerges as an excellent choice for enterprise document processing, delivering a mAP of 0.8305 with consistent performance across diverse document types. Looking ahead, we see several promising directions for further optimization. Future work could explore extending the solution in agentic workflows to support more complex document types and field relationships.
To get started with your own implementation, you can find the complete solution code in our GitHub repository. We also recommend reviewing the Amazon Bedrock documentation for the latest model capabilities and best practices.

About the authors
Ryan Razkenari is a Deep Learning Architect at the AWS Generative AI Innovation Center, where he uses his expertise to create cutting-edge AI solutions. With a strong background in AI and analytics, he is passionate about building innovative technologies that address real-world challenges for AWS customers.
Harpreet Cheema is a Deep Learning Architect at the AWS Generative AI Innovation Center. He is very passionate in the field of machine learning and in tackling different problems in the ML domain. In his role, he focuses on developing and delivering Generative AI focused solutions for real-world applications.
Spencer Romo is a Senior Data Scientist with extensive experience in deep learning applications. He specializes in intelligent document processing while maintaining broad expertise in computer vision, natural language processing, and signal processing. Spencer’s innovative work in remote sensing has resulted in multiple patents. Based in Austin, Texas, Spencer loves working directly with customers to understand their unique problems and identify impactful AI solutions. Outside of work, Spencer competes in 24 Hours of Lemons racing series, embracing the challenge of high-performance driving on a budget.
Mun Kim is a Machine Learning Engineer at the AWS Generative AI Innovation Center. Mun brings expertise in building machine learning science and platform that help customers harness the power of generative AI technologies. He works closely with AWS customers to accelerate their AI adoption journey and unlock new business value.
Wan Chen is an Applied Science Manager at the Generative AI Innovation Center. As a ML/AI veteran in tech industry, she has wide range of expertise on traditional machine learning, recommender system, deep learning and Generative AI. She is a stronger believer of Superintelligence, and is very passionate to push the boundary of AI research and application to enhance human life and drive business growth. She holds Ph.D in Applied Mathematics from University of British Columbia, and had worked as postdoctoral fellow in Oxford University.

How Infosys built a generative AI solution to process oil and gas dril …

Enterprises across industries like healthcare, finance, manufacturing, and legal services face escalating challenges in processing vast amounts of multimodal data that combines text, images, charts, and complex technical formats. As organizations generate multimodal content at unprecedented speed and scale, document processing methods increasingly fail to handle the intricacies of specialized domains where technical terminology, interconnected data relationships, and industry-specific formats create operational bottlenecks. These conventional (non-AI) processing approaches struggle with the unique characteristics of enterprise documents: highly technical terminology, complex multimodal data formats, and interconnected information spread across various document types. This results in inefficient data extraction, missed insights, and time-consuming manual processing that hinders organizational productivity and decision-making.One such industry example is oil and gas, which generates vast amounts of complex technical data through drilling operations, presenting significant challenges in data processing and knowledge extraction. These documents, such as detailed well completion reports, drilling logs, and intricate lithology diagrams, contain crucial information that drives operational decisions and strategic planning.
To overcome such challenges, we built an advanced RAG solution using Amazon Bedrock leveraging Infosys Topaz AI capabilities, tailored for the oil and gas sector. This solution excels in handling multimodal data sources, seamlessly processing text, diagrams, and numerical data while maintaining context and relationships between different data elements. The specialized approach helps organizations unlock valuable insights from their technical documentation, streamline their workflows, and make more informed decisions based on comprehensive data analysis.
In this post, we provide insights on the solution and walk you through different approaches and architecture patterns explored, like different chunking, multi-vector retrieval, and hybrid search during the development.
Solution overview
The solution is built using AWS services, including Amazon Bedrock Nova Pro, Amazon Bedrock Knowledge Bases, Amazon OpenSearch Serverless as a Vector Database, Amazon Titan Text Embeddings, Cohere Embed English model, and BGE Reranker, allowing for scalability and cost-effectiveness. We also used Amazon Q Developer, an AI-powered assistant for software development for frontend and backend development of our solution powered by Infosys Topaz’s generative AI capabilities. The solution uses distributed processing to handle large volumes of data, so the system can handle a high volume of requests without compromising performance. The real-time indexing system allows for new documents to be incorporated into the system as soon as they are available, so that the information is up-to-date.
The following are some of the key components of the solution:

Document processing – PyMuPDF for PDF parsing, OpenCV for image processing.
Embedding generation – Cohere Embed English on Amazon Bedrock for generating vector embeddings of document content and user queries. A hierarchical parent-child chunking architecture that preserves document structure and contextual relationships.
Vector storage – Amazon OpenSearch Serverless for hybrid search capabilities combining semantic vector search with traditional keyword search (although Amazon Bedrock Knowledge Bases provides a managed RAG solution, this implementation uses a custom RAG architecture to deliver enhanced value and flexibility). This multi-vector retrieval mechanism with separate embedding spaces was required for maintaining the context between textual and visual data.
Model – Amazon Nova model for domain-specific response generation.
Reranking – BGE reranker, for improving search result relevance by reordering retrieved documents based on semantic similarity to the query.

The following diagram is a high-level overview of the architecture of the solution.

Many approaches were used during the build phase to get the desired accuracy. In the following sections, we discuss these approaches in detail.
RAG exploration and initial approach
The following figure shows some sample images from the oil and well drilling reports. The image on the left is a performance chart of a well drilling operation with the details of the drilling instrument. The image on the top right is of the split sections of the drilling instrument, followed below by the drilling data in a tabular form.

Image Source : Wells Search | NEATS
© Commonwealth of Australia [year of publishing- 2018]
Over a thousand such technical images (including lithology diagrams, well completion charts, and drilling visualizations) were preprocessed using Amazon Nova Pro, a multimodal language model. An iterative prompting strategy was employed to generate comprehensive descriptions:

Initial image analysis to extract basic technical elements
Refined prompting with domain-specific context to capture specialized terminology
Multiple inference iterations to provide completeness and accuracy of technical descriptions

This process converted visual technical information into detailed textual descriptions that preserve the original technical context.The process included the following key components:

Text content processing – The textual content from drilling reports was processed using Amazon Titan Text Embedding v2 model with:

Fixed-size chunking of 1,500 tokens with 100-token overlap
Preservation of original document structure and technical relationships

Image content integration– The detailed image descriptions generated were integrated without chunking to maintain complete technical context
Vector storage – The processed content (chunked text and complete image descriptions) was ingested into an OpenSearch Serverless vector database
RAG implementation – RAG-enabled semantic search and retrieval is used across both textual content and image-derived descriptions

This approach worked well with text questions but was less effective with image-related questions and finding information from images. The lack of a chunking strategy for images resulted in the entire description of each image ingested as a single unit into the search index. This made it difficult for the embedding model to pinpoint exact locations of specific information, especially for technical terms that might be buried within longer descriptions.In the following sections, we discuss the other approaches explored to overcome the limitations presented by each of the previous approaches.
Multi-vector embeddings with ColBERT
To use a vision model, we created multi-vector embeddings for each image. We then used the ColBERT embedding model for fine-grained text representations. User queries were converted into embeddings, and similarity scores between query and document embeddings were calculated. These embeddings were stored using tensor-based storage, and no chunking was applied. We observed the following:

Outcome – We encountered difficulties in storing and managing the complex ColBERT embeddings in traditional vector stores. Debugging and analyzing retrieved documents became cumbersome. Despite context-rich queries, selecting the proper document pages remained challenging.

Limitations and key learnings – This approach demonstrated the potential of advanced embedding techniques for image-based document retrieval. However, it also highlighted the challenges in implementing and managing such a system effectively, particularly in the complex domain of oil and gas. Overall, the use of vision models enhanced document understanding, and fine-grained representation of visual and textual content was achieved.
Fixed chunking with Amazon Titan Embeddings
Adopting a more traditional text-based approach, we introduced a fixed chunking strategy. PDF pages were converted to images, and text content was extracted from these images. A fixed chunking strategy of 500 tokens per chunk was implemented. We used Amazon Bedrock Knowledge Bases with OpenSearch Serverless vector storage, upgraded to Amazon Titan Embeddings v2, and retained the Amazon Nova Pro model. We observed the following:

Outcome – The ability to find and retrieve information based on technical keyword searches improved. More focused chunks allowed for a more accurate representation of specific concepts.
Limitations and key learnings – Providing comprehensive, long-form answers was challenging. Rigid chunking sometimes split related information across different chunks. This approach underscored the importance of balancing chunk size with information coherence, improving our handling of technical terms but highlighting the need for more sophisticated chunking strategies to maintain context.

Parent-child hierarchy with Cohere Embeddings
Building on our previous learnings, we introduced a more sophisticated chunking strategy using a parent-child hierarchy. PDF pages were converted to images and text was extracted. We implemented a parent-child chunking hierarchy with parent chunks of 1,500 tokens and child chunks of 512 tokens. We switched to Cohere English embeddings, used Amazon Bedrock Knowledge Bases and OpenSearch Serverless vector storage, and continued using the Amazon Nova Pro model. We observed the following:

Outcome – This approach balanced the need for context with the ability to pinpoint specific information. It significantly improved the ability to answer a wide range of queries, maintaining context while offering precise information retrieval.
Limitations and key learnings – Careful structuring of content significantly enhanced the performance of both embedding and QnA models. The parent-child structure proved particularly effective for handling the complex, nested nature of oil and gas documentation.

Hybrid search with optimized chunking
Our final approach retained the advanced features of the previous method while introducing a crucial change in the search methodology. PDF pages were converted to images and text was extracted. We implemented a hybrid search system within the Amazon Bedrock knowledge base. The parent-child chunking hierarchy was retained with parent chunks of 1,200 tokens and child chunks of 512 tokens. We continued using Cohere English embeddings and the Amazon Nova Pro model, and implemented a BGE reranker to refine search results. We observed the following:

Outcome – This approach combined the strengths of semantic search and traditional keyword-based search. It addressed the limitations of purely embedding-based searches and improved the handling of specific technical terms and exact phrases.
Limitations and key learnings – This final approach represents a highly evolved RAG system, offering the best of both worlds: the ability to understand context and nuance through embeddings, and the precision of keyword matching for specific technical queries.

The following are some of the tangible results of the hybrid strategy:

Average query response time: Less than 2 seconds
Retrieval accuracy (measured against human expert baseline): 92%
User satisfaction rating: 4.7/5 based on feedback from field engineers and geologists

Hybrid RAG approach and optimization strategy
Let’s explore the key components and strategies that make the final approach effective for oil and gas drilling reports. Each of the following sections outlines the differentiators in the solution.
Multimodal processing capabilities
The solution is designed to handle the diverse types of information found in oil and gas documents. The system processes both textual content (technical jargon, well logs, production figures) and visual elements (well schematics, seismic charts, lithology graphs) while maintaining contextual relationships between them. This makes sure that when processing a well completion report, the system can extract key parameters from text, analyze accompanying schematics, and link textual formation descriptions to their visual representations in lithology charts.For example, when processing a well completion report, the system can:

Extract key parameters from the text (such as total depth and casing sizes)
Analyze the accompanying well schematic
Link textual descriptions of formations to their visual representation in lithology charts

Domain-specific vocabulary handling
The system incorporates a comprehensive dictionary of industry terms and acronyms specific to oil and gas operations. Standard natural language processing (NLP) models often misinterpret technical terminology like “fish left in hole” or fail to recognize specialized abbreviations like “BOP” and “TVD.” By implementing domain-specific vocabulary handling, the system accurately interprets queries and maintains semantic understanding of technical concepts. This helps prevent misinterpretation of critical drilling information and provides relevant document retrieval.For example, when processing a query about “fish left in hole at 5000 ft MD,” the system understands:

“Fish” refers to lost equipment, not an actual fish
“MD” means measured depth
The relevance of this information to drilling operations and potential remediation steps

Hybrid hierarchy chunking strategy
Traditional fixed-size chunking often breaks apart related technical information, losing critical context in oil and gas documents. The solution implements a hybrid hierarchy approach with parent chunks (1,200 tokens) maintaining document-level context and child chunks (512 tokens) containing detailed technical information. Dynamic chunk sizing adjusts based on content complexity, using natural language processing to identify logical break points. This preserves the integrity of technical descriptions while enabling precise information retrieval across large, complex documents.For example, when processing a well completion report, the system will:

Create a large parent chunk for the overall well summary
Generate smaller child chunks for specific sections like casing details or perforation intervals
Dynamically adjust chunk size for the lithology description based on its complexity
Implement cross-references between the casing schedule and the well schematic description

Multi-vector retrieval implementation
Oil and gas documents contain diverse content types that require different retrieval approaches. The system creates separate embedding spaces for text, diagrams, and numerical data, implementing dense vector search for semantic similarity and sparse vector search for exact technical terminology matches. Cross-modal retrieval connects information across different content types, and contextual query expansion automatically includes relevant industry-specific terms. This hybrid approach delivers comprehensive retrieval whether users search for conceptual information or specific technical parameters.For example, for a query like “recent gas shows in Permian Basin wells,” the system will:

Use dense vector search to understand the concept of “gas shows”
Use sparse vector search to find exact matches for “Permian Basin”
Expand the query to include related terms like “hydrocarbon indicators”
Apply temporal filtering to focus on recent reports
Use spatial awareness to limit results to the Permian Basin area

Temporal and spatial awareness
Drilling operations are inherently tied to specific locations and time periods, making context crucial for accurate information retrieval. The system incorporates understanding of well locations and operational timelines, allowing for queries that consider geographical and chronological contexts. For example, searching for “recent gas shows in Permian Basin wells” uses both temporal filtering and spatial awareness to provide relevant, location-specific results. This optimization makes sure retrieved information matches the operational context of the user’s needs.For example, when generating a response about drilling fluid properties, the system will:

Retrieve relevant information from multiple sources
Cross-check numerical values for consistency
Use reflective prompting to make sure critical parameters are addressed
Apply the reranking model to prioritize the most relevant and accurate information
Present the response along with confidence scores and source citations

Reflective response generation
Technical accuracy is paramount in oil and gas operations, where incorrect information can have serious consequences. The system implements reflective prompting mechanisms that prompt the language model to critically evaluate its own responses against source documents and industry standards. Response reranking uses scoring models that evaluate technical accuracy, contextual relevance, and adherence to industry best practices. This multi-layered validation approach makes sure generated responses meet the high accuracy standards required for technical decision-making in drilling operations.
Advanced RAG strategies
To further enhance our system’s capabilities, we implemented several advanced RAG strategies:

Hypothetical document embeddings:

Generates synthetic questions based on document content
Creates embeddings for these hypothetical questions
Improves retrieval for complex, multi-part queries
Particularly effective for handling what-if scenarios in drilling operations

Recursive retrieval:

Implements multi-hop information gathering
Allows the system to follow chains of related information across multiple documents
Essential for answering complex queries that require synthesizing information from various sources

Semantic routing:

Intelligently routes queries to appropriate knowledge bases or document subsets
Optimizes search efficiency by focusing on the most relevant data sources
Crucial for handling the diverse types of documents in oil and gas operations

Query transformation:

Automatically refines and reformulates user queries for optimal retrieval
Applies industry-specific knowledge to interpret ambiguous terms
Breaks down complex queries into series of simpler, more targeted searches

For example, for a complex query like “Compare the production decline rates of horizontal wells in the Eagle Ford to those in the Bakken over the last 5 years,” the system will:

Use hypothetical document embeddings to generate relevant sub-questions about decline rates, horizontal wells, and specific formations
Apply recursive retrieval to gather data from production reports, geological surveys, and economic analyses
Route different aspects of the query to appropriate knowledge bases (such as separate databases for Eagle Ford and Bakken data)
Transform the query into a series of more specific searches, considering factors like well completion techniques and reservoir characteristics

Business outcome
The implementation of this advanced RAG solution has delivered significant business value for oil and gas operations:

Operational efficiency – Significant reduction in decision-making time for drilling and field engineers
Cost optimization – 40–50% decrease in manual document processing costs through automated information extraction
Enhanced productivity – Field engineers and geologists spend 60% less time searching for technical information, focusing instead on high-value analysis
Risk mitigation – Consistent 92% retrieval accuracy provides reliable access to critical technical knowledge, reducing operational decision risks

Conclusion
Our journey in developing this advanced RAG solution for the oil and gas industry demonstrates the power of combining AI techniques with domain-specific knowledge. By addressing the unique challenges of technical documentation in this field, we have created a system that not only retrieves information but understands and synthesizes it in a way that adds real value to operations. Amazon Bedrock is at the center of this solution, with Amazon Q Developer for the application frontend and backend development, and capabilities from Infosys Topaz – an AI-first offering that accelerates business value for enterprises using generative AI.We see significant potential for further advancement, s in this area, such as integration with real-time sensor data for dynamic information retrieval, enhanced visualization capabilities for complex geological and engineering data, and predictive analytics by combining historical retrieval patterns with operational data.
For more information on Amazon Bedrock and the latest Amazon Nova models, refer to the Amazon Bedrock User Guide and Amazon Nova User Guide.

About the Authors
Dhiraj Thakur is a Solutions Architect with Amazon Web Services, specializing in Generative AI and data analytics domains. He works with AWS customers and partners to architect and implement scalable analytics platforms and AI-driven solutions. With deep expertise in Generative AI services and implementation, end-to-end machine learning implementation, and cloud-native data architectures, he helps organizations harness the power of GenAI and analytics to drive business transformation. He can be reached via LinkedIn.
Meenakshi Venkatesan is a Principal Consultant at Infosys and a part of the AWS partnerships team at Infosys Topaz CoE. She helps in designing, developing, and deploying in AWS environments and has interests in exploring the new offerings and services.
Keerthi Prasad is a Senior Technology Architect at Infosys and a part of the AWS partnerships team at Infosys Topaz CoE. He provides guidance and assistance to customers in building various solutions in the AWS Cloud. He also supports AWS partners and customers in their generative AI adoption journey.
Suman Debnath is an Associate Principal at Infosys and a part of Infosys Topaz delivery. He has played multiple roles, such as architect, program manager, and data scientist, building scalable enterprise systems and AI/ML and generative AI applications on the cloud for oil and gas, healthcare, and financial clients.
Ganesh is a Enterprise Architect and Data Scientist at Infosys and part of Topaz Delivery. He has a master’s degree in computer science and machine learning. He has played multiple roles such as architect, program manager and data scientist building scalable enterprise systems.
Yash Sharma is a Digital Specialist Engineer with Infosys and part of the AWS team at ICETS with a passion for emerging generative AI services. He has successfully led and contributed to numerous generative AI projects. He is always eager to expand his knowledge and stay ahead of industry trends, bringing the latest insights and techniques to work.
Karthikeyan Senthilkumar is a Senior Systems Engineer at Infosys and a part of the AWSCOE at iCETS. He specializes in AWS services with a focus on emerging technologies.

Master Vibe Coding: Pros, Cons, and Best Practices for Data Engineers

Large-language-model (LLM) tools now let engineers describe pipeline goals in plain English and receive generated code—a workflow dubbed vibe coding. Used well, it can accelerate prototyping and documentation. Used carelessly, it can introduce silent data corruption, security risks, or unmaintainable code. This article explains where vibe coding genuinely helps and where traditional engineering discipline remains indispensable, focusing on five pillars: data pipelines, DAG orchestration, idempotence, data-quality tests, and DQ checks.

1) Data Pipelines: Fast Scaffolds, Slow Production

LLM assistants excel at scaffolding: generating boiler-plate ETL scripts, basic SQL, or infrastructure-as-code templates that would otherwise take hours. Still, engineers must:

Review for logic holes—e.g., off-by-one date filters or hard-coded credentials frequently appear in generated code.

Refactor to project standards (naming, error handling, logging). Unedited AI output often violates style guides and DRY (don’t-repeat-yourself) principles, raising technical debt.youtube

Integrate tests before merging. A/B comparisons show LLM-built pipelines fail CI checks ~25% more often than hand-written equivalents until manually fixed.

When to use vibe coding

Green-field prototypes, hack-days, early POCs.

Document generation—auto-extracted SQL lineage saved 30-50% doc time in a Google Cloud internal study.

When to avoid it

Mission-critical ingestion—financial or medical feeds with strict SLAs.

Regulated environments where generated code lacks audit evidence.

2) DAGs: AI-Generated Graphs Need Human Guardrails

A directed acyclic graph (DAG) defines task dependencies so steps run in the right order without cycles. LLM tools can infer DAGs from schema descriptions, saving setup time. Yet common failure modes include:

Incorrect parallelization (missing upstream constraints).

Over-granular tasks creating scheduler overhead.

Hidden circular refs when code is regenerated after schema drift.

Mitigation: export the AI-generated DAG to code (Airflow, Dagster, Prefect), run static validation, and peer-review before deployment. Treat the LLM as a junior engineer whose work always needs code review.

3) Idempotence: Reliability Over Speed

Idempotent steps produce identical results even when retried. AI tools can add naïve “DELETE-then-INSERT” logic, which looks idempotent but degrades performance and can break downstream FK constraints. Verified patterns include:

UPSERT / MERGE keyed on natural or surrogate IDs.

Checkpoint files in cloud storage to mark processed offsets (good for streams).

Hash-based deduplication for blob ingestion.

Engineers must still design the state model; LLMs often skip edge cases like late-arriving data or daylight-saving anomalies.

4) Data-Quality Tests: Trust, but Verify

LLMs can suggest sensors (metric collectors) and rules (thresholds) automatically—for example, “row_count ≥ 10 000” or “null_ratio < 1%”. This is useful for coverage, surfacing checks humans forget. Problems arise when:

Thresholds are arbitrary. AI tends to pick round numbers with no statistical basis.

Generated queries don’t leverage partitions, causing warehouse cost spikes.

Best practice:

Let the LLM draft checks.

Validate thresholds with historical distributions.

Commit checks to version control so they evolve with schema.

5) DQ Checks in CI/CD: Shift-Left, Not Ship-And-Pray

Modern teams embed DQ tests in pull-request pipelines—shift-left testing—to catch issues before production. Vibe coding aids by:

Autogenerating unit tests for dbt models (e.g., expect_column_values_to_not_be_null).

Producing documentation snippets (YAML or Markdown) for each test.

But you still need:

A go/no-go policy: what severity blocks deployment?

Alert routing: AI can draft Slack hooks, but on-call playbooks must be human-defined.

Controversies and Limitations

Over-hype: Independent studies call vibe coding “over-promised” and advise confinement to sandbox stages until maturity.

Debugging debt: Generated code often includes opaque helper functions; when they break, root-cause analysis can exceed hand-coded time savings.youtube

Security gaps: Secret handling is frequently missing or incorrect, creating compliance risks, especially for HIPAA/PCI data.

Governance: Current AI assistants do not auto-tag PII or propagate data-classification labels, so data governance teams must retrofit policies.

Practical Adoption Road-map

Pilot Phase - Restrict AI agents to dev repos. - Measure success on time saved vs. bug tickets opened.

Review & Harden - Add linting, static analysis, and schema diff checks that block merge if AI output violates rules. - Implement idempotence tests—rerun the pipeline in staging and assert output equality hashes.

Gradual Production Roll-Out - Start with non-critical feeds (analytics backfills, A/B logs). - Monitor cost; LLM-generated SQL can be less efficient, doubling warehouse minutes until optimized.

Education - Train engineers on AI prompt design and manual override patterns. - Share failures openly to refine guardrails.

Key Takeaways

Vibe coding is a productivity booster, not a silver bullet. Use it for rapid prototyping and documentation, but pair with rigorous reviews before production.

Foundational practices—DAG discipline, idempotence, and DQ checks—remain unchanged. LLMs can draft them, but engineers must enforce correctness, cost-efficiency, and governance.

Successful teams treat the AI assistant like a capable intern: speed up the boring parts, double-check the rest.

By blending vibe coding’s strengths with established engineering rigor, you can accelerate delivery while protecting data integrity and stakeholder trust.
The post Master Vibe Coding: Pros, Cons, and Best Practices for Data Engineers appeared first on MarkTechPost.

Qwen Team Introduces Qwen-Image-Edit: The Image Editing Version of Qwe …

In the domain of multimodal AI, instruction-based image editing models are transforming how users interact with visual content. Just released in August 2025 by Alibaba’s Qwen Team, Qwen-Image-Edit builds on the 20B-parameter Qwen-Image foundation to deliver advanced editing capabilities. This model excels in semantic editing (e.g., style transfer and novel view synthesis) and appearance editing (e.g., precise object modifications), while preserving Qwen-Image’s strength in complex text rendering for both English and Chinese. Integrated with Qwen Chat and available via Hugging Face, it lowers barriers for professional content creation, from IP design to error correction in generated artwork.

Architecture and Key Innovations

Qwen-Image-Edit extends the Multimodal Diffusion Transformer (MMDiT) architecture of Qwen-Image, which comprises a Qwen2.5-VL multimodal large language model (MLLM) for text conditioning, a Variational AutoEncoder (VAE) for image tokenization, and the MMDiT backbone for joint modeling. For editing, it introduces dual encoding: the input image is processed by Qwen2.5-VL for high-level semantic features and the VAE for low-level reconstructive details, concatenated in the MMDiT’s image stream. This enables balanced semantic coherence (e.g., maintaining object identity during pose changes) and visual fidelity (e.g., preserving unmodified regions).

The Multimodal Scalable RoPE (MSRoPE) positional encoding is augmented with a frame dimension to differentiate pre- and post-edit images, supporting tasks like text-image-to-image (TI2I) editing. The VAE, fine-tuned on text-rich data, achieves superior reconstruction with 33.42 PSNR on general images and 36.63 on text-heavy ones, outperforming FLUX-VAE and SD-3.5-VAE. These enhancements allow Qwen-Image-Edit to handle bilingual text edits while retaining original font, size, and style.

Key Features of Qwen-Image-Edit

Semantic and Appearance Editing: Supports low-level visual appearance editing (e.g., adding, removing, or modifying elements while keeping other regions unchanged) and high-level visual semantic editing (e.g., IP creation, object rotation, and style transfer, allowing pixel changes with semantic consistency).

Precise Text Editing: Enables bilingual (Chinese and English) text editing, including direct addition, deletion, and modification of text in images, while preserving the original font, size, and style.

Strong Benchmark Performance: Achieves state-of-the-art results on multiple public benchmarks for image editing tasks, positioning it as a robust foundation model for generation and manipulation.

Training and Data Pipeline

Leveraging Qwen-Image’s curated dataset of billions of image-text pairs across Nature (55%), Design (27%), People (13%), and Synthetic (5%) domains, Qwen-Image-Edit employs a multi-task training paradigm unifying T2I, I2I, and TI2I objectives. A seven-stage filtering pipeline refines data for quality and balance, incorporating synthetic text rendering strategies (Pure, Compositional, Complex) to address long-tail issues in Chinese characters.

Training uses flow matching with a Producer-Consumer framework for scalability, followed by supervised fine-tuning and reinforcement learning (DPO and GRPO) for preference alignment. For editing-specific tasks, it integrates novel view synthesis and depth estimation, using DepthPro as a teacher model. This results in robust performance, such as correcting calligraphy errors through chained edits.

Advanced Editing Capabilities

Qwen-Image-Edit shines in semantic editing, enabling IP creation like generating MBTI-themed emojis from a mascot (e.g., Capybara) while preserving character consistency. It supports 180-degree novel view synthesis, rotating objects or scenes with high fidelity, achieving 15.11 PSNR on GSO—surpassing specialized models like CRM. Style transfer transforms portraits into artistic forms, such as Studio Ghibli, maintaining semantic integrity.

For appearance editing, it adds elements like signboards with realistic reflections or removes fine details like hair strands without altering surroundings. Bilingual text editing is precise: changing “Hope” to “Qwen” on posters or correcting Chinese characters in calligraphy via bounding boxes. Chained editing allows iterative corrections, e.g., fixing “稽” step-by-step until accurate.

Benchmark Results and Evaluations

Qwen-Image-Edit leads editing benchmarks, scoring 7.56 overall on GEdit-Bench-EN and 7.52 on CN, outperforming GPT Image 1 (7.53 EN, 7.30 CN) and FLUX.1 Kontext [Pro] (6.56 EN, 1.23 CN). On ImgEdit, it achieves 4.27 overall, excelling in tasks like object replacement (4.66) and style changes (4.81). Depth estimation yields 0.078 AbsRel on KITTI, competitive with DepthAnything v2.

Human evaluations on AI Arena position its base model third among APIs, with strong text rendering advantages. These metrics highlight its superiority in instruction-following and multilingual fidelity.

Deployment and Practical Usage

Qwen-Image-Edit is deployable via Hugging Face Diffusers:

Copy CodeCopiedUse a different Browserfrom diffusers import QwenImageEditPipeline
import torch
from PIL import Image

pipeline = QwenImageEditPipeline.from_pretrained(“Qwen/Qwen-Image-Edit”)
pipeline.to(torch.bfloat16).to(“cuda”)

image = Image.open(“input.png”).convert(“RGB”)
prompt = “Change the rabbit’s color to purple, with a flash light background.”
output = pipeline(image=image, prompt=prompt, num_inference_steps=50, true_cfg_scale=4.0).images
output.save(“output.png”)

Alibaba Cloud’s Model Studio offers API access for scalable inference. Licensed under Apache 2.0, the GitHub repository provides training code.

Future Implications

Qwen-Image-Edit advances vision-language interfaces, enabling seamless content manipulation for creators. Its unified approach to understanding and generation suggests potential extensions to video and 3D, fostering innovative applications in AI-driven design.

Check out the Technical Details, Models on Hugging Face and Try the Chat here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Qwen Team Introduces Qwen-Image-Edit: The Image Editing Version of Qwen-Image with Advanced Capabilities for Semantic and Appearance Editing appeared first on MarkTechPost.

Creating Dashboards Using Vizro MCP: Vizro is an Open-Source Python To …

Vizro is an open-source Python toolkit by McKinsey that makes it easy to build beautiful, production-ready data visualization apps. With just a few lines of configuration (via JSON, YAML, or Python dictionaries), you can create multi-page dashboards that would normally take thousands of lines of code.

Built on top of Plotly, Dash, and Pydantic, Vizro combines the flexibility of open source with in-built best practices for design and scalability. It’s quick to learn, customizable for advanced users, and powerful enough to move from prototype to production seamlessly.

In this tutorial, we’ll use the Vizro MCP server to create a dashboard directly from Claude Desktop.

Setting up the dependencies

uv package manager

To run the Vizro server, we will need the uv package manager. Install it with the following commands:

For Mac/Linux:

Copy CodeCopiedUse a different Browsercurl -LsSf https://astral.sh/uv/install.sh | sh

For Windows:

Copy CodeCopiedUse a different Browserpowershell -ExecutionPolicy ByPass -c “irm https://astral.sh/uv/install.ps1 | iex”

Once uv is downloaded, run the following command to get the location of uvx

For Mac/Linux:

Copy CodeCopiedUse a different Browserwhich uvx

For Windows:

Copy CodeCopiedUse a different Browserwhere uvx

Keep the location of uvx handy, we will need it for the config file for Claude

Claude Desktop

You can download Claude Desktop from https://claude.ai/download. Next, open the claude_desktop_config.json file located in the Claude installation directory using any text editor. If the file doesn’t exist, you can create it manually. Once opened, enter the following code:

Mac/Linux:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“vizro-mcp”: {
“command”: “/placeholder-path/uvx”,
“args”: [
“vizro-mcp”
]
}
}
}

Windows:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“vizro-mcp”: {
“command”: “placeholder-path//uvx”,
“args”: [
“vizro-mcp”
]
}
}
}

Replace the placeholder-path with the path of uvx

Running the Server

Once the file is configured, Vizro MCP Server should be visible in the list of servers.

Vizro comes with some sample datasets as well. You can try the following prompt to get started:

“create a vizro dashboard using tips dataset”

Claude will use the vizro-mcp to generate the dashboard and open it in your browser via PyCafe, showcasing interactive charts like tip vs total bill, average tips by day, tip distribution by gender, and tips by party size, along with filters for day, gender, and smoker status for seamless cross-filtering analysis.

The post Creating Dashboards Using Vizro MCP: Vizro is an Open-Source Python Toolkit by McKinsey appeared first on MarkTechPost.

Create a travel planning agentic workflow with Amazon Nova

Traveling is enjoyable, but travel planning can be complex to navigate and a hassle. Travelers must book accommodations, plan activities, and arrange local transportation. All these decisions can feel overwhelming. Although travel professionals have long helped manage these complexities, recent breakthroughs in generative AI have made something entirely new possible—intelligent assistants that can understand natural conversation, access real-time data, and directly interface with booking systems and travel tools. Agentic workflows, which use large language models (LLMs) with access to external tools, are particularly promising for simplifying dynamic, multi-step processes like travel planning.
In this post, we explore how to build a travel planning solution using AI agents. The agent uses Amazon Nova, which offers an optimal balance of performance and cost compared to other commercial LLMs. By combining accurate but cost-efficient Amazon Nova models with LangGraph orchestration capabilities, we create a practical travel assistant that can handle complex planning tasks while keeping operational costs manageable for production deployments.
Solution overview
Our solution is built on a serverless AWS Lambda architecture using Docker containers and implements a comprehensive three-layer approach: frontend interaction, core processing, and integration services. In the core processing layer, we use LangGraph, a stateful orchestration framework, to create a sophisticated yet flexible agent-based system that manages the complex interactions required for travel planning.
The core of our system is a graph architecture where components (nodes) handle distinct aspects of travel planning, with the router node orchestrating the flow of information between them. We use Amazon Nova, a new generation of state-of-the-art foundation models (FMs) available exclusively on Amazon Bedrock that delivers frontier intelligence with industry-leading price-performance. The router node uses an LLM to analyze each user query and, with access to the description of our 14 action nodes, decides which ones need to be executed. The action nodes, each with their own LLM chain, powered by either Amazon Nova Pro or Amazon Nova Lite models, manage various functions, including web research, personalized recommendations, weather lookups, product searches, and shopping cart management.
We use Amazon Nova Lite for the router and simpler action nodes. It can handle query analysis and basic content generation with its lightning-fast processing while maintaining strong accuracy at a low cost. Five complex nodes use Amazon Nova Pro for tasks requiring advanced instruction following and multi-step operations, such as detailed travel planning and recommendations. Both models support a 300,000-token context window and can process text, image, and video inputs. The models support text processing across more than 200 languages, helping our travel assistant serve a global audience.The integration layer unifies multiple data sources and services through an interface:

Amazon Product Advertising API for travel-related product recommendations
Google Custom Search API for real-time travel information
OpenWeather API for accurate weather forecasts
Amazon Bedrock Knowledge Bases for travel destination insights
Amazon DynamoDB for persistent storage of user profiles and chat history

These integrations serve as examples, and the architecture is designed to be extensible, so organizations can quickly incorporate their own APIs and data sources based on specific requirements.
The agent keeps track of the conversation state using AgentState (TypedDict), a special Python dictionary that helps prevent data errors by enforcing specific data types. It stores the information we need to know about each user’s session: their conversation history, profile information, processing status, and final outputs. This makes sure the different action nodes can access and update information reliably.
The following diagram illustrates the solution architecture.

The travel assistant processes user interactions from end to end:

Users interact with a React.js web application through a chat interface.
Their requests are authenticated using Amazon Cognito and routed through Amazon API Gateway.
Authenticated requests are sent to our backend Lambda functions, which host the core agent workflow.
API credentials are securely stored using AWS Secrets Manager, following best practices to make sure these sensitive keys are never exposed in code or configuration files, with appropriate access controls and rotation policies implemented.
The Travel Assistant Agent itself consists of several interconnected components. At the center, the agent router analyzes incoming queries and orchestrates the workflow.
The agent maintains state through three DynamoDB tables that store conversation history, shopping wishlists, and user profiles, making sure context is preserved across interactions.
For travel-specific knowledge, the system uses a combination of Amazon Bedrock Knowledge Bases, Amazon OpenSearch Serverless, and a document store in Amazon Simple Storage Service (Amazon S3). These components work together to provide accurate, relevant travel information when needed.
The agent’s action nodes handle specialized tasks by combining LLM chains with external APIs. When users need product recommendations, the system connects to the Amazon Product Advertising API. For general travel information, it uses the Google Custom Search API, and for weather-related queries, it consults the OpenWeather API. API credentials are securely managed through Secrets Manager.
The system formulates comprehensive responses based on collected information, and the final responses are returned to the user through the chat interface.

This architecture supports both simple queries that can be handled by a single node and complex multi-step interactions that require coordination across multiple components. The system can scale horizontally, and new capabilities can be added by introducing additional action nodes and API integrations.
You can deploy this solution using the AWS Cloud Development Kit (AWS CDK), which generates an AWS CloudFormation template that handles the necessary resources, including Lambda functions, DynamoDB tables, and API configurations. The deployment creates the required AWS resources and outputs the API endpoint URL for your frontend application.
Prerequisites
For this walkthrough, you must have the following prerequisites:

An active AWS account and familiarity with FMs, Amazon Bedrock, and Amazon OpenSearch Service
Access to the Amazon Nova FMs on Amazon Bedrock
Node.js v16.x or later
Python 3.9 or later
Access to the Product Advertising API (PAAPI)

Clone the repository
Start by cloning the GitHub repository containing the solution files:

git clone https://github.com/aws-samples/sample-travel-assistant-agent.git

Obtain API keys
The solution requires API keys from three services to enable its core functionalities:

OpenWeather API – Create a Free Access account at OpenWeather to obtain your API key. The free tier (60 calls per minute) is sufficient for testing and development.
Google Custom Search API – Set up the search functionality through Google Cloud Console. Create or select a project and enable the Custom Search API. Then, generate an API key from the credentials section. Create a search engine at Programmable Search and note your Search Engine ID. The free tier includes 100 queries per day.
(Optional) Amazon Product Advertising API (PAAPI) – If you want to enable product recommendations, access the PAAPI Documentation Portal to generate your API keys. You will receive both a public key and a secret key. You must have an Amazon Associates account to access these credentials. If you’re new to the Amazon Associates Program, complete the application process first. Skip this step if you don’t want to use PAAPI features.

Add API keys to Secrets Manager
Before deploying the solution, you must securely store your API keys in Secrets Manager. The following table lists the secrets to create and their JSON structure. For instructions to create a secret, refer to Create an AWS Secrets Manager secret.

Secret Name
JSON Structure

openweather_maps_keys
{” openweather_key”: “YOUR_API_KEY”}

google_search_keys
{“cse_id”: “YOUR_SEARCH_ENGINE_ID”, “google_api_key”: “YOUR_API_KEY”}

paapi_keys
{“paapi_public”: “YOUR_PUBLIC_KEY”, “paapi_secret”: “YOUR_SECRET_KEY”}

Configure environment variables
Create a .env file in the project root with your configuration:

STACK_NAME=TravelAssistantAgent

# Optional: Create Bedrock Knowledge Base with documents
KB_DOCS_PATH = Path/to/your/documents/folder
# Optional: Enable/disable Product Search features with PAAPI
USE_PAAPI=false

Deploy the stack
If this is your first time using the AWS CDK in your AWS account and AWS Region, bootstrap your environment:

cdk bootstrap

Deploy the solution using the provided script, which creates the required AWS resources, including Lambda functions, DynamoDB tables, and API configurations:

sh deploy.sh

Access your application
When the deployment is complete, open the AWS CloudFormation console and open your stack. On the Outputs tab, note the following values:

WebAppDomain – Your application’s URL
UserPoolId – Required for user management
UserPoolClientId – Used for authentication

Create an Amazon Cognito user
Complete the following steps to create an Amazon Cognito user:

On the Amazon Cognito console, choose User pools in the navigation pane.
Choose your user pool.
Choose Users in the navigation pane, then choose Create user.

For Email address, enter an email address, and select Mark email address as verified.
For Password, enter a temporary password.
Choose Create user.

You can use these credentials to access your application at the WebAppDomain URL.
Test the solution
To test the agent’s capabilities, we created a business traveler persona and simulated a typical travel planning conversation flow. We focused on routing, function calling accuracy, response quality, and latency metrics. The agent’s routing system directs the user questions to the appropriate specialized node (for example, searching for accommodations, checking weather conditions, or suggesting travel products). Throughout the conversation, the agent maintains the context of previously discussed details, so it can build upon earlier responses while providing relevant new information. For example, after discussing travel destination, the agent can naturally incorporate this into subsequent weather and packing list recommendations.
The following screenshots demonstrate the end-user experience, while the underlying API interactions are handled seamlessly on the backend. The complete implementation details, including Lambda function code and API integration patterns, are available in our GitHub repository.

The solution demonstrates personalization capabilities using sample user profiles stored in DynamoDB, containing upcoming trips and travel preferences. In production deployments, these profiles can be integrated with existing customer databases and reservation systems to provide a personalized assistance.

The product recommendations shown are live links to actual items available on Amazon.com, so the user can explore or purchase these products directly. The user can choose a link to check out the product, or choose Add to Amazon Cart to see the items in their shopping cart.

Clean up
After you are done experimenting with the travel assistant, you can locate the CloudFormation stack on the AWS CloudFormation console and delete it. This will delete the resources you created.
Conclusion
Our travel planning assistant agent demonstrates a practical application built by Amazon Nova and LangGraph for solving real-world business challenges. The system streamlines complex travel planning while naturally integrating product recommendations through specialized processing nodes and real-time data integration. Amazon Nova Lite models showed reasonable performance at task orchestration, and Amazon Nova Pro performed well for more complex function calling operations. Looking ahead, this framework could be implemented with more dynamic orchestration systems such as ReAct. To build your own implementation, explore our code samples in the GitHub repository.
For those looking to deepen their understanding of LLM-powered agents, AWS provides extensive resources on building intelligent systems. The Amazon Bedrock Agents documentation offers insights into automating multistep tasks with FMs, and the AWS Bedrock Agent Samples GitHub repo provides guidance for implementing multiple agent applications using Amazon Bedrock.

About the authors
Isaac Privitera is a Principal Data Scientist with the AWS Generative AI Innovation Center, where he develops bespoke generative AI-based solutions to address customers’ business problems. His primary focus lies in building responsible AI systems, using techniques such as RAG, multi-agent systems, and model fine-tuning. When not immersed in the world of AI, Isaac can be found on the golf course, enjoying a football game, or hiking trails with his loyal canine companion, Barry.
Ryan Razkenari is a Deep Learning Architect at the AWS Generative AI Innovation Center, where he uses his expertise to create cutting-edge AI solutions. With a strong background in AI and analytics, he is passionate about building innovative technologies that address real-world challenges for AWS customers.
Sungmin Hong is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he helps expedite a variety of use cases for AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds a PhD in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading, and cooking.