Microsoft Releases Phi-4-mini-Flash-Reasoning: Efficient Long-Context …

Phi-4-mini-Flash-Reasoning, the latest addition to Microsoft’s Phi-4 model family, is an open, lightweight language model designed to excel at long-context reasoning while maintaining high inference efficiency. Released on Hugging Face, this 3.8B parameter model is a distilled version of Phi-4-mini, fine-tuned for dense reasoning tasks like math problem solving and multi-hop question answering. Built using Microsoft’s new SambaY decoder-hybrid-decoder architecture, it achieves state-of-the-art performance among compact models and operates up to 10× faster than its predecessor on long-generation tasks.

Architecture: Gated Memory Meets Hybrid Decoding

At the core of Phi-4-mini-Flash-Reasoning is the SambaY architecture, a novel decoder-hybrid-decoder model that integrates State Space Models (SSMs) with attention layers using a lightweight mechanism called the Gated Memory Unit (GMU). This structure enables efficient memory sharing between layers, significantly reducing inference latency in long-context and long-generation scenarios.

Unlike Transformer-based architectures that rely heavily on memory-intensive attention computations, SambaY leverages Samba (a hybrid SSM architecture) in the self-decoder and replaces roughly half of the cross-attention layers in the cross-decoder with GMUs. GMUs serve as cheap, element-wise gating functions that reuse the hidden state from the final SSM layer, thereby avoiding redundant computation. This results in a linear-time prefill complexity and lower decoding I/O, yielding substantial speedups during inference.

Training Pipeline and Reasoning Capabilities

The Phi-4-mini-Flash model is pre-trained on 5T tokens from high-quality synthetic and filtered real data, consistent with the rest of the Phi-4-mini family. Post pretraining, it undergoes multi-stage supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) using reasoning-focused instruction datasets. Notably, unlike Phi-4-mini-Reasoning, it excludes reinforcement learning (RLHF) entirely.

Despite this, Phi-4-mini-Flash-Reasoning outperforms Phi-4-mini-Reasoning on a suite of complex reasoning tasks. On the Math500 benchmark, it achieves a pass@1 accuracy of 92.45%, outperforming Phi-4-mini-Reasoning (91.2%) and surpassing other open models like Qwen-1.5B and Bespoke-Stratos-7B. On AIME24/25, it shows strong gains as well, with over 52% accuracy on AIME24.

This performance leap is attributed to the architecture’s capacity for long Chain-of-Thought (CoT) generation. With 64K context length support and optimized inference under the vLLM framework, the model can generate and reason across multi-thousand-token contexts without bottlenecks. In latency benchmarks with 2K-token prompts and 32K-token generations, Phi-4-mini-Flash-Reasoning delivers up to 10× higher throughput than its predecessor.

Efficient Long-Context Processing

Efficiency gains in Phi-4-mini-Flash-Reasoning aren’t just theoretical. Through the decoder-hybrid-decoder design, the model achieves competitive performance on long-context benchmarks like Phonebook and RULER. For instance, with a sliding window attention (SWA) size as small as 256, it maintains high retrieval accuracy, indicating that long-range token dependencies are well captured via SSMs and GMU-based memory sharing.

These architectural innovations lead to reduced compute and memory overhead. For example, during decoding, GMU layers replace attention operations that would otherwise cost O(N·d) time per token, cutting that down to O(d), where N is sequence length and d is hidden dimension. The result is real-time inference capability even in multi-turn or document-level scenarios.

Open Weights and Use Cases

Microsoft has open-sourced the model weights and configuration through Hugging Face, providing full access to the community. The model supports 64K context length, operates under standard Hugging Face and vLLM runtimes, and is optimized for fast token throughput on A100 GPUs.

Potential use cases for Phi-4-mini-Flash-Reasoning include:

Mathematical Reasoning (e.g., SAT, AIME-level problems)

Multi-hop QA

Legal and Scientific Document Analysis

Autonomous Agents with Long-Term Memory

High-throughput Chat Systems

Its combination of open access, reasoning ability, and efficient inference makes it a strong candidate for deployment in environments where compute resources are constrained but task complexity is high.

Conclusion

Phi-4-mini-Flash-Reasoning exemplifies how architectural innovation—particularly hybrid models leveraging SSMs and efficient gating—can bring transformative gains in reasoning performance without ballooning model size or cost. It marks a new direction in efficient long-context language modeling, paving the way for real-time, on-device reasoning agents and scalable open-source alternatives to commercial LLMs.

Check out the Paper, Codes, Model on Hugging Face and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Microsoft Releases Phi-4-mini-Flash-Reasoning: Efficient Long-Context Reasoning with Compact Architecture appeared first on MarkTechPost.

New capabilities in Amazon SageMaker AI continue to transform how orga …

As AI models become increasingly sophisticated and specialized, the ability to quickly train and customize models can mean the difference between industry leadership and falling behind. That is why hundreds of thousands of customers use the fully managed infrastructure, tools, and workflows of Amazon SageMaker AI to scale and advance AI model development. Since launching in 2017, SageMaker AI has transformed how organizations approach AI model development by reducing complexity while maximizing performance. Since then, we’ve continued to relentlessly innovate, adding more than 420 new capabilities since launch to give customers the best tools to build, train, and deploy AI models quickly and efficiently. Today, we’re pleased to announce new innovations that build on the rich features of SageMaker AI to accelerate how customers build and train AI models.
Amazon SageMaker HyperPod: The infrastructure of choice for developing AI models
AWS launched Amazon SageMaker HyperPod in 2023 to reduce complexity and maximize performance and efficiency when building AI models. With SageMaker HyperPod, you can quickly scale generative AI model development across thousands of AI accelerators and reduce foundation model (FM) training and fine-tuning development costs by up to 40%. Many of today’s top models are trained on SageMaker HyperPod, including models from Hugging Face, Luma AI, Perplexity AI, Salesforce, Thomson Reuters, Writer, and Amazon. By training Amazon Nova FMs on SageMaker HyperPod, Amazon saved months of work and increased utilization of compute resources to more than 90%.

To further streamline workflows and make it faster to develop and deploy models, a new command line interface (CLI) and software development kit (SDK) provides a single, consistent interface that simplifies infrastructure management, unifies job submission across training and inference, and supports both recipe-based and custom workflows with integrated monitoring and control. Today, we are also adding two capabilities to SageMaker HyperPod that can help you reduce training costs and accelerate AI model development.
Reduce the time to troubleshoot performance issues from days to minutes with SageMaker HyperPod observability
To bring new AI innovations to market as quickly as possible, organizations need visibility across AI model development tasks and compute resources to optimize training efficiency and detect and resolve interruptions or performance bottlenecks as soon as possible. For example, to investigate if a training or fine-tuning job failure was the result of a hardware issue, data scientists and machine learning (ML) engineers want to quickly filter to review the monitoring data of the specific GPUs that performed the job rather than manually browsing through the hardware resources of an entire cluster to establish the correlation between the job failure and a hardware issue.
The new observability capability in SageMaker HyperPod transforms how you can monitor and optimize your model development workloads. Through a unified dashboard preconfigured in Amazon Managed Grafana, with the monitoring data automatically published to an Amazon Managed Service for Prometheus workspace, you can now see generative AI task performance metrics, resource utilization, and cluster health in a single view. Teams can now quickly spot bottlenecks, prevent costly delays, and optimize compute resources. You can define automated alerts, specify use case-specific task metrics and events, and publish them to the unified dashboard with just a few clicks.
By reducing troubleshooting time from days to minutes, this capability can help you accelerate your path to production and maximize the return on your AI investments.

DatologyAI builds tools to automatically select the best data on which to train deep learning models.

“We are excited to use Amazon SageMaker HyperPod’s one-click observability solution. Our senior staff members needed insights into how we’re utilizing GPU resources. The pre-built Grafana dashboards will give us exactly what we needed, with immediate visibility into critical metrics—from task-specific GPU utilization to file system (FSx for Lustre) performance—without requiring us to maintain any monitoring infrastructure. As someone who appreciates the power of the Prometheus Query Language, I like the fact that I can write my own queries and analyze custom metrics without worrying about infrastructure problems.” –Josh Wills, Member of Technical Staff at DatologyAI

Articul8 helps companies build sophisticated enterprise generative AI applications.

“With SageMaker HyperPod observability, we can now deploy our metric collection and visualization systems in a single click, saving our teams days of otherwise manual setup and enhancing our cluster observability workflows and insights. Our data scientists can quickly monitor task performance metrics, such as latency, and identify hardware issues without manual configuration. SageMaker HyperPod observability will help streamline our foundation model development processes, allowing us to focus on advancing our mission of delivering accessible and reliable AI-powered innovation to our customers.” –Renato Nascimento, head of technology at Articul8


Deploy Amazon SageMaker JumpStart models on SageMaker HyperPod for fast, scalable inference
After developing generative AI models on SageMaker HyperPod, many customers import these models to Amazon Bedrock, a fully managed service for building and scaling generative AI applications. However, some customers want to use their SageMaker HyperPod compute resources to speed up their evaluation and move models into production faster.
Now, you can deploy open-weights models from Amazon SageMaker JumpStart, as well as fine-tuned custom models, on SageMaker HyperPod within minutes with no manual infrastructure setup. Data scientists can run inference on SageMaker JumpStart models with a single click, simplifying and accelerating model evaluation. This straightforward, one-time provisioning reduces manual infrastructure setup, providing a reliable and scalable inference environment with minimal effort. Large model downloads are reduced from hours to minutes, accelerating model deployments and shortening the time to market.

H.AI exists to push the boundaries of superintelligence with agentic AI.

“With Amazon SageMaker HyperPod, we used the same high-performance compute to build and deploy the foundation models behind our agentic AI platform. This seamless transition from training to inference streamlined our workflow, reduced time to production, and delivered consistent performance in live environments. SageMaker HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency.” –Laurent Sifre, Co-founder & CTO at H.AI


Seamlessly access the powerful compute resources of SageMaker AI from local development environments
Today, many customers choose from the broad set of fully managed integrated development environments (IDEs) available in SageMaker AI for model development, including JupyterLab, Code Editor based on Code-OSS, and RStudio. Although these IDEs enable secure and efficient setups, some developers prefer to use local IDEs on their personal computers for their debugging capabilities and extensive customization options. However, customers using a local IDE, such as Visual Studio Code, couldn’t easily run their model development tasks on SageMaker AI until now.
With new remote connections to SageMaker AI, developers and data scientists can quickly and seamlessly connect to SageMaker AI from their local VS Code, maintaining access to the custom tools and familiar workflows that help them work most efficiently. Developers can build and train AI models using their local IDE while SageMaker AI manages remote execution, so you can work in your preferred environment while still benefiting from the performance, scalability, and security of SageMaker AI. You can now choose your preferred IDE—whether that is a fully managed cloud IDE or VS Code—to accelerate AI model development using the powerful infrastructure and seamless scalability of SageMaker AI.

CyberArk is a leader in Identity Security, which provides a comprehensive approach centered on privileged controls to protect against advanced cyber threats.
“With remote connections to SageMaker AI, our data scientists have the flexibility to choose the IDE that makes them most productive. Our teams can leverage their customized local setup while accessing the infrastructure and security controls of SageMaker AI. As a security first company, this is extremely important to us as it ensures sensitive data stays protected, while allowing our teams to securely collaborate and boost productivity.” –Nir Feldman, Senior Vice President of Engineering at CyberArk


Build generative AI models and applications faster with fully managed MLflow 3.0
As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Customers such as Cisco, SonRai, and Xometry are already using managed MLflow on SageMaker AI to efficiently manage ML model experiments at scale. The introduction of fully managed MLflow 3.0 on SageMaker AI makes it straightforward to track experiments, monitor training progress, and gain deeper insights into the behavior of models and AI applications using a single tool, helping you accelerate generative AI development.
Conclusion
In this post, we shared some of the new innovations in SageMaker AI to accelerate how you can build and train AI models.
To learn more about these new features, SageMaker AI, and how companies are using this service, refer to the following resources:

Accelerate foundation model development with one click observability in Amazon SageMaker HyperPod
Supercharge your AI workflows by connecting to SageMaker Studio from Visual Studio Code
Accelerating generative AI development with fully managed MLflow 3.0 on Amazon SageMaker AI
Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle
Amazon SageMaker AI
Amazon SageMaker AI customers

About the author
Ankur Mehrotra joined Amazon back in 2008 and is currently the General Manager of Amazon SageMaker AI. Before Amazon SageMaker AI, he worked on building Amazon.com’s advertising systems and automated pricing technology.

Accelerate foundation model development with one-click observability i …

Amazon SageMaker HyperPod now provides a comprehensive, out-of-the-box dashboard that delivers insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and visualizes them in Amazon Managed Grafana dashboards, optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance.
With a one-click installation of the Amazon Elastic Kubernetes Service (Amazon EKS) add-on for SageMaker HyperPod observability, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter (EFA), integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators. With this unified view, you can trace model development task performance to cluster resources with aggregation of resource metrics at the task level. The solution also abstracts management of collector agents and scrapers across clusters, offering automatic scalability of collectors across nodes as the cluster grows. The dashboards feature intuitive navigation across metrics and visualizations to help users diagnose problems and take action faster. They are also fully customizable, supporting additional PromQL metric imports and custom Grafana layouts.
These capabilities save teams valuable time and resources during FM development, helping accelerate time-to-market and reduce the cost of generative AI innovations. Instead of spending hours or days configuring, collecting, and analyzing cluster telemetry systems, data scientists and machine learning (ML) engineers can now quickly identify training, tuning, and inference disruptions, underutilization of valuable GPU resources, and hardware performance issues. The pre-built, actionable insights of SageMaker HyperPod observability can be used in several common scenarios when operating FM workloads, such as:

Data scientists can monitor resource utilization of submitted training and inference tasks at the per-GPU level, with insights into GPU memory and FLOPs
AI researchers can troubleshoot sub-optimal time-to-first-token (TTFT) for their inferencing workloads by correlating the deployment metrics with the corresponding resource bottlenecks
Cluster administrators can configure customizable alerts to send notifications to multiple destinations such as Amazon Simple Notification Service (Amazon SNS), PagerDuty, and Slack when hardware falls outside of recommended health thresholds
Cluster administrators can quickly identify inefficient resource queuing patterns across teams or namespaces to reconfigure allocation and prioritization policies

In this post, we walk you through installing and using the unified dashboards of the out-of-the-box observability feature in SageMaker HyperPod. We cover the one-click installation from the Amazon SageMaker AI console, navigating the dashboard and metrics it consolidates, and advanced topics such as setting up custom alerts. If you have a running SageMaker HyperPod EKS cluster, then this post will help you understand how to quickly visualize key health and performance telemetry data to derive actionable insights.
Prerequisites
To get started with SageMaker HyperPod observability, you first need to enable AWS IAM Identity Center to use Amazon Managed Grafana. If IAM Identity Center isn’t already enabled in your account, refer to Getting started with IAM Identity Center. Additionally, create at least one user in the IAM Identity Center.
SageMaker HyperPod observability is available for SageMaker HyperPod clusters with an Amazon EKS orchestrator. If you don’t already have a SageMaker HyperPod cluster with an Amazon EKS orchestrator, refer to Amazon SageMaker HyperPod quickstart workshops for instructions to create one.
Enable SageMaker HyperPod observability
To enable SageMaker HyperPod observability, follow these steps:

On the SageMaker AI console, choose Cluster management in the navigation pane.
Open the cluster detail page from the SageMaker HyperPod clusters list.
On the Dashboard tab, in the HyperPod Observability section, choose Quick installation.

SageMaker AI will create a new Prometheus workspace, a new Grafana workspace, and install the SageMaker HyperPod observability add-on to the EKS cluster. The installation typically completes within a few minutes.

When the installation process is complete, you can view the add-on details and metrics available.

Choose Manage users to assign a user to a Grafana workspace.
Choose Open dashboard in Grafana to open the Grafana dashboard.

When prompted, sign in with IAM Identity Center with the user you configured as a prerequisite.

After signing in successfully, you will see the SageMaker HyperPod observability dashboard on Grafana.
SageMaker HyperPod observability dashboards
You can choose from multiple dashboards, including Cluster, Tasks, Inference, Training, and File system.
The Cluster dashboard shows cluster-level metrics such as Total Nodes and Total GPUs, and cluster node-level metrics such as GPU Utilization and Filesystem space available. By default, the dashboard shows metrics about entire cluster, but you can apply filters to show metrics only about a specific hostname or specific GPU ID.

The Tasks dashboard is helpful if you want to see resource allocation and utilization metrics at the task level (PyTorchJob, ReplicaSet, and so on). For example, you can compare GPU utilization by multiple tasks running on your cluster and identify which task should be improved.
You can also choose an aggregation level from multiple options (Namespace, Task Name, Task Pod), and apply filters (Namespace, Task Type, Task Name, Pod, GPU ID). You can use these aggregation and filtering capabilities to view metrics at the appropriate granularity and drill down into the specific issue you are investigating.

The Inference dashboard shows inference application specific metrics such as Incoming Requests, Latency, and Time to First Byte (TTFB). The Inference dashboard is particularly useful when you use SageMaker HyperPod clusters for inference and need to monitor the traffic of the requests and performance of models.

Advanced installation
The Quick installation option will create a new workspace for Prometheus and Grafana and select default metrics. If you want to reuse an existing workspace, select additional metrics, or enable Pod logging to Amazon CloudWatch Logs, use the Custom installation option. For more information, see Amazon SageMaker HyperPod.
Set up alerts
Amazon Managed Grafana includes access to an updated alerting system that centralizes alerting information in a single, searchable view (in the navigation pane, choose Alerts to create an alert). Alerting is useful when you want to receive timely notifications, such as when GPU utilization drops unexpectedly, when a disk usage of your shared file system exceeds 90%, when multiple instances become unavailable at the same time, and so on. The HyperPod observability dashboard in Amazon Managed Grafana has pre-configured alerts for few of these key metrics. You can create additional alert rules based on metrics or queries and set up multiple notification channels, such as emails and Slack messages. For instructions on setting up alerts with Slack messages, see the Setting Up Slack Alerts for Amazon Managed Grafana GitHub page.
The number of alerts is limited to 100 per Grafana workspace. If you need a more scalable solution, check out the alerting options in Amazon Managed Service for Prometheus.
High-level overview
The following diagram illustrates the architecture of the new HyperPod observability capability.

Clean up
If you want to uninstall the SageMaker HyperPod observability feature (for example, to reconfigure it), clean up the resources in the following order:

Remove the SageMaker HyperPod observability add-on, either using the SageMaker AI console or Amazon EKS console.
Delete the Grafana workspace on the Amazon Managed Grafana console.
Delete the Prometheus workspace on the Amazon Managed Service for Prometheus console.

Conclusion
This post provided an overview and usage instructions for SageMaker HyperPod observability, a newly released observability feature for SageMaker HyperPod. This feature reduces the heavy lifting involved in setting up cluster observability and provides centralized visibility into cluster health status and performance metrics.
For more information about SageMaker HyperPod observability, see Amazon SageMaker HyperPod. Please leave your feedback on this post in the comments section.

About the authors
Tomonori Shimomura is a Principal Solutions Architect on the Amazon SageMaker AI team, where he provides in-depth technical consultation to SageMaker AI customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in Cloud side technology. In his free time, he enjoys playing video games, reading books, and writing software.
Matt Nightingale is a Solutions Architect Manager on the AWS WWSO Frameworks team focusing on Generative AI Training and Inference. Matt specializes in distributed training architectures with a focus on hardware performance and reliability. Matt holds a bachelors degree from University of Virginia and is based in Boston, Massachusetts.
Eric Saleh is a Senior GenAI Specialist at AWS, focusing on foundation model training and inference. He is partnering with top foundation model builders and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions with strategic customers. Before joining AWS, Eric led product teams building enterprise AI/ML solutions, which included frontier GenAI services for fine-tuning, RAG, and managed inference. He holds a master’s degree in Business Analytics from UCLA Anderson.
Piyush Kadam is a Senior Product Manager on the Amazon SageMaker AI team, where he specializes in LLMOps products that empower both startups and enterprise customers to rapidly experiment with and efficiently govern foundation models. With a Master’s degree in Computer Science from the University of California, Irvine, specializing in distributed systems and artificial intelligence, Piyush brings deep technical expertise to his role in shaping the future of cloud AI products.
Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.
Bhaskar Pratap is a Senior Software Engineer with the Amazon SageMaker AI team. He is passionate about designing and building elegant systems that bring machine learning to people’s fingertips. Additionally, he has extensive experience with building scalable cloud storage services.
Gopi Sekar is an Engineering Leader for the Amazon SageMaker AI team. He is dedicated to assisting customers and developing products that simplify the adaptation of machine learning to address real-world customer challenges.

Accelerating generative AI development with fully managed MLflow 3.0 o …

Amazon SageMaker now offers fully managed support for MLflow 3.0 that streamlines AI experimentation and accelerates your generative AI journey from idea to production. This release transforms managed MLflow from experiment tracking to providing end-to-end observability, reducing time-to-market for generative AI development.
As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Data scientists and developers struggle to effectively analyze the performance of their models and AI applications from experimentation to production, making it hard to find root causes and resolve issues. Teams spend more time integrating tools than improving the quality of their models or generative AI applications.
With the launch of fully managed MLflow 3.0 on Amazon SageMaker AI, you can accelerate generative AI development by making it easier to track experiments and observe behavior of models and AI applications using a single tool. Tracing capabilities in fully managed MLflow 3.0 provide customers the ability to record the inputs, outputs, and metadata at every step of a generative AI application, so developers can quickly identify the source of bugs or unexpected behaviors. By maintaining records of each model and application version, fully managed MLflow 3.0 offers traceability to connect AI responses to their source components, which means developers can quickly trace an issue directly to the specific code, data, or parameters that generated it. With these capabilities, customers using Amazon SageMaker HyperPod to train and deploy foundation models (FMs) can now use managed MLflow to track experiments, monitor training progress, gain deeper insights into the behavior of models and AI applications, and manage their machine learning (ML) lifecycle at scale. This reduces troubleshooting time and enables teams to focus more on innovation.
This post walks you through the core concepts of fully managed MLflow 3.0 on SageMaker and provides technical guidance on how to use the new features to help accelerate your next generative AI application development.
Getting started
You can get started with fully managed MLflow 3.0 on Amazon SageMaker to track experiments, manage models, and streamline your generative AI/ML lifecycle through the AWS Management Console, AWS Command Line Interface (AWS CLI), or API.
Prerequisites
To get started, you need:

An AWS account with billing enabled
An Amazon SageMaker Studio AI domain. To create a domain, refer to Guide to getting set up with Amazon SageMaker AI.

Configure your environment to use SageMaker managed MLflow Tracking Server
To perform the configuration, follow these steps:

In the SageMaker Studio UI, in the Applications pane, choose MLflow and choose Create.

Enter a unique name for your tracking server and specify the Amazon Simple Storage Service (Amazon S3) URI where your experiment artifacts will be stored. When you’re ready, choose Create. By default, SageMaker will select version 3.0 to create the MLflow tracking server.
Optionally, you can choose Update to adjust settings such as server size, tags, or AWS Identity and Access Management (IAM) role.

The server will now be provisioned and started automatically, typically within 25 minutes. After setup, you can launch the MLflow UI from SageMaker Studio to start tracking your ML and generative AI experiments. For more details on tracking server configurations, refer to Machine learning experiments using Amazon SageMaker AI with MLflow in the SageMaker Developer Guide.
To begin tracking your experiments with your newly created SageMaker managed MLflow tracking server, you need to install both MLflow and the AWS SageMaker MLflow Python packages in your environment. You can use SageMaker Studio managed Jupyter Lab, SageMaker Studio Code Editor, a local integrated development environment (IDE), or other supported environment where your AI workloads operate to track with SageMaker managed MLFlow tracking server.
To install both Python packages using pip:pip install mlflow==3.0 sagemaker-mlflow==0.1.0
To connect and start logging your AI experiments, parameters, and models directly to the managed MLflow on SageMaker, replace the Amazon Resource Name (ARN) of your SageMaker MLflow tracking server:

import mlflow

# SageMaker MLflow ARN
tracking_server_arn = “arn:aws:sagemaker:<Region>:<Account_id>:mlflow-tracking-server/<Name>” # Enter ARN
mlflow.set_tracking_uri(tracking_server_arn) 
mlflow.set_experiment(“customer_support_genai_app”)

Now your environment is configured and ready to track your experiments with your SageMaker Managed MLflow tracking server.
Implement generative AI application tracing and version tracking
Generative AI applications have multiple components, including code, configurations, and data, which can be challenging to manage without systematic versioning. A LoggedModel entity in managed MLflow 3.0 represents your AI model, agent, or generative AI application within an experiment. It provides unified tracking of model artifacts, execution traces, evaluation metrics, and metadata throughout the development lifecycle. A trace is a log of inputs, outputs, and intermediate steps from a single application execution. Traces provide insights into application performance, execution flow, and response quality, enabling debugging and evaluation. With LoggedModel, you can track and compare different versions of your application, making it easier to identify issues, deploy the best version, and maintain a clear record of what was deployed and when.
To implement version tracking and tracing with managed MLflow 3.0 on SageMaker, you can establish a versioned model identity using a Git commit hash, set this as the active model context so all subsequent traces will be automatically linked to this specific version, enable automatic logging for Amazon Bedrock interactions, and then make an API call to Anthropic’s Claude 3.5 Sonnet that will be fully traced with inputs, outputs, and metadata automatically captured within the established model context. Managed MLflow 3.0 tracing is already integrated with various generative AI libraries and provides one-line automatic tracing experience for all the support libraries. For information about supported libraries, refer to Supported Integrations in the MLflow documentation.

# 1. Define your application version using the git commit
logged_model= “customer_support_agent”
logged_model_name = f”{logged_model}-{git_commit}”

# 2.Set the active model context – traces will be linked to this
mlflow.set_active_model(name=logged_model_name)

# 3.Set auto logging for your model provider
mlflow.bedrock.autolog()

# 4. Chat with your LLM provider
# Ensure that your boto3 client has the necessary auth information
bedrock = boto3.client(
service_name=”bedrock-runtime”,
region_name=”<REPLACE_WITH_YOUR_AWS_REGION>”,
)

model = “anthropic.claude-3-5-sonnet-20241022-v2:0”
messages = [{ “role”: “user”, “content”: [{“text”: “Hello!”}]}]
# All intermediate executions within the chat session will be logged
bedrock.converse(modelId=model, messages=messages)

After logging this information, you can track these generative AI experiments and the logged model for the agent in the managed MLflow 3.0 tracking server UI, as shown in the following screenshot.

In addition to the one-line auto tracing functionality, MLflow offers Python SDK for manually instrumenting your code and manipulating traces. Refer to the code sample notebook sagemaker_mlflow_strands.ipynb in the aws-samples GitHub repository, where we use MLflow manual instrumentation to trace Strands Agents. With tracing capabilities in fully managed MLflow 3.0, you can record the inputs, outputs, and metadata associated with each intermediate step of a request, so you can pinpoint the source of bugs and unexpected behaviors.
These capabilities provide observability in your AI workload by capturing detailed information about the execution of the workload services, nodes, and tools that you can see under the Traces tab.

You can inspect each trace, as shown in the following image, by choosing the request ID in the traces tab for the desired trace.

Fully managed MLflow 3.0 on Amazon SageMaker also introduces the capability to tag traces. Tags are mutable key-value pairs you can attach to traces to add valuable metadata and context. Trace tags make it straightforward to organize, search, and filter traces based on criteria such as user session, environment, model version, or performance characteristics. You can add, update, or remove tags at any stage—during trace execution using mlflow.update_current_trace() or after a trace is logged using the MLflow APIs or UI. Managed MLflow 3.0 makes it seamless to search and analyze traces, helping teams quickly pinpoint issues, compare agent behaviors, and optimize performance. The tracing UI and Python API both support powerful filtering, so you can drill down into traces based on attributes such as status, tags, user, environment, or execution time as shown in the screenshot below. For example, you can instantly find all traces with errors, filter by production environment, or search for traces from a specific request. This capability is essential for debugging, cost analysis, and continuous improvement of generative AI applications.
The following screenshot displays the traces returned when searching for the tag ‘Production’.

The following code snippet shows how you can use search for all traces in production with a successful status:

# Search for traces in production environment with successful status
traces = mlflow.search_traces( filter_string=”attributes.status = ‘OK’ AND tags.environment = ‘production'”)

Generative AI use case walkthrough with MLflow tracing
Building and deploying generative AI agents such as chat-based assistants, code generators, or customer support assistants requires deep visibility into how these agents interact with large language models (LLMs) and external tools. In a typical agentic workflow, the agent loops through reasoning steps, calling LLMs and using tools or subsystems such as search APIs or Model Context Protocol (MCP) servers until it completes the user’s task. These complex, multistep interactions make debugging, optimization, and cost tracking especially challenging.
Traditional observability tools fall short in generative AI because agent decisions, tool calls, and LLM responses are dynamic and context-dependent. Managed MLflow 3.0 tracing provides comprehensive observability by capturing every LLM call, tool invocation, and decision point in your agent’s workflow. You can use this end-to-end trace data to:

Debug agent behavior – Pinpoint where an agent’s reasoning deviates or why it produces unexpected outputs.
Monitor tool usage – Discover how and when external tools are called and analyze their impact on quality and cost.
Track performance and cost – Measure latency, token usage, and API costs at each step of the agentic loop.
Audit and govern – Maintain detailed logs for compliance and analysis.

Imagine a real-world scenario using the managed MLflow 3.0 tracing UI for a sample finance customer support agent equipped with a tool to retrieve financial data from a datastore. While you’re developing a generative AI customer support agent or analyzing the agent behavior in production, you can observe how agent responses and the execution optionally call a product database tool for more accurate recommendations. For illustration, the first trace, shown in the following screenshot, shows the agent handling a user query without invoking any tools. The trace captures the prompt, agent response, and agent decision points. The agent’s response lacks product-specific details. The trace makes it clear that no external tool was called, and you quickly identify the behavior in the agent’s reasoning chain.

The second trace, shown in the following screenshot, captures the same agent, but this time it decides to call the product database tool. The trace logs the tool invocation, the returned product data, and how the agent incorporates this information into its final response. Here, you can observe improved answer quality, a slight increase in latency, and additional API cost with higher token usage.

By comparing these traces side by side, you can debug why the agent sometimes skips using the tool, optimize when and how tools are called, and balance quality against latency and cost. MLflow’s tracing UI makes these agentic loops transparent, actionable, and seamless to analyze at scale. This post’s sample agent and all necessary code is available on the aws-samples GitHub repository, where you can replicate and adapt it for your own applications.
Cleanup
After it’s created, a SageMaker managed MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop tracking servers when they’re not in use to save costs, or you can delete them using API or the SageMaker Studio UI. For more details on pricing, refer to Amazon SageMaker pricing.
Conclusion
Fully managed MLflow 3.0 on Amazon SageMaker AI is now available. Get started with sample code in the aws-samples GitHub repository. We invite you to explore this new capability and experience the enhanced efficiency and control it brings to your ML projects. To learn more, visit Machine Learning Experiments using Amazon SageMaker with MLflow.
For more information, visit the SageMaker Developer Guide and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

About the authors
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, Retrieval-Augmented-Generation (RAG), GenAI Agents, and scaling GenAI use-cases. He also focuses on Go-To-Market strategies helping AWS build and align products to solve industry challenges in the GenerativeAI space. You can find Sandeep on LinkedIn.
Amit Modi is the product leader for SageMaker AIOps and Governance, and Responsible AI at AWS. With over a decade of B2B experience, he builds scalable products and teams that drive innovation and deliver value to customers globally.
Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the SageMaker AIOps team. With over 15 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.

Hugging Face Releases SmolLM3: A 3B Long-Context, Multilingual Reasoni …

Hugging Face just released SmolLM3, the latest version of its “Smol” language models, designed to deliver strong multilingual reasoning over long contexts using a compact 3B-parameter architecture. While most high-context capable models typically push beyond 7B parameters, SmolLM3 manages to offer state-of-the-art (SoTA) performance with significantly fewer parameters—making it more cost-efficient and deployable on constrained hardware, without compromising on capabilities like tool usage, multi-step reasoning, and language diversity.

Overview of SmolLM3

SmolLM3 stands out as a compact, multilingual, and dual-mode long-context language model capable of handling sequences up to 128k tokens. It was trained on 11 trillion tokens, positioning it competitively against models like Mistral, LLaMA 2, and Falcon. Despite its size, SmolLM3 achieves surprisingly strong tool usage performance and few-shot reasoning ability—traits more commonly associated with models double or triple its size.

SmolLM3 was released in two variants:

SmolLM3-3B-Base: The base language model trained on the 11T-token corpus.

SmolLM3-3B-Instruct: An instruction-tuned variant optimized for reasoning and tool use.

Both models are publicly available under the Apache 2.0 license on Hugging Face’s Model Hub.

Key Features

1. Long Context Reasoning (up to 128k tokens)SmolLM3 utilizes a modified attention mechanism to efficiently process extremely long contexts—up to 128,000 tokens. This capability is crucial for tasks involving extended documents, logs, or structured records where context length directly affects comprehension and accuracy.

2. Dual Mode ReasoningThe instruction-tuned SmolLM3-3B supports dual-mode reasoning:

Instruction-following for chat-style and tool-augmented tasks.

Multilingual QA and generation for tasks in multiple languages.

This bifurcation allows the model to excel in both open-ended generation and structured reasoning, making it suitable for applications ranging from RAG pipelines to agent workflows.

3. Multilingual CapabilitiesTrained on a multilingual corpus, SmolLM3 supports six languages: English, French, Spanish, German, Italian, and Portuguese. It performs well on benchmarks like XQuAD and MGSM, demonstrating its ability to generalize across linguistic boundaries with minimal performance drop.

4. Compact Size with SoTA PerformanceAt just 3 billion parameters, SmolLM3 achieves performance close to or on par with larger models such as Mistral-7B on multiple downstream tasks. This is made possible by the scale and quality of its training data (11T tokens) and careful architectural tuning.

5. Tool Use and Structured OutputsThe model demonstrates impressive performance on tool-calling tasks—both in prompt-based workflows and with structured outputs. It correctly follows schema-driven input-output constraints and interfaces well with systems requiring deterministic behavior, such as autonomous agents and API-driven environments.

Technical Training Details

SmolLM3 was trained on an internal mixture curated by Hugging Face, consisting of high-quality web content, code, academic papers, and multilingual sources. The 11T-token training run was done using multi-node distributed training strategies on GPU clusters, employing optimizations like Flash Attention v2 for efficient long-sequence training. The tokenizer is a 128k-token SentencePiece model, shared across all supported languages.

For long context support, Hugging Face employed linear and grouped attention mechanisms that minimize quadratic complexity while retaining performance. This enabled the model to handle context lengths up to 128k during both training and inference—without memory bottlenecks that plague dense transformers at this scale.

The SmolLM3-3B instruction-tuned variant was further trained using Hugging Face’s trlx library for alignment with chat instructions, reasoning tasks, and tool usage demonstrations.

Performance Benchmarks

SmolLM3 performs strongly on multiple multilingual and reasoning benchmarks:

XQuAD (Multilingual QA): Competitive scores in all six supported languages.

MGSM (Multilingual Grade School Math): Outperforms several larger models in zero-shot settings.

ToolQA and MultiHopQA: Shows strong multi-step reasoning and context grounding.

ARC and MMLU: High accuracy in commonsense and professional knowledge domains.

While it does not surpass the latest 7B and 13B models on every benchmark, SmolLM3’s performance-to-parameter ratio remains one of the highest in its class.

Use Cases and Applications

SmolLM3 is particularly suited for:

Low-cost, multilingual AI deployments in chatbots, helpdesk systems, and document summarizers.

Lightweight RAG and retrieval-based systems that benefit from long-context understanding.

Tool-augmented agents requiring schema adherence and deterministic tool invocation.

Edge deployments and private environments where smaller models are necessary due to hardware or data privacy constraints.

Conclusion

SmolLM3 exemplifies a new generation of small-yet-capable language models. Its combination of multilingual support, long-context handling, and strong reasoning—all within a 3B parameter footprint—marks a significant step forward in model efficiency and accessibility. Hugging Face’s release demonstrates that with the right training recipe and architectural design, smaller models can still deliver robust performance in complex tasks traditionally reserved for much larger LLMs.

Check out the SmolLM3-3B-Base and SmolLM3-3B-Instruct. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Hugging Face Releases SmolLM3: A 3B Long-Context, Multilingual Reasoning Model appeared first on MarkTechPost.

A Code Implementation for Designing Intelligent Multi-Agent Workflows …

BeeAI FrameworkIn this tutorial, we explore the power and flexibility of the beeai-framework by building a fully functional multi-agent system from the ground up. We walk through the essential components, custom agents, tools, memory management, and event monitoring, to show how BeeAI simplifies the development of intelligent, cooperative agents. Along the way, we demonstrate how these agents can perform complex tasks, such as market research, code analysis, and strategic planning, using a modular, production-ready pattern.

Copy CodeCopiedUse a different Browserimport subprocess
import sys
import asyncio
import json
from typing import Dict, List, Any, Optional
from datetime import datetime
import os

def install_packages():
packages = [
“beeai-framework”,
“requests”,
“beautifulsoup4”,
“numpy”,
“pandas”,
“pydantic”
]

print(“Installing required packages…”)
for package in packages:
try:
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, package])
print(f” {package} installed successfully”)
except subprocess.CalledProcessError as e:
print(f” Failed to install {package}: {e}”)
print(“Installation complete!”)

install_packages()

try:
from beeai_framework import ChatModel
from beeai_framework.agents import Agent
from beeai_framework.tools import Tool
from beeai_framework.workflows import Workflow
BEEAI_AVAILABLE = True
print(” BeeAI Framework imported successfully”)
except ImportError as e:
print(f” BeeAI Framework import failed: {e}”)
print(“Falling back to custom implementation…”)
BEEAI_AVAILABLE = False

We begin by installing all the required packages, including the beeai-framework, to ensure our environment is ready for multi-agent development. Once installed, we attempt to import BeeAI’s core modules. If the import fails, we gracefully fall back to a custom implementation to maintain workflow functionality.

Copy CodeCopiedUse a different Browserclass MockChatModel:
“””Mock LLM for demonstration purposes”””
def __init__(self, model_name: str = “mock-llm”):
self.model_name = model_name

async def generate(self, messages: List[Dict[str, str]]) -> str:
“””Generate a mock response”””
last_message = messages[-1][‘content’] if messages else “”

if “market” in last_message.lower():
return “Market analysis shows strong growth in AI frameworks with 42% YoY increase. Key competitors include LangChain, CrewAI, and AutoGen.”
elif “code” in last_message.lower():
return “Code analysis reveals good structure with async patterns. Consider adding more error handling and documentation.”
elif “strategy” in last_message.lower():
return “Strategic recommendation: Focus on ease of use, strong documentation, and enterprise features to compete effectively.”
else:
return f”Analyzed: {last_message[:100]}… Recommendation: Implement best practices for scalability and maintainability.”

class CustomTool:
“””Base class for custom tools”””
def __init__(self, name: str, description: str):
self.name = name
self.description = description

async def run(self, input_data: str) -> str:
“””Override this method in subclasses”””
raise NotImplementedError

We define a MockChatModel to simulate LLM behavior when BeeAI is unavailable, allowing us to test and prototype workflows without relying on external APIs. Alongside it, we create a CustomTool base class, which serves as a blueprint for task-specific tools that our agents can use, laying the foundation for modular, tool-augmented agent capabilities.

Copy CodeCopiedUse a different Browserclass MarketResearchTool(CustomTool):
“””Custom tool for market research and competitor analysis”””

def __init__(self):
super().__init__(
name=”market_research”,
description=”Analyzes market trends and competitor information”
)
self.market_data = {
“AI_frameworks”: {
“competitors”: [“LangChain”, “CrewAI”, “AutoGen”, “Haystack”, “Semantic Kernel”],
“market_size”: “$2.8B”,
“growth_rate”: “42% YoY”,
“key_trends”: [“Multi-agent systems”, “Production deployment”, “Tool integration”, “Enterprise adoption”]
},
“enterprise_adoption”: {
“rate”: “78%”,
“top_use_cases”: [“Customer support”, “Data analysis”, “Code generation”, “Document processing”],
“challenges”: [“Reliability”, “Cost control”, “Integration complexity”, “Governance”]
}
}

async def run(self, query: str) -> str:
“””Simulate market research based on query”””
query_lower = query.lower()

if “competitor” in query_lower or “competition” in query_lower:
data = self.market_data[“AI_frameworks”]
return f”””Market Analysis Results:

Key Competitors: {‘, ‘.join(data[‘competitors’])}
Market Size: {data[‘market_size’]}
Growth Rate: {data[‘growth_rate’]}
Key Trends: {‘, ‘.join(data[‘key_trends’])}

Recommendation: Focus on differentiating features like simplified deployment, better debugging tools, and enterprise-grade security.”””

elif “adoption” in query_lower or “enterprise” in query_lower:
data = self.market_data[“enterprise_adoption”]
return f”””Enterprise Adoption Analysis:

Adoption Rate: {data[‘rate’]}
Top Use Cases: {‘, ‘.join(data[‘top_use_cases’])}
Main Challenges: {‘, ‘.join(data[‘challenges’])}

Recommendation: Address reliability and cost control concerns through better monitoring and resource management features.”””

else:
return “Market research available for: competitor analysis, enterprise adoption, or specific trend analysis. Please specify your focus area.”

We implement the MarketResearchTool as a specialized extension of our CustomTool base class. This tool simulates real-world market intelligence by returning pre-defined insights on AI framework trends, key competitors, adoption rates, and industry challenges. With this, we equip our agents to make informed, data-driven recommendations during workflow execution.

Copy CodeCopiedUse a different Browserclass CodeAnalysisTool(CustomTool):
“””Custom tool for analyzing code patterns and suggesting improvements”””

def __init__(self):
super().__init__(
name=”code_analysis”,
description=”Analyzes code structure and suggests improvements”
)

async def run(self, code_snippet: str) -> str:
“””Analyze code and provide insights”””
analysis = {
“lines”: len(code_snippet.split(‘n’)),
“complexity”: “High” if len(code_snippet) > 500 else “Medium” if len(code_snippet) > 200 else “Low”,
“async_usage”: “Yes” if “async” in code_snippet or “await” in code_snippet else “No”,
“error_handling”: “Present” if “try:” in code_snippet or “except:” in code_snippet else “Missing”,
“documentation”: “Good” if ‘”””‘ in code_snippet or “”'” in code_snippet else “Needs improvement”,
“imports”: “Present” if “import ” in code_snippet else “None detected”,
“classes”: len([line for line in code_snippet.split(‘n’) if line.strip().startswith(‘class ‘)]),
“functions”: len([line for line in code_snippet.split(‘n’) if line.strip().startswith(‘def ‘) or line.strip().startswith(‘async def ‘)])
}

suggestions = []
if analysis[“error_handling”] == “Missing”:
suggestions.append(“Add try-except blocks for error handling”)
if analysis[“documentation”] == “Needs improvement”:
suggestions.append(“Add docstrings and comments”)
if “print(” in code_snippet:
suggestions.append(“Consider using proper logging instead of print statements”)
if analysis[“async_usage”] == “Yes” and “await” not in code_snippet:
suggestions.append(“Ensure proper await usage with async functions”)
if analysis[“complexity”] == “High”:
suggestions.append(“Consider breaking down into smaller functions”)

return f”””Code Analysis Report:

Structure:
– Lines of code: {analysis[‘lines’]}
– Complexity: {analysis[‘complexity’]}
– Classes: {analysis[‘classes’]}
– Functions: {analysis[‘functions’]}

Quality Metrics:
– Async usage: {analysis[‘async_usage’]}
– Error handling: {analysis[‘error_handling’]}
– Documentation: {analysis[‘documentation’]}

Suggestions:
{chr(10).join(f”• {suggestion}” for suggestion in suggestions) if suggestions else “• Code looks good! Following best practices.”}

Overall Score: {10 – len(suggestions) * 2}/10″””

class CustomAgent:
“””Custom agent implementation”””

def __init__(self, name: str, role: str, instructions: str, tools: List[CustomTool], llm=None):
self.name = name
self.role = role
self.instructions = instructions
self.tools = tools
self.llm = llm or MockChatModel()
self.memory = []

async def run(self, task: str) -> Dict[str, Any]:
“””Execute agent task”””
print(f” {self.name} ({self.role}) processing task…”)

self.memory.append({“type”: “task”, “content”: task, “timestamp”: datetime.now()})

task_lower = task.lower()
tool_used = None
tool_result = None

for tool in self.tools:
if tool.name == “market_research” and (“market” in task_lower or “competitor” in task_lower):
tool_result = await tool.run(task)
tool_used = tool.name
break
elif tool.name == “code_analysis” and (“code” in task_lower or “analyze” in task_lower):
tool_result = await tool.run(task)
tool_used = tool.name
break

messages = [
{“role”: “system”, “content”: f”You are {self.role}. {self.instructions}”},
{“role”: “user”, “content”: task}
]

if tool_result:
messages.append({“role”: “system”, “content”: f”Tool {tool_used} provided: {tool_result}”})

response = await self.llm.generate(messages)

self.memory.append({“type”: “response”, “content”: response, “timestamp”: datetime.now()})

return {
“agent”: self.name,
“task”: task,
“tool_used”: tool_used,
“tool_result”: tool_result,
“response”: response,
“success”: True
}

We now implement the CodeAnalysisTool, which enables our agents to assess code snippets based on structure, complexity, documentation, and error handling. This tool generates insightful suggestions to improve code quality. We also define the CustomAgent class, equipping each agent with its own role, instructions, memory, tools, and access to an LLM. This design allows each agent to decide whether a tool is needed intelligently and then synthesize responses using both analysis and LLM reasoning, ensuring adaptable and context-aware behavior.

Copy CodeCopiedUse a different Browserclass WorkflowMonitor:
“””Monitor and log workflow events”””

def __init__(self):
self.events = []
self.start_time = datetime.now()

def log_event(self, event_type: str, data: Dict[str, Any]):
“””Log workflow events”””
timestamp = datetime.now()
self.events.append({
“timestamp”: timestamp,
“duration”: (timestamp – self.start_time).total_seconds(),
“event_type”: event_type,
“data”: data
})
print(f”[{timestamp.strftime(‘%H:%M:%S’)}] {event_type}: {data.get(‘agent’, ‘System’)}”)

def get_summary(self):
“””Get monitoring summary”””
return {
“total_events”: len(self.events),
“total_duration”: (datetime.now() – self.start_time).total_seconds(),
“event_types”: list(set([e[“event_type”] for e in self.events])),
“events”: self.events
}

class CustomWorkflow:
“””Custom workflow implementation”””

def __init__(self, name: str, description: str):
self.name = name
self.description = description
self.agents = []
self.monitor = WorkflowMonitor()

def add_agent(self, agent: CustomAgent):
“””Add agent to workflow”””
self.agents.append(agent)
self.monitor.log_event(“agent_added”, {“agent”: agent.name, “role”: agent.role})

async def run(self, tasks: List[str]) -> Dict[str, Any]:
“””Execute workflow with tasks”””
self.monitor.log_event(“workflow_started”, {“tasks”: len(tasks)})

results = []
context = {“shared_insights”: []}

for i, task in enumerate(tasks):
agent = self.agents[i % len(self.agents)]

if context[“shared_insights”]:
enhanced_task = f”{task}nnContext from previous analysis:n” + “n”.join(context[“shared_insights”][-2:])
else:
enhanced_task = task

result = await agent.run(enhanced_task)
results.append(result)

context[“shared_insights”].append(f”{agent.name}: {result[‘response’][:200]}…”)

self.monitor.log_event(“task_completed”, {
“agent”: agent.name,
“task_index”: i,
“success”: result[“success”]
})

self.monitor.log_event(“workflow_completed”, {“total_tasks”: len(tasks)})

return {
“workflow”: self.name,
“results”: results,
“context”: context,
“summary”: self._generate_summary(results)
}

def _generate_summary(self, results: List[Dict[str, Any]]) -> str:
“””Generate workflow summary”””
summary_parts = []

for result in results:
summary_parts.append(f”• {result[‘agent’]}: {result[‘response’][:150]}…”)

return f”””Workflow Summary for {self.name}:

{chr(10).join(summary_parts)}

Key Insights:
• Market opportunities identified in AI framework space
• Technical architecture recommendations provided
• Strategic implementation plan outlined
• Multi-agent collaboration demonstrated successfully”””

We implement the WorkflowMonitor to log and track events throughout the execution, giving us real-time visibility into the actions taken by each agent. With the CustomWorkflow class, we orchestrate the entire multi-agent process, assigning tasks, preserving shared context across agents, and capturing all relevant insights. This structure ensures that we not only execute tasks in a coordinated and transparent way but also generate a comprehensive summary that highlights collaboration and key outcomes.

Copy CodeCopiedUse a different Browserasync def advanced_workflow_demo():
“””Demonstrate advanced multi-agent workflow”””

print(” Advanced Multi-Agent Workflow Demo”)
print(“=” * 50)

workflow = CustomWorkflow(
name=”Advanced Business Intelligence System”,
description=”Multi-agent system for comprehensive business analysis”
)

market_agent = CustomAgent(
name=”MarketAnalyst”,
role=”Senior Market Research Analyst”,
instructions=”Analyze market trends, competitor landscape, and business opportunities. Provide data-driven insights with actionable recommendations.”,
tools=[MarketResearchTool()],
llm=MockChatModel()
)

tech_agent = CustomAgent(
name=”TechArchitect”,
role=”Technical Architecture Specialist”,
instructions=”Evaluate technical solutions, code quality, and architectural decisions. Focus on scalability, maintainability, and best practices.”,
tools=[CodeAnalysisTool()],
llm=MockChatModel()
)

strategy_agent = CustomAgent(
name=”StrategicPlanner”,
role=”Strategic Business Planner”,
instructions=”Synthesize market and technical insights into comprehensive strategic recommendations. Focus on ROI, risk assessment, and implementation roadmaps.”,
tools=[],
llm=MockChatModel()
)

workflow.add_agent(market_agent)
workflow.add_agent(tech_agent)
workflow.add_agent(strategy_agent)

tasks = [
“Analyze the current AI framework market landscape and identify key opportunities for a new multi-agent framework targeting enterprise users.”,
“””Analyze this code architecture pattern and provide technical assessment:

async def multi_agent_workflow():
agents = [ResearchAgent(), AnalysisAgent(), SynthesisAgent()]
context = SharedContext()

for agent in agents:
try:
result = await agent.run(context.get_task())
if result.success:
context.add_insight(result.data)
else:
context.add_error(result.error)
except Exception as e:
logger.error(f”Agent {agent.name} failed: {e}”)

return context.synthesize_recommendations()”””,
“Based on the market analysis and technical assessment, create a comprehensive strategic plan for launching a competitive AI framework with focus on multi-agent capabilities and enterprise adoption.”
]

print(“n Executing Advanced Workflow…”)
result = await workflow.run(tasks)

print(“n Workflow Completed Successfully!”)
print(“=” * 50)
print(” COMPREHENSIVE ANALYSIS RESULTS”)
print(“=” * 50)
print(result[“summary”])

print(“n WORKFLOW MONITORING SUMMARY”)
print(“=” * 30)
summary = workflow.monitor.get_summary()
print(f”Total Events: {summary[‘total_events’]}”)
print(f”Total Duration: {summary[‘total_duration’]:.2f} seconds”)
print(f”Event Types: {‘, ‘.join(summary[‘event_types’])}”)

return workflow, result

async def simple_tool_demo():
“””Demonstrate individual tool functionality”””

print(“n Individual Tool Demo”)
print(“=” * 30)

market_tool = MarketResearchTool()
code_tool = CodeAnalysisTool()

print(“Available Tools:”)
print(f”• {market_tool.name}: {market_tool.description}”)
print(f”• {code_tool.name}: {code_tool.description}”)

print(“n Market Research Analysis:”)
market_result = await market_tool.run(“competitor analysis in AI frameworks”)
print(market_result)

print(“n Code Analysis:”)
sample_code = ”’
import asyncio
from typing import List, Dict

class AgentManager:
“””Manages multiple AI agents”””

def __init__(self):
self.agents = []
self.results = []

async def add_agent(self, agent):
“””Add agent to manager”””
self.agents.append(agent)

async def run_all(self, task: str) -> List[Dict]:
“””Run task on all agents”””
results = []
for agent in self.agents:
try:
result = await agent.execute(task)
results.append(result)
except Exception as e:
print(f”Agent failed: {e}”)
results.append({“error”: str(e)})
return results
”’

code_result = await code_tool.run(sample_code)
print(code_result)

We demonstrate two powerful workflows. First, in the individual tool demo, we directly test the capabilities of our MarketResearchTool and CodeAnalysisTool, ensuring they generate relevant insights independently. Then, we bring everything together in the advanced workflow demo, where we deploy three specialized agents, MarketAnalyst, TechArchitect, and StrategicPlanner, to tackle business analysis tasks collaboratively.

Copy CodeCopiedUse a different Browserasync def main():
“””Main demo function”””

print(” Advanced BeeAI Framework Tutorial”)
print(“=” * 40)
print(“This tutorial demonstrates:”)
print(“• Multi-agent workflows”)
print(“• Custom tool development”)
print(“• Memory management”)
print(“• Event monitoring”)
print(“• Production-ready patterns”)

if BEEAI_AVAILABLE:
print(“• Using real BeeAI Framework”)
else:
print(“• Using custom implementation (BeeAI not available)”)

print(“=” * 40)

await simple_tool_demo()

print(“n” + “=”*50)
await advanced_workflow_demo()

print(“n Tutorial Complete!”)
print(“nNext Steps:”)
print(“1. Install BeeAI Framework properly: pip install beeai-framework”)
print(“2. Configure your preferred LLM (OpenAI, Anthropic, local models)”)
print(“3. Explore the official BeeAI documentation”)
print(“4. Build custom agents for your specific use case”)
print(“5. Deploy to production with proper monitoring”)

if __name__ == “__main__”:
try:
import nest_asyncio
nest_asyncio.apply()
print(” Applied nest_asyncio for Colab compatibility”)
except ImportError:
print(” nest_asyncio not available – may not work in some environments”)

asyncio.run(main())

We wrap up our tutorial with the main() function, which ties together everything we’ve built, demonstrating both tool-level capabilities and a full multi-agent business intelligence workflow. Whether we’re running BeeAI natively or using a fallback setup, we ensure compatibility with environments like Google Colab using nest_asyncio. With this structure in place, we’re ready to scale our agent systems, explore deeper use cases, and confidently deploy production-ready AI workflows.

In conclusion, we’ve built and executed a robust multi-agent workflow using the BeeAI framework (or a custom equivalent), showcasing its potential in real-world business intelligence applications. We’ve seen how easy it is to create agents with specific roles, attach tools for task augmentation, and monitor execution in a transparent way.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Code Implementation for Designing Intelligent Multi-Agent Workflows with the BeeAI Framework appeared first on MarkTechPost.

Accelerate AI development with Amazon Bedrock API keys

Today, we’re excited to announce a significant improvement to the developer experience of Amazon Bedrock: API keys. API keys provide quick access to the Amazon Bedrock APIs, streamlining the authentication process so that developers can focus on building rather than configuration.
CamelAI is an open-source, modular framework for building intelligent multi-agent systems for data generation, world simulation, and task automation.

“As a startup with limited resources, streamlined customer onboarding is critical to our success. The Amazon Bedrock API keys enable us to onboard enterprise customers in minutes rather than hours. With Bedrock, our customers can quickly provision access to leading AI models and seamlessly integrate them into CamelAI,”
said Miguel Salinas, CTO, CamelAI.

In this post, explore how API keys work and how you can start using them today.
API key authentication
Amazon Bedrock now provides API key access to streamline integration with tools and frameworks that expect API key-based authentication. The Amazon Bedrock and Amazon Bedrock runtime SDKs support API key authentication for methods including on-demand inference, provisioned throughput inference, model fine-tuning, distillation, and evaluation.
The diagram compares the default authentication process to Amazon Bedrock (in orange) with the API keys approach (in blue). In the default process, you must create an identity in AWS IAM Identity Center or IAM, attach IAM policies to provide permissions to perform API operations, and generate credentials, which you can then use to make API calls. The grey boxes in the diagram highlight the steps that Amazon Bedrock now streamlines when generating an API key. Developers can now authenticate and access Amazon Bedrock APIs with minimal setup overhead.

You can generate API keys in the Amazon Bedrock console, choosing between two types.
With long-term API keys, you can set expiration times ranging from 1 day to no expiration. These keys are associated with an IAM user that Amazon Bedrock automatically creates for you. The system attaches the AmazonBedrockLimitedAccess managed policy to this IAM user, and you can then modify permissions as needed through the IAM service. We recommend using long-term keys primarily for exploration of Amazon Bedrock.
Short-term API keys use the IAM permissions from your current IAM principal and expire when your account’s session ends or can last up to 12 hours. Short-term API keys use AWS Signature Version 4 for authentication. For continuous application use, you can implement API key refreshing with a script as shown in this example. We recommend that you use short-term API keys for setups that require a higher level of security.
Making Your First API Call
Once you have access to foundation models, getting started with Amazon Bedrock API key is straightforward. Here’s how to make your first API call using the AWS SDK for Python (Boto3 SDK) and API keys:
Generate an API key
To generate an API key, follow these steps:

Sign in to the AWS Management Console and open the Amazon Bedrock console
In the left navigation panel, select API keys
Choose either Generate short-term API key or Generate long-term API key
For long-term keys, set your desired expiration time and optionally configure advanced permissions
Choose Generate and copy your API key

Set Your API Key as Environment Variable
You can set your API key as an environment variable so that it’s automatically recognized when you make API requests:

# To set the API key as an environment variable, you can open a terminal and run the following command:
export AWS_BEARER_TOKEN_BEDROCK=${api-key}

The Boto3 SDK automatically detects your environment variable when you create an Amazon Bedrock client.
Make Your First API Call
You can now make API calls to Amazon Bedrock in multiple ways:

Using curl

curl -X POST “https://bedrock-runtime.us-east-1.amazonaws.com/model/us.anthropic.claude-3-5-haiku-20241022-v1:0/converse”
-H “Content-Type: application/json”
-H “Authorization: Bearer $AWS_BEARER_TOKEN_BEDROCK”
-d ‘{
“messages”: [
{
“role”: “user”,
“content”: [{“text”: “Hello”}]
}
]
}’

Using the Amazon Bedrock SDK:

import boto3

# Create an Amazon Bedrock client
client = boto3.client(
service_name=”bedrock-runtime”,
region_name=”us-east-1″ # If you’ve configured a default region, you can omit this line
)

# Define the model and message
model_id = “us.anthropic.claude-3-5-haiku-20241022-v1:0”
messages = [{“role”: “user”, “content”: [{“text”: “Hello”}]}]

response = client.converse(
modelId=model_id,
messages=messages,
)

# Print the response
print(response[‘output’][‘message’][‘content’][0][‘text’])

You can also use native libraries like Python Requests:

import requests
import os

url = “https://bedrock-runtime.us-east-1.amazonaws.com/model/us.anthropic.claude-3-5-haiku-20241022-v1:0/converse”

payload = {
“messages”: [
{
“role”: “user”,
“content”: [{“text”: “Hello”}]
}
]
}

headers = {
“Content-Type”: “application/json”,
“Authorization”: f”Bearer {os.environ[‘AWS_BEARER_TOKEN_BEDROCK’]}”
}

response = requests.request(“POST”, url, json=payload, headers=headers)

print(response.text)

Bridging developer experience and enterprise security requirements
Enterprise administrators can now streamline their user onboarding to Amazon Bedrock foundation models. With setups that require a higher level of security, administrators can enable short-term API keys for their users. Short-term API keys use AWS Signature Version 4 and existing IAM principals, maintaining established access controls implemented by administrators.
For audit and compliance purposes, all API calls are logged in AWS CloudTrail. API keys are passed as authorization headers to API requests and aren’t logged.
Conclusion
Amazon Bedrock API keys are available in 20 AWS Regions where Amazon Bedrock is available: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Hyderabad, Mumbai, Osaka, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Milan, Paris, Spain, Stockholm, Zurich), and South America (São Paulo). To learn more about API keys in Amazon Bedrock, visit the API Keys documentation in the Amazon Bedrock user guide.
Give API keys a try in the Amazon Bedrock console today and send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

About the Authors
Sofian Hamiti is a technology leader with over 10 years of experience building AI solutions, and leading high-performing teams to maximize customer outcomes. He is passionate in empowering diverse talent to drive global impact and achieve their career aspirations.
Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in product management, engineering, and go-to-market. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing generative AI technologies and driving real-world impact with generative AI.
Nakul Vankadari Ramesh is a Software Development Engineer with over 7 years of experience building large-scale distributed systems. He currently works on the Amazon Bedrock team, helping accelerate the development of generative AI capabilities. Previously, he contributed to Amazon Managed Blockchain, focusing on scalable and reliable infrastructure.
Huong Nguyen is a Principal Product Manager at AWS. She is a product leader at Amazon Bedrock, with 18 years of experience building customer-centric and data-driven products. She is passionate about democratizing responsible machine learning and generative AI to enable customer experience and business innovation. Outside of work, she enjoys spending time with family and friends, listening to audiobooks, traveling, and gardening.
Massimiliano Angelino is Lead Architect for the EMEA Prototyping team. During the last 3 and half years he has been an IoT Specialist Solution Architect with a particular focus on edge computing, and he contributed to the launch of AWS IoT Greengrass v2 service and its integration with Amazon SageMaker Edge Manager. Based in Stockholm, he enjoys skating on frozen lakes.

Accelerating data science innovation: How Bayer Crop Science used AWS …

The world’s population is expanding at a rapid rate. The growing global population requires innovative solutions to produce food, fiber, and fuel, while restoring natural resources like soil and water and addressing climate change. Bayer Crop Science estimates farmers need to increase crop production by 50% by 2050 to meet these demands. To support their mission, Bayer Crop Science is collaborating with farmers and partners to promote and scale regenerative agriculture—a future where farming can produce more while restoring the environment.
Regenerative agriculture is a sustainable farming philosophy that aims to improve soil health by incorporating nature to create healthy ecosystems. It’s based on the idea that agriculture should restore degraded soils and reverse degradation, rather than sustain current conditions. The Crop Science Division at Bayer believes regenerative agriculture is foundational to the future of farming. Their vision is to produce 50% more food by restoring nature and scaling regenerative agriculture. To make this mission a reality, Bayer Crop Science is driving model training with Amazon SageMaker and accelerating code documentation with Amazon Q.
In this post, we show how Bayer Crop Science manages large-scale data science operations by training models for their data analytics needs and maintaining high-quality code documentation to support developers. Through these solutions, Bayer Crop Science projects up to a 70% reduction in developer onboarding time and up to a 30% improvement in developer productivity.
Challenges
Bayer Crop Science faced the challenge of scaling genomic predictive modeling to increase its speed to market. It also needed data scientists to focus on building the high-value foundation models (FMs), rather than worrying about constructing and engineering the solution itself. Prior to building their solution, the Decision Science Ecosystem, provisioning a data science environment could take days for a data team within Bayer Crop Science.
Solution overview
Bayer Crop Science’s Decision Science Ecosystem (DSE) is a next-generation machine learning operations (MLOps) solution built on AWS to accelerate data-driven decision making for data science teams at scale across the organization. AWS services assist Bayer Crop Science in creating a connected decision-making system accessible to thousands of data scientists. The company is using the solution for generative AI, product pipeline advancements, geospatial imagery analytics of field data, and large-scale genomic predictive modeling that will allow Bayer Crop Science to become more data-driven and increase speed to market. This solution helps the data scientist at every step, from ideation to model output, including the entire business decision record made using DSE. Other divisions within Bayer are also beginning to build a similar solution on AWS based on the success of DSE.
Bayer Crop Science teams’ DSE integrates cohesively with SageMaker, a fully managed service that lets data scientists quickly build, train, and deploy machine learning (ML) models for different use cases so they can make data-informed decisions quickly. This boosts collaboration within Bayer Crop Science across product supply, R&D, and commercial. Their data science strategy no longer needs self-service data engineering, but rather provides an effective resource to drive fast data engineering at scale. Bayer Crop Science chose SageMaker because it provides a single cohesive experience where data scientists can focus on building high-value models, without having to worry about constructing and engineering the resource itself. With the help of AWS services, cross-functional teams can align quickly to reduce operational costs by minimizing redundancy, addressing bugs early and often, and quickly identifying issues in automated workflows. The DSE solution uses SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), AWS Lambda, and Amazon Simple Storage Service (Amazon S3) to accelerate innovation at Bayer Crop Science and to create a customized, seamless, end-to-end user experience.
The following diagram illustrates the DSE architecture.

Solution walkthrough
Bayer Crop Science had two key challenges in managing large-scale data science operations: maintaining high-quality code documentation and optimizing existing documentation across multiple repositories. With Amazon Q, Bayer Crop Science tackled both challenges, which empowered them to onboard developers more rapidly and improve developer productivity.
The company’s first use case focused on automatically creating high-quality code documentation. When a developer pushes code to a GitHub repository, a webhook—a lightweight, event-driven communication that automatically sends data between applications using HTTP—triggers a Lambda function through Amazon API Gateway. This function then uses Amazon Q to analyze the code changes and generate comprehensive documentation and change summaries. The updated documentation is then stored in Amazon S3. The same Lambda function also creates a pull request with the AI-generated summary of code changes. To maintain security and flexibility, Bayer Crop Science uses Parameter Store, a capability of AWS Systems Manager, to manage prompts for Amazon Q, allowing for quick updates without redeployment, and AWS Secrets Manager to securely handle repository tokens.
This automation significantly reduces the time developers spend creating documentation and pull request descriptions. The generated documentation is also ingested into Amazon Q, so developers can quickly answer questions they have about a repository and onboard onto projects.
The second use case addresses the challenge of maintaining and improving existing code documentation quality. An AWS Batch job, triggered by Amazon EventBridge, processes the code repository. Amazon Q generates new documentation for each code file, which is then indexed along with the source code. The system also generates high-level documentation for each module or functionality and compares the AI-generated documentation with existing human-written documentation. This process makes it possible for Bayer Crop Science to systematically evaluate and enhance their documentation quality over time.
To improve search capabilities, Bayer Crop Science added repository names as custom attributes in the Amazon Q index and prefixed them to indexed content. This enhancement improved the accuracy and relevance of documentation searches. The development team also implemented strategies to handle API throttling and variability in AI responses, maintaining robustness in production environments. Bayer Crop Science is considering developing a management plane to streamline the addition of new repositories and centralize the management of settings, tokens, and prompts. This would further enhance the scalability and ease of use of the system.
Organizations looking to replicate Bayer Crop Science’s success can implement similar webhook-triggered documentation generation, use Amazon Q Business for both generating and evaluating documentation quality, and integrate the solution with existing version control and code review processes. By using AWS services like Lambda, Amazon S3, and Systems Manager, companies can create a scalable and manageable architecture for their documentation needs. Amazon Q Developer also helps organizations further accelerate their development timelines by providing real-time code suggestions and a built-in next-generation chat experience.

“One of the lessons we’ve learned over the last 10 years is that we want to write less code. We want to focus our time and investment on only the things that provide differentiated value to Bayer, and we want to leverage everything we can that AWS provides out of the box. Part of our goal is reducing the development cycles required to transition a model from proof-of-concept phase, to production, and ultimately business adoption. That’s where the value is.”
– Will McQueen, VP, Head of CS Global Data Assets and Analytics at Bayer Crop Science.

Summary
Bayer Crop Science’s approach aligns with modern MLOps practices, enabling data science teams to focus more on high-value modeling tasks rather than time-consuming documentation processes and infrastructure management. By adopting these practices, organizations can significantly reduce the time and effort required for code documentation while improving overall code quality and team collaboration.
Learn more about Bayer Crop Science’s generative AI journey, and discover how Bayer Crop Science is redesigning sustainable practices through cutting-edge technology.
About Bayer
Bayer is a global enterprise with core competencies in the life science fields of health care and nutrition. In line with its mission, “Health for all, Hunger for none,” the company’s products and services are designed to help people and the planet thrive by supporting efforts to understand the major challenges presented by a growing and aging global population. Bayer is committed to driving sustainable development and generating a positive impact with its businesses. At the same time, Bayer aims to increase its earning power and create value through innovation and growth. The Bayer brand stands for trust, reliability, and quality throughout the world. In fiscal 2023, the Group employed around 100,000 people and had sales of 47.6 billion euros. R&D expenses before special items amounted to 5.8 billion euros. For more information, go to www.bayer.com.

About the authors
Lance Smith is a Senior Solutions Architect and part of the Global Healthcare and Life Sciences industry division at AWS. He has spent the last 2 decades helping life sciences companies apply technology in pursuit of their missions to help patients. Outside of work, he loves traveling, backpacking, and spending time with his family.
Kenton Blacutt is an AI Consultant within the Amazon Q Customer Success team. He works hands-on with customers, helping them solve real-world business problems with cutting-edge AWS technologies. In his free time, he likes to travel and run an occasional marathon.
Karthik Prabhakar is a Senior Applications Architect within the AWS Professional Services team. In this role, he collaborates with customers to design and implement cutting-edge solutions for their mission-critical business systems, focusing on areas such as scalability, reliability, and cost optimization in digital transformation and modernization projects.
Jake Malmad is a Senior DevOps Consultant within the AWS Professional Services team, specializing in infrastructure as code, security, containers, and orchestration. As a DevOps consultant, he uses this expertise to collaboratively works with customers, architecting and implementing solutions for automation, scalability, reliability, and security across a wide variety of cloud adoption and transformation engagements.
Nicole Brown is a Senior Engagement Manager within the AWS Professional Services team based in Minneapolis, MN. With over 10 years of professional experience, she has led multidisciplinary, global teams across the healthcare and life sciences industries. She is also a supporter of women in tech and currently holds a board position within the Women at Global Services affinity group.

Combat financial fraud with GraphRAG on Amazon Bedrock Knowledge Bases

Financial fraud detection isn’t just important to banks—it’s essential. With global fraud losses surpassing $40 billion annually and sophisticated criminal networks constantly evolving their tactics, financial institutions face an increasingly complex threat landscape. Today’s fraud schemes operate across multiple accounts, institutions, and channels, creating intricate webs designed specifically to evade detection systems.
Financial institutions have invested heavily in detection capabilities, but the core challenge remains: how to connect the dots across fragmented information landscapes where the evidence of fraud exists not within individual documents or transactions, but in the relationships between them.
In this post, we show how to use Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics to build a financial fraud detection solution.
The limitations of traditional RAG systems
In recent years, Retrieval Augmented Generation (RAG) has emerged as a promising approach for building AI systems grounded in organizational knowledge. However, traditional RAG-based systems have limitations when it comes to complex financial fraud detection.The fundamental limitation lies in how conventional RAG processes information. Standard RAG retrieves and processes document chunks as isolated units, looking for semantic similarities between a query and individual text passages. This approach works well for straightforward information retrieval, but falls critically short in the following scenarios:

Evidence is distributed across multiple documents and systems
The connections between entities matter more than the entities themselves
Complex relationship chains require multi-hop reasoning
Structural context (like hierarchical document organization) provides critical clues
Entity resolution across disparate references is essential

A fraud analyst intuitively follows connection paths—linking an account to a phone number, that phone number to another customer, and that customer to a known fraud ring. Traditional RAG systems, however, lack this relational reasoning capability, leaving sophisticated fraud networks undetected until losses have already occurred.
Amazon Bedrock Knowledge Bases with GraphRAG for financial fraud detection
Amazon Bedrock Knowledge Bases GraphRAG helps financial institutions implement fraud detection systems without building complex graph infrastructure from scratch. By offering a fully managed service that seamlessly integrates knowledge graph construction, maintenance, and querying with powerful foundation models (FMs), Amazon Bedrock Knowledge Bases dramatically lowers the technical barriers to implementing relationship-aware fraud detection. Financial organizations can now use their existing transaction data, customer profiles, and risk signals within a graph context that preserves the critical connections between entities while benefiting from the natural language understanding of FMs. This powerful combination enables fraud analysts to query complex financial relationships using intuitive natural language to detect suspicious patterns that can result in financial fraud.
Example fraud detection use case
To demonstrate this use case, we use a fictitious bank (AnyCompany Bank) in Australia whose customers hold savings, checking, and credit card accounts with the bank. These customers perform transactions to buy goods and services from merchants across the country using their debit and credit cards. AnyCompany Bank is looking to use the latest advancements in GraphRAG and generative AI technologies to detect subtle patterns in fraudulent behavior that will yield higher accuracy and reduce false positives.A fraud analyst at AnyCompany Bank wants to use natural language queries to get answers to the following types of queries:

Basic queries – For example, “Show me all the transactions processed by ABC Electronics” or “What accounts does Michael Green own?”
Relationship exploration queries – For example, “Which devices have accessed account A003?” or “Show all relationships between Jane Smith and her devices.”
Temporal pattern detection queries – For example, “Which accounts had transactions and device access on the same day?” or “Which accounts had transactions outside their usual location pattern?”
Fraud detection queries – For example, as “Find unusual transaction amounts compared to account history” or “Are there any accounts with failed transactions followed by successful ones within 24 hours?”

Solution overview
To help illustrate the core GraphRAG principles, we have simplified the data model to six key tables: accounts, transactions, individuals, devices, merchants, and relationships. Real-world financial fraud detection systems are much more complex, with hundreds of entity types and intricate relationships, but this example demonstrates the essential concepts that scale to enterprise implementations. The following figure is an example of the accounts table.

The following figure is an example of the individuals table.

The following figure is an example of the devices table.

The following figure is an example of the transactions table.

The following figure is an example of the merchants table.

The following figure is an example of the relationships table.

The following diagram shows the relationships among these entities: accounts, individuals, devices, transactions, and merchants. For example, the individual John Doe uses device D001 to access account A001 to execute transaction T001, which is processed by merchant ABC Electronics.

In the following sections, we demonstrate how to upload documents to Amazon Simple Storage Service (Amazon S3), create a knowledge base using Amazon Bedrock Knowledge Bases, and test the knowledge base by running natural language queries.
Prerequisites
To follow along with this post, make sure you have an active AWS account with appropriate permissions to access Amazon Bedrock and create an S3 bucket to be the data source. Additionally, verify that you have enabled access to both Anthropic’s Claude 3.5 Haiku and an embeddings model, such as Amazon Titan Text Embeddings V2.
Uplaod documents to Amazon S3
In this step, you create an S3 bucket as the data source and upload the six tables (accounts, individuals, devices, transactions, merchants, and relationships) as Excel data sheets. The following screenshot shows our S3 bucket and its contents.

Create a knowledge base
Complete the following steps to create the knowledge base:

On the Amazon Bedrock console, choose Knowledge Bases under Builder tools in the navigation pane.
Choose Create and Knowledge Base with vector store.

In the Knowledge Base details section, provide the following information:

Enter a meaningful name for the knowledge base.
For IAM permissions, select Create and use a new service role to create a new AWS Identity and Access Management (IAM) role.
For Choose data source, select Amazon S3.
Choose Next.

In the Configure data source section, provide the following information:

Enter a data source name.
For Data source location, select the location of your data source (for example, we select This AWS account).
For S3 source, choose Browse S3 and choose the location where you uploaded the files.
For Parsing staretgy, select Amazon Bedrock default parser.
For Chunking strategy, choose Default chunking.
Choose Next.

In the Configure data storage and processing section, provide the following information:

For Embeddings model, choose Titan Text Embeddings V2.
For Vector store creation method, select Quick create a new vector store.
For Vector store type, select Amazon Neptune Analytics (GraphRAG).
Choose Next

Amazon Bedrock chooses the FM as Anthropic’s Claude 3 Haiku v1 to automatically build graphs for our knowledge base. This automatically enables contextual enrichment.

Choose Create knowledge base.
Choose the knowledge base when it’s in Available status.

Select the data source and choose Sync, then wait for the sync process to complete.

In the sync process, Amazon Bedrock ingests data files from Amazon S3, creates chunks and embeddings, and automatically extracts entities and relationships, creating the graph.

Test the knowledge base and run natural language queries
When the sync is complete, you can test the knowledge base.

In the Test Knowledge Base section, choose Select model.
Set the model as Anthropic’s Claude 3.5 Haiku (or another model of your choice) and then choose Apply.

Enter a sample query and choose Run.

Let’s start with some basic queries, such as “Show me all transactions processed by ABC Electronics” or “What accounts does Michael Green own?” The generated responses are shown in the following screenshot.

We can also run some relationship exploration queries, such as “Which devices have accessed account A003?” or “Show all relationships between Jane Smith and her devices.” The generated responses are shown in the following screenshot. To arrive at the response, the model will do multi-hop reasoning where it will traverse multiple files.

The model can also perform temporal pattern detection queries, such as “Which accounts had transactions and device access on the same day?” or “Which accounts had transactions outside their usual location pattern?” The generated responses are shown in the following screenshot.

Let’s try out some fraud detection queries, such as “Find unusual transaction amounts compared to account history” or “Are there any accounts with failed transactions followed by successful ones within 24 hours?” The generated responses are shown in the following screenshot.

The GraphRAG solution also enables complex relationship queries, such as “Show the complete path from Emma Brown to Pacific Fresh Market” or “Map all connections between the individuals and merchants in the system.” The generated responses are shown in the following screenshot.

Clean up
To avoid incurring additional costs, clean up the resources you created. This includes deleting the Amazon Bedrock knowledge base, its associated IAM role, and the S3 bucket used for source documents. Additionally, you must separately delete the Neptune Analytics graph that was automatically created by Amazon Bedrock Knowledge Bases during the setup process.
Conclusion
GraphRAG in Amazon Bedrock emerges as a game-changing feature in the fight against financial fraud. By automatically connecting relationships across transaction data, customer profiles, historical patterns, and fraud reports, it significantly enhances financial institutions’ ability to detect complex fraud schemes that traditional systems might miss. Its unique capability to understand and link information across multiple documents and data sources proves invaluable when investigating sophisticated fraud patterns that span various touchpoints and time periods.For financial institutions and fraud detection teams, GraphRAG intelligent document processing means faster, more accurate fraud investigations. It can quickly piece together related incidents, identify common patterns in fraud reports, and connect seemingly unrelated activities that might indicate organized fraud rings. This deeper level of insight, combined with its ability to provide comprehensive, context-aware responses, enables security teams to stay one step ahead of fraudsters who continuously evolve their tactics.As financial crimes become increasingly sophisticated, GraphRAG in Amazon Bedrock stands as a powerful tool for fraud prevention, transforming how you can analyze, connect, and act on fraud-related information. The future of fraud detection demands tools that can think and connect like humans—and GraphRAG is leading the way in making this possible.

About the Authors
Senaka Ariyasinghe is a Senior Partner Solutions Architect at AWS. He collaborates with Global Systems Integrators to drive cloud innovation across the Asia-Pacific and Japan region. He specializes in helping AWS partners develop and implement scalable, well-architected solutions, with particular emphasis on generative AI, machine learning, cloud migration strategies, and the modernization of enterprise applications.
Senthil Nathan is a Senior Partner Solutions Architect working with Global Systems Integrators at AWS. In his role, Senthil works closely with global partners to help them maximize the value and potential of the AWS Cloud landscape. He is passionate about using the transformative power of cloud computing and emerging technologies to drive innovation and business impact.
Deependra Shekhawat is a Senior Energy and Utilities Industry Specialist Solutions Architect based in Sydney, Australia. In his role, Deependra helps energy companies across the Asia-Pacific and Japan region use cloud technologies to drive sustainability and operational efficiency. He specializes in creating robust data foundations and advanced workflows that enable organizations to harness the power of big data, analytics, and machine learning for solving critical industry challenges.
Aaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large-scale complex integration and event-driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI-assisted business automation.
Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge generative AI and graph analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.
JaiPrakash Dave is a Partner Solutions Architect working with Global Systems Integrators at AWS based in India. In his role, JaiPrakash guides AWS partners in the India region to design and scale well-architected solutions, focusing on generative AI, machine learning, DevOps, and application and data modernization initiatives.

Anthropic Proposes Targeted Transparency Framework for Frontier AI Sys …

As the development of large-scale AI systems accelerates, concerns about safety, oversight, and risk management are becoming increasingly critical. In response, Anthropic has introduced a targeted transparency framework aimed specifically at frontier AI models—those with the highest potential impact and risk—while deliberately excluding smaller developers and startups to avoid stifling innovation across the broader AI ecosystem.

Why a Targeted Approach?

Anthropic’s framework addresses the need for differentiated regulatory obligations. It argues that universal compliance requirements could overburden early-stage companies and independent researchers. Instead, the proposal focuses on a narrow class of developers: companies building models that surpass specific thresholds for computational power, evaluation performance, R&D expenditure, and annual revenue. This scope ensures that only the most capable—and potentially hazardous—systems are subject to stringent transparency requirements.

Key Components of the Framework

The proposed framework is structured into four major sections: scope, pre-deployment requirements, transparency obligations, and enforcement mechanisms.

I. Scope

The framework applies to organizations developing frontier models—defined not by model size alone, but by a combination of factors including:

Compute scale

Training cost

Evaluation benchmarks

Total R&D investment

Annual revenue

Importantly, startups and small developers are explicitly excluded, using financial thresholds to prevent unnecessary regulatory overhead. This is a deliberate choice to maintain flexibility and support innovation at the early stages of AI development.

II. Pre-Deployment Requirements

Central to the framework is the requirement for companies to implement a Secure Development Framework (SDF) before releasing any qualifying frontier model.

Key SDF requirements include:

Model Identification: Companies must specify which models the SDF applies to.

Catastrophic Risk Mitigation: Plans must be in place to assess and mitigate catastrophic risks—defined broadly to include Chemical, Biological, Radiological, and Nuclear (CBRN) threats, and autonomous actions by models that contradict developer intent.

Standards and Evaluations: Clear evaluation procedures and standards must be outlined.

Governance: A responsible corporate officer must be assigned for oversight.

Whistleblower Protections: Processes must support internal reporting of safety concerns without retaliation.

Certification: Companies must affirm SDF implementation before deployment.

Recordkeeping: SDFs and their updates must be retained for at least five years.

This structure promotes rigorous pre-deployment risk analysis while embedding accountability and institutional memory.

III. Minimum Transparency Requirements

The framework mandates public disclosure of safety processes and results, with allowances for sensitive or proprietary information.

Covered companies must:

Publish SDFs: These must be posted in a publicly accessible format.

Release System Cards: At deployment or upon adding major new capabilities, documentation (akin to model “nutrition labels”) must summarize testing results, evaluation procedures, and mitigations.

Certify Compliance: A public confirmation that the SDF has been followed, including descriptions of any risk mitigations.

Redactions are allowed for trade secrets or public safety concerns, but any omissions must be justified and flagged.

This strikes a balance between transparency and security, ensuring accountability without risking model misuse or competitive disadvantage.

IV. Enforcement

The framework proposes modest but clear enforcement mechanisms:

False Statements Prohibited: Intentionally misleading disclosures regarding SDF compliance are banned.

Civil Penalties: The Attorney General may seek penalties for violations.

30-Day Cure Period: Companies have an opportunity to rectify compliance failures within 30 days.

These provisions emphasize compliance without creating excessive litigation risk, providing a pathway for responsible self-correction.

Strategic and Policy Implications

Anthropic’s targeted transparency framework serves as both a regulatory proposal and a norm-setting initiative. It aims to establish baseline expectations for frontier model development before regulatory regimes are fully in place. By anchoring oversight in structured disclosures and responsible governance—rather than blanket rules or model bans—it provides a blueprint that could be adopted by policymakers and peer companies alike.

The framework’s modular structure could also evolve. As risk signals, deployment scales, or technical capabilities change, the thresholds and compliance requirements can be revised without upending the entire system. This design is particularly valuable in a field as fast-moving as frontier AI.

Conclusion

Anthropic’s proposal for a Targeted Transparency Framework offers a pragmatic middle ground between unchecked AI development and overregulation. It places meaningful obligations on developers of the most powerful AI systems—those with the greatest potential for societal harm—while allowing smaller players to operate without excessive compliance burdens.

As governments, civil society, and the private sector wrestle with how to regulate foundation models and frontier systems, Anthropic’s framework provides a technically grounded, proportionate, and enforceable path forward.

Check out the Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Anthropic Proposes Targeted Transparency Framework for Frontier AI Systems appeared first on MarkTechPost.

Google AI Just Open-Sourced a MCP Toolbox to Let AI Agents Query Datab …

Google has released the MCP Toolbox for Databases, a new open-source module under its GenAI Toolbox aimed at simplifying the integration of SQL databases into AI agents. The release is part of Google’s broader strategy to advance the Model Context Protocol (MCP), a standardized approach that allows language models to interact with external systems—including tools, APIs, and databases—using structured, typed interfaces.

This toolbox addresses a growing need: enabling AI agents to interact with structured data repositories like PostgreSQL and MySQL in a secure, scalable, and efficient manner. Traditionally, building such integrations requires managing authentication, connection handling, schema alignment, and security controls—introducing friction and complexity. The MCP Toolbox removes much of this burden, making integration possible with less than 10 lines of Python and minimal configuration.

Why This Matters for AI Workflows

Databases are essential for storing and querying operational and analytical data. In enterprise and production contexts, AI agents need to access these data sources to perform tasks like reporting, customer support, monitoring, and decision automation. However, connecting large language models (LLMs) directly to SQL databases introduces operational and security concerns such as unsafe query generation, poor connection lifecycle management, and exposure of sensitive credentials.

The MCP Toolbox for Databases solves these problems by providing:

Built-in support for credential-based authentication

Secure and scalable connection pooling

Schema-aware tool interfaces for structured querying

MCP-compliant input/output formats for compatibility with LLM orchestration frameworks

Key Technical Highlights

Minimal Configuration, Maximum Usability

The toolbox allows developers to integrate databases with AI agents using a configuration-driven setup. Instead of dealing with raw credentials or managing individual connections, developers can simply define their database type and environment, and the toolbox handles the rest. This abstraction reduces the boilerplate and risk associated with manual integration.

Native Support for MCP-Compliant Tooling

All tools generated through the toolbox conform to the Model Context Protocol, which defines structured input/output formats for tool interactions. This standardization improves interpretability and safety by constraining LLM interactions through schemas rather than free-form text. These tools can be used directly in agent orchestration frameworks such as LangChain or Google’s own agent infrastructure.

The structured nature of MCP-compliant tools also aids in prompt engineering, allowing LLMs to reason more effectively and safely when interacting with external systems.

Connection Pooling and Authentication

The database interface includes native support for connection pooling to handle concurrent queries efficiently—especially important in multi-agent or high-traffic systems. Authentication is handled securely through environment-based configurations, reducing the need to hard-code credentials or expose them during runtime.

This design minimizes risks such as leaking credentials or overwhelming a database with concurrent requests, making it suitable for production-grade deployment.

Schema-Aware Query Generation

One of the core advantages of this toolbox is its ability to introspect database schemas and make them available to LLMs or agents. This enables safe, schema-validated querying. By mapping out the structure of tables and their relationships, the agent gains situational awareness and can avoid generating invalid or unsafe queries.

This schema grounding also enhances the performance of natural language to SQL pipelines by improving query generation reliability and reducing hallucinations.

Use Cases

The MCP Toolbox for Databases supports a broad range of applications:

Customer service agents that retrieve user information from relational databases in real time

BI assistants that answer business metric questions by querying analytical databases

DevOps bots that monitor database status and report anomalies

Autonomous data agents for ETL, reporting, and compliance verification tasks

Because it’s built on open protocols and popular Python libraries, the toolbox is easily extensible and fits into existing LLM-agent workflows.

Fully Open Source

The module is part of the fully open-source GenAI Toolbox released under the Apache 2.0 license. It builds on established packages such as sqlalchemy to ensure compatibility with a wide range of databases and deployment environments. Developers can fork, customize, or contribute to the module as needed.

Conclusion

The MCP Toolbox for Databases represents an important step in operationalizing AI agents in data-rich environments. By removing integration overhead and embedding best practices for security and performance, Google is enabling developers to bring AI to the heart of enterprise data systems. The combination of structured interfaces, lightweight setup, and open-source flexibility makes this release a compelling foundation for building production-ready AI agents with reliable database access.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Just Open-Sourced a MCP Toolbox to Let AI Agents Query Databases Safely and Efficiently appeared first on MarkTechPost.

Implementing a Tool-Enabled Multi-Agent Workflow with Python, OpenAI A …

In this advanced tutorial, we aim to build a multi-agent task automation system using the PrimisAI Nexus framework, which is fully integrated with the OpenAI API. Our primary objective is to demonstrate how hierarchical supervision, intelligent tool utilization, and structured outputs can facilitate the coordination of multiple AI agents to perform complex tasks, ranging from planning and development to quality assurance and data analysis. As we walk through each phase, we don’t just build individual agents; we architect a collaborative ecosystem where each agent has a clear role, responsibilities, and smart tools to accomplish the task.

Copy CodeCopiedUse a different Browser!pip install primisai openai nest-asyncio

import os
import nest_asyncio
from primisai.nexus.core import AI, Agent, Supervisor
from primisai.nexus.utils.debugger import Debugger
import json

nest_asyncio.apply()

We begin by installing the core dependencies: Primisai for agent orchestration, OpenAI for LLM access, and nest_asyncio to handle Colab’s event loop quirks. After applying nest_asyncio, we ensure the notebook is ready to execute asynchronous tasks seamlessly, a key requirement for multi-agent execution.

Copy CodeCopiedUse a different Browserprint(” PrimisAI Nexus Advanced Tutorial with OpenAI API”)
print(“=” * 55)

os.environ[“OPENAI_API_KEY”] = “Use Your Own API Key Here5”

# llm_config = {
# “api_key”: os.environ[“OPENAI_API_KEY”],
# “model”: “gpt-4o-mini”,
# “base_url”: “https://api.openai.com/v1”,
# “temperature”: 0.7
# }

llm_config = {
“api_key”: os.environ[“OPENAI_API_KEY”],
“model”: “gpt-3.5-turbo”,
“base_url”: “https://api.openai.com/v1”,
“temperature”: 0.7
}

print(” API Configuration:”)
print(f”• Model: {llm_config[‘model’]}”)
print(f”• Base URL: {llm_config[‘base_url’]}”)
print(“• Note: OpenAI has limited free tokens through April 2025”)
print(“• Alternative: Consider Puter.js for unlimited free access”)

To power our agents, we connect to OpenAI’s models, starting with gpt-3.5-turbo for cost-efficient tasks. We store our API key in environment variables and construct a configuration dictionary specifying the model, temperature, and base URL. This section allows us to flexibly switch between models, such as gpt-4o-mini or gpt-4o, depending on task complexity and cost.

Copy CodeCopiedUse a different Browsercode_schema = {
“type”: “object”,
“properties”: {
“description”: {“type”: “string”, “description”: “Code explanation”},
“code”: {“type”: “string”, “description”: “Python code implementation”},
“language”: {“type”: “string”, “description”: “Programming language”},
“complexity”: {“type”: “string”, “enum”: [“beginner”, “intermediate”, “advanced”]},
“test_cases”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “Example usage”}
},
“required”: [“description”, “code”, “language”]
}

analysis_schema = {
“type”: “object”,
“properties”: {
“summary”: {“type”: “string”, “description”: “Brief analysis summary”},
“insights”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “Key insights”},
“recommendations”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “Action items”},
“confidence”: {“type”: “number”, “minimum”: 0, “maximum”: 1},
“methodology”: {“type”: “string”, “description”: “Analysis approach used”}
},
“required”: [“summary”, “insights”, “confidence”]
}

planning_schema = {
“type”: “object”,
“properties”: {
“tasks”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “List of tasks to complete”},
“priority”: {“type”: “string”, “enum”: [“low”, “medium”, “high”]},
“estimated_time”: {“type”: “string”, “description”: “Time estimate”},
“dependencies”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “Task dependencies”}
},
“required”: [“tasks”, “priority”]
}

We define JSON schemas for three agent types: CodeWriter, Data Analyst, and Project Planner. These schemas enforce structure in the agent’s responses, making the output machine-readable and predictable. It helps us ensure that the system returns consistent data, such as code blocks, insights, or project timelines, even when different LLMs are behind the scenes.

Copy CodeCopiedUse a different Browserdef calculate_metrics(data_str):
“””Calculate comprehensive statistics for numerical data”””
try:
data = json.loads(data_str) if isinstance(data_str, str) else data_str
if isinstance(data, list) and all(isinstance(x, (int, float)) for x in data):
import statistics
return {
“mean”: statistics.mean(data),
“median”: statistics.median(data),
“mode”: statistics.mode(data) if len(set(data)) < len(data) else “No mode”,
“std_dev”: statistics.stdev(data) if len(data) > 1 else 0,
“max”: max(data),
“min”: min(data),
“count”: len(data),
“sum”: sum(data)
}
return {“error”: “Invalid data format – expecting array of numbers”}
except Exception as e:
return {“error”: f”Could not parse data: {str(e)}”}

def validate_code(code):
“””Advanced code validation with syntax and basic security checks”””
try:
dangerous_imports = [‘os’, ‘subprocess’, ‘eval’, ‘exec’, ‘__import__’]
security_warnings = []

for danger in dangerous_imports:
if danger in code:
security_warnings.append(f”Potentially dangerous: {danger}”)

compile(code, ‘<string>’, ‘exec’)

return {
“valid”: True,
“message”: “Code syntax is valid”,
“security_warnings”: security_warnings,
“lines”: len(code.split(‘n’))
}
except SyntaxError as e:
return {
“valid”: False,
“message”: f”Syntax error: {e}”,
“line”: getattr(e, ‘lineno’, ‘unknown’),
“security_warnings”: []
}

def search_documentation(query):
“””Simulate searching documentation (placeholder function)”””
docs = {
“python”: “Python is a high-level programming language”,
“list”: “Lists are ordered, mutable collections in Python”,
“function”: “Functions are reusable blocks of code”,
“class”: “Classes define objects with attributes and methods”
}

results = []
for key, value in docs.items():
if query.lower() in key.lower():
results.append(f”{key}: {value}”)

return {
“query”: query,
“results”: results if results else [“No documentation found”],
“total_results”: len(results)
}

Next, we add custom tools that agents could call, such as calculate_metrics for statistical summaries, validate_code for syntax and security checks, and search_documentation for simulated programming help. These tools extend the agents’ abilities, turning them from simple chatbots into interactive, utility-driven workers capable of autonomous reasoning and validation.

Copy CodeCopiedUse a different Browserprint(“n Setting up Multi-Agent Hierarchy with OpenAI”)

main_supervisor = Supervisor(
name=”ProjectManager”,
llm_config=llm_config,
system_message=”You are a senior project manager coordinating development and analysis tasks. Delegate appropriately, provide clear summaries, and ensure quality delivery. Always consider time estimates and dependencies.”
)

dev_supervisor = Supervisor(
name=”DevManager”,
llm_config=llm_config,
is_assistant=True,
system_message=”You manage development tasks. Coordinate between coding, testing, and code review. Ensure best practices and security.”
)

analysis_supervisor = Supervisor(
name=”AnalysisManager”,
llm_config=llm_config,
is_assistant=True,
system_message=”You manage data analysis and research tasks. Ensure thorough analysis, statistical rigor, and actionable insights.”
)

qa_supervisor = Supervisor(
name=”QAManager”,
llm_config=llm_config,
is_assistant=True,
system_message=”You manage quality assurance and testing. Ensure thorough validation and documentation.”
)

To simulate a real-world management structure, we create a multi-tiered hierarchy. A ProjectManager serves as the root supervisor, overseeing three assistant supervisors (DevManager, AnalysisManager, and QAManager), each in charge of domain-specific agents. This modular hierarchy allows tasks to flow down from high-level strategy to granular execution.

Copy CodeCopiedUse a different Browsercode_agent = Agent(
name=”CodeWriter”,
llm_config=llm_config,
system_message=”You are an expert Python developer. Write clean, efficient, well-documented code with proper error handling. Always include test cases and follow PEP 8 standards.”,
output_schema=code_schema,
tools=[{
“metadata”: {
“function”: {
“name”: “validate_code”,
“description”: “Validates Python code syntax and checks for security issues”,
“parameters”: {
“type”: “object”,
“properties”: {
“code”: {“type”: “string”, “description”: “Python code to validate”}
},
“required”: [“code”]
}
}
},
“tool”: validate_code
}, {
“metadata”: {
“function”: {
“name”: “search_documentation”,
“description”: “Search for programming documentation and examples”,
“parameters”: {
“type”: “object”,
“properties”: {
“query”: {“type”: “string”, “description”: “Documentation topic to search for”}
},
“required”: [“query”]
}
}
},
“tool”: search_documentation
}],
use_tools=True
)

review_agent = Agent(
name=”CodeReviewer”,
llm_config=llm_config,
system_message=”You are a senior code reviewer. Analyze code for best practices, efficiency, security, maintainability, and potential issues. Provide constructive feedback and suggestions.”,
keep_history=True,
tools=[{
“metadata”: {
“function”: {
“name”: “validate_code”,
“description”: “Validates code syntax and security”,
“parameters”: {
“type”: “object”,
“properties”: {
“code”: {“type”: “string”, “description”: “Code to validate”}
},
“required”: [“code”]
}
}
},
“tool”: validate_code
}],
use_tools=True
)

analyst_agent = Agent(
name=”DataAnalyst”,
llm_config=llm_config,
system_message=”You are a data scientist specializing in statistical analysis and insights generation. Provide thorough analysis with confidence metrics and actionable recommendations.”,
output_schema=analysis_schema,
tools=[{
“metadata”: {
“function”: {
“name”: “calculate_metrics”,
“description”: “Calculates comprehensive statistics for numerical data”,
“parameters”: {
“type”: “object”,
“properties”: {
“data_str”: {“type”: “string”, “description”: “JSON string of numerical data array”}
},
“required”: [“data_str”]
}
}
},
“tool”: calculate_metrics
}],
use_tools=True
)

planner_agent = Agent(
name=”ProjectPlanner”,
llm_config=llm_config,
system_message=”You are a project planning specialist. Break down complex projects into manageable tasks with realistic time estimates and clear dependencies.”,
output_schema=planning_schema
)

tester_agent = Agent(
name=”QATester”,
llm_config=llm_config,
system_message=”You are a QA specialist focused on comprehensive testing strategies, edge cases, and quality assurance.”,
tools=[{
“metadata”: {
“function”: {
“name”: “validate_code”,
“description”: “Validates code for testing”,
“parameters”: {
“type”: “object”,
“properties”: {
“code”: {“type”: “string”, “description”: “Code to test”}
},
“required”: [“code”]
}
}
},
“tool”: validate_code
}],
use_tools=True
)

We then build a diverse set of specialized agents: CodeWriter for generating Python code, CodeReviewer for reviewing logic and security, DataAnalyst for performing structured data analysis, ProjectPlanner for task breakdown, and QATester for quality checks. Each agent has domain-specific tools, output schemas, and system instructions tailored to their role.

Copy CodeCopiedUse a different Browserdev_supervisor.register_agent(code_agent)
dev_supervisor.register_agent(review_agent)
analysis_supervisor.register_agent(analyst_agent)
qa_supervisor.register_agent(tester_agent)

main_supervisor.register_agent(dev_supervisor)
main_supervisor.register_agent(analysis_supervisor)
main_supervisor.register_agent(qa_supervisor)
main_supervisor.register_agent(planner_agent)

All agents are registered under their respective supervisors, and the assistant supervisors are, in turn, registered with the main supervisor. This setup creates a fully linked agent ecosystem, where instructions could cascade from the top-level agent to any specialist agent in the network.

Copy CodeCopiedUse a different Browserprint(“n Agent Hierarchy:”)
main_supervisor.display_agent_graph()

print(“n Testing Full Multi-Agent Communication”)
print(“-” * 45)

try:
test_response = main_supervisor.chat(“Hello! Please introduce your team and explain how you coordinate complex projects.”)
print(f” Supervisor communication test successful!”)
print(f”Response preview: {test_response[:200]}…”)
except Exception as e:
print(f” Supervisor test failed: {str(e)}”)
print(“Falling back to direct agent testing…”)

We visualize the entire hierarchy using display_agent_graph() to confirm our structure. It offers a clear view of how each agent is connected within the broader task management flow, a helpful diagnostic before deployment.

Copy CodeCopiedUse a different Browserprint(“n Complex Multi-Agent Task Execution”)
print(“-” * 40)

complex_task = “””Create a Python function that implements a binary search algorithm,
have it reviewed for optimization, tested thoroughly, and provide a project plan
for integrating it into a larger search system.”””

print(f”Complex Task: {complex_task}”)

try:
complex_response = main_supervisor.chat(complex_task)
print(f” Complex task completed”)
print(f”Response: {complex_response[:300]}…”)
except Exception as e:
print(f” Complex task failed: {str(e)}”)

We give the full system a real-world task: create a binary search function, review it, test it, and plan its integration into a larger project. The ProjectManager seamlessly coordinates agents across development, QA, and planning, demonstrating the true power of hierarchical, tool-driven agent orchestration.

Copy CodeCopiedUse a different Browserprint(“n Tool Integration & Structured Outputs”)
print(“-” * 43)

print(“Testing Code Agent with tools…”)
try:
code_response = code_agent.chat(“Create a function to calculate fibonacci numbers with memoization”)
print(f” Code Agent with tools: Working”)
print(f”Response type: {type(code_response)}”)

if isinstance(code_response, str) and code_response.strip().startswith(‘{‘):
code_data = json.loads(code_response)
print(f” – Description: {code_data.get(‘description’, ‘N/A’)[:50]}…”)
print(f” – Language: {code_data.get(‘language’, ‘N/A’)}”)
print(f” – Complexity: {code_data.get(‘complexity’, ‘N/A’)}”)
else:
print(f” – Raw response: {code_response[:100]}…”)

except Exception as e:
print(f” Code Agent error: {str(e)}”)

print(“nTesting Analyst Agent with tools…”)
try:
analysis_response = analyst_agent.chat(“Analyze this sales data: [100, 150, 120, 180, 200, 175, 160, 190, 220, 185]. What trends do you see?”)
print(f” Analyst Agent with tools: Working”)

if isinstance(analysis_response, str) and analysis_response.strip().startswith(‘{‘):
analysis_data = json.loads(analysis_response)
print(f” – Summary: {analysis_data.get(‘summary’, ‘N/A’)[:50]}…”)
print(f” – Confidence: {analysis_data.get(‘confidence’, ‘N/A’)}”)
print(f” – Insights count: {len(analysis_data.get(‘insights’, []))}”)
else:
print(f” – Raw response: {analysis_response[:100]}…”)

except Exception as e:
print(f” Analyst Agent error: {str(e)}”)

We directly test the capabilities of two specialized agents using real prompts. We first ask the CodeWriter agent to generate a Fibonacci function with memoization and validate that it returns structured output containing a code description, language, and complexity level. Then, we evaluate the DataAnalyst agent by feeding it sample sales data to extract trends.

Copy CodeCopiedUse a different Browserprint(“n Manual Tool Usage”)
print(“-” * 22)

# Test all tools manually
sample_data = “[95, 87, 92, 88, 91, 89, 94, 90, 86, 93]”
metrics_result = calculate_metrics(sample_data)
print(f”Statistics for {sample_data}:”)
for key, value in metrics_result.items():
print(f” {key}: {value}”)

print(“nCode validation test:”)
test_code = “””
def binary_search(arr, target):
left, right = 0, len(arr) – 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid – 1
return -1
“””
validation_result = validate_code(test_code)
print(f”Validation result: {validation_result}”)

print(“nDocumentation search test:”)
doc_result = search_documentation(“python function”)
print(f”Search results: {doc_result}”)

We step outside the agent framework to test each tool directly. First, we use the calculate_metrics tool on a dataset of ten numbers, confirming it correctly returned statistics such as mean, median, mode, and standard deviation. Next, we run the validate_code tool on a sample binary search function, which confirms both syntactic correctness and flags no security warnings. Finally, we test the search_documentation tool with the query “python function” and receive relevant documentation snippets, verifying its ability to efficiently simulate contextual lookup.

Copy CodeCopiedUse a different Browserprint(“n Advanced Multi-Agent Workflow”)
print(“-” * 35)

workflow_stages = [
(“Planning”, “Create a project plan for building a web scraper for news articles”),
(“Development”, “Implement the web scraper with error handling and rate limiting”),
(“Review”, “Review the web scraper code for security and efficiency”),
(“Testing”, “Create comprehensive test cases for the web scraper”),
(“Analysis”, “Analyze sample scraped data: [45, 67, 23, 89, 12, 56, 78, 34, 91, 43]”)
]

workflow_results = {}

for stage, task in workflow_stages:
print(f”n{stage} Stage: {task}”)
try:
if stage == “Planning”:
response = planner_agent.chat(task)
elif stage == “Development”:
response = code_agent.chat(task)
elif stage == “Review”:
response = review_agent.chat(task)
elif stage == “Testing”:
response = tester_agent.chat(task)
elif stage == “Analysis”:
response = analyst_agent.chat(task)

workflow_results[stage] = response
print(f” {stage} completed: {response[:80]}…”)

except Exception as e:
print(f” {stage} failed: {str(e)}”)
workflow_results[stage] = f”Error: {str(e)}”

We simulate a five-stage project lifecycle: planning, development, review, testing, and analysis. Each task is passed to the most relevant agent, and responses are collected to evaluate performance. This demonstrates the framework’s capability to manage end-to-end workflows without manual intervention.

Copy CodeCopiedUse a different Browserprint(“n System Monitoring & Performance”)
print(“-” * 37)

debugger = Debugger(name=”OpenAITutorialDebugger”)
debugger.log(“Advanced OpenAI tutorial execution completed successfully”)

print(f”Main Supervisor ID: {main_supervisor.workflow_id}”)

We activate the Debugger tool to track the performance of our session and log system events. We also print the main supervisor’s workflow_id as a traceable identifier, useful when managing multiple workflows in production.

In conclusion, we have successfully built a fully automated, OpenAI-compatible multi-agent system using PrimisAI Nexus. Each agent operates with clarity, precision, and autonomy, whether writing code, validating logic, analyzing data, or breaking down complex workflows. Our hierarchical structure allows for seamless task delegation and modular scalability. PrimisAI Nexus framework establishes a robust foundation for automating real-world tasks, whether in software development, research, planning, or data operations, through intelligent collaboration between specialized agents.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Implementing a Tool-Enabled Multi-Agent Workflow with Python, OpenAI API, and PrimisAI Nexus appeared first on MarkTechPost.

How INRIX accelerates transportation planning with Amazon Bedrock

This post is co-written with Shashank Saraogi, Nat Gale, and Durran Kelly from INRIX.
The complexity of modern traffic management extends far beyond mere road monitoring, encompassing massive amounts of data collected worldwide from connected cars, mobile devices, roadway sensors, and major event monitoring systems. For transportation authorities managing urban, suburban, and rural traffic flow, the challenge lies in effectively processing and acting upon this vast network of information. The task requires balancing immediate operational needs, such as real-time traffic redirection during incidents, with strategic long-term planning for improved mobility and safety.
Traditionally, analyzing these complex data patterns and producing actionable insights has been a resource-intensive process requiring extensive collaboration. With recent advances in generative AI, there is an opportunity to transform how we process, understand, and act upon transportation data, enabling more efficient and responsive traffic management systems.
In this post, we partnered with Amazon Web Services (AWS) customer INRIX to demonstrate how Amazon Bedrock can be used to determine the best countermeasures for specific city locations using rich transportation data and how such countermeasures can be automatically visualized in street view images. This approach allows for significant planning acceleration compared to traditional approaches using conceptual drawings.
INRIX pioneered the use of GPS data from connected vehicles for transportation intelligence. For over 20 years, INRIX has been a leader for probe-based connected vehicle and device data and insights, powering automotive, enterprise, and public sector use cases. INRIX’s products range from tickerized datasets that inform investment decisions for the financial services sector to digital twins for the public rights-of-way in the cities of Philadelphia and San Francisco. INRIX was the first company to develop a crowd-sourced traffic network, and they continue to lead in real-time mobility operations.
In June 2024, the State of California’s Department of Transportation (Caltrans) selected INRIX for a proof of concept for a generative AI-powered solution to improve safety for vulnerable road users (VRUs). The problem statement sought to harness the combination of Caltrans’ asset, crash, and points-of-interest (POI) data and INRIX’s 50 petabyte (PB) data lake to anticipate high-risk locations and quickly generate empirically validated safety measures to mitigate the potential for crashes. Trained on real-time and historical data and industry research and manuals, the solution provides a new systemic, safety-based methodology for risk assessment, location prioritization, and project implementation.
Solution overview
INRIX announced INRIX Compass in November 2023. INRIX Compass is an application that harnesses generative AI and INRIX’s 50 PB data lake to solve transportation challenges. This solution uses INRIX Compass countermeasures as the input, AWS serverless architecture, and Amazon Nova Canvas as the image visualizer. Key components include:

Countermeasures generation:

INRIX Compass generates the countermeasures for a selected location
Amazon API Gateway and Amazon Elastic Kubernetes Service (Amazon EKS) manage API requests and responses
Amazon Bedrock Knowledge Bases and Anthropic’s Claude Models provide Retrieval Augmented Generation (RAG) implementation

Image visualization

API Gateway and AWS Lambda process requests from API Gateway and Amazon Bedrock
Amazon Bedrock with model access to Amazon Nova Canvas provide image generation and in-painting

The following diagram shows the architecture of INRIX Compass.

INRIX Compass for countermeasures
By using INRIX Compass, users can ask natural language queries such as, Where are the top five locations with the highest risk for vulnerable road users? and Can you recommend a suite of proven safety countermeasures at each of these locations? Furthermore, users can probe deeper into the roadway characteristics that contribute to risk factors, and find similar locations in the roadway network that meet those conditions. Behind the scenes, Compass AI uses RAG and Amazon Bedrock powered foundation models (FMs) to query the roadway network to identify and prioritize locations with systemic risk factors and anomalous safety patterns. The solution provides prioritized recommendations for operational and design solutions and countermeasures based on industry knowledge.
The following image shows the interface of INRIX Compass.

Image visualization for countermeasures
The generation of countermeasure suggestions represents the initial phase in transportation planning. Image visualization requires the crucial next step of preparing conceptual drawings. This process has traditionally been time-consuming due to the involvement of multiple specialized teams, including:

Transportation engineers who assess technical feasibility and safety standards
Urban planners who verify alignment with city development goals
Landscape architects who integrate environmental and aesthetic elements
CAD or visualization specialists who create detailed technical drawings
Safety analysts who evaluate the potential impact on road safety
Public works departments who oversee implementation feasibility
Traffic operations teams who assess impact on traffic flow and management

These teams work collaboratively, creating and iteratively refining various visualizations based on feedback from urban designers and other stakeholders. Each iteration cycle typically involves multiple rounds of reviews, adjustments, and approvals, often extending the timeline significantly. The complexity is further amplified by city-specific rules and design requirements, which often necessitate significant customization. Additionally, local regulations, environmental considerations, and community feedback must be incorporated into the design process. Consequently, this lengthy and costly process frequently leads to delays in implementing safety countermeasures. To streamline this challenge, INRIX has pioneered an innovative approach to the visualization phase by using generative AI technology. This prototyped solution enables rapid iteration of conceptual drawings that can be efficiently reviewed by various teams, potentially reducing the design cycle from weeks to days. Moreover, the system incorporates a few-shot learning approach with reference images and carefully crafted prompts, allowing for seamless integration of city-specific requirements into the generated outputs. This approach not only accelerates the design process but also supports consistency across different projects while maintaining compliance with local standards.
The following image shows the congestion insights by INRIX Compass.

Amazon Nova Canvas for conceptual visualizations
INRIX developed and prototyped this solution using Amazon Nova models. Amazon Nova Canvas delivers advanced image processing through text-to-image generation and image-to-image transformation capabilities. The model provides sophisticated controls for adjusting color schemes and manipulating layouts to achieve desired visual outcomes. To promote responsible AI implementation, Amazon Nova Canvas incorporates built-in safety measures, including watermarking and content moderation systems.
The model supports a comprehensive range of image editing operations. These operations encompass basic image generation, object removal from existing images, object replacement within scenes, creation of image variations, and modification of image backgrounds. This versatility makes Amazon Nova Canvas suitable for a wide range of professional applications requiring sophisticated image editing.
The following sample images show an example of countermeasures visualization.

In-painting implementation in Compass AI
Amazon Nova Canvas integrates with INRIX Compass’s existing natural language analytics capabilities. The original Compass system generated text-based countermeasure recommendations based on:

Historical transportation data analysis
Current environmental conditions
User-specified requirements

The INRIX Compass visualization feature specifically uses the image generation and in-painting capabilities of Amazon Nova Canvas. In-painting enables object replacement through two distinct approaches:

A binary mask precisely defines the areas targeted for replacement.
Text prompts identify objects for replacement, allowing the model to interpret and modify the specified elements while maintaining visual coherence with the surrounding image context. This functionality provides seamless integration of new elements while preserving the overall image composition and contextual relevance. The developed interface accommodates both image generation and in-painting approaches, providing comprehensive image editing capabilities.

The implementation follows a two-stage process for visualizing transportation countermeasures. Initially, the system employs image generation functionality to create street-view representations corresponding to specific longitude and latitude coordinates where interventions are proposed. Following the initial image creation, the in-painting capability enables precise placement of countermeasures within the generated street view scene. This sequential approach provides accurate visualization of proposed modifications within the actual geographical context.
An Amazon Bedrock API facilitates image editing and generation through the Amazon Nova Canvas model. The responses contain the generated or modified images in base64 format, which can be decoded and processed for further use in the application. The generative AI capabilities of Amazon Bedrock enable rapid iteration and simultaneous visualization of multiple countermeasures within a single image. RAG implementation can further extend the pipeline’s capabilities by incorporating county-specific regulations, standardized design patterns, and contextual requirements. The integration of these technologies significantly streamlines the countermeasure deployment workflow. Traditional manual visualization processes that previously required extensive time and resources can now be executed efficiently through automated generation and modification. This automation delivers substantial improvements in both time-to-deployment and cost-effectiveness.
Conclusion
The partnership between INRIX and AWS showcases the transformative potential of AI in solving complex transportation challenges. By using Amazon Bedrock FMs, INRIX has turned their massive 50 PB data lake into actionable insights through effective visualization solutions. This post highlighted a single specific transportation use case, but Amazon Bedrock and Amazon Nova power a wide spectrum of applications, from text generation to video creation. The combination of extensive data and advanced AI capabilities continues to pave the way for smarter, more efficient transportation systems worldwide.
For more information, check out the documentation for Amazon Nova Foundation Models, Amazon Bedrock, and INRIX Compass.

About the authors
Arun is a Senior Solutions Architect at AWS, supporting enterprise customers in the Pacific Northwest. He’s passionate about solving business and technology challenges as an AWS customer advocate, with his recent interest being AI strategy. When not at work, Arun enjoys listening to podcasts, going for short trail runs, and spending quality time with his family.
Alicja Kwasniewska, PhD, is an AI leader driving generative AI innovations in enterprise solutions and decision intelligence for customer engagements in North America, advertisement and marketing verticals at AWS. She is recognized among the top 10 women in AI and 100 women in data science. Alicja published in more than 40 peer-reviewed publications. She also serves as a reviewer for top-tier conferences, including ICML,NeurIPS,and ICCV. She advises organizations on AI adoption, bridging research and industry to accelerate real-world AI applications.
Shashank is the VP of Engineering at INRIX, where he leads multiple verticals, including generative AI and traffic. He is passionate about using technology to make roads safer for drivers, bikers, and pedestrians every day. Prior to working at INRIX, he held engineering leadership roles at Amazon and Lyft. Shashank brings deep experience in building impactful products and high-performing teams at scale. Outside of work, he enjoys traveling, listening to music, and spending time with his family.
Nat Gale is the Head of Product at INRIX, where he manages the Safety and Traffic product verticals. Nat leads the development of data products and software that help transportation professionals make smart, more informed decisions. He previously ran the City of Los Angeles’ Vision Zero program and was the Director of Capital Projects and Operations for the City of Hartford, CT.
Durran is a Lead Software Engineer at INRIX, where he designs scalable backend systems and mentors engineers across multiple product lines. With over a decade of experience in software development, he specializes in distributed systems, generative AI, and cloud infrastructure. Durran is passionate about writing clean, maintainable code and sharing best practices with the developer community. Outside of work, he enjoys spending quality time with his family and deepening his Japanese language skills.

Qwen3 family of reasoning models now available in Amazon Bedrock Marke …

Today, we are excited to announce that Qwen3, the latest generation of large language models (LLMs) in the Qwen family, is available through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, you can deploy the Qwen3 models—available in 0.6B, 4B, 8B, and 32B parameter sizes—to build, experiment, and responsibly scale your generative AI applications on AWS.
In this post, we demonstrate how to get started with Qwen3 on Amazon Bedrock Marketplace and SageMaker JumpStart. You can follow similar steps to deploy the distilled versions of the models as well.
Solution overview
Qwen3 is the latest generation of LLMs in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:

Unique support of seamless switching between thinking mode and non-thinking mode within a single model, providing optimal performance across various scenarios.
Significantly enhanced in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
Good human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open source models in complex agent-based tasks.
Support for over 100 languages and dialects with strong capabilities for multilingual instruction following and translation.

Prerequisites
To deploy Qwen3 models, make sure you have access to the recommended instance types based on the model size. You can find these instance recommendations on Amazon Bedrock Marketplace or the SageMaker JumpStart console. To verify you have the necessary resources, complete the following steps:

Open the Service Quotas console.
Under AWS Services, select Amazon SageMaker.
Check that you have sufficient quota for the required instance type for endpoint deployment.
Make sure at least one of these instance types is available in your target AWS Region.

If needed, request a quota increase and contact your AWS account team for support.
Deploy Qwen3 in Amazon Bedrock Marketplace
Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access Qwen3 in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, in the navigation pane under Foundation models, choose Model catalog.
Filter for Hugging Face as a provider and choose a Qwen3 model. For this example, we use the Qwen3-32B model.

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.
The page also includes deployment options and licensing information to help you get started with Qwen3-32B in your applications.

To begin using Qwen3-32B, choose Deploy.

You will be prompted to configure the deployment details for Qwen3-32B. The model ID will be pre-populated.

For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
For Number of instances, enter a number of instances (between 1–100).
For Instance type, choose your instance type. For optimal performance with Qwen3-32B, a GPU-based instance type like ml.g5-12xlarge is recommended.
To deploy the model, choose Deploy.

When the deployment is complete, you can test Qwen3-32B’s capabilities directly in the Amazon Bedrock playground.

Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with any Amazon Bedrock APIs, you must have the endpoint Amazon Resource Name (ARN).
Enable reasoning and non-reasoning responses with Converse API
The following code shows how to turn reasoning on and off with Qwen3 models using the Converse API, depending on your use case. By default, reasoning is left on for Qwen3 models, but you can streamline interactions by using the /no_think command within your prompt. When you add this to the end of your query, reasoning is turned off and the models will provide just the direct answer. This is particularly useful when you need quick information without explanations, are familiar with the topic, or want to maintain a faster conversational flow. At the time of writing, the Converse API doesn’t support tool use for Qwen3 models. Refer to the Invoke_Model API example later in this post to learn how to use reasoning and tools in the same completion.

import boto3
from botocore.exceptions import ClientError

# Create a Bedrock Runtime client in the AWS Region you want to use.
client = boto3.client(“bedrock-runtime”, region_name=”us-west-2″)

# Configuration
model_id = “”  # Replace with Bedrock Marketplace endpoint arn

# Start a conversation with the user message.
user_message = “hello, what is 1+1 /no_think” #remove /no_think to leave default reasoning on
conversation = [
    {
        “role”: “user”,
        “content”: [{“text”: user_message}],
    }
]

try:
    # Send the message to the model, using a basic inference configuration.
    response = client.converse(
        modelId=model_id,
        messages=conversation,
        inferenceConfig={“maxTokens”: 512, “temperature”: 0.5, “topP”: 0.9},
    )

    # Extract and print the response text.
    #response_text = response[“output”][“message”][“content”][0][“text”]
    #reasoning_content = response [“output”][“message”][“reasoning_content”][0][“text”]
    #print(response_text, reasoning_content)
    print(response)
    
except (ClientError, Exception) as e:
    print(f”ERROR: Can’t invoke ‘{model_id}’. Reason: {e}”)
    exit(1)

The following is a response using the Converse API, without default thinking:

{‘ResponseMetadata’: {‘RequestId’: ‘f7f3953a-5747-4866-9075-fd4bd1cf49c4’, ‘HTTPStatusCode’: 200, ‘HTTPHeaders’: {‘date’: ‘Tue, 17 Jun 2025 18:34:47 GMT’, ‘content-type’: ‘application/json’, ‘content-length’: ‘282’, ‘connection’: ‘keep-alive’, ‘x-amzn-requestid’: ‘f7f3953a-5747-4866-9075-fd4bd1cf49c4’}, ‘RetryAttempts’: 0}, ‘output’: {‘message’: {‘role’: ‘assistant’, ‘content’: [{‘text’: ‘nnHello! The result of 1 + 1 is **2**. 😊’}, {‘reasoningContent’: {‘reasoningText’: {‘text’: ‘nn’}}}]}}, ‘stopReason’: ‘end_turn’, ‘usage’: {‘inputTokens’: 20, ‘outputTokens’: 22, ‘totalTokens’: 42}, ‘metrics’: {‘latencyMs’: 1125}}

The following is an example with default thinking on; the <think> tokens are automatically parsed into the reasoningContent field for the Converse API:

{‘ResponseMetadata’: {‘RequestId’: ‘b6d2ebbe-89da-4edc-9a3a-7cb3e7ecf066’, ‘HTTPStatusCode’: 200, ‘HTTPHeaders’: {‘date’: ‘Tue, 17 Jun 2025 18:32:28 GMT’, ‘content-type’: ‘application/json’, ‘content-length’: ‘1019’, ‘connection’: ‘keep-alive’, ‘x-amzn-requestid’: ‘b6d2ebbe-89da-4edc-9a3a-7cb3e7ecf066’}, ‘RetryAttempts’: 0}, ‘output’: {‘message’: {‘role’: ‘assistant’, ‘content’: [{‘text’: ‘nnHello! The sum of 1 + 1 is **2**. Let me know if you have any other questions or need further clarification! 😊’}, {‘reasoningContent’: {‘reasoningText’: {‘text’: ‘nOkay, the user asked “hello, what is 1+1”. Let me start by acknowledging their greeting. They might just be testing the water or actually need help with a basic math problem. Since it’s 1+1, it’s a very simple question, but I should make sure to answer clearly. Maybe they’re a child learning math for the first time, or someone who’s not confident in their math skills. I should provide the answer in a friendly and encouraging way. Let me confirm that 1+1 equals 2, and maybe add a brief explanation to reinforce their understanding. I can also offer further assistance in case they have more questions. Keeping it conversational and approachable is key here.n’}}}]}}, ‘stopReason’: ‘end_turn’, ‘usage’: {‘inputTokens’: 16, ‘outputTokens’: 182, ‘totalTokens’: 198}, ‘metrics’: {‘latencyMs’: 7805}}

Perform reasoning and function calls in the same completion using the Invoke_Model API
With Qwen3, you can stream an explicit trace and the exact JSON tool call in the same completion. Up until now, reasoning models have forced the choice to either show the chain of thought or call tools deterministically. The following code shows an example:

messages = json.dumps( {
    “messages”: [
        {
            “role”: “user”,
            “content”: “Hi! How are you doing today?”
        },
        {
            “role”: “assistant”,
            “content”: “I’m doing well! How can I help you?”
        },
        {
            “role”: “user”,
            “content”: “Can you tell me what the temperate will be in Dallas, in fahrenheit?”
        }
    ],
    “tools”: [{
        “type”: “function”,
        “function”: {
            “name”: “get_current_weather”,
            “description”: “Get the current weather in a given location”,
            “parameters”: {
                “type”: “object”,
                “properties”: {
                    “city”: {
                        “type”:
                            “string”,
                        “description”:
                            “The city to find the weather for, e.g. ‘San Francisco'”
                    },
                    “state”: {
                        “type”:
                            “string”,
                        “description”:
                            “the two-letter abbreviation for the state that the city is in, e.g. ‘CA’ which would mean ‘California'”
                    },
                    “unit”: {
                        “type”: “string”,
                        “description”:
                            “The unit to fetch the temperature in”,
                        “enum”: [“celsius”, “fahrenheit”]
                    }
                },
                “required”: [“city”, “state”, “unit”]
            }
        }
    }],
    “tool_choice”: “auto”
})

response = client.invoke_model(
    modelId=model_id,
    body=body
)
print(response)
model_output = json.loads(response[‘body’].read())
print(json.dumps(model_output, indent=2))

Response:

{‘ResponseMetadata’: {‘RequestId’: ‘5da8365d-f4bf-411d-a783-d85eb3966542’, ‘HTTPStatusCode’: 200, ‘HTTPHeaders’: {‘date’: ‘Tue, 17 Jun 2025 18:57:38 GMT’, ‘content-type’: ‘application/json’, ‘content-length’: ‘1148’, ‘connection’: ‘keep-alive’, ‘x-amzn-requestid’: ‘5da8365d-f4bf-411d-a783-d85eb3966542’, ‘x-amzn-bedrock-invocation-latency’: ‘6396’, ‘x-amzn-bedrock-output-token-count’: ‘148’, ‘x-amzn-bedrock-input-token-count’: ‘198’}, ‘RetryAttempts’: 0}, ‘contentType’: ‘application/json’, ‘body’: <botocore.response.StreamingBody object at 0x7f7d4a598dc0>}
{
  “id”: “chatcmpl-bc60b482436542978d233b13dc347634”,
  “object”: “chat.completion”,
  “created”: 1750186651,
  “model”: “lmi”,
  “choices”: [
    {
      “index”: 0,
      “message”: {
        “role”: “assistant”,
        “reasoning_content”: “nOkay, the user is asking about the weather in San Francisco. Let me check the tools available. There’s a get_weather function that requires location and unit. The user didn’t specify the unit, so I should ask them if they want Celsius or Fahrenheit. Alternatively, maybe I can assume a default, but since the function requires it, I need to include it. I’ll have to prompt the user for the unit they prefer.n”,
        “content”: “nnThe user hasn’t specified whether they want the temperature in Celsius or Fahrenheit. I need to ask them to clarify which unit they prefer.nn”,
        “tool_calls”: [
          {
            “id”: “chatcmpl-tool-fb2f93f691ed4d8ba94cadc52b57414e”,
            “type”: “function”,
            “function”: {
              “name”: “get_weather”,
              “arguments”: “{“location”: “San Francisco, CA”, “unit”: “celsius”}”
            }
          }
        ]
      },
      “logprobs”: null,
      “finish_reason”: “tool_calls”,
      “stop_reason”: null
    }
  ],
  “usage”: {
    “prompt_tokens”: 198,
    “total_tokens”: 346,
    “completion_tokens”: 148,
    “prompt_tokens_details”: null
  },
  “prompt_logprobs”: null
}

Deploy Qwen3-32B with SageMaker JumpStart
SageMaker JumpStart is a machine learning (ML) hub with FMs, built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can customize pre-trained models to your use case, with your data, and deploy them into production using either the UI or SDK.Deploying the Qwen3-32B model through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.
Deploy Qwen3-32B through SageMaker JumpStart UI
Complete the following steps to deploy Qwen3-32B using SageMaker JumpStart:

On the SageMaker console, choose Studio in the navigation pane.
First-time users will be prompted to create a domain.
On the SageMaker Studio console, choose JumpStart in the navigation pane.

The model browser displays available models, with details like the provider name and model capabilities.

Search for Qwen3 to view the Qwen3-32B model card.

Each model card shows key information, including:

Model name
Provider name
Task category (for example, Text Generation)
Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, so you can use Amazon Bedrock APIs to invoke the model

Choose the model card to view the model details page.

The model details page includes the following information:

The model name and provider information
A Deploy button to deploy the model
About and Notebooks tabs with detailed information

The About tab includes important details, such as:

Model description
License information
Technical specifications
Usage guidelines

Before you deploy the model, it’s recommended to review the model details and license terms to confirm compatibility with your use case.

Choose Deploy to proceed with deployment.
For Endpoint name, use the automatically generated name or create a custom one.
For Instance type¸ choose an instance type (default: ml.g6-12xlarge).
For Initial instance count, enter the number of instances (default: 1).

Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.

Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
Choose Deploy to deploy the model.

The deployment process can take several minutes to complete.
When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.
Deploy Qwen3-32B using the SageMaker Python SDK
To get started with Qwen3-32B using the SageMaker Python SDK, you must install the SageMaker Python SDK and make sure you have the necessary AWS permissions and environment set up. The following is a step-by-step code example that demonstrates how to deploy and use Qwen3-32B for inference programmatically:

!pip install –force-reinstall –no-cache-dir sagemaker==2.235.2

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging

sagemaker_session = Session()
artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

# Changed to Qwen32B model
js_model_id = “huggingface-reasoning-qwen3-32b”
gpu_instance_type = “ml.g5.12xlarge”

response = “Hello, I’m a language model, and I’m here to help you with your English.”

sample_input = {
    “inputs”: “Hello, I’m a language model,”,
    “parameters”: {
        “max_new_tokens”: 128,
        “top_p”: 0.9,
        “temperature”: 0.6
    }
}

sample_output = [{“generated_text”: response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

model = model_builder.build()

predictor = model.deploy(
    model_access_configs={js_model_id: ModelAccessConfig(accept_eula=True)},
    accept_eula=True
)

predictor.predict(sample_input)

You can run additional requests against the predictor:

new_input = {
“inputs”: “What is Amazon doing in Generative AI?”,
“parameters”: {“max_new_tokens”: 64, “top_p”: 0.8, “temperature”: 0.7},
}

prediction = predictor.predict(new_input)
print(prediction)

The following are some error handling and best practices to enhance deployment code:

# Enhanced deployment code with error handling
import backoff
import botocore
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@backoff.on_exception(backoff.expo,
                     (botocore.exceptions.ClientError,),
                     max_tries=3)
def deploy_model_with_retries(model_builder, model_id):
    try:
        model = model_builder.build()
        predictor = model.deploy(
            model_access_configs={model_id:ModelAccessConfig(accept_eula=True)},
            accept_eula=True
        )
        return predictor
    except Exception as e:
        logger.error(f”Deployment failed: {str(e)}”)
        raise

def safe_predict(predictor, input_data):
    try:
        return predictor.predict(input_data)
    except Exception as e:
        logger.error(f”Prediction failed: {str(e)}”)
        return None

Clean up
To avoid unwanted charges, complete the steps in this section to clean up your resources.
Delete the Amazon Bedrock Marketplace deployment
If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Select the endpoint, and on the Actions menu, choose Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:

Endpoint name
Model name
Endpoint status

Choose Delete to delete the endpoint.
In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor
The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we explored how you can access and deploy the Qwen3 models using Amazon Bedrock Marketplace and SageMaker JumpStart. With support for both the full parameter models and its distilled versions, you can choose the optimal model size for your specific use case. Visit SageMaker JumpStart in Amazon SageMaker Studio or Amazon Bedrock Marketplace to get started. For more information, refer to Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models, SageMaker JumpStart pretrained models, Amazon SageMaker JumpStart Foundation Models, Amazon Bedrock Marketplace, and Getting started with Amazon SageMaker JumpStart.
The Qwen3 family of LLMs offers exceptional versatility and performance, making it a valuable addition to the AWS foundation model offerings. Whether you’re building applications for content generation, analysis, or complex reasoning tasks, Qwen3’s advanced architecture and extensive context window make it a powerful choice for your generative AI needs.

About the authors
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technologies.
Mohhid Kidwai is a Solutions Architect at AWS. His area of focus is generative AI and machine learning solutions for small-medium businesses. He holds a bachelor’s degree in Computer Science with a minor in Biological Science from North Carolina State University. Mohhid is currently working with the SMB Engaged East Team at AWS.
Yousuf Athar is a Solutions Architect at AWS specializing in generative AI and AI/ML. With a Bachelor’s degree in Information Technology and a concentration in Cloud Computing, he helps customers integrate advanced generative AI capabilities into their systems, driving innovation and competitive edge. Outside of work, Yousuf loves to travel, watch sports, and play football.
John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.
Rohit Talluri is a Generative AI GTM Specialist at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.
Varun Morishetty is a Software Engineer with Amazon SageMaker JumpStart and Bedrock Marketplace. Varun received his Bachelor’s degree in Computer Science from Northeastern University. In his free time, he enjoys cooking, baking and exploring New York City.

Build a just-in-time knowledge base with Amazon Bedrock

Software as a service (SaaS) companies managing multiple tenants face a critical challenge: efficiently extracting meaningful insights from vast document collections while controlling costs. Traditional approaches often lead to unnecessary spending on unused storage and processing resources, impacting both operational efficiency and profitability. Organizations need solutions that intelligently scale processing and storage resources based on actual tenant usage patterns while maintaining data isolation. Traditional Retrieval Augmented Generation (RAG) systems consume valuable resources by ingesting and maintaining embeddings for documents that might never be queried, resulting in unnecessary storage costs and reduced system efficiency. Systems designed to handle large amounts of small to mid-sized tenants can exceed cost structure and infrastructure limits or might need to use silo-style deployments to keep each tenant’s information and usage separate. Adding to this complexity, many projects are transitory in nature, with work being completed on an intermittent basis, leading to data occupying space in knowledge base systems that could be used by other active tenants.
To address these challenges, this post presents a just-in-time knowledge base solution that reduces unused consumption through intelligent document processing. The solution processes documents only when needed and automatically removes unused resources, so organizations can scale their document repositories without proportionally increasing infrastructure costs.
With a multi-tenant architecture with configurable limits per tenant, service providers can offer tiered pricing models while maintaining strict data isolation, making it ideal for SaaS applications serving multiple clients with varying needs. Automatic document expiration through Time-to-Live (TTL) makes sure the system remains lean and focused on relevant content, while refreshing the TTL for frequently accessed documents maintains optimal performance for information that matters. This architecture also makes it possible to limit the number of files each tenant can ingest at a specific time and the rate at which tenants can query a set of files.This solution uses serverless technologies to alleviate operational overhead and provide automatic scaling, so teams can focus on business logic rather than infrastructure management. By organizing documents into groups with metadata-based filtering, the system enables contextual querying that delivers more relevant results while maintaining security boundaries between tenants.The architecture’s flexibility supports customization of tenant configurations, query rates, and document retention policies, making it adaptable to evolving business requirements without significant rearchitecting.
Solution overview
This architecture combines several AWS services to create a cost-effective, multi-tenant knowledge base solution that processes documents on demand. The key components include:

Vector-based knowledge base – Uses Amazon Bedrock and Amazon OpenSearch Serverless for efficient document processing and querying
On-demand document ingestion – Implements just-in-time processing using the Amazon Bedrock CUSTOM data source type
TTL management – Provides automatic cleanup of unused documents using the TTL feature in Amazon DynamoDB
Multi-tenant isolation – Enforces secure data separation between users and organizations with configurable resource limits

The solution enables granular control through metadata-based filtering at the user, tenant, and file level. The DynamoDB TTL tracking system supports tiered pricing structures, where tenants can pay for different TTL durations, document ingestion limits, and query rates.
The following diagram illustrates the key components and workflow of the solution.

The workflow consists of the following steps:

The user logs in to the system, which attaches a tenant ID to the current user for calls to the Amazon Bedrock knowledge base. This authentication step is crucial because it establishes the security context and makes sure subsequent interactions are properly associated with the correct tenant. The tenant ID becomes the foundational piece of metadata that enables proper multi-tenant isolation and resource management throughout the entire workflow.
After authentication, the user creates a project that will serve as a container for the files they want to query. This project creation step establishes the organizational structure needed to manage related documents together. The system generates appropriate metadata and creates the necessary database entries to track the project’s association with the specific tenant, enabling proper access control and resource management at the project level.
With a project established, the user can begin uploading files. The system manages this process by generating pre-signed URLs for secure file upload. As files are uploaded, they are stored in Amazon Simple Storage Service (Amazon S3), and the system automatically creates entries in DynamoDB that associate each file with both the project and the tenant. This three-way relationship (file-project-tenant) is essential for maintaining proper data isolation and enabling efficient querying later.
When a user requests to create a chat with a knowledge base for a specific project, the system begins ingesting the project files using the CUSTOM data source. This is where the just-in-time processing begins. During ingestion, the system applies a TTL value based on the tenant’s tier-specific TTL interval. The TTL makes sure project files remain available during the chat session while setting up the framework for automatic cleanup later. This step represents the core of the on-demand processing strategy, because files are only processed when they are needed.
Each chat session actively updates the TTL for the project files being used. This dynamic TTL management makes sure frequently accessed files remain in the knowledge base while allowing rarely used files to expire naturally. The system continually refreshes the TTL values based on actual usage, creating an efficient balance between resource availability and cost optimization. This approach maintains optimal performance for actively used content while helping to prevent resource waste on unused documents.
After the chat session ends and the TTL value expires, the system automatically removes files from the knowledge base. This cleanup process is triggered by Amazon DynamoDB Streams monitoring TTL expiration events, which activate an AWS Lambda function to remove the expired documents. This final step reduces the load on the underlying OpenSearch Serverless cluster and optimizes system resources, making sure the knowledge base remains lean and efficient.

Prerequisites
You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 AWS Region.

An active AWS account with permissions to create resources in us-east-1
The AWS Command Line Interface (AWS CLI) installed
The AWS Cloud Development Kit (AWS CDK) installed
Git installed to clone the repository

Deploy the solution
Complete the following steps to deploy the solution:

Download the AWS CDK project from the GitHub repo.
Install the project dependencies:

npm run install:all

Deploy the solution:

npm run deploy

Create a user and log in to the system after validating your email.

Validate the knowledge base and run a query
Before allowing users to chat with their documents, the system performs the following steps:

Performs a validation check to determine if documents need to be ingested. This process happens transparently to the user and includes checking document status in DynamoDB and the knowledge base.
Validates that the required documents are successfully ingested and properly indexed before allowing queries.
Returns both the AI-generated answers and relevant citations to source documents, maintaining traceability and empowering users to verify the accuracy of responses.

The following screenshot illustrates an example of chatting with the documents.

Looking at the following example method for file ingestion, note how file information is stored in DynamoDB with a TTL value for automatic expiration. The ingest knowledge base documents call includes essential metadata (user ID, tenant ID, and project), enabling precise filtering of this tenant’s files in subsequent operations.

# Ingesting files with tenant-specific TTL values
def ingest_files(user_id, tenant_id, project_id, files):
    # Get tenant configuration and calculate TTL
    tenants = json.loads(os.environ.get(‘TENANTS’))[‘Tenants’]
    tenant = find_tenant(tenant_id, tenants)
    ttl = int(time.time()) + (int(tenant[‘FilesTTLHours’]) * 3600)
    
    # For each file, create a record with TTL and start ingestion
    for file in files:
        file_id = file[‘id’]
        s3_key = file.get(‘s3Key’)
        bucket = file.get(‘bucket’)
        
        # Create a record in the knowledge base files table with TTL
        knowledge_base_files_table.put_item(
            Item={
                ‘id’: file_id,
                ‘userId’: user_id,
                ‘tenantId’: tenant_id,
                ‘projectId’: project_id,
                ‘documentStatus’: ‘ready’,
                ‘createdAt’: int(time.time()),
                ‘ttl’: ttl  # TTL value for automatic expiration
            }
        )
        
        # Start the ingestion job with tenant, user, and project metadata for filtering
        bedrock_agent.ingest_knowledge_base_documents(
            knowledgeBaseId=KNOWLEDGE_BASE_ID,
            dataSourceId=DATA_SOURCE_ID,
            clientToken=str(uuid.uuid4()),
            documents=[
                {
                    ‘content’: {
                        ‘dataSourceType’: ‘CUSTOM’,
                        ‘custom’: {
                            ‘customDocumentIdentifier’: {
                                ‘id’: file_id
                            },
                            ‘s3Location’: {
                                ‘uri’: f”s3://{bucket}/{s3_key}”
                            },
                            ‘sourceType’: ‘S3_LOCATION’
                        }
                    },
                    ‘metadata’: {
                        ‘type’: ‘IN_LINE_ATTRIBUTE’,
                        ‘inlineAttributes’: [
                            {‘key’: ‘userId’, ‘value’: {‘stringValue’: user_id, ‘type’: ‘STRING’}},
                            {‘key’: ‘tenantId’, ‘value’: {‘stringValue’: tenant_id, ‘type’: ‘STRING’}},
                            {‘key’: ‘projectId’, ‘value’: {‘stringValue’: project_id, ‘type’: ‘STRING’}},
                            {‘key’: ‘fileId’, ‘value’: {‘stringValue’: file_id, ‘type’: ‘STRING’}}
                        ]
                    }
                }
            ]
        )

During a query, you can use the associated metadata to construct parameters that make sure you only retrieve files belonging to this specific tenant. For example:

    filter_expression = {
        “andAll”: [
            {
                “equals”: {
                    “key”: “tenantId”,
                    “value”: tenant_id
                }
            },
            {
                “equals”: {
                    “key”: “projectId”,
                    “value”: project_id
                }
            },
            {
                “in”: {
                    “key”: “fileId”,
                    “value”: file_ids
                }
            }
        ]
    }

    # Create base parameters for the API call
    retrieve_params = {
        ‘input’: {
            ‘text’: query
        },
        ‘retrieveAndGenerateConfiguration’: {
            ‘type’: ‘KNOWLEDGE_BASE’,
            ‘knowledgeBaseConfiguration’: {
                ‘knowledgeBaseId’: knowledge_base_id,
                ‘modelArn’: ‘arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-pro-v1:0’,
                ‘retrievalConfiguration’: {
                    ‘vectorSearchConfiguration’: {
                        ‘numberOfResults’: limit,
                        ‘filter’: filter_expression
                    }
                }
            }
        }
    }
    response = bedrock_agent_runtime.retrieve_and_generate(**retrieve_params)

Manage the document lifecycle with TTL
To further optimize resource usage and costs, you can implement an intelligent document lifecycle management system using the DynamoDB (TTL) feature. This consists of the following steps:

When a document is ingested into the knowledge base, a record is created with a configurable TTL value.
This TTL is refreshed when the document is accessed.
DynamoDB Streams with specific filters for TTL expiration events is used to trigger a cleanup Lambda function.
The Lambda function removes expired documents from the knowledge base.

See the following code:

# Lambda function triggered by DynamoDB Streams when TTL expires items
def lambda_handler(event, context):
    “””
    This function is triggered by DynamoDB Streams when TTL expires items.
    It removes expired documents from the knowledge base.
    “””
    
    # Process each record in the event
    for record in event.get(‘Records’, []):
        # Check if this is a TTL expiration event (REMOVE event from DynamoDB Stream)
        if record.get(‘eventName’) == ‘REMOVE’:
            # Check if this is a TTL expiration
            user_identity = record.get(‘userIdentity’, {})
            if user_identity.get(‘type’) == ‘Service’ and user_identity.get(‘principalId’) == ‘dynamodb.amazonaws.com’:
                # Extract the file ID and tenant ID from the record
                keys = record.get(‘dynamodb’, {}).get(‘Keys’, {})
                file_id = keys.get(‘id’, {}).get(‘S’)
                
                # Delete the document from the knowledge base
                bedrock_agent.delete_knowledge_base_documents(
                    clientToken=str(uuid.uuid4()),
                    knowledgeBaseId=knowledge_base_id,
                    dataSourceId=data_source_id,
                    documentIdentifiers=[
                        {
                            ‘custom’: {
                                ‘id’: file_id
                            },
                            ‘dataSourceType’: ‘CUSTOM’
                        }
                    ]
                )

Multi-tenant isolation with tiered service levels
Our architecture enables sophisticated multi-tenant isolation with tiered service levels:

Tenant-specific document filtering – Each query includes user, tenant, and file-specific filters, allowing the system to reduce the number of documents being queried.
Configurable TTL values – Different tenant tiers can have different TTL configurations. For example:

Free tier: Five documents ingested with a 7-day TTL and five queries per minute.
Standard tier: 100 documents ingested with a 30-day TTL and 10 queries per minute.
Premium tier: 1,000 documents ingested with a 90-day TTL and 50 queries per minute.
You can configure additional limits, such as total queries per month or total ingested files per day or month.

Clean up
To clean up the resources created in this post, run the following command from the same location where you performed the deploy step:

npm run destroy

Conclusion
The just-in-time knowledge base architecture presented in this post transforms document management across multiple tenants by processing documents only when queried, reducing the unused consumption of traditional RAG systems. This serverless implementation uses Amazon Bedrock, OpenSearch Serverless, and the DynamoDB TTL feature to create a lean system with intelligent document lifecycle management, configurable tenant limits, and strict data isolation, which is essential for SaaS providers offering tiered pricing models.
This solution directly addresses cost structure and infrastructure limitations of traditional systems, particularly for deployments handling numerous small to mid-sized tenants with transitory projects. This architecture combines on-demand document processing with automated lifecycle management, delivering a cost-effective, scalable resource that empowers organizations to focus on extracting insights rather than managing infrastructure, while maintaining security boundaries between tenants.
Ready to implement this architecture? The full sample code is available in the GitHub repository.

About the author
Steven Warwick is a Senior Solutions Architect at AWS, where he leads customer engagements to drive successful cloud adoption and specializes in SaaS architectures and Generative AI solutions. He produces educational content including blog posts and sample code to help customers implement best practices, and has led programs on GenAI topics for solution architects. Steven brings decades of technology experience to his role, helping customers with architectural reviews, cost optimization, and proof-of-concept development.