Meet SymTorch: A PyTorch Library that Translates Deep Learning Models …

Can symbolic regression be the key to transforming opaque deep learning models into interpretable, closed-form mathematical equations? or Say you have trained your deep learning model. It works. But do you know what it has actually learned? A team of University of Cambridge researchers propose ‘SymTorch’, a library designed to integrate symbolic regression (SR) into deep learning workflows. It enables researchers to approximate neural network components with closed-form mathematical expressions, facilitating functional interpretability and potential inference acceleration.

https://arxiv.org/pdf/2602.21307

Core Mechanism: The Wrap-Distill-Switch Workflow

SymTorch simplifies the engineering required to extract symbolic equations from trained models by automating data movement and hook management.

Wrap: Users apply the SymbolicModel wrapper to any nn.Module or callable function.

Distill: The library registers forward hooks to record input and output activations during a forward pass. These are cached and transferred from the GPU to the CPU for symbolic regression via PySR.

Switch: Once distilled, the original neural weights can be replaced with the discovered equation in the forward pass using switch_to_symbolic.

The library interfaces with PySR, which uses a multi-population genetic algorithm to find equations that balance accuracy and complexity on a Pareto front. The ‘best’ equation is chosen by maximizing the fractional drop in log mean absolute error relative to an increase in complexity.

Case Study: Accelerating LLM Inference

A primary application explored in this research is replacing Multi-Layer Perceptron (MLP) layers in Transformer models with symbolic surrogates to improve throughput.

Implementation Details

Due to the high dimensionality of LLM activations, the research team employed Principal Component Analysis (PCA) to compress inputs and outputs before performing SR. For the Qwen2.5-1.5B model, they selected 32 principal components for inputs and 8 for outputs across three targeted layers.

Performance Trade-offs

The intervention resulted in an 8.3% increase in token throughput. However, this gain came with a non-trivial increase in perplexity, primarily driven by the PCA dimensionality reduction rather than the symbolic approximation itself.

MetricBaseline (Qwen2.5-1.5B)Symbolic SurrogatePerplexity (Wikitext-2)10.6213.76Throughput (tokens/s)4878.825281.42Avg. Latency (ms)209.89193.89

GNNs and PINNs

SymTorch was validated on its ability to recover known physical laws from latent representations in scientific models.

Graph Neural Networks (GNNs): By training a GNN on particle dynamics, the research team used SymTorch to recover empirical force laws, such as gravity (1/r2) and spring forces, directly from the edge messages.

Physics-Informed Neural Networks (PINNs): The library successfully distilled the 1-D heat equation’s analytic solution from a trained PINN. The PINN’s inductive bias allowed it to achieve a Mean Squared Error (MSE) of 7.40 x 10-6.

LLM Arithmetic Analysis: Symbolic distillation was used to inspect how models like Llama-3.2-1B perform 3-digit addition and multiplication. The distilled equations revealed that while the models are often correct, they rely on internal heuristics that include systematic numerical errors.

Key Takeaways

Automated Symbolic Distillation: SymTorch is a library that automates the process of replacing complex neural network components with interpretable, closed-form mathematical equations by wrapping components and collecting their input-output behavior.

Engineering Barrier Removal: The library handles critical engineering challenges that previously hindered the adoption of symbolic regression, including GPU-CPU data transfer, input-output caching, and seamless switching between neural and symbolic forward passes.

LLM Inference Acceleration: A proof-of-concept demonstrated that replacing MLP layers in a transformer model with symbolic surrogates achieved an 8.3% throughput improvement, though with some performance degradation in perplexity.

Scientific Law Discovery: SymTorch was successfully used to recover physical laws from Graph Neural Networks (GNNs) and analytic solutions to the 1-D heat equation from Physics-Informed Neural Networks (PINNs).

Functional Interpretability of LLMs: By distilling the end-to-end behavior of LLMs, researchers could inspect the explicit mathematical heuristics used for tasks like arithmetic, revealing where internal logic deviates from exact operations.

Check out the Paper, Repo and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet SymTorch: A PyTorch Library that Translates Deep Learning Models into Human-Readable Equations appeared first on MarkTechPost.

How to Build a Stable and Efficient QLoRA Fine-Tuning Pipeline Using U …

In this tutorial, we demonstrate how to efficiently fine-tune a large language model using Unsloth and QLoRA. We focus on building a stable, end-to-end supervised fine-tuning pipeline that handles common Colab issues such as GPU detection failures, runtime crashes, and library incompatibilities. By carefully controlling the environment, model configuration, and training loop, we show how to reliably train an instruction-tuned model with limited resources while maintaining strong performance and rapid iteration speed.

Copy CodeCopiedUse a different Browserimport os, sys, subprocess, gc, locale

locale.getpreferredencoding = lambda: “UTF-8”

def run(cmd):
print(“n$ ” + cmd, flush=True)
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for line in p.stdout:
print(line, end=””, flush=True)
rc = p.wait()
if rc != 0:
raise RuntimeError(f”Command failed ({rc}): {cmd}”)

print(“Installing packages (this may take 2–3 minutes)…”, flush=True)

run(“pip install -U pip”)
run(“pip uninstall -y torch torchvision torchaudio”)
run(
“pip install –no-cache-dir ”
“torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 ”
“–index-url https://download.pytorch.org/whl/cu121”
)
run(
“pip install -U ”
“transformers==4.45.2 ”
“accelerate==0.34.2 ”
“datasets==2.21.0 ”
“trl==0.11.4 ”
“sentencepiece safetensors evaluate”
)
run(“pip install -U unsloth”)

import torch
try:
import unsloth
restarted = False
except Exception:
restarted = True

if restarted:
print(“nRuntime needs restart. After restart, run this SAME cell again.”, flush=True)
os._exit(0)

We set up a controlled and compatible environment by reinstalling PyTorch and all required libraries. We ensure that Unsloth and its dependencies align correctly with the CUDA runtime available in Google Colab. We also handle the runtime restart logic so that the environment is clean and stable before training begins.

Copy CodeCopiedUse a different Browserimport torch, gc

assert torch.cuda.is_available()
print(“Torch:”, torch.__version__)
print(“GPU:”, torch.cuda.get_device_name(0))
print(“VRAM(GB):”, round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2))

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

def clean():
gc.collect()
torch.cuda.empty_cache()

import unsloth
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TextStreamer
from trl import SFTTrainer, SFTConfig

We verify GPU availability and configure PyTorch for efficient computation. We import Unsloth before all other training libraries to ensure that all performance optimizations are applied correctly. We also define utility functions to manage GPU memory during training.

Copy CodeCopiedUse a different Browsermax_seq_length = 768
model_name = “unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit”

model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
model,
r=8,
target_modules=[“q_proj”,”k_proj],
lora_alpha=16,
lora_dropout=0.0,
bias=”none”,
use_gradient_checkpointing=”unsloth”,
random_state=42,
max_seq_length=max_seq_length,
)

We load a 4-bit quantized, instruction-tuned model using Unsloth’s fast-loading utilities. We then attach LoRA adapters to the model to enable parameter-efficient fine-tuning. We configure the LoRA setup to balance memory efficiency and learning capacity.

Copy CodeCopiedUse a different Browserds = load_dataset(“trl-lib/Capybara”, split=”train”).shuffle(seed=42).select(range(1200))

def to_text(example):
example[“text”] = tokenizer.apply_chat_template(
example[“messages”],
tokenize=False,
add_generation_prompt=False,
)
return example

ds = ds.map(to_text, remove_columns=[c for c in ds.column_names if c != “messages”])
ds = ds.remove_columns([“messages”])
split = ds.train_test_split(test_size=0.02, seed=42)
train_ds, eval_ds = split[“train”], split[“test”]

cfg = SFTConfig(
output_dir=”unsloth_sft_out”,
dataset_text_field=”text”,
max_seq_length=max_seq_length,
packing=False,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
max_steps=150,
learning_rate=2e-4,
warmup_ratio=0.03,
lr_scheduler_type=”cosine”,
logging_steps=10,
eval_strategy=”no”,
save_steps=0,
fp16=True,
optim=”adamw_8bit”,
report_to=”none”,
seed=42,
)

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_ds,
eval_dataset=eval_ds,
args=cfg,
)

We prepare the training dataset by converting multi-turn conversations into a single text format suitable for supervised fine-tuning. We split the dataset to maintain training integrity. We also define the training configuration, which controls the batch size, learning rate, and training duration.

Copy CodeCopiedUse a different Browserclean()
trainer.train()

FastLanguageModel.for_inference(model)

def chat(prompt, max_new_tokens=160):
messages = [{“role”:”user”,”content”:prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors=”pt”).to(“cuda”)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
with torch.inference_mode():
model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
streamer=streamer,
)

chat(“Give a concise checklist for validating a machine learning model before deployment.”)

save_dir = “unsloth_lora_adapters”
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

We execute the training loop and monitor the fine-tuning process on the GPU. We switch the model to inference mode and validate its behavior using a sample prompt. We finally save the trained LoRA adapters so that we can reuse or deploy the fine-tuned model later.

In conclusion, we fine-tuned an instruction-following language model using Unsloth’s optimized training stack and a lightweight QLoRA setup. We demonstrated that by constraining sequence length, dataset size, and training steps, we can achieve stable training on Colab GPUs without runtime interruptions. The resulting LoRA adapters provide a practical, reusable artifact that we can deploy or extend further, making this workflow a robust foundation for future experimentation and advanced alignment techniques.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Stable and Efficient QLoRA Fine-Tuning Pipeline Using Unsloth for Large Language Models appeared first on MarkTechPost.

Google Drops Gemini 3.1 Flash-Lite: A Cost-efficient Powerhouse with A …

Google has released Gemini 3.1 Flash-Lite, the most cost-efficient entry in the Gemini 3 model series. Designed for ‘intelligence at scale,’ this model is optimized for high-volume tasks where low latency and cost-per-token are the primary engineering constraints. It is currently available in Public Preview via the Gemini API (Google AI Studio) and Vertex AI.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/?

Core Feature: Variable ‘Thinking Levels’

A significant architectural update in the 3.1 series is the introduction of Thinking Levels. This feature allows developers to programmatically adjust the model’s reasoning depth based on the specific complexity of a request.

By selecting between Minimal, Low, Medium, or High thinking levels, you can optimize the trade-off between latency and logical accuracy.

Minimal/Low: Ideal for high-throughput, low-latency tasks such as classification, basic sentiment analysis, or simple data extraction.

Medium/High: Utilizes Deep Think Mini logic to handle complex instruction-following, multi-step reasoning, and structured data generation.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/?

Performance and Efficiency Benchmarks

Gemini 3.1 Flash-Lite is designed to replace Gemini 2.5 Flash for production workloads that require faster inference without sacrificing output quality. The model achieves a 2.5x faster Time to First Token (TTFT) and a 45% increase in overall output speed compared to its predecessor.

On the GPQA Diamond benchmark—a measure of expert-level reasoning—Gemini 3.1 Flash-Lite scored 86.9%, matching or exceeding the quality of larger models in the previous generation while operating at a significantly lower computational cost.

Comparison Table: Gemini 3.1 Flash-Lite vs. Gemini 2.5 Flash

MetricGemini 2.5 FlashGemini 3.1 Flash-LiteInput Cost (per 1M tokens)Higher$0.25Output Cost (per 1M tokens)Higher$1.50TTFT SpeedBaseline2.5x FasterOutput ThroughputBaseline45% FasterReasoning (GPQA Diamond)Competitive86.9%

Technical Use Cases for Production

The 3.1 Flash-Lite model is specifically tuned for workloads that involve complex structures and long-sequence logic:

UI and Dashboard Generation: The model is optimized for generating hierarchical code (HTML/CSS, React components) and structured JSON required to render complex data visualizations.

System Simulations: It maintains logical consistency over long contexts, making it suitable for creating environment simulations or agentic workflows that require state-tracking.

Synthetic Data Generation: Due to the low input cost ($0.25/1M tokens), it serves as an efficient engine for distilling knowledge from larger models like Gemini 3.1 Ultra into smaller, domain-specific datasets.

Key Takeaways

Superior Price-to-Performance Ratio: Gemini 3.1 Flash-Lite is the most cost-efficient model in the Gemini 3 series, priced at $0.25 per 1M input tokens and $1.50 per 1M output tokens. It outperforms Gemini 2.5 Flash with a 2.5x faster Time to First Token (TTFT) and 45% higher output speed.

Introduction of ‘Thinking Levels’: A new architectural feature allows developers to programmatically toggle between Minimal, Low, Medium, and High reasoning intensities. This provides granular control to balance latency against reasoning depth depending on the task’s complexity.

High Reasoning Benchmark: Despite its ‘Lite’ designation, the model maintains high-tier logic, scoring 86.9% on the GPQA Diamond benchmark. This makes it suitable for expert-level reasoning tasks that previously required larger, more expensive models.

Optimized for Structured Workloads: The model is specifically tuned for ‘intelligence at scale,’ excelling at generating complex UI/dashboards, creating system simulations, and maintaining logical consistency across long-sequence code generation.

Seamless API Integration: Currently available in Public Preview, the model uses the gemini-3.1-flash-lite-preview endpoint via the Gemini API and Vertex AI. It supports multimodal inputs (text, image, video) while maintaining a standard 128k context window.

Check out the Public Preview via the Gemini API (Google AI Studio) and Vertex AI. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Drops Gemini 3.1 Flash-Lite: A Cost-efficient Powerhouse with Adjustable Thinking Levels Designed for High-Scale Production AI appeared first on MarkTechPost.

Building a scalable virtual try-on solution using Amazon Nova on AWS: …

In this first post in a two-part series, we examine how retailers can implement a virtual try-on to improve customer experience. In part 2, we will further explore real-world applications and benefits of this innovative technology.
Every fourth piece of clothing bought online is returned to the retailer, feeding into America’s $890 billion returns problem in 2024. Behind these numbers lies a simple truth: shoppers can’t judge fit and style through their screens. Among the top reasons for returned fashion items are poor fit, wrong size, or style mismatch.
Retailers face a critical challenge in that their most valuable customers often return the most items, forcing them to maintain generous return policies despite steep processing costs and environmental impact. Each return produces 30% more carbon emissions than the initial delivery and represents a missed sales opportunity until items are processed back into inventory. As digital shopping accelerates, virtual try-on technology has emerged as a solution to reduce returns while maintaining customer convenience, but early implementations struggled with accuracy, scalability, and preserving crucial details such as garment draping, patterns, and logos.
Amazon Nova Canvas addresses these challenges through its virtual try-on capability, which uses two two-dimensional image inputs: a source image showing a person or living space and a reference image of the product. The system offers both automatic product placement through auto-masking functionality and manual controls for precise adjustments. Throughout the process, it carefully preserves important details such as logos and textures while providing comprehensive styling controls for customization.
Virtual try-on can be deployed across multiple customer engagement channels, from ecommerce websites and mobile shopping apps to in-store kiosks, social media shopping platforms, and virtual showrooms. Imagine visiting an ecommerce website, uploading your personal image, and seeing it applied across the clothing and accessory products on that website.
The following image shows a source image, a reference image, a mask image, and the resulting try-on image.

In this post, we explore the virtual try-on capability now available in Amazon Nova Canvas, including sample code to get started quickly and tips to help get the best outputs.
Solution overview
With virtual try-on capability, retailers and ecommerce companies can integrate garment and product visualization directly into their existing or new customer touch points. Using only a photo upload and product selection, customers can see how items would look on themselves, a model, or other placement. You can experiment with virtual try-on in Amazon Nova Canvas within the Amazon Bedrock playground. And, we’ll guide you through implementing a complete solution around this feature in your own Amazon Web Services (AWS) environment. The following section provides detailed instructions and best practices for deployment.
At its core, the solution uses the new virtual try-on in Amazon Nova Canvas in Amazon Bedrock. This model offers fast inference speeds, making it suitable for real-time applications such as ecommerce. At the same time, it preserves high-fidelity details of reference items, including patterns, textures, and logos. The model maintains accurate semantic manipulations within scenes.
Our solution combines AWS serverless services with AI processing capabilities in an event-driven architecture. Amazon DynamoDB Streams triggers an AWS Step Functions workflow and Amazon Simple Storage Service (Amazon S3) events to manage result delivery. Amazon Nova Canvas in Amazon Bedrock manages both the mask generation and pose detection. The solution follows an asynchronous processing pipeline with real-time status updates in which WebSocket connections maintain real-time communication with clients, enabling continuous user engagement throughout the process. For detailed implementation guidance and best practices, refer to our guidance.
Detailed explanation of the architecture
The request initiation follows this flow:

Amazon S3 stores the uploaded customer model photos and product images.
Each upload generates a message sent to an Amazon Simple Queue Service (Amazon SQS) queue. The AWS Lambda function creates the corresponding metadata and S3 path and stores it in the DynamoDB product table for later retrieval.
Amazon API Gateway manages the WebSocket connections for real-time status updates between the client and the virtual try-on.
Lambda processes initial requests by retrieving product information in the DynamoDB product table and creating job entries in DynamoDB.
Amazon DynamoDB: The products table (vto-products) stores catalog items available for the virtual try-on, notably the Amazon S3 picture location.
The virtual try-on jobs DynamoDB table (vto-jobs) tracks the state of each try-on request.

The virtual try-on generation follows this flow:

DynamoDB Streams asynchronously triggers AWS Step Functions workflows on job creation for processing try-on requests.
AWS Step Functions orchestrates the virtual try-on generation. It triggers a Lambda function that calls the Amazon Nova Canvas model through Amazon Bedrock. The DynamoDB job table is updated with the virtual try-on status.

The result delivery follows this flow:

Amazon S3 stores the generated try-on images with job ID metadata.
Amazon SQS handles S3 event notifications for completed try-on images.
AWS Lambda function sends the Amazon S3 URL of the result back to the user through WebSocket.

The following diagram illustrates the solution architecture.

Solution process
This section explains the end-to-end process of the solution. The solution guidance provides further details and information on how you can replicate the solution.
When your customer initiates a try-on request, they first sign in on Amazon Cognito and then upload their photo(s) stored into Amazon S3. A workflow is available to auto populate the product table in DynamoDB through Amazon S3 events. The client establishes a WebSocket connection through API Gateway, creating a persistent channel for real-time updates. The client sends the ID of the product they want to virtually try as well as the S3 URL of the static model they want to use. A Lambda function processes this request by retrieving the product image URL from DynamoDB and creating a job entry with both image URLs, returning a unique job ID for tracking.
DynamoDB stream then triggers a step function to coordinate the different writes and updates in the DynamoDB table. The step function also invokes Amazon Nova Canvas virtual try-on feature. The model takes as input (1) the source image, which is the base image you would like to modify (for example, the image of the customer), (2) the reference image, which is an image containing the product(s) you want to insert into the base image. For garments, the reference image can contain garments on or off body and can even contain multiple products representing distinct outfit components (such as a shirt, pants, and shoes in a single image).
By default, a mask is computed automatically using auxiliary inputs (maskType: “GARMENT” or maskType: “PROMPT”). The mask image can either be provided directly by the developer (maskType: “IMAGE”).
When a mask type of “GARMENT” is specified, Amazon Nova Canvas will create a garment-aware mask based on a garmentClass input parameter value you specify. In most cases, you will use one of the following high-level garment classes:

“UPPER_BODY” – Creates a mask that includes full arm length.
“LOWER_BODY” – Creates a mask the includes full leg length with no gap between the legs.
“FOOTWEAR” – Creates a mask that fits the shoe profile demonstrated in the source image.
“FULL_BODY” – Creates a mask equivalent to the combination of “UPPER_BODY” and “LOWER_BODY”.

The following table shows example inputs with maskType: “GARMENT”.

Source
Reference
Garment class
Output

“FOOTWEAR”

The following table shows example inputs with maskType: “PROMPT”.

Source image
Mask prompt
Reference image
Output

There are also more fine-grained subclasses that can be useful in certain edge cases. By using the “PROMPT” mask type, you can use natural language to describe the item in the source image that you want to replace. This is useful for images of items other than garments. This feature uses the same auto-masking functionality that exists in the Nova Canvas “INPAINTING” task using the maskPrompt parameter.
By using the mask and understanding which garment areas needs to be replaced, the product image is inserted on the user’s photo as input. The model then generates the try-on image, which is stored in Amazon S3 with the job ID as metadata. Throughout this process, the system sends progress updates through the WebSocket connection. An Amazon S3 event notification triggers a Lambda function through Amazon SQS. The function generates a presigned URL for the result image and delivers it to the client through the established WebSocket connection. This completes the process, typically taking 7–11 seconds.
Implementation details
This section details the tables and schema used in our virtual try-on solution to help you further understand how the role each DynamoDB tables plays.
This solution uses four DynamoDB tables. The products_table stores the catalog of available items for virtual try-on. The virtual_try_on_jobs table maintains the state and tracking information for each try-on request. The vto-models table stores the catalog of customers images used for virtual try-on. The WebSocket connections table (vto-connections) tracks active WebSocket connections for real-time job status updates. The solution assumes the products table is prepopulated with the retailer’s inventory.
The products table (vto-products) stores the catalog of available items for virtual try-on. Products are automatically populated when images are uploaded to the /products/ S3 folder. The schema for the products table is as follows:

product_id (string, partition key) – Unique identifier for the product
product_picture_s3_url (string) – Amazon S3 URL of the original product image
name (string) – Product display name
category (string) – Product category for organization
description (string) – Product details including style, color, and size options
auto_imported (Boolean) – Flag indicating if product was imported automatically through Amazon S3 upload
created_at (string) – ISO timestamp when product was added
updated_at (string) – ISO timestamp of last modification

The models table (vto-models) stores the catalog of customer images used for virtual try-on. Models are automatically populated when images are uploaded to the /models/ S3 folder. The schema for the models table is as follows:

model_id (string, partition key) – Unique identifier for the model
model_picture_s3_url (string) – Amazon S3 URL of the model image
name (string) – Model display name
category (string) – Model category for organization
description (string) – Model details and characteristics
auto_imported (Boolean) – Flag indicating if model was imported automatically using Amazon S3 upload
created_at (string) – ISO timestamp when model was added
updated_at (string) – ISO timestamp of last modification

The virtual try-on jobs table (vto-jobs) maintains state and tracking information for each try-on request throughout the processing workflow. The schema for the virtual try-on jobs table is as follows:

id (string, partition key) – Unique identifier for each try-on job
model_id (string) – Reference to the model used
product_id (string) – Reference to the product being tried on
model_picture_s3_url (string) – Amazon S3 URL of the customer’s uploaded photo
product_picture_s3_url (string) – Amazon S3 URL of the product being tried on
result_s3_url (string) – Amazon S3 URL of the generated virtual try-on result image
status (string) – Current job status (created, processing, completed, or error)
parameters (map) – Virtual try-on API parameters (such as maskType, mergeStyle, or garmentClass)
connection_id (string) – WebSocket connection ID for real-time updates
error_message (string) – Error details if job fails
created_at (string) – ISO timestamp when job was created
updated_at (string) – ISO timestamp of last status update

The WebSocket connections table (vto-connections) tracks active WebSocket connections for real-time job status updates. Further information on how using WebSocket API can be found at the Create a WebSocket chat app with a WebSocket API, Lambda, and DynamoDB tutorial. The schema is as follows:

connection_id (string, partition key) – WebSocket connection identifier
connected_at (string) – ISO timestamp when connection was established
ttl (number) – Time-to-live for automatic cleanup of stale connections

Conclusion
In this post, we covered how to implement virtual try-on at scale, covering the main building blocks. For a quick start, we provide a complete GitHub sample with prerequisites, deployment scripts, example code and a comprehensive solution guidance document with best practices and configuration details. Use this guide to get started right away in experimenting with the solution.
As ecommerce continues to grow, reducing return rates while maintaining customer satisfaction becomes increasingly critical for retailers’ profitability and sustainability. This Virtual try-on solution demonstrates how AWS serverless services can be combined with generative AI to address a significant challenge. By using Amazon Nova Canvas alongside a robust serverless architecture, retailers can provide customers with accurate product visualization and pose conservation while maintaining the seamless shopping experience their most loyal customers expect. Implementation considerations extend beyond the technical architecture. Successful deployment requires careful attention to service quotas, monitoring, and cost optimization. Our solution guidance provides further detailed recommendations for managing WebSocket connections, implementing retry strategies, and optimizing resource utilization. These operational aspects are crucial for maintaining reliable performance during peak shopping periods while managing costs effectively.

About the authors

Amandine Annoye
Amandine Annoye is a Solutions Architect at AWS, she works with Luxury & Fashion customers in France to help them drive business value. Amandine enjoys translating customers business needs into concrete and effective technical solutions. Outside of work, she enjoys travelling and painting.

Kevin Polossat
Kevin Polossat is a Solutions Architect at AWS. He works with retail & CPG customers in France to help them create value through cloud adoption. Outside of work, he enjoys wine and cheese.

Leopold Cheval
Leopold Cheval is a Solutions Architect at AWS based in Paris, working with Media & Entertainment and Retail customers on their cloud journey. He focuses on modern applications, AI/ML, and Generative AI technologies. Outside of work, Leopold enjoys traveling and camping.

Rania Khemiri
Rania Khemiri is a Prototyping Architect at AWS. She focuses on agentic workflows and Generative AI applications, helping teams accelerate experimentation and adoption of AI technologies on AWS. Through hands-on prototyping, she empowers customers to transform ideas into functional proofs of concept and gain the skills to scale them into production.

How Lendi revamped the refinance journey for its customers using agent …

This post was co-written with Davesh Maheshwari from Lendi Group and Samuel Casey from Mantel Group.
Most Australians don’t know whether their home loan is still competitive. Rates shift, property values move, personal circumstances change—yet for the average homeowner, staying informed of these changes is difficult. It’s often their largest financial commitment, but it’s also the one they’re least equipped to monitor. And when they do decide to refinance, the process itself demands significant manual effort.
Lendi Group, one of Australia’s fastest growing FinTech companies, recognized this gap and set out to transform the home loan experience through innovative technology. By using the generative AI capabilities of Amazon Bedrock, Lendi Group has developed Guardian, an agentic AI-powered application that serves as an around-the-clock companion for homeowners, monitoring their loans, providing personalized insights, and simplifying the mortgage refinance process.
This post details how Lendi Group built their AI-powered Home Loan Guardian using Amazon Bedrock, the challenges they faced, the architecture they implemented, and the significant business outcomes they’ve achieved. Their journey offers valuable insights for organizations that want to use generative AI to transform customer experiences while maintaining the human touch that builds trust and loyalty.
Challenges
Lendi Group identified several persistent challenges in the home loan journey that affected both customers and brokers:

Customers struggled with limited visibility into their mortgage position. Most homeowners lacked real-time insights into whether their current rate remained competitive, how their equity position changed as property values fluctuated, or how their overall financial health impacted their mortgage options. This information gap often led to customers missing opportunities to save money or utilize their home equity effectively.
The refinancing process was cumbersome and time-consuming. Even when customers identified better rates, the paperwork and administrative burden of refinancing deterred many from acting.
Brokers spent significant time on administrative tasks rather than focusing on high-value client interactions. Post-call documentation, routine inquiries, and after-hours support diverted broker attention from complex client needs that required human expertise and empathy.
Lendi Group faced the challenge of scaling personalized service across their extensive customer base. While their digital solution provided convenience, maintaining the human touch that builds trust in financial relationships proved difficult at scale, especially outside business hours.

These challenges led Lendi Group to explore how AI could transform the mortgage experience. Rather than viewing AI as merely an efficiency tool, Lendi envisioned a reinvention of the home loan journey—one where technology could anticipate customer needs, provide around-the-clock personalized guidance, and free human experts to focus on building meaningful relationships.
Solution overview
Lendi’s Guardian represents a fundamental shift in how customers interact with their home loans. At its core, Guardian is designed to:

Monitor loan competitiveness by continuously scanning thousands of home loans daily and alerting customers when better deals become available
Track equity position in real time as property values and industry conditions change, giving customers visibility into their current financial standing
Streamline the refinancing process with journeys that adapt to the customer’s circumstances and auto populates forms based on internal and external data sources, removing friction points that previously deterred customers from taking action
Deliver personalized insights and recommendations based on each customer’s unique financial situation and goals

Lendi used Amazon Bedrock to accelerate the build of their agentic solution within 16 weeks.
The solution is built upon Amazon Bedrock foundation models and Amazon Bedrock Guardrails. Lendi chose Amazon Elastic Kubernetes Service (Amazon EKS) to deploy their AI agents at scale, facilitating the necessary infrastructure to meet consumer demand. By using the wide range of foundation models (FMs) available on Amazon Bedrock, Lendi was able to select task-appropriate models optimized for specific use cases.
A critical component of their solution is AI guardrails powered by Amazon Bedrock Guardrails, which help make sure that the customer communications remain aligned with regulatory requirements. Additionally, Lendi developed Model Context Protocol (MCP) servers to enable AI agents to access institutional knowledge and interact with external services seamlessly.
The key components of the solution are as follows:

UI layer – Customers interact with Guardian through an intuitive chat led interface integrated directly into their Lendi dashboard, providing seamless access to AI-powered mortgage insights and recommendations.
API layer – A RESTful API in Amazon API Gateway serves as the communication bridge between frontend applications and backend AI agents, handling request routing, authentication, and rate limiting to help maintain secure and reliable interactions.
Compute layer – Amazon EKS hosts and orchestrates the AI agents, providing auto-scaling capabilities to efficiently handle varying customer demand while maintaining consistent performance and availability.
Intelligence layer – The core AI capabilities are powered by multiple specialized agents built on Amazon Bedrock foundation models. Lendi used Agno, an open-source agentic framework to develop these agents, with MCP servers providing integrations to internal systems, external data sources, and third-party services. Bedrock Guardrails help enforce compliance boundaries, verifying that the customer interactions adhere to Lendi’s communication guidelines and remain focused on relevant mortgage-related topics.
Observability layer – Langfuse captures comprehensive agent traces, including inputs, outputs, reasoning chains, and performance metrics, providing full visibility into agent behavior and enabling continuous optimization and debugging. Amazon Cloudwatch logs are used to collect system level logs.
Storage layer – MongoDB serves as the persistent data store for user context, conversation history, and session state, enabling customers to resume conversations across sessions while providing agents with the customer-specific context needed for personalized recommendations. Amazon S3 is used to store documents and files provided by the customer.

The following diagram illustrates the solution architecture.

This architecture pattern provides a robust and scalable system to deploy AI agents.
Agent flow for mortgage refinance
Building upon this scalable architecture, Lendi designed a multi-agent orchestration system where specialized agents can collaborate to complete the mortgage refinance journey. This modular approach helps provide several key advantages: clear separation of concerns between agents, simplified development and maintenance of individual agent capabilities, faster response times through task-specific optimization, and straightforward troubleshooting when issues arise.
The mortgage refinance process flows through the following specialized agents, with seamless handovers preserving full context at each transition:

Mortgage Broker Associate Agent (initial engagement) – This agent serves as the customer’s first point of contact, embodying a friendly, professional persona similar to a human mortgage broker. Its primary goal is to understand the customer’s current situation and assess their interest in refinancing.
Customer Information Collection Agent (data gathering) – When a customer expresses interest in refinancing, this specialized agent systematically collects essential customer details including current loan information, employment status, income, and refinancing preferences. The agent uses conversational techniques to make data collection feel natural rather than interrogative and provides clarifications to the customer as required. The agent is context aware and asks for information not already provided by the customer.
Product Recommendation Agent (lender matching) – With complete customer information in hand, this agent analyzes the customer’s profile against Lendi’s extensive database of lenders and products. It presents suitable options with clear explanations of benefits and potential savings.
Product-Specific Information Collection Agent (application preparation) – After the customer selects their preferred product, this agent gathers the additional information required by that specific lender. Different lenders have varying requirements, and this agent adapts its questions accordingly.
Communication Agent (Linda) – Linda is the off-system engagement and re-engagement agent that keeps customers connected to their refinance journey, even when they’re not actively using the Guardian system. Although the specialized agents manage in-system tasks from initial engagement to product selection and application preparation, Linda operates across channels such as SMS, email, WhatsApp, and push to bring customers back in at the right moment. She detects when progress has stalled, surfaces timely reminders or new opportunities, and reinvites customers to continue where they left off. Drawing on live data from the Aurora Digital Twin, Linda tailors messages to the customer’s specific context, tone, and goal, whether it’s encouraging them to reconnect their loan, review matched products, or complete their submission. In essence, Linda is the voice of Guardian beyond the app, helping keep customers informed, motivated, and moving forward throughout the refinance journey.

The following graphic illustrates this workflow.

This agentic approach simplified the mortgage application process for customers by providing an intuitive, natural language interface to share information, ask clarifying questions, and receive guidance throughout their refinance journey. For brokers, it alleviated the burden of manual form filling and application submission, freeing them to focus their expertise on complex customer scenarios, relationship building, and providing strategic financial advice where human judgment and empathy are most valuable.
Business outcomes and customer impact
Lendi’s Guardian application is already delivering measurable results, having settled millions in home loans with refinance cycle times considerably faster than Lendi Group’s baseline. Guardian extends this impact with its AI-powered Rate Radar, which scans thousands of home loans daily and enables refinancing in only 10 minutes, with no paperwork, no phone calls, only a single tap. By automating routine monitoring and alerting customers to better rates in real time, brokers can focus on negotiation, empathy, and complex structuring—the high-value, relationship-driven work that builds loyalty. Guardian launched in only 16 weeks following a more than 30,000-hour cross-functional sprint, demonstrating how an AI-first architecture accelerates both development velocity and customer outcomes.
Lessons learned
Lendi Group’s 16-week journey to build and deploy the AI-powered Home Loan Guardian provided invaluable insights into implementing agentic AI at scale in a regulated financial services environment. Here are the critical lessons they learned:

Prioritize early, iterative evaluation metrics to guide AI development systematically. Rely on data-driven metrics to make key decisions such as model choice. Use Amazon Bedrock prompt management for versioning prompts.
Choose models strategically by using the diverse model options of Amazon Bedrock. Recognize that the most sophisticated model isn’t always the most effective solution for your specific use case. Equally important is incorporating domain knowledge from human experts into your prompts because this contextual expertise often determines success more than model selection alone.
Take advantage of using Amazon Bedrock batch inference on tasks that don’t require immediate results to reduce cost.
Treat AI as a transformative technology requiring bold vision and rapid, strategic implementation. Use the generative AI capabilities of Amazon Bedrock and the scalable cloud infrastructure of AWS to accelerate AI-driven innovation.
Prioritize responsible AI governance in regulated environments. Use Amazon Bedrock Guardrails to help enforce content policies, filter inappropriate responses, and maintain compliance alignment requirements throughout the AI lifecycle.
Balance automation with human expertise. Design AI systems that augment—rather than replace—human judgment, maintaining a customer-centric approach where human oversight remains central to critical decisions.

Future roadmap
Lendi Group’s implementation of the AI-powered Home Loan Guardian represents only the first step in their ambitious journey to become a fully AI-based organization by June 2026. With the foundation now in place, Lendi Group aims to use agentic AI to rethink the whole mortgage and finance journey.
To support this strategic initiative, Lendi is exploring new AWS services, including Amazon Bedrock AgentCore, which enables the deployment of agents in a scalable and secure manner without the overhead of infrastructure management. This approach will further help accelerate Lendi’s pace of innovation.
“We’ve built our platform so that refinancing happens at the speed of life, not at the speed of paperwork,” says Devesh Maheshwari – CTO at Lendi. “A customer can receive a Rate Radar alert about a sharper rate or a shift in property value during their morning commute. They tap to engage with it and provide information to our Agentic platform “Guardian” and by the time they’re heading home, their refinance loan application can be lodged. That’s not magic. It’s what happens when you invest properly in intelligent automation, real-time decisioning APIs and a micro-services architecture that coordinates everything from document verification through to settlement, without manual handoffs. The real challenge wasn’t just speed. It was removing every point of friction while still meeting the highest standards of compliance and risk control. When your infrastructure can support life-changing financial decisions in minutes rather than weeks, you’re not just improving the experience. You’re resetting what customers expect from financial services.”
Conclusion
Lendi Group’s AI-powered Home Loan Guardian represents a significant leap forward in how Australians manage their home loans. By using the generative AI capabilities of Amazon Bedrock, Lendi has created a solution that helps transform the mortgage experience from a periodic, transaction-based interaction to an ongoing, proactive relationship that delivers continuous value to customers. Looking ahead, Lendi’s journey to become a fully AI-based organization by June 2026 positions them at the forefront of innovation in the Australian mortgage industry. Their vision of AI integrated into “every workflow, every decision, every customer experience, and every broker experience” presents a fundamental reimagining of how mortgage services can be delivered.

About the authors
Deepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in AI, he partners with clients to accelerate their generative AI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences.
James Hardman James is a Senior Account Manager at AWS, partnering with Australia’s fintech and financial services organisations to navigate complex technology challenges. He works backwards from what matters most to his customers, connecting them with the right investment, tools, and specialist teams to help them move faster. James is particularly focused on helping customers explore emerging technologies like agentic AI – not for the sake of innovation, but to drive real business outcomes and better serve their end customers.
Igor Londero Gentil is a Solutions Architect at AWS, based in Sydney, helping customers design and build on the cloud with a focus on serverless and event-driven architectures. With a background spanning infrastructure engineering, cloud architecture, and AI, he brings a practitioner’s perspective to solving real-world problems — grounded in years of hands-on experience before joining AWS. Igor is a regular speaker on topics like event-driven architectures and AWS Lambda, and an active open-source contributor.
Devesh Maheshwari is the Chief Technology Officer at Lendi Group Services in Sydney, Australia, where he’s driving the company’s transition to an AI-native business. With more than 18 years of experience leading technology strategy, digital transformation and engineering teams, Dev has a strong track record in fintech and highly regulated sectors, shaping platforms that scale and deliver real business value. Before Lendi and he has held senior leadership positions at DataMesh, Tyro Payments, Tabcorp & ThoughtWorks. He’s also a trusted advisor and mentor in tech, and he’s shared his insights on AI and innovation at industry events.
Samuel Casey began his career in the startup ecosystem as the co-founder of a specialised AI consultancy. After successfully spinning out a proprietary AI product and overseeing its acquisition by Mantel Group, Samuel joined Mantel four years ago to lead high-stakes digital transformations. As an AI partner in Mantel, he has spearheaded a variety of complex projects for a broad range of enterprise and government clients. Most recently, Samuel has been at the forefront of the Generative/Agentic AI movement, dedicated to helping organisations integrate AI Solutions into their core operations as these technologies have materialised in the global market.

How Tines enhances security analysis with Amazon Quick Suite

Organizations face challenges in quickly detecting and responding to user account security events, such as repeated login attempts from unusual locations. Although security data exists across multiple applications, manually correlating information and making corrective actions often delays effective response. With Amazon Quick Suite and Tines, you can automate the investigation and remediation process by integrating data from multiple security tools, and providing visual insights for faster decision-making.
Quick Suite is a digital workspace that provides business users agentic AI capabilities to quickly answer questions and turn insights into actions. Quick Suite brings AI-powered research, business intelligence (BI), and automation into a single application. You can build automated workflows where multiple AI assistants work together, using your company data and the internet to answer business questions faster and more accurately. Users connect additional applications to Quick Suite using built-in integrations and the Model Context Protocol (MCP), a protocol that standardizes how AI assistants communicate with external tools. Tines is an intelligent workflow platform with a built-in MCP Server Builder. An MCP server is a program that exposes an application’s capabilities through a standard protocol so AI assistants can call them as tools. In Tines, you define MCP tools that read from or write to your internal or third-party applications, and Quick Suite can query those tools directly. With full audit trails in Tines, customers maintain visibility and governance across every workflow. This pattern enables Quick Suite users to bring proprietary or siloed data into their AI-driven analysis workflows without deploying new infrastructure or writing custom integration code.
In this post, we show you how to connect Quick Suite with Tines to securely retrieve, analyze, and visualize enterprise data from any security or IT system. We walk through an example that uses a MCP server in Tines to retrieve data from various tools, such as AWS CloudTrail, Okta, and VirusTotal, to remediate security events using Quick Suite.
Use case: Orchestrated security investigation and remediation
As a member of a security team, you stay ahead of security events with regular account security data review. This involves triaging information from multiple sources to determine if there are indicators that signal the need to dive more deeply into the data. With Quick Suite and Tines, you can investigate and remediate security events using natural language. This integrated approach leads to faster decision-making, without requiring custom scripts or manual correlation across multiple security applications.
Once connected to Quick Suite as well as your security and IT tools, Tines can:

Analyze internet protocol (IP) addresses in VirusTotal to assess event risk
Retrieve account details from Okta and BambooHR
Review authentication logs and user activity in CloudTrail
Flag suspicious IP addresses and, after analyst approval, block them in CrowdStrike

In Quick Suite, you can then visualize this data to gain immediate insights such as:

Geographic mapping of login attempts with risk scoring
Timeline of user activity before and after suspicious logins
Correlation between accounts and affected systems
Remediation status tracking for security events

This enables you to ask natural language questions, for example:

Show all login attempts from high-risk countries in the last 24 hours
Display user activity timeline
List all systems the user accessed
Generate a report of remediation actions taken for the security event

Explore additional use cases in the Tines story library.
Solution overview
Tines can help you integrate with services that expose an API, automate retrieval or transformation of that data, and provide the resulting workflow as an MCP server. The MCP client in Quick Suite can connect directly to the Tines MCP server and access the tools defined within the server.
This pattern provides the following benefits:

A simple, governed integration layer between Quick Suite and internal or external tools
The ability to connect systems that don’t currently have an MCP server
A straightforward and powerful way to create new MCP tools for custom data sources without custom engineering or development work
Consistent, secure connectivity without maintaining custom scripts or servers

For Quick Suite customers, the result is faster insight and less manual effort, with built-in control over how Quick Suite connects to enterprise data sources.
The workflow consists of four components:

Quick Suite – Connects to the Tines MCP server using the Quick Suite MCP client, retrieves the data, and enables analysis through chat and dashboards
Tines MCP Server – A published endpoint that exposes the workflow as an MCP tool
Security or IT API – Any REST API that returns network, endpoint, asset, or configuration data
Tines workflow – A sequence of actions that retrieves, normalizes, or enriches this data

The following diagram illustrates this architecture.

Prerequisites
To deploy this solution, you must have the following:

A Quick Suite account within your AWS account with a Professional subscription and an Author, or higher, user role. Refer to Model Context Protocol (MCP) integration for more information.
A Tines tenant. All plans, including the free Community Edition, support creating MCP servers
API credentials for the chosen security or IT system.

Create MCP server in Tines
You can import an MCP server from the Tines story library into your Tines tenant. Alternatively, complete the following steps to create a custom MCP server in Tines:

Create a new Story.

Open the Templates browser and search for MCP.

Drag the MCP action to the storyboard.

Choose MCP Server in the right pane and note the MCP server URL to connect Quick Suite.

Add as many tools as required for your workflow from the list of templates, or configure your own custom tools.

Connect the tools with your account in the associated applications using standard authentication methods (such as API key or OAuth).

The following screenshot shows a custom MCP server example for user account security analysis and remediation.

Connect Quick Suite to Tines MCP server
Complete the following steps to connect Quick Suite to the Tines MCP server:

On the Quick Suite console, choose Integrations under Connections in the navigation pane.
Choose the Actions tab under Existing integrations.
Choose the plus sign next to Model Context Protocol.

On the Create integration page, enter a name and description for your Tines integration.
For MCP server endpoint, enter the MCP server URL from your MCP server in your Tines story, then choose Next.

On the next page, configure the authentication settings and choose Create and continue to see the tools from your Tines MCP server.
Choose Next to complete the connection.

Query and visualize data in Quick Suite
After you’re connected, you can use the Quick Suite chat assistant to retrieve and explore data in real time, generate visual dashboards and charts from the returned results, and combine this data with existing AWS datasets for broader analysis. Quick Suite automatically selects and retrieves data from your Tines integration based on the content of the chat messages. This gives you a simple and scalable way to operationalize security and IT data using the BI and AI capabilities in Quick Suite. The following screenshot shows a sample security query.

The following screenshot shows the query result, including a security event timeline graph.

Clean up
To avoid incurring ongoing charges, clean up the resources you created as part of this solution.
Conclusion
Connecting Quick Suite and Tines using MCP transforms how organizations analyze their security and IT data. This solution reduces the need for custom integration code and provides centralized governance of integrations, standardized data retrieval, and improved operational visibility. Security and IT teams can extend their analytics capabilities to any API-enabled system through a single, auditable layer that scales across their tooling landscape.
Get Started with Quick Suite to create a Quick Suite instance in your AWS account and visit the Tines home page to sign up for a Tines Community Edition account. Once you have access, you can create your first MCP server and connect your existing security and IT tools using the Tines prebuilt templates. Finally, configure Quick Suite to access your new data sources and start analyzing data through natural language queries.
For more details, refer to the Amazon Quick Suite User Guide and Tines MCP server documentation.

About the Authors

Yannick Gloster
Yannick Gloster is a Software Engineer based in Dublin, Ireland, originally from Santa Barbara, California. He works on AI features and infrastructure at Tines, building Workbench, AI agents, and scalable AI infrastructure for the platform powering the world’s most important workflows. Yannick has a master’s degree in computer science from Trinity College Dublin, Ireland. In his spare time, he enjoys sailing, playing Counter-Strike and Deadlock, and watching Formula 1.

Jonah Craig
Jonah Craig is a Startup Solutions Architect based in Dublin, Ireland. He works with startup customers across the UK and Ireland and focuses on developing AI/ML and generative AI solutions. Jonah has a master’s degree in computer science and regularly speaks on stage at AWS conferences, such as the annual AWS London Summit and the AWS Dublin Cloud Day. In his spare time, he enjoys creating music and releasing it on Spotify.

Ashok Mahajan
Ashok Mahajan is a Senior Solutions Architect at Amazon Web Services. Based in the NYC Metropolitan area, Ashok is a part of Global Startup team focusing on Security Startups and helps them design and develop secure, scalable, and innovative solutions and architecture using the breadth and depth of AWS services and their features to deliver measurable business outcomes.

Bobby Williams
Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.

Alibaba just released Qwen 3.5 Small models: a family of 0.8B to 9B pa …

Alibaba’s Qwen team has released the Qwen3.5 Small Model Series, a collection of Large Language Models (LLMs) ranging from 0.8B to 9B parameters. While the industry trend has historically favored increasing parameter counts to achieve ‘frontier’ performance, this release focuses on ‘More Intelligence, Less Compute.‘ These models represent a shift toward deploying capable AI on consumer hardware and edge devices without the traditional trade-offs in reasoning or multimodality.

The series is currently available on Hugging Face and ModelScope, including both Instruct and Base versions.

The Model Hierarchy: Optimization by Scale

The Qwen3.5 small series is categorized into four distinct tiers, each optimized for specific hardware constraints and latency requirements:

Qwen3.5-0.8B and Qwen3.5-2B: These models are designed for high-throughput, low-latency applications on edge devices. By optimizing the dense token training process, these models provide a reduced VRAM footprint, making them compatible with mobile chips and IoT hardware.

Qwen3.5-4B: This model serves as a multimodal base for lightweight agents. It bridges the gap between pure text models and complex visual-language models (VLMs), allowing for agentic workflows that require visual understanding—such as UI navigation or document analysis—while remaining small enough for local deployment.

Qwen3.5-9B: The flagship of the small series, the 9B variant, focuses on reasoning and logic. It is specifically tuned to close the performance gap with models significantly larger (such as 30B+ parameter variants) through advanced training techniques.

Native Multimodality vs. Visual Adapters

One of the significant technical shifts in Qwen3.5-4B and above is the move toward native multimodal capabilities. In earlier iterations of small models, multimodality was often achieved through ‘adapters’ or ‘bridges’ that connected a pre-trained vision encoder (like CLIP) to a language model.

In contrast, Qwen3.5 incorporates multimodality directly into the architecture. This native approach allows the model to process visual and textual tokens within the same latent space from the early stages of training. This results in better spatial reasoning, improved OCR accuracy, and more cohesive visual-grounded responses compared to adapter-based systems.

Scaled RL: Enhancing Reasoning in Compact Models

The performance of the Qwen3.5-9B is largely attributed to the implementation of Scaled Reinforcement Learning (RL). Unlike standard Supervised Fine-Tuning (SFT), which teaches a model to mimic high-quality text, Scaled RL uses reward signals to optimize for correct reasoning paths.

The benefits of Scaled RL in a 9B model include:

Improved Instruction Following: The model is more likely to adhere to complex, multi-step system prompts.

Reduced Hallucinations: By reinforcing logical consistency during training, the model exhibits higher reliability in fact-retrieval and mathematical reasoning.

Efficiency in Inference: The 9B parameter count allows for faster token generation (higher tokens-per-second) than 70B models, while maintaining competitive logic scores on benchmarks like MMLU and GSM8K.

Summary Table: Qwen3.5 Small Series Specifications

Model SizePrimary Use CaseKey Technical Feature0.8B / 2BEdge Devices / IoTLow VRAM, high-speed inference4BLightweight AgentsNative multimodal integration9BReasoning & LogicScaled RL for frontier-closing performance

By focusing on architectural efficiency and advanced training paradigms like Scaled RL and native multimodality, the Qwen3.5 series provides a viable path for developers to build sophisticated AI applications without the overhead of massive, cloud-dependent models.

Key Takeaways

More Intelligence, Less Compute: The series (0.8B to 9B parameters) prioritizes architectural efficiency over raw parameter scale, enabling high-performance AI on consumer-grade hardware and edge devices.

Native Multimodal Integration (4B Model): Unlike models that use ‘bolted-on’ vision towers, the 4B variant features a native architecture where text and visual data are processed in a unified latent space, significantly improving spatial reasoning and OCR accuracy.

Frontier-Level Reasoning via Scaled RL: The 9B model leverages Scaled Reinforcement Learning to optimize for logical reasoning paths rather than just token prediction, effectively closing the performance gap with models 5x to 10x its size.

Optimized for Edge and IoT: The 0.8B and 2B models are developed for ultra-low latency and minimal VRAM footprints, making them ideal for local-first applications, mobile deployment, and privacy-sensitive environments.

Check out the Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Alibaba just released Qwen 3.5 Small models: a family of 0.8B to 9B parameters built for on-device applications appeared first on MarkTechPost.

Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM a …

In the current AI landscape, agentic frameworks typically rely on high-level managed languages like Python or Go. While these ecosystems offer extensive libraries, they introduce significant overhead through runtimes, virtual machines, and garbage collectors. NullClaw is a project that diverges from this trend, implementing a full-stack AI agent framework entirely in Raw Zig.

By eliminating the runtime layer, NullClaw achieves a compiled binary size of 678 KB and operates with approximately 1 MB of RAM. For devs working in resource-constrained environments or edge computing, these metrics represent a shift in how AI orchestration can be deployed.

Performance Benchmarks and Resource Allocation

The primary distinction between NullClaw and existing frameworks lies in its resource footprint. Standard agent implementations often require significant hardware overhead to maintain the underlying language environment:

Local machine benchmark (macOS arm64, Feb 2026), normalized for 0.8 GHz edge hardware.

OpenClawNanoBotPicoClawZeroClaw NullClawLanguageTypeScriptPythonGoRustZigRAM> 1 GB> 100 MB< 10 MB< 5 MB~1 MBStartup (0.8 GHz)> 500 s> 30 s< 1 s< 10 ms< 8 msBinary Size~28 MB (dist)N/A (Scripts)~8 MB3.4 MB678 KBTests———1,0173,230+Source Files~400+——~120~110CostMac Mini $599Linux SBC ~$50Linux Board $10Any $10 hardwareAny $5 hardware

NullClaw’s ability to boot in under 2 milliseconds is a direct result of its lack of a virtual machine or interpreter. It compiles directly to machine code with zero dependencies beyond libc, ensuring that CPU cycles are dedicated entirely to logic rather than runtime management.

Architectural Design: The Vtable Interface Pattern

The most critical aspect of NullClaw is its modularity. Despite its small size, the system is not hard-coded for specific vendors. Every major subsystem—including providers, channels, tools, and memory backends—is implemented as a vtable interface.

A vtable (virtual method table) allows for dynamic dispatch at runtime. In NullClaw, this enables users to swap components via configuration changes without modifying or recompiling the source code. This architecture supports:

22+ AI Providers: Integration for OpenAI, Anthropic, Ollama, DeepSeek, Groq, and others.

13 Communication Channels: Native support for Telegram, Discord, Slack, WhatsApp, iMessage, and IRC.

18+ Built-in Tools: Executable functions for agentic task completion.

This modularity ensures that the core engine remains lightweight while remaining extensible for complex ‘subagent’ workflows and MCP (Model Context Protocol) integration.

Memory Management and Security

NullClaw manages memory manually, a core feature of the Zig programming language. To maintain a 1 MB RAM footprint while handling complex data, it utilizes a hybrid vector + keyword memory search. This allows the agent to perform retrieval-augmented generation (RAG) tasks without the overhead of an external, heavy vector database.

Security is integrated into the low-level design rather than added as an external layer:

Encryption: API keys are encrypted by default using ChaCha20-Poly1305, an AEAD (Authenticated Encryption with Associated Data) algorithm known for high performance on mobile and embedded CPUs.

Execution Sandboxing: When agents utilize tools or execute code, NullClaw supports multi-layer sandboxing through Landlock (a Linux security module), Firejail, and Docker.

Hardware Peripheral Support

Because NullClaw is written in Zig and lacks a heavy runtime, it is uniquely suited for hardware interaction. It provides native support for hardware peripherals across various platforms, including Arduino, Raspberry Pi, and STM32. This enables the deployment of autonomous AI agents directly onto microcontrollers, allowing them to interact with physical sensors and actuators in real-time.

Engineering Reliability

A common concern with manual memory management and low-level implementations is system stability. NullClaw addresses this through rigorous validation:

Test Suite: The codebase includes 2,738 tests to ensure logic consistency and memory safety.

Codebase Volume: The framework comprises approximately 45,000 lines of Zig.

Licensing: It is released under the MIT License, allowing for broad commercial and private utility.

Key Takeaways

Extreme Resource Efficiency: By using raw Zig and eliminating runtimes (No Python, No JVM, No Go), NullClaw reduces RAM requirements to ~1 MB and binary size to 678 KB. This is a 99% reduction in resources compared to standard managed-language agents.

Near-Instant Cold Starts: The removal of a virtual machine or interpreter allows the system to boot in under 2 milliseconds. This makes it ideal for event-driven architectures or serverless functions where latency is critical.

Modular ‘Vtable’ Architecture: Every subsystem (AI providers, chat channels, memory backends) is a vtable interface. This allows developers to swap providers like OpenAI for local DeepSeek or Groq via simple config changes with zero code modifications.

Embedded and IoT Ready: Unlike traditional frameworks requiring a PC or expensive Mac Mini, NullClaw provides native support for Arduino, Raspberry Pi, and STM32. It allows a full agent stack to run on a $5 board.

Security-First Design: Despite its small footprint, it includes high-level security features: default ChaCha20-Poly1305 encryption for API keys and multi-layer sandboxing using Landlock, Firejail, and Docker to contain agent-executed code.

Check out the Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds appeared first on MarkTechPost.

FireRedTeam Releases FireRed-OCR-2B Utilizing GRPO to Solve Structural …

Document digitization has long been a multi-stage problem: first detect the layout, then extract the text, and finally try to reconstruct the structure. For Large Vision-Language Models (LVLMs), this often leads to ‘structural hallucinations’—disordered rows, invented formulas, or unclosed syntax.

The FireRedTeam has released FireRed-OCR-2B, a flagship model designed to treat document parsing as a structural engineering task rather than ‘impressionist’ text generation. Built on the Qwen3-VL-2B-Instruct architecture, this model establishes a new State-of-the-Art (SOTA) for end-to-end solutions, achieving an overall score of 92.94% on the OmniDocBench v1.5 benchmark.

Shifting the Paradigm: Structural Engineering vs. Text Generation

Devs often find that even the most powerful general VLMs struggle with the dense spatial logic of a technical PDF. When a model ‘sees’ a complex table or a multi-line LaTeX equation, it frequently fails to maintain the hierarchical relationship between elements.

FireRed-OCR-2B addresses this through a specialized Progressive Training Pipeline consisting of three distinct stages:

Multi-task Pre-alignment: This stage establishes spatial grounding by training the model on detection, region recognition, and layout-to-markdown tasks.

Specialized SFT (Supervised Fine-Tuning): The model is fine-tuned on a high-quality, standardized Markdown dataset to ensure logical consistency and hierarchical expression.

Format-Constrained GRPO: The final stage uses reinforcement learning to enforce syntactic validity.

The Core Innovation: Format-Constrained GRPO

The most significant technical differentiator for FireRed-OCR is its use of Format-Constrained Group Relative Policy Optimization (GRPO). While traditional fine-tuning focuses on character accuracy, GRPO introduces a reinforcement learning loop that rewards the model for specific structural traits:

Formula Syntax: Ensuring LaTeX equations are mathematically valid.

Table Integrity: Maintaining consistent row/column counts and proper HTML/Markdown tagging.

Hierarchical Closure: Verifying that all opened structural tags (like lists or headers) are correctly closed.

Text Accuracy: Reducing character-level errors in dense text blocks.

By eliminating the need for a separate ‘critic’ model—a key benefit of the GRPO algorithm—FireRedTeam has optimized the training process to focus specifically on the high-friction areas of document parsing.

Solving the Long-Tail Layout Problem

The ‘long-tail’ of document layouts (e.g., non-standard legal forms, academic papers with overlapping figures, or handwritten annotations) is where most OCR pipelines break. FireRed-OCR utilizes a ‘Geometry + Semantics’ Data Factory.

This novel approach uses geometric feature clustering and multi-dimensional tagging to synthesize balanced datasets. By combining geometric awareness with semantic understanding, the model maintains ‘In-the-Wild Robustness,’ outperforming traditional pipeline systems like PaddleOCR on complex, non-standard layouts (benchmarked on the FireRedBench dataset).

Performance Benchmarks

In head-to-head comparisons on OmniDocBench v1.5, FireRed-OCR-2B (92.94%) significantly outperforms other end-to-end models, including:

DeepSeek-OCR 2: 91.09%

Gemini-3.0 Pro: 90.33%

Qwen3-VL-235B: 89.15%

While some ‘pipeline’ solutions (which use separate models for detection and recognition) achieve slightly higher scores, FireRed-OCR-2B represents the leading performance for a single-model, end-to-end approach. This is particularly relevant for devs looking to reduce system complexity and inference latency in production RAG (Retrieval-Augmented Generation) environments.

Key Takeaways

I have summarized the technical significance and performance metrics of the FireRed-OCR-2B release into five key takeaways for AI engineers and data scientists.

5 Key Takeaways: FireRed-OCR-2B

New End-to-End SOTA Performance: FireRed-OCR-2B has achieved a state-of-the-art (SOTA) score of 92.94% on the OmniDocBench v1.5 benchmark. This makes it the leading single-model solution for document parsing, outperforming significantly larger models like Qwen2-VL-72B and Gemini-1.5-Pro in structural accuracy.

Architectural Foundation: Built on the Qwen2-VL-2B-Instruct (or the updated 2026 iteration) base, the model utilizes a Vision-Language-Model (VLM) approach. It replaces traditional multi-stage pipelines (separate detection, cropping, and OCR steps) with a unified, end-to-end transformer architecture that outputs structured Markdown directly.

Structural Integrity via GRPO: A major technical differentiator is the use of Format-Constrained GRPO (Group Relative Policy Optimization). This reinforcement learning technique rewards the model for maintaining syntactic validity—specifically ensuring that LaTeX formulas, table tags, and Markdown hierarchies are logically closed and mathematically consistent.

‘Geometry + Semantics’ Data Factory: To solve the problem of complex ‘in-the-wild’ layouts, the FireRedTeam developed a specialized data engine. This ‘factory’ synthesizes datasets by balancing geometric layout features with semantic content, enabling the model to handle overlapping figures, multi-column academic papers, and non-standard forms more reliably than previous iterations.

Check out the Model Weight and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post FireRedTeam Releases FireRed-OCR-2B Utilizing GRPO to Solve Structural Hallucinations in Tables and LaTeX for Software Developers appeared first on MarkTechPost.

Building specialized AI without sacrificing intelligence: Nova Forge …

Large language models (LLMs) perform well on general tasks but struggle with specialized work that requires understanding proprietary data, internal processes, and industry-specific terminology. Supervised fine-tuning (SFT) adapts LLMs to these organizational contexts. SFT can be implemented through two distinct methodologies: Parameter-Efficient Fine-Tuning (PEFT), which updates only a subset of model parameters, offering faster training and lower computational costs while maintaining reasonable performance improvements; Full-rank SFT, which updates all model parameters rather than a subset and incorporates more domain knowledge than PEFT.
Full-rank SFT often faces a challenge: catastrophic forgetting. As models learn domain-specific patterns, they lose general capabilities including instruction-following, reasoning, and broad knowledge. Organizations must choose between domain expertise and general intelligence, which limits model utility across enterprise use cases.
Amazon Nova Forge addresses the problem. Nova Forge is a new service that you can use to build your own frontier models using Nova. Nova Forge customers can start their development from early model checkpoints, blend proprietary data with Amazon Nova-curated training data, and host their custom models securely on AWS.
In this post, we share results from the AWS China Applied Science team’s comprehensive evaluation of Nova Forge using a challenging Voice of Customer (VOC) classification task, benchmarked against open-source models. Working with over 16,000 customer comment samples across a complex four-level label hierarchy containing 1,420 leaf categories, we demonstrate how Nova Forge’s data mixing approach provides two advantages:

In-domain task performance gains: achieving 17% F1 score improvements
Preserved general capabilities: maintaining near-baseline MMLU (Massive Multitask Language Understanding) scores and instruction-following abilities post-finetuning

The challenge: real-world customer feedback classification
Consider a typical scenario at a large ecommerce company. The customer experience team receives thousands of customer comments daily with detailed feedback spanning product quality, delivery experiences, payment issues, website usability, and customer service interactions. To operate efficiently, they need an LLM that can automatically classify each comment into actionable categories with high precision. Each classification must be specific enough to route the issue to the right team: logistics, finance, development, or customer service, and trigger the appropriate workflow. This requires domain specialization.
However, this same LLM doesn’t operate in isolation. Across your organization, teams need the model to:

Generate customer-facing responses that require general communication skills
Perform data analysis requiring mathematical and logical reasoning
Draft documentation following specific formatting guidelines

This requires broad general capabilities—instruction-following, reasoning, knowledge across domains, and conversational fluency.

Evaluation methodology
Test overview
To test whether Nova Forge can deliver both domain specialization and general capabilities, we designed a dual-evaluation framework measuring performance across two dimensions.
For domain-specific performance, we use a real-world Voice of Customer (VOC) dataset derived from actual customer reviews. The dataset contains 14,511 training samples and 861 test samples, reflecting production-scale enterprise data. The dataset employs a four-level taxonomy where Level 4 represents the leaf categories (final classification targets). Each category includes a descriptive explanation of its scope. Example categories:

Level 1
Level 2
Level 3
Level 4 (leaf category)

Installation – app configuration
Initial setup guidance
Setup process
Easy setup experience: Installation process characteristics and complexity level

Usage – hardware experience
Night vision performance
Low-light Image quality
Night vision clarity: Night vision mode produces images in low-light or dark conditions

Usage – hardware experience
Pan-tilt-zoom functionality
Rotation capability
360-degree rotation: The camera can rotate a full 360 degrees, providing complete panoramic coverage

After-sales policy and cost
Return and exchange policy
Return process execution
Product return completed: Customer initiated and completed product return due to functionality issues

The dataset exhibits extreme class imbalance typical of real-world customer feedback environments. The following image displays the class distribution:

As a result, the dataset places a significant challenge on classification accuracy.
For evaluating general-purpose capabilities, we use the public test set split of the MMLU (Massive Multitask Language Understanding) benchmark (all subsets). The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. In this post, MMLU serves as a proxy for general capability retention. We use it to measure whether supervised fine-tuning improves domain performance at the cost of degrading foundational model behaviors, and to assess the effectiveness of Nova data mixing in mitigating catastrophic forgetting.

Item
Description

Total samples
15,372 customer reviews

Label hierarchy
4-level classification, 1,420 categories in total

Training set
14,511 samples

Test set
861 samples

MMLU Benchmark all (test split)
14,000 samples

In-domain task evaluation: voice of customer classification
To understand how Nova Forge performs in real enterprise scenarios, we first evaluate model accuracy on the VOC classification task before and after supervised fine-tuning. With this approach, we can quantify domain adaptation gains while establishing a baseline for subsequent robustness analysis.
Base model evaluation
We begin with a base model evaluation to assess out-of-the-box performance on the VOC classification task without any task-specific fine-tuning. This setup establishes each model’s inherent capability to handle highly granular classification under strict output format constraints. The following prompt is used for the VOC classification task:
# Role Definition
You are a rigorous customer experience classification system. Your sole responsibility is to map user feedback to the existing label taxonomy at Level 1 through Level 4 (L1–L4). You must strictly follow the predefined taxonomy structure and must not create, modify, or infer any new labels.
## Operating Principles
### 1. Strict taxonomy alignment All classifications must be fully grounded in the provided label taxonomy and strictly adhere to its hierarchical structure.
### 2. Feedback decomposition using MECE principles A single piece of user feedback may contain one or multiple issues. You must carefully analyze all issues described and decompose the feedback into multiple non-overlapping segments, following the MECE (Mutually Exclusive, Collectively Exhaustive) principle:
– **Semantic singularity**: Each segment describes only one issue, function, service, or touchpoint (for example, pricing, performance, or UI). – **Independence**: Segments must not overlap in meaning. – **Complete coverage**: All information in the original feedback must be preserved without omission.
### 3. No taxonomy expansion You must not invent, infer, or modify any labels or taxonomy levels.
## Label Taxonomy
The following section provides the label taxonomy: {tag category}. Use this taxonomy to perform L1–L4 classification for the original VOC feedback. No taxonomy expansion is allowed.
## Task Instructions
You will be given a piece of user feedback: {user comment}. Users may come from different regions and use different languages. You must accurately understand the user’s language and intent before assigning labels.
Refer to the provided examples for the expected labeling format.
## Output Format
Return the classification results in JSON format only. For each feedback segment, output the original text along with the corresponding L1–L4 labels and sentiment. Do not generate or rewrite content.
“`json [ { “content”: “<comment_text>”, “L1”: “<L1>”, “L2”: “<L2>”, “L3”: “<L3>”, “L4”: “<L4>”, “emotion”: “<emotion>” } ] “`
For base model evaluation, we selected:

Amazon Nova 2 Lite: Evaluated on Amazon Bedrock
Qwen3-30B-A3B: Open source model deployed on Amazon Elastic Compute Cloud (Amazon EC2) with vLLM

Model
Precision
Recall
F1-Score

Nova 2 Lite
0.4596
0.3627
0.387

Qwen3-30B-A3B
0.4567
0.3864
0.394

The F1-scores reveal that Nova 2 Lite and Qwen3-30B-A3B demonstrate comparable performance on this domain-specific task, with both models achieving F1-scores near 0.39. These results also highlight the inherent difficulty of the task: even strong foundation models struggle with fine-grained label classification when no domain-specific data is provided.
Supervised fine-tuning
We then apply full-parameter supervised fine-tuning (SFT) using customer VOC data. All models are fine-tuned using the same dataset and comparable training configurations for a fair comparison.
Training infrastructure:

Nova 2 Lite: Fine-tuned on Amazon SageMaker HyperPod cluster using four p5.48xlarge instances (as specified in the Nova customization SageMaker hyperpod topic in the Amazon SageMaker AI Developer Guide)
Qwen3-30B-A3B: Fine-tuned on Amazon EC2 using p6-b200.48xlarge instances

In domain task performance comparison

Model
Training Data
Precision
Recall
F1-Score

Nova 2 Lite
None (baseline)
0.4596
0.3627
0.387

Nova 2 Lite
Customer data only
0.6048
0.5266
0.5537

Qwen3-30B
Customer data only
0.5933
0.5333
0.5552

After fine-tuning on customer data alone, Nova 2 Lite achieves a substantial performance improvement, with F1 increasing from 0.387 to 0.5537—an absolute gain of 17 points. This result places the Nova model in the top tier for this task and makes its performance comparable to that of the fine-tuned Qwen3-30B open-source model. These results confirm the effectiveness of Nova full-parameter SFT for complex enterprise classification workloads.
General capabilities evaluation: MMLU benchmark
Models fine-tuned for VOC classification are often deployed beyond a single task and integrated into broader enterprise workflows. Preserving general-purpose capabilities is important. Industry-standard benchmarks such as MMLU provide an effective mechanism for evaluating general-purpose capabilities and detecting catastrophic forgetting in fine-tuned models.
For the fine-tuned Nova model, Amazon SageMaker HyperPod offers out-of-the-box evaluation recipes that streamline MMLU evaluation with minimal configuration.

Model
Training data
VOC F1-Score
MMLU accuracy

Nova 2 Lite
None (baseline)
0.38
0.75

Nova 2 Lite
Customer data only
0.55
0.47

Nova 2 Lite
75% customer + 25% Nova data
0.5
0.74

Qwen3-30B
Customer data only
0.55
0.0038

When Nova 2 Lite is fine-tuned using customer data only, we observe a significant drop in MMLU accuracy from 0.75 to 0.47, indicating the loss of general-purpose capabilities. The degradation is even more pronounced for the Qwen model, which largely loses instruction-following ability after fine-tuning. An example of Qwen model degraded output:
{
“prediction”: “[n {n “content”: “x^5 + 3x^3 + x^2 + 2x in Z_5”,n “A”: “0”,n “B”: “1”,n “C”: “0,1”,n “D”: “0,4”,n “emotion”: “neutral”n }n]”
}
This behavior is also related to the VOC prompt design, where category knowledge is internalized through supervised fine-tuning—a common approach in large-scale classification systems.
Notably, when Nova data mixing is applied during fine-tuning, Nova 2 Lite retains near-baseline general performance. MMLU accuracy remains at 0.74, only 0.01 below the original baseline, while VOC F1 still improves by 12 points (0.38 → 0.50). This validates that Nova data mixing is a practical and effective mechanism for mitigating catastrophic forgetting while preserving domain performance.
Key findings and practical recommendations
This evaluation shows that when the base model provides a strong foundation, full-parameter supervised fine-tuning on Amazon Nova Forge can deliver substantial gains for complex enterprise classification tasks. At the same time, the results confirm that catastrophic forgetting is a real concern in production fine-tuning workflows. Fine-tuning on customer data alone can degrade general-purpose capabilities such as instruction following and reasoning, limiting a model’s usability across broader business scenarios.
The data mixing capability of Nova Forge provides an effective mitigation strategy. By blending customer data with Nova-curated datasets during fine-tuning, teams can preserve near-baseline general capabilities while continuing to achieve strong domain-specific performance.
Based on these findings, we recommend the following practices when using Nova Forge:

Use supervised fine-tuning to maximize in-domain performance for complex or highly customized tasks.
Apply Nova data mixing when models are expected to support multiple general-purpose workflows in production, to reduce the risk of catastrophic forgetting.

Together, these practices help balance model customization with production robustness, enabling more reliable deployment of fine-tuned models in enterprise environments.

Conclusion
In this post, we demonstrated how organizations can build specialized AI models without sacrificing general intelligence with Nova Forge data mixing capabilities. Depending on your use cases and business objectives, Nova Forge can deliver other benefits, including access checkpoints across all phases of model development and performing reinforcement learning with reward functions in your environment. To get started with your experiments, see the Nova Forge Developer Guide for detailed documentation.

About the authors
Yuan Wei is an Applied Scientist at Amazon Web Services, working with enterprise customers on proof-of-concepts and technical advisory. She specializes in large language models and vision-language models, with a focus on evaluating emerging techniques under real-world data, cost, and system constraints.
Xin Hao is a Senior AI/ML Go-to-Market Specialist at AWS, helping customers achieve success with Amazon Nova models and related Generative AI solutions. He has extensive hands-on experience in cloud computing, AI/ML, and Generative AI. Prior to joining AWS, Xin spent over 10 years in the industrial manufacturing sector, including industrial automation and CNC machining.
Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

Build a serverless conversational AI agent using Claude with LangGraph …

Customer service teams face a persistent challenge. Existing chat-based assistants frustrate users with rigid responses, while direct large language model (LLM) implementations lack the structure needed for reliable business operations. When customers need help with order inquiries, cancellations, or status updates, traditional approaches either fail to understand natural language or can’t maintain context across multistep conversations.
This post explores how to build an intelligent conversational agent using Amazon Bedrock, LangGraph, and managed MLflow on Amazon SageMaker AI.
Solution overview
The conversational AI agent presented in this post demonstrates a practical implementation for handling customer order inquiries, a common but often challenging use case for existing customer service automation solutions. We implement an intelligent order management agent that addresses these challenges by helping customers find information about their orders and take actions such as cancellations through natural conversation. The system uses a graph-based conversation flow with three key stages:

Entry intent – Identifies what the customer wants and collects necessary information
Order confirmation –Presents found order details and verifies customer intentions
Resolution – Executes the customer’s request and provides closure

This agentic flow is illustrated in the following graphic.

Problem statement
Most customer service automation solutions fall into two categories, each with significant limitations.
Rule-based chat assistant often fail at natural language understanding, leading to frustrating user experiences. They typically follow rigid decision trees that can’t handle the nuances of human conversation. When users deviate from expected inputs, these systems fail, forcing users to adapt to the assistant rather than the other way around. For example, a rule-based chat assistant might recognize “I want to cancel my order” but fail with “I need to return something I just bought” because it doesn’t match predefined patterns.
Meanwhile, modern LLMs excel at understanding natural language but present their own challenges when used directly. LLMs don’t inherently maintain state or follow multistep processes, making conversation management difficult. Connecting LLMs to backend systems requires careful orchestration, and monitoring their performance presents unique observability challenges. Most critically, LLMs may generate plausible but incorrect information when they lack access to domain knowledge.
To understand these limitations for a real-world example, consider a seemingly simple customer service scenario: a user needs to check on an order status or request a cancellation. This interaction requires understanding the user’s intent, extracting relevant information like order numbers and account details, verifying information against backend systems, confirming actions before execution, and maintaining context throughout the conversation. Without a structured approach, both rule-based systems and raw LLMs can’t handle these multistep processes that require memory, planning, and integration with external systems.
These fundamental limitations explain why existing approaches consistently fall short in real-world applications. Rule-based systems can’t effectively bridge natural conversation with structured business processes, while LLMs can’t maintain state across multiple interactions. Neither approach can seamlessly integrate with backend systems for data retrieval and updates, and both provide limited visibility into performance and user experience. Most critically, current solutions can’t balance the flexibility needed for natural conversation with the business rule enforcement required for reliable customer service.
This solution addresses these challenges through AI agents—systems that combine the natural language capabilities of LLMs with structured workflows, tool integration, and comprehensive observability.
Solution architecture
This solution implements a serverless conversational AI system using a WebSocket-based architecture for real-time customer interactions. Customers access a React frontend hosted on Amazon Simple Storage Service (Amazon S3) and delivered through Amazon CloudFront. When customers send messages, the system establishes a persistent WebSocket connection through Amazon API Gateway to AWS Lambda functions that orchestrate the conversation flow. The following diagram illustrates the solution architecture.

Agent architecture
This solution uses AI agents, systems where LLMs dynamically direct their own processes and tool usage while maintaining control over how they accomplish tasks. Unlike simple LLM applications, these agents maintain state and context across multiple interactions, can use external tools to gather information or perform actions, reason about their next steps based on previous outcomes, and operate with some degree of autonomy. The agent workflow follows a structured pattern of initialization, understanding user intent, planning required actions, executing tool calls when needed, generating responses, and updating conversation state for future interactions.
To build effective conversational agents, we need four core capabilities:

Intelligence to understand and respond to users
Memory to maintain context across conversations
Ability to take actions in external systems
Orchestration to manage complex multistep workflows

Our implementation addresses these requirements through specific Amazon Web Services (AWS) services and frameworks.
Amazon Bedrock serves as the intelligence layer, providing access to state-of-the-art foundation models (FMs) through a consistent API. Amazon Bedrock is used to handle intent recognition to understand what users are trying to accomplish, entity extraction to identify key information such as order numbers and customer details, natural language generation to create contextually appropriate responses, decision-making to determine the next best action in conversation flows, and coordination of tool use to interact with external systems. This intelligence layer enables our agent to understand natural language while maintaining the structured decision-making needed for reliable customer service.
State management (agent memory) is handled through Amazon DynamoDB, which provides persistent storage for conversation context even if there are interruptions or system restarts. The state includes session IDs as unique conversation identifiers, complete conversation history for context maintenance, formatted transcripts optimized for model context windows, extracted information such as order numbers and customer details, and process flags indicating confirmation status and information retrieval success. This persistent state allows our agent to maintain context across multiple interactions, addressing one of the key limitations of raw LLM implementations.
State management snippet code (for full implementation you can reference the code in backed/app.py):

# Save conversation state to DynamoDB
ttl_value = int(time.time()) + (3600 * 4) # 4 hrs
item = {
‘conversationId’: session_id,
‘state’: json.dumps(state_dict),
‘chat_status’: state_dict[‘session_end’],
‘update_ts_pst’: str(datetime.now(pst)),
‘ttl’: ttl_value,
‘timestamp’: int(time.time())
}
ddb_table.put_item(Item=item)

The state includes a Time-To-Live (TTL) value that automatically expires conversations after a period of inactivity, helping to manage storage costs.
Function calling, also known as tool use, enables our agent to interact with external systems in a structured way. Instead of generating free-form text that attempts to describe an action, the model generates structured calls to predefined functions with specific parameters. You can think of this as giving the LLM a set of tools complete with instruction manuals, where the LLM decides when to use these tools and what information to provide. Our implementation defines specific tools that connect to an Amazon Relational Database (Amazon RDS) for PostgreSQL database: get_user for customer lookups, get_order_by_id for order details, get_customer_orders for listing customer orders, cancel_order for order cancellations, and update_order for order modifications.
The following code snippet allows proper handling of the message sequence between the assistant and user along with the right tool name and inputs or parameters required. (For implementation details, refer to backend/utils/utils.py):

def use_tool(messages):
tool_use = messages[-1][“content”][-1].get(“toolUse”)
if tool_use:
tool_name = tool_use[“name”]
tool_input = tool_use[“input”]

# Process the tool call
tool_result = _process_tool_call(tool_name, tool_input)

# Format response for the model
message = {
“role”: “user”,
“content”: [
{
“toolResult”: {
“toolUseId”: tool_use[“toolUseId”],
“content”: [
{“text”: json.dumps(tool_result)}
],
“status”: “success”,
}
}
],
}
return message

The tools are defined with JSON schemas that provide clear contracts for the model to follow:

tool_config = {
“toolChoice”: {“auto”: {}},
“tools”: [
{
“toolSpec”: {
“name”: “get_order_by_id”,
“description”: “Retrieves the details of a specific order based on the order ID.”,
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“order_id”: {
“type”: “string”,
“description”: “The unique identifier for the order.”,
}
},
“required”: [“order_id”],
},
},
},
}
]
}

The previous snippet code shows only one example of tool definition, however in the implementation there are three different tools configured. For full details, refer to backend/tools_config/entry_intent_tool.py or backend/tools_config/agent_tool.py
This capability grounds the model to real-world data and systems, reduces hallucinations by providing factual information, extends the model’s capabilities beyond what it could do alone, and enforces consistent patterns for system interactions. The structured nature of function calling means that the model can only request specific data through well-defined interfaces rather than making assumptions.
LangGraph provides the orchestration framework for building stateful, multistep applications using a directed graph approach. It offers explicit tracking of conversation state, separation of concerns where each node handles a specific conversation phase, conditional routing for dynamic decision-making based on context, cycle detection to handle loops and repetitive patterns, and flexible architecture that’s straightforward to extend with new nodes or modify existing flows. You can think of LangGraph as creating a flowchart for your conversation, where each box represents a specific part of the conversation and the arrows show how to move between them.
The conversation flow is implemented as a directed graph using LangGraph. For reference, check the agentic flow graphic in the Solution architecture section.
The following code snippet shows the state graph that is a structure graph context used to collect information across different user interactions giving the agent the proper context:

class State(TypedDict):
# Messages tracked in the conversation history
messages: list
# Transcription attributes tracks updates posted by the agent
transcript: list
# Session Id is the unique identifier attribute of the conversation
session_id: str
# Order number
order_number: str
# tracks in the conversation is still active
session_end: bool
# tracks the current node in the conversation
current_turn: int
# tracks the next node in the conversation
next_node: str
# track status of the confirmation
order_confirmed: bool
# track status of the orders eligible
order_info_found: bool

This state object maintains the relevant information about the conversation, allowing the system to make informed decisions about routing and responses.
Our conversation flow uses three main nodes: the entry intent node handles initial user requests and extracts key information, the order confirmation node verifies details and confirms user intent, and the resolution node executes requested actions and provides closure. This approach offers explicit state management, conditional routing, separation of concerns, reusability across different conversation flows, and clear visualization of conversation paths:

# Define nodes and edges
graph_builder = StateGraph(State)
# Add nodes
graph_builder.add_node(“entry_intent”, entry_intent.node)
graph_builder.add_node(“order_confirmation”, order_confirmation.node)
graph_builder.add_node(“resolution”, resolution.node)
# Add conditional edges with routing logic
graph_builder.add_conditional_edges(
START,
initial_router,
{
‘entry_intent’: ‘entry_intent’,
‘order_confirmation’: ‘order_confirmation’,
‘resolution’: ‘resolution’
}
)

The edges between nodes use conditional logic to determine the flow on a runtime execution as shown in the following code snippet based on the content of the StateGraph:

graph_builder.add_conditional_edges(
‘entry_intent’,
lambda x: x[“next_node”],
{
‘order_confirmation’: ‘order_confirmation’,
‘__end__’: END
}
)

Each node in the conversation graph is implemented as a Python function that processes the current state and returns an updated state. The entry intent node handles initial user requests, extracts key information such as order numbers, and determines next steps by interpreting customer queries. It uses tools to search for relevant order information, extract key details such as order numbers or customer identifiers, and determines if enough information is available to proceed. The order confirmation node verifies details and confirms user intent by presenting found order details to the customer, verifying this is the correct order being discussed, and confirming the customer’s intentions regarding the order. The resolution node executes requested actions and provides closure by executing necessary actions such as providing status or canceling orders, confirming successful completion of requested actions, answering follow-up questions about the order, and providing a natural conclusion to the conversation:

@mlflow.trace(span_type=SpanType.AGENT)
def node(state: Dict[str, Any]) -> Dict[str, Any]:
“””
Entry intent node for processing chat messages and managing order information.

This node handles:
1. Initial message processing with the chat model
2. Tool execution for order information retrieval
3. State management and updates
4. Dynamic routing based on order information
“””

For full implementation details, refer to: backend/nodes.
The nodes use a consistent pattern of extracting relevant information from the state, processing the user message using the LLM, executing the necessary tools, updating the state with new information, and determining the next node in the flow.
Observability becomes essential because LLM applications present unique challenges including nondeterministic outputs where the same input can produce different results, complex chains where multiple models and tools interact in sequence, performance monitoring where latency affects user experience, and quality assessment that requires specialized metrics. Managed MLflow on Amazon SageMaker AI addresses these challenges through specialized tracing capabilities that monitor model interactions, latency, token usage, and conversation paths.
Each conversation node is decorated with MLflow tracing:

@mlflow.trace(span_type=SpanType.AGENT)
def node(state: Dict[str, Any]) -> Dict[str, Any]:
# Node implementation

This straightforward decorator automatically captures rich information about each node’s execution. It records model invocations, showing which models were called and with what parameters. It tracks response metrics such as latency, token usage, and completion reasons. It maps conversation paths, showing how users navigate through the conversation graph. It also logs tool usage, indicating which tools were called and their results, as well as error patterns identifying when and why failures occur.
The captured data is visualized in the MLflow UI, providing insights for production performance monitoring, optimization opportunities, debugging, and business impact measurement.
MLflow traces capture the whole agentic workflow execution including the nodes involved in the interaction, the inputs and outputs per node, and additional metadata such as the latency, the tool calls, and the conversation sequence.
The following screenshot shows an example of MLFlow tracking server traces capturing the agentic workflow execution, including nodes involved, inputs and outputs per node, and metadata such a latency, tool calls, and conversation sequence.

This traceability is critical for continuous improvement of the agent. Developers can identify patterns in successful conversations and opportunities for optimization.
Prerequisites
To build a serverless conversational AI agent using Claude with LangGraph and managed MLflow on Amazon SageMaker AI, you need the following prerequisites: AWS account requirements:

An AWS account with permissions to create Lambda functions, DynamoDB tables, API gateways, S3 buckets, CloudFront distributions, Amazon RDS for PostgreSQL instances, and Amazon Virtual Private Cloud (Amazon VPC) resources
Amazon Bedrock access with Claude 3.5 Sonnet by Anthropic enabled

Development environment:

The AWS Command Line Interface (AWS CLI) is installed on your local machine
Git and Docker utilities are installed on your local machine
Permission to create AWS resources
Python 3.12 or later
Node.js 20+ and npm installed
AWS Cloud Development Kit (AWS CDK) CLI installed (`npm install -g aws-cdk`)
Amazon CloudWatch Logs role Amazon Resource Name (ARN) configured in API Gateway account settings (required for API Gateway logging):

Create an AWS Identity and Access Management (IAM) role with required permissions. For guidance, refer to Permissions for CloudWatch logging.
Configure the role in the API Gateway console. Follow steps 1–3 only.

Skills and knowledge:

Familiarity with serverless architectures
Basic knowledge of Python and React
Understanding of AWS services (AWS Lambda, Amazon DynamoDB, Amazon VPC)

Deployment guide
To build a serverless conversational AI agent using Claude with LangGraph and managed MLflow on Amazon SageMaker AI, follow these steps:

Clone the repository and set up the project root:

git clone https://github.com/aws-samples/sample-aws-genai-serverless-orchestration-chatbot-mlflow.git
cd sample-aws-genai-serverless-orchestration-chatbot-mlflow
export PROJECT_ROOT=$(pwd)

Bootstrap your AWS environment (required if the bootstrap wasn’t done before):

cd $PROJECT_ROOT/infra
cdk bootstrap

Install dependencies:

# Install dependencies
cd $PROJECT_ROOT
make install

Build and deploy application:

cd $PROJECT_ROOT
make deploy

This script will:

Deploy the backend infrastructure, including the VPC, Lambda function, database, and MLflow
Get the Lambda ARN from the backend stack
Deploy the frontend with integrated WebSocket API Gateway
Get the actual WebSocket API URL from the deployed stack
Create and upload config.json with runtime configuration to Amazon S3

Clean up
To avoid ongoing charges from the resources created in this post, clean up the resources when they’re no longer needed. Use the following command:

cd $PROJECT_ROOT
make clean

Conclusion
In this post, we showed how combining the reasoning capabilities of LLMs from Amazon Bedrock, orchestration capabilities of LangGraph, and observability of managed MLflow on Amazon SageMaker AI can be used to build customer service agents. The architecture enables natural, multiturn conversations while maintaining context across interactions, seamlessly integrating with backend systems to perform real-world actions such as order lookups and cancellations.
The comprehensive observability is provided by MLflow so developers can monitor conversation flows, track model performance, and optimize the system based on real usage patterns. By using AWS serverless services, this solution automatically scales to handle varying workloads while maintaining cost efficiency through pay-per-use pricing. You can use this blueprint to build sophisticated conversational AI solutions that bridge the gap between natural language interaction and structured business processes, delivering business value through improved customer experiences and operational efficiency.
Ready to take your conversational AI agent further? Get started with Amazon Bedrock AgentCore to accelerate your agents to production with intelligent memory and a gateway to enable secure, controlled access to tools and data. Discover how MLflow integrates with Bedrock AgentCore Runtime for comprehensive observability across your agent ecosystem.

About the Authors

Sri Potluri
Sri Potluri is a Cloud Infrastructure Architect at AWS. He is passionate about solving complex problems and delivering well-structured solutions for diverse customers. His expertise spans across a range of cloud technologies, providing scalable and reliable infrastructures tailored to each project’s unique challenges.

Luis Felipe Yepez Barrios
Luis Felipe Yepez Barrios is a Machine Learning Engineer with AWS Professional Services, focused on scalable distributed systems and automation tooling to expedite scientific innovation in the field of machine learning (ML). Furthermore, he assists enterprise clients in optimizing their machine learning solutions through AWS services.

Build safe generative AI applications like a Pro: Best Practices with …

Are you struggling to balance generative AI safety with accuracy, performance, and costs? Many organizations face this challenge when deploying generative AI applications to production. A guardrail that’s too strict blocks legitimate user requests, which frustrates customers. One that’s too lenient exposes your application to harmful content, prompt attacks, or unintended data exposure. Finding the right balance requires more than just enabling features; it demands thoughtful configuration and nearly continuous refinement.
Amazon Bedrock Guardrails gives you powerful tools for implementing responsible AI safeguards: content filtering for both text and images (including prompt attack prevention), topic classification, sensitive information protection, contextual grounding checks, and automated reasoning checks. In this post, we will show you how to configure these capabilities for more efficient performance, implement best practices to protect your applications, and monitor your deployment effectively to maintain the right balance between safety and user experience.
Let’s explore the strategies that will help you deploy guardrails confidently in production.
Best practices for using Amazon Bedrock Guardrails
For the maximum benefit from Amazon Bedrock Guardrails, we recommend that you adopt the following best practices.
1. Select the right guardrail policies
The choice of which guardrail policies to use in production workflows depends on your specific use case, but several foundational policies provide protection suitable for most implementations.
Content Policy blocks harmful content across hate speech, insults, sexual content, violence, and misconduct, helping you maintain content safety in applications. We recommend this for all production deployments.
Beyond text content, you can extend the content filters for images to apply the same content moderation policies to both text and images in your generative AI applications. This multimodal capability helps block harmful visual content across all six content filter categories: Hate, Insults, Sexual, Violence, Misconduct, and Prompt Attacks. When configuring content filters, you can choose to apply filtering to text only, images only, or both modalities.
Prompt Attack Prevention can help identify potential jailbreak attempts, prompt injection attacks, and prompt leakage attacks that might seek to weaken safety features and developer instructions. This policy is recommended for maintaining application security.
Sensitive Information Policy offers masking or removal capabilities for personally identifiable information (PII), which can help you protect customer data and support your compliance efforts.
Word Policy blocks specific words or phrases, commonly used to filter profanity, industry-specific restricted terms, or custom vocabulary restrictions.
Topic Policy helps you enforce custom Responsible AI (RAI) policies, maintain compliance with organizational guidelines, and control conversation scope and subject matter.
For specialized use cases, you can add Contextual Grounding to help validate whether responses are supported by trusted reference materials, help reduce model hallucinations during content summarization, and help maintain conversation relevance. You can use the Automated Reasoning Policy to enforce compliance with regulatory requirements, validate outputs against specific business rules, and implement sophisticated filtering beyond keyword matching.
Start with base policies that align with your core security and compliance requirements, then add specialized policies based on specific use case needs. Regular review and adjustment of your policies can help improve protection while helping to maintain desired functionality.
1.1 Choose the correct safeguard tier
Guardrails currently provides two safeguard tiers for content policy, prompt attack prevention, and topic policy: classic tier and standard tier. For most use cases, standard tier is the better choice. It offers greater robustness, better accuracy, broader language support, higher quotas, and improved availability by directing traffic across AWS Regions based on load. For more information, see Safeguard tiers for guardrails policies.
1.2. Use Guardrails detect mode to test out your guardrail’s behavior without impact
Before letting your guardrail intervene on production applications, you can test its behavior on live customer traffic using guardrails detect mode. With this mode, guardrails will evaluate all content and report what was identified in the trace response but will not take any blocking action. Through detect mode, you can see how your guardrail performs on real traffic and update configurations as necessary. After you’re satisfied with the behavior, you can update your guardrail to Block or Mask content as appropriate. For more information, see Options for handling harmful content detected by Amazon Bedrock Guardrails.
2. Configure content policy filter strength
Amazon Bedrock Guardrails content policy offers four filter strength levels to help you balance content safety with application functionality: NONE, LOW, MEDIUM, and HIGH. The different filter strengths reflect the confidence of guardrails that the input contains harmful content. If you configure a guardrail with LOW filter strength, then the guardrail will block requests where it has high confidence that the input is harmful. Analogously, if the guardrail is configured with HIGH filter strength, then the guardrail will block even inputs where it has low confidence. For example, a request containing subtle innuendos might pass through a LOW filter strength but would be blocked by a HIGH filter strength setting.

Filter strength
Blocks content with confidence

NONE
No filtering

LOW
HIGH confidence only

MEDIUM
HIGH and MEDIUM confidence

HIGH
HIGH, MEDIUM, and LOW confidence

2.1 Recommended filter strength selection process

Initial configuration

Start with HIGH filter strength to establish maximum protection.

Evaluation

Test your implementation using representative sample traffic (expected user’s traffic) to:

Identify false positive rate
Assess impact on legitimate content
Evaluate user experience

Adjustment

If the initial configuration produces too many false positives:

Lower the filter strength to MEDIUM
Re-evaluate with sample traffic
Continue adjusting as needed, moving to LOW if necessary

3. Craft effective denied topics: golden rules
1. Be crisp and precise. Define topics clearly and unambiguously, for example, “Questions or information associated with investing, selling, transacting, or procuring cryptocurrencies” rather than vague descriptions, such as “Investment advice”.
2. Define, don’t instruct. Avoid command-style phrases like “Block all content associated with cryptocurrency”, and instead say “All content associated with cryptocurrency”. Focus on what the topic is, not what you want the system to do.
3. Stay positive. Never define topics negatively (for example, “All content except investment advice”). Guardrails should have clear, affirmative definitions of what to detect.
4. Focus on themes, not words. Denied topics capture subjects and concepts contextually—they’re not designed to catch specific names, entities, or individual words. For those use cases, use sensitive information filters or word filters instead.
5. Provide sample phrases. Add a few sample phrases that represent the types of inputs you would want to get blocked by the topic filter. For a deny topic blocking investment advice, you might put “Recommend a stock that will skyrocket” or “Can you suggest where to invest my money?”.
4. Customizing beyond built-in filters
For some applications, the provided content filter categories or built-in PII types might not fully cover your guardrail requirements. When this happens, you have two options:

Create a custom deny topic: if your use case requires blocking content that falls outside the existing content filter categories, you can define a deny topic tailored to your needs. For example, if you need to block political discussion, you could create a deny topic with the definition “Any content related to politics or elections.”
Create a custom regex filter: if the built-in PII types don’t cover the sensitive data patterns that you need to catch, you can define a regex filter to fill the gap. For example, to block all dates in MM/DD/YYYY format, you could add the following regex pattern: b(0[1-9]|1[0-2])[/-](0[1-9]|[12]d|3[01])[/-](19|20)d{2}b

5. Choose the right implementation approach
Amazon Bedrock Guardrails offers multiple ways to protect your applications, each suited to different architectural patterns and control requirements. Understanding when to use each approach helps you build protection strategies that match your specific needs.
Standalone ApplyGuardrail API for maximum flexibility
When you need precise control over where and how guardrails will evaluate the content, you can invoke the ApplyGuardrail API at any point in your application logic. You can use ApplyGuardrail with any large language model (LLM) or LLM gateway, including models from Amazon Bedrock or outside. With this approach, you can implement guardrails at critical checkpoints: pre-processing user inputs from multiple sources, validating intermediate outputs in multi-step AI workflows, filtering retrieved documents in Retrieval Augmented Generation (RAG) pipelines, or post-processing LLM responses before delivery. For latency-sensitive applications, you can parallelize the input validation ApplyGuardrail call and the LLM inference call, then process the results together. However, this means that you will always pay for both calls—even if the guardrail would have blocked the input. With a sequential approach, you can skip the inference call entirely when the guardrail intervenes, saving that cost. You can design custom protection strategies that match your application’s specific risk profile, applying different guardrail configurations based on context, user state, or workflow stage. For more details, see Use the ApplyGuardrail API in your application.
Native integration with Bedrock inference APIs
When you use Amazon Bedrock Guardrails with inference APIs like InvokeModel, InvokeModelWithResponseStream, Converse, or ConverseStream, the system automatically handles a dual-checkpoint pattern for you. First, it sends user input to the ApplyGuardrail API to evaluate against your defined policies. If your guardrail blocks the input, it returns your configured message, if the guardrail allows the input, it proceeds to the foundation model. After the model generates a response, the system evaluates the output (including grounding sources when applicable) through guardrails again before returning results to users. For the integration with the Amazon Bedrock streaming APIs (InvokeModelWithResponseStream and ConverseStream), the guardrail will buffer the model’s streaming output and evaluate the output in chunks. These native integrations streamline implementation while maintaining comprehensive protection. For more details, see Use your guardrail with inference operations to evaluate user input.
Important: Each ApplyGuardrail API call incurs separate charges, so consider your architecture carefully. The pricing for Amazon Bedrock Guardrails is based on text units consumed or images processed per configured safeguard. For more information, see the Amazon Bedrock pricing page for details.
6. Manage guardrails in multi-turn conversations
One of the most common pitfalls in conversational AI is over-applying guardrails to conversation history. If you evaluate every message from the entire chat history on each turn, a single blocked topic early in the conversation can prevent users from moving forward. This can happen even when their new questions are perfectly valid.

Imagine this scenario with a guardrail configured to block discussions about “bananas”:
User: Do you sell bananas?
Chatbot: Sorry, the model cannot respond to your question.
User: Can I book a flight?

If your guardrails evaluate the entire conversation history, that second question gets blocked too—simply because “bananas” still exists somewhere in the chat log. Your user is now stuck, unable to recover from a single misstep.
Instead of checking the full conversation history, configure your guardrails to evaluate only the most recent user input or a limited number of recent turns. This approach allows conversations to flow naturally and lets users recover from blocked interactions. Furthermore, you can reduce both cost and latency by not having the guardrail evaluate the same content multiple times across different turns.If guardrails only evaluated the last turn (in this case “Can I book a flight?”), then the conversation would continue smoothly and users could move past previous guardrail interventions without friction. With this strategy, you can maintain conversation fluidity and improve user experience by keeping conversations natural.
Guardrail integrations inside tools like LiteLLM, LangChain AWS, and Strands Agents either default to only evaluating the last turn in the conversation or provide a flag to do so.
Using the Converse API with guardContent for multi-turn conversations
The following example demonstrates how to selectively evaluate only the latest user message in a multi-turn conversation using the guardContent block. In this approach, the conversation history is passed as regular text (which won’t be evaluated by guardrails), while only the most recent user input is wrapped in guardContent:
import boto3
bedrock = boto3.client(“bedrock-runtime”, region_name=”<aws region>”)

# Conversation history (previous messages won’t be evaluated by guardrails)
messages = [
{
“role”: “user”,
“content”: [
{“text”: “Do you sell bananas?”}
]
},
{
“role”: “assistant”,
“content”: [
{“text”: “I’m sorry, but I can’t help with that topic.”}
]
},
{
“role”: “user”,
“content”: [
{ # Only this block will be evaluated by guardrails
“guardContent”: { “text”:
{ “text”: “Can I book a flight to Paris?” }
}
}
]
}
]

response = bedrock.converse(
modelId=”<bedrock_model_id>”,
guardrailConfig={
“guardrailIdentifier”: “your-guardrail-id”,
“guardrailVersion”: “1”,
“trace”: “enabled” },
messages=messages
)

# The conversation flows naturally because only “Can I book a flight to Paris?” is evaluated, not the earlier blocked topic about bananas
print(response[‘output’][‘message’][‘content’][0][‘text’])
In this example, even though the conversation history contains a previously blocked topic (“bananas”), the user can continue the conversation naturally because only the latest query wrapped in guardContent is evaluated by the guardrail. The optimal number of turns to evaluate can vary based on your use case and safety requirements, as some attacks can span across several conversation turns. Consider starting with a single-turn evaluation and adjust based on your application’s needs.
7. Use guardrail numerical versions in production
When you create a guardrail, Amazon Bedrock automatically creates a single version labeled as DRAFT. You can create additional numerical versions (version 1 and version 2) of the guardrail by using the CreateGuardrailVersion API. The version numbers are auto-incremented by the service whenever a new version is created. Each numerical version is an immutable snapshot of the DRAFT guardrail version’s policies at the time of creation. Any modifications to the policies in the DRAFT version do not affect existing numerical versions. We strongly recommend using numerical versions instead of the DRAFT version in production applications. The DRAFT version is designed for development and testing purposes, and using it in production can lead to the following issues:

Service interruptions – When an operator modifies the DRAFT version using the UpdateGuardrail API, the guardrail enters an UPDATING state. During this period, any inference calls using the DRAFT guardrail will receive a ValidationException saying that the guardrail is not in a READY state.
Inconsistent protection – Changes made to the DRAFT version’s settings can immediately affect your production application, potentially compromising your intended protection controls.

To use a numerical version in an ApplyGuardrail call, set the value of the guardrailVersion field to be the version number:
response = bedrock.apply_guardrail( guardrailId=”your-guardrail-id”, guardrailVersion=”47″, content=content, source=”your-source”)
By using numerical versions in production, you can help maintain more consistent and predictable behavior of your guardrails while preserving the flexibility to test and iterate on new policies in the DRAFT version. For more information about guardrail versions, see Create a version of a guardrail.
Conclusion
Implementing Amazon Bedrock Guardrails effectively requires thoughtful configuration and a deep understanding of your application’s unique risk profile. By selecting the right policies and safeguard tiers, tuning the configurations through iterative testing, choosing the implementation approach that fits your architecture, and safely deploying with a numerical version, you can balance safety, cost, and user experience. Treat your guardrails as a living system—start with strong baselines, test with detect mode on real traffic, and adjust as your application evolves. Following these battle-tested practices will help your generative AI applications remain safe, performant, and ready to scale confidently into production.
To learn more about Amazon Bedrock Guardrails, refer to the Amazon Bedrock Guardrails documentation, explore safeguard tiers for tailored responsible AI, or visit the Amazon Bedrock console to create your first production-ready guardrail.

About the Authors

Daniel Khain
Daniel Khain is a Software Engineer at AWS AI, where he has worked on Amazon Bedrock AgentCore Gateway, Amazon Bedrock Guardrails, and Amazon Lex. Outside of work, Daniel likes to kayak, cross-country ski, and play classical guitar.

Bharathi Srinivasan
Bharathi Srinivasan is a Generative AI Data Scientist at the AWS Worldwide Specialist Organization. She works on developing solutions for Responsible AI, focusing on algorithmic fairness, veracity of large language models, explainability and governance of agents. Bharathi guides internal teams and AWS customers on their responsible AI journey. She has presented her work at various learning conferences.

Shyam Srinivasan
Shyam Srinivasan is on the Amazon Bedrock Guardrails product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.

Antonio Rodriguez
Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at AWS. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Google AI Introduces STATIC: A Sparse Matrix Framework Delivering 948x …

In industrial recommendation systems, the shift toward Generative Retrieval (GR) is replacing traditional embedding-based nearest neighbor search with Large Language Models (LLMs). These models represent items as Semantic IDs (SIDs)—discrete token sequences—and treat retrieval as an autoregressive decoding task. However, industrial applications often require strict adherence to business logic, such as enforcing content freshness or inventory availability. Standard autoregressive decoding cannot natively enforce these constraints, often leading the model to “hallucinate” invalid or out-of-stock item identifiers.

The Accelerator Bottleneck: Tries vs. TPUs/GPUs

To ensure valid output, developers typically use a prefix tree (trie) to mask invalid tokens during each decoding step. While conceptually straightforward, traditional trie implementations are fundamentally inefficient on hardware accelerators like TPUs and GPUs.

The efficiency gap stems from two primary issues:

Memory Latency: Pointer-chasing structures result in non-contiguous, random memory access patterns. This prevents memory coalescing and fails to utilize the High-Bandwidth Memory (HBM) burst capabilities of modern accelerators.

Compilation Incompatibility: Accelerators rely on static computation graphs for machine learning compilation (e.g., Google’s XLA). Standard tries use data-dependent control flow and recursive branching, which are incompatible with this paradigm and often force costly host-device round-trips.

https://arxiv.org/pdf/2602.22647

STATIC: Sparse Transition Matrix-Accelerated Trie Index

Google DeepMind and Youtube Researchers have introduced STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding) to resolve these bottlenecks. Instead of treating the trie as a graph to be traversed, STATIC flattens it into a static Compressed Sparse Row (CSR) matrix. This transformation allows irregular tree traversals to be executed as fully vectorized sparse matrix operations.

The Hybrid Decoding Architecture

STATIC employs a two-phase lookup strategy to balance memory usage and speed:

Dense Masking (t-1 < d): For the first d=2 layers, where the branching factor is highest, STATIC uses a bit-packed dense boolean tensor. This allows for O(1) lookups during the most computationally expensive initial steps.

Vectorized Node Transition Kernel (VNTK): For deeper layers (l ≥ 3), STATIC utilizes a branch-free kernel. This kernel performs a ‘speculative slice’ of a fixed number of entries (Bt), corresponding to the maximum branch factor at that level. By using a fixed-size slice regardless of the actual child count, the entire decoding process remains a single, static computation graph.

This approach achieves an I/O complexity of O(1) relative to the constraint set size, whereas previous hardware-accelerated binary-search methods scaled logarithmically (O(log|C|)).

Performance and Scalability

Evaluated on Google TPU v6e accelerators using a 3-billion parameter model with a batch size of 2 and a beam size (M) of 70, STATIC demonstrated significant performance gains over existing methods.

MethodLatency Overhead per Step (ms)% of Total Inference TimeSTATIC (Ours)+0.0330.25%PPV Approximate +1.5611.9%Hash Bitmap+12.394.0%CPU Trie+31.3239%PPV Exact +34.1260%

STATIC achieved a 948x speedup over CPU-offloaded tries and outperformed the exact binary-search baseline (PPV) by 1033x. Its latency remains nearly constant even as the Semantic ID vocabulary size (|V|) increases.

Memory Footprint

For a vocabulary of 20 million items, STATIC’s upper bound for HBM usage is approximately 1.5 GB. In practice, due to the non-uniform distribution and clustering of Semantic IDs, actual utilization is typically ≤75% of this bound. The rule of thumb for capacity planning is approximately 90 MB of HBM per 1 million constraints.

Deployment Results

STATIC was deployed on YouTube to enforce a ‘last 7 days’ freshness constraint for video recommendations. The system served a vocabulary of 20 million fresh items with 100% compliance.

Online A/B testing showed:

A +5.1% increase in 7-day fresh video views.

A +2.9% increase in 3-day fresh video views.

A +0.15% increase in click-through rate (CTR).

Cold-Start Performance

The framework also addresses the ‘cold-start’ limitation of generative retrieval—recommending items not seen during training. By constraining the model to a cold-start item set on Amazon Reviews datasets, STATIC significantly improved performance over unconstrained baselines, which recorded 0.00% Recall@1. For these tests, a 1-billion parameter Gemma architecture was used with L = 4 tokens and a vocabulary size of |V|=256.

Key Takeaways

Vectorized Efficiency: STATIC recasts constrained decoding from a graph traversal problem into hardware-friendly, vectorized sparse matrix operations by flattening prefix trees into static Compressed Sparse Row (CSR) matrices.

Massive Speedups: The system achieves a 0.033ms per-step latency, representing a 948x speedup over CPU-offloaded tries and a 47–1033x speedup over hardware-accelerated binary-search baselines.+1

Scalable O(1) Complexity: By achieving O(1) I/O complexity relative to constraint set size, STATIC maintains high performance with a low memory footprint of roughly 90 MB per 1 million items.

Production-Proven Results: Deployment on YouTube showed 100% compliance with business logic constraints, driving a 5.1% increase in fresh video views and a 0.15% boost in click-through rates.

Cold-Start Solution: The framework enables generative retrieval models to successfully recommend cold-start items, boosting Recall@1 performance from 0.00% to non-trivial levels on Amazon Reviews benchmarks.

Check out the Paper and Codes. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces STATIC: A Sparse Matrix Framework Delivering 948x Faster Constrained Decoding for LLM Based Generative Retrieval appeared first on MarkTechPost.

How to Design a Production-Grade Multi-Agent Communication System Usin …

In this tutorial, we build an advanced multi-agent communication system using a structured message bus architecture powered by LangGraph and Pydantic. We define a strict ACP-style message schema that allows agents to communicate via a shared state rather than calling each other directly, enabling modularity, traceability, and production-grade orchestration. We implement three specialized agents, a Planner, Executor, and Validator, that coordinate through structured messages, persistent state, and routing logic. We also integrate SQLite-based persistence to provide durable memory across executions and visualize the agent communication flow to understand how messages propagate through the system.

Copy CodeCopiedUse a different Browser!pip -q install -U “pydantic==2.12.3”
!pip -q install -U langgraph langchain-core networkx matplotlib
!pip -q install -U langgraph-checkpoint-sqlite

import os
import json
import uuid
import sqlite3
from datetime import datetime, timezone
from typing import Any, Dict, List, Literal, Optional, Tuple

from pydantic import BaseModel, Field

import networkx as nx
import matplotlib.pyplot as plt

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.sqlite import SqliteSaver

Role = Literal[“planner”, “executor”, “validator”, “user”, “system”]
MsgType = Literal[“task”, “plan”, “result”, “validation”, “error”, “control”]

class ACPMessage(BaseModel):
msg_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
ts: str = Field(default_factory=lambda: datetime.now(timezone.utc).isoformat().replace(“+00:00”, “Z”))
sender: Role
receiver: Role
msg_type: MsgType
content: str
meta: Dict[str, Any] = Field(default_factory=dict)
trace: Dict[str, Any] = Field(default_factory=dict)

def acp_log_path() -> str:
os.makedirs(“acp_logs”, exist_ok=True)
return os.path.join(“acp_logs”, “acp_messages.jsonl”)

def append_acp_log(m: ACPMessage) -> None:
with open(acp_log_path(), “a”, encoding=”utf-8″) as f:
f.write(m.model_dump_json() + “n”)

We install and import all the required libraries needed to build a structured multi-agent communication system. We define the ACP-style message schema using Pydantic, which allows us to enforce a strict and structured format for agent communication. We also implement structured logging to persist every message exchanged between agents, enabling traceability and observability of the system.

Copy CodeCopiedUse a different Browserclass BusState(BaseModel):
goal: str = “”
done: bool = False
errors: List[str] = Field(default_factory=list)
mailbox: List[ACPMessage] = Field(default_factory=list)
edges: List[Tuple[str, str, str]] = Field(default_factory=list)
active_role: Role = “user”
step: int = 0

def bus_update(
state: BusState,
sender: Role,
receiver: Role,
msg_type: MsgType,
content: str,
meta: Optional[Dict[str, Any]] = None,
trace: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
m = ACPMessage(
sender=sender,
receiver=receiver,
msg_type=msg_type,
content=content,
meta=meta or {},
trace=trace or {},
)
append_acp_log(m)
return {
“goal”: state.goal,
“done”: state.done,
“errors”: state.errors,
“mailbox”: state.mailbox + [m],
“edges”: state.edges + [(sender, receiver, msg_type)],
“active_role”: receiver,
“step”: state.step + 1,
}

We define the shared state structure that acts as the centralized message bus for all agents. We implement the BusState class to store the goal, mailbox, routing information, and execution progress. We also create the bus_update function, which allows us to generate structured messages, update the shared state, and consistently persist message logs.

Copy CodeCopiedUse a different Browserdef planner_agent(state_dict: Dict[str, Any]) -> Dict[str, Any]:
state = BusState.model_validate(state_dict)
goal = state.goal.strip()
if not goal:
return bus_update(state, “planner”, “validator”, “error”, “No goal provided.”, meta={“reason”: “empty_goal”})
plan = [
“Interpret the goal and extract requirements.”,
“Decide an execution strategy with clear outputs.”,
“Ask Executor to produce the result.”,
“Ask Validator to check correctness + completeness.”,
]
plan_text = “n”.join([f”{i+1}. {p}” for i, p in enumerate(plan)])
return bus_update(
state,
“planner”,
“executor”,
“plan”,
plan_text,
meta={“goal”: goal, “plan_steps”: len(plan)},
trace={“policy”: “deterministic_planner_v1”},
)

def executor_agent(state_dict: Dict[str, Any]) -> Dict[str, Any]:
state = BusState.model_validate(state_dict)
goal = state.goal.strip()
latest_plan = None
for m in reversed(state.mailbox):
if m.receiver == “executor” and m.msg_type == “plan”:
latest_plan = m.content
break
result = {
“goal”: goal,
“assumptions”: [
“We can produce a concise, actionable output.”,
“We can validate via rule-based checks.”,
],
“output”: f”Executed task for goal: {goal}”,
“deliverables”: [
“A clear summary”,
“A step-by-step action list”,
“Any constraints and edge cases”,
],
“plan_seen”: bool(latest_plan),
}
result_text = json.dumps(result, indent=2)
return bus_update(
state,
“executor”,
“validator”,
“result”,
result_text,
meta={“artifact_type”: “json”, “bytes”: len(result_text.encode(“utf-8”))},
trace={“policy”: “deterministic_executor_v1”},
)

We implement the Planner and Executor agents, which handle task planning and execution. We design the Planner agent to interpret the goal and generate a structured execution plan, which is then passed through the message bus. We implement the Executor agent to read the plan, execute it, and produce a structured result artifact that downstream agents can validate.

Copy CodeCopiedUse a different Browserdef validator_agent(state_dict: Dict[str, Any]) -> Dict[str, Any]:
state = BusState.model_validate(state_dict)
goal = state.goal.strip()
latest_result = None
for m in reversed(state.mailbox):
if m.receiver == “validator” and m.msg_type in (“result”, “error”):
latest_result = m
break
if latest_result is None:
upd = bus_update(state, “validator”, “planner”, “error”, “No result to validate.”, meta={“reason”: “missing_result”})
upd[“done”] = True
upd[“errors”] = state.errors + [“missing_result”]
return upd
if latest_result.msg_type == “error”:
upd = bus_update(
state,
“validator”,
“planner”,
“validation”,
f”Validation failed because upstream error occurred: {latest_result.content}”,
meta={“status”: “fail”},
)
upd[“done”] = True
upd[“errors”] = state.errors + [latest_result.content]
return upd
try:
parsed = json.loads(latest_result.content)
except Exception as e:
upd = bus_update(
state,
“validator”,
“planner”,
“validation”,
f”Result is not valid JSON: {e}”,
meta={“status”: “fail”},
)
upd[“done”] = True
upd[“errors”] = state.errors + [f”invalid_json: {e}”]
return upd
issues = []
if parsed.get(“goal”) != goal:
issues.append(“Result.goal does not match input goal.”)
if “deliverables” not in parsed or not isinstance(parsed[“deliverables”], list) or len(parsed[“deliverables”]) == 0:
issues.append(“Missing or empty deliverables list.”)
if issues:
upd = bus_update(
state,
“validator”,
“planner”,
“validation”,
“Validation failed:n- ” + “n- “.join(issues),
meta={“status”: “fail”, “issues”: issues},
)
upd[“done”] = True
upd[“errors”] = state.errors + issues
return upd
upd = bus_update(
state,
“validator”,
“user”,
“validation”,
“Validation passed Result looks consistent and complete.”,
meta={“status”: “pass”},
)
upd[“done”] = True
upd[“errors”] = state.errors
return upd

def route_next(state_dict: Dict[str, Any]) -> str:
if state_dict.get(“done”, False):
return END
role = state_dict.get(“active_role”, “user”)
if role == “planner”:
return “planner”
if role == “executor”:
return “executor”
if role == “validator”:
return “validator”
return END

We implement the Validator agent and the routing logic that controls agent execution flow. We design the Validator to inspect the execution results, verify correctness, and generate validation outcomes through structured checks. We also implement the routing function that dynamically determines which agent should execute next, enabling coordinated multi-agent orchestration.

Copy CodeCopiedUse a different Browsergraph = StateGraph(dict)

graph.add_node(“planner”, planner_agent)
graph.add_node(“executor”, executor_agent)
graph.add_node(“validator”, validator_agent)

graph.set_entry_point(“planner”)

graph.add_conditional_edges(“planner”, route_next, {“planner”: “planner”, “executor”: “executor”, “validator”: “validator”, END: END})
graph.add_conditional_edges(“executor”, route_next, {“planner”: “planner”, “executor”: “executor”, “validator”: “validator”, END: END})
graph.add_conditional_edges(“validator”, route_next, {“planner”: “planner”, “executor”: “executor”, “validator”: “validator”, END: END})

os.makedirs(“checkpoints”, exist_ok=True)
db_path = “checkpoints/langgraph_bus.sqlite”
conn = sqlite3.connect(db_path, check_same_thread=False)
checkpointer = SqliteSaver(conn)

app = graph.compile(checkpointer=checkpointer)

def run_thread(goal: str, thread_id: str) -> BusState:
init = BusState(goal=goal, active_role=”planner”, done=False).model_dump()
final_state_dict = app.invoke(init, config={“configurable”: {“thread_id”: thread_id}})
return BusState.model_validate(final_state_dict)

thread_id = “demo-thread-001”
goal = “Design an ACP-style message bus where planner/executor/validator coordinate through shared state.”

final_state = run_thread(goal, thread_id)
print(“Done:”, final_state.done)
print(“Steps:”, final_state.step)
print(“Errors:”, final_state.errors)

print(“nLast 5 messages:”)
for m in final_state.mailbox[-5:]:
print(f”- [{m.msg_type}] {m.sender} -> {m.receiver}: {m.content[:80]}”)

snapshot = checkpointer.get_tuple({“configurable”: {“thread_id”: thread_id}})
cp = snapshot.checkpoint or {}
cv = cp.get(“channel_values”, {}) or {}
sv = cp.get(“state”, {}) or {}
vals = cv if isinstance(cv, dict) and len(cv) else sv if isinstance(sv, dict) else {}

print(“nCheckpoint keys:”, list(cp.keys()))
if isinstance(cv, dict):
print(“channel_values keys:”, list(cv.keys())[:30])
if isinstance(sv, dict):
print(“state keys:”, list(sv.keys())[:30])

print(“nPersisted step (best-effort):”, vals.get(“step”, “NOT_FOUND”))
print(“Persisted active_role (best-effort):”, vals.get(“active_role”, “NOT_FOUND”))

print(“nACP logs:”, acp_log_path())
print(“Checkpoint DB:”, db_path)

G = nx.DiGraph()
G.add_edge(“planner”, “executor”)
G.add_edge(“executor”, “validator”)
G.add_edge(“validator”, “user”)

plt.figure(figsize=(6, 4))
pos = nx.spring_layout(G, seed=7)
nx.draw(G, pos, with_labels=True, node_size=1800, font_size=10, arrows=True)
plt.title(“Orchestration Graph: Planner → Executor → Validator”)
plt.show()

comm = nx.MultiDiGraph()
for (s, r, t) in final_state.edges:
comm.add_edge(s, r, label=t)

plt.figure(figsize=(8, 5))
pos2 = nx.spring_layout(comm, seed=11)
nx.draw(comm, pos2, with_labels=True, node_size=1800, font_size=10, arrows=True)
plt.title(“Communication Graph from Structured Message Bus (Runtime Edges)”)
plt.show()

def tail_jsonl(path: str, n: int = 8) -> List[Dict[str, Any]]:
if not os.path.exists(path):
return []
with open(path, “r”, encoding=”utf-8″) as f:
lines = f.readlines()[-n:]
return [json.loads(x) for x in lines]

print(“nLast ACP log entries:”)
for row in tail_jsonl(acp_log_path(), 6):
print(f”{row[‘msg_type’]:>10} | {row[‘sender’]} -> {row[‘receiver’]} | {row[‘ts’]}”)

We construct the LangGraph state graph, enable SQLite-based persistence, and execute the multi-agent workflow. We use a thread identifier to ensure the agent state can be saved and recovered reliably across executions. We also visualize the orchestration and communication graphs and inspect persisted logs, which allows us to understand how agents interact through the structured message bus.

In this tutorial, we successfully designed and implemented a structured multi-agent communication framework using LangGraph’s shared-state architecture and ACP-style message-bus principles. We enabled agents to operate independently while communicating through structured, persistent messages, which improves reliability, observability, and scalability. We logged every interaction, persisted agent state across executions, and visualized communication patterns to gain deep insight into agent coordination. This architecture allows us to build robust, modular, and production-ready multi-agent systems that can be extended with additional agents, LLM reasoning, memory systems, and complex routing strategies.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Production-Grade Multi-Agent Communication System Using LangGraph Structured Message Bus, ACP Logging, and Persistent Shared State Architecture appeared first on MarkTechPost.

Alibaba Team Open-Sources CoPaw: A High-Performance Personal Agent Wor …

As the industry moves from simple Large Language Model (LLM) inference toward autonomous agentic systems, the challenge for devs have shifted. It is no longer just about the model; it is about the environment in which that model operates. A team of researchers from Alibaba released CoPaw, an open-source framework designed to address this by providing a standardized workstation for deploying and managing personal AI agents.

CoPaw is built on a technical stack comprising AgentScope, AgentScope Runtime, and ReMe. It functions as a bridge between high-level agent logic and the practical requirements of a personal assistant, such as persistent memory, multi-channel connectivity, and task scheduling.

The Architecture: AgentScope and ReMe Integration

CoPaw is not a standalone bot but a workstation that orchestrates multiple components to create a cohesive ‘Agentic App.’

The system relies on three primary layers:

AgentScope: The underlying framework that handles agent communication and logic.

AgentScope Runtime: The execution environment that ensures stable operation and resource management.

ReMe (Memory Management): A specialized module that handles both local and cloud-based memory. This allows agents to maintain ‘Long-Term Experience,’ solving the statelessness issue inherent in standard LLM APIs.

By leveraging ReMe, CoPaw allows users to control their data privacy while ensuring the agent retains context across different sessions and platforms. This persistent memory is what enables the workstation to adapt to a user’s specific workflows over time.

Extensibility via the Skills System

A core feature of the CoPaw workstation is its Skill Extension capability. In this framework, a ‘Skill’ is a discrete unit of functionality—essentially a tool that the agent can invoke to interact with the external world.

Adding capabilities to CoPaw does not require modifying the core engine. Instead, CoPaw supports a custom skill directory where engineers can drop Python-based functions. These skills follow a standardized specification (influenced by anthropics/skills), allowing the agent to:

Perform web scraping (e.g., summarizing Reddit threads or YouTube videos).

Interact with local files and desktop environments.

Query personal knowledge bases stored within the workstation.

Manage calendars and email via natural language.

This design allows for the creation of Agentic Apps—complex workflows where the agent uses a combination of built-in skills and scheduled tasks to achieve a goal autonomously.

Multi-Channel Connectivity (All-Domain Access)

One of the primary technical hurdles in personal AI is deployment across fragmented communication platforms. CoPaw addresses this through its All-Domain Access layer, which standardizes how agents interact with different messaging protocols.

Currently, CoPaw supports integration with:

Enterprise Platforms: DingTalk and Lark (Feishu).

Social/Developer Platforms: Discord, QQ, and iMessage.

This multi-channel support means that a developer can initialize a single CoPaw instance and interact with it from any of these endpoints. The workstation handles the translation of messages between the agent’s logic and the specific channel’s API, maintaining a consistent state and memory regardless of where the interaction occurs.

Key Takeaways

Shift from Model to Workstation: CoPaw moves the focus away from just the Large Language Model (LLM) and toward a structured Workstation architecture. It acts as a middleware layer that orchestrates the AgentScope framework, AgentScope Runtime, and external communication channels to turn raw LLM capabilities into a functional, persistent assistant.

Long-Term Memory via ReMe: Unlike standard stateless LLM interactions, CoPaw integrates the ReMe (Memory Management) module. This allows agents to maintain ‘Long-Term Experience’ by storing user preferences and past task data either locally or in the cloud, enabling a personalized evolution of the agent’s behavior over time.

Extensible Python-Based ‘Skills’: The framework uses a decoupled Skill Extension system based on the anthropics/skills specification. Developers can extend an agent’s utility by simply adding Python functions to a custom skill directory, allowing the agent to perform specific tasks like web scraping, file manipulation, or API integrations without modifying the core codebase.

All-Domain Multi-Channel Access: CoPaw provides a unified interface for cross-platform deployment. A single workstation instance can be connected to enterprise tools (Lark, DingTalk) and social/developer platforms (Discord, QQ, iMessage), allowing the same agent and its memory to be accessed across different environments.

Automated Agentic Workflows: By combining Scheduled Tasks with the skills system, CoPaw transitions from reactive chat to proactive automation. Devs can program ‘Agentic Apps’ that perform background operations—such as daily research synthesis or automated repository monitoring—and push results to the user’s preferred communication channel.

Check out the Repo here and Website. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Alibaba Team Open-Sources CoPaw: A High-Performance Personal Agent Workstation for Developers to Scale Multi-Channel AI Workflows and Memory appeared first on MarkTechPost.