Drive organizational growth with Amazon Lex multi-developer CI/CD pipe …

As your conversational AI initiatives evolve, developing Amazon Lex assistants becomes increasingly complex. Multiple developers working on the same shared Lex instance leads to configuration conflicts, overwritten changes, and slower iteration cycles. Scaling Amazon Lex development requires isolated environments, version control, and automated deployment pipelines. By adopting well-structured continuous integration and continuous delivery (CI/CD) practices, organizations can reduce development bottlenecks, accelerate innovation, and deliver smoother intelligent conversational experiences powered by Amazon Lex.
In this post, we walk through a multi-developer CI/CD pipeline for Amazon Lex that enables isolated development environments, automated testing, and streamlined deployments. We show you how to set up the solution and share real-world results from teams using this approach.
Transforming development through scalable CI/CD practices
Traditional approaches to Amazon Lex development often rely on single-instance setups and manual workflows. While these methods work for small, single-developer projects, they can introduce friction when multiple developers need to work in parallel, leading to slower iteration cycles and higher operational overhead. A modern multi-developer CI/CD pipeline changes this dynamic by enabling automated validation, streamlined deployment, and intelligent version control. The pipeline minimizes configuration conflicts, improves resource utilization, and empowers teams to deliver new features faster and more reliably. With continuous integration and delivery, Amazon Lex developers can focus less on managing processes and more on creating engaging, high-quality conversational AI experiences for customers. Let’s explore how this solution works.
Solution architecture
The multi-developer CI/CD pipeline transforms Amazon Lex from a limited, single-user development tool into an enterprise-grade conversational AI platform. This approach addresses the fundamental collaboration challenges that slow down conversational AI development. The following diagram illustrates the multi-developer CI/CD pipeline architecture:

Using infrastructure as code (IaC) with AWS Cloud Development Kit (AWS CDK), each developer runs cdk deploy to provision their own dedicated Lex assistant and AWS Lambda instances in a shared Amazon Web Services (AWS) account. This approach eliminates the overwriting issues common in traditional Amazon Lex development and enables true parallel work streams with full version control capabilities.
Developers use lexcli, a custom AWS Command Line Interface (AWS CLI) tool, to export Lex assistant configurations from the shared AWS account to their local workstations for editing. Developers then test and debug locally using lex_emulator, a custom tool providing integrated testing for both assistant configurations and AWS Lambda functions with real-time validation to catch issues before they reach cloud environments. This local capability transforms the development experience by providing immediate feedback and reducing the need for time-consuming cloud deployments during iterations.
When developers push changes to version control, this pipeline automatically deploys ephemeral test environments for each merge request through GitLab CI/CD. The pipeline runs in Docker containers, providing a consistent build environment that ensures reliable Lambda function packaging and reproducible deployments. Automated tests run against these temporary stacks, and merges are only enabled if all tests are successful. Ephemeral environments are automatically destroyed after merge, ensuring cost efficiency while maintaining quality gates. Failed tests block merges and notify developers, preventing broken code from reaching shared environments.
Changes that pass testing in ephemeral environments are promoted to shared environments (Development, QA, and Production) with manual approval gates between stages. This structured approach maintains high-quality standards while accelerating the delivery process, enabling teams to deploy new features and improvements with confidence.
The following graphic illustrates the developer workflow organized by phases: local development, version control, and automated deployment. Developers work in isolated environments before changes flow through the CI/CD pipeline to shared environments.

Business Impact
By enabling parallel development workflows, this solution delivers substantial time and efficiency improvements for conversational AI teams. Internal evaluations show teams can parallelize much of their development work, driving measurable productivity gains. Results vary based on team size, project scope, and implementation approach, but some teams have reduced development cycles significantly. The acceleration has enabled teams to deliver features in weeks rather than months, improving time-to-market. The time savings allow teams to handle larger workloads within existing development cycles, freeing capacity for innovation and quality improvement.
Real-world success stories
This multi-developer CI/CD pipeline for Amazon Lex has supported enterprise teams in improving their development efficiency. One organization used it to migrate their platform to Amazon Lex, enabling multiple developers to collaborate concurrently without conflicts. Isolated environments and automated merge capabilities helped maintain consistent progress during complex development efforts.
A large enterprise adopted the pipeline as part of its broader AI strategy. By using validation and collaboration features within the CI/CD process, their teams enhanced coordination and accountability across environments. These examples illustrate how structured workflows can contribute to improved efficiency, smoother migrations, and reduced rework.
Overall, these experiences demonstrate how the multi-developer CI/CD pipeline helps organizations of varying scales strengthen their conversational AI initiatives while maintaining consistent quality and development velocity.
See the solution in action
To better understand how the multi-developer CI/CD pipeline works in practice, watch this demonstration video that walks through the key workflows. It shows how developers work in parallel on the same Amazon Lex assistant, resolve conflicts automatically, and deploy changes through the pipeline.

Getting started with the solution
The multi-developer CI/CD pipeline for Amazon Lex is available as an open source solution through our GitHub repository. Standard AWS service charges apply for the resources you deploy.
Prerequisites and environment setup
To follow along with this walkthrough, you need:

NodeJS 22.9.0 or later
AWS Cloud Development Kit 2.176.0 or later (AWS CDK)
Python 3.12.8 or later
Docker (required for AWS CDK to bundle Python Lambda functions)
An AWS account with permissions to create Amazon Lex assistants, Lambda functions, and AWS Identity and Access Management (IAM) roles

Core components and architecture
The framework consists of several key components that work together to enable collaborative development: infrastructure-as-code with AWS CDK, the Amazon Lex CLI tool called lexcli, and the GitLab CI/CD pipeline configuration.
The solution uses AWS CDK to define infrastructure components as code, including:

Individual Amazon Lex instances for each developer
Lambda functions for fulfillment logic
Amazon CloudWatch logging and monitoring
Amazon Simple Storage Service (Amazon S3) buckets for configuration storage

Deploy each developer’s environment using:

cdk deploy -c environment=your-username –outputs-file ./cdk-outputs.json

This creates a complete, isolated environment that mirrors the shared configuration but allows for independent modifications.
The lexcli tool exports Amazon Lex assistant configuration from the console into version-controlled JSON files. When invoking lexcli export <environment>, it will:

Connect to your deployed assistant using the Amazon Lex API
Download the complete assistant configuration as a .zip file
Extract and standardize identifiers to make configurations environment-agnostic
Format JSON files for review during merge requests
Provide interactive prompts to selectively export only changed intents and slots

This tool transforms the manual, error-prone process of copying assistant configurations into an automated, reliable workflow that maintains configuration integrity across environments.
The .gitlab-ci.yml file orchestrates the entire development workflow:

Ephemeral environment creation – Automatically creates and destroys a temporary dynamic environment for each merge request.
Automated testing – Runs comprehensive tests including intent validation, slot verification, and performance benchmarks
Quality gates – Enforces code linting and automated testing with 40% minimum coverage; requires manual approval for all environment deployments
Environment promotion – Enables controlled deployment progression through dev, staging, production with manual approval at each stage

The pipeline ensures only validated, tested changes progress through deployment stages, maintaining quality while enabling rapid iteration.
Step-by-step implementation guide
To create a multi-developer CI/CD pipeline for Amazon Lex, complete the steps in the following sections. Implementation follows five phases:

Repository and GitLab setup
AWS authentication setup
Local development environment
Development workflow
CI/CD pipeline execution

Repository and GitLab setup
To set up your repository and configure GitLab variables, follow these steps:

Clone the sample repository and create your own project:

# Clone the sample repository
git clone https://gitlab.aws.dev/lex/sample-lex-multi-developer-cicd.git

# Navigate to the project directory
cd sample-lex-multi-developer-cicd

# Remove the original remote and add your own
git remote remove origin
git remote add origin

# Push to your new repository
git push -u origin main

To configure GitLab CI/CD variables, navigate to your GitLab project and choose Settings. Then choose CI/CD and Variables. Add the following variables:

For AWS_REGION, enter us-east-1
For AWS_DEFAULT_REGION, enter us-east-1
Add the other environment-specific secrets your application requires

Set up branch protection rules to protect your main branch. Proper workflow enforcement prevents direct commits to the production code.

AWS authentication setup
The pipeline requires appropriate permissions to deploy AWS CDK changes within your environment. This can be achieved through various methods, such as assuming a specific IAM role within the pipeline, using a hosted runner with an attached IAM role, or enabling another approved form of access. The exact setup depends on your organization’s security and access management practices. The detailed configuration of these permissions is outside the scope of this post, but it’s essential to properly authorize your runners and roles to perform CDK deployments.
Local development environment
To set up your local development environment, complete the following steps:

Install dependencies

pip install -r requirements.txt

Deploy your personal assistant environment:

cdk deploy -c environment=your-username –outputs-file ./cdk-outputs.json

This creates your isolated assistant instance for independent modifications.
Development workflow
To create the development workflow, complete the following steps:

Create a feature branch:

git checkout -b feature/your-feature-name

To make assistant modifications, follow these steps:

Access your personal assistant in the Amazon Lex console
Modify intents, slots, or assistant configurations as needed
Test your changes directly in the console

Export changes to code:

python lexcli.py export your-username

The tool will interactively prompt you to select which changes to export so you only commit the modifications you intended.

Review and commit changes:

git add .
git commit -m “feat: add new intent for booking flow”
git push origin feature/your-feature-name

CI/CD pipeline execution
To execute the CI/CD pipeline, complete the following steps:

Create merge request – The pipeline automatically creates an ephemeral environment for your branch
Automated testing – The pipeline runs comprehensive tests against your changes
Code review – Team members can review both the code changes and test results
Merge to main – After the changes are approved, they’re merged and automatically deployed to development
Environment promotion – Manual approval gates control promotion to QA and production

What’s next?
After implementing this multi-developer pipeline, consider these next steps:

Scale your testing – Add more comprehensive test suites for intent validation
Enhance monitoring – Integrate Amazon CloudWatch dashboards for assistant performance
Explore hybrid AI – Combine Amazon Lex with Amazon Bedrock for generative AI capabilities

For more information about Amazon Lex, refer to the Amazon Lex Developer Guide.
Conclusion
In this post, we showed how implementing multi-developer CI/CD pipelines for Amazon Lex addresses critical operational challenges in conversational AI development. By enabling isolated development environments, local testing capabilities, and automated validation workflows, teams can work in parallel without sacrificing quality, helping to accelerate time-to-market for complex conversational AI solutions.
You can start implementing this approach today using the AWS CDK prototype and Amazon Lex CLI tool available in our GitHub repository. For organizations looking to enhance their conversational AI capabilities further, consider exploring the Amazon Lex integration with Amazon Bedrock for hybrid solutions using both structured dialog management and large language models (LLMs).
We’d love to hear about your experience implementing this solution. Share your feedback in the comments or reach out to AWS Professional Services for implementation guidance.

About the authors

Grazia Russo Lassner
Grazia Russo Lassner is a Senior Delivery Consultant with AWS Professional Services. She specializes in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Grazia is passionate about leveraging generative AI, agentic systems, and multi-agent orchestration to build intelligent customer experiences that modernize how businesses engage with their customers.

Ken Erwin
Ken Erwin is a Senior Delivery Consultant with AWS Professional Services. He specializes in the architecture and operationalization of frontier-scale AI infrastructure, focusing on the design and management of the world’s largest HPC clusters. Ken is passionate about leveraging gigawatt-scale compute and immutable infrastructure to build the high-performance environments required to train the world’s most powerful AI models.

Building custom model provider for Strands Agents with LLMs hosted on …

Organizations increasingly deploy custom large language models (LLMs) on Amazon SageMaker AI real-time endpoints using their preferred serving frameworks—such as SGLang, vLLM, or TorchServe—to help gain greater control over their deployments, optimize costs, and align with compliance requirements. However, this flexibility introduces a critical technical challenge: response format incompatibility with Strands agents. While these custom serving frameworks typically return responses in OpenAI-compatible formats to facilitate broad environment support, Strands agents expect model responses aligned with the Bedrock Messages API format.
The challenge is particularly significant because support for the Messages API is not guaranteed for the models hosted on SageMaker AI real-time endpoints. While Amazon Bedrock Mantle distributed inference engine has supported OpenAI messaging formats since December 2025, flexibility of SageMaker AI allows customers to host various foundation models—some requiring esoteric prompt and response formats that don’t conform to standard APIs. This creates a gap between the serving framework’s output structure and what Strands expects, preventing seamless integration despite both systems being technically functional. The solution lies in implementing custom model parsers that extend SageMakerAIModel and translate the model server’s response format into what Strands expects, enabling organizations to leverage their preferred serving frameworks without sacrificing compatibility with the Strands Agents SDK.
This post demonstrates how to build custom model parsers for Strands agents when working with LLMs hosted on SageMaker that don’t natively support the Bedrock Messages API format. We’ll walk through deploying Llama 3.1 with SGLang on SageMaker using awslabs/ml-container-creator, then implementing a custom parser to integrate it with Strands agents.
Strands Custom Parsers
Strands agents expect model responses in a specific format aligned with the Bedrock Messages API. When you deploy models using custom serving frameworks like SGLang, vLLM, or TorchServe, they typically return responses in their own formats—often OpenAI-compatible for broad environment support. Without a custom parser, you’ll encounter errors like:
TypeError: ‘NoneType’ object is not subscriptable
This happens because the Strands Agents default SageMakerAIModel class attempts to parse responses assuming a specific structure that your custom endpoint doesn’t provide. In this post and the companion code base, we illustrate how to extend the SageMakerAIModel class with custom parsing logic that translates your model server’s response format into what Strands expects.
Implementation Overview
Our implementation consists of three layers:

Model Deployment Layer: Llama 3.1 served by SGLang on SageMaker, returning OpenAI-compatible responses
Parser Layer: Custom LlamaModelProvider class that extends SageMakerAIModel to handle Llama 3.1’s response format
Agent Layer: Strands agent that uses the custom provider for conversational AI, appropriately parsing the model’s response

We start by using awslabs/ml-container-creator, an AWS Labs open-source Yeoman generator that automates the creation of SageMaker BYOC (Bring Your Own Container) deployment projects. It generates the artifacts needed to build LLM serving containers, including Dockerfiles, CodeBuild configurations, and deployment scripts.
Install ml-container-creator
The first step we need to take is to build the serving container for our model. We use an open-source project to build the container and generate deployment scripts for that container. The following commands illustrate how to install awslabs/ml-container-creator and its dependencies, which include npm and Yeoman. For more information, review the project’s README and Wiki to get started.

# Install Yeoman globally
npm install -g yo

# Clone and install ml-container-creator
git clone https://github.com/awslabs/ml-container-creator
cd ml-container-creator
npm install && npm link

# Verify installation
yo –generators # Should show ml-container-creator

Generate Deployment Project
Once installed and linked, the yo command allows you to run installed generators, yo ml-container-creator allows you to run the generator we need for this exercise.

# Run the generator
yo ml-container-creator

# Configuration options:
# – Framework: transformers
# – Model Server: sglang
# – Model: meta-llama/Llama-3.1-8B-Instruct
# – Deploy Target: codebuild
# – Instance Type: ml.g6.12xlarge (GPU)
# – Region: us-east-1

The generator creates a complete project structure:

<project-directory>/
├── Dockerfile # Container with SGLang and dependencies
├── buildspec.yml # CodeBuild configuration
├── code/
│ └── serve # SGLang server startup script
├── deploy/
│ ├── submit_build.sh # Triggers CodeBuild
│ └── deploy.sh # Deploys to SageMaker
└── test/
└── test_endpoint.sh # Endpoint testing script

Build and Deploy
Projects built by awslabs/ml-container-creator include templatized build and deployment scripts. The ./deploy/submit_build.sh and ./deploy/deploy.sh scripts are used to build the image, push the image to Amazon Elastic Container Registry (ECR), and deploy to an Amazon SageMaker AI real-time endpoint.

cd llama-31-deployment

# Build container with CodeBuild (no local Docker required)
./deploy/submit_build.sh

# Deploy to SageMaker
./deploy/deploy.sh arn:aws:iam::ACCOUNT:role/SageMakerExecutionRole

The deployment process:

CodeBuild builds the Docker image with SGLang and Llama 3.1
Image is pushed to Amazon ECR
SageMaker creates a real-time endpoint
SGLang downloads the model from HuggingFace and loads it into GPU memory
Endpoint reaches InService status (approximately 10-15 minutes)

We can test the endpoint by using ./test/test_endpoint.sh, or with a direct invocation:

import boto3
import json

runtime_client = boto3.client(‘sagemaker-runtime’, region_name=’us-east-1′)

payload = {
“messages”: [
{“user”, “content”: “Hello, how are you?”}
],
“max_tokens”: 100,
“temperature”: 0.7
}

response = runtime_client.invoke_endpoint(
EndpointName=’llama-31-deployment-endpoint’,
ContentType=’application/json’,
Body=json.dumps(payload)
)

result = json.loads(response[‘Body’].read().decode(‘utf-8’))
print(result[‘choices’][0][‘message’][‘content’])

Understanding the Response Format
Llama 3.1 returns OpenAI-compatible responses. Strands expects model responses to adhere to the Bedrock Messages API format. Until late last year, this was a standard compatibility mismatch. Since December 2025, the Amazon Bedrock Mantle distributed inference engine supports OpenAI messaging formats:

{
“id”: “cmpl-abc123”,
“object”: “chat.completion”,
“created”: 1704067200,
“model”: “meta-llama/Llama-3.1-8B-Instruct”,
“choices”: [{
“index”: 0,
“message”: {“role”: “assistant”, “content”: “I’m doing well, thank you for asking!”},
“finish_reason”: “stop”
}],
“usage”: {
“prompt_tokens”: 23,
“completion_tokens”: 12,
“total_tokens”: 35
}
}

However, support for the Messages API is not guaranteed for the models hosted on SageMaker AI real-time endpoints. SageMaker AI allows customers to host many kinds of foundation models on managed GPU-accelerated infrastructure, some of which may require esoteric prompt/response formats. For example, the default SageMakerAIModel uses the legacy Bedrock Messages API format and attempts to access fields that don’t exist in the standard OpenAI Messages format, causing TypeError style failures.
Implementing a Custom Model Parser
Custom model parsers are a feature of the Strands Agents SDK that provides strong compatibility and flexibility for customers building agents powered by LLMs hosted on SageMaker AI. Here, we describe how to create a custom provider that extends SageMakerAIModel:

def stream(self, messages: List[Dict[str, Any]], tool_specs: list, system_prompt: Optional[str], **kwargs):
# Build payload messages
payload_messages = []
if system_prompt:
payload_messages.append({“role”: “system”, “content”: system_prompt})
# Extract message content from Strands format
for msg in messages:
payload_messages.append({“role”: “user”, “content”: msg[‘content’][0][‘text’]})

# Build complete payload with streaming enabled
  payload = {
“messages”: payload_messages,
“max_tokens”: kwargs.get(‘max_tokens’, self.max_tokens),
“temperature”: kwargs.get(‘temperature’, self.temperature),
“top_p”: kwargs.get(‘top_p’, self.top_p),
“stream”: True
}

try:
# Invoke SageMaker endpoint with streaming
response = self.runtime_client.invoke_endpoint_with_response_stream(
EndpointName=self.endpoint_name,
ContentType=’application/json’,
Accept=’application/json’,
Body=json.dumps(payload)
)

# Process streaming response
accumulated_content = “”
for event in response[‘Body’]:
chunk = event[‘PayloadPart’][‘Bytes’].decode(‘utf-8’)
if not chunk.strip():
continue

# Parse SSE format: “data: {json}n”
for line in chunk.split(‘n’):
if line.startswith(‘data: ‘):
try:
json_str = line.replace(‘data: ‘, ”).strip()
if not json_str:
continue

chunk_data = json.loads(json_str)
if ‘choices’ in chunk_data and chunk_data[‘choices’]:
delta = chunk_data[‘choices’][0].get(‘delta’, {})

# Yield content delta in Strands format
if ‘content’ in delta:
content_chunk = delta[‘content’]
accumulated_content += content_chunk
yield {
“type”: “contentBlockDelta”,
“delta”: {“text”: content_chunk},
“contentBlockIndex”: 0
}

# Check for completion
finish_reason = chunk_data[‘choices’][0].get(‘finish_reason’)
if finish_reason:
yield {
“type”: “messageStop”,
“stopReason”: finish_reason
}

# Yield usage metadata
if ‘usage’ in chunk_data:
yield {
“type”: “metadata”,
“usage”: chunk_data[‘usage’]
}

except json.JSONDecodeError:
continue

except Exception as e:
yield {
“type”: “error”,
“error”: {
“message”: f”Endpoint invocation failed: {str(e)}”,
“type”: “EndpointInvocationError”
}
}

The stream method overrides the behavior of the SageMakerAIModel and allows the agent to parse responses based on the requirements of the underlying model. While the vast majority of models do support OpenAI’s Message API protocol, this capability enables power-users to leverage highly specified LLMs on SageMaker AI to power agent workloads using Strands Agents SDK. Once the custom model response logic is built, Strands Agents SDK makes it simple to initialize agents with custom model providers:

from strands.agent import Agent

# Initialize custom provider
provider = LlamaModelProvider(
endpoint_name=”llama-31-deployment-endpoint”,
region_name=”us-east-1″,
max_tokens=1000,
temperature=0.7
)

# Create agent with custom provider
agent = Agent(
name=”llama-assistant”,
model=provider,
system_prompt=(
“You are a helpful AI assistant powered by Llama 3.1, ”
“deployed on Amazon SageMaker. You provide clear, accurate, ”
“and friendly responses to user questions.”
)
)

# Test the agent
response = agent(“What are the key benefits of deploying LLMs on SageMaker?”)
print(response.content)

The complete implementation for this custom parser, including the Jupyter notebook with detailed explanations and the ml-container-creator deployment project, is available in the companion GitHub repository.
Conclusion
Building custom model parsers for Strands agents helps users to leverage different LLM deployments on SageMaker, regardless of its response format. By extending SageMakerAIModel and implementing the stream() method, you can integrate custom-hosted models while maintaining the clean agent interface of Strands.
Key takeaways:

awslabs/ml-container-creator simplifies SageMaker BYOC deployments with production-ready infrastructure code
Custom parsers bridge the gap between model server response formats and Strands expectations
The stream() method is the critical integration point for custom providers

About the authors
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.

LangWatch Open Sources the Missing Evaluation Layer for AI Agents to E …

As AI development shifts from simple chat interfaces to complex, multi-step autonomous agents, the industry has encountered a significant bottleneck: non-determinism. Unlike traditional software where code follows a predictable path, agents built on LLMs introduce a high degree of variance.

LangWatch is an open-source platform designed to address this by providing a standardized layer for evaluation, tracing, simulation, and monitoring. It moves AI engineering away from anecdotal testing toward a systematic, data-driven development lifecycle.

The Simulation-First Approach to Agent Reliability

For software developers working with frameworks like LangGraph or CrewAI, the primary challenge is identifying where an agent’s reasoning fails. LangWatch introduces end-to-end simulations that go beyond simple input-output checks.

By running full-stack scenarios, the platform allows developers to observe the interaction between several critical components:

The Agent: The core logic and tool-calling capabilities.

The User Simulator: An automated persona that tests various intents and edge cases.

The Judge: An LLM-based evaluator that monitors the agent’s decisions against predefined rubrics.

This setup enables devs to pinpoint exactly which ‘turn’ in a conversation or which specific tool call led to a failure, allowing for granular debugging before production deployment.

Closing the Evaluation Loop

A recurring friction point in AI workflows is the ‘glue code’ required to move data between observability tools and fine-tuning datasets. LangWatch consolidates this into a single Optimization Studio.

The Iterative Lifecycle

The platform automates the transition from raw execution to optimized prompts through a structured loop:

StageActionTraceCapture the complete execution path, including state changes and tool outputs.DatasetConvert specific traces (especially failures) into permanent test cases.EvaluateRun automated benchmarks against the dataset to measure accuracy and safety.OptimizeUse the Optimization Studio to iterate on prompts and model parameters.Re-testVerify that changes resolve the issue without introducing regressions.

This process ensures that every prompt modification is backed by comparative data rather than subjective assessment.

Infrastructure: OpenTelemetry-Native and Framework-Agnostic

To avoid vendor lock-in, LangWatch is built as an OpenTelemetry-native (OTel) platform. By utilizing the OTLP standard, it integrates into existing enterprise observability stacks without requiring proprietary SDKs.

The platform is designed to be compatible with the current leading AI stack:

Orchestration Frameworks: LangChain, LangGraph, CrewAI, Vercel AI SDK, Mastra, and Google AI SDK.

Model Providers: OpenAI, Anthropic, Azure, AWS, Groq, and Ollama.

By remaining agnostic, LangWatch allows teams to swap underlying models (e.g., moving from GPT-4o to a locally hosted Llama 3 via Ollama) while maintaining a consistent evaluation infrastructure.

GitOps and Version Control for Prompts

One of the more practical features for devs is the direct GitHub integration. In many workflows, prompts are treated as ‘configuration’ rather than ‘code,’ leading to versioning issues. LangWatch links prompt versions directly to the traces they generate.

This enables a GitOps workflow where:

Prompts are version-controlled in the repository.

Traces in LangWatch are tagged with the specific Git commit hash.

Engineers can audit the performance impact of a code change by comparing traces across different versions.

Enterprise Readiness: Deployment and Compliance

For organizations with strict data residency requirements, LangWatch supports self-hosting via a single Docker Compose command. This ensures that sensitive agent traces and proprietary datasets remain within the organization’s virtual private cloud (VPC).

Key enterprise specifications include:

ISO 27001 Certification: Providing the security baseline required for regulated sectors.

Model Context Protocol (MCP) Support: Allowing full integration with Claude Desktop for advanced context handling.

Annotations & Queues: A dedicated interface for domain experts to manually label edge cases, bridging the gap between automated evals and human oversight.

Conclusion

The transition from ‘experimental AI’ to ‘production AI’ requires the same level of rigor applied to traditional software engineering. By providing a unified platform for tracing and simulation, LangWatch offers the infrastructure necessary to validate agentic workflows at scale.

Check out the GitHub Repo here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post LangWatch Open Sources the Missing Evaluation Layer for AI Agents to Enable End-to-End Tracing, Simulation, and Systematic Testing appeared first on MarkTechPost.

How to Build an EverMem-Style Persistent AI Agent OS with Hierarchical …

In this tutorial, we build an EverMem-style persistent agent OS. We combine short-term conversational context (STM) with long-term vector memory using FAISS so the agent can recall relevant past information before generating each response. Alongside semantic memory, we also store structured records in SQLite to persist metadata like timestamps, importance scores, and memory signals (preference, fact, task, decision). As we interact with the agent, we see it form new memories, retrieve the most relevant ones for the current query, and maintain consistent behavior across turns.

Copy CodeCopiedUse a different Browser!pip -q install -U transformers sentence-transformers faiss-cpu accelerate

import os, time, json, math, sqlite3, hashlib
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
import numpy as np
import faiss
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def _now_ts():
return int(time.time())

def _sha(s: str) -> str:
return hashlib.sha256(s.encode(“utf-8″, errors=”ignore”)).hexdigest()[:16]

def _ensure_dir(p: str):
os.makedirs(p, exist_ok=True)

def _safe_clip(text: str, max_chars: int = 1800) -> str:
text = (text or “”).strip()
if len(text) <= max_chars:
return text
return text[:max_chars].rstrip() + ” …”

@dataclass
class MemoryItem:
mid: str
role: str
text: str
created_ts: int
importance: float
tokens_est: int
meta: Dict[str, Any]

We set up the full environment by installing the required libraries and importing all dependencies needed for memory, embeddings, generation, and persistence. We define utility helper functions for hashing, timestamps, safe clipping, and directory management to support a stable agent OS foundation. We also introduce the MemoryItem dataclass, which serves as the core structured unit representing each memory item stored in our system.

Copy CodeCopiedUse a different Browserclass EverMemAgentOS:
def __init__(
self,
workdir: str = “/content/evermem_agent_os”,
db_name: str = “evermem.sqlite”,
embedding_model: str = “sentence-transformers/all-MiniLM-L6-v2”,
gen_model: str = “google/flan-t5-small”,
stm_max_turns: int = 10,
ltm_topk: int = 6,
consolidate_every: int = 8,
consolidate_trigger_tokens: int = 1400,
compress_target_chars: int = 420,
seed: int = 7,
):
self.workdir = workdir
_ensure_dir(self.workdir)
self.db_path = os.path.join(self.workdir, db_name)

self.embedder = SentenceTransformer(embedding_model)
self.embed_dim = self.embedder.get_sentence_embedding_dimension()

self.tokenizer = AutoTokenizer.from_pretrained(gen_model)
self.model = AutoModelForSeq2SeqLM.from_pretrained(gen_model)
self.model.to(self.device)
self.model.eval()

self.stm_max_turns = stm_max_turns
self.ltm_topk = ltm_topk
self.consolidate_every = consolidate_every
self.consolidate_trigger_tokens = consolidate_trigger_tokens
self.compress_target_chars = compress_target_chars

np.random.seed(seed)

self._init_db()
self._init_faiss()

self.stm: List[Dict[str, str]] = []
self.turns = 0

def _init_db(self):
conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(
“””
CREATE TABLE IF NOT EXISTS memories (
mid TEXT PRIMARY KEY,
role TEXT,
text TEXT,
created_ts INTEGER,
importance REAL,
tokens_est INTEGER,
meta_json TEXT
)
“””
)
cur.execute(
“””
CREATE TABLE IF NOT EXISTS kv_store (
k TEXT PRIMARY KEY,
v_json TEXT,
updated_ts INTEGER
)
“””
)
cur.execute(
“””
CREATE TABLE IF NOT EXISTS consolidations (
cid TEXT PRIMARY KEY,
created_ts INTEGER,
summary TEXT,
source_mids_json TEXT
)
“””
)
conn.commit()
conn.close()

def _init_faiss(self):
self.faiss_index_path = os.path.join(self.workdir, “faiss.index”)
self.faiss_map_path = os.path.join(self.workdir, “faiss_map.json”)

if os.path.exists(self.faiss_index_path) and os.path.exists(self.faiss_map_path):
self.index = faiss.read_index(self.faiss_index_path)
with open(self.faiss_map_path, “r”, encoding=”utf-8″) as f:
self.id_map = json.load(f)
self.id_map = {int(k): v for k, v in self.id_map.items()}
self.next_faiss_id = (max(self.id_map.keys()) + 1) if self.id_map else 0
return

self.index = faiss.IndexFlatIP(self.embed_dim)
self.id_map: Dict[int, str] = {}
self.next_faiss_id = 0
self._persist_faiss()

def _persist_faiss(self):
faiss.write_index(self.index, self.faiss_index_path)
with open(self.faiss_map_path, “w”, encoding=”utf-8″) as f:
json.dump({str(k): v for k, v in self.id_map.items()}, f)

def _embed(self, texts: List[str]) -> np.ndarray:
vecs = self.embedder.encode(texts, convert_to_numpy=True, normalize_embeddings=True)
if vecs.ndim == 1:
vecs = vecs.reshape(1, -1)
return vecs.astype(“float32”)

def _tokens_est(self, text: str) -> int:
text = text or “”
return max(1, int(len(text.split()) * 1.25))

def _importance_score(self, role: str, text: str, meta: Dict[str, Any]) -> float:
base = 0.35
length_bonus = min(0.45, math.log1p(len(text)) / 20.0)
role_bonus = 0.08 if role == “user” else 0.03
pin = 0.35 if meta.get(“pinned”) else 0.0
signal = meta.get(“signal”, “”)
signal_bonus = 0.18 if signal in {“decision”, “preference”, “fact”, “task”} else 0.0
q_bonus = 0.06 if “?” in text else 0.0
number_bonus = 0.05 if any(ch.isdigit() for ch in text) else 0.0
return float(min(1.0, base + length_bonus + role_bonus + pin + signal_bonus + q_bonus + number_bonus))

def upsert_kv(self, k: str, v: Any):
conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(
“INSERT INTO kv_store (k, v_json, updated_ts) VALUES (?, ?, ?) ON CONFLICT(k) DO UPDATE SET v_json=excluded.v_json, updated_ts=excluded.updated_ts”,
(k, json.dumps(v, ensure_ascii=False), _now_ts()),
)
conn.commit()
conn.close()

def get_kv(self, k: str, default=None):
conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(“SELECT v_json FROM kv_store WHERE k=?”, (k,))
row = cur.fetchone()
conn.close()
if not row:
return default
try:
return json.loads(row[0])
except Exception:
return default

def add_memory(self, role: str, text: str, meta: Optional[Dict[str, Any]] = None) -> str:
meta = meta or {}
text = (text or “”).strip()
mid = meta.get(“mid”) or f”m:{_sha(f'{_now_ts()}::{role}::{text[:80]}::{np.random.randint(0, 10**9)}’)}”
created_ts = _now_ts()
tokens_est = self._tokens_est(text)
importance = float(meta.get(“importance”)) if meta.get(“importance”) is not None else self._importance_score(role, text, meta)

conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(
“INSERT OR REPLACE INTO memories (mid, role, text, created_ts, importance, tokens_est, meta_json) VALUES (?, ?, ?, ?, ?, ?, ?)”,
(mid, role, text, created_ts, importance, tokens_est, json.dumps(meta, ensure_ascii=False)),
)
conn.commit()
conn.close()

vec = self._embed([text])
fid = self.next_faiss_id
self.next_faiss_id += 1
self.index.add(vec)
self.id_map[fid] = mid
self._persist_faiss()

return mid

We initialize the EverMemAgentOS class and configure the embedding model, generation model, device selection, and memory hyperparameters. We create the SQLite schema for persistent storage and initialize the FAISS index for vector-based long-term memory retrieval. We also implement a memory-writing pipeline, including importance scoring and vector insertion, enabling the agent to store structured and semantic memory simultaneously.

Copy CodeCopiedUse a different Browserdef _fetch_memories_by_ids(self, mids: List[str]) -> List[MemoryItem]:
if not mids:
return []
placeholders = “,”.join([“?”] * len(mids))
conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(
f”SELECT mid, role, text, created_ts, importance, tokens_est, meta_json FROM memories WHERE mid IN ({placeholders})”,
mids,
)
rows = cur.fetchall()
conn.close()

items = []
for r in rows:
meta = {}
try:
meta = json.loads(r[6]) if r[6] else {}
except Exception:
meta = {}
items.append(
MemoryItem(
mid=r[0],
role=r[1],
text=r[2],
created_ts=int(r[3]),
importance=float(r[4]),
tokens_est=int(r[5]),
meta=meta,
)
)
mid_pos = {m: i for i, m in enumerate(mids)}
items.sort(key=lambda x: mid_pos.get(x.mid, 10**9))
return items

def retrieve_ltm(self, query: str, topk: Optional[int] = None) -> List[MemoryItem]:
topk = topk or self.ltm_topk
qv = self._embed([query])
scores, ids = self.index.search(qv, topk + 8)
mids = []
for fid in ids[0].tolist():
if fid == -1:
continue
mid = self.id_map.get(int(fid))
if mid:
mids.append(mid)
mids = list(dict.fromkeys(mids))[:topk]
return self._fetch_memories_by_ids(mids)

def _format_stm(self) -> str:
turns = self.stm[-self.stm_max_turns:]
chunks = []
for t in turns:
chunks.append(f”{t[‘role’].upper()}: {t[‘content’]}”)
return “n”.join(chunks).strip()

def _format_ltm(self, ltm_items: List[MemoryItem]) -> str:
if not ltm_items:
return “”
lines = []
for i, it in enumerate(ltm_items, 1):
ts_age = max(1, (_now_ts() – it.created_ts) // 3600)
imp = f”{it.importance:.2f}”
tag = it.meta.get(“signal”, “”)
tag = f” | {tag}” if tag else “”
lines.append(f”[LTM {i}] (imp={imp}, age_h={ts_age}{tag}) {it.role}: {_safe_clip(it.text, 420)}”)
return “n”.join(lines).strip()

@torch.inference_mode()
def _gen(self, prompt: str, max_new_tokens: int = 180) -> str:
inputs = self.tokenizer(prompt, return_tensors=”pt”, truncation=True, max_length=1024).to(self.device)
out_ids = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.6,
top_p=0.92,
num_beams=1,
)
out = self.tokenizer.decode(out_ids[0], skip_special_tokens=True)
return (out or “”).strip()

def _compress_memories(self, items: List[MemoryItem], max_chars: int = 520) -> str:
raw = “n”.join([f”- {it.role}: {it.text}” for it in items])
raw = _safe_clip(raw, 3500)
prompt = (
“Summarize the following notes into a compact memory that preserves decisions, preferences, facts, and tasks. ”
f”Keep it under {max_chars} characters.nnNOTES:n{raw}nnCOMPACT MEMORY:”
)
summ = self._gen(prompt, max_new_tokens=170).strip()
if len(summ) > max_chars:
summ = summ[:max_chars].rstrip() + “…”
return summ

def consolidate(self) -> Optional[str]:
conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(“SELECT mid, role, text, created_ts, importance, tokens_est, meta_json FROM memories ORDER BY created_ts DESC LIMIT 160″)
rows = cur.fetchall()
conn.close()

items = []
for r in rows:
try:
meta = json.loads(r[6]) if r[6] else {}
except Exception:
meta = {}
items.append(MemoryItem(r[0], r[1], r[2], int(r[3]), float(r[4]), int(r[5]), meta))

if not items:
return None

items_sorted = sorted(items, key=lambda x: (-(x.importance + 0.15 * (1.0 / (1.0 + (_now_ts() – x.created_ts) / 3600.0))), -x.created_ts))
picked = items_sorted[:18]
summary = self._compress_memories(picked, max_chars=520)

cid = f”c:{_sha(f'{_now_ts()}::{summary[:120]}::{np.random.randint(0, 10**9)}’)}”
source_mids = [it.mid for it in picked]

conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(
“INSERT OR REPLACE INTO consolidations (cid, created_ts, summary, source_mids_json) VALUES (?, ?, ?, ?)”,
(cid, _now_ts(), summary, json.dumps(source_mids, ensure_ascii=False)),
)
conn.commit()
conn.close()

self.add_memory(
role=”system”,
text=f”Consolidated memory: {summary}”,
meta={“signal”: “consolidation”, “pinned”: True, “source_mids”: source_mids, “cid”: cid, “importance”: 0.95},
)
return cid

We implement semantic retrieval and formatting logic that enables the agent to fetch relevant long-term memories before reasoning. We define how short-term memory and retrieved long-term memory are structured and how they are injected into prompts for contextual generation. We also implement memory compression and consolidation logic, allowing the agent to periodically summarize high-value memories into durable long-term summaries.

Copy CodeCopiedUse a different Browser def _should_consolidate(self) -> bool:
if self.turns > 0 and self.turns % self.consolidate_every == 0:
return True
conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(“SELECT SUM(tokens_est) FROM memories”)
s = cur.fetchone()[0]
conn.close()
s = int(s or 0)
return s >= self.consolidate_trigger_tokens

def chat(self, user_text: str, user_meta: Optional[Dict[str, Any]] = None, max_answer_tokens: int = 220) -> Dict[str, Any]:
user_meta = user_meta or {}
self.turns += 1

self.stm.append({“role”: “user”, “content”: user_text})
self.stm = self.stm[-(self.stm_max_turns * 2):]
self.add_memory(“user”, user_text, meta=user_meta)

ltm = self.retrieve_ltm(user_text, topk=self.ltm_topk)
stm_block = self._format_stm()
ltm_block = self._format_ltm(ltm)

sys_rules = (
“You are an AI agent with persistent memory. Use retrieved long-term memories to stay consistent. ”
“If a memory conflicts with the user, ask a short clarifying question. Keep answers practical.”
)

prompt = (
f”{sys_rules}nn”
f”SHORT-TERM CONTEXT:n{_safe_clip(stm_block, 1800)}nn”
f”RETRIEVED LONG-TERM MEMORIES:n{ltm_block if ltm_block else ‘(none)’}nn”
f”USER REQUEST:n{user_text}nn”
f”ANSWER:”
)
answer = self._gen(prompt, max_new_tokens=max_answer_tokens)

self.stm.append({“role”: “assistant”, “content”: answer})
self.stm = self.stm[-(self.stm_max_turns * 2):]
self.add_memory(“assistant”, answer, meta={“signal”: “response”})

consolidation_id = None
if self._should_consolidate():
consolidation_id = self.consolidate()

return {
“answer”: answer,
“retrieved_ltm”: [
{“mid”: it.mid, “role”: it.role, “importance”: it.importance, “meta”: it.meta, “text”: _safe_clip(it.text, 320)}
for it in ltm
],
“consolidation_id”: consolidation_id,
}

def inspect_recent_memories(self, n: int = 12) -> List[Dict[str, Any]]:
conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(“SELECT mid, role, text, created_ts, importance, tokens_est, meta_json FROM memories ORDER BY created_ts DESC LIMIT ?”, (n,))
rows = cur.fetchall()
conn.close()
out = []
for r in rows:
try:
meta = json.loads(r[6]) if r[6] else {}
except Exception:
meta = {}
out.append({“mid”: r[0], “role”: r[1], “created_ts”: int(r[3]), “importance”: float(r[4]), “tokens_est”: int(r[5]), “meta”: meta, “text”: _safe_clip(r[2], 520)})
return out

def inspect_consolidations(self, n: int = 5) -> List[Dict[str, Any]]:
conn = sqlite3.connect(self.db_path)
cur = conn.cursor()
cur.execute(“SELECT cid, created_ts, summary, source_mids_json FROM consolidations ORDER BY created_ts DESC LIMIT ?”, (n,))
rows = cur.fetchall()
conn.close()
out = []
for r in rows:
try:
src = json.loads(r[3]) if r[3] else []
except Exception:
src = []
out.append({“cid”: r[0], “created_ts”: int(r[1]), “summary”: r[2], “source_mids”: src})
return out

We implement the agent’s main reasoning loop in the chat() function, combining STM, LTM retrieval, and generation into a single workflow. We ensure that every interaction updates both vector memory and structured memory while maintaining contextual coherence. We also include automatic consolidation triggers so the system behaves like a persistent memory OS rather than a simple chatbot.

Copy CodeCopiedUse a different Browseragent = EverMemAgentOS()

agent.upsert_kv(“profile”, {“name”: “User”, “preferences”: {“style”: “concise”}})

demo_queries = [
(“I prefer answers in bullet points and I’m working on a Colab tutorial.”, {“signal”: “preference”, “pinned”: True}),
(“Remember that my project is about an EverMem-style agent OS with FAISS + SQLite.”, {“signal”: “fact”, “pinned”: True}),
(“Give me a 5-step plan to add memory importance scoring and consolidation.”, {“signal”: “task”}),
(“Now remind me what you know about my preferences and project, briefly.”, {“signal”: “task”}),
]

for q, meta in demo_queries:
r = agent.chat(q, user_meta=meta, max_answer_tokens=180)
print(“nUSER:”, q)
print(“ASSISTANT:”, r[“answer”])
if r[“retrieved_ltm”]:
print(“RETRIEVED_LTM:”, [(x[“importance”], x[“text”]) for x in r[“retrieved_ltm”][:3]])
if r[“consolidation_id”]:
print(“CONSOLIDATED:”, r[“consolidation_id”])

print(“nRECENT MEMORIES:”)
for m in agent.inspect_recent_memories(10):
print(m[“role”], m[“importance”], m[“text”])

print(“nRECENT CONSOLIDATIONS:”)
for c in agent.inspect_consolidations(3):
print(c[“cid”], c[“summary”])

We instantiate the agent and simulate multi-turn interactions to demonstrate persistent recall and memory usage. We observe how the agent retrieves relevant long-term memories and uses them to produce consistent responses. Finally, we inspect stored memories and consolidations to verify that our EverMem-style architecture actively manages and evolves its memory over time.

In conclusion, we have a working memory-centric agent that behaves less like a stateless chatbot and more like a persistent assistant that learns from interactions. We implemented importance scoring to prioritize what matters, vector retrieval to fetch the right context at the right time, and periodic consolidation to compress multiple memories into durable summaries that improve long-horizon recall. We also kept the system practical for Colab by using lightweight models, FAISS for fast similarity search, and SQLite for structured persistence.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build an EverMem-Style Persistent AI Agent OS with Hierarchical Memory, FAISS Vector Retrieval, SQLite Storage, and Automated Memory Consolidation appeared first on MarkTechPost.

Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memor …

Current end-to-end robotic policies, specifically Vision-Language-Action (VLA) models, typically operate on a single observation or a very short history. This ‘lack of memory’ makes long-horizon tasks, such as cleaning a kitchen or following a complex recipe, computationally intractable or prone to failure. To address this, researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have introduced Multi-Scale Embodied Memory (MEM).

https://www.pi.website/download/Mem.pdf

The Dual-Scale Memory Architecture

MEM factorizes robotic memory into two distinct scales to balance semantic context with real-time control constraints.

(1) Short-Term Video Memory

For tasks requiring fine-grained spatial awareness—like resolving self-occlusions or adapting a grasp—dense visual data is required. MEM utilizes an efficient video encoder that extends standard Vision Transformers (ViTs). To maintain real-time inference (the 380ms ‘real-time barrier’), the architecture avoids joint attention over all patches. Instead, it uses Space-Time Separable Attention, interleaving spatial attention within frames with causal-temporal attention across frames every fourth layer.

The computational complexity is reduced from O(n2K2) to O(Kn2+nK2), where n is the number of spatial patches and K is the number of timesteps. By dropping tokens from past timesteps in upper layers, the model passes only the current observation’s representation to the VLA backbone, keeping the token count invariant compared to single-frame models.

(2) Long-Term Language Memory

To handle tasks spanning up to 15 minutes, MEM uses a language-based representation for semantic events. The system decomposes the action prediction as:

$$pi(a_{t:t+H},l_{t+1},m_{t+1}|o_{t-T:t},m_{t},g) approxpi_{LL}(a_{t:t+H}|o_{t-K:t},l_{t+1},g)pi_{HL}(l_{t+1},m_{t+1}|o_{t},m_{t},g)$$

/*

Here, a high-level policy (πHL) maintains a running language summary (mt) of past events and generates subtask instructions (lt+1) for a low-level policy (πLL). This language memory is trained using LLM-generated summaries that compress information (e.g., ‘I placed three bowls’ instead of individual attributes), reducing the risk of training-inference distribution shifts.

https://www.pi.website/download/Mem.pdf

Implementation and Performance

The research team integrated MEM into the π0.6 VLA, which is initialized from a pre-trained Gemma 3-4B model. The model was pre-trained on a diverse mixture of robot demonstrations, vision-language tasks, and internet video data.

Key Results:

In-Context Adaptation: MEM enables robots to adapt manipulation strategies based on recent failures. In evaluation, this led to a +62% success rate increase in opening refrigerators with unknown hinge directions and a +11% increase in picking up chopsticks at variable heights.

Long-Horizon Tasks: The model successfully performed 15-minute tasks like ‘Recipe Setup’ (retrieving ingredients from multiple locations) and ‘Kitchen Cleaning’ (washing dishes and wiping counters). Memory-less VLAs failed these tasks significantly more often.

Efficiency: The video encoder allows the model to process up to 16 observation frames (spanning ~1 minute) while remaining under critical real-time inference thresholds on a single NVIDIA H100 GPU.

MEM demonstrates that combining dense, short-term visual tokens with compressed, long-term language summaries allows VLAs to scale their ‘working memory’ without incurring prohibitive computational costs.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks appeared first on MarkTechPost.

Embed Amazon Quick Suite chat agents in enterprise applications

Organizations can face two critical challenges with conversational AI. First, users need answers where they work—in their CRM, support console, or analytics portal—not in separate tools. Second, implementing a secure embedded chat in their applications can require weeks of development to build authentication, token validation, domain security, and global distribution infrastructure.
Amazon Quick Suite embedded chat helps solve the first challenge by bringing conversational AI directly into your applications, so users can query structured data, search documents, and trigger actions without switching tools.
In this post, we show you how to solve the second challenge with a one-click deployment solution to embed the chat agents using the Quick Suite Embedding SDK in enterprise portals.
Solution overview
The solution deploys a secure web portal for the embedded chat using Amazon CloudFront for global content delivery, Amazon Cognito for OAuth 2.0 authentication, Amazon API Gateway for REST API endpoints, AWS Lambda for serverless API processing, and OpenID Connect (OIDC) federation for identity integration with the Quick Suite.
The solution implements defense-in-depth security with multiple layers of protection: DDoS protection on CloudFront, a private Amazon Simple Storage Service (Amazon S3) bucket with origin access control helping prevent direct access to frontend assets, AWS WAF rate limiting protection on API Gateway, and JSON Web Token (JWT) signature validation using Amazon Cognito public keys before generating time-limited user-specific embed URLs with least-privilege AWS Identity and Access Management (IAM) permissions.
The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

Users access the web portal URL, which routes to CloudFront.
CloudFront uses origin access control to fetch HTML, CSS, and JavaScript files from a private S3 bucket.
The web application checks for a valid authentication token and redirects unauthenticated users to the Amazon Cognito hosted UI for OAuth 2.0 login.
Users enter credentials on the Amazon Cognito login page, which validates them and redirects back to the CloudFront URL with a single-use authorization code.
The application extracts the authorization code and makes an HTTPS API call to API Gateway, which passes through AWS WAF rate limiting.
API Gateway invokes a Lambda function with the authorization code.
The Lambda function makes a server-to-server HTTPS call to the Amazon Cognito OAuth token endpoint, exchanging the authorization code for JWT tokens (ID token, access token, refresh token).
The function validates the ID token’s cryptographic signature using Amazon Cognito public keys JSON Web Key Set (JWKS) with thread-safe caching.

The following is a decoded JWT example:

{“at_hash”: “abcdefifB5vH2D0HEvLghi”, “sub”: “12345678-abcd-1234-efgh-123456789012”, “email_verified”: true, “iss”: “https://cognito-idp.us-east-1.amazonaws.com/us-east-1_EXAMPLE123”, “cognito:username”: “12345678-abcd-1234-efgh-123456789012”, “origin_jti”: “abcd1234-5678-90ef-ghij-klmnopqrstuv”, “aud”: “1a2b3c4d5e6f7g8h9i0j1k2l3m”, “event_id”: “a1b2c3d4-e5f6-7890-abcd-ef1234567890”, “token_use”: “id”, “auth_time”: 1704063600, “exp”: 1704067200, “iat”: 1704063600, “jti”: “abcdef12-3456-7890-abcd-ef1234567890”, “email”: “user123@example.com”}

The Lambda function calls the AWS Security Token Service (AWS STS) AssumeRoleWithWebIdentity API with the verified ID token to assume the IAM web identity role and receive temporary AWS credentials.
The function uses the temporary credentials to call the Quick Suite ListUsers API to verify the user exists, then calls the GenerateEmbedUrlForRegisteredUser API to help generate a secure embedded URL with domain restrictions.
The function returns the embed URL in a JSON response with cross-origin resource sharing (CORS) headers through API Gateway to CloudFront. The following is an embed URL example:

{“ChatEmbedUrl”: “https://us-east-1.quicksight.aws.amazon.com/embedding/abcdefe827dd4ef8b4e1fb921db046c4/quick/chat?code=Abcdef….&amp;amp;identityprovider=quicksight&amp;amp;isauthcode=true”, “user”: “user123@example.com”}

The CloudFront application uses the Quick Suite Embedding SDK to create an embedding context and render the chat interface in an HTML iframe with secure cross-origin communication.

You can deploy the solution with the following high-level steps:

Deploy the serverless infrastructure using the AWS Cloud Development Kit (AWS CDK).
Provision users in Amazon Cognito and Quick Suite.
Share the Quick Suite assets (chat agent and associated connections, knowledge base).
Access the web portal to use Quick Suite chat agents.

Prerequisites
The following prerequisites are required to deploy the solution demonstrated in this post:

An AWS account 
A Quick Suite subscription with the authentication method Password-based or Single-Sign On.
The AWS CDK CLI 
The AWS SDK for Python (Boto3) 
An AWS CLI profile with appropriate permissions to deploy the solution, including list Quick Suite namespaces, create IAM roles and AWS resources including CloudFront distribution, S3 bucket, API Gateway REST API, AWS WAF web access control list, Lambda function, and Amazon Cognito user pool
Node.js 20+
jq 1.7+
Docker Desktop running

Deploy serverless infrastructure using AWS CDK
Complete the following steps to deploy the serverless infrastructure using the AWS CDK:

Clone the GitHub repository:

git clone git@github.com:aws-samples/sample-quicksuite-chat-embedding.git
cd sample-quicksuite-chat-embedding

Deploy the infrastructure:

./setup.sh

You will be prompted to enter your AWS Region code, AWS CloudFormation stack ID and portal title, and your AWS CLI profile.

Provision users in Amazon Cognito and Quick Suite
Complete the following steps to provision users in Amazon Cognito and Quick Suite:

Create an Amazon Cognito user in an Amazon Cognito user pool:

python scripts/create_cognito_user.py –profile <aws-profile> <cognito-user-email>

Create a federated user in Quick Suite:

python scripts/create_quicksuite_user.py –profile <aws-profile> <cognito-user-email>

Share Quick Suite chat agent
Complete the following steps to share your Quick Suite chat agent:

Sign in to the Quick Suite console using credentials with the Quick Suite Author Pro role.
Choose Chat agents in the navigation pane.
Select the agents you want to share (for example, AnyCompany Ecom order assistant) and choose Share.

Search for the user name (for example, user123@example.com) you created earlier.
Choose Share.

After sharing this agent, you also need to share each linked resource of the agent separately to confirm full functionality.
Access web portal to use the Quick Suite chat agents
Complete the following steps to access the web portal and start using the chat agents:

Look for the temporary password in the Amazon Cognito verification email.
Access the CloudFront URL from your web browser with the user ID and temporary password.
You will be prompted to change your password at your first login.

After the successful login, you can see My Assistant in the chat interface.

Choose the Region to connect to the custom Quick Suite chat agents.

To see the chat agents shared with you, choose Shared with me under Filter.

Choose the agent you want and start chatting.

The following screenshots show chat interactions of a customer service representative tracking an example online order and processing its return as requested by a verified customer over the phone.

Clean up
To clean up your resources, delete the AWS resources deployed:

./cleanup.sh

Conclusion
This solution addresses core challenges for embedding conversational AI at scale: securing authentication for thousands of concurrent users across global locations, maintaining enterprise-grade security with comprehensive audit trails, and simplifying deployment with automated infrastructure provisioning. You can customize the portal branding, adjust security policies, and integrate with existing identity providers. You can scale to thousands of concurrent users automatically while maintaining pay-as-you-go pricing.
To try this solution, clone the GitHub repository and deploy the complete infrastructure with one click to embed Quick Suite chat agents.

About the authors
Satyanarayana Adimula is a Senior Builder in AWS Generative AI Innovation & Delivery. Leveraging over 20 years of data and analytics expertise, he specializes in building agentic AI systems that enable large enterprises to automate complex workflows, accelerate decision-making, and achieve measurable business outcomes.

Unlock powerful call center analytics with Amazon Nova foundation mode …

Call center analytics play a crucial role in improving customer experience and operational efficiency. With foundation models (FMs), you can improve the quality and efficiency of call center operations and analytics. Organizations can use generative AI to assist human customer support agents and managers of contact center teams, so they can gain insights that are more nuanced, helping redefine how and what questions can be asked from call center data.
Whereas some organizations look for turnkey solutions to introduce generative AI into their operations, such as Amazon Connect Contact Lens, others build custom customer support systems using AWS services for their microservices backend. With this comes the opportunity to integrate FMs into the system to provide AI support to human customer support agents and their managers.
One of the major decisions these organizations face is which model to use to power the AI support and analytics in their platform. For this, the Generative AI Innovation Center developed a demo application that features a collection of use cases powered by Amazon’s latest family of FMs, Amazon Nova. In this post, we discuss how Amazon Nova demonstrates capabilities in conversational analytics, call classification, and other use cases often relevant to contact center solutions. We examine these capabilities for both single-call and multi-call analytics use cases.
Amazon Nova FMs for scale
Amazon Nova FMs provide leading price-performance, making them suitable for generative AI at scale. These models are pre-trained on vast amounts of data, enabling them to perform a wide range of language tasks with remarkable accuracy and efficiency while effectively scaling to support large demand. In the context of call center analytics, Amazon Nova models can comprehend complex conversations, extract key information, and generate valuable insights that were previously difficult or impossible to obtain at scale. The demo application showcases the capabilities of Amazon Nova models for various analytical tasks, including:

Sentiment analysis
Topic identification
Vulnerable customer assessment
Protocol adherence checking
Interactive question-answering

By using these advanced AI capabilities from Amazon Nova FMs, businesses can gain a deeper understanding of their customer interactions and make data-driven decisions to improve service quality and operational efficiency.
Solution overview
The Call Center Analytics demo application is built on a simple architecture that seamlessly integrates Amazon Bedrock and Amazon Nova to enable end-to-end call center analytics for both single-call and multi-call analytics. The following diagram illustrates this architecture.

Amazon Bedrock – Provides access to the Amazon Nova FMs, enabling powerful natural language processing capabilities
Amazon Athena – Used for querying the call data stored in a structured format, allowing for efficient data retrieval and analysis
Amazon Transcribe – Fully managed, automatic speech recognition (ASR) service
Amazon Simple Storage Service (Amazon S3) – Object storage service offering industry-leading scalability, data availability, security, and performance
Streamlit – Powers the web-based UI, providing an intuitive and interactive experience for users

The application is divided into two main components: Single Call Analytics and Multi-Call Analytics. These scripts work together to provide a comprehensive solution that combines post-call analysis with historical data insights.
Single Call Analytics
The Single Call Analytics functionality of the application provides a detailed analysis of individual customer service calls. This feature is implemented in the Single_Call_Analytics.py script. In this section, we explore some of the key capabilities.
Sentiment analysis and vulnerable customer assessment
The solution uses Amazon Nova FMs to derive insights on both the customer and agent sentiment, as shown in the following screenshot.

By using the chatbot feature, users can ask for an explanation on why the sentiment was classified as such and also get references from the transcription. This feature gives more understanding on the sentiment class by quickly finding supporting phrases from the transcription itself, which later can be used for other analyses.

A vulnerable customer or potentially vulnerable customer is someone who, due to their personal circumstances, is particularly susceptible to financial harm or requires special consideration in financial services. The application assesses whether the customer calling in might be considered vulnerable or potentially vulnerable, by passing the call transcript of the selected call with the following prompt:

vc_prompt = f”””You are a AI Assistant for Banking Call Center.
Your goal is to determine if the customer in the <call_transcription> below
qualifies as Vulnerable Customer (VC) or Potentially Vulnerable Customer (PVC).

<call_transcription>
{speaker_texts}
</call_transcription>

If the customer qualifies as a VC or PVC, return Yes and explain why.
If the customer does not qualify as a VC or PVC, return No and explain why.
“””

isVC = invoke_llm(vc_prompt, vc_model)

In this prompt, the Amazon Nova FM uses a generic definition of a vulnerable or potentially vulnerable customer to make the assessment. However, if a business has its own definition of vulnerable or potentially vulnerable customers, they can engineer the prompt to have the FM make the classification using this custom definition. This feature helps call center managers identify potentially sensitive situations and make sure vulnerable customers receive appropriate care and attention along with an explanation on why the customer was identified as such.
Protocol assistance and step completion
The application uses Amazon Nova models to identify the relevant protocol for each call and check if the agent followed the prescribed steps. Protocols are currently defined in a JSON file that are ingested locally at runtime. The following code shows an example of how this is implemented:

protocol_identification_formatted = protocol_identification_prompt.format(transcript=context, protocols=protocols)
llm_protocol_key = invoke_llm(protocol_identification_formatted, protocol_model)

step_completion_formatted = step_completion_prompt.format(protocol_steps=protocol_list, context=context)
step_check = invoke_llm(step_completion_formatted, protocol_model)

This code snippet shows how the application first identifies the relevant protocol using the call transcript and a list of available protocols. After the protocol has been identified, the call transcript and protocol steps for the determined protocol are passed together to check if each step of the protocol was completed by the agent. The results are displayed in a user-friendly format, helping managers quickly assess agent performance and adherence to guidelines.

Interactive transcription view and AI assistant
The Single Call Analytics page provides an interactive transcription view, so users can read through the conversation between the agent and customer. Additionally, it includes an AI assistant feature so users can ask specific questions about the call:

user_message = call_prompt.format(query=prompt, context=context, chat_history=st.session_state.messages)
ans = invoke_llm(user_message, cb_model)

This assistant functionality, powered by Amazon Nova models, helps users gain deeper insights into specific aspects of the call without having to manually search through the transcript.

Multi-Call Analytics
The Multi-Call Analytics functionality, implemented in the Multi_Call_Analytics.py script, provides aggregate analysis across multiple calls and enables powerful business intelligence (BI) queries.
Data visualization and flexible model selection
This feature helps users quickly visualize trends and patterns across multiple calls, making it straightforward to identify areas for improvement or success.

The “Top 5 Call Topics” visual in the preceding screenshot is also powered by Amazon Nova models; users can classify the call’s topic from passing in the call transcript and then letting the model determine what the main topic of the call was. This feature can help users quickly classify calls and place them in the bucket of the determined topic to generate visuals. By seeing the top reasons customers are calling in, businesses can focus on devising strategies to reduce call volumes for these topic categories. Additionally, the application provides flexible model selection options, so users can choose between different Amazon Nova models (such as Nova Pro, Nova Lite, and Nova Micro) for various analytical tasks. This flexibility means users can select the most appropriate model for their specific needs and use cases.
Analytical AI Assistant
One of the key features of the Multi-Call Analytics page is the Analytical AI Assistant, which can handle complex BI queries using SQL.

The following code demonstrates how the application uses Amazon Nova models to generate SQL queries based on natural language questions:

user_prompt = “””Given the following schema:
{schema}
and a user query, generate a SQL query which can be executed in AWS Athena.
The table name is {table_name}.

Give the SQL query as a JSON response.
“””

sql_query, chart = invoke_llm(final_prompt, cb_model, “sql”)

The assistant can understand complex queries, translate them into SQL, and even suggest appropriate chart types for visualizing the results. The SQL queries are run on processed data from Amazon Transcribe and queried using Athena, which are then surfaced in the Analytical AI Assistant.
Implementation
The Call Analytics demo application is implemented using the Streamlit UI for speed and simplicity of development. The application is a mix of specific use cases and AI tasks to provide a sample of what Amazon Nova models can do for call center operations and analytics use cases. For more information about how this demo application is implemented, refer to the following GitHub repo.
Conclusion
In this post, we discussed how Amazon Nova FMs power the Call Center Analytics demo application, representing significant advancements in the field of call center analytics. By using the power of these advanced AI models, businesses can gain unique insights into their customer interactions, improve agent performance, and enhance overall operational efficiency. The application’s comprehensive features, including sentiment analysis, protocol adherence checking, vulnerable customer assessment, and powerful BI capabilities, provide call center managers the tools they need to make data-driven decisions and continuously improve their customer service operations.
As Amazon Nova FMs continue to evolve and improve, we can expect even more powerful and sophisticated analytics capabilities in the future. This demo serves as an excellent starting point for customers looking to explore the potential of AI-powered call center analytics and applying it in their own environment. We encourage readers to explore the Call Center Analytics demo to learn more details of how Amazon Nova models are integrated in the application.

About the authors

Francisco Calderon Rodriguez
Francisco Calderon Rodriguez is a Data Scientist at the Generative AI Innovation Center (GAIIC). As a member of the GAIIC, he helps discover the art of the possible with AWS customers using generative AI technologies. In his spare time, Francisco likes playing music and guitar, playing soccer with his daughters, and enjoying time with his family.

Harpreet Cheema
Harpreet Cheema is a Deep Learning Architect at the AWS Generative AI Innovation Center. He is very passionate in the field of machine learning and in tackling different problems in the ML domain. In his role, he focuses on developing and delivering Generative AI focused solutions for real-world applications.

Jamal Saboune
Jamal Saboune is an Applied Science Manager with AWS Generative AI Innovation Center. He is currently leading a team focused on supporting AWS customers build innovative and scalable Generative AI products across several industries. Jamal holds a PhD in AI and Computer Vision from the INRIA Lab in France, and has a long R&D experience designing and building AI solutions that add value to users.

How Ricoh built a scalable intelligent document processing solution on …

This post is cowritten by Jeremy Jacobson and Rado Fulek from Ricoh.
This post demonstrates how enterprises can overcome document processing scaling limits by combining generative AI, serverless architecture, and standardized frameworks. Ricoh engineered a repeatable, reusable framework using the AWS GenAI Intelligent Document Processing (IDP) Accelerator. This framework reduced customer onboarding time from weeks to days. It also increased processing capacity for new AI-intensive workflows that required complex document splitting. The capacity is projected to grow sevenfold to over 70,000 documents per month. Additionally, the solution decreased engineering hours per deployment by over 90%.
Ricoh USA, Inc. is a global technology leader serving a diverse client base in over 200 countries. Within its healthcare practice, Ricoh serves major health insurance payers, managed care organizations, and healthcare providers—processing hundreds of thousands of critical documents each month, including insurance claims, grievances, appeals, and clinical records for their clients. They faced a challenge common to enterprises modernizing document-heavy workflows: reliance on custom manual engineering. Each new healthcare customer implementation required unique development and tuning by specialized engineers. Additionally, deployment required custom prompt engineering, model fine-tuning, and integration testing that couldn’t be reused across customers. Although this provided an exceptional, bespoke experience for Ricoh customers, the time and effort involved created bottlenecks that limited expansion. With an anticipated sevenfold increase in volume, Ricoh seized the opportunity to innovate.
The challenge was not just to automate processes. It was to build a scalable solution that could deliver state-of-the-art AI for document extraction and agentic workflows. This solution needed to meet strict compliance standards, including HITRUST, HIPAA, and SOC II. These requirements often stand at odds with rapid AI innovation. Compliance frameworks typically restrict data sharing that limits model training capabilities. They also mandate rigorous security controls that can impede the agility needed for iterative AI development and deployment. Despite these challenges, Ricoh made it a priority to overcome this tension for their customers. Building upon foundation models (FMs) available through Amazon Bedrock and combining them with Amazon Textract, Ricoh made it possible for customers to benefit from cutting-edge automation that aligns with the strictest compliance standards.
This post explores how Ricoh built a standardized, multi-tenant solution for automated document classification and extraction using the AWS GenAI IDP Accelerator as a foundation, transforming their document processing from a custom-engineering bottleneck into a scalable, repeatable service.
Customer overview
Ricoh USA, Inc. is a global technology leader delivering digital workplace services, document management, and business process automation solutions to organizations in over 200 countries. Within its healthcare practice, Ricoh serves major health insurance payers, managed care organizations, and healthcare providers—processing thousands of critical documents each month, including insurance claims, grievances, appeals, and clinical records.
“Within the Ricoh Intelligent Business Platform, the workflows that required the highest levels of intelligence for key IDP tasks experienced explosive growth. We needed to move from bespoke builds to a platform,” says Jeremy Jacobson, AI Architect, Portfolio Solution Development at Ricoh. “For our customers, we integrate, operate, and evolve AI so they don’t have to. Aligning our proprietary IDP patterns and technologies with the AWS GenAI IDP accelerator amplified this advantage. So equipped, we delivered a HITRUST CSF-certified configurable IDP platform that ties our customers to the frontiers of AI.”
Healthcare documents often arrive unstructured and highly variable. A single packet might include multiple document types—fax covers, clinical notes, and appeal forms—each with different layouts and naming conventions. Documents ranged from 15–50 pages, with some containing cover letters while others did not. Different healthcare providers used varying document structures, field naming conventions, and placement of critical information across different healthcare providers. Template-based extraction approaches proved ineffective.
For Ricoh’s Intelligent Business Platform services, functional requirements included capturing data attributes from scans of unstructured or semi-structured documents and assigning to each data attribute a confidence level that reliably identifies when human review is needed. Every attribute with a confidence level below a predefined threshold is reviewed by a person to verify accuracy and compliance. Human reviewers verify extracted data, correct errors, and validate that critical healthcare information—such as member IDs, diagnosis codes, and claim amounts—meets the quality standards required for regulatory compliance alignment and claims processing. This human-in-the-loop approach achieves two key business outcomes: maintaining the high accuracy levels (typically 98–99%) required by healthcare payers while reducing manual review costs by 60–70% compared to fully manual processing.
The solution needed to extract key data such as member IDs, provider information, and claim details from various sections of documents, with the capability to search through clinical notes and other sections when information was not found in cover letters. Non-functional requirements addressed several critical operational needs:

Performance and scalability – Handle traffic spikes to process up to 1,000 documents in minutes while avoiding wasted computational resources during low-traffic periods
Accuracy and quality – Meet strict service level agreements (SLAs) for delivery deadlines and data accuracy
Cost optimization – Enable configurable confidence thresholds that balance accuracy requirements with manual review costs—keeping wrongly captured attributes below the agreed SLA while minimizing expensive human review
Operational efficiency – Enable quick customer onboarding through configuration changes rather than code changes

Challenges with complex document processing workflows
For some time, the Ricoh team had combined traditional optical character recognition (OCR)—which detects and extracts text from scanned documents—with multimodal AI models that can understand both text and images simultaneously. This approach helped address complex challenges such as distinguishing between similar fields when extracting data from documents with multiple names and addresses.
After multimodal FMs became available on Amazon Bedrock, it soon became clear that a simple API call to Amazon Bedrock—that is, sending a scanned document along with a prompt—would not suffice for complex workflows. When documents are composed of multiple parts or sections, such as cover sheets, contracts, or authorization responses, extraction rules often depend upon first successfully classifying the section type.
The solution needed to handle complex document classification, distinguishing between claims, disputes, emails, and fax cover sheets without breaking down packets into granular document types. Additionally, large language models (LLMs) have context window limits and experience declining performance in following instructions as the context fills. Document page size limitations required the Ricoh team to use alternative approaches for larger documents.
The Ricoh team also required flexibility to integrate with their existing high-capacity document processing workflows—including document routing systems, case management services, and downstream business applications—while maintaining control over processing steps and model selection. This included unique requirements such as splitting documents based on healthcare provider or patient information.
To improve accuracy, the Ricoh team utilized more sophisticated means of dynamically inserting context into prompts—a technique where relevant document metadata, previously extracted fields, and document structure information are programmatically added to the AI model’s instructions based on the specific document being processed. This context-aware prompting improved extraction accuracy by 15–20% compared to static prompts, helping the model understand document relationships and field dependencies.
Although these gains were substantial, when trying to recreate this success, the Ricoh team ran into a persistent hurdle: these workflows demanded 40–60 hours of developer time per customer to set up, for instance to incorporate newly released features of the underlying models. Ricoh coordinated with the AWS Generative AI Innovation Center on the IDP Accelerator to address these scalability challenges.
Solution overview
Ricoh partnered with AWS to implement the GenAI IDP Accelerator, a reference framework designed to help you deploy production-grade document processing solutions. The accelerator provides multiple processing patterns optimized for different document types and workflows.
The team selected Processing Pattern 2, which combines Amazon Textract for OCR—the technology that converts images of text into machine-readable text—with Amazon Bedrock FMs for intelligent classification and extraction. This pattern is specifically designed for complex, multi-part documents that require both text extraction and AI-powered understanding. The approach offered full control over model orchestration and was ideal for handling Ricoh’s multi-part healthcare documents because it supports sequential processing (classify first, then extract based on classification) and handles documents exceeding typical LLM context windows by processing them in sections.
The solution was architected to align with stringent healthcare compliance requirements. For HIPAA compliance, the Protected Health Information (PHI) is encrypted at rest using AWS Key Management Service (AWS KMS) and in transit using TLS 1.2+. Access controls follow the principle of least privilege, with AWS Identity and Access Management (IAM) policies restricting data access to authorized personnel only.
For HITRUST certification requirements, the architecture implements comprehensive audit logging through Amazon CloudWatch and AWS CloudTrail, capturing data access and processing activities. SOC 2 Type II compliance alignment is supported through the use of AWS services that maintain their own SOC 2 certifications, combined with Ricoh’s documented operational controls for change management, event response, and continuous monitoring.
The pay-per-use pricing model removes idle infrastructure costs—Ricoh only pays for actual document processing, with no charges during periods of inactivity. This cost predictability was crucial for supporting multiple customers with varying document volumes, as each customer’s costs scale proportionally with their usage rather than requiring fixed infrastructure investments.
Documents enter using Amazon Simple Storage Service (Amazon S3), triggering event-driven workflows. AWS Lambda functions invoke Amazon Bedrock models to determine document types such as claims, appeals, faxes, grievances, prior authorization requests, and clinical documentation. Amazon Textract parses text and layout, and the results are combined with Amazon Bedrock models for structured data extraction. Custom business rules—configurable logic specific to each customer’s requirements, such as field validation rules, document routing criteria, and data transformation specifications—work alongside confidence scoring to determine which fields require human review.
Confidence scores are calculated by comparing extraction results from multiple sources (Amazon Textract and Amazon Bedrock) and assigning a numerical value (0–100%) indicating the system’s certainty in each extracted field. Fields scoring below customer-defined thresholds (typically 70–85%) are flagged for human validation. Final outputs are stored in Amazon S3, with low-confidence cases routed for human validation through review queues where operators verify extracted data, correct errors, and provide feedback that improves future processing.
The core IDP-Common engine from the AWS GenAI IDP Accelerator served as the integration layer, helping Ricoh maintain its established workflows. The IDP Common Package is a Python library that provides shared functionality for the Accelerated Intelligent Document Processing solution on AWS. This solution helps businesses automatically extract and process information from documents using AI services, removing manual data entry and improving accuracy.
Each customer deployment is instantiated using a configurable AWS Serverless Application Model (AWS SAM) application deployed as an AWS CloudFormation stack, supporting rapid onboarding. This abstracts away infrastructure details—including Amazon Virtual Private Cloud (Amazon VPC) configuration, security group rules, IAM role policies, and service quotas—so team members can focus only on the customer-dependent parameters such as Lambda reserved concurrency or database connection details. This focused approach is valuable when onboarding a new customer.
The modular design helped Ricoh integrate specific parameters and custom functionality such as customer-defined proprietary document classification, custom data extraction for industry-specific forms, or redaction rules for personally identifiable information (PII) compliance alignment into their existing high-capacity workflow without disrupting established processes. This approach helped the team maintain operational efficiency through automated deployment that reduced customer onboarding time from weeks to days, while adding advanced AI capabilities for document processing, including intelligent document classification, and automated data extraction from unstructured forms.
Architecture details
The architecture was designed with three primary objectives: enable rapid customer onboarding through configuration rather than code changes, help align with healthcare regulations (HIPAA, HITRUST, SOC 2), and provide cost-effective scalability for variable document volumes. The serverless approach was chosen to remove infrastructure management overhead and align costs directly with usage, and the multi-tenant design with per-customer queues balances resource efficiency with workload isolation. The decision to use Processing Pattern 2 (Amazon Textract and Amazon Bedrock) rather than Amazon Bedrock alone was driven by the need to handle documents exceeding LLM context windows and the requirement for structured text extraction that could be selectively included in prompts based on document type.
The implementation used a serverless architecture in which Lambda functions are automatically invoked upon upload of scanned documents to Amazon S3. The Lambda functions handle calls to the AI services—Amazon Textract and Amazon Bedrock—and output the captured attributes along with their confidence scores to an Amazon DynamoDB database.
The architecture incorporates AWS Well-Architected Framework principles across multiple pillars. For security, the data is encrypted at rest using AWS KMS with customer-managed keys and in transit using TLS 1.2+. IAM roles enforce least-privilege access, separated by function, with separate roles for document ingestion, processing, and retrieval. CloudTrail logs the API calls for audit trails, and CloudWatch Logs captures application-level events for security monitoring.
For reliability, the serverless design removes single points of failure, with automatic retries and dead-letter queues (DLQs) handling transient errors. For performance efficiency, Lambda concurrency limits and Amazon Simple Queue Service (Amazon SQS) queue throttling helps prevent API quota exhaustion while maintaining high throughput. For cost optimization, the pay-per-use model removes idle resource costs, and Amazon S3 lifecycle policies automatically transition processed documents to lower-cost storage tiers.
For operational excellence, infrastructure as code using AWS SAM and CloudFormation enables consistent deployments, and CloudWatch dashboards and alarms provide real-time visibility into processing metrics and error rates.
A critical part of the architecture is an SQS queue that makes it possible for the team to control the rate at which they are making requests to Amazon Textract and Amazon Bedrock API endpoints by controlling message processing velocity through Lambda concurrency settings and Amazon SQS visibility timeouts. This design helps them stay within service quota limits (such as transactions per second for Amazon Textract and requests per minute for Amazon Bedrock). Furthermore, Amazon SQS seamlessly facilitates retries and sending of unprocessed messages to a DLQ.
Each customer has its own Amazon EventBridge rule and SQS queue, enabling multi-tenant isolation (helping prevent one customer’s high volume from impacting others) and independent scaling (allowing per-customer concurrency limits and throughput controls).
The architecture used Amazon S3 for document storage. Different buckets were created to manage documents from various sources, including fax, scan, and SFTP systems. DynamoDB tables stored document metadata and processing state, tracking document versions and helping prevent multiple attempts to update the same document simultaneously. CloudWatch provided comprehensive monitoring and logging of successful extraction rates and processing anomalies.
The actual interaction with AI services uses Amazon Textract to augment Amazon Bedrock prompts with structured data extracted from the scanned document. Here, the team took advantage of their previous Amazon Textract based solution and used it as another source of truth for the extracted values, which make it possible to compute reliable confidence scores by comparing results from both extraction methods. This dual-extraction approach was used during the initial deployment phase to validate accuracy, with the legacy system phased out after confidence in the new system was established.
For document processing, the solution used Amazon Textract to extract text from large healthcare documents, addressing the challenge of documents that exceeded the context window limitations of FMs when processed as images. For example, a 50-page clinical record would exceed most LLM context windows if sent as images, but Amazon Textract converts it to structured text that can be selectively included in prompts. Amazon Bedrock FMs handled the intelligent classification and extraction tasks, with tailored instructions for healthcare data designed to identify document types and extract healthcare-specific information such as member IDs, provider details, and claim information.
For document classification and splitting, the team used LLMs to intelligently identify document types and split multi-document packets based on provider or patient information.
Regarding fast onboarding for new customers, the team used a configurable AWS SAM application deployed as a CloudFormation nested stack for each customer. This abstracts away infrastructure details—such as VPC configuration, security group rules, IAM role policies, and service quotas—and so team members can focus only on the customer-dependent parameters when onboarding a new customer.
The modular architecture helped Ricoh deploy only the components they needed while maintaining the option to add additional features such as document summarization or knowledge base integration in the future.
Results and outcomes
Ricoh has been able to lower prices for an important healthcare customer by measuring and achieving significant reductions in human labor required to index documents in production. Human indexers now concentrate their time on difficult documents and extractions, with AI serving as their partner in the process rather than performing routine data entry.
Ricoh’s Intelligent Business Platform achieved significant operational improvements and potential annual savings exceeding 1,900 person-hours through automation, dramatically reducing the manual effort required for document processing.
The automated classification system successfully distinguished between insurance policy holders’ grievances and appeals claims, a critical capability for healthcare compliance and workflow management. These document types have different regulatory timelines (grievances typically require 30-day resolution, appeals require 60 days) and must be routed to different processing teams. Misclassification can result in missed deadlines, regulatory penalties, and member dissatisfaction.
The solution demonstrated extraction accuracy levels that help minimize financial penalties from processing errors, a crucial outcome in the heavily regulated healthcare industry. The confidence scoring capabilities enabled effective human-in-the-loop review processes, helping verify that documents requiring expert validation were properly flagged while allowing high-confidence extractions to proceed automatically.
Ricoh successfully created a reusable framework that can be deployed across multiple healthcare customers, providing a scalable foundation for expanding their document processing services to future use cases. The solution now processes over 10,000 healthcare documents monthly with the infrastructure in place to scale to 70,000 documents as client needs grow.
The Intelligent Business Platform achieved significant operational improvements, as detailed in the following table.

Key Performance Indicator
Before (Legacy)
After (AWS IDP Accelerator)
Impact

Onboarding Time
4–6 weeks
2–3 days
>90% reduction

Monthly Throughput
~10,000 documents
>70,000 documents
7-fold increase

Engineering Hours per Deployment
~80 hours
<5 hours
>90% reduction

Processing Capacity
Limited
1,000 documents in minutes
Handles traffic spikes

Best practices and lessons learned
The Ricoh implementation highlighted several best practices for deploying IDP solutions in production environments:

Choose the appropriate processing pattern – Selecting Pattern 2 from the AWS IDP Accelerator provided the flexibility needed for complex healthcare document requirements while maintaining control over model selection and processing steps. This choice was essential for handling unique document splitting requirements and integration with existing workflows.
Use a hybrid approach combining OCR with FMs – The team found that using Amazon Textract to augment Amazon Bedrock prompts with structured data provided both scalability and accuracy for documents of varying sizes and complexity. This hybrid approach of combining OCR with FMs addressed practical limitations around context window sizes when processing documents as images—Amazon Textract handles documents of different sizes, and Amazon Bedrock provides intelligent understanding of the extracted text, enabling both scalability (no document size limits) and accuracy (AI-powered field extraction and validation). Taking advantage of the previous Amazon Textract based solution as another source of truth during the validation phase helped the team compute reliable confidence scores without incurring significant additional costs, because Amazon Textract was already being used for text extraction in the new architecture.
Integrate confidence scoring from the beginning – Integrating confidence scoring from the beginning enabled effective human-in-the-loop workflows, allowing the system to automatically flag uncertain extractions for expert review. This approach balanced automation benefits with the accuracy requirements of healthcare document processing. Configurable confidence thresholds proved essential for meeting customer requirements—helping teams keep wrongly captured attributes below agreed SLAs while minimizing the cost of manual review.
Implement rate limiting with SQS queues – Implementing an SQS queue to limit the rate of API calls to Amazon Textract and Amazon Bedrock endpoints helped the team stay within quota limits while seamlessly facilitating retries and DLQ handling. This architectural decision helped prevent throttling issues and improved overall system reliability.
Standardize using configuration rather than code – Standardizing using configuration rather than code changes was a key enabler of rapid customer onboarding. The configurable AWS SAM application deployed as a CloudFormation nested stack for each customer abstracted away infrastructure details, so team members could focus only on customer-dependent parameters. This approach reduced maintenance efforts and enabled quick onboarding for new customers.
Use a modular architecture for integration – The modular architecture of the GenAI IDP Accelerator proved valuable for integration with existing systems. Rather than replacing established workflows, the core IDP-Common engine helped Ricoh enhance their current infrastructure with AI capabilities—including document classification, intelligent field extraction, confidence scoring, and natural language understanding.
Plan for scalability from the outset – Planning for scalability from the outset enabled smooth growth from proof of concept to production volumes. The serverless architecture’s automatic scaling capabilities and pay-per-use pricing model aligned infrastructure costs with business growth, providing predictable economics as document volumes increased. The architecture handled spikes in traffic to seamlessly process up to 1,000 documents in a few minutes while not wasting computational resources during periods of low or no traffic.

Getting started
Ready to build your own IDP solution? The AWS GenAI IDP Accelerator provides a proven foundation for deploying production-grade document automation:

Explore the accelerator – Visit the GenAI IDP Accelerator repository to review the architecture patterns, deployment guides, and sample code
Choose your pattern – Review the multiple processing patterns available and select the one that best fits your document types and workflow requirements
Start small, scale fast – Begin with a proof of concept using your most challenging document types, then use the modular architecture to expand across your organization
Leverage AWS expertise – Connect with AWS Solutions Architects and the GenAI Innovation Center to discuss your specific use case and implementation strategy

For organizations processing high volumes of complex documents, the combination of serverless architecture, FMs, and standardized frameworks offers a path to rapid deployment and scalable growth.
Conclusion
Ricoh’s implementation of the AWS GenAI IDP Accelerator demonstrates how enterprises can overcome scaling limits by combining generative AI, serverless architecture, and compliance frameworks. The result is faster onboarding, higher accuracy, and reduced operational overhead—all without compromising compliance standards (HIPAA, HITRUST, SOC 2) or operational control. By developing a reusable framework rather than single-use solutions, Ricoh transformed document processing into a scalable service.
The Intelligent Business Platform’s ability to handle complex healthcare document variations, provide confidence scoring for human-in-the-loop workflows, and scale from 10,000 to potentially 70,000 documents monthly showcases the practical benefits of IDP powered by generative AI on AWS. The reusable framework Ricoh created can now be deployed across multiple healthcare customers, providing a foundation for expanding their document processing services.
For organizations facing similar document processing challenges, the GenAI IDP Accelerator offers a proven path from proof of concept to production-ready solutions. The combination of serverless architecture, multiple processing patterns, and integration flexibility helps teams build document automation tailored to their specific needs while using the latest advances in generative AI and AWS services. Their story is proof that with the right foundation, AI doesn’t just automate work—it can accelerate growth.
To get started with the GenAI IDP Accelerator, visit the project repository and explore the documentation and deployment guides.
Acknowledgments
Special thank you to Bob Strahan for his leadership of the GenAI IDP Accelerator project. We would also like to thank Guillermo Tantachuco, Saeideh Shahrokh Esfahani, Mofijul Islam, Suresh Konappanava, and Yingwei Yu for their contributions and guidance throughout.

About the authors

Jeremy Jacobson
Jeremy Jacobson is a lead developer and solutions architect for AI within Ricoh USA’s Intelligent Business Platform (IBP) services. His background includes experience at Emory University and the Fields Institute, which informs his approach to building production AI systems.

Rado Fulek
Rado Fulek is a software engineer at Ricoh where he builds secure, scalable and reliable document processing platforms. Previously, he conducted cutting-edge algorithmic research, publishing in top journals on algorithms, and discovering efficient algorithms, whose existence had been open for decades. Rado brings a problem solver mindset to AI, emphasizing practical, well-architected solutions that bridge the gap between cutting-edge research and real-world production systems.

Earl Bovell
Earl Bovell is a Senior Solutions Architect at Amazon Web Services (AWS), where he serves as a technical advisor and strategist helping enterprise customers solve business problems by leveraging AWS technology.

Vincil Bishop
Vincil Bishop is a Senior Deep Learning Architect in AWS’s Generative AI Innovation Center. He has over 25 years of experience in the IT industry and holds a PhD in Systems Engineering. Vincil specializes in designing and implementing AI solutions that help organizations solve their toughest business challenges.

Jordan Ratner
Jordan Ratner is a Senior Generative AI Strategist in the AWS Generative AI Innovation Center, where he partners with C-suite leaders and engineering teams to design, prototype, and deploy generative AI solutions.

Meet SymTorch: A PyTorch Library that Translates Deep Learning Models …

Can symbolic regression be the key to transforming opaque deep learning models into interpretable, closed-form mathematical equations? or Say you have trained your deep learning model. It works. But do you know what it has actually learned? A team of University of Cambridge researchers propose ‘SymTorch’, a library designed to integrate symbolic regression (SR) into deep learning workflows. It enables researchers to approximate neural network components with closed-form mathematical expressions, facilitating functional interpretability and potential inference acceleration.

https://arxiv.org/pdf/2602.21307

Core Mechanism: The Wrap-Distill-Switch Workflow

SymTorch simplifies the engineering required to extract symbolic equations from trained models by automating data movement and hook management.

Wrap: Users apply the SymbolicModel wrapper to any nn.Module or callable function.

Distill: The library registers forward hooks to record input and output activations during a forward pass. These are cached and transferred from the GPU to the CPU for symbolic regression via PySR.

Switch: Once distilled, the original neural weights can be replaced with the discovered equation in the forward pass using switch_to_symbolic.

The library interfaces with PySR, which uses a multi-population genetic algorithm to find equations that balance accuracy and complexity on a Pareto front. The ‘best’ equation is chosen by maximizing the fractional drop in log mean absolute error relative to an increase in complexity.

Case Study: Accelerating LLM Inference

A primary application explored in this research is replacing Multi-Layer Perceptron (MLP) layers in Transformer models with symbolic surrogates to improve throughput.

Implementation Details

Due to the high dimensionality of LLM activations, the research team employed Principal Component Analysis (PCA) to compress inputs and outputs before performing SR. For the Qwen2.5-1.5B model, they selected 32 principal components for inputs and 8 for outputs across three targeted layers.

Performance Trade-offs

The intervention resulted in an 8.3% increase in token throughput. However, this gain came with a non-trivial increase in perplexity, primarily driven by the PCA dimensionality reduction rather than the symbolic approximation itself.

MetricBaseline (Qwen2.5-1.5B)Symbolic SurrogatePerplexity (Wikitext-2)10.6213.76Throughput (tokens/s)4878.825281.42Avg. Latency (ms)209.89193.89

GNNs and PINNs

SymTorch was validated on its ability to recover known physical laws from latent representations in scientific models.

Graph Neural Networks (GNNs): By training a GNN on particle dynamics, the research team used SymTorch to recover empirical force laws, such as gravity (1/r2) and spring forces, directly from the edge messages.

Physics-Informed Neural Networks (PINNs): The library successfully distilled the 1-D heat equation’s analytic solution from a trained PINN. The PINN’s inductive bias allowed it to achieve a Mean Squared Error (MSE) of 7.40 x 10-6.

LLM Arithmetic Analysis: Symbolic distillation was used to inspect how models like Llama-3.2-1B perform 3-digit addition and multiplication. The distilled equations revealed that while the models are often correct, they rely on internal heuristics that include systematic numerical errors.

Key Takeaways

Automated Symbolic Distillation: SymTorch is a library that automates the process of replacing complex neural network components with interpretable, closed-form mathematical equations by wrapping components and collecting their input-output behavior.

Engineering Barrier Removal: The library handles critical engineering challenges that previously hindered the adoption of symbolic regression, including GPU-CPU data transfer, input-output caching, and seamless switching between neural and symbolic forward passes.

LLM Inference Acceleration: A proof-of-concept demonstrated that replacing MLP layers in a transformer model with symbolic surrogates achieved an 8.3% throughput improvement, though with some performance degradation in perplexity.

Scientific Law Discovery: SymTorch was successfully used to recover physical laws from Graph Neural Networks (GNNs) and analytic solutions to the 1-D heat equation from Physics-Informed Neural Networks (PINNs).

Functional Interpretability of LLMs: By distilling the end-to-end behavior of LLMs, researchers could inspect the explicit mathematical heuristics used for tasks like arithmetic, revealing where internal logic deviates from exact operations.

Check out the Paper, Repo and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet SymTorch: A PyTorch Library that Translates Deep Learning Models into Human-Readable Equations appeared first on MarkTechPost.

How to Build a Stable and Efficient QLoRA Fine-Tuning Pipeline Using U …

In this tutorial, we demonstrate how to efficiently fine-tune a large language model using Unsloth and QLoRA. We focus on building a stable, end-to-end supervised fine-tuning pipeline that handles common Colab issues such as GPU detection failures, runtime crashes, and library incompatibilities. By carefully controlling the environment, model configuration, and training loop, we show how to reliably train an instruction-tuned model with limited resources while maintaining strong performance and rapid iteration speed.

Copy CodeCopiedUse a different Browserimport os, sys, subprocess, gc, locale

locale.getpreferredencoding = lambda: “UTF-8”

def run(cmd):
print(“n$ ” + cmd, flush=True)
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for line in p.stdout:
print(line, end=””, flush=True)
rc = p.wait()
if rc != 0:
raise RuntimeError(f”Command failed ({rc}): {cmd}”)

print(“Installing packages (this may take 2–3 minutes)…”, flush=True)

run(“pip install -U pip”)
run(“pip uninstall -y torch torchvision torchaudio”)
run(
“pip install –no-cache-dir ”
“torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 ”
“–index-url https://download.pytorch.org/whl/cu121”
)
run(
“pip install -U ”
“transformers==4.45.2 ”
“accelerate==0.34.2 ”
“datasets==2.21.0 ”
“trl==0.11.4 ”
“sentencepiece safetensors evaluate”
)
run(“pip install -U unsloth”)

import torch
try:
import unsloth
restarted = False
except Exception:
restarted = True

if restarted:
print(“nRuntime needs restart. After restart, run this SAME cell again.”, flush=True)
os._exit(0)

We set up a controlled and compatible environment by reinstalling PyTorch and all required libraries. We ensure that Unsloth and its dependencies align correctly with the CUDA runtime available in Google Colab. We also handle the runtime restart logic so that the environment is clean and stable before training begins.

Copy CodeCopiedUse a different Browserimport torch, gc

assert torch.cuda.is_available()
print(“Torch:”, torch.__version__)
print(“GPU:”, torch.cuda.get_device_name(0))
print(“VRAM(GB):”, round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2))

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

def clean():
gc.collect()
torch.cuda.empty_cache()

import unsloth
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TextStreamer
from trl import SFTTrainer, SFTConfig

We verify GPU availability and configure PyTorch for efficient computation. We import Unsloth before all other training libraries to ensure that all performance optimizations are applied correctly. We also define utility functions to manage GPU memory during training.

Copy CodeCopiedUse a different Browsermax_seq_length = 768
model_name = “unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit”

model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
model,
r=8,
target_modules=[“q_proj”,”k_proj],
lora_alpha=16,
lora_dropout=0.0,
bias=”none”,
use_gradient_checkpointing=”unsloth”,
random_state=42,
max_seq_length=max_seq_length,
)

We load a 4-bit quantized, instruction-tuned model using Unsloth’s fast-loading utilities. We then attach LoRA adapters to the model to enable parameter-efficient fine-tuning. We configure the LoRA setup to balance memory efficiency and learning capacity.

Copy CodeCopiedUse a different Browserds = load_dataset(“trl-lib/Capybara”, split=”train”).shuffle(seed=42).select(range(1200))

def to_text(example):
example[“text”] = tokenizer.apply_chat_template(
example[“messages”],
tokenize=False,
add_generation_prompt=False,
)
return example

ds = ds.map(to_text, remove_columns=[c for c in ds.column_names if c != “messages”])
ds = ds.remove_columns([“messages”])
split = ds.train_test_split(test_size=0.02, seed=42)
train_ds, eval_ds = split[“train”], split[“test”]

cfg = SFTConfig(
output_dir=”unsloth_sft_out”,
dataset_text_field=”text”,
max_seq_length=max_seq_length,
packing=False,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
max_steps=150,
learning_rate=2e-4,
warmup_ratio=0.03,
lr_scheduler_type=”cosine”,
logging_steps=10,
eval_strategy=”no”,
save_steps=0,
fp16=True,
optim=”adamw_8bit”,
report_to=”none”,
seed=42,
)

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_ds,
eval_dataset=eval_ds,
args=cfg,
)

We prepare the training dataset by converting multi-turn conversations into a single text format suitable for supervised fine-tuning. We split the dataset to maintain training integrity. We also define the training configuration, which controls the batch size, learning rate, and training duration.

Copy CodeCopiedUse a different Browserclean()
trainer.train()

FastLanguageModel.for_inference(model)

def chat(prompt, max_new_tokens=160):
messages = [{“role”:”user”,”content”:prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors=”pt”).to(“cuda”)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
with torch.inference_mode():
model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
streamer=streamer,
)

chat(“Give a concise checklist for validating a machine learning model before deployment.”)

save_dir = “unsloth_lora_adapters”
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

We execute the training loop and monitor the fine-tuning process on the GPU. We switch the model to inference mode and validate its behavior using a sample prompt. We finally save the trained LoRA adapters so that we can reuse or deploy the fine-tuned model later.

In conclusion, we fine-tuned an instruction-following language model using Unsloth’s optimized training stack and a lightweight QLoRA setup. We demonstrated that by constraining sequence length, dataset size, and training steps, we can achieve stable training on Colab GPUs without runtime interruptions. The resulting LoRA adapters provide a practical, reusable artifact that we can deploy or extend further, making this workflow a robust foundation for future experimentation and advanced alignment techniques.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Stable and Efficient QLoRA Fine-Tuning Pipeline Using Unsloth for Large Language Models appeared first on MarkTechPost.

Google Drops Gemini 3.1 Flash-Lite: A Cost-efficient Powerhouse with A …

Google has released Gemini 3.1 Flash-Lite, the most cost-efficient entry in the Gemini 3 model series. Designed for ‘intelligence at scale,’ this model is optimized for high-volume tasks where low latency and cost-per-token are the primary engineering constraints. It is currently available in Public Preview via the Gemini API (Google AI Studio) and Vertex AI.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/?

Core Feature: Variable ‘Thinking Levels’

A significant architectural update in the 3.1 series is the introduction of Thinking Levels. This feature allows developers to programmatically adjust the model’s reasoning depth based on the specific complexity of a request.

By selecting between Minimal, Low, Medium, or High thinking levels, you can optimize the trade-off between latency and logical accuracy.

Minimal/Low: Ideal for high-throughput, low-latency tasks such as classification, basic sentiment analysis, or simple data extraction.

Medium/High: Utilizes Deep Think Mini logic to handle complex instruction-following, multi-step reasoning, and structured data generation.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/?

Performance and Efficiency Benchmarks

Gemini 3.1 Flash-Lite is designed to replace Gemini 2.5 Flash for production workloads that require faster inference without sacrificing output quality. The model achieves a 2.5x faster Time to First Token (TTFT) and a 45% increase in overall output speed compared to its predecessor.

On the GPQA Diamond benchmark—a measure of expert-level reasoning—Gemini 3.1 Flash-Lite scored 86.9%, matching or exceeding the quality of larger models in the previous generation while operating at a significantly lower computational cost.

Comparison Table: Gemini 3.1 Flash-Lite vs. Gemini 2.5 Flash

MetricGemini 2.5 FlashGemini 3.1 Flash-LiteInput Cost (per 1M tokens)Higher$0.25Output Cost (per 1M tokens)Higher$1.50TTFT SpeedBaseline2.5x FasterOutput ThroughputBaseline45% FasterReasoning (GPQA Diamond)Competitive86.9%

Technical Use Cases for Production

The 3.1 Flash-Lite model is specifically tuned for workloads that involve complex structures and long-sequence logic:

UI and Dashboard Generation: The model is optimized for generating hierarchical code (HTML/CSS, React components) and structured JSON required to render complex data visualizations.

System Simulations: It maintains logical consistency over long contexts, making it suitable for creating environment simulations or agentic workflows that require state-tracking.

Synthetic Data Generation: Due to the low input cost ($0.25/1M tokens), it serves as an efficient engine for distilling knowledge from larger models like Gemini 3.1 Ultra into smaller, domain-specific datasets.

Key Takeaways

Superior Price-to-Performance Ratio: Gemini 3.1 Flash-Lite is the most cost-efficient model in the Gemini 3 series, priced at $0.25 per 1M input tokens and $1.50 per 1M output tokens. It outperforms Gemini 2.5 Flash with a 2.5x faster Time to First Token (TTFT) and 45% higher output speed.

Introduction of ‘Thinking Levels’: A new architectural feature allows developers to programmatically toggle between Minimal, Low, Medium, and High reasoning intensities. This provides granular control to balance latency against reasoning depth depending on the task’s complexity.

High Reasoning Benchmark: Despite its ‘Lite’ designation, the model maintains high-tier logic, scoring 86.9% on the GPQA Diamond benchmark. This makes it suitable for expert-level reasoning tasks that previously required larger, more expensive models.

Optimized for Structured Workloads: The model is specifically tuned for ‘intelligence at scale,’ excelling at generating complex UI/dashboards, creating system simulations, and maintaining logical consistency across long-sequence code generation.

Seamless API Integration: Currently available in Public Preview, the model uses the gemini-3.1-flash-lite-preview endpoint via the Gemini API and Vertex AI. It supports multimodal inputs (text, image, video) while maintaining a standard 128k context window.

Check out the Public Preview via the Gemini API (Google AI Studio) and Vertex AI. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Drops Gemini 3.1 Flash-Lite: A Cost-efficient Powerhouse with Adjustable Thinking Levels Designed for High-Scale Production AI appeared first on MarkTechPost.

Building a scalable virtual try-on solution using Amazon Nova on AWS: …

In this first post in a two-part series, we examine how retailers can implement a virtual try-on to improve customer experience. In part 2, we will further explore real-world applications and benefits of this innovative technology.
Every fourth piece of clothing bought online is returned to the retailer, feeding into America’s $890 billion returns problem in 2024. Behind these numbers lies a simple truth: shoppers can’t judge fit and style through their screens. Among the top reasons for returned fashion items are poor fit, wrong size, or style mismatch.
Retailers face a critical challenge in that their most valuable customers often return the most items, forcing them to maintain generous return policies despite steep processing costs and environmental impact. Each return produces 30% more carbon emissions than the initial delivery and represents a missed sales opportunity until items are processed back into inventory. As digital shopping accelerates, virtual try-on technology has emerged as a solution to reduce returns while maintaining customer convenience, but early implementations struggled with accuracy, scalability, and preserving crucial details such as garment draping, patterns, and logos.
Amazon Nova Canvas addresses these challenges through its virtual try-on capability, which uses two two-dimensional image inputs: a source image showing a person or living space and a reference image of the product. The system offers both automatic product placement through auto-masking functionality and manual controls for precise adjustments. Throughout the process, it carefully preserves important details such as logos and textures while providing comprehensive styling controls for customization.
Virtual try-on can be deployed across multiple customer engagement channels, from ecommerce websites and mobile shopping apps to in-store kiosks, social media shopping platforms, and virtual showrooms. Imagine visiting an ecommerce website, uploading your personal image, and seeing it applied across the clothing and accessory products on that website.
The following image shows a source image, a reference image, a mask image, and the resulting try-on image.

In this post, we explore the virtual try-on capability now available in Amazon Nova Canvas, including sample code to get started quickly and tips to help get the best outputs.
Solution overview
With virtual try-on capability, retailers and ecommerce companies can integrate garment and product visualization directly into their existing or new customer touch points. Using only a photo upload and product selection, customers can see how items would look on themselves, a model, or other placement. You can experiment with virtual try-on in Amazon Nova Canvas within the Amazon Bedrock playground. And, we’ll guide you through implementing a complete solution around this feature in your own Amazon Web Services (AWS) environment. The following section provides detailed instructions and best practices for deployment.
At its core, the solution uses the new virtual try-on in Amazon Nova Canvas in Amazon Bedrock. This model offers fast inference speeds, making it suitable for real-time applications such as ecommerce. At the same time, it preserves high-fidelity details of reference items, including patterns, textures, and logos. The model maintains accurate semantic manipulations within scenes.
Our solution combines AWS serverless services with AI processing capabilities in an event-driven architecture. Amazon DynamoDB Streams triggers an AWS Step Functions workflow and Amazon Simple Storage Service (Amazon S3) events to manage result delivery. Amazon Nova Canvas in Amazon Bedrock manages both the mask generation and pose detection. The solution follows an asynchronous processing pipeline with real-time status updates in which WebSocket connections maintain real-time communication with clients, enabling continuous user engagement throughout the process. For detailed implementation guidance and best practices, refer to our guidance.
Detailed explanation of the architecture
The request initiation follows this flow:

Amazon S3 stores the uploaded customer model photos and product images.
Each upload generates a message sent to an Amazon Simple Queue Service (Amazon SQS) queue. The AWS Lambda function creates the corresponding metadata and S3 path and stores it in the DynamoDB product table for later retrieval.
Amazon API Gateway manages the WebSocket connections for real-time status updates between the client and the virtual try-on.
Lambda processes initial requests by retrieving product information in the DynamoDB product table and creating job entries in DynamoDB.
Amazon DynamoDB: The products table (vto-products) stores catalog items available for the virtual try-on, notably the Amazon S3 picture location.
The virtual try-on jobs DynamoDB table (vto-jobs) tracks the state of each try-on request.

The virtual try-on generation follows this flow:

DynamoDB Streams asynchronously triggers AWS Step Functions workflows on job creation for processing try-on requests.
AWS Step Functions orchestrates the virtual try-on generation. It triggers a Lambda function that calls the Amazon Nova Canvas model through Amazon Bedrock. The DynamoDB job table is updated with the virtual try-on status.

The result delivery follows this flow:

Amazon S3 stores the generated try-on images with job ID metadata.
Amazon SQS handles S3 event notifications for completed try-on images.
AWS Lambda function sends the Amazon S3 URL of the result back to the user through WebSocket.

The following diagram illustrates the solution architecture.

Solution process
This section explains the end-to-end process of the solution. The solution guidance provides further details and information on how you can replicate the solution.
When your customer initiates a try-on request, they first sign in on Amazon Cognito and then upload their photo(s) stored into Amazon S3. A workflow is available to auto populate the product table in DynamoDB through Amazon S3 events. The client establishes a WebSocket connection through API Gateway, creating a persistent channel for real-time updates. The client sends the ID of the product they want to virtually try as well as the S3 URL of the static model they want to use. A Lambda function processes this request by retrieving the product image URL from DynamoDB and creating a job entry with both image URLs, returning a unique job ID for tracking.
DynamoDB stream then triggers a step function to coordinate the different writes and updates in the DynamoDB table. The step function also invokes Amazon Nova Canvas virtual try-on feature. The model takes as input (1) the source image, which is the base image you would like to modify (for example, the image of the customer), (2) the reference image, which is an image containing the product(s) you want to insert into the base image. For garments, the reference image can contain garments on or off body and can even contain multiple products representing distinct outfit components (such as a shirt, pants, and shoes in a single image).
By default, a mask is computed automatically using auxiliary inputs (maskType: “GARMENT” or maskType: “PROMPT”). The mask image can either be provided directly by the developer (maskType: “IMAGE”).
When a mask type of “GARMENT” is specified, Amazon Nova Canvas will create a garment-aware mask based on a garmentClass input parameter value you specify. In most cases, you will use one of the following high-level garment classes:

“UPPER_BODY” – Creates a mask that includes full arm length.
“LOWER_BODY” – Creates a mask the includes full leg length with no gap between the legs.
“FOOTWEAR” – Creates a mask that fits the shoe profile demonstrated in the source image.
“FULL_BODY” – Creates a mask equivalent to the combination of “UPPER_BODY” and “LOWER_BODY”.

The following table shows example inputs with maskType: “GARMENT”.

Source
Reference
Garment class
Output

“FOOTWEAR”

The following table shows example inputs with maskType: “PROMPT”.

Source image
Mask prompt
Reference image
Output

There are also more fine-grained subclasses that can be useful in certain edge cases. By using the “PROMPT” mask type, you can use natural language to describe the item in the source image that you want to replace. This is useful for images of items other than garments. This feature uses the same auto-masking functionality that exists in the Nova Canvas “INPAINTING” task using the maskPrompt parameter.
By using the mask and understanding which garment areas needs to be replaced, the product image is inserted on the user’s photo as input. The model then generates the try-on image, which is stored in Amazon S3 with the job ID as metadata. Throughout this process, the system sends progress updates through the WebSocket connection. An Amazon S3 event notification triggers a Lambda function through Amazon SQS. The function generates a presigned URL for the result image and delivers it to the client through the established WebSocket connection. This completes the process, typically taking 7–11 seconds.
Implementation details
This section details the tables and schema used in our virtual try-on solution to help you further understand how the role each DynamoDB tables plays.
This solution uses four DynamoDB tables. The products_table stores the catalog of available items for virtual try-on. The virtual_try_on_jobs table maintains the state and tracking information for each try-on request. The vto-models table stores the catalog of customers images used for virtual try-on. The WebSocket connections table (vto-connections) tracks active WebSocket connections for real-time job status updates. The solution assumes the products table is prepopulated with the retailer’s inventory.
The products table (vto-products) stores the catalog of available items for virtual try-on. Products are automatically populated when images are uploaded to the /products/ S3 folder. The schema for the products table is as follows:

product_id (string, partition key) – Unique identifier for the product
product_picture_s3_url (string) – Amazon S3 URL of the original product image
name (string) – Product display name
category (string) – Product category for organization
description (string) – Product details including style, color, and size options
auto_imported (Boolean) – Flag indicating if product was imported automatically through Amazon S3 upload
created_at (string) – ISO timestamp when product was added
updated_at (string) – ISO timestamp of last modification

The models table (vto-models) stores the catalog of customer images used for virtual try-on. Models are automatically populated when images are uploaded to the /models/ S3 folder. The schema for the models table is as follows:

model_id (string, partition key) – Unique identifier for the model
model_picture_s3_url (string) – Amazon S3 URL of the model image
name (string) – Model display name
category (string) – Model category for organization
description (string) – Model details and characteristics
auto_imported (Boolean) – Flag indicating if model was imported automatically using Amazon S3 upload
created_at (string) – ISO timestamp when model was added
updated_at (string) – ISO timestamp of last modification

The virtual try-on jobs table (vto-jobs) maintains state and tracking information for each try-on request throughout the processing workflow. The schema for the virtual try-on jobs table is as follows:

id (string, partition key) – Unique identifier for each try-on job
model_id (string) – Reference to the model used
product_id (string) – Reference to the product being tried on
model_picture_s3_url (string) – Amazon S3 URL of the customer’s uploaded photo
product_picture_s3_url (string) – Amazon S3 URL of the product being tried on
result_s3_url (string) – Amazon S3 URL of the generated virtual try-on result image
status (string) – Current job status (created, processing, completed, or error)
parameters (map) – Virtual try-on API parameters (such as maskType, mergeStyle, or garmentClass)
connection_id (string) – WebSocket connection ID for real-time updates
error_message (string) – Error details if job fails
created_at (string) – ISO timestamp when job was created
updated_at (string) – ISO timestamp of last status update

The WebSocket connections table (vto-connections) tracks active WebSocket connections for real-time job status updates. Further information on how using WebSocket API can be found at the Create a WebSocket chat app with a WebSocket API, Lambda, and DynamoDB tutorial. The schema is as follows:

connection_id (string, partition key) – WebSocket connection identifier
connected_at (string) – ISO timestamp when connection was established
ttl (number) – Time-to-live for automatic cleanup of stale connections

Conclusion
In this post, we covered how to implement virtual try-on at scale, covering the main building blocks. For a quick start, we provide a complete GitHub sample with prerequisites, deployment scripts, example code and a comprehensive solution guidance document with best practices and configuration details. Use this guide to get started right away in experimenting with the solution.
As ecommerce continues to grow, reducing return rates while maintaining customer satisfaction becomes increasingly critical for retailers’ profitability and sustainability. This Virtual try-on solution demonstrates how AWS serverless services can be combined with generative AI to address a significant challenge. By using Amazon Nova Canvas alongside a robust serverless architecture, retailers can provide customers with accurate product visualization and pose conservation while maintaining the seamless shopping experience their most loyal customers expect. Implementation considerations extend beyond the technical architecture. Successful deployment requires careful attention to service quotas, monitoring, and cost optimization. Our solution guidance provides further detailed recommendations for managing WebSocket connections, implementing retry strategies, and optimizing resource utilization. These operational aspects are crucial for maintaining reliable performance during peak shopping periods while managing costs effectively.

About the authors

Amandine Annoye
Amandine Annoye is a Solutions Architect at AWS, she works with Luxury & Fashion customers in France to help them drive business value. Amandine enjoys translating customers business needs into concrete and effective technical solutions. Outside of work, she enjoys travelling and painting.

Kevin Polossat
Kevin Polossat is a Solutions Architect at AWS. He works with retail & CPG customers in France to help them create value through cloud adoption. Outside of work, he enjoys wine and cheese.

Leopold Cheval
Leopold Cheval is a Solutions Architect at AWS based in Paris, working with Media & Entertainment and Retail customers on their cloud journey. He focuses on modern applications, AI/ML, and Generative AI technologies. Outside of work, Leopold enjoys traveling and camping.

Rania Khemiri
Rania Khemiri is a Prototyping Architect at AWS. She focuses on agentic workflows and Generative AI applications, helping teams accelerate experimentation and adoption of AI technologies on AWS. Through hands-on prototyping, she empowers customers to transform ideas into functional proofs of concept and gain the skills to scale them into production.

How Lendi revamped the refinance journey for its customers using agent …

This post was co-written with Davesh Maheshwari from Lendi Group and Samuel Casey from Mantel Group.
Most Australians don’t know whether their home loan is still competitive. Rates shift, property values move, personal circumstances change—yet for the average homeowner, staying informed of these changes is difficult. It’s often their largest financial commitment, but it’s also the one they’re least equipped to monitor. And when they do decide to refinance, the process itself demands significant manual effort.
Lendi Group, one of Australia’s fastest growing FinTech companies, recognized this gap and set out to transform the home loan experience through innovative technology. By using the generative AI capabilities of Amazon Bedrock, Lendi Group has developed Guardian, an agentic AI-powered application that serves as an around-the-clock companion for homeowners, monitoring their loans, providing personalized insights, and simplifying the mortgage refinance process.
This post details how Lendi Group built their AI-powered Home Loan Guardian using Amazon Bedrock, the challenges they faced, the architecture they implemented, and the significant business outcomes they’ve achieved. Their journey offers valuable insights for organizations that want to use generative AI to transform customer experiences while maintaining the human touch that builds trust and loyalty.
Challenges
Lendi Group identified several persistent challenges in the home loan journey that affected both customers and brokers:

Customers struggled with limited visibility into their mortgage position. Most homeowners lacked real-time insights into whether their current rate remained competitive, how their equity position changed as property values fluctuated, or how their overall financial health impacted their mortgage options. This information gap often led to customers missing opportunities to save money or utilize their home equity effectively.
The refinancing process was cumbersome and time-consuming. Even when customers identified better rates, the paperwork and administrative burden of refinancing deterred many from acting.
Brokers spent significant time on administrative tasks rather than focusing on high-value client interactions. Post-call documentation, routine inquiries, and after-hours support diverted broker attention from complex client needs that required human expertise and empathy.
Lendi Group faced the challenge of scaling personalized service across their extensive customer base. While their digital solution provided convenience, maintaining the human touch that builds trust in financial relationships proved difficult at scale, especially outside business hours.

These challenges led Lendi Group to explore how AI could transform the mortgage experience. Rather than viewing AI as merely an efficiency tool, Lendi envisioned a reinvention of the home loan journey—one where technology could anticipate customer needs, provide around-the-clock personalized guidance, and free human experts to focus on building meaningful relationships.
Solution overview
Lendi’s Guardian represents a fundamental shift in how customers interact with their home loans. At its core, Guardian is designed to:

Monitor loan competitiveness by continuously scanning thousands of home loans daily and alerting customers when better deals become available
Track equity position in real time as property values and industry conditions change, giving customers visibility into their current financial standing
Streamline the refinancing process with journeys that adapt to the customer’s circumstances and auto populates forms based on internal and external data sources, removing friction points that previously deterred customers from taking action
Deliver personalized insights and recommendations based on each customer’s unique financial situation and goals

Lendi used Amazon Bedrock to accelerate the build of their agentic solution within 16 weeks.
The solution is built upon Amazon Bedrock foundation models and Amazon Bedrock Guardrails. Lendi chose Amazon Elastic Kubernetes Service (Amazon EKS) to deploy their AI agents at scale, facilitating the necessary infrastructure to meet consumer demand. By using the wide range of foundation models (FMs) available on Amazon Bedrock, Lendi was able to select task-appropriate models optimized for specific use cases.
A critical component of their solution is AI guardrails powered by Amazon Bedrock Guardrails, which help make sure that the customer communications remain aligned with regulatory requirements. Additionally, Lendi developed Model Context Protocol (MCP) servers to enable AI agents to access institutional knowledge and interact with external services seamlessly.
The key components of the solution are as follows:

UI layer – Customers interact with Guardian through an intuitive chat led interface integrated directly into their Lendi dashboard, providing seamless access to AI-powered mortgage insights and recommendations.
API layer – A RESTful API in Amazon API Gateway serves as the communication bridge between frontend applications and backend AI agents, handling request routing, authentication, and rate limiting to help maintain secure and reliable interactions.
Compute layer – Amazon EKS hosts and orchestrates the AI agents, providing auto-scaling capabilities to efficiently handle varying customer demand while maintaining consistent performance and availability.
Intelligence layer – The core AI capabilities are powered by multiple specialized agents built on Amazon Bedrock foundation models. Lendi used Agno, an open-source agentic framework to develop these agents, with MCP servers providing integrations to internal systems, external data sources, and third-party services. Bedrock Guardrails help enforce compliance boundaries, verifying that the customer interactions adhere to Lendi’s communication guidelines and remain focused on relevant mortgage-related topics.
Observability layer – Langfuse captures comprehensive agent traces, including inputs, outputs, reasoning chains, and performance metrics, providing full visibility into agent behavior and enabling continuous optimization and debugging. Amazon Cloudwatch logs are used to collect system level logs.
Storage layer – MongoDB serves as the persistent data store for user context, conversation history, and session state, enabling customers to resume conversations across sessions while providing agents with the customer-specific context needed for personalized recommendations. Amazon S3 is used to store documents and files provided by the customer.

The following diagram illustrates the solution architecture.

This architecture pattern provides a robust and scalable system to deploy AI agents.
Agent flow for mortgage refinance
Building upon this scalable architecture, Lendi designed a multi-agent orchestration system where specialized agents can collaborate to complete the mortgage refinance journey. This modular approach helps provide several key advantages: clear separation of concerns between agents, simplified development and maintenance of individual agent capabilities, faster response times through task-specific optimization, and straightforward troubleshooting when issues arise.
The mortgage refinance process flows through the following specialized agents, with seamless handovers preserving full context at each transition:

Mortgage Broker Associate Agent (initial engagement) – This agent serves as the customer’s first point of contact, embodying a friendly, professional persona similar to a human mortgage broker. Its primary goal is to understand the customer’s current situation and assess their interest in refinancing.
Customer Information Collection Agent (data gathering) – When a customer expresses interest in refinancing, this specialized agent systematically collects essential customer details including current loan information, employment status, income, and refinancing preferences. The agent uses conversational techniques to make data collection feel natural rather than interrogative and provides clarifications to the customer as required. The agent is context aware and asks for information not already provided by the customer.
Product Recommendation Agent (lender matching) – With complete customer information in hand, this agent analyzes the customer’s profile against Lendi’s extensive database of lenders and products. It presents suitable options with clear explanations of benefits and potential savings.
Product-Specific Information Collection Agent (application preparation) – After the customer selects their preferred product, this agent gathers the additional information required by that specific lender. Different lenders have varying requirements, and this agent adapts its questions accordingly.
Communication Agent (Linda) – Linda is the off-system engagement and re-engagement agent that keeps customers connected to their refinance journey, even when they’re not actively using the Guardian system. Although the specialized agents manage in-system tasks from initial engagement to product selection and application preparation, Linda operates across channels such as SMS, email, WhatsApp, and push to bring customers back in at the right moment. She detects when progress has stalled, surfaces timely reminders or new opportunities, and reinvites customers to continue where they left off. Drawing on live data from the Aurora Digital Twin, Linda tailors messages to the customer’s specific context, tone, and goal, whether it’s encouraging them to reconnect their loan, review matched products, or complete their submission. In essence, Linda is the voice of Guardian beyond the app, helping keep customers informed, motivated, and moving forward throughout the refinance journey.

The following graphic illustrates this workflow.

This agentic approach simplified the mortgage application process for customers by providing an intuitive, natural language interface to share information, ask clarifying questions, and receive guidance throughout their refinance journey. For brokers, it alleviated the burden of manual form filling and application submission, freeing them to focus their expertise on complex customer scenarios, relationship building, and providing strategic financial advice where human judgment and empathy are most valuable.
Business outcomes and customer impact
Lendi’s Guardian application is already delivering measurable results, having settled millions in home loans with refinance cycle times considerably faster than Lendi Group’s baseline. Guardian extends this impact with its AI-powered Rate Radar, which scans thousands of home loans daily and enables refinancing in only 10 minutes, with no paperwork, no phone calls, only a single tap. By automating routine monitoring and alerting customers to better rates in real time, brokers can focus on negotiation, empathy, and complex structuring—the high-value, relationship-driven work that builds loyalty. Guardian launched in only 16 weeks following a more than 30,000-hour cross-functional sprint, demonstrating how an AI-first architecture accelerates both development velocity and customer outcomes.
Lessons learned
Lendi Group’s 16-week journey to build and deploy the AI-powered Home Loan Guardian provided invaluable insights into implementing agentic AI at scale in a regulated financial services environment. Here are the critical lessons they learned:

Prioritize early, iterative evaluation metrics to guide AI development systematically. Rely on data-driven metrics to make key decisions such as model choice. Use Amazon Bedrock prompt management for versioning prompts.
Choose models strategically by using the diverse model options of Amazon Bedrock. Recognize that the most sophisticated model isn’t always the most effective solution for your specific use case. Equally important is incorporating domain knowledge from human experts into your prompts because this contextual expertise often determines success more than model selection alone.
Take advantage of using Amazon Bedrock batch inference on tasks that don’t require immediate results to reduce cost.
Treat AI as a transformative technology requiring bold vision and rapid, strategic implementation. Use the generative AI capabilities of Amazon Bedrock and the scalable cloud infrastructure of AWS to accelerate AI-driven innovation.
Prioritize responsible AI governance in regulated environments. Use Amazon Bedrock Guardrails to help enforce content policies, filter inappropriate responses, and maintain compliance alignment requirements throughout the AI lifecycle.
Balance automation with human expertise. Design AI systems that augment—rather than replace—human judgment, maintaining a customer-centric approach where human oversight remains central to critical decisions.

Future roadmap
Lendi Group’s implementation of the AI-powered Home Loan Guardian represents only the first step in their ambitious journey to become a fully AI-based organization by June 2026. With the foundation now in place, Lendi Group aims to use agentic AI to rethink the whole mortgage and finance journey.
To support this strategic initiative, Lendi is exploring new AWS services, including Amazon Bedrock AgentCore, which enables the deployment of agents in a scalable and secure manner without the overhead of infrastructure management. This approach will further help accelerate Lendi’s pace of innovation.
“We’ve built our platform so that refinancing happens at the speed of life, not at the speed of paperwork,” says Devesh Maheshwari – CTO at Lendi. “A customer can receive a Rate Radar alert about a sharper rate or a shift in property value during their morning commute. They tap to engage with it and provide information to our Agentic platform “Guardian” and by the time they’re heading home, their refinance loan application can be lodged. That’s not magic. It’s what happens when you invest properly in intelligent automation, real-time decisioning APIs and a micro-services architecture that coordinates everything from document verification through to settlement, without manual handoffs. The real challenge wasn’t just speed. It was removing every point of friction while still meeting the highest standards of compliance and risk control. When your infrastructure can support life-changing financial decisions in minutes rather than weeks, you’re not just improving the experience. You’re resetting what customers expect from financial services.”
Conclusion
Lendi Group’s AI-powered Home Loan Guardian represents a significant leap forward in how Australians manage their home loans. By using the generative AI capabilities of Amazon Bedrock, Lendi has created a solution that helps transform the mortgage experience from a periodic, transaction-based interaction to an ongoing, proactive relationship that delivers continuous value to customers. Looking ahead, Lendi’s journey to become a fully AI-based organization by June 2026 positions them at the forefront of innovation in the Australian mortgage industry. Their vision of AI integrated into “every workflow, every decision, every customer experience, and every broker experience” presents a fundamental reimagining of how mortgage services can be delivered.

About the authors
Deepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in AI, he partners with clients to accelerate their generative AI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences.
James Hardman James is a Senior Account Manager at AWS, partnering with Australia’s fintech and financial services organisations to navigate complex technology challenges. He works backwards from what matters most to his customers, connecting them with the right investment, tools, and specialist teams to help them move faster. James is particularly focused on helping customers explore emerging technologies like agentic AI – not for the sake of innovation, but to drive real business outcomes and better serve their end customers.
Igor Londero Gentil is a Solutions Architect at AWS, based in Sydney, helping customers design and build on the cloud with a focus on serverless and event-driven architectures. With a background spanning infrastructure engineering, cloud architecture, and AI, he brings a practitioner’s perspective to solving real-world problems — grounded in years of hands-on experience before joining AWS. Igor is a regular speaker on topics like event-driven architectures and AWS Lambda, and an active open-source contributor.
Devesh Maheshwari is the Chief Technology Officer at Lendi Group Services in Sydney, Australia, where he’s driving the company’s transition to an AI-native business. With more than 18 years of experience leading technology strategy, digital transformation and engineering teams, Dev has a strong track record in fintech and highly regulated sectors, shaping platforms that scale and deliver real business value. Before Lendi and he has held senior leadership positions at DataMesh, Tyro Payments, Tabcorp & ThoughtWorks. He’s also a trusted advisor and mentor in tech, and he’s shared his insights on AI and innovation at industry events.
Samuel Casey began his career in the startup ecosystem as the co-founder of a specialised AI consultancy. After successfully spinning out a proprietary AI product and overseeing its acquisition by Mantel Group, Samuel joined Mantel four years ago to lead high-stakes digital transformations. As an AI partner in Mantel, he has spearheaded a variety of complex projects for a broad range of enterprise and government clients. Most recently, Samuel has been at the forefront of the Generative/Agentic AI movement, dedicated to helping organisations integrate AI Solutions into their core operations as these technologies have materialised in the global market.

How Tines enhances security analysis with Amazon Quick Suite

Organizations face challenges in quickly detecting and responding to user account security events, such as repeated login attempts from unusual locations. Although security data exists across multiple applications, manually correlating information and making corrective actions often delays effective response. With Amazon Quick Suite and Tines, you can automate the investigation and remediation process by integrating data from multiple security tools, and providing visual insights for faster decision-making.
Quick Suite is a digital workspace that provides business users agentic AI capabilities to quickly answer questions and turn insights into actions. Quick Suite brings AI-powered research, business intelligence (BI), and automation into a single application. You can build automated workflows where multiple AI assistants work together, using your company data and the internet to answer business questions faster and more accurately. Users connect additional applications to Quick Suite using built-in integrations and the Model Context Protocol (MCP), a protocol that standardizes how AI assistants communicate with external tools. Tines is an intelligent workflow platform with a built-in MCP Server Builder. An MCP server is a program that exposes an application’s capabilities through a standard protocol so AI assistants can call them as tools. In Tines, you define MCP tools that read from or write to your internal or third-party applications, and Quick Suite can query those tools directly. With full audit trails in Tines, customers maintain visibility and governance across every workflow. This pattern enables Quick Suite users to bring proprietary or siloed data into their AI-driven analysis workflows without deploying new infrastructure or writing custom integration code.
In this post, we show you how to connect Quick Suite with Tines to securely retrieve, analyze, and visualize enterprise data from any security or IT system. We walk through an example that uses a MCP server in Tines to retrieve data from various tools, such as AWS CloudTrail, Okta, and VirusTotal, to remediate security events using Quick Suite.
Use case: Orchestrated security investigation and remediation
As a member of a security team, you stay ahead of security events with regular account security data review. This involves triaging information from multiple sources to determine if there are indicators that signal the need to dive more deeply into the data. With Quick Suite and Tines, you can investigate and remediate security events using natural language. This integrated approach leads to faster decision-making, without requiring custom scripts or manual correlation across multiple security applications.
Once connected to Quick Suite as well as your security and IT tools, Tines can:

Analyze internet protocol (IP) addresses in VirusTotal to assess event risk
Retrieve account details from Okta and BambooHR
Review authentication logs and user activity in CloudTrail
Flag suspicious IP addresses and, after analyst approval, block them in CrowdStrike

In Quick Suite, you can then visualize this data to gain immediate insights such as:

Geographic mapping of login attempts with risk scoring
Timeline of user activity before and after suspicious logins
Correlation between accounts and affected systems
Remediation status tracking for security events

This enables you to ask natural language questions, for example:

Show all login attempts from high-risk countries in the last 24 hours
Display user activity timeline
List all systems the user accessed
Generate a report of remediation actions taken for the security event

Explore additional use cases in the Tines story library.
Solution overview
Tines can help you integrate with services that expose an API, automate retrieval or transformation of that data, and provide the resulting workflow as an MCP server. The MCP client in Quick Suite can connect directly to the Tines MCP server and access the tools defined within the server.
This pattern provides the following benefits:

A simple, governed integration layer between Quick Suite and internal or external tools
The ability to connect systems that don’t currently have an MCP server
A straightforward and powerful way to create new MCP tools for custom data sources without custom engineering or development work
Consistent, secure connectivity without maintaining custom scripts or servers

For Quick Suite customers, the result is faster insight and less manual effort, with built-in control over how Quick Suite connects to enterprise data sources.
The workflow consists of four components:

Quick Suite – Connects to the Tines MCP server using the Quick Suite MCP client, retrieves the data, and enables analysis through chat and dashboards
Tines MCP Server – A published endpoint that exposes the workflow as an MCP tool
Security or IT API – Any REST API that returns network, endpoint, asset, or configuration data
Tines workflow – A sequence of actions that retrieves, normalizes, or enriches this data

The following diagram illustrates this architecture.

Prerequisites
To deploy this solution, you must have the following:

A Quick Suite account within your AWS account with a Professional subscription and an Author, or higher, user role. Refer to Model Context Protocol (MCP) integration for more information.
A Tines tenant. All plans, including the free Community Edition, support creating MCP servers
API credentials for the chosen security or IT system.

Create MCP server in Tines
You can import an MCP server from the Tines story library into your Tines tenant. Alternatively, complete the following steps to create a custom MCP server in Tines:

Create a new Story.

Open the Templates browser and search for MCP.

Drag the MCP action to the storyboard.

Choose MCP Server in the right pane and note the MCP server URL to connect Quick Suite.

Add as many tools as required for your workflow from the list of templates, or configure your own custom tools.

Connect the tools with your account in the associated applications using standard authentication methods (such as API key or OAuth).

The following screenshot shows a custom MCP server example for user account security analysis and remediation.

Connect Quick Suite to Tines MCP server
Complete the following steps to connect Quick Suite to the Tines MCP server:

On the Quick Suite console, choose Integrations under Connections in the navigation pane.
Choose the Actions tab under Existing integrations.
Choose the plus sign next to Model Context Protocol.

On the Create integration page, enter a name and description for your Tines integration.
For MCP server endpoint, enter the MCP server URL from your MCP server in your Tines story, then choose Next.

On the next page, configure the authentication settings and choose Create and continue to see the tools from your Tines MCP server.
Choose Next to complete the connection.

Query and visualize data in Quick Suite
After you’re connected, you can use the Quick Suite chat assistant to retrieve and explore data in real time, generate visual dashboards and charts from the returned results, and combine this data with existing AWS datasets for broader analysis. Quick Suite automatically selects and retrieves data from your Tines integration based on the content of the chat messages. This gives you a simple and scalable way to operationalize security and IT data using the BI and AI capabilities in Quick Suite. The following screenshot shows a sample security query.

The following screenshot shows the query result, including a security event timeline graph.

Clean up
To avoid incurring ongoing charges, clean up the resources you created as part of this solution.
Conclusion
Connecting Quick Suite and Tines using MCP transforms how organizations analyze their security and IT data. This solution reduces the need for custom integration code and provides centralized governance of integrations, standardized data retrieval, and improved operational visibility. Security and IT teams can extend their analytics capabilities to any API-enabled system through a single, auditable layer that scales across their tooling landscape.
Get Started with Quick Suite to create a Quick Suite instance in your AWS account and visit the Tines home page to sign up for a Tines Community Edition account. Once you have access, you can create your first MCP server and connect your existing security and IT tools using the Tines prebuilt templates. Finally, configure Quick Suite to access your new data sources and start analyzing data through natural language queries.
For more details, refer to the Amazon Quick Suite User Guide and Tines MCP server documentation.

About the Authors

Yannick Gloster
Yannick Gloster is a Software Engineer based in Dublin, Ireland, originally from Santa Barbara, California. He works on AI features and infrastructure at Tines, building Workbench, AI agents, and scalable AI infrastructure for the platform powering the world’s most important workflows. Yannick has a master’s degree in computer science from Trinity College Dublin, Ireland. In his spare time, he enjoys sailing, playing Counter-Strike and Deadlock, and watching Formula 1.

Jonah Craig
Jonah Craig is a Startup Solutions Architect based in Dublin, Ireland. He works with startup customers across the UK and Ireland and focuses on developing AI/ML and generative AI solutions. Jonah has a master’s degree in computer science and regularly speaks on stage at AWS conferences, such as the annual AWS London Summit and the AWS Dublin Cloud Day. In his spare time, he enjoys creating music and releasing it on Spotify.

Ashok Mahajan
Ashok Mahajan is a Senior Solutions Architect at Amazon Web Services. Based in the NYC Metropolitan area, Ashok is a part of Global Startup team focusing on Security Startups and helps them design and develop secure, scalable, and innovative solutions and architecture using the breadth and depth of AWS services and their features to deliver measurable business outcomes.

Bobby Williams
Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.

Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM a …

In the current AI landscape, agentic frameworks typically rely on high-level managed languages like Python or Go. While these ecosystems offer extensive libraries, they introduce significant overhead through runtimes, virtual machines, and garbage collectors. NullClaw is a project that diverges from this trend, implementing a full-stack AI agent framework entirely in Raw Zig.

By eliminating the runtime layer, NullClaw achieves a compiled binary size of 678 KB and operates with approximately 1 MB of RAM. For devs working in resource-constrained environments or edge computing, these metrics represent a shift in how AI orchestration can be deployed.

Performance Benchmarks and Resource Allocation

The primary distinction between NullClaw and existing frameworks lies in its resource footprint. Standard agent implementations often require significant hardware overhead to maintain the underlying language environment:

Local machine benchmark (macOS arm64, Feb 2026), normalized for 0.8 GHz edge hardware.

OpenClawNanoBotPicoClawZeroClaw NullClawLanguageTypeScriptPythonGoRustZigRAM> 1 GB> 100 MB< 10 MB< 5 MB~1 MBStartup (0.8 GHz)> 500 s> 30 s< 1 s< 10 ms< 8 msBinary Size~28 MB (dist)N/A (Scripts)~8 MB3.4 MB678 KBTests———1,0173,230+Source Files~400+——~120~110CostMac Mini $599Linux SBC ~$50Linux Board $10Any $10 hardwareAny $5 hardware

NullClaw’s ability to boot in under 2 milliseconds is a direct result of its lack of a virtual machine or interpreter. It compiles directly to machine code with zero dependencies beyond libc, ensuring that CPU cycles are dedicated entirely to logic rather than runtime management.

Architectural Design: The Vtable Interface Pattern

The most critical aspect of NullClaw is its modularity. Despite its small size, the system is not hard-coded for specific vendors. Every major subsystem—including providers, channels, tools, and memory backends—is implemented as a vtable interface.

A vtable (virtual method table) allows for dynamic dispatch at runtime. In NullClaw, this enables users to swap components via configuration changes without modifying or recompiling the source code. This architecture supports:

22+ AI Providers: Integration for OpenAI, Anthropic, Ollama, DeepSeek, Groq, and others.

13 Communication Channels: Native support for Telegram, Discord, Slack, WhatsApp, iMessage, and IRC.

18+ Built-in Tools: Executable functions for agentic task completion.

This modularity ensures that the core engine remains lightweight while remaining extensible for complex ‘subagent’ workflows and MCP (Model Context Protocol) integration.

Memory Management and Security

NullClaw manages memory manually, a core feature of the Zig programming language. To maintain a 1 MB RAM footprint while handling complex data, it utilizes a hybrid vector + keyword memory search. This allows the agent to perform retrieval-augmented generation (RAG) tasks without the overhead of an external, heavy vector database.

Security is integrated into the low-level design rather than added as an external layer:

Encryption: API keys are encrypted by default using ChaCha20-Poly1305, an AEAD (Authenticated Encryption with Associated Data) algorithm known for high performance on mobile and embedded CPUs.

Execution Sandboxing: When agents utilize tools or execute code, NullClaw supports multi-layer sandboxing through Landlock (a Linux security module), Firejail, and Docker.

Hardware Peripheral Support

Because NullClaw is written in Zig and lacks a heavy runtime, it is uniquely suited for hardware interaction. It provides native support for hardware peripherals across various platforms, including Arduino, Raspberry Pi, and STM32. This enables the deployment of autonomous AI agents directly onto microcontrollers, allowing them to interact with physical sensors and actuators in real-time.

Engineering Reliability

A common concern with manual memory management and low-level implementations is system stability. NullClaw addresses this through rigorous validation:

Test Suite: The codebase includes 2,738 tests to ensure logic consistency and memory safety.

Codebase Volume: The framework comprises approximately 45,000 lines of Zig.

Licensing: It is released under the MIT License, allowing for broad commercial and private utility.

Key Takeaways

Extreme Resource Efficiency: By using raw Zig and eliminating runtimes (No Python, No JVM, No Go), NullClaw reduces RAM requirements to ~1 MB and binary size to 678 KB. This is a 99% reduction in resources compared to standard managed-language agents.

Near-Instant Cold Starts: The removal of a virtual machine or interpreter allows the system to boot in under 2 milliseconds. This makes it ideal for event-driven architectures or serverless functions where latency is critical.

Modular ‘Vtable’ Architecture: Every subsystem (AI providers, chat channels, memory backends) is a vtable interface. This allows developers to swap providers like OpenAI for local DeepSeek or Groq via simple config changes with zero code modifications.

Embedded and IoT Ready: Unlike traditional frameworks requiring a PC or expensive Mac Mini, NullClaw provides native support for Arduino, Raspberry Pi, and STM32. It allows a full agent stack to run on a $5 board.

Security-First Design: Despite its small footprint, it includes high-level security features: default ChaCha20-Poly1305 encryption for API keys and multi-layer sandboxing using Landlock, Firejail, and Docker to contain agent-executed code.

Check out the Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds appeared first on MarkTechPost.