Fine-tune VLMs for multipage document-to-JSON with SageMaker AI and SW …

Extracting structured data from documents like invoices, receipts, and forms is a persistent business challenge. Variations in format, layout, language, and vendor make standardization difficult, and manual data entry is slow, error-prone, and unscalable. Traditional optical character recognition (OCR) and rule-based systems often fall short in handling this complexity. For instance, a regional bank might need to process thousands of disparate documents—loan applications, tax returns, pay stubs, and IDs—where manual methods create bottlenecks and increase the risk of error. Intelligent document processing (IDP) aims to solve these challenges by using AI to classify documents, extract or derive relevant information, and validate the extracted data to use it in business processes. One of its core goals is to convert unstructured or semi-structured documents into usable, structured formats such as JSON, which then contain specific fields, tables, or other structured target information. The target structure needs to be consistent, so that it can be used as part of workflows or other downstream business systems or for reporting and insights generation. The following figure shows the workflow, which involves ingesting unstructured documents (for example, invoices from multiple vendors with varying layouts) and extracting relevant information. Despite differences in keywords, column names, or formats across documents, the system normalizes and outputs the extracted data into a consistent, structured JSON format.

Vision language models (VLMs) mark a revolutionary advancement in IDP. VLMs integrate large language models (LLMs) with specialized image encoders, creating truly multi-modal AI capabilities of both textual reasoning and visual interpretation. Unlike traditional document processing tools, VLMs process documents more holistically—simultaneously analyzing text content, document layout, spatial relationships, and visual elements in a manner that more closely resembles human comprehension. This approach enables VLMs to extract meaning from documents with unprecedented accuracy and contextual understanding. For readers interested in exploring the foundations of this technology, Sebastian Raschka’s post—Understanding Multimodal LLMs—offers an excellent primer on multimodal LLMs and their capabilities.
This post has four main sections that reflect the primary contributions of our work and include:

An overview of the various IDP approaches available, including the option (our recommended solution) for fine-tuning as a scalable approach.
Sample code for fine-tuning VLMs for document-to-JSON conversion using Amazon SageMaker AI and the SWIFT framework, a lightweight toolkit for fine-tuning various large models.
Developing an evaluation framework to assess performance processing structured data.
A discussion of the possible deployment options, including an explicit example for deploying the fine-tuned adapter.

SageMaker AI is a fully managed service to build, train and deploy models at scale. In this post, we use SageMaker AI to fine-tune the VLMs and deploy them for both batch and real-time inference.
Prerequisites
Before you begin, make sure you have the following set up so that you can successfully follow the steps outlined in this post and the accompanying GitHub repository:

AWS account: You need an active AWS account with permissions to create and manage resources in SageMaker AI, Amazon Simple Storage Service (Amazon S3), and Amazon Elastic Container Registry (Amazon ECR).
IAM permissions: Your IAM user or role must have sufficient permissions. For production setups, follow the principle of least privilege as described in security best practices in IAM. For a sandbox setup we suggest the following roles:

Full access to Amazon SageMaker AI (for example, AmazonSageMakerFullAccess).
Read/write access to S3 buckets for storing datasets and model artifacts.
Permissions to push and pull Docker images from Amazon ECR (for example, AmazonEC2ContainerRegistryPowerUser).
If using specific SageMaker instance types, make sure your service quotas are sufficient.

GitHub repository: Clone or download the project code from our GitHub repository. This repository contains the notebooks, scripts, and Docker artifacts referenced in this post.

git clone https://github.com/aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai.git

Local environment set up:

Python: Python 3.10 or higher is recommended.
AWS CLI: Make sure the AWS Command Line Interface (AWS CLI) is installed and configured with credentials that have the necessary permissions.
Docker: Docker must be installed and running on your local machine if you plan to build the custom Docker container for deployment.
Jupyter Notebook and Lab: To run the provided notebooks.
Install the required Python packages by running pip install -r requirements.txt from the cloned repository’s root directory.

Familiarity (recommended):

Basic understanding of Python programming.
Familiarity with AWS services, particularly SageMaker AI.
Conceptual knowledge of LLMs, VLMs, and the container technology will be beneficial.

Overview of document processing and generative AI approaches
There are varying degrees of autonomy in intelligent document processing. On one end of the spectrum are fully manual processes: Humans manually reading documents and entering the information into a form using a computer system. Most systems today are semi-autonomous document processing solutions. For example, a human taking a picture of a receipt and uploading it to a computer system that automatically extracts part of the information. The goal is to get to fully autonomous intelligent document processing systems. This means reducing the error rate and assessing the use case specific risk of errors. AI is significantly transforming document processing by enabling greater levels of automation. A variety of approaches exist, ranging in complexity and accuracy—from specialized models for OCR, to generative AI.
Specialized OCR models that don’t rely on generative AI are designed as pre-trained, task-specific ML models that excel at extracting structured information such as tables, forms, and key-value pairs from common document types like invoices, receipts, and IDs. Amazon Textract is one example of this type of service. This service offers high accuracy out of the box and requires minimal setup, making it well-suited for workloads where basic text extraction is required, and documents don’t vary significantly in structure or contain images.
However, as you increase the complexity and variability of documents, in addition to adding multimodality, using generative AI can help improve document processing pipelines.
While powerful, applying general-purpose VLMs or LLMs to document processing isn’t straightforward. Effective prompt engineering is important to guide the model. Processing large volumes of documents (scaling) requires efficient batching and infrastructure. Because LLMs are stateless, providing historical context or specific schema requirements for every document can be cumbersome.
Approaches to intelligent document processing that use LLMs or VLMs fall into four categories:

Zero-shot prompting: the foundation model (FM) receives the result of previous OCR or a PDF and the instructions to perform the document processing task.
Few-shot prompting: the FM receives the result of previous OCR or a PDF, the instructions to perform the document processing task, and some examples.
Retrieval-augmented few-shot prompting: similar to the preceding strategy, but the examples sent to the model are selected dynamically using Retrieval Augmented Generation (RAG).
Fine-tuning VLMs

In the following, you can see the relationship between increasing effort and complexity and task accuracy, demonstrating how different techniques—from basic prompt engineering to advanced fine-tuning—impact the performance of large and small base models compared to a specialized solution (inspired by the blog post Comparing LLM fine-tuning methods)

As you move across the horizontal axis, the strategies grow in complexity, and as you move up the vertical axis, you improve overall accuracy. In general, large base models provide better performance than small base models in the strategies that require prompt engineering, however as we explain in the results of this post, fine-tuning small base models can deliver similar results as fine-tuning large base models for a specific task.
Zero-shot prompting
Zero-shot prompting is a technique to use language models where the model is given a task without prior examples or fine-tuning. Instead, it relies solely on the prompt’s wording and its pre-trained knowledge to generate a response. In document processing, this approach involves giving the model either an image of a PDF document, the OCR-extracted text from the PDF, or a structured markdown representation of the document and providing instructions to perform the document processing task, in addition to the desired output format.
Amazon Bedrock Data Automation uses zero-shot prompting with generative AI to perform IDP. You can use Bedrock Data Automation to automate the transformation of multi-modal data—including documents containing text and complex structures, such as tables, charts and images—into structured formats. You can benefit from customization capabilities through the creation of blueprints that specify output requirements using natural language or a schema editor. Bedrock Data Automation can also extract bounding boxes for the identified entities and route documents appropriately to the correct blueprint. These features can be configured and used through a single API, making it significantly more powerful than a basic zero-shot prompting approach.
While out-of-the-box VLMs can handle general OCR tasks effectively, they often struggle with the unique structure and nuances of custom documents—such as invoices from diverse vendors. Although crafting a prompt for a single document might be straightforward, the variability across hundreds of vendor formats makes prompt iteration a labor-intensive and time-consuming process.
Few-shot prompting
Moving to a more complex approach, you have few-shot prompting, a technique used with LLMs where a small number of examples are provided within the prompt to guide the model in completing a specific task. Unlike zero-shot prompting, which relies solely on natural language instructions, few-shot prompting improves accuracy and consistency by demonstrating the desired input-output behavior through examples.
One alternative is to use the Amazon Bedrock Converse API to perform few shot prompting. Converse API provides a consistent way to access LLMs using Amazon Bedrock. It supports turn-based messages between the user and the generative AI model, and allows including documents as part of the content. Another option is using Amazon SageMaker Jumpstart, which you can use to deploy models from providers like HuggingFace.
However, most likely your business needs to process different types of documents (for example, invoices, contracts and hand written notes) and even within one document type there are many variations, for example, there is not one standardized invoice layout and instead each vendor has their own layout that you cannot control. Finding a single or a few examples that cover all the different documents you want to process is challenging.
Retrieval-augmented few-shot prompting
One way to address the challenge of finding the right examples is to dynamically retrieve previously processed documents as examples and add them to the prompt at runtime (RAG).
You can store a few annotated samples in a vector store and retrieve them based on the document that needs to be processed. Amazon Bedrock Knowledge Bases helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows.
This turns the intelligent document processing problem into a search problem, which comes with its own challenges on how to improve the accuracy of the search. In addition to how to scale for multiple types of documents, the few-shot approach is costly because every document processed requires a longer prompt with examples. This results in an increased number of input tokens.

As shown in the preceding figure, the prompt context will vary based on the strategy selected (zero-shot, few-shot or few-shot with RAG), which will overall change the results obtained.
Fine-tuning VLMs
At the end of the spectrum, you have the option to fine-tune a custom model to perform document processing. This is our recommended approach and what we focus on in this post. Fine-tuning is a method where a pre-trained LLM is further trained on a specific dataset to specialize it for a particular task or domain. In the context of document processing, fine-tuning involves using labeled examples—such as annotated invoices, contracts, or insurance forms—to teach the model exactly how to extract or interpret relevant information. Usually, the labor-intensive part of fine-tuning is acquiring a suitable, high-quality dataset. In the case of document processing, your company probably already has a historic dataset in its existing document processing system. You can export this data from your document processing system (for example from your enterprise resource planning (ERP) system) and use it as the dataset for fine-tuning. This fine-tuning approach is what we focus on in this post as a scalable, high accuracy, and cost-effective approach for intelligent document processing.
The preceding approaches represent a spectrum of strategies to improve LLM performance along two axes: LLM optimization (shaping model behavior through prompt engineering or fine-tuning) and context optimization (enhancing what the model knows at inference through techniques such as few-shot learning or RAG). These methods can be combined—for example, using RAG with few-shot prompts or incorporating retrieved data into fine-tuning—to maximize accuracy.
Fine-tuning VLMs for document-to-JSON conversion
Our approach—the recommended solution for cost-effective document-to-JSON conversion—uses a VLM and fine-tunes it using a dataset of historical documents paired with their corresponding ground-truth JSON that we consider as annotations. This allows the model to learn the specific patterns, fields, and output structure relevant to your historic data, effectively teaching it to read your documents and extract information according to your desired schema.
The following figure shows a high-level architecture of the document-to-JSON conversion process for fine-tuning VLMs by using historic data. This allows the VLM to learn from high data variations and helps ensure that the structured output matches the target system structure and format. 

Fine-tuning offers several advantages over relying solely on OCR or general VLMs:

Schema adherence: The model learns to output JSON matching a specific target structure, which is vital for integration with downstream systems like ERPs.
Implicit field location: Fine-tuned VLMs often learn to locate and extract fields without explicit bounding box annotations in the training data, simplifying data preparation significantly.
Improved text extraction quality: The model becomes more accurate at extracting text even from visually complex or noisy document layouts.
Contextual understanding: The model can better understand the relationships between different pieces of information on the document.
Reduced prompt engineering: Post fine-tuning, the model requires less complex or shorter prompts because the desired extraction behavior is built into its weights.

For our fine-tuning process, we selected the Swift framework. Swift provides a comprehensive, lightweight toolkit for fine-tuning various large language models, including VLMs like Qwen-VL and Llama-Vision.
Data preparation
To fine-tune the VLMs, you will use the Fatura2 dataset, a multi-layout invoice image dataset comprising 10,000 invoices with 50 distinct layouts.
The Swift framework expects training data in a specific JSONL (JSON Lines) format. Each line in the file is a JSON object representing a single training example. For multimodal tasks, this JSON object typically includes:

messages: A list of conversational turns (for example, system, user, assistant). The user turn contains placeholders for images (for example, <image>) and the text prompt that guides the model. The assistant turn contains the target output, which in this case is the ground-truth JSON string.
images: A list of relative paths—within the dataset directory structure—to the document page images (JPG files) relevant to this training example.

As with standard ML practice, the dataset is split into training, development (validation), and test sets to effectively train the model, tune hyperparameters, and evaluate its final performance on unseen data. Each document (which could be single-page or multi-page) paired with its corresponding ground-truth JSON annotation constitutes a single row or example in our dataset. In our use case, one training sample is the invoice image (or multiple images of document pages) and the corresponding detailed JSON extraction. This one-to-one mapping is essential for supervised fine-tuning.
The conversion process, detailed in the dataset creation notebook from the associated GitHub repo, involves several key steps:

Image handling: If the source document is a PDF, each page is rendered into a high-quality PNG image.
Annotation processing (fill missing values): We apply light pre-processing to the raw JSON annotation. Fine-tuning multiple models on an open source dataset, we observed that the performance increases when all keys are present in every JSON sample. To maintain this consistency, the target JSONs in the dataset are made to include the same set of top-level keys (derived from the entire dataset). If a key is missing for a particular document, it’s added with a null value.
Key ordering: The keys within the processed JSON annotation are sorted alphabetically. This consistent ordering helps the model learn a stable output structure.
Prompt construction: A user prompt is constructed. This prompt includes <image> tags (one for each page of the document) and explicitly lists the JSON keys the model is expected to extract. Including the JSON keys in the prompts improves the fine-tuned model’s performance.
Swift formatting: These components (prompt, image paths, target JSON) are assembled into the Swift JSONL format. Swift datasets support multimodal inputs, including images, videos and audios.

The following is an example structure of a single training instance in Swift’s JSONL format, demonstrating how multimodal inputs are organized. This includes conversational messages, paths to images, and objects containing bounding box (bbox) coordinates for visual references within the text. For more information about how to create a custom dataset for Swift, see the Swift documentation.

{
“messages”: [
{“role”: “system”, “content”: “Task definition”},
{“role”: “user”, “content”: “<image><image>… + optional text prompt”},
{“role”: “assistant”, “content”: “JSON or text output with extracted data with <bbox> references.”}
],
“images”: [“path/to/image1.png”, “path/to/image2.png”]
“objects”: {“ref”: [], “bbox”: [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]} #Optional
}

Fine-tuning frameworks and resources
In our evaluation of fine-tuning frameworks for use with SageMaker AI, we considered several prominent options highlighted in the community and relevant to our needs. These included Hugging Face Transformers, Hugging Face Autotrain, Llama Factory, Unsloth, Torchtune, and ModelScope SWIFT (referred to simply as SWIFT in this post, aligning with the SWIFT 2024 paper by Zhao and others.).
After experimenting with these, we decided to use SWIFT because of its lightweight nature, comprehensive support for various Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and DoRA, and its design tailored for efficient training of a wide array of models, including the VLMS used in this post (for example, Qwen-VL 2.5). Its scripting approach integrates seamlessly with SageMaker AI training jobs, allowing for scalable and reproducible fine-tuning runs in the cloud.
There are several strategies for adapting pre-trained models: full fine-tuning, where all model parameters are updated, PEFT, which offers a more efficient alternative by updating only a small new number of parameters (adapters), and quantization, a technique that reduces model size and speeds up inference using lower-precision formats (see Sebastian Rashka’s post on fine-tuning to learn more about each technique).
Our project uses LoRA and DoRA, as configured in the fine-tuning notebook.
The following is an example of configuring and running a fine-tuning job (LoRA) as a SageMaker AI training job using SWIFT and remote function. When executing this function, the fine-tuning will be executed remotely as a SageMaker AI training job.

from sagemaker.remote_function import remote
import json
import os
@remote (instance_type=”ml.g6e.12xlarge”, volume_size=200, use_spot_instances=True)
def fine_tune_document (training_data_s3, train_data_path=”train.jsonl” , validation_data_path=”validation.jsonl”):
from swift.llm.sft import lim_sft, get_sft_main
from swift.llm import sft_main

## copy the training data from input source to local directory

train_data_local_path = …
validation_data_local_path = …
# set and run the fine-tuning using ms-swift framework
os.environ[“SIZE_FACTOR”] = json.dumps(8)# can be increase but requires more GPU memory
os.environ[“MAX_PIXELS”]= json.dumps (602112) # can be increase but requires more GPU memory os. environ [“CUDA_VISIBLE_DEVICES”]=”0,1,2,3″ # GPU devices to be used os. environ [“NPROC_PER_NODE”]=”4″ # we have 4 GPUs on on instance
os.environ[“USE_H_TRANSFER”] = json.dumps (1)
argv = [‘—model_type’, ‘qwen2_5_vl’,
‘-model_id_or_path’, ‘Qwen/Qwen2.5-VL-3B-Instruct’
‘–train_type’, ‘lora’
‘–use_dora’, ‘true’
‘-output_dir’, checkpoint_dir,
‘—max_length’, ‘4096’
‘-dataset’, train_data_local_path,
‘–val_dataset’, validation_data_local_path,

]

sft_main (argv)
## potentially evaluate inference on test dataset return “done”

Fine-tuning VLMs typically requires GPU instances because of their computational demands. For models like Qwen2.5-VL 3B, an instance such as an Amazon SageMaker AI ml.g5.2xlarge or ml.g6.8xlarge can be suitable. Training time is a function of dataset size, model size, batch size, number of epochs, and other hyperparameters. For instance, as noted in our project readme.md, fine-tuning Qwen2.5 VL 3B on 300 Fatura2 samples took approximately 2,829 seconds (roughly 47 minutes) on an ml.g6.8xlarge instance using Spot pricing. This demonstrates how smaller models, when fine-tuned effectively, can deliver exceptional performance cost-efficiently. Larger models like Llama-3.2-11B-Vision would generally require more substantial GPU resources (for example, ml.g5.12xlarge or larger) and longer training times.
Evaluation and visualization of structured outputs (JSON)
A key aspect of any automation or machine learning project is evaluation. Without evaluating your solution, you don’t know how well it performs at solving your business problem. We wrote an evaluation notebook that you can use as a framework. Evaluating the performance of document-to-JSON models involves comparing the model-generated JSON outputs for unseen input documents (test dataset) against the ground-truth JSON annotations.
Key metrics employed in our project include:

Exact match (EM) – accuracy: This metric measures whether the extracted value for a specific field is an exact character-by-character match to the ground-truth value. It’s a strict metric, often reported as a percentage.
Character error rate (CER) – edit distance: calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change the model’s predicted string into the ground-truth string, typically normalized by the length of the ground-truth string. A lower CER indicates better performance.
Recall-Oriented Understudy for Gisting Evaluation (ROGUE): This is a suite of metrics that compare n-grams (sequences of words) and the longest common subsequence between the predicted output and the reference. While traditionally used for text summarization, ROUGE scores can also provide insights into the overall textual similarity of the generated JSON string compared to the ground truth.

Visualizations are helpful for understanding model performance nuances. The following edit distance heatmap image provides a granular view, showing how closely the predictions match the ground truth (green means the model’s output exactly matches the ground truth, and shades of yellow, orange, and red depict increasing deviations). Each model has its own bar chart, allowing quick comparison across models. The X-axis is the number of sample documents. In this case, we ran inference on 250 unseen samples from the Fatura2 dataset. The Y-axis shows the JSON keys that we asked the model to extract; which will be different for you depending on what structure your downstream system requires.
In the image, you can see the performance of three different models on the Fatura2 dataset. From left to right: Qwen2.5 VL 3B fine-tuned on 300 samples from the Fatura2 dataset, in the middle Qwen2.5 VL 3B without fine-tuning (labeled vanilla), and Llama 3.2 11B vision fine-tuned on 1,000 samples.
The grey color shows the samples for which the Fatura2 dataset doesn’t contain any ground truth, which is why those are the same across the three models.
For a detailed, step-by-step walk-through of how the evaluation metrics are calculated, the specific Python code used, and how the visualizations are generated, see the comprehensive evaluation notebook in our project.

The image shows that Qwen2.5 vanilla is only decent at extracting the Title and Seller Name from the document. For the other keys it makes more than six character edit mistakes. However, out of the box Qwen2.5 is good at adhering to the JSON schema with only a few predictions where the key is missing (dark blue color) and no predictions of JSON that couldn’t be parsed (for example, missing quotation marks, missing parentheses, or a missing comma). Examining the two fine-tuned models, you can see improvement in performance with most samples, exactly matching the ground truth on all keys. There are only slight differences between fine-tuned Qwen2.5 and fine-tuned Llama 3.2, for example fine-tuned Qwen2.5 slightly outperforms fine-tuned Llama 3.2 on Total, Title, Conditions, and Buyer; whereas fine-tuned Llama 3.2 slightly outperforms fine-tuned Qwen2.5 on Seller Address, Discount, Tax, and Discount.
The goal is to input a document into your fine-tuned model and receive a clean, structured JSON object that accurately maps the extracted information to predefined fields. JSON-constrained decoding enforces adherence to a specified JSON schema during inference and is useful to make sure the output is valid JSON. For the Fatura2 dataset, this approach was not necessary—our fine-tuned Qwen 2.5 model consistently produced valid JSON outputs without additional constraints. However, incorporating constrained decoding remains a valuable safeguard, particularly for production environments where output reliability is critical.
Notebook 07 visualizes the input document and the extracted JSON data side-by-side.
Deploying the fine-tuned model
After you fine-tune a model and evaluate it on your dataset, you will want to deploy it to run inference to process your documents. Depending on your use case, a different deployment option might be more suitable.
Option a: vLLM container extended for SageMaker
To deploy our fine-tuned model for real-time inference, we use SageMaker endpoints. SageMaker endpoints provide fully managed hosting for real-time inference for FMs, deep learning, and other ML models and allows managed autoscaling and cost optimal deployment techniques. The process, detailed in our deploy model notebook, involves building a custom Docker container. This container packages the vLLM serving engine, highly optimized for LLM and VLM inference, along with the Swift framework components needed to load our specific model and adapter. vLLM provides an OpenAI-compatible API server by default, suitable for handling document and image inputs with VLMs. Our custom docker-artifacts and Dockerfile adapts this vLLM base for SageMaker deployment. Key steps include:

Setting up the necessary environment and dependencies.
Configuring an entry point that initializes the vLLM server.
Making sure the server can load the base VLM and dynamically apply our fine-tuned LoRA adapter. The Amazon S3 path to the adapter (model.tar.gz) is passed using the ADAPTER_URI environment variable when creating the SageMaker model.
The container, after being built and pushed to Amazon ECR, is then deployed to a SageMaker endpoint, which listens for invocation requests and routes them to the vLLM engine inside the container.

The following image shows a SageMaker vLLM deployment architecture, where a custom Docker container from Amazon ECR is deployed to a SageMaker endpoint. The container uses vLLM’s OpenAI-compatible API and Swift to serve a base VLM with a fine-tuned LoRA adapter dynamically loaded from Amazon S3.

Option b (optional): Inference components on SageMaker
For more complex inference workflows that might involve sophisticated pre-processing of input documents, post-processing of the extracted JSON, or even chaining multiple models (for example, a classification model followed by an extraction model), Amazon SageMaker inference components offer enhanced flexibility. You can use them to build a pipeline of multiple containers or models within a single endpoint, each handling a specific part of the inference logic.
Option c: Custom model inference in Amazon Bedrock
You can now import your custom models in Amazon Bedrock and then use Amazon Bedrock features to make inference calls to the model. Qwen 2.5 architecture is supported (see Supported Architectures). For more information, see Amazon Bedrock Custom Model Import now generally available.
Clean up
To avoid ongoing charges, it’s important to remove the AWS resources created for this project when you’re finished.

SageMaker endpoints and models:

In the AWS Management Console for SageMaker AI, go to Inference and then Endpoints. Select and delete endpoints created for this project.
Then, go to Inference and then Models and delete the associated models.

Amazon S3 data:

Navigate to the Amazon S3 console.
Delete the S3 buckets or specific folders or prefixes used for datasets, model artifacts (for example, model.tar.gz from training jobs), and inference results. Note: Make sure you don’t delete data needed by other projects.

Amazon ECR images and repositories:

In the Amazon ECR console, delete Docker images and the repository created for the custom vLLM container if you deployed one.

CloudWatch logs (optional):

Logs from SageMaker activities are stored in Amazon CloudWatch. You can delete relevant log groups (for example, /aws/sagemaker/TrainingJobsand /aws/sagemaker/Endpoints) if desired, though many have automatic retention policies.

Important: Always verify resources before deletion. If you experimented with Amazon Bedrock custom model imports, make sure those are also cleaned up. Use AWS Cost Explorer to monitor for unexpected charges.
Conclusion and future outlook
In this post, we demonstrated that fine-tuning VLMs provides a powerful and flexible approach to automate and significantly enhance document understanding capabilities. We have also demonstrated that using focused fine-tuning allows smaller, multi-modal models to compete effectively with much larger counterparts (98% accuracy with Qwen2.5 VL 3B). The project also highlights that fine-tuning VLMs for document-to-JSON processing can be done cost-effectively by using Spot instances and PEFT methods (approximately $1 USD to fine-tune a 3 billion parameter model on around 200 documents).
The fine-tuning task was conducted using Amazon SageMaker training jobs and the Swift framework, which proved to be a versatile and effective toolkit for orchestrating this fine-tuning process.
The potential for enhancing and expanding this work is vast. Some exciting future directions include deploying structured document models on CPU-based, serverless compute like AWS Lambda or Amazon SageMaker Serverless Inference using tools like llama.cpp or vLLM. Using quantized models can enable low-latency, cost-efficient inference for sporadic workloads. Another future direction includes improving evaluation of structured outputs by going beyond field-level metrics. This includes validating complex nested structures and tables using methods like tree edit distance for tables (TEDS).
The complete code repository, including the notebooks, utility scripts, and Docker artifacts, is available on GitHub to help you get started unlocking insights from your documents. For a similar approach, using Amazon Nova, please refer to this AWS blog for optimizing document AI and structured outputs by fine-tuning Amazon Nova Models and on-demand inference.

About the Authors
Arlind Nocaj is a GTM Specialist Solutions Architect for AI/ML and Generative AI for Europe central based in AWS Zurich Office, who guides enterprise customers through their digital transformation journeys. With a PhD in network analytics and visualization (Graph Drawing) and over a decade of experience as a research scientist and software engineer, he brings a unique blend of academic rigor and practical expertise to his role. His primary focus lies in using the full potential of data, algorithms, and cloud technologies to drive innovation and efficiency. His areas of expertise include Machine Learning, Generative AI and in particular Agentic systems with Multi-modal LLMs for document processing and structured insights.
Malte Reimann is a Solutions Architect based in Zurich, working with customers across Switzerland and Austria on their cloud initiatives. His focus lies in practical machine learning applications—from prompt optimization to fine-tuning vision language models for document processing. The most recent example, working in a small team to provide deployment options for Apertus on AWS. An active member of the ML community, Malte balances his technical work with a disciplined approach to fitness, preferring early morning gym sessions when it’s empty. During summer weekends, he explores the Swiss Alps on foot and enjoying time in nature. His approach to both technology and life is straightforward: consistent improvement through deliberate practice, whether that’s optimizing a customer’s cloud deployment or preparing for the next hike in the clouds.
Nick McCarthy is a Senior Generative AI Specialist Solutions Architect on the Amazon Bedrock team, focused on model customization. He has worked with AWS clients across a wide range of industries — including healthcare, finance, sports, telecommunications, and energy — helping them accelerate business outcomes through the use of AI and machine learning. Outside of work, Nick loves traveling, exploring new cuisines, and reading about science and technology. He holds a Bachelor’s degree in Physics and a Master’s degree in Machine Learning.
Irene Marban Alvarez is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS), working with customers in the United Kingdom and Ireland. With a background in Biomedical Engineering and Masters in Artificial Intelligence, her work focuses on helping organizations leverage the latest AI technologies to accelerate their business. In her spare time, she loves reading and cooking for her friends.

How Clario automates clinical research analysis using generative AI on …

Clinical outcome assessment (COA) interviews are important instruments in clinical trials for evaluating the efficacy and safety of treatments. In studies of psychosis, anxiety, and mood disorders, these assessments often determine the success or failure of the trial, highlighting the importance of data quality and reliability. The traditional approach to evaluating the quality of these outcomes is complex and involves time-consuming, logistically challenging reviews of audio-video recordings in near real time. Interview evaluation variability, poor assessment technique, and other factors can introduce noise, leading to unreliable results and potentially to study failure.
About Clario
Clario is a leading provider of endpoint data solutions for systematic collection, management, and analysis of specific, pre-defined outcomes (endpoints) to evaluate a treatment’s safety and effectiveness in the clinical trials industry. Clario generates high-quality clinical evidence for life sciences companies seeking to bring new therapies to patients. Since its founding over 50 years ago, Clario has deployed endpoint data solutions over 30,000 times, supporting over 710 novel drug regulatory approvals across more than 100 countries.
In this post, we demonstrate how Clario has used Amazon Bedrock and other AWS services to build an AI-powered solution that automates and improves the analysis of COA interviews. We discuss how Clario:

implemented speaker diarization, multi-lingual transcription, and large language models (LLMs)
used vector databases and semantic search to evaluate interview quality
incorporated automation into complex assessment reviews while maintaining regulatory compliance

Business challenge
Clario sought to transform their COA review methodology to enhance operational effectiveness while also increasing data quality. The company required a system that could address the critical challenges of standardized review of multi-lingual data at a global scale, while reducing natural variation between different expert reviewers, and maintaining uniform assessment quality across the complex COA interview process. The solution also needed to efficiently manage large volumes of audio recordings while meeting strict regulatory and privacy requirements. Clario sought capabilities that could automatically analyze speech and dialogue in near real time during COA interviews to potentially enable:

Reduced subjectivity and variability – Delivering more consistent and reliable behavioral health assessments, minimizing site and rater bias.
Enhanced data quality and credibility – Improving the robustness of trial outcomes with objective, standardized, and repeatable interview evaluations.
Streamlined operations – Automated complex assessment review and scoring could save time and resources for geographically dispersed sites and sponsor-level clinical teams.
Accelerated decision-making – Gaining clearer insights earlier could support faster, evidence-based go or no-go decisions for the trial sponsors.

Solution
To address this challenge, Clario chose AWS for its comprehensive artificial intelligence and machine learning (AI/ML) capabilities, proven ability to deploy HIPAA-compliant services at a global scale. Clario used the power of generative AI and Amazon Bedrock, a fully managed service that provides access to a diverse range of high-performing foundation models, to offer several key advantages:

No infrastructure management – Alleviate the operational overhead of managing AI model infrastructure and updates
Multiple model access – Compare and select from leading foundation models to optimize performance for their specific COA analysis needs
Built-in compliance features – Native support for data governance, audit trails, and regulatory requirements essential for clinical research
Rapid prototyping and deployment – Accelerated time-to-market through serverless architecture and pre-built integrations
Seamless AWS system integration – Native compatibility with existing AWS services for data storage, processing, and analytics
Enterprise security and privacy controls – Advanced encryption, access controls, and data residency options to help meet stringent industry standards
Continuous model improvements – Automatic access to model updates and new capabilities, reducing migration complexity

This comprehensive approach enabled Clario to focus on their core competency—clinical research excellence—while using cutting-edge AI capabilities through a trusted, compliance-aligned system.
The solution integrates advanced AI capabilities, including speaker diarization, multi-lingual transcription, semantic search, and agentic AI, to automatically review the quality of complex COA interviews in a manner similar to expert human central reviewers. The workflow orchestrates multiple steps where audio data is first analyzed to identify the unique speakers in the interview based on their voice, followed by speech-to-text conversion, and speaker role attribution to determine which speech corresponds to the interviewer and the study participant.
This information is segmented into semantically meaningful chunks based on speaker turns and natural conversation boundaries, with each segment maintaining crucial metadata. Examples of metadata include timestamps, speaker role, and positional context. These chunks are then vectorized and stored in an Amazon OpenSearch vector database, enabling the system to overcome the context window limitations of foundation models when processing lengthy interviews. The solution implements a sophisticated retrieval strategy where:

Overlapping windows makes sure that contextual information is not lost at segment boundaries
Targeted semantic searches identify specific dialogue segments relevant to each assessment criterion
A hierarchical approach preserves both local conversational flow and global interview context through interview-level summaries and speaker roles
Rolling context windows can be dynamically assembled when evaluating criteria that span multiple segments

This architecture allows the system to efficiently handle multiple queries against the same interview data while maintaining contextual relationships throughout the conversation. The system uses this semantic retrieval capability to analyze the content of the dialogue between the interviewer and the participant, evaluating it against a structured interview guide and central review checklist. The output of the workflow includes a quality rating for the interview, along with structured feedback for each checklist item, specifying where the interview diverges from the established standards. The overall system provides near real-time insights into the quality and reliability of the COA interview, supporting faster evidence-based go or no-go decisions for sponsors of clinical trials.
Solution architecture
The following architecture diagram illustrates the solution implementation:

The workflow consists of the following steps:

The COA interview recordings (audio and video files) from the interviews are collected on premises (1) using a recording application. The files are uploaded using AWS Direct Connect with encryption in transit to Amazon Simple Storage Service (Amazon S3)(2). The uploaded documents are then automatically stored with server-side object-level encryption.
After the files are uploaded, Clario’s AI Orchestration Engine (3) extracts the audio and identifies speech segments of unique speakers using a custom speaker diarization model on Amazon SageMaker (4).
The Orchestration Engine also invokes the Amazon Bedrock API for automated audio transcription. Clario uses the Whisper model from the Amazon Bedrock Marketplace (5) to generate near real-time transcriptions of the COA interview recordings. The transcriptions are then annotated with speaker information and timecodes, and then vectorized using an embedding model (Amazon Titan Text Embeddings v2 model) and stored into Amazon OpenSearch (7) for semantic retrieval.
After the information has been vectorized and stored, Clario’s AI Orchestration Engine executes a graph-based agent system running on Amazon Elastic Kubernetes Service (Amazon EKS)(3) for automated COA interview review. The agent implements a multi-step workflow that: (1) retrieves the assessment’s structured interview guide from configuration, (2) loads the corresponding central review checklist criteria, and (3) systematically queries Amazon OpenSearch (7) to extract relevant interview segments. Using the pre-configured graph structure for the task at hand, the agent traverses predefined decision nodes to compare interview responses against standardized assessment criteria, identify gaps or inconsistencies, and generate structured findings with supporting evidence citations.
The agent uses advanced large language models (LLMs), such as Anthropic Claude 3.7 Sonnet from Amazon Bedrock (6), to classify the speech segment as interviewer or participant, and to determine if each interview turn meets the interview quality criteria.
Clario’s AI Orchestration Engine then compiles the overall review of the interview and persists the information in Amazon Relational Database Service (Amazon RDS)(8).
Results of the AI-powered automated review can be retrieved by a client application (9) by invoking a Rest API using Amazon API Gateway endpoints (10).

Benefits and results
The initial implementation of this AI-powered solution is showing promise in improving Clario’s clinical trial processes:

Operational efficiency

Potential to decrease manual review effort by over 90%.

Quality improvements

Up to 100% data coverage through automated review versus human-only review of a smaller subset of recordings to spot check quality.
Highly targeted interventions might be enabled with rapid turnaround, focusing only on those raters and sites that require remediation.

Business impact

Potential to shorten turn-around time by decreasing central review time from weeks to hours.
Enhanced data reliability for regulatory submissions.
Reduced risk of study failure and uninterpretable results.
Improved scalability of clinical trial operations.

Lessons learned and best practices
Throughout the development and deployment of this solution, Clario has gained valuable insights and lessons learned that can benefit other organizations looking to implement similar AI-powered systems:

Importance of responsible AI development and use – During initial testing, Clario discovered that LLMs would occasionally generate plausible sounding but inaccurate summaries. This critical finding reinforced the importance of responsible AI practices in healthcare applications. This led Clario to implement a validation system where AI outputs are cross-checked against source documents for factual accuracy before human review.
Continuous model evaluation – Clario adopted a rigorous model evaluation process to maintain the highest standards of quality and reliability in their AI-powered COA interview analysis solution. Clario regularly assessed the performance and accuracy of their AI models through multiple approaches, including comparative studies on custom datasets, across multiple models and configurations.
Scalable and more secure architecture – The serverless, cloud-based architecture of the solution–using services like Amazon Bedrock, Amazon S3, and AWS Lambda–helped Clario to scale their solution effectively while prioritizing data security and compliance.

Next steps and conclusion
Clario’s innovative solution has the potential to transform the way COAs are reviewed and rated, significantly improving the reliability of clinical trial data and reducing the time and effort required for manual review. As Clario continues to refine and expand the capabilities of this AI-powered system, Clario is exploring additional use cases in neuroscience studies that rely on clinical interviews for evaluating the safety and efficacy of treatments.
By using generative AI and the robust features of Amazon Bedrock, Clario has set a new standard for clinical trial data analysis. This empowers their customers to make more informed decisions and accelerate the development of life-changing therapies.

About the authors
Alex Boudreau is the Director of AI at Clario. He leads the company’s innovative Generative AI department and oversees the development of the company’s advanced multi-modal GenAI Platform, which encompasses cutting-edge cloud engineering, AI engineering, and foundational AI research. Alex previously pioneered Deep Learning speech analysis systems for automotive applications, led cloud-based enterprise fraud detection solutions, advanced conversational AI technologies, and groundbreaking projects in medical image analysis. His expertise in leading high-impact initiatives positions him uniquely to drive forward the boundaries of AI technology in the business world.
Cuong Lai is the Technical Team Lead for the Generative AI team at Clario, where he helps to drive the development and scaling of the company’s generative AI platform. With over eight years of software engineering experience, he specializes in web development, API design, and architecting cloud-native solutions. Cuong has extensive experience leveraging AWS services to build secure, reliable, and high-performance systems that support large-scale AI workloads. He is passionate about advancing generative AI technologies and delivering innovative, production-ready AI solutions.
Praveen Haranahalli is a Senior Solutions Architect at Amazon Web Services (AWS), where he architects secure, scalable cloud solutions and provides strategic guidance to diverse enterprise customers. With nearly two decades of IT experience, Praveen has delivered transformative implementations across multiple industries. As a trusted technical advisor, he partners with customers to implement robust DevSecOps pipelines, establish comprehensive security guardrails, and develop innovative AI/ML solutions. He is passionate about solving complex business challenges through cutting-edge cloud architectures and empowering organizations to achieve successful digital transformations powered by artificial intelligence and machine learning.

A Coding Implementation to Build Neural Memory Agents with Differentia …

In this tutorial, we explore how neural memory agents can learn continuously without forgetting past experiences. We design a memory-augmented neural network that integrates a Differentiable Neural Computer (DNC) with experience replay and meta-learning to adapt quickly to new tasks while retaining prior knowledge. By implementing this approach in PyTorch, we demonstrate how content-based memory addressing and prioritized replay enable the model to overcome catastrophic forgetting and maintain performance across multiple learning tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
from dataclasses import dataclass

@dataclass
class MemoryConfig:
memory_size: int = 128
memory_dim: int = 64
num_read_heads: int = 4
num_write_heads: int = 1

We begin by importing all the essential libraries and defining the configuration class for our neural memory system. Here, we set parameters such as memory size, dimensionality, and the number of read/write heads that shape how the differentiable memory behaves throughout training. This setup acts as the foundation upon which our memory-augmented architecture is built. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuralMemoryBank(nn.Module):
def __init__(self, config: MemoryConfig):
super().__init__()
self.memory_size = config.memory_size
self.memory_dim = config.memory_dim
self.num_read_heads = config.num_read_heads
self.register_buffer(‘memory’, torch.zeros(config.memory_size, config.memory_dim))
self.register_buffer(‘usage’, torch.zeros(config.memory_size))
def content_addressing(self, key, beta):
key_norm = F.normalize(key, dim=-1)
mem_norm = F.normalize(self.memory, dim=-1)
similarity = torch.matmul(key_norm, mem_norm.t())
return F.softmax(beta * similarity, dim=-1)
def write(self, write_key, write_vector, erase_vector, write_strength):
write_weights = self.content_addressing(write_key, write_strength)
erase = torch.outer(write_weights.squeeze(), erase_vector.squeeze())
self.memory = (self.memory * (1 – erase)).detach()
add = torch.outer(write_weights.squeeze(), write_vector.squeeze())
self.memory = (self.memory + add).detach()
self.usage = (0.99 * self.usage + write_weights.squeeze()).detach()
def read(self, read_keys, read_strengths):
reads = []
for i in range(self.num_read_heads):
weights = self.content_addressing(read_keys[i], read_strengths[i])
read_vector = torch.matmul(weights, self.memory)
reads.append(read_vector)
return torch.cat(reads, dim=-1)

class MemoryController(nn.Module):
def __init__(self, input_dim, hidden_dim, memory_config: MemoryConfig):
super().__init__()
self.hidden_dim = hidden_dim
self.memory_config = memory_config
self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
total_read_dim = memory_config.num_read_heads * memory_config.memory_dim
self.read_keys = nn.Linear(hidden_dim, memory_config.num_read_heads * memory_config.memory_dim)
self.read_strengths = nn.Linear(hidden_dim, memory_config.num_read_heads)
self.write_key = nn.Linear(hidden_dim, memory_config.memory_dim)
self.write_vector = nn.Linear(hidden_dim, memory_config.memory_dim)
self.erase_vector = nn.Linear(hidden_dim, memory_config.memory_dim)
self.write_strength = nn.Linear(hidden_dim, 1)
self.output = nn.Linear(hidden_dim + total_read_dim, input_dim)
def forward(self, x, memory_bank, hidden=None):
lstm_out, hidden = self.lstm(x.unsqueeze(0), hidden)
controller_state = lstm_out.squeeze(0)
read_k = self.read_keys(controller_state).view(self.memory_config.num_read_heads, -1)
read_s = F.softplus(self.read_strengths(controller_state))
write_k = self.write_key(controller_state)
write_v = torch.tanh(self.write_vector(controller_state))
erase_v = torch.sigmoid(self.erase_vector(controller_state))
write_s = F.softplus(self.write_strength(controller_state))
read_vectors = memory_bank.read(read_k, read_s)
memory_bank.write(write_k, write_v, erase_v, write_s)
combined = torch.cat([controller_state, read_vectors], dim=-1)
output = self.output(combined)
return output, hidden

We implement the Neural Memory Bank and the Memory Controller, which together form the core of the agent’s differentiable memory mechanism. The Neural Memory Bank stores and retrieves information through content-based addressing, while the controller network dynamically interacts with this memory using read and write operations. This setup enables the agent to recall relevant information and adapt to new inputs efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ExperienceReplay:
def __init__(self, capacity=10000, alpha=0.6):
self.capacity = capacity
self.alpha = alpha
self.buffer = deque(maxlen=capacity)
self.priorities = deque(maxlen=capacity)
def push(self, experience, priority=1.0):
self.buffer.append(experience)
self.priorities.append(priority ** self.alpha)
def sample(self, batch_size, beta=0.4):
if len(self.buffer) == 0:
return [], []
probs = np.array(self.priorities)
probs = probs / probs.sum()
indices = np.random.choice(len(self.buffer), min(batch_size, len(self.buffer)), p=probs, replace=False)
samples = [self.buffer[i] for i in indices]
weights = (len(self.buffer) * probs[indices]) ** (-beta)
weights = weights / weights.max()
return samples, torch.FloatTensor(weights)

class MetaLearner(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def adapt(self, support_x, support_y, num_steps=5, lr=0.01):
adapted_params = {name: param.clone() for name, param in self.model.named_parameters()}
for _ in range(num_steps):
pred, _ = self.model(support_x, self.model.memory_bank)
loss = F.mse_loss(pred, support_y)
grads = torch.autograd.grad(loss, self.model.parameters(), create_graph=True)
adapted_params = {name: param – lr * grad for (name, param), grad in zip(adapted_params.items(), grads)}
return adapted_params

We design the Experience Replay and Meta-Learner components to strengthen the agent’s ability to learn continuously. The replay buffer enables the model to revisit past experiences through prioritized sampling, thereby reducing forgetting, while the Meta-Learner utilizes MAML-style adaptation for rapid learning on new tasks. Together, these modules bring stability and flexibility to the agent’s training process. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ContinualLearningAgent:
def __init__(self, input_dim=64, hidden_dim=128):
self.config = MemoryConfig()
self.memory_bank = NeuralMemoryBank(self.config)
self.controller = MemoryController(input_dim, hidden_dim, self.config)
self.replay_buffer = ExperienceReplay(capacity=5000)
self.meta_learner = MetaLearner(self.controller)
self.optimizer = torch.optim.Adam(self.controller.parameters(), lr=0.001)
self.task_history = []
def train_step(self, x, y, use_replay=True):
self.optimizer.zero_grad()
pred, _ = self.controller(x, self.memory_bank)
current_loss = F.mse_loss(pred, y)
self.replay_buffer.push((x.detach().clone(), y.detach().clone()), priority=current_loss.item() + 1e-6)
total_loss = current_loss
if use_replay and len(self.replay_buffer.buffer) > 16:
samples, weights = self.replay_buffer.sample(8)
for (replay_x, replay_y), weight in zip(samples, weights):
with torch.enable_grad():
replay_pred, _ = self.controller(replay_x, self.memory_bank)
replay_loss = F.mse_loss(replay_pred, replay_y)
total_loss = total_loss + 0.3 * replay_loss * weight
total_loss.backward()
torch.nn.utils.clip_grad_norm_(self.controller.parameters(), 1.0)
self.optimizer.step()
return total_loss.item()
def evaluate(self, test_data):
self.controller.eval()
total_error = 0
with torch.no_grad():
for x, y in test_data:
pred, _ = self.controller(x, self.memory_bank)
total_error += F.mse_loss(pred, y).item()
self.controller.train()
return total_error / len(test_data)

We construct a Continual Learning Agent that integrates memory, controller, replay, and meta-learning into a single, adaptive framework. In this step, we define how the agent trains on each batch, replays past data, and evaluates its performance. The implementation ensures that the model can retain prior knowledge while learning new information without catastrophic forgetting. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef create_task_data(task_id, num_samples=100):
torch.manual_seed(task_id)
x = torch.randn(num_samples, 64)
if task_id == 0:
y = torch.sin(x.mean(dim=1, keepdim=True).expand(-1, 64))
elif task_id == 1:
y = torch.cos(x.mean(dim=1, keepdim=True).expand(-1, 64)) * 0.5
else:
y = torch.tanh(x * 0.5 + task_id)
return [(x[i], y[i]) for i in range(num_samples)]

def run_continual_learning_demo():
print(” Neural Memory Agent – Continual Learning Demon”)
print(“=” * 60)
agent = ContinualLearningAgent()
num_tasks = 4
results = {‘tasks’: [], ‘without_memory’: [], ‘with_memory’: []}
for task_id in range(num_tasks):
print(f”n Learning Task {task_id + 1}/{num_tasks}”)
train_data = create_task_data(task_id, num_samples=50)
test_data = create_task_data(task_id, num_samples=20)
for epoch in range(20):
total_loss = 0
for x, y in train_data:
loss = agent.train_step(x, y, use_replay=(task_id > 0))
total_loss += loss
if epoch % 5 == 0:
avg_loss = total_loss / len(train_data)
print(f” Epoch {epoch:2d}: Loss = {avg_loss:.4f}”)
print(f”n Evaluation on all tasks:”)
for eval_task_id in range(task_id + 1):
eval_data = create_task_data(eval_task_id, num_samples=20)
error = agent.evaluate(eval_data)
print(f” Task {eval_task_id + 1}: Error = {error:.4f}”)
if eval_task_id == task_id:
results[‘tasks’].append(eval_task_id + 1)
results[‘with_memory’].append(error)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
ax = axes[0]
memory_matrix = agent.memory_bank.memory.detach().numpy()
im = ax.imshow(memory_matrix, aspect=’auto’, cmap=’viridis’)
ax.set_title(‘Neural Memory Bank State’, fontsize=14, fontweight=’bold’)
ax.set_xlabel(‘Memory Dimension’)
ax.set_ylabel(‘Memory Slots’)
plt.colorbar(im, ax=ax)
ax = axes[1]
ax.plot(results[‘tasks’], results[‘with_memory’], marker=’o’, linewidth=2, markersize=8, label=’With Memory Replay’)
ax.set_title(‘Continual Learning Performance’, fontsize=14, fontweight=’bold’)
ax.set_xlabel(‘Task Number’)
ax.set_ylabel(‘Test Error’)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(‘neural_memory_results.png’, dpi=150, bbox_inches=’tight’)
print(“n Results saved to ‘neural_memory_results.png'”)
plt.show()
print(“n” + “=” * 60)
print(” Key Insights:”)
print(” • Memory bank stores compressed task representations”)
print(” • Experience replay mitigates catastrophic forgetting”)
print(” • Agent maintains performance on earlier tasks”)
print(” • Content-based addressing enables efficient retrieval”)

if __name__ == “__main__”:
run_continual_learning_demo()

We conduct a comprehensive demonstration of the continual learning process, generating synthetic tasks to evaluate the agent’s adaptability across multiple environments. As we train and visualize the results, we observe how memory replay improves stability and maintains accuracy across tasks. The experiment concludes with graphical insights that highlight how differentiable memory enhances the agent’s long-term learning capability.

In conclusion, we built and trained a neural memory agent capable of continual adaptation across evolving tasks. We observed how the differentiable memory enables efficient storage and retrieval of learned representations, while the replay mechanism reinforces stability and knowledge retention. By combining these components with meta-learning, we saw how such agents pave the way for more resilient, self-adapting neural systems that can remember, reason, and evolve without losing what they’ve already mastered.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Build Neural Memory Agents with Differentiable Memory, Meta-Learning, and Experience Replay for Continual Adaptation in Dynamic Environments appeared first on MarkTechPost.

AI Interview Series #1: Explain Some LLM Text Generation Strategies Us …

Every time you prompt an LLM, it doesn’t generate a complete answer all at once — it builds the response one word (or token) at a time. At each step, the model predicts the probability of what the next token could be based on everything written so far. But knowing probabilities alone isn’t enough — the model also needs a strategy to decide which token to actually pick next.

Different strategies can completely change how the final output looks — some make it more focused and precise, while others make it more creative or varied. In this article, we’ll explore four popular text generation strategies used in LLMs: Greedy Search, Beam Search, Nucleus Sampling, and Temperature Sampling — explaining how each one works.

Greedy Search

Greedy Search is the simplest decoding strategy where, at each step, the model picks the token with the highest probability given the current context. While it’s fast and easy to implement, it doesn’t always produce the most coherent or meaningful sequence — similar to making the best local choice without considering the overall outcome. Because it only follows one path in the probability tree, it can miss better sequences that require short-term trade-offs. As a result, greedy search often leads to repetitive, generic, or dull text, making it unsuitable for open-ended text generation tasks.

Beam Search

Beam Search is an improved decoding strategy over greedy search that keeps track of multiple possible sequences (called beams) at each generation step instead of just one. It expands the top K most probable sequences, allowing the model to explore several promising paths in the probability tree and potentially discover higher-quality completions that greedy search might miss. The parameter K (beam width) controls the trade-off between quality and computation — larger beams produce better text but are slower. 

While beam search works well in structured tasks like machine translation, where accuracy matters more than creativity, it tends to produce repetitive, predictable, and less diverse text in open-ended generation. This happens because the algorithm favors high-probability continuations, leading to less variation and “neural text degeneration,” where the model overuses certain words or phrases.

https://arxiv.org/pdf/1904.09751

Greedy Search:

Beam Search:

Greedy Search (K=1) always takes the highest local probability:

T2: Chooses “slow” (0.6) over “fast” (0.4).

Resulting path: “The slow dog barks.” (Final Probability: 0.1680)

Beam Search (K=2) keeps both “slow” and “fast” paths alive:

At T3, it realizes the path starting with “fast” has a higher potential for a good ending.

Resulting path: “The fast cat purrs.” (Final Probability: 0.1800)

Beam Search successfully explores a path that had a slightly lower probability early on, leading to a better overall sentence score.

Top-p Sampling (Nucleus Sampling)

Top-p Sampling (Nucleus Sampling) is a probabilistic decoding strategy that dynamically adjusts how many tokens are considered for generation at each step. Instead of picking from a fixed number of top tokens like in top-k sampling, top-p sampling selects the smallest set of tokens whose cumulative probability adds up to a chosen threshold p (for example, 0.7). These tokens form the “nucleus,” from which the next token is randomly sampled after normalizing their probabilities. 

This allows the model to balance diversity and coherence — sampling from a broader range when many tokens have similar probabilities (flat distribution) and narrowing down to the most likely tokens when the distribution is sharp (peaky). As a result, top-p sampling produces more natural, varied, and contextually appropriate text compared to fixed-size methods like greedy or beam search.

Temperature Sampling

Temperature Sampling controls the level of randomness in text generation by adjusting the temperature parameter (t) in the softmax function that converts logits into probabilities. A lower temperature (t < 1) makes the distribution sharper, increasing the chance of selecting the most probable tokens — resulting in more focused but often repetitive text. At t = 1, the model samples directly from its natural probability distribution, known as pure or ancestral sampling. 

Higher temperatures (t > 1) flatten the distribution, introducing more randomness and diversity but at the cost of coherence. In practice, temperature sampling allows fine-tuning the balance between creativity and precision: low temperatures yield deterministic, predictable outputs, while higher ones generate more varied and imaginative text. 

The optimal temperature often depends on the task — for instance, creative writing benefits from higher values, while technical or factual responses perform better with lower ones.

The post AI Interview Series #1: Explain Some LLM Text Generation Strategies Used in LLMs appeared first on MarkTechPost.

StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade A …

How can speech editing become as direct and controllable as simply rewriting a line of text? StepFun AI has open sourced Step-Audio-EditX, a 3B parameter LLM based audio model that turns expressive speech editing into a token level text like operation, instead of a waveform level signal processing task.

https://arxiv.org/pdf/2511.03601

Why developers care about controllable TTS?

Most zero shot TTS systems copy emotion, style, accent, and timbre directly from a short reference audio. They can sound natural, but control is weak. Style prompts in text help only for in domain voices, and the cloned voice often ignores the requested emotion or speaking style.

Past work tries to disentangle factors with extra encoders, adversarial losses, or complex architectures. Step-Audio-EditX keeps a relatively entangled representation and instead changes the data and post training objective. The model learns control by seeing many pairs and triplets where text is fixed, but one attribute changes with a large margin.

Architecture, dual codebook tokenizer plus compact audio LLM

Step-Audio-EditX reuses the Step-Audio dual codebook tokenizer. Speech is mapped into two token streams, a linguistic stream at 16.7 Hz with a 1024 entry codebook, and a semantic stream at 25 Hz with a 4096 entry codebook. Tokens are interleaved with a 2 to 3 ratio. The tokenizer keeps prosody and emotion information, so it is not fully disentangled.

On top of this tokenizer, the StepFun research team builds a 3B parameter audio LLM. The model is initialized from a text LLM, then trained on a blended corpus with a 1 to 1 ratio of pure text and dual codebook audio tokens in chat style prompts. The audio LLM reads text tokens, audio tokens, or both, and always generates dual codebook audio tokens as output.

A separate audio decoder handles reconstruction. A diffusion transformer based flow matching module predicts Mel spectrograms from audio tokens, reference audio, and a speaker embedding, and a BigVGANv2 vocoder converts Mel spectrograms to waveform. The flow matching module is trained on about 200000 hours of high quality speech, which improves pronunciation and timbre similarity.

https://arxiv.org/pdf/2511.03601

Large margin synthetic data instead of complicated encoders

The key idea is large margin learning. The model is post trained on triplets and quadruplets that keep text fixed and change only one attribute with a clear gap.

For zero shot TTS, Step-Audio-EditX uses a high quality in house dataset, mainly Chinese and English, with a small amount of Cantonese and Sichuanese, and about 60000 speakers. The data covers wide intra speaker and inter speaker variation in style and emotion.(arXiv)

For emotion and speaking style editing, the team builds synthetic large margin triplets (text, audio neutral, audio emotion or style). Voice actors record about 10 second clips for each emotion and style. StepTTS zero shot cloning then produces neutral and emotional versions for the same text and speaker. A margin scoring model, trained on a small human labeled set, scores pairs on a 1 to 10 scale, and only samples with score at least 6 are kept.

Paralinguistic editing, which covers breathing, laughter, filled pauses and other tags, uses a semi synthetic strategy on top of the NVSpeech dataset. The research team builds quadruplets where the target is the original NVSpeech audio and transcript, and the input is a cloned version with tags removed from the text. This gives time domain editing supervision without a margin model.

Reinforcement learning data uses two preference sources. Human annotators rate 20 candidates per prompt on a 5 point scale for correctness, prosody, and naturalness, and pairs with margin greater than 3 are kept. A comprehension model scores emotion and speaking style on a 1 to 10 scale, and pairs with margin greater than 8 are kept.

Post training, SFT plus PPO on token sequences

Post training has two stages, supervised fine tuning followed by PPO.

In supervised fine tuning, system prompts define zero shot TTS and editing tasks in a unified chat format. For TTS, the prompt waveform is encoded to dual codebook tokens, converted to string form, and inserted into the system prompt as speaker information. The user message is the target text, and the model returns new audio tokens. For editing, the user message includes original audio tokens plus a natural language instruction, and the model outputs edited tokens.

Reinforcement learning then refines instruction following. A 3B reward model is initialized from the SFT checkpoint and trained with Bradley Terry loss on large margin preference pairs. The reward is computed directly on dual codebook token sequences, without decoding to waveform. PPO training uses this reward model, a clip threshold, and a KL penalty to balance quality and deviation from the SFT policy.

Step-Audio-Edit-Test, iterative editing and generalization

To quantify control, the research team introduced Step-Audio-Edit-Test. It uses Gemini 2.5 Pro as an LLM as a judge to evaluate emotion, speaking style, and paralinguistic accuracy. The benchmark has 8 speakers, drawn from Wenet Speech4TTS, GLOBE V2, and Libri Light, with 4 speakers per language.

The emotion set has 5 categories with 50 Chinese and 50 English prompts per category. The speaking style set has 7 styles with 50 prompts per language per style. The paralinguistic set has 10 labels such as breathing, laughter, surprise oh, and uhm, with 50 prompts per label and language.

Editing is evaluated iteratively. Iteration 0 is the initial zero shot clone. Then the model applies 3 rounds of editing with text instructions. In Chinese, emotion accuracy rises from 57.0 at iteration 0 to 77.7 at iteration 3. Speaking style accuracy rises from 41.6 to 69.2. English shows similar behavior, and a prompt fixed ablation, where the same prompt audio is used for all iterations, still improves accuracy, which supports the large margin learning hypothesis.

https://arxiv.org/pdf/2511.03601

The same editing model is applied to four closed source TTS systems, GPT 4o mini TTS, ElevenLabs v2, Doubao Seed TTS 2.0, and MiniMax speech 2.6 hd. For all of them, one editing iteration with Step-Audio-EditX improves both emotion and style accuracy, and further iterations continue to help.

Paralinguistic editing is scored on a 1 to 3 scale. The average score rises from 1.91 at iteration 0 to 2.89 after a single edit, in both Chinese and English, which is comparable to native paralinguistic synthesis in strong commercial systems.

https://arxiv.org/pdf/2511.03601

Key Takeaways

Step Audio EditX uses a dual codebook tokenizer and a 3B parameter audio LLM so it can treat speech as discrete tokens and edit audio in a text like way.

The model relies on large margin synthetic data for emotion, speaking style, paralinguistic cues, speed, and noise, rather than adding extra disentangling encoders.

Supervised fine tuning plus PPO with a token level reward model aligns the audio LLM to follow natural language editing instructions for both TTS and editing tasks.

The Step Audio Edit Test benchmark with Gemini 2.5 Pro as a judge shows clear accuracy gains over 3 editing iterations for emotion, style, and paralinguistic control in both Chinese and English.

Step Audio EditX can post process and improve speech from closed source TTS systems, and the full stack, including code and checkpoints, is available as open source for developers.

Editorial Comments

Step Audio EditX is a precise step forward in controllable speech synthesis, because it keeps the Step Audio tokenizer, adds a compact 3B audio LLM, and optimizes control through large margin data and PPO. The introduction of Step Audio Edit Test with Gemini 2.5 Pro as a judge makes the evaluation story concrete for emotion, speaking style, and paralinguistic control, and the open release lowers the barrier for practical audio editing research. Overall, this release makes audio editing feel much closer to text editing.

Check out the Paper, Repo and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade Audio Editing Model Excelling at Expressive and Iterative Audio Editing appeared first on MarkTechPost.

Nested Learning: A New Machine Learning Approach for Continual Learnin …

How can we build AI systems that keep learning new information over time without forgetting what they learned before or retraining from scratch? Google Researchers has introduced Nested Learning, a machine learning approach that treats a model as a collection of smaller nested optimization problems, instead of a single network trained by one outer loop. The goal is to attack catastrophic forgetting and move large models toward continual learning, closer to how biological brains manage memory and adaptation over time.

https://abehrouz.github.io/files/NL.pdf

What is Nested Learning?

The research paper from Google ‘Nested Learning, The Illusion of Deep Learning Architectures’ models a complex neural network as a set of coherent optimization problems, nested or running in parallel, that are optimized together. Each internal problem has its own context flow, the sequence of inputs, gradients, or states that this component observes, and its own update frequency.

Instead of seeing training as a flat stack of layers plus one optimizer, Nested Learning imposes an ordering by update frequency. Parameters that update often sit at inner levels, while slowly updated parameters form outer levels. This hierarchy defines a Neural Learning Module, where every level compresses its own context flow into its parameters. The research team show that this view covers standard back-propagation on an MLP, linear attention, and common optimizers, all as instances of associative memory.

In this framework, associative memory is any operator that maps keys to values and is trained with an internal objective. The research team formalizes associative memory and then shows that back-propagation itself can be written as a one step gradient descent update that learns a mapping from inputs to local surprise signals, the gradient of the loss with respect to the output.

https://abehrouz.github.io/files/NL.pdf

Deep Optimizers as Associative Memory

Once optimizers are treated as learning modules, Nested Learning suggests redesigning them with richer internal objectives. Standard momentum can be written as a linear associative memory over past gradients, trained with a dot product similarity objective. This internal objective produces a Hebbian like update rule that does not model dependencies between data samples.

The researcher team replaced this similarity objective with an L2 regression loss over gradient features, which yields an update rule that better manages limited memory capacity and better memorizes gradient sequences. They then generalize the momentum memory from a linear map to an MLP and define Deep Momentum Gradient Descent, where the momentum state is produced by a neural memory and can pass through a non linear function such as Newton Schulz. This perspective also recovers the Muon optimizer as a special case.

https://abehrouz.github.io/files/NL.pdf

Continuum Memory System

In typical sequence models, attention acts as working memory over the current context window, while feedforward blocks store pre training knowledge as long term memory that is rarely updated after training. The Nested Learning researchers extend this binary view to a Continuum Memory System, or CMS.

CMS is defined as a chain of MLP blocks, MLP(f₁) through MLP(fₖ), where each block has its own update frequency and chunk size. For an input sequence, the output is obtained by sequentially applying these blocks. The parameters of each block are updated only every C^(ℓ) steps, so each block compresses a different time scale of context into its parameters. A standard Transformer with one feedforward block is recovered as the special case with k equal to 1.

This construction turns long term memory into a spectrum of levels across frequency, instead of a single static feedforward layer. The research connects this directly to multi time scale synaptic and system consolidation processes in the brain, where different parts of the system learn at different rates while sharing a common architecture.

HOPE, A Self Modifying Architecture Built On Titans

To show that Nested Learning is practical, the research team designed HOPE, a self referential sequence model that applies the paradigm to a recurrent architecture. HOPE is built as a variant of Titans, a long term memory architecture where a neural memory module learns to memorize surprising events at test time and helps attention attend to long past tokens.

Titans has only 2 levels of parameter update, which yields first order in context learning. HOPE extends Titans in 2 ways. First, it is self modifying, it can optimize its own memory through a self referential process and can in principle support unbounded levels of in context learning. Second, it integrates Continuum Memory System blocks so that memory updates occur at multiple frequencies and scale to longer context windows.

https://abehrouz.github.io/files/NL.pdf

Understanding the Results

The research team evaluates HOPE and baselines on language modeling and common sense reasoning tasks at 3 parameter scales, 340M, 760M, and 1.3B parameters. Benchmarks include Wiki and LMB perplexity for language modeling and PIQA, HellaSwag, WinoGrande, ARC Easy, ARC Challenge, Social IQa, and BoolQ accuracy for reasoning. The below given Table 1 reports results for HOPE, Transformer++, RetNet, Gated DeltaNet, TTT, Samba, and Titans.

https://abehrouz.github.io/files/NL.pdf

Key Takeaways

Nested Learning treats a model as multiple nested optimization problems with different update frequencies, which directly targets catastrophic forgetting in continual learning.

The framework reinterprets backpropagation, attention, and optimizers as associative memory modules that compress their own context flow, giving a unified view of architecture and optimization.

Deep optimizers in Nested Learning replace simple dot product similarity with richer objectives such as L2 regression and use neural memories, which leads to more expressive and context aware update rules.

The Continuum Memory System models memory as a spectrum of MLP blocks that update at different rates, creating short, medium, and long range memory rather than one static feedforward layer.

The HOPE architecture, a self modifying variant of Titans built using Nested Learning principles, shows improved language modeling, long context reasoning, and continual learning performance compared to strong Transformer and recurrent baselines.

Editorial Comments

Nested Learning is a useful reframing of deep networks as Neural Learning Modules that integrate architecture and optimization into one system. The introduction of Deep Momentum Gradient Descent, Continuum Memory System, and the HOPE architecture gives a concrete path to richer associative memory and better continual learning. Overall, this work turns continual learning from an afterthought into a primary design axis.

Check out the Paper and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Nested Learning: A New Machine Learning Approach for Continual Learning that Views Models as Nested Optimization Problems to Enhance Long Context Processing appeared first on MarkTechPost.

Prior Labs Releases TabPFN-2.5: The Latest Version of TabPFN that Unlo …

Tabular data is still where many important models run in production. Finance, healthcare, energy and industry teams work with tables of rows and columns, not images or long text. Prior Labs now extends this space with TabPFN-2.5, a new tabular foundation model that scales in context learning to 50,000 samples and 2,000 features while keeping a training free workflow.

https://priorlabs.ai/technical-reports/tabpfn-2-5-model-report

From TabPFN And TabPFNv2 To TabPFN-2.5

The first TabPFN showed that a transformer can learn a Bayesian like inference procedure on synthetic tabular tasks. It handled up to about 1,000 samples and clean numerical features. TabPFNv2 extended this to messy real world data. It added support for categorical features, missing values and outliers, and was practical up to 10,000 samples and 500 features.

TabPFN-2.5 is the next generation in this line. Prior Labs describes it as best for datasets with up to 50,000 samples and 2,000 features, which is a 5 times increase in rows and a 4 times increase in columns over TabPFNv2. That gives roughly 20 times more data cells in the supported regime. The model is exposed through the tabpfn Python package and also through an API.

AspectTabPFN (v1)TabPFNv2TabPFN-2.5Max Rows (recommended)1,00010,00050,000Max Features (recommended)1005002,000Supported data typesNumeric onlyMixedMixed

In Context Learning For Tables

TabPFN-2.5 follows the same prior data fitted network idea as earlier versions. It is a transformer based foundation model that uses in context learning to solve tabular prediction problems in a forward pass. At training time, the model is meta trained on large synthetic distributions of tabular tasks. At inference time, you pass training rows and labels and the test rows together. The model runs one forward pass and outputs predictions, so there is no dataset specific gradient descent or hyperparameter search.

https://priorlabs.ai/technical-reports/tabpfn-2-5-model-report

Benchmark Results On TabArena And RealCause

The research team uses the TabArena Lite benchmark to measure medium sized tasks up to 10,000 samples and 500 features. TabPFN-2.5 in a forward pass outperforms any other model in the comparison. When the Real-TabPFN-2.5 variant is fine tuned on real datasets, the lead increases further. AutoGluon 1.4 in extreme mode is the baseline ensemble, tuned for 4 hours and even including TabPFNv2.

On industry standard benchmarks with up to 50,000 data points and 2,000 features, TabPFN-2.5 substantially outperforms tuned tree based models such as XGBoost and CatBoost. On the same benchmarks it matches the accuracy of AutoGluon 1.4, which runs a complex four hour tuned ensemble that includes previous methods.

Model Architecture And Training Setup

The model architecture follows TabPFNv2 with alternating attention and 18 to 24 layers. Alternating attention means that the network attends along the sample axis and along the feature axis in separate stages, which enforces permutation invariance over rows and columns. This design is important for tabular data where the order of rows and the order of columns do not carry information.

The training setup keeps the prior data based learning idea. TabPFN-2.5 uses synthetic tabular tasks with different priors over functions and data distributions as its meta training source. Real-TabPFN-2.5 uses continued pre training on a set of real world tabular datasets from repositories like OpenML and Kaggle, while the team carefully avoids overlap with evaluation benchmarks.

Key Takeaways

TabPFN 2.5 scales prior data fitted tabular transformers to about 50,000 samples and 2,000 features while keeping a one forward pass, no tuning workflow.

The model is trained on synthetic tabular tasks and evaluated on TabArena, internal industry benchmarks and RealCause, where it substantially outperforms tuned tree based baselines and matches AutoGluon 1.4 on benchmarks in this size range.

TabPFN 2.5 keeps the TabPFNv2 style alternating attention transformer for rows and features, which enables permutation invariance over tables and in context learning without task specific training.

A distillation engine turns TabPFN 2.5 into compact MLP or tree ensemble students that preserve most of the accuracy while giving much lower latency and plug in deployment in existing tabular stacks.

Editorial Comments

TabPFN 2.5 is an important release for tabular machine learning because it turns model selection and hyperparameter tuning into a single forward pass workflow on datasets with up to 50,000 samples and 2,000 features. It combines synthetic meta training, Real-TabPFN-2.5 fine tuning and a distillation engine into MLP and TreeEns students, with a clear non commercial license and enterprise path. Overall, this release makes prior data fitted networks practical for real tabular problems.

Check out the Paper, Model Weights, Repo and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Prior Labs Releases TabPFN-2.5: The Latest Version of TabPFN that Unlocks Scale and Speed for Tabular Foundation Models appeared first on MarkTechPost.

Anthropic Turns MCP Agents Into Code First Systems With ‘Code Execut …

Agents that use the Model Context Protocol MCP have a scaling problem. Every tool definition and every intermediate result is pushed through the context window, which means large workflows burn tokens and hit latency and cost limits fast. Anthropic’s new ‘code execution with MCP’ pattern restructures this pipeline by turning MCP tools into code level APIs and asking the model to write and run code instead of calling tools directly.

The problem, MCP tools as direct model calls

MCP is an open standard that lets AI applications connect to external systems through MCP servers that expose tools. These tools let a model query databases, call APIs, or work with files through a unified interface.

In the default pattern, an agent loads many tool definitions into the model context. Each tool definition contains schema information and metadata. Intermediate results from each tool call are also streamed back into the context so the model can decide the next call.

Anthropic describes a typical case where an agent uses an MCP server for Google Drive to fetch a long sales meeting transcript and then uses another MCP server for Salesforce to update a record with that transcript. The full transcript is first returned through the model, then sent back again when the Salesforce tool is called. For a long meeting this can add tens of thousands of extra tokens that do not change the logic of the task.

When there are many MCP servers and many tools, this pattern does not scale. The model pays to read large tool catalogs and to move large payloads between tools. Latency increases, costs grow, and context limits become a hard cap on system behavior.

The shift, represent MCP servers as code APIs

Anthropic’s proposal is to place MCP inside a code execution loop. Instead of letting the model call tools directly, the MCP client exposes each server as a set of code modules in a filesystem. The model writes TypeScript code that imports and composes those modules, and this code runs in a sandboxed environment.

The pattern has three main steps.

The MCP client generates a directory such as servers that mirrors the available MCP servers and tools.

For each MCP tool, it creates a thin wrapper function implemented in a source file, for example servers/google-drive/getDocument.ts, that internally calls the MCP tool with typed parameters.

The model is instructed to write TypeScript code that imports these functions, runs them, and handles control flow and data movement inside the execution environment.

The earlier Google Drive and Salesforce workflow becomes a short script. The script calls the Google Drive wrapper once, manipulates or inspects the data locally, then calls the Salesforce wrapper. The large transcript does not pass through the model, only the final status and any small samples or summaries do.

Cloudflare’s ‘Code Mode’ work uses the same idea in its Workers platform. It converts MCP tools into TypeScript APIs and runs model generated code inside an isolate with restricted bindings.

Quantitative impact, token usage drops by 98.7 percent

Anthropic reports a concrete example. A workflow that previously consumed about 150,000 tokens when tools and intermediate data were passed directly through the model was reimplemented with code execution and filesystem based MCP APIs. The new pattern used about 2,000 tokens. That is a 98.7 percent reduction in token usage for that scenario, which also reduces cost and latency.

Design benefits for agent builders

Code execution with MCP introduces several practical benefits for engineers who design agents:

Progressive tool discovery: The agent does not need all tool definitions in context. It can explore the generated filesystem, list available servers, and read specific tool modules only when needed. This shifts tool catalogs from the model context into code, so tokens are spent only on relevant interfaces.

Context efficient data handling: Large datasets remain inside the execution environment. For example, TypeScript code can read a large spreadsheet through an MCP tool, filter rows, compute aggregates, and log only small samples and summary statistics back to the model. The model sees a compact view of the data while the heavy lifting happens in code.

Privacy preserving operations: Anthropic describes a pattern where sensitive fields such as email or phone are tokenized inside the execution environment. The model sees placeholders, while the MCP client maintains a secure mapping and restores real values when calling downstream tools. This lets data move between MCP servers without exposing raw identifiers to the model.

State and reusable skills: The filesystem lets agents store intermediate files and reusable scripts. A helper script that transforms a sheet into a report can be saved in a skills directory and imported in later sessions. Anthropic connects this idea to Claude Skills, where collections of scripts and metadata define higher level capabilities.

Editorial Comments

Anthropic’s ‘code execution with MCP’ approach is a sensible next step for MCP powered agents. It directly attacks the token costs of loading tool definitions and routing large intermediate results through the context, by presenting MCP servers as code APIs and pushing work into a sandboxed TypeScript runtime. This makes agents more efficient, while also forcing teams to take code execution security seriously. This launch turns MCP from a tool list into an executable API surface.
The post Anthropic Turns MCP Agents Into Code First Systems With ‘Code Execution With MCP’ Approach appeared first on MarkTechPost.

Google AI Releases ADK Go: A New Open-Source Toolkit Designed to Empow …

How do you build reliable AI agents that plug into your existing Go services without bolting on a separate language stack? Google has just released Agent Development Kit for Go. Go developers can now build AI agents with the same framework that already supports Python and Java, while keeping everything inside a familiar Go toolchain and deployment model.

For AI devs and backend developers who already use Go for services, this closes a gap. You no longer need a separate Python based stack for agents. You can express agent logic, orchestration, and tool use directly in Go code, then move the same agents into Vertex AI Agent Builder and Agent Engine when you are ready for production.

What Agent Development Kit Provides?

Agent Development Kit, or ADK, is an open source framework for developing and deploying AI agents. It is optimized for Gemini and Google Cloud, but the design is model agnostic and deployment agnostic.

In practical terms, ADK gives you:

A code first programming model where agent behavior, tools, and orchestration live in normal source files

Workflow agents for sequential, parallel, and loop style control flow inside an agent system

A rich tool ecosystem with built in tools, custom function tools, OpenAPI tools, Google Cloud tools, and ecosystem tools

Deployment paths that cover local runs, containers, Cloud Run, and Vertex AI Agent Engine

Built in evaluation and safety patterns, integrated with Vertex AI Agent Builder

For a developer, ADK turns an agent into a normal service. You run it locally, inspect traces, and deploy it to a managed runtime, instead of treating it as a one off script that calls an LLM.

What ADK for Go Adds?

The Go release keeps the same core feature set as the Python and Java SDKs but exposes it through an idiomatic Go API. The Google AI team describes ADK for Go as an idiomatic and performant way to build agents that use Go concurrency and strong typing.

Here are some key points:

ADK for Go is installed with go get google.golang.org/adk

The project is open source and hosted at github.com/google/adk-go

It supports building, evaluating, and deploying sophisticated AI agents with flexibility and control

It uses the same abstractions for agents, tools, and workflows as the other ADK languages

This means a Go service can embed agent behavior without switching languages. You can build a multi agent architecture where each agent is a Go component that composes with others using the same framework.

A2A Protocol Support in Go

ADK for Go ships with native support for the Agent2Agent protocol, or A2A.

The A2A protocol defines a way for agents to call other agents over a standard interface. In the Go release, Google highlights that a primary agent can orchestrate and delegate tasks to specialized sub agents. Those sub agents can run locally or as remote deployments. A2A keeps these interactions secure and opaque, so an agent does not need to expose internal memory or proprietary logic to participate.

Google also contributed an A2A Go SDK to the main A2A project. That gives Go developers a protocol level entry point if they want agents that interoperate with other runtimes and frameworks that also support A2A.

MCP Toolbox for Databases and Tooling

A key detail in the official Google announcement is native integration with MCP Toolbox for Databases. It states that ADK Go has out of the box support for more than 30 databases through this toolbox.

MCP Toolbox for Databases is an open source MCP server for databases. It handles connection pooling, authentication, and other concerns, and exposes database operations as tools using the Model Context Protocol.

Within ADK, that means:

You register MCP Toolbox for Databases as an MCP tool provider

The agent calls database operations through MCP tools rather than constructing raw SQL

The toolbox enforces a set of safe, predefined actions that the agent can perform

This fits the ADK model for tools in general, where agents use a mix of built in tools, Google Cloud tools, ecosystem tools, and MCP tools, all described in the Vertex AI Agent Builder documentation.

Integration with Vertex AI Agent Builder and Agent Engine

ADK is the primary framework supported in Vertex AI Agent Builder for building multi agent systems.

The latest Agent Builder updates describe a build path where you:

Develop the agent locally using ADK, now including ADK for Go

Use the ADK quickstart and dev UI to test the agent with multiple tools

Deploy the agent to Vertex AI Agent Engine as a managed runtime

For Go teams, this means the language used in services and infrastructure is now available across the full agent lifecycle, from local development to managed production deployment.

Editorial Comments

This launch positions Agent Development Kit for Go as a practical bridge between AI agents and existing Go services, using the same open source, code first toolkit that underpins Python and Java agents. It brings A2A protocol support and MCP Toolbox for Databases into a Go native environment, aligned with Vertex AI Agent Builder and Vertex AI Agent Engine for deployment, evaluation, and observability. Overall, this release makes Go a first class language for building production ready, interoperable AI agents in Google’s ecosystem.

Check out the Repo, Samples and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases ADK Go: A New Open-Source Toolkit Designed to Empower Go Developers to Build Powerful AI Agents appeared first on MarkTechPost.

Why Spatial Supersensing is Emerging as the Core Capability for Multim …

Even strong ‘long-context’ AI models fail badly when they must track objects and counts over long, messy video streams, so the next competitive edge will come from models that predict what comes next and selectively remember only surprising, important events, not from just buying more compute and bigger context windows. A team of researchers from New York University and Stanford introduce Cambrian-S, a spatially grounded video multimodal large language model family, together with the VSI Super benchmark and the VSI 590K dataset to test and train spatial supersensing in long videos.

https://arxiv.org/pdf/2511.04670

From video question answering to spatial supersensing

The research team frames spatial supersensing as a progression of capabilities beyond linguistic only reasoning. The stages are semantic perception, streaming event cognition, implicit 3D spatial cognition and predictive world modeling.

Most current video MLLMs sample sparse frames and rely on language priors. They often answer benchmark questions using captions or single frames rather than continuous visual evidence. Diagnostic tests show that several popular video benchmarks are solvable with limited or text only input, so they do not strongly test spatial sensing.

Cambrian-S targets the higher stages of this hierarchy, where the model must remember spatial layouts across time, reason about object locations and counts and anticipate changes in a 3D world.

VSI Super, a stress test for continual spatial sensing

To expose the gap between current systems and spatial supersensing, the research team designed VSI Super, a two part benchmark that runs on arbitrarily long indoor videos.

https://arxiv.org/pdf/2511.04670

VSI Super Recall, or VSR, evaluates long horizon spatial observation and recall. Human annotators take indoor walkthrough videos from ScanNet, ScanNet++ and ARKitScenes and use Gemini to insert an unusual object, such as a Teddy Bear, into four frames at different spatial locations. These edited sequences are concatenated into streams up to 240 minutes. The model must report the order of locations where the object appears, which is a visual needle in a haystack task with sequential recall.

https://arxiv.org/pdf/2511.04670

VSI Super Count, or VSC, measures continual counting under changing viewpoints and rooms. The benchmark concatenates room tour clips from VSI Bench and asks for the total number of instances of a target object across all rooms. The model must handle viewpoint changes, revisits and scene transitions and maintain a cumulative count. Evaluation uses mean relative accuracy for durations from 10 to 120 minutes.

When Cambrian-S 7B is evaluated on VSI Super in a streaming setup at 1 frame per second, accuracy on VSR drops from 38.3 percent at 10 minutes to 6.0 percent at 60 minutes and becomes zero beyond 60 minutes. VSC accuracy is near zero across lengths. Gemini 2.5 Flash also degrades on VSI Super despite a long context window, which shows that brute force context scaling is not sufficient for continual spatial sensing.

VSI 590K, spatially focused instruction data

To test whether data scaling can help, the research team construct VSI 590K, a spatial instruction corpus with 5,963 videos, 44,858 images and 590,667 question answer pairs from 10 sources.

Sources include 3D annotated real indoor scans such as ScanNet, ScanNet++ V2, ARKitScenes, S3DIS and Aria Digital Twin, simulated scenes from ProcTHOR and Hypersim and pseudo annotated web data such as YouTube RoomTour and robot datasets Open X Embodiment and AgiBot World.

The dataset defines 12 spatial question types, such as object count, absolute and relative distance, object size, room size and appearance order. Questions are generated from 3D annotations or reconstructions so that spatial relationships are grounded in geometry rather than text heuristics. Ablations show that annotated real videos contribute the largest gains on VSI Bench, followed by simulated data and then pseudo annotated images and that training on the full mix gives the best spatial performance.

https://arxiv.org/pdf/2511.04670

Cambrian-S model family and spatial performance

Cambrian-S builds on Cambrian-1 and uses Qwen2.5 language backbones at 0.5B, 1.5B, 3B and 7B parameters with a SigLIP2 SO400M vision encoder and a two layer MLP connector.

Training follows a four stage pipeline. Stage 1 performs vision language alignment on image text pairs. Stage 2 applies image instruction tuning, equivalent to the improved Cambrian-1 setup. Stage 3 extends to video with general video instruction tuning on a 3 million sample mixture called Cambrian-S 3M. Stage 4 performs spatial video instruction tuning on a mixture of VSI 590K and a subset of the stage 3 data.

https://arxiv.org/pdf/2511.04670

On VSI Bench, Cambrian-S 7B reaches 67.5 percent accuracy and outperforms open source baselines like InternVL3.5 8B and Qwen VL 2.5 7B as well as proprietary Gemini 2.5 Pro by more than 16 absolute points. The model also maintains strong performance on Perception Test, EgoSchema and other general video benchmarks, so the focus on spatial sensing does not destroy general capabilities.

Predictive sensing with latent frame prediction and surprise

To go beyond static context expansion, the research team propose predictive sensing. They add a Latent Frame Prediction head, which is a two layer MLP that predicts the latent representation of the next video frame in parallel with next token prediction.

Training modifies stage 4. The model uses mean squared error and cosine distance losses between predicted and ground truth latent features, weighted against the language modeling loss. A subset of 290,000 videos from VSI 590K, sampled at 1 frame per second, is reserved for this objective. During this stage the connector, language model and both output heads are trained jointly, while the SigLIP vision encoder remains frozen.

https://arxiv.org/pdf/2511.04670

At inference time the cosine distance between predicted and actual features becomes a surprise score. Frames with low surprise are compressed before being stored in long term memory and high surprise frames are retained with more detail. A fixed size memory buffer uses surprise to decide which frames to consolidate or drop and queries retrieve frames that are most relevant to the question.

https://arxiv.org/pdf/2511.04670

For VSR, this surprise driven memory system lets Cambrian-S maintain accuracy as video length increases while keeping GPU memory usage stable. It outperforms Gemini 1.5 Flash and Gemini 2.5 Flash on VSR at all tested durations and avoids the sharp degradation seen in models that only extend context.

For VSC, the research team designed a surprise driven event segmentation scheme. The model accumulates features in an event buffer and when a high surprise frame signals a scene change, it summarizes that buffer into a segment level answer and resets the buffer. Aggregating segment answers gives the final count. In streaming evaluation, Gemini Live and GPT Realtime achieve less than 15 percent mean relative accuracy and drop near zero on 120 minute streams, while Cambrian-S with surprise segmentation reaches about 38 percent at 10 minutes and maintains around 28 percent at 120 minutes.

Key Takeaways

Cambrian-S and VSI 590K show that careful spatial data design and strong video MLLMs can significantly improve spatial cognition on VSI Bench, but they still fail on VSI Super, so scale alone does not solve spatial supersensing.

VSI Super, through VSR and VSC, is intentionally built from arbitrarily long indoor videos to stress continual spatial observation, recall and counting, which makes it resistant to brute force context window expansion and standard sparse frame sampling.

Benchmarking shows that frontier models, including Gemini 2.5 Flash and Cambrian S, degrade sharply on VSI Super even when video lengths remain within their nominal context limits, revealing a structural weakness in current long context multimodal architectures.

The Latent Frame Prediction based predictive sensing module uses next latent frame prediction error, or surprise, to drive memory compression and event segmentation, which yields substantial gains on VSI Super compared to long context baselines while keeping GPU memory usage stable.

The research work positions spatial supersensing as a hierarchy from semantic perception to predictive world modeling and argues that future video MLLMs must incorporate explicit predictive objectives and surprise driven memory, not only larger models and datasets, to handle unbounded streaming video in real applications.

Editorial Comments

Cambrian-S is a useful stress test of current video MLLMs because it shows that VSI SUPER is not just a harder benchmark, it exposes a structural failure of long context architectures that still rely on reactive perception. The predictive sensing module, based on Latent Frame Prediction and surprise driven memory, is an important step because it couples spatial sensing with internal world modeling rather than only scaling data and parameters. This research signals a shift from passive video understanding to predictive spatial supersensing as the next design target for multimodal models.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Why Spatial Supersensing is Emerging as the Core Capability for Multimodal AI Systems? appeared first on MarkTechPost.

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. That comes down to three implementation details: how the runtime batches requests, how it overlaps prefill and decode, and how it stores and reuses the KV cache. Different engines make different tradeoffs on these axes, which show up directly as differences in tokens per second, P50/P99 latency, and GPU memory usage.

This article compares six runtimes that show up repeatedly in production stacks:

vLLM

TensorRT LLM

Hugging Face Text Generation Inference (TGI v3)

LMDeploy

SGLang

DeepSpeed Inference / ZeRO Inference

1. vLLM

Design

vLLM is built around PagedAttention. Instead of storing each sequence’s KV cache in a large contiguous buffer, it partitions KV into fixed size blocks and uses an indirection layer so each sequence points to a list of blocks.

This gives:

Very low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)

High GPU utilization with continuous batching

Native support for prefix sharing and KV reuse at block level

Recent versions add KV quantization (FP8) and integrate FlashAttention style kernels.

Performance

vLLM evaluation:

vLLM achieves 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.

KV and memory behavior

PagedAttention provides a KV layout that is both GPU friendly and fragmentation resistant.

FP8 KV quantization reduces KV size and improves decode throughput when compute is not the bottleneck.

Where it fits

Default high performance engine when you need a general LLM serving backend with good throughput, good TTFT, and hardware flexibility.

2. TensorRT LLM

Design

TensorRT LLM is a compilation based engine on top of NVIDIA TensorRT. It generates fused kernels per model and shape, and exposes an executor API used by frameworks such as Triton.

Its KV subsystem is explicit and feature rich:

Paged KV cache

Quantized KV cache (INT8, FP8, with some combinations still evolving)

Circular buffer KV cache

KV cache reuse, including offloading KV to CPU and reusing it across prompts to reduce TTFT

NVIDIA reports that CPU based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios.

Performance

TensorRT LLM is highly tunable, so results vary. Common patterns from public comparisons and vendor benchmarks:

Very low single request latency on NVIDIA GPUs when engines are compiled for the exact model and configuration.

At moderate concurrency, it can be tuned either for low TTFT or for high throughput; at very high concurrency, throughput optimized profiles push P99 up due to aggressive batching.

KV and memory behavior

Paged KV plus quantized KV gives strong control over memory use and bandwidth.

Executor and memory APIs let you design cache aware routing policies at the application layer.

Where it fits

Latency critical workloads and NVIDIA only environments, where teams can invest in engine builds and per model tuning.

3. Hugging Face TGI v3

Design

Text Generation Inference (TGI) is a server focused stack with:

Rust based HTTP and gRPC server

Continuous batching, streaming, safety hooks

Backends for PyTorch and TensorRT and tight Hugging Face Hub integration

TGI v3 adds a new long context pipeline:

Chunked prefill for long inputs

Prefix KV caching so long conversation histories are not recomputed on each request

Performance

For conventional prompts, recent third party work shows:

vLLM often edges out TGI on raw tokens per second at high concurrency due to PagedAttention, but the difference is not huge on many setups.

TGI v3 processes around 3× more tokens and is up to 13× faster than vLLM on long prompts, under a setup with very long histories and prefix caching enabled.

Latency profile:

P50 for short and mid length prompts is similar to vLLM when both are tuned with continuous batching.

For long chat histories, prefill dominates in naive pipelines; TGI v3’s reuse of earlier tokens gives a large win in TTFT and P50.

KV and memory behavior

TGI uses KV caching with paged attention style kernels and reduces memory footprint through chunking of prefill and other runtime changes.

It integrates quantization through bits and bytes and GPTQ and runs across several hardware backends.

Where it fits

Production stacks already on Hugging Face, especially for chat style workloads with long histories where prefix caching gives large real world gains.

4. LMDeploy

Design

LMDeploy is a toolkit for compression and deployment from the InternLM ecosystem. It exposes two engines:

TurboMind: high performance CUDA kernels for NVIDIA GPUs

PyTorch engine: flexible fallback

Key runtime features:

Persistent, continuous batching

Blocked KV cache with a manager for allocation and reuse

Dynamic split and fuse for attention blocks

Tensor parallelism

Weight only and KV quantization (including AWQ and online INT8 / INT4 KV quant)

LMDeploy delivers up to 1.8× higher request throughput than vLLM, attributing this to persistent batching, blocked KV and optimized kernels.

Performance

Evaluations show:

For 4 bit Llama style models on A100, LMDeploy can reach higher tokens per second than vLLM under comparable latency constraints, especially at high concurrency.

It also reports that 4 bit inference is about 2.4× faster than FP16 for supported models.

Latency:

Single request TTFT is in the same ballpark as other optimized GPU engines when configured without extreme batch limits.

Under heavy concurrency, persistent batching plus blocked KV let LMDeploy sustain high throughput without TTFT collapse.

KV and memory behavior

Blocked KV cache trades contiguous per sequence buffers for a grid of KV chunks managed by the runtime, similar in spirit to vLLM’s PagedAttention but with a different internal layout.

Support for weight and KV quantization targets large models on constrained GPUs.

Where it fits

NVIDIA centric deployments that want maximum throughput and are comfortable using TurboMind and LMDeploy specific tooling.

5. SGLang

Design

SGLang is both:

A DSL for building structured LLM programs such as agents, RAG workflows and tool pipelines

A runtime that implements RadixAttention, a KV reuse mechanism that shares prefixes using a radix tree structure rather than simple block hashes.

RadixAttention:

Stores KV for many requests in a prefix tree keyed by tokens

Enables high KV hit rates when many calls share prefixes, such as few shot prompts, multi turn chat, or tool chains

Performance

Key Insights:

SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems such as vLLM, LMQL and others on structured workloads.

Improvements are largest when there is heavy prefix reuse, for example multi turn chat or evaluation workloads with repeated context.

Reported KV cache hit rates range from roughly 50% to 99%, and cache aware schedulers get close to the optimal hit rate on the measured benchmarks.

KV and memory behavior

RadixAttention sits on top of paged attention style kernels and focuses on reuse rather than just allocation.

SGLang integrates well with hierarchical context caching systems that move KV between GPU and CPU when sequences are long, although those systems are usually implemented as separate projects.

Where it fits

Agentic systems, tool pipelines, and heavy RAG applications where many calls share large prompt prefixes and KV reuse matters at the application level.

6. DeepSpeed Inference / ZeRO Inference

Design

DeepSpeed provides two pieces relevant for inference:

DeepSpeed Inference: optimized transformer kernels plus tensor and pipeline parallelism

ZeRO Inference / ZeRO Offload: techniques that offload model weights, and in some setups KV cache, to CPU or NVMe to run very large models on limited GPU memory

ZeRO Inference focuses on:

Keeping little or no model weights resident in GPU

Streaming tensors from CPU or NVMe as needed

Targeting throughput and model size rather than low latency

Performance

In the ZeRO Inference OPT 30B example on a single V100 32GB:

Full CPU offload reaches about 43 tokens per second

Full NVMe offload reaches about 30 tokens per second

Both are 1.3–2.4× faster than partial offload configurations, because full offload enables larger batch sizes

These numbers are small compared to GPU resident LLM runtimes on A100 or H100, but they apply to a model that does not fit natively in 32GB.

A recent I/O characterization of DeepSpeed and FlexGen confirms that offload based systems are dominated by small 128 KiB reads and that I/O behavior becomes the main bottleneck.

KV and memory behavior

Model weights and sometimes KV blocks are offloaded to CPU or SSD to fit models beyond GPU capacity.

TTFT and P99 are high compared to pure GPU engines, but the tradeoff is the ability to run very large models that otherwise would not fit.

Where it fits

Offline or batch inference, or low QPS services where model size matters more than latency and GPU count.

Comparison Tables

This table summarizes the main tradeoffs qualitatively:

RuntimeMain design ideaRelative strengthKV strategyTypical use casevLLMPagedAttention, continuous batchingHigh tokens per second at given TTFTPaged KV blocks, FP8 KV supportGeneral purpose GPU serving, multi hardwareTensorRT LLMCompiled kernels on NVIDIA + KV reuseVery low latency and high throughput on NVIDIAPaged, quantized KV, reuse and offloadNVIDIA only, latency sensitiveTGI v3HF serving layer with long prompt pathStrong long prompt performance, integrated stackPaged KV, chunked prefill, prefix cachingHF centric APIs, long chat historiesLMDeployTurboMind kernels, blocked KV, quantUp to 1.8× vLLM throughput in vendor testsBlocked KV cache, weight and KV quantNVIDIA deployments focused on raw throughputSGLangRadixAttention and structured programsUp to 6.4× throughput and 3.7× lower latency on structured workloadsRadix tree KV reuse over prefixesAgents, RAG, high prefix reuseDeepSpeedGPU CPU NVMe offload for huge modelsEnables large models on small GPU; throughput orientedOffloaded weights and sometimes KVVery large models, offline or low QPS

Choosing a runtime in practice

For a production system, the choice tends to collapse to a few simple patterns:

You want a strong default engine with minimal custom work: You can start with vLLM. It gives you good throughput, reasonable TTFT, and solid KV handling on common hardware.

You are committed to NVIDIA and want fine grained control over latency and KV: You can use TensorRT LLM, likely behind Triton or TGI. Plan for model specific engine builds and tuning.

Your stack is already on Hugging Face and you care about long chats: You can use TGI v3. Its long prompt pipeline and prefix caching are very effective for conversation style traffic.

You want maximum throughput per GPU with quantized models: You can use LMDeploy with TurboMind and blocked KV, especially for 4 bit Llama family models.

You are building agents, tool chains or heavy RAG systems: You can use SGLang and design prompts so that KV reuse via RadixAttention is high.

You must run very large models on limited GPUs: You can use DeepSpeed Inference / ZeRO Inference, accept higher latency, and treat the GPU as a throughput engine with SSD in the loop.

Overall, all these engines are converging on the same idea: KV cache is the real bottleneck resource. The winners are the runtimes that treat KV as a first class data structure to be paged, quantized, reused and offloaded, not just a big tensor slapped into GPU memory.
The post Comparing the Top 6 Inference Runtimes for LLM Serving in 2025 appeared first on MarkTechPost.

Connect Amazon Bedrock agents to cross-account knowledge bases

Organizations need seamless access to their structured data repositories to power intelligent AI agents. However, when these resources span multiple AWS accounts integration challenges can arise. This post explores a practical solution for connecting Amazon Bedrock agents to knowledge bases in Amazon Redshift clusters residing in different AWS accounts.
The challenge
Organizations that build AI agents using Amazon Bedrock can maintain their structured data in Amazon Redshift clusters. When these data repositories exist in separate AWS accounts from their AI agents, they face a significant limitation: Amazon Bedrock Knowledge Bases doesn’t natively support cross-account Redshift integration.
This creates a challenge for enterprises with multi-account architectures who want to:

Leverage existing structured data in Redshift for their AI agents.
Maintain separation of concerns across different AWS accounts.
Avoid duplicating data across accounts.
Ensure proper security and access controls.

Solution overview
Our solution enables cross-account knowledge base integration through a secure, serverless architecture that maintains secure access controls while allowing AI agents to query structured data. The approach uses AWS Lambda as an intermediary to facilitate secure cross-account data access.

The action flow as shown above:

Users enter their natural language question in Amazon Bedrock Agents which is configured in the agent account.
Amazon Bedrock Agents invokes a Lambda function through action groups which provides access to the Amazon Bedrock knowledge base configured in the agent-kb account above.
Action group Lambda function running in agent account assumes an IAM role created in agent-kb account above to connect to the knowledge base in the agent-kb account.
Amazon Bedrock Knowledge Base in the agent-kb account uses an IAM role created in the same account to access Amazon Redshift data warehouse and query data in the data warehouse.

The solution follows these key components:

Amazon Bedrock agent in the agent account that handles user interactions.
Amazon Redshift serverless workgroup in VPC and private subnet in the agent-kb account containing structured data.
Amazon Bedrock Knowledge base using the Amazon Redshift serverless workgroup as structured data source.
Lambda function in the agent account.
Action group configuration to connect the agent in the agent account to the Lambda function.
IAM roles and policies that enable secure cross-account access.

Prerequisites
This solution requires you to have the following:

Two AWS accounts. Create an AWS account if you do not have one. Specific permissions required for both account which will be set up in subsequent steps.
Install the AWS CLI (2.24.22 – current version)
Set up authentication using IAM user credentials for the AWS CLI for each account
Make sure you have jq installed, jq is lightweight command-line JSON processor. For example, in Mac you can use the command brew install jq (jq-1.7.1-apple – current version) to install it.
Navigate to the Amazon Bedrock console and make sure you enable access to the meta.llama3-1-70b-instruct-v1:0 model for the agent-kb account and access for us.amazon.nova-pro-v1:0 model in the agent account in the us-west-2, US West (Oregon) AWS Region.

Assumption
Let’s call the AWS account profile, agent profile that has the Amazon Bedrock agent. Similarly, the AWS account profile be called agent-kb that has the Amazon Bedrock knowledge base with Amazon Redshift Serverless and the structured data source. We will use the us-west-2 US West (Oregon) AWS Region but feel free to choose another AWS Region as necessary (the prerequisites will be applicable to the AWS Region you choose to deploy this solution in). We will use the meta.llama3-1-70b-instruct-v1:0 model for the agent-kb. This is an available on-demand model in us-west-2. You are free to choose other models with cross-Region inference but that would mean changing the roles and polices accordingly and enable model access in all Regions they are available in. Based on our model choice for this solution the AWS Region must be us-west-2. For the agent we will be using an Amazon Bedrock agent optimized model like us.amazon.nova-pro-v1:0.
Implementation walkthrough
The following is a step-by-step implementation guide. Make sure to perform all steps in the same AWS Region in both accounts.
These steps are to deploy and test an end-to-end solution from scratch and if you are already running some of these components, you may skip over those steps.

Make a note of the AWS account numbers in the agent and agent-kb account. In the implementation steps we will refer them as follows:

Profile
AWS account
Description

agent
111122223333
Account for the Bedrock Agent

agent-kb
999999999999
Account for the Bedrock Knowledge base

Note: These steps use example profile names and account numbers, please replace with actuals before running.
Create the Amazon Redshift Serverless workgroup in the agent-kb account:

Log on to the agent-kb account
Follow the workshop link to create the Amazon Redshift Serverless workgroup in private subnet
Make a note of the namespace, workgroup, and other details and follow the rest of the hands-on workshop instructions.

Set up your data warehouse in the agent-kb account.
Create your AI knowledge base in the agent-kb account. Make a note of the knowledge base ID.
Train your AI Assistant in the agent-kb account.
Test natural language queries in the agent-kb account. You can find the code in aws-samples git repository: sample-for-amazon-bedrock-agent-connect-cross-account-kb.
Create necessary roles and policies in both the accounts. Run the script create_bedrock_agent_kb_roles_policies.sh with the following input parameters.

Input parameter
Value
Description

–agent-kb-profile
agent-kb
The agent knowledgebase profile that you set up with the AWS CLI with aws_access_key_id, aws_secret_access_key as mentioned in the prerequisites.

–lambda-role
lambda_bedrock_kb_query_role
This is the IAM role the agent account Bedrock agent action group lambda will assume to connect to the Redshift cross account

–kb-access-role
bedrock_kb_access_role
This is the IAM role the agent-kb account which the lambda_bedrock_kb_query_role in agent account assumes to connect to the Redshift cross account

–kb-access-policy
bedrock_kb_access_policy
IAM policy attached to the IAM role bedrock_kb_access_role

–lambda-policy
lambda_bedrock_kb_query_policy
IAM policy attached to the IAM role lambda_bedrock_kb_query_role

–knowledge-base-id
XXXXXXXXXX
Replace with the actual knowledge base ID created in Step 4

–agent-account
111122223333
Replace with the 12-digit AWS account number where the Bedrock agent is running. (agent account)

–agent-kb-account
999999999999
Replace with the 12-digit AWS account number where the Bedrock knowledge base is running. (agent-kb acccount)

Download the script (create_bedrock_agent_kb_roles_policies.sh) from the aws-samples GitHub repository.
Open Terminal in Mac or similar bash shell for other platforms.
Locate and change the directory to the downloaded location, provide executable permissions:

cd /my/location
chmod +x create_bedrock_agent_kb_roles_policies.sh

If you are still not clear on the script usage or inputs, then you can run the script with the –help option and the script will display the usage: ./create_bedrock_agent_kb_roles_policies.sh –help
Run the script with the right input parameters as described in the previous table.

./create_bedrock_agent_kb_roles_policies.sh –agent-profile agent
–agent-kb-profile agent-kb
–lambda-role lambda_bedrock_kb_query_role
–kb-access-role bedrock_kb_access_role
–kb-access-policy bedrock_kb_access_policy
–lambda-policy lambda_bedrock_kb_query_policy
–knowledge-base-id XXXXXXXXXX
–agent-account 111122223333
–agent-kb-account 999999999999

The script on successful execution shows the summary of the IAM, roles and policies created in both accounts.
Log on to both the agent and agent-kb account to verify the IAM roles and policies are created.

For the agent account: Make a note of the ARN of the lambda_bedrock_kb_query_role as that will be the value of CloudFormation stack parameter AgentLambdaExecutionRoleArn in the next step.
For the agent-kb account: Make a note of the ARN of the bedrock_kb_access_role as that will be the value of CloudFormation stack parameter TargetRoleArn in the next step.

Run the AWS CloudFormation script to create a Bedrock agent:

Download the CloudFormation script: cloudformation_bedrock_agent_kb_query_cross_account.yaml from the aws-samples GitHub repository.
Log on to the agent account and navigate to the CloudFormation console, and verify you are in the us-west-2 (Oregon) Region, choose Create stack and choose With new resources (standard).
In the Specify template section choose Upload a template file and then Choose file and select the file from (1). Then, choose Next.
Enter the following stack details and choose Next.

Parameter
Value
Description

Stack name
bedrock-agent-connect-kb-cross-account-agent
You can choose any name

AgentFoundationModelId
us.amazon.nova-pro-v1:0
Do not change

AgentLambdaExecutionRoleArn
arn:aws:iam:: 111122223333:role/lambda_bedrock_kb_query_role
Replace with you agent account number

BedrockAgentDescription
Agent to query inventory data from Redshift Serverless database
Keep this as default

BedrockAgentInstructions
You are an assistant that helps users query inventory data from our Redshift Serverless database using the action group.
Do not change

BedrockAgentName
bedrock_kb_query_cross_account
Keep this as default

KBFoundationModelId
meta.llama3-1-70b-instruct-v1:0
Do not change

KnowledgeBaseId
XXXXXXXXXX
Knowledge base id from Step 4

TargetRoleArn
arn:aws:iam::999999999999:role/bedrock_kb_access_role
Replace with you agent-kb account number

Complete the acknowledgement and choose Next.
Scroll down through the page and choose Submit.
You will see the CloudFormation stack is getting created as shown by the status CREATE_IN_PROGRESS.
It will take a few minutes, and you will see the status change to CREATE_COMPLETE indicating creation of all resources. Choose the Outputs tab to make a note of the resources that were created. In summary, the CloudFormation script does the following in the agent account.

Creates a Bedrock agent
Creates an action group
Also creates a Lambda function which is invoked by the Bedrock action group
Defines the OpenAPI schema
Creates necessary roles and permissions for the Bedrock agent
Finally, it prepares the Bedrock agent so that it is ready to test.

Check for model access in Oregon (us-west-2)

Verify Nova Pro (us.amazon.nova-pro-v1:0) model access in the agent account. Navigate to the Amazon Bedrock console and choose Model access under Configure and learn. Search for Model name : Nova Pro to verify access. If not, then enable model access.
Verify access to the meta.llama3-1-70b-instruct-v1:0 model in the agent-kb account. This should already be enabled as we set up the knowledge base earlier.

Run the agent. Log on to agent account. Navigate to Amazon Bedrock console and choose Agents under Build.
Choose the name of the agent and choose Test. You can test the following questions as mentioned the workshop’s Stage 4: Test Natural Language Queries page. For example:

Who are the top 5 customers in Saudi Arabia?
Who are the top parts supplier in the United States by volume?
What is the total revenue by region for the year 1998?
Which products have the highest profit margins?
Show me orders with the highest priority from the last quarter of 1997.

Choose Show trace to investigate the agent traces.

Some recommended best practices:

Phrase your question to be more specific
Use terminology that matches your table descriptions
Try questions similar to your curated examples
Verify your question relates to data that exists in the TPCH dataset
Use Amazon Bedrock Guardrails to add configurable safeguards to questions and responses.

Clean up resources
It is recommended that you clean up any resources you do not need anymore to avoid any unnecessary charges:

Navigate to the CloudFormation console for the agent and agent-kb account, search for the stack and and choose Delete.
S3 buckets need to be deleted separately.
For deleting the roles and policies created in both accounts, download the script delete-bedrock-agent-kb-roles-policies.sh from the aws-samples GitHub repository.

Open Terminal in Mac or similar bash shell on other platforms.
Locate and change the directory to the downloaded location, provide executable permissions:

cd /my/location
chmod +x delete-bedrock-agent-kb-roles-policies.sh

If you are still not clear on the script usage or inputs, then you can run the script with the –help option then the script will display the usage: ./ delete-bedrock-agent-kb-roles-policies.sh –help
Run the script: delete-bedrock-agent-kb-roles-policies.sh with the same values for the same input parameters as in Step7 when running the create_bedrock_agent_kb_roles_policies.sh script. Note: Enter the correct account numbers for agent-account and agent-kb-account before running.

./delete-bedrock-agent-kb-roles-policies.sh –agent-profile agent
–agent-kb-profile agent-kb
–lambda-role lambda_bedrock_kb_query_role
–kb-access-role bedrock_kb_access_role
–kb-access-policy bedrock_kb_access_policy
–lambda-policy lambda_bedrock_kb_query_policy
–agent-account 111122223333
–agent-kb-account 999999999999
The script will ask for a confirmation, say yes and press enter.

Summary
This solution demonstrates how the Amazon Bedrock agent in the agent account can query the Amazon Bedrock knowledge base in the agent-kb account.
Conclusion
This solution uses Amazon Bedrock Knowledge Bases for structured data to create a more integrated approach to cross-account data access. The knowledge base in agent-kb account connects directly to Amazon Redshift Serverless in a private VPC. The Amazon Bedrock agent in the agent account invokes an AWS Lambda function as part of its action group to make a cross-account connection to retrieve response from the structured knowledge base.
This architecture offers several advantages:

Uses Amazon Bedrock Knowledge Bases capabilities for structured data
Provides a more seamless integration between the agent and the data source
Maintains proper security boundaries between accounts
Reduces the complexity of direct database access codes

As Amazon Bedrock continues to evolve, you can take advantage of future enhancements to knowledge base functionality while maintaining your multi-account architecture.

About the Authors
Kunal Ghosh is an expert in AWS technologies. He passionate about building efficient and effective solutions on AWS, especially involving generative AI, analytics, data science, and machine learning. Besides family time, he likes reading, swimming, biking, and watching movies, and he is a foodie.
Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers adopt and use the AWS Cloud. He is focused on big data, data lakes, streaming and batch analytics services, and generative AI technologies.
Indranil Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers in the hi-tech and semi-conductor sectors solve complex business problems using the AWS Cloud. His special interests are in the areas of legacy modernization and migration, building analytics platforms and helping customers adopt cutting edge technologies such as generative AI.
Vinayak Datar is Sr. Solutions Manager based in Bay Area, helping enterprise customers accelerate their AWS Cloud journey. He’s focusing on helping customers to convert ideas from concepts to working prototypes to production using AWS generative AI services.

Democratizing AI: How Thomson Reuters Open Arena supports no-code AI f …

This post is cowritten by Laura Skylaki, Vaibhav Goswami, Ramdev Wudali and Sahar El Khoury from Thomson Reuters.
Thomson Reuters (TR) is a leading AI and technology company dedicated to delivering trusted content and workflow automation solutions. With over 150 years of expertise, TR provides essential solutions across legal, tax, accounting, risk, trade, and media sectors in a fast-evolving world.
TR recognized early that AI adoption would fundamentally transform professional work. According to TR’s 2025 Future of Professionals Report, 80% of professionals anticipate AI significantly impacting their work within five years, with projected productivity gains of up to 12 hours per week by 2029. To unlock this immense potential, TR needed a solution to democratize AI creation across its organization.
In this blog post, we explore how TR addressed key business use cases with Open Arena, a highly scalable and flexible no-code AI solution powered by Amazon Bedrock and other AWS services such as Amazon OpenSearch Service, Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and AWS Lambda. We’ll explain how TR used AWS services to build this solution, including how the architecture was designed, the use cases it solves, and the business profiles that use it. The system demonstrates TR’s successful approach of using existing TR services for rapid launches while supporting thousands of users, showcasing how organizations can democratize AI access and support business profiles (for example, AI explorers and SMEs) to create applications without coding expertise.
Introducing Open Arena: No-code AI for all
TR introduced Open Arena to non-technical professionals to create their own customized AI solutions. With Open Arena users can use cutting-edge AI powered by Amazon Bedrock in a no-code environment, exemplifying TR’s commitment to democratizing AI access.
Today, Open Arena supports:

High adoption: ~70% employee adoption, with 19,000 monthly active users.
Custom solutions: Thousands of customized AI solutions created without coding, used for internal workflows or integrated into TR products for customers.
Self-served functionality: 100% self-served functionality, so that users, irrespective of technical background, can develop, evaluate, and deploy generative AI solutions.

The Open Arena journey: From prototype to enterprise solution
Conceived as a rapid prototype, Open Arena was developed in under six weeks at the onset of the generative AI boom in early 2023 by TR Labs – TR’s dedicated applied research division focused on the research, development, and application of AI and emerging trends in technologies. The goal was to support internal team exploration of large language models (LLMs) and discover unique use cases by merging LLM capabilities with TR company data.
Open Arena’s introduction significantly increased AI awareness, fostered developer-SME collaboration for groundbreaking concepts, and accelerated AI capability development for TR products. The rapid success and demand for new features quickly highlighted Open Arena’s potential for AI democratization, so TR developed an enterprise version of Open Arena. Built on the TR AI Platform, Open Arena enterprise version offers secure, scalable, and standardized services covering the entire AI development lifecycle, significantly accelerating time to production.
The Open Arena enterprise version uses existing system capabilities for enhanced data access controls, standardized service access, and compliance with TR’s governance and ethical standards. This version introduced self-served capabilities so that every user, irrespective of their technical ability, can create, evaluate, and deploy customized AI solutions in a no-code environment.

“The foundation of the AI Platform has always been about empowerment; in the early days it was about empowering Data Scientists but with the rise of Gen AI, the platform adapted and evolved on empowering users of any background to leverage and create AI Solutions.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

As of July 2025, the TR Enterprise AI Platform consists of 15 services spanning the entire AI development lifecycle and user personas. Open Arena remains one of its most popular, serving 19,000 users each month, with increasing monthly usage.
Addressing key enterprise AI challenges across user types
Using the TR Enterprise AI Platform, Open Arena helped thousands of professionals transition into using generative AI. AI-powered innovation is now readily in the hands of everyone, not just AI scientists.
Open Arena successfully addresses four critical enterprise AI challenges:

Enablement: Delivers AI solution building with consistent LLM and service provider experience and support for various user personas, including non-technical.
Security and quality: Streamlines AI solution quality tracking using evaluation and monitoring services, whilst complying with data governance and ethics policies.
Speed and reusability: Automates workflows and uses existing AI solutions and prompts.
Resources and cost management: Tracks and displays generative AI solution resource consumption, supporting transparency and efficiency.

The solution currently supports several AI experiences, including tech support, content creation, coding assistance, data extraction and analysis, proof reading, project management, content summarization, personal development, translation, and problem solving, catering to different user needs across the organization.

Figure 1. Examples of Open Arena use cases.
AI explorers use Open Arena to speed up day-to-day tasks, such as summarizing documents, engaging in LLM chat, building custom workflows, and comparing AI models. AI creators and Subject Matter Experts (SMEs) use Open Arena to build custom AI workflows and experiences and to evaluate solutions without requiring coding knowledge. Meanwhile, developers can develop and deploy new AI solutions at speed, training models, creating new AI skills, and deploying AI capabilities.
Why Thomson Reuters selected AWS for Open Arena
TR strategically chose AWS as a primary cloud provider for Open Arena based on several critical factors:

Comprehensive AI/ML capabilities: Amazon Bedrock offers easy access to a choice of high-performing foundation models from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma AI, Meta, Mistral AI, OpenAI, Qwen, Stability AI, TwelveLabs, Writer, and Amazon. It supports simple chat and complex RAG workflows, and integrates seamlessly with TR’s existing Enterprise AI Platform.
Enterprise-grade security and governance: Advanced security controls, model access using RBAC, data handling with enhanced security features, single sign-on (SSO) enabled, and clear operational and user data separation across AWS accounts.
Scalable infrastructure: Serverless architecture for automatic scaling, pay-per-use pricing for cost optimization, and global availability with low latency.
Existing relationship and expertise: Strong, established relationship between TR and AWS, existing Enterprise AI Platform on AWS, and deep AWS expertise within TR’s technical teams.

“Our long-standing partnership with AWS and their robust, flexible and innovative services made them the natural choice to power Open Arena and accelerate our AI initiatives.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

Open Arena architecture: Scalability, extensibility, and security
Designed for a broad enterprise audience, Open Arena prioritizes scalability, extensibility and security while maintaining simplicity for non-technical users to create and deploy AI solutions. The following diagram illustrates the architecture of Open Arena.

Figure 2. Architecture design of Open Arena.
The architecture design facilitates enterprise-grade performance with clear separation between capability and usage, aligning with TR’s enterprise cost and usage tracking requirements.
The following are key components of the solution architecture:

No-code interface: Intuitive UI, visual workflow builder, pre-built templates, drag-and-drop functionality.
Enterprise integration: Seamless integration with TR’s Enterprise AI Platform, SSO enabled, data handling with enhanced security, clear data separation.
Solution management: Searchable repository, public/private sharing, version control, usage analytics.

TR developed Open Arena using AWS services such as Amazon Bedrock, Amazon OpenSearch, Amazon DynamoDB, Amazon API Gateway, AWS Lambda, and AWS Step Functions. It uses Amazon Bedrock for foundational model interactions, supporting simple chat and complex Retrieval-Augmented Generation (RAG) tasks. Open Arena uses Amazon Bedrock Flows as the custom workflow builder where users can drag-and-drop components like prompts, agents, knowledge bases and Lambda functions to create sophisticated AI workflows without coding. The system also integrates with AWS OpenSearch for knowledge bases and external APIs for advanced agent capabilities.
For data separation, orchestration is managed using the Enterprise AI Platform AWS account, capturing operational data. Flow instances and user-specific data reside in the user’s dedicated AWS account, stored in a database. Each user’s data and workflow executions are isolated within their respective AWS accounts, which is required for complying with Thomson Reuters data sovereignty and enterprise security policies with strict regional controls. The system integrates with Thomson Reuters SSO solution to automatically identify users and grant secure, private access to foundational models.
The orchestration layer, centrally hosted within the Enterprise AI Platform AWS account, manages AI workflow activities, including scheduling, deployment, resource provisioning, and governance across user environments.
The system features fully automated provisioning of  Amazon Bedrock Flows directly within each user’s AWS account, avoiding manual setup and accelerating time to value. Using AWS Lambda for serverless compute and DynamoDB for scalable, low-latency storage, the system dynamically allocates resources based on real-time demand. This architecture makes sure prompt flows and supporting infrastructure are deployed and scaled to match workload fluctuations, optimizing performance, cost, and user experience.

“Our decision to adopt a cross-account architecture was driven by a commitment to enterprise security and operational excellence. By isolating orchestration from execution, we make sure that each user’s data remains private and secure within their own AWS account, while still delivering a seamless, centrally-managed experience. This design empowers organizations to innovate rapidly without compromising compliance or control.”
– Thomson Reuters’ architecture team

Evolution of Open Arena: From classic to Amazon Bedrock Flows-powered chain builder
Open Arena has evolved to cater to varying levels of user sophistication:

Open Arena v1 (Classic): Features a form-based interface for simple prompt customization and basic AI workflow deployment within a single AWS account. Its simplicity appeals to novice users for straightforward use cases, though with limited advanced capabilities.
Open Arena v2 (Chain Builder): Introduces a robust, visual workflow builder interface, enabling users to design complex, multi-step AI workflows using drag-and-drop components. With support for advanced node types, parallel execution, and seamless cross-account deployment, Chain Builder dramatically expands the system’s capabilities and accessibility for non-technical users.

Thomson Reuters uses Amazon Bedrock Flows as a core feature of Chain Builder. Users can define, customize, and deploy AI-driven workflows using Amazon Bedrock models. Bedrock Flows supports advanced workflows combining multiple prompt nodes, incorporating AWS Lambda functions, and supporting sophisticated RAG pipelines. Operating seamlessly across user AWS accounts, Bedrock Flows facilitates secure, scalable execution of personalized AI solutions, serving as the fundamental engine for the Chain Builder workflows and driving TR’s ability to deliver robust, enterprise-grade automation and innovation.
What’s next?
TR continues to expand Open Arena’s capabilities through the strategic partnership with AWS, focusing on:

Driving further adoption of Open Arena’s DIY capabilities.
Enhancing flexibility for workflow creation in Chain Builder with custom components, such as inline scripts.
Developing new templates to represent common tasks and workflows.
Enhancing collaboration features within Open Arena.
Extending multimodal capabilities and model integration.
Expanding into new use cases across the enterprise.

“From innovating new product ideas to reimagining daily tasks for Thomson Reuters employees, we continue to push the boundaries of what’s possible with Open Arena.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

Conclusion
In this blog post, we explored how Thomson Reuters’ Open Arena demonstrates the successful democratization of AI across an enterprise by using AWS services, particularly Amazon Bedrock and Bedrock Flows. With 19,000 monthly active users and 70% employee adoption, the system proves that no-code AI solutions can deliver enterprise-scale impact while maintaining security and governance standards.
By combining the robust infrastructure of AWS with innovative architecture design, TR has created a blueprint for AI democratization that empowers professionals across technical skill levels to harness generative AI for their daily work.
As Open Arena continues to evolve, it exemplifies how strategic cloud partnerships can accelerate AI adoption and transform how organizations approach innovation with generative AI.

About the authors
Laura Skylaki, PhD, leads the Enterprise AI Platform at Thomson Reuters, driving the development of GenAI services that accelerate the creation, testing and deployment of AI solutions, enhancing product value. A recognized expert with a doctorate in stem cell bioinformatics, her extensive experience in AI research and practical application spans legal, tax, and biotech domains. Her machine learning work is published in leading academic journals, and she is a frequent speaker on AI and machine learning
Vaibhav Goswami is a Lead Software Engineer on the AI Platform team at Thomson Reuters, where he leads the development of the Generative AI Platform that empowers users to build and deploy generative AI solutions at scale. With expertise in building production-grade AI systems, he focuses on creating tools and infrastructure that democratize access to cutting-edge AI capabilities across the enterprise.
Ramdev Wudali is a Distinguished Engineer, helping architect and build the AI/ML Platform to enable the Enterprise user, data scientists and researchers to develop Generative AI and machine learning solutions by democratizing access to tools and LLMs. In his spare time, he loves to fold paper to create origami tessellations, and wearing irreverent T-shirts
As the director of AI Platform Adoption and Training, Sahar El Khoury guides users to seamlessly onboard and successfully use the platform services, drawing on her experience in AI and data analysis across robotics (PhD), financial markets, and media.
Vu San Ha Huynh is a Solutions Architect at AWS with a PhD in Computer Science. He helps large Enterprise customers drive innovation across different domains with a focus on AI/ML and Generative AI solutions.
Paul Wright is a Senior Technical Account Manager, with over 20 years experience in the IT industry and over 7 years of dedicated cloud focus. Paul has helped some of the largest enterprise customers grow their business and improve their operational excellence. In his spare time Paul is a huge football and NFL fan.
Mike Bezak is a Senior Technical Account Manager in AWS Enterprise Support. He has over 20 years of experience in information technology, primarily disaster recovery and systems administration. Mike’s current focus is helping customers streamline and optimize their AWS Cloud journey. Outside of AWS, Mike enjoys spending time with family & friends.

Introducing structured output for Custom Model Import in Amazon Bedroc …

With Amazon Bedrock Custom Model Import, you can deploy and scale fine-tuned or proprietary foundation models in a fully managed, serverless environment. You can bring your own models into Amazon Bedrock, scale them securely without managing infrastructure, and integrate them with other Amazon Bedrock capabilities.
Today, we are excited to announce the addition of structured output to Custom Model Import. Structured output constrains a model’s generation process in real time so that every token it produces conforms to a schema you define. Rather than relying on prompt-engineering tricks or brittle post-processing scripts, you can now generate structured outputs directly at inference time.
For certain production applications, the predictability of model outputs is more important than their creative flexibility. A customer service chatbot might benefit from varied, natural-sounding responses, but an order processing system needs exact, structured data that conforms to predefined schemas. Structured output bridges this gap by maintaining the intelligence of foundation models while verifying their outputs meet strict formatting requirements.
This represents a shift from free-form text generation to outputs that are consistent, machine-readable, and designed for seamless integration with enterprise systems. While free-form text excels for human consumption, production applications require more precision. Businesses can’t afford the ambiguity of natural language variations when their systems depend on structured outputs to reliably interface with APIs, databases, and automated workflows.
In this post, you will learn how to implement structured output for Custom Model Import in Amazon Bedrock. We will cover what structured output is, how to enable it in your API calls, and how to apply it to real-world scenarios that require structured, predictable outputs.
Understanding structured output
Structured output, also known as constrained decoding, is a method that directs LLM outputs to conform to a predefined schema, such as valid JSON. Rather than allowing the model to freely select tokens based on probability distributions, it introduces constraints during generation that limit choices to only those that maintain structural validity. If a particular token would violate the schema by producing invalid JSON, inserting stray characters, or using an unexpected field name the structured output rejects it and requires the model to select another allowed option. This real-time validation helps keep the final output consistent, machine readable, and immediately usable by downstream applications without the need for additional post-processing.
Without structured output, developers often attempt to enforce structure through prompt instructions like “Respond only in JSON.” While this approach sometimes works, it remains unreliable due to the inherently probabilistic nature of LLMs. These models generate text by sampling from probability distributions, introducing natural variability that makes responses feel human but creates significant challenges for automated systems.
Consider a customer support application that classifies tickets: if responses vary between “This seems like a billing issue,” “I’d classify this as: Billing,” and “Category = BILLING,” downstream code cannot reliably interpret the results. What production systems require instead is predictable, structured output. For example:

{
“category”: “billing”,
“priority”: “high”,
“sentiment”: “negative”
}

With a response like this, your application can automatically route tickets, trigger workflows, or update databases without human intervention. By providing predictable, schema-aligned responses, structured output transforms LLMs from conversational tools into reliable system components that can be integrated with databases, APIs, and business logic. This capability opens new possibilities for automation while maintaining the intelligent reasoning that underpin the value of these models.
Beyond improving reliability and simplifying post-processing, structured output offers additional benefits that strengthens performance, security and safety in production environments.

Lower token usage and faster responses: By constraining generation to a defined schema, structured output removes unnecessary verbose, free-form text, resulting in reduced token count. Because token generation is sequential, shorter outputs directly translate to faster responses and lower latency, improving overall performance and cost efficiency.
Enhanced security against prompt injection: Structured output narrows the model’s expression space and helps prevent it from producing arbitrary or unsafe content. Bad actors cannot inject instructions, code or unexpected text outside the defined structure. Each field must match its expected type and format, making sure outputs remain within safe boundaries.
Safety and policy controls: Structured output enables you to design schemas that inherently help prevent harmful, toxic, or policy-violating content. By limiting fields to approved values, enforcing patterns, and restricting free-form text, schemas make sure outputs align with regulatory requirements.

In the next section, we will explore how structured output works with Custom Model Import in Amazon Bedrock and walks through an example of enabling it in your API calls.
Using structured output with Custom Model Import in Amazon Bedrock
Let’s start by assuming you have already imported a Hugging Face model into Amazon Bedrock using the Custom Model Import feature.
Prerequisites
Before proceeding, make sure you have:

An active AWS account with access to Amazon Bedrock
A custom model created in Amazon Bedrock using the Custom Model Import feature
Appropriate AWS Identity and Access Management (IAM) permissions to invoke models through the Amazon Bedrock Runtime

With these prerequisites in place, let’s explore how to implement structured output with your imported model.
To start using structured output with a Custom Model Import in Amazon Bedrock, begin by configuring your environment. In Python, this involves creating a Bedrock Runtime client and initializing a tokenizer from your imported Hugging Face model.
The Bedrock Runtime client provides access to your imported model using the Bedrock InvokeModel API. The tokenizer applies the correct chat template that aligns with the imported model, which defines how user, system, and assistant messages are combined into a single prompt, how the role markers (for example, <|user|>, <|assistant|>) are inserted, and where the model’s response should begin.
By calling tokenizer.apply_chat_template(messages, tokenize=False) you can generate a prompt that matches the exact input format your model expects, which is essential for consistent and reliable inference, especially when structured encoding is enabled.

import boto3
from transformers import AutoTokenizer
from botocore.config import Config

# HF model identifier imported into Bedrock
hf_model_id = “<<huggingface_model_id>>” # Example: “deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
model_arn = “arn:aws:bedrock:<<aws-region>>:<<account-id>>:imported-model/your-model-id”
region = “<<aws-region>>”

# Initialize tokenizer aligned with your imported model
tokenizer = AutoTokenizer.from_pretrained(hf_model_id)

# Initialize Bedrock client
bedrock_runtime = boto3.client(
service_name=”bedrock-runtime”,
region_name=region)

Implementing structured output
When you invoke a custom model on Amazon Bedrock, you have the option to enable structured output by adding a response_format block to the request payload. This block accepts a JSON schema that defines the structured of the model’s response. During inference, the model enforces this schema in real-time, making sure that each generated token conforms to the defined structure. Below is a walkthrough demonstrating how to implement structured output using a simple address extraction task.
Step 1: Define the data structure
You can define your expected output using a Pydantic model, which serves as a typed contract for the data you want to extract.

from pydantic import BaseModel, Field

class Address(BaseModel):
street_number: str = Field(description=”Street number”)
street_name: str = Field(description=”Street name including type (Ave, St, Rd, etc.)”)
city: str = Field(description=”City name”)
state: str = Field(description=”Two-letter state abbreviation”)
zip_code: str = Field(description=”5-digit ZIP code”)

Step 2: Generate the JSON schema
Pydantic can automatically convert your data model into a JSON schema:

schema = Address.model_json_schema()
address_schema = {
“name”: “Address”,
“schema”: schema
}

This schema defines each field’s type, description, and requirement, creating a blueprint that the model will follow during generation.
Step 3: Prepare your input messages
Format your input using the chat format expected by your model:

messages = [{
“role”: “user”,
“content”: “Extract the address: 456 Tech Boulevard, San Francisco, CA 94105”
}]

Step 4: Apply the chat template
Use your model’s tokenizer to generate the formatted prompt:

prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

Step 5: Build the request payload
Create your request body, including the response_format that references your schema:

request_body = {
‘prompt’: prompt,
‘temperature’: 0.1,
‘max_gen_len’: 1000,
‘top_p’: 0.9,
‘response_format’: {
“type”: “json_schema”,
“json_schema”: address_schema
}
}

Step 6: Invoke the model
Send the request using the InvokeModel API:

response = bedrock_runtime.invoke_model(
modelId=model_arn,
body=json.dumps(request_body),
accept=”application/json”,
contentType=”application/json”
)

Step 7: Parse the response
Extract the generated text from the response:

result = json.loads(response[‘body’].read().decode(‘utf-8’))
raw_output = result[‘choices’][0][‘text’]
print(raw_output)

Because the schema defines required fields, the model’s response will contain them:

{
“street_number”: “456”,
“street_name”: “Tech Boulevard”,
“city”: “San Francisco”,
“state”: “CA”,
“zip_code”: “94105”
}

The output is clean, valid JSON that can be consumed directly by your application with no extra parsing, filtering, or cleanup required.
Conclusion
Structured output with Custom Model Import in Amazon Bedrock provides an effective way to generate structures, schema-aligned outputs from your models. By shifting validation into the model inference itself, structured output reduce the need for complex post-processing workflows and error handling code.
Structured output generates outputs that are predictable and straightforward to integrate into your systems and supports a variety of use cases, for example, building financial applications that require precise data extraction, healthcare systems that need structured clinical documentation, or customer service systems that demand consistent ticket classification.
Start experimenting with structured output with your Custom Model Import today and transform how your AI applications deliver consistent, production-ready results.

About the authors
Manoj Selvakumar is a Generative AI Specialist Solutions Architect at AWS, where he helps organizations design, prototype, and scale AI-powered solutions in the cloud. With expertise in deep learning, scalable cloud-native systems, and multi-agent orchestration, he focuses on turning emerging innovations into production-ready architectures that drive measurable business value. He is passionate about making complex AI concepts practical and enabling customers to innovate responsibly at scale—from early experimentation to enterprise deployment. Before joining AWS, Manoj worked in consulting, delivering data science and AI solutions for enterprise clients, building end-to-end machine learning systems supported by strong MLOps practices for training, deployment, and monitoring in production.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Revendra Kumar is a Senior Software Development Engineer at Amazon Web Services. In his current role, he focuses on model hosting and inference MLOps on Amazon Bedrock. Prior to this, he worked as an engineer on hosting Quantum computers on the cloud and developing infrastructure solutions for on-premises cloud environments. Outside of his professional pursuits, Revendra enjoys staying active by playing tennis and hiking.
Muzart Tuman is a software engineer utilizing his experience in fields like deep learning, machine learning optimization, and AI-driven applications to help solve real-world problems in a scalable, efficient, and accessible manner. His goal is to create impactful tools that not only advance technical capabilities but also inspire meaningful change across industries and communities.

Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model th …

How do we design AI systems that can plan, reason, and act over long sequences of decisions without constant human guidance? Moonshot AI has released Kimi K2 Thinking, an open source thinking agent model that exposes the full reasoning stream of the Kimi K2 Mixture of Experts architecture. It targets workloads that need deep reasoning, long horizon tool use, and stable agent behavior across many steps.

https://moonshotai.github.io/Kimi-K2/thinking.html

What is Kimi K2 Thinking?

Kimi K2 Thinking is described as the latest, most capable version of Moonshot’s open source thinking model. It is built as a thinking agent that reasons step by step and dynamically invokes tools during inference. The model is designed to interleave chain of thought with function calls so it can read, think, call a tool, think again, and repeat for hundreds of steps.

The model sets a new state of the art on Humanity’s Last Exam and BrowseComp, while maintaining coherent behavior across about 200 to 300 sequential tool calls without human interference.

At the same time, K2 Thinking is released as an open weights model with a 256K token context window and native INT4 inference, which reduces latency and GPU memory usage while preserving benchmark performance.

K2 Thinking is already live on kimi.com in chat mode and is accessible through the Moonshot platform API, with a dedicated agentic mode planned to expose the full tool using behavior.

Architecture, MoE design, and context length

Kimi K2 Thinking inherits the Kimi K2 Mixture of Experts design. The model uses a MoE architecture with 1T total parameters and 32B activated parameters per token. It has 61 layers including 1 dense layer, 384 experts with 8 experts selected per token, 1 shared expert, 64 attention heads, and an attention hidden dimension of 7168. The MoE hidden dimension is 2048 per expert.

The vocabulary size is 160K tokens and the context length is 256K. The attention mechanism is Multi head Latent Attention, and the activation function is SwiGLU.

Test time scaling and long horizon thinking

Kimi K2 Thinking is explicitly optimized for test time scaling. The model is trained to expand its reasoning length and tool call depth when facing harder tasks, rather than relying on a fixed short chain of thought.

https://moonshotai.github.io/Kimi-K2/thinking.html

On Humanity’s Last Exam in the no tools setting, K2 Thinking scores 23.9. With tools, the score rises to 44.9, and in the heavy setting it reaches 51.0. On AIME25 with Python, it reports 99.1, and on HMMT25 with Python it reports 95.1. On IMO AnswerBench it scores 78.6, and on GPQA it scores 84.5.

The testing protocol caps thinking token budgets at 96K for HLE, AIME25, HMMT25, and GPQA. It uses 128K thinking tokens for IMO AnswerBench, LiveCodeBench, and OJ Bench, and 32K completion tokens for Longform Writing. On HLE, the maximum step limit is 120 with a 48K reasoning budget per step. On agentic search tasks, the limit is 300 steps with a 24K reasoning budget per step.

Benchmarks in agentic search and coding

On agentic search tasks with tools, K2 Thinking reports 60.2 on BrowseComp, 62.3 on BrowseComp ZH, 56.3 on Seal 0, 47.4 on FinSearchComp T3, and 87.0 on Frames.

On general knowledge benchmarks, it reports 84.6 on MMLU Pro, 94.4 on MMLU Redux, 73.8 on Longform Writing, and 58.0 on HealthBench.

For coding, K2 Thinking achieves 71.3 on SWE bench Verified with tools, 61.1 on SWE bench Multilingual with tools, 41.9 on Multi SWE bench with tools, 44.8 on SciCode, 83.1 on LiveCodeBenchV6, 48.7 on OJ Bench in the C plus plus setting, and 47.1 on Terminal Bench with simulated tools.

Moonshot team also defines a Heavy Mode that runs eight trajectories in parallel, then aggregates them to produce a final answer. This is used in some reasoning benchmarks to squeeze out extra accuracy from the same base model.

Native INT4 quantization and deployment

K2 Thinking is trained as a native INT4 model. The research team applies Quantization Aware Training during the post training stage and uses INT4 weight only quantization on the MoE components. This supports INT4 inference with roughly a 2x generation speed improvement in low latency mode while maintaining state of the art performance. All reported benchmark scores are obtained under INT4 precision.

The checkpoints are saved in compressed tensors format and can be unpacked to higher precision formats such as FP8 or BF16 using the official compressed tensors tools. Recommended inference engines include vLLM, SGLang, and KTransformers.

Key Takeaways

Kimi K2 Thinking is an open weights thinking agent that extends the Kimi K2 Mixture of Experts architecture with explicit long horizon reasoning and tool use, not just short chat style responses.

The model uses a trillion parameter MoE design with about tens of billions of active parameters per token, a 256K context window, and is trained as a native INT4 model with Quantization Aware Training, which gives about 2x faster inference while keeping benchmark performance stable.

K2 Thinking is optimized for test time scaling, it can carry out hundreds of sequential tool calls in a single task and is evaluated under large thinking token budgets and strict step caps, which is important when you try to reproduce its reasoning and agentic results.

On public benchmarks, it leads or is competitive on reasoning, agentic search, and coding tasks such as HLE with tools, BrowseComp, and SWE bench Verified with tools, showing that the thinking oriented variant delivers clear gains over the base non thinking K2 model.

Editorial Comments

Kimi K2 Thinking is a strong signal that test time scaling is now a first class design target for open source reasoning models. Moonshot AI is not only exposing a 1T parameter Mixture of Experts system with 32B active parameters and 256K context window, it is doing so with native INT4 quantization, Quantization Aware Training, and tool orchestration that runs for hundreds of steps in production like settings. Overall, Kimi K2 Thinking shows that open weights reasoning agents with long horizon planning and tool use are becoming practical infrastructure, not just research demos.

Check out the Model Weights and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model that can Execute up to 200–300 Sequential Tool Calls without Human Interference appeared first on MarkTechPost.