MiniMax Releases MiniMax M2: A Mini Open Model Built for Max Codin …

Can an open source MoE truly power agentic coding workflows at a fraction of flagship model costs while sustaining long-horizon tool use across MCP, shell, browser, retrieval, and code? MiniMax team has just released MiniMax-M2, a mixture of experts MoE model optimized for coding and agent workflows. The weights are published on Hugging Face under the MIT license, and the model is positioned as for end to end tool use, multi file editing, and long horizon plans, It lists 229B total parameters with about 10B active per token, which keeps memory and latency in check during agent loops.

https://github.com/MiniMax-AI/MiniMax-M2

Architecture and why activation size matters?

MiniMax-M2 is a compact MoE that routes to about 10B active parameters per token. The smaller activations reduce memory pressure and tail latency in plan, act, and verify loops, and allow more concurrent runs in CI, browse, and retrieval chains. This is the performance budget that enables the speed and cost claims relative to dense models of similar quality.

MiniMax-M2 is an interleaved thinking model. The research team wrapped internal reasoning in <think>…</think> blocks, and instructs users to keep these blocks in the conversation history across turns. Removing these segments harms quality in multi step tasks and tool chains. This requirement is explicit on the model page on HF.

Benchmarks that target coding and agents

The MiniMax team reports a set of agent and code evaluations are closer to developer workflows than static QA. On Terminal Bench, the table shows 46.3. On Multi SWE Bench, it shows 36.2. On BrowseComp, it shows 44.0. SWE Bench Verified is listed at 69.4 with the scaffold detail, OpenHands with 128k context and 100 steps.

https://github.com/MiniMax-AI/MiniMax-M2

MiniMax’s official announcement stresses 8% of Claude Sonnet pricing, and near 2x speed, plus a free access window. The same note provides the specific token prices and the trial deadline.

Comparison M1 vs M2

AspectMiniMax M1MiniMax M2Total parameters456B total229B in model card metadata, model card text says 230B totalActive parameters per token45.9B active10B activeCore designHybrid Mixture of Experts with Lightning AttentionSparse Mixture of Experts targeting coding and agent workflowsThinking formatThinking budget variants 40k and 80k in RL training, no think tag protocol requiredInterleaved thinking with <think>…</think> segments that must be preserved across turnsBenchmarks highlightedAIME, LiveCodeBench, SWE-bench Verified, TAU-bench, long context MRCR, MMLU-ProTerminal-Bench, Multi SWE-Bench, SWE-bench Verified, BrowseComp, GAIA text only, Artificial Analysis intelligence suiteInference defaultstemperature 1.0, top p 0.95model card shows temperature 1.0, top p 0.95, top k 40, launch page shows top k 20Serving guidancevLLM recommended, Transformers path also documentedvLLM and SGLang recommended, tool calling guide providedPrimary focusLong context reasoning, efficient scaling of test time compute, CISPO reinforcement learningAgent and code native workflows across shell, browser, retrieval, and code runners

Key Takeaways

M2 ships as open weights on Hugging Face under MIT, with safetensors in F32, BF16, and FP8 F8_E4M3.

The model is a compact MoE with 229B total parameters and ~10B active per token, which the card ties to lower memory use and steadier tail latency in plan, act, verify loops typical of agents.

Outputs wrap internal reasoning in <think>…</think> and the model card explicitly instructs retaining these segments in conversation history, warning that removal degrades multi-step and tool-use performance.

Reported results cover Terminal-Bench, (Multi-)SWE-Bench, BrowseComp, and others, with scaffold notes for reproducibility, and day-0 serving is documented for SGLang and vLLM with concrete deploy guides.

Editorial Notes

MiniMax M2 lands with open weights under MIT, a mixture of experts design with 229B total parameters and about 10B activated per token, which targets agent loops and coding tasks with lower memory and steadier latency. It ships on Hugging Face in safetensors with FP32, BF16, and FP8 formats, and provides deployment notes plus a chat template. The API documents Anthropic compatible endpoints and lists pricing with a limited free window for evaluation. vLLM and SGLang recipes are available for local serving and benchmarking. Overall, MiniMax M2 is a very solid open release.

Check out the API Doc, Weights and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post MiniMax Releases MiniMax M2: A Mini Open Model Built for Max Coding and Agentic Workflows at 8% Claude Sonnet Price and ~2x Faster appeared first on MarkTechPost.

Zhipu AI Releases ‘Glyph’: An AI Framework for Scaling the Context …

Can we render long texts as images and use a VLM to achieve 3–4× token compression, preserving accuracy while scaling a 128K context toward 1M-token workloads? A team of researchers from Zhipu AI release Glyph, an AI framework for scaling the context length through visual-text compression. It renders long textual sequences into images and processes them using vision–language models. The system renders ultra long text into page images, then a vision language model, VLM, processes those pages end to end. Each visual token encodes many characters, so the effective token sequence shortens, while semantics are preserved. Glyph can achieve 3-4x token compression on long text sequences without performance degradation, enabling significant gains in memory efficiency, training throughput, and inference speed.

https://arxiv.org/pdf/2510.17800

Why Glyph?

Conventional methods expand positional encodings or modify attention, compute and memory still scale with token count. Retrieval trims inputs, but risks missing evidence and adds latency. Glyph changes the representation, it converts text to images and shifts burden to a VLM that already learns OCR, layout, and reasoning. This increases information density per token, so a fixed token budget covers more original context. Under extreme compression, the research team show a 128K context VLM can address tasks that originate from 1M token level text.

https://arxiv.org/pdf/2510.17800

System design and training

The method has three stages, continual pre training, LLM driven rendering search, and post training. Continual pre training exposes the VLM to large corpora of rendered long text with diverse typography and styles. The objective aligns visual and textual representations, and transfers long context skills from text tokens to visual tokens. The rendering search is a genetic loop driven by an LLM. It mutates page size, dpi, font family, font size, line height, alignment, indent, and spacing. It evaluates candidates on a validation set to optimize accuracy and compression jointly. Post training uses supervised fine tuning and reinforcement learning with Group Relative Policy Optimization, plus an auxiliary OCR alignment task. The OCR loss improves character fidelity when fonts are small and spacing is tight.

https://arxiv.org/pdf/2510.17800

Results, performance and efficiency…

LongBench and MRCR establish accuracy and compression under long dialogue histories and document tasks. The model achieves an average effective compression ratio about 3.3 on LongBench, with some tasks near 5, and about 3.0 on MRCR. These gains scale with longer inputs, since every visual token carries more characters. Reported speedups versus the text backbone at 128K inputs are about 4.8 times for prefill, about 4.4 times for decoding, and about 2 times for supervised fine tuning throughput. The Ruler benchmark confirms that higher dpi at inference time improves scores, since crisper glyphs help OCR and layout parsing. The research team reports dpi 72 with average compression 4.0 and maximum 7.7 on specific sub tasks, dpi 96 with average compression 2.2 and maximum 4.4, and dpi 120 with average 1.2 and maximum 2.8. The 7.7 maximum belongs to Ruler, not to MRCR.

https://arxiv.org/pdf/2510.17800

So, what? Applications

Glyph benefits multimodal document understanding. Training on rendered pages improves performance on MMLongBench Doc relative to a base visual model. This indicates that the rendering objective is a useful pretext for real document tasks that include figures and layout. The main failure mode is sensitivity to aggressive typography. Very small fonts and tight spacing degrade character accuracy, especially for rare alphanumeric strings. The research team exclude the UUID subtask on Ruler. The approach assumes server side rendering and a VLM with strong OCR and layout priors.

Key Takeaways

Glyph renders long text into images, then a vision language model processes those pages. This reframes long-context modeling as a multimodal problem and preserves semantics while reducing tokens.

The research team reports token compression is 3 to 4 times with accuracy comparable to strong 8B text baselines on long-context benchmarks.

Prefill speedup is about 4.8 times, decoding speedup is about 4.4 times, and supervised fine tuning throughput is about 2 times, measured at 128K inputs.

The system uses continual pretraining on rendered pages, an LLM driven genetic search over rendering parameters, then supervised fine tuning and reinforcement learning with GRPO, plus an OCR alignment objective.

Evaluations include LongBench, MRCR, and Ruler, with an extreme case showing a 128K context VLM addressing 1M token level tasks. Code and model card are public on GitHub and Hugging Face.

Editorial Comments

Glyph treats long context scaling as visual text compression, it renders long sequences into images and lets a VLM process them, reducing tokens while preserving semantics. The research team claims 3 to 4 times token compression with accuracy comparable to Qwen3 8B baselines, about 4 times faster prefilling and decoding, and about 2 times faster SFT throughput. The pipeline is disciplined, continual pre training on rendered pages, an LLM genetic rendering search over typography, then post training. The approach is pragmatic for million token workloads under extreme compression, yet it depends on OCR and typography choices, which remain knobs. Overall, visual text compression offers a concrete path to scale long context while controlling compute and memory.

Check out the Paper, Weights and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Zhipu AI Releases ‘Glyph’: An AI Framework for Scaling the Context Length through Visual-Text Compression appeared first on MarkTechPost.

Hosting NVIDIA speech NIM models on Amazon SageMaker AI: Parakeet ASR

This post was written with NVIDIA and the authors would like to thank Adi Margolin, Eliuth Triana, and Maryam Motamedi for their collaboration.
Organizations today face the challenge of processing large volumes of audio data–from customer calls and meeting recordings to podcasts and voice messages–to unlock valuable insights. Automatic Speech Recognition (ASR) is a critical first step in this process, converting speech to text so that further analysis can be performed. However, running ASR at scale is computationally intensive and can be expensive. This is where asynchronous inference on Amazon SageMaker AI comes in. By deploying state-of-the-art ASR models (like NVIDIA Parakeet models) on SageMaker AI with asynchronous endpoints, you can handle large audio files and batch workloads efficiently. With asynchronous inference, long-running requests can be processed in the background (with results delivered later); it also supports auto-scaling to zero when there’s no work and handles spikes in demand without blocking other jobs.
In this blog post, we’ll explore how to host the NVIDIA Parakeet ASR model on SageMaker AI and integrate it into an asynchronous pipeline for scalable audio processing. We’ll also highlight the benefits of Parakeet’s architecture and the NVIDIA Riva toolkit for speech AI, and discuss how to use NVIDIA NIM for deployment on AWS.
NVIDIA speech AI technologies: Parakeet ASR and Riva Framework
NVIDIA offers a comprehensive suite of speech AI technologies, combining high-performance models with efficient deployment solutions. At its core, the Parakeet ASR model family represents state-of-the-art speech recognition capabilities, achieving industry-leading accuracy with low word error rates (WERs) . The model’s architecture uses the Fast Conformer encoder with the CTC or transducer decoder, enabling 2.4× faster processing than standard Conformers while maintaining accuracy.
NVIDIA speech NIM is a collection of GPU-accelerated microservices for building customizable speech AI applications. NVIDIA Speech models deliver accurate transcription accuracy and natural, expressive voices in over 36 languages–ideal for customer service, contact centers, accessibility, and global enterprise workflows. Developers can fine-tune and customize models for specific languages, accents, domains, and vocabularies, supporting accuracy and brand voice alignment.
Seamless integration with LLMs and the NVIDIA Nemo Retriever make NVIDIA models ideal for agentic AI applications, helping your organization stand out with more secure, high-performing, voice AI. The NIM framework delivers these services as containerized solutions, making deployment straightforward through Docker containers that include the necessary dependencies and optimizations.
This combination of high-performance models and deployment tools provides organizations with a complete solution for implementing speech recognition at scale.
Solution overview
The architecture illustrated in the diagram showcases a comprehensive asynchronous inference pipeline designed specifically for ASR and summarization workloads. The solution provides a robust, scalable, and cost-effective processing pipeline.

Architecture components
The architecture consists of five key components working together to create an efficient audio processing pipeline. At its core, the SageMaker AI asynchronous endpoint hosts the Parakeet ASR model with auto scaling capabilities that can scale to zero when idle for cost optimization.

The data ingestion process begins when audio files are uploaded to Amazon Simple Storage Service (Amazon S3), triggering AWS Lambda functions that process metadata and initiate the workflow.
For event processing, the SageMaker endpoint automatically sends out Amazon Simple Notification Service (Amazon SNS) success and failure notifications through separate queues, enabling proper handling of transcriptions.
Successfully transcribed content on Amazon S3 moves to Amazon Bedrock LLMs for intelligent summarization and additional processing like classification and insights extraction.
Finally, a comprehensive tracking system using Amazon DynamoDB stores workflow status and metadata, enabling real-time monitoring and analytics of the entire pipeline.

Detailed implementation walkthrough
In this section, we will provide the detailed walkthrough of the solution implementation.
SageMaker asynchronous endpoint prerequisites
To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with least-privilege permissions to manage resources created. For details, refer to Create an AWS account. You might need to request a service quota increase for the corresponding SageMaker async hosting instances. In this example, we need one ml.g5.xlarge SageMaker async hosting instance and a ml.g5.xlarge SageMaker notebook instance. You can also choose a different integrated development environment (IDE), but make sure the environment contains GPU compute resources for local testing.
SageMaker asynchronous endpoint configuration
When you deploy a custom model like Parakeet, SageMaker has a couple of options:

Use a NIM container provided by NVIDIA
Use a large model inference (LMI) container
Use a prebuilt PyTorch container

We’ll provide examples for all three approaches.
Using an NVIDIA NIM container
NVIDIA NIM provides a streamlined approach to deploying optimized AI models through containerized solutions. Our implementation takes this concept further by creating a unified SageMaker AI endpoint that intelligently routes between HTTP and gRPC protocols to help maximize both performance and capabilities while simplifying the deployment process.
Innovative dual-protocol architecture
The key innovation is the combined HTTP + gRPC architecture that exposes a single SageMaker AI endpoint with intelligent routing capabilities. This design addresses the common challenge of choosing between protocol efficiency and feature completeness by automatically selecting the optimal transport method. The HTTP route is optimized for simple transcription tasks with files under 5MB, providing faster processing and lower latency for common use cases. Meanwhile, the gRPC route supports larger files (SageMaker AI real-time endpoints support a max payload of 25MB) and advanced features like speaker diarization with precise word-level timing information. The system’s auto-routing functionality analyzes incoming requests to determine file size and requested features, then automatically selects the most appropriate protocol without requiring manual configuration. For applications that need explicit control, the endpoint also supports forced routing through /invocations/http for simple transcription or /invocations/grpc when speaker diarization is required. This flexibility allows both automated optimization and fine-grained control based on specific application requirements.
Advanced speech recognition and speaker diarization capabilities
The NIM container enables a comprehensive audio processing pipeline that seamlessly combines speech recognition with speaker identification through the NVIDIA Riva integrated capabilities. The container handles audio preprocessing, including format conversion and segmentation, while ASR and speaker diarization processes run concurrently on the same audio stream. Results are automatically aligned using overlapping time segments, with each transcribed segment receiving appropriate speaker labels (for example, Speaker_0, Speaker_1). The inference handler processes audio files through the complete pipeline, initializing both ASR and speaker diarization services, running them in parallel, and aligning transcription segments with speaker labels. The output includes the full transcription, timestamped segments with speaker attribution, confidence scores, and total speaker count in a structured JSON format.
Implementation and deployment
The implementation extends NVIDIA parakeet-1-1b-ctc-en-us NIM container as the foundation, adding a Python aiohttp server that seamlessly manages the complete NIM lifecycle by automatically starting and monitoring the service. The server handles protocol adaptation by translating SageMaker inference requests to appropriate NIM APIs, implements the intelligent routing logic that analyzes request characteristics, and provides comprehensive error handling with detailed error messages and fallback mechanisms for robust production deployment. The containerized solution streamlines deployment through standard Docker and AWS CLI commands, featuring a pre-configured Docker file with the necessary dependencies and optimizations. The system accepts multiple input formats including multipart form-data (recommended for maximum compatibility), JSON with base64 encoding for simple integration scenarios, and raw binary uploads for direct audio processing.
For detailed implementation instructions and working examples, teams can reference the complete implementation and deployment notebook in the AWS samples repository, which provides comprehensive guidance on deploying Parakeet ASR with NIM on SageMaker AI using the bring your own container (BYOC) approach. For organizations with specific architectural preferences, separate HTTP-only and gRPC-only implementations are also available, providing simpler deployment models for teams with well-defined use cases while the combined implementation offers maximum flexibility and automatic optimization.
AWS customers can deploy these models either as production-grade NVIDIA NIM containers directly from SageMaker Marketplace or JumpStart, or open source NVIDIA models available on Hugging Face, which can be deployed through custom containers on SageMaker or Amazon Elastic Kubernetes Service (Amazon EKS). This allows organizations to choose between fully managed, enterprise-tier endpoints with auto-scaling and security, or flexible open-source development for research or constrained use cases.
Using an AWS LMI container
LMI containers are designed to simplify hosting large models on AWS. These containers include optimized inference engines like vLLM, FasterTransformer, or TensorRT-LLM that can automatically handle things like model parallelism, quantization, and batching for large models. The LMI container is essentially a pre-configured Docker image that runs an inference server (for example a Python server with these optimizations) and allows you to specify model parameters by using environment variables.
To use the LMI container for Parakeet, we would typically:

Choose the appropriate LMI image: AWS provides different LMI images for different frameworks. For Parakeet , we might use the DJLServing image for efficient inference. Alternatively, NVIDIA Triton Inference Server (which Riva uses) is an option if we package the model in ONNX or TensorRT format.
Specify the model configuration: With LMI, we often provide a model_id (if pulling from Hugging Face Hub) or a path to our model, along with configuration for how to load it (number of GPUs, tensor parallel degree, quantization bits). The container then downloads the model and initializes it with the specified settings. We can also download our own model files from Amazon S3 instead of using the Hub.
Define the inference handler: The LMI container might require a small handler script or configuration to tell it how to process requests. For ASR, this might involve reading the audio input, passing it to the model, and returning text.

AWS LMI containers deliver high performance and scalability through advanced optimization techniques, including continuous batching, tensor parallelism, and state-of-the-art quantization methods. LMI containers integrate multiple inference backends (vLLM, TensorRT-LLM through a single unified configuration), helping users seamlessly experiment and switch between frameworks to find the optimal performance stack for your specific use case.
Using a SageMaker PyTorch container
SageMaker offers PyTorch Deep Learning Containers (DLCs) that come with PyTorch and many common libraries pre-installed. In this example, we demonstrated how to extend our prebuilt container to install necessary packages for the model. You can download the model directly from Hugging Face during the endpoint creation or download the Parakeet model artifacts, packaging it with necessary configuration files into a model.tar.gz archive, and uploading it to Amazon S3. Along with the model artifacts, an inference.py script is required as the entry point script to define model loading and inference logic, including audio preprocessing and transcription handling. When using the SageMaker Python SDK to create a PyTorchModel, the SDK will automatically repackage the model archive to include the inference script under /opt/ml/model/code/inference.py, while keeping model artifacts in /opt/ml/model/ on the endpoint. Once the endpoint is deployed successfully, it can be invoked through the predict API by sending audio files as byte streams to get transcription results.
For the SageMaker real-time endpoint, we currently allow a maximum of 25MB for payload size. Make sure you have set up the container to also allow the maximum request size. However, if you are planning to use the same model for the asynchronous endpoint, the maximum file size that the async endpoint supports is 1GB and the response time is up to 1 hour. Accordingly, you should setup the container to be prepared for this payload size and timeout. When using the PyTorch containers, here are some key configuration parameters to consider:

SAGEMAKER_MODEL_SERVER_WORKERS: Set the number of torch workers that will load the number of models copied into GPU memory.
TS_DEFAULT_RESPONSE_TIMEOUT: Set the time out setting for Torch server workers; for long audio processing, you can set it to a higher number
TS_MAX_REQUEST_SIZE: Set the byte size values for requests to 1G for async endpoints.
TS_MAX_RESPONSE_SIZE: Set the byte size values for response.

In the example notebook, we also showcase how to leverage the SageMaker local session provided by the SageMaker Python SDK. It helps you create estimators and run training, processing, and inference jobs locally using Docker containers instead of managed AWS infrastructure, providing a fast way to test and debug your machine learning scripts before scaling to production.
CDK pipeline prerequisites
Before deploying this solution, make sure you have:

AWS CLI configured with appropriate permissions – Installation Guide
AWS Cloud Development Kit (AWS CDK) installed – Installation Guide
Node.js 18+ and Python 3.9+ installed
Docker – Installation Guide
SageMaker endpoint deployed with your ML model (Parakeet ASR models or similar)
Amazon SNS topics created for success and failure notifications

CDK pipeline setup
The solution deployment begins with provisioning the necessary AWS resources using Infrastructure as Code (IaC) principles. AWS CDK creates the foundational components including:

DynamoDB Table: Configured for on-demand capacity to track invocation metadata, processing status, and results
S3 Buckets: Secure storage for input audio files, transcription outputs, and summarization results
SNS topics: Separate queues for success and failure event handling
Lambda functions: Serverless functions for metadata processing, status updates, and workflow orchestration
IAM roles and policies: Appropriate permissions for cross-service communication and resource access

Environment setup
Clone the repository and install dependencies:

# Install degit, a library for downloading specific sub directories
npm install -g degit

# Clone just the specific folder
npx degit aws-samples/genai-ml-platform-examples/infrastructure/automated-speech-recognition-async-pipeline-sagemaker-ai/sagemaker-async-batch-inference-cdk sagemaker-async-batch-inference-cdk

# Navigate to folder
cd sagemaker-async-batch-inference-cdk

# Install Node.js dependencies
npm install

# Set up Python virtual environment
python3 -m venv .venv
source .venv/bin/activate

# On Windows:
.venvScriptsactivate
pip install -r requirements.txt

Configuration
Update the SageMaker endpoint configuration in bin/aws-blog-sagemaker.ts:

vim bin/aws-blog-sagemaker.ts

# Change the endpoint name
sageMakerConfig: {
endpointName: ‘your-sagemaker-endpoint-name’,
enableSageMakerAccess: true
}

If you have followed the notebook to deploy the endpoint, you should have created the two SNS topics. Otherwise, make sure you create the correct SNS topics using CLI:

# Create SNS topics
aws sns create-topic –name success-inf
aws sns create-topic –name failed-inf

Build and deploy
Before you deploy the AWS CloudFormation template, make sure Docker is running.

# Compile TypeScript to JavaScript
npm run build

# Bootstrap CDK (first time only)
npx cdk bootstrap

# Deploy the stack
npx cdk deploy

Verify deployment
After successful deployment, note the output values:

DynamoDB table name for status tracking
Lambda function ARNs for processing and status updates
SNS topic ARNs for notifications

Submit audio file for processing
Processing Audio Files
Update the upload_audio_invoke_lambda.sh

LAMBDA_ARN=”YOUR_LAMBDA_FUNCTION_ARN”
S3_BUCKET=”YOUR_S3_BUCKET_ARN”

Run the Script:
AWS_PROFILE=default ./scripts/upload_audio_invoke_lambda.sh
This script will:

Download a sample audio file
Upload the audio file to your s3 bucket
Send the bucket path to Lambda and trigger the transcription and summarization pipeline

Monitoring progress
You can check the result in DynamoDB table using the following command:

aws dynamodb scan –table-name YOUR_DYNAMODB_TABLE_NAME

Check processing status in the DynamoDB table:

submitted: Successfully queued for inference
completed: Transcription completed successfully
failed: Processing encountered an error

Audio processing and workflow orchestration
The core processing workflow follows an event-driven pattern:
Initial processing and metadata extraction: When audio files are uploaded to S3, the triggered Lambda function analyzes the file metadata, validates format compatibility, and creates detailed invocation records in DynamoDB. This facilitates comprehensive tracking from the moment audio content enters the system.
Asynchronous Speech Recognition: Audio files are processed through the SageMaker endpoint using optimized ASR models. The asynchronous process can handle various file sizes and durations without timeout concerns. Each processing request is assigned a unique identifier for tracking purposes.
Success path processing: Upon successful transcription, the system automatically initiates the summarization workflow. The transcribed text is sent to Amazon Bedrock, where advanced language models generate contextually appropriate summaries based on configurable parameters such as summary length, focus areas, and output format.
Error handling and recovery: Failed processing attempts trigger dedicated Lambda functions that log detailed error information, update processing status, and can initiate retry logic for transient failures. This robust error handling results in minimal data loss and provides clear visibility into processing issues.
Real-world applications
Customer service analytics: Organizations can process thousands of customer service call recordings to generate transcriptions and summaries, enabling sentiment analysis, quality assurance, and insights extraction at scale.
Meeting and conference processing: Enterprise teams can automatically transcribe and summarize meeting recordings, creating searchable archives and actionable summaries for participants and stakeholders.
Media and content processing: Media companies can process podcast episodes, interviews, and video content to generate transcriptions and summaries for improved accessibility and content discoverability.
Compliance and legal documentation: Legal and compliance teams can process recorded depositions, hearings, and interviews to create accurate transcriptions and summaries for case preparation and documentation.
Cleanup
Once you have used the solution, remove the SageMaker endpoints to prevent incurring additional costs. You can use the provided code to delete real-time and asynchronous inference endpoints, respectively:

# Delete real-time inference
endpointreal_time_predictor.delete_endpoint()

# Delete asynchronous inference
endpointasync_predictor.delete_endpoint()

You should also delete all the resources created by the CDK stack.

# Delete CDK Stack
cdk destroy

Conclusion
The integration of powerful NVIDIA speech AI technologies with AWS cloud infrastructure creates a comprehensive solution for large-scale audio processing. By combining Parakeet ASR’s industry-leading accuracy and speed with NVIDIA Riva’s optimized deployment framework on the Amazon SageMaker asynchronous inference pipeline, organizations can achieve both high-performance speech recognition and cost-effective scaling. The solution leverages the managed services of AWS (SageMaker AI, Lambda, S3, and Bedrock) to create an automated, scalable pipeline for processing audio content. With features like auto scaling to zero, comprehensive error handling, and real-time monitoring through DynamoDB, organizations can focus on extracting business value from their audio content rather than managing infrastructure complexity. Whether processing customer service calls, meeting recordings, or media content, this architecture delivers reliable, efficient, and cost-effective audio processing capabilities. To experience the full potential of this solution, we encourage you to explore the solution and reach out to us if you have any specific business requirements and would like to customise the solution for your use case.

About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Tony Trinh is a Senior AI/ML Specialist Architect at AWS. With 13+ years of experience in the IT industry, Tony specializes in architecting scalable, compliance-driven AI and ML solutions—particularly in generative AI, MLOps, and cloud-native data platforms. As part of his PhD, he’s doing research in Multimodal AI and Spatial AI. In his spare time, Tony enjoys hiking, swimming and experimenting with home improvement.
Alick Wong is a Senior Solutions Architect at Amazon Web Services, where he helps startups and digital-native businesses modernize, optimize, and scale their platforms in the cloud. Drawing on his experience as a former startup CTO, he works closely with founders and engineering leaders to drive growth and innovation on AWS.
Andrew Smith is a Sr. Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.
Derrick Choo is a Senior AI/ML Specialist Solutions Architect at AWS who accelerates enterprise digital transformation through cloud adoption, AI/ML, and generative AI solutions. He specializes in full-stack development and ML, designing end-to-end solutions spanning frontend interfaces, IoT applications, data integrations, and ML models, with a particular focus on computer vision and multi-modal systems.
Tim Ma is a Principal Specialist in Generative AI at AWS, where he collaborates with customers to design and deploy cutting-edge machine learning solutions. He also leads go-to-market strategies for generative AI services, helping organizations harness the potential of advanced AI technologies.
Curt Lockhart is an AI Solutions Architect at NVIDIA, where he helps customers deploy language and vision models to build end to end AI workflows using NVIDIA’s tooling on AWS. He enjoys making complex AI feel approachable and spending his time exploring the art, music, and outdoors of the Pacific Northwest.
Francesco Ciannella is a senior engineer at NVIDIA, where he works on conversational AI solutions built around large language models (LLMs) and audio language models (ALMs). He holds a M.S. in engineering of telecommunications from the University of Rome “La Sapienza” and an M.S. in language technologies from the School of Computer Science at Carnegie Mellon University.

How to Build an Agentic Decision-Tree RAG System with Intelligent Quer …

In this tutorial, we build an advanced Agentic Retrieval-Augmented Generation (RAG) system that goes beyond simple question answering. We design it to intelligently route queries to the right knowledge sources, perform self-checks to assess answer quality, and iteratively refine responses for improved accuracy. We implement the entire system using open-source tools like FAISS, SentenceTransformers, and Flan-T5. As we progress, we explore how routing, retrieval, generation, and self-evaluation combine to form a decision-tree-style RAG pipeline that mimics real-world agentic reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(” Setting up dependencies…”)
import subprocess
import sys
def install_packages():
packages = [‘sentence-transformers’, ‘transformers’, ‘torch’, ‘faiss-cpu’, ‘numpy’, ‘accelerate’]
for package in packages:
print(f”Installing {package}…”)
subprocess.check_call([sys.executable, ‘-m’, ‘pip’, ‘install’, ‘-q’, package])
try:
import faiss
except ImportError:
install_packages()
print(“✓ All dependencies installed! Importing modules…n”)
import torch
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings(‘ignore’)
print(“✓ All modules loaded successfully!n”)

We begin by installing all necessary dependencies, including Transformers, FAISS, and SentenceTransformers, to ensure smooth local execution. We verify installations and install essential modules such as NumPy, PyTorch, and FAISS for embedding, retrieval, and generation. We confirm that all libraries load successfully before moving ahead with the main pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass VectorStore:
def __init__(self, embedding_model=’all-MiniLM-L6-v2′):
print(f”Loading embedding model: {embedding_model}…”)
self.embedder = SentenceTransformer(embedding_model)
self.documents = []
self.index = None
def add_documents(self, docs: List[str], sources: List[str]):
self.documents = [{“text”: doc, “source”: src} for doc, src in zip(docs, sources)]
embeddings = self.embedder.encode(docs, show_progress_bar=False)
dimension = embeddings.shape[1]
self.index = faiss.IndexFlatL2(dimension)
self.index.add(embeddings.astype(‘float32’))
print(f”✓ Indexed {len(docs)} documentsn”)
def search(self, query: str, k: int = 3) -> List[Dict]:
query_vec = self.embedder.encode([query]).astype(‘float32’)
distances, indices = self.index.search(query_vec, k)
return [self.documents[i] for i in indices[0]]

We design the VectorStore class to store and retrieve documents efficiently using FAISS-based similarity search. We embed each document using a transformer model and build an index for fast retrieval. This allows us to quickly fetch the most relevant context for any incoming query. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass QueryRouter:
def __init__(self):
self.categories = {
‘technical’: [‘how’, ‘implement’, ‘code’, ‘function’, ‘algorithm’, ‘debug’],
‘factual’: [‘what’, ‘who’, ‘when’, ‘where’, ‘define’, ‘explain’],
‘comparative’: [‘compare’, ‘difference’, ‘versus’, ‘vs’, ‘better’, ‘which’],
‘procedural’: [‘steps’, ‘process’, ‘guide’, ‘tutorial’, ‘how to’]
}
def route(self, query: str) -> str:
query_lower = query.lower()
scores = {}
for category, keywords in self.categories.items():
score = sum(1 for kw in keywords if kw in query_lower)
scores[category] = score
best_category = max(scores, key=scores.get)
return best_category if scores[best_category] > 0 else ‘factual’

We introduce the QueryRouter class to classify queries by intent, technical, factual, comparative, or procedural. We use keyword matching to determine which category best fits the input question. This routing step ensures that the retrieval strategy adapts dynamically to different query styles. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AnswerGenerator:
def __init__(self, model_name=’google/flan-t5-base’):
print(f”Loading generation model: {model_name}…”)
self.generator = pipeline(‘text2text-generation’, model=model_name, device=0 if torch.cuda.is_available() else -1, max_length=256)
device_type = “GPU” if torch.cuda.is_available() else “CPU”
print(f”✓ Generator ready (using {device_type})n”)
def generate(self, query: str, context: List[Dict], query_type: str) -> str:
context_text = “nn”.join([f”[{doc[‘source’]}]: {doc[‘text’]}” for doc in context])

Context:
{context_text}

Question: {query}

Answer:”””
answer = self.generator(prompt, max_length=200, do_sample=False)[0][‘generated_text’]
return answer.strip()
def self_check(self, query: str, answer: str, context: List[Dict]) -> Tuple[bool, str]:
if len(answer) < 10:
return False, “Answer too short – needs more detail”
context_keywords = set()
for doc in context:
context_keywords.update(doc[‘text’].lower().split()[:20])
answer_words = set(answer.lower().split())
overlap = len(context_keywords.intersection(answer_words))
if overlap < 2:
return False, “Answer not grounded in context – needs more evidence”
query_keywords = set(query.lower().split())
if len(query_keywords.intersection(answer_words)) < 1:
return False, “Answer doesn’t address the query – rephrase needed”
return True, “Answer quality acceptable”

We built the AnswerGenerator class to handle answer creation and self-evaluation. Using the Flan-T5 model, we generate text responses grounded in retrieved documents. Then, we perform a self-check to assess the length of the answer, context grounding, and relevance, ensuring our output is meaningful and accurate. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AgenticRAG:
def __init__(self):
self.vector_store = VectorStore()
self.router = QueryRouter()
self.generator = AnswerGenerator()
self.max_iterations = 2
def add_knowledge(self, documents: List[str], sources: List[str]):
self.vector_store.add_documents(documents, sources)
def query(self, question: str, verbose: bool = True) -> Dict:
if verbose:
print(f”n{‘=’*60}”)
print(f” Query: {question}”)
print(f”{‘=’*60}”)
query_type = self.router.route(question)
if verbose:
print(f” Route: {query_type.upper()} query detected”)
k_docs = {‘technical’: 2, ‘comparative’: 4, ‘procedural’: 3}.get(query_type, 3)
iteration = 0
answer_accepted = False
while iteration < self.max_iterations and not answer_accepted:
iteration += 1
if verbose:
print(f”n Iteration {iteration}”)
context = self.vector_store.search(question, k=k_docs)
if verbose:
print(f” Retrieved {len(context)} documents from sources:”)
for doc in context:
print(f” – {doc[‘source’]}”)
answer = self.generator.generate(question, context, query_type)
if verbose:
print(f” Generated answer: {answer[:100]}…”)
answer_accepted, feedback = self.generator.self_check(question, answer, context)
if verbose:
status = “✓ ACCEPTED” if answer_accepted else “✗ REJECTED”
print(f” Self-check: {status}”)
print(f” Feedback: {feedback}”)
if not answer_accepted and iteration < self.max_iterations:
question = f”{question} (provide more specific details)”
k_docs += 1
return {‘answer’: answer, ‘query_type’: query_type, ‘iterations’: iteration, ‘accepted’: answer_accepted, ‘sources’: [doc[‘source’] for doc in context]}

We combine all components into the AgenticRAG system, which orchestrates routing, retrieval, generation, and quality checking. The system iteratively refines its answers based on self-evaluation feedback, adjusting the query or expanding context when necessary. This creates a feedback-driven decision-tree RAG that automatically improves performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
print(“n” + “=”*60)
print(” AGENTIC RAG WITH ROUTING & SELF-CHECK”)
print(“=”*60 + “n”)
documents = [
“RAG (Retrieval-Augmented Generation) combines information retrieval with text generation. It retrieves relevant documents and uses them as context for generating accurate answers.”
]
sources = [“Python Documentation”, “ML Textbook”, “Neural Networks Guide”, “Deep Learning Paper”, “Transformer Architecture”, “RAG Research Paper”]
rag = AgenticRAG()
rag.add_knowledge(documents, sources)
test_queries = [“What is Python?”, “How does machine learning work?”, “Compare neural networks and deep learning”]
for query in test_queries:
result = rag.query(query, verbose=True)
print(f”n{‘=’*60}”)
print(f” FINAL RESULT:”)
print(f” Answer: {result[‘answer’]}”)
print(f” Query Type: {result[‘query_type’]}”)
print(f” Iterations: {result[‘iterations’]}”)
print(f” Accepted: {result[‘accepted’]}”)
print(f”{‘=’*60}n”)
if __name__ == “__main__”:
main()

We finalize the demo by loading a small knowledge base and running test queries through the Agentic RAG pipeline. We observe how the model routes, retrieves, and refines answers step by step, printing intermediate results for transparency. By the end, we confirm that our system successfully delivers accurate, self-validated answers using only local computation.

In conclusion, we create a fully functional Agentic RAG framework that autonomously retrieves, reasons, and refines its answers. We witness how the system dynamically routes different query types, evaluates its own responses, and improves them through iterative feedback, all within a lightweight, local environment. Through this exercise, we deepen our understanding of RAG architectures and also experience how agentic components can transform static retrieval systems into self-improving intelligent agents.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build an Agentic Decision-Tree RAG System with Intelligent Query Routing, Self-Checking, and Iterative Refinement? appeared first on MarkTechPost.

Meet ‘kvcached’: A Machine Learning Library to Enable Virtualized, …

Large language model serving often wastes GPU memory because engines pre-reserve large static KV cache regions per model, even when requests are bursty or idle. Meet ‘kvcached‘, a library to enable virtualized, elastic KV cache for LLM serving on shared GPUs. kvcached has been developed by a research from Berkeley’s Sky Computing Lab (University of California, Berkeley) in close collaboration with Rice University and UCLA, and with valuable input from collaborators and colleagues at NVIDIA, Intel Corporation, Stanford University. It introduces an OS-style virtual memory abstraction for the KV cache that lets serving engines reserve contiguous virtual space first, then back only the active portions with physical GPU pages on demand. This decoupling raises memory utilization, reduces cold starts, and enables multiple models to time share and space share a device without heavy engine rewrites.

https://github.com/ovg-project/kvcached

What kvcached changes?

With kvcached, an engine creates a KV cache pool that is contiguous in the virtual address space. As tokens arrive, the library maps physical GPU pages lazily at a fine granularity using CUDA virtual memory APIs. When requests complete or models go idle, pages unmap and return to a shared pool, which other colocated models can immediately reuse. This preserves simple pointer arithmetic in kernels, and removes the need for per engine user level paging. The project targets SGLang and vLLM integration, and it is released under the Apache 2.0 license. Installation and a one command quick start are documented in the Git repository.

https://yifanqiao.notion.site/Solve-the-GPU-Cost-Crisis-with-kvcached-289da9d1f4d68034b17bf2774201b141

How does it impact at scale?

Production workloads host many models with long tail traffic and spiky bursts. Static reservations leave memory stranded and slow down time to first token when models must be activated or swapped. The Prism research paper shows that multi-LLM serving requires cross model memory coordination at runtime, not just compute scheduling. Prism implements on demand mapping of physical to virtual pages and a two level scheduler, and reports more than 2 times cost savings and 3.3 times higher TTFT SLO attainment versus prior systems on real traces. kvcached focuses on the memory coordination primitive, and provides a reusable component that brings this capability to mainstream engines.

https://www.arxiv.org/pdf/2505.04021

Performance signals

The kvcached team reports 1.2 times to 28 times faster time to first token in multi model serving, due to immediate reuse of freed pages and the removal of large static allocations. These numbers come from multi-LLM scenarios where activation latency and memory headroom dominate tail latency. The research team note kvcached’s compatibility with SGLang and vLLM, and describe elastic KV allocation as the core mechanism.

https://yifanqiao.notion.site/Solve-the-GPU-Cost-Crisis-with-kvcached-289da9d1f4d68034b17bf2774201b141

How is it related to recent research?

Recent work has moved from fixed partitioning to virtual memory based methods for KV management. Prism extends VMM based allocation to multi-LLM settings with cross model coordination and scheduling. Prior efforts like vAttention explore CUDA VMM for single model serving to avoid fragmentation without PagedAttention. The arc is clear, use virtual memory to keep KV contiguous in virtual space, then map physical pages elastically as the workload evolves. kvcached operationalizes this idea as a library, which simplifies adoption inside existing engines.

https://www.arxiv.org/pdf/2505.04021

Practical Applications for Devs

Colocation across models: Engines can colocate several small or medium models on one device. When one model goes idle, its KV pages free quickly and another model can expand its working set without restart. This reduces head of line blocking during bursts and improves TTFT SLO attainment.

Activation behavior: Prism reports activation times of about 0.7 seconds for an 8B model and about 1.5 seconds for a 70B model with streaming activation. kvcached benefits from similar principles because virtual reservations allow engines to prepare address ranges in advance, then map pages as tokens arrive.

Autoscaling for serverless LLM: Fine grained page mapping makes it feasible to scale replicas more frequently and to run cold models in a warm state with minimal memory footprint. This enables tighter autoscaling loops and reduces the blast radius of hot spots.

Offloading and future work. Virtual memory opens the door to KV offload to host memory or NVMe when the access pattern allows it. NVIDIA’s recent guide on managed memory for KV offload on GH200 class systems shows how unified address spaces can extend capacity at acceptable overheads. The kvcached maintainers also discuss offload and compaction directions in public threads. Verify throughput and latency in your own pipeline, since access locality and PCIe topology have strong effects.

https://www.arxiv.org/pdf/2505.04021

Key Takeaways

kvcached virtualizes the KV cache using GPU virtual memory, engines reserve contiguous virtual space and map physical pages on demand, enabling elastic allocation and reclamation under dynamic loads.

It integrates with mainstream inference engines, specifically SGLang and vLLM, and is released under Apache 2.0, making adoption and modification straightforward for production serving stacks.

Public benchmarks report 1.2 times to 28 times faster time to first token in multi model serving due to immediate reuse of freed KV pages and the removal of large static reservations.

Prism shows that cross model memory coordination, implemented via on demand mapping and two level scheduling, delivers more than 2 times cost savings and 3.3 times higher TTFT SLO attainment on real traces, kvcached supplies the memory primitive that mainstream engines can reuse.

For clusters that host many models with bursty, long tail traffic, virtualized KV cache allows safe colocation, faster activation, and tighter autoscaling, with reported activation around 0.7 seconds for an 8B model and 1.5 seconds for a 70B model in the Prism evaluation.

Editorial Comments

kvcached is an effective approach toward GPU memory virtualization for LLM serving, not a full operating system, and that clarity matters. The library reserves virtual address space for the KV cache, then maps physical pages on demand, which enables elastic sharing across models with minimal engine changes. This aligns with evidence that cross model memory coordination is essential for multi model workloads and improves SLO attainment and cost under real traces. Overall, kvcached advances GPU memory coordination for LLM serving, production value depends on per cluster validation.

Check out the GitHub Repo, Paper 1, Paper 2 and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet ‘kvcached’: A Machine Learning Library to Enable Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs appeared first on MarkTechPost.

5 Common LLM Parameters Explained with Examples

Large language models (LLMs) offer several parameters that let you fine-tune their behavior and control how they generate responses. If a model isn’t producing the desired output, the issue often lies in how these parameters are configured. In this tutorial, we’ll explore some of the most commonly used ones — max_completion_tokens, temperature, top_p, presence_penalty, and frequency_penalty — and understand how each influences the model’s output.

Installing the dependencies

Copy CodeCopiedUse a different Browserpip install openai pandas matplotlib

Loading OpenAI API Key

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘OPENAI_API_KEY’] = getpass(‘Enter OpenAI API Key: ‘)

Initializing the Model

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
model=”gpt-4.1″
client = OpenAI()

Max Tokens

Max Tokens is the maximum number of tokens the model can generate during a run. The model will try to stay within this limit across all turns. If it exceeds the specified number, the run will stop and be marked as incomplete.

A smaller value (like 16) limits the model to very short answers, while a higher value (like 80) allows it to generate more detailed and complete responses. Increasing this parameter gives the model more room to elaborate, explain, or format its output more naturally.

Copy CodeCopiedUse a different Browserprompt = “What is the most popular French cheese?”
for tokens in [16, 30, 80]:
print(f”n— max_output_tokens = {tokens} —“)
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “developer”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
max_completion_tokens=tokens
)
print(response.choices[0].message.content)

Temperature

In Large Language Models (LLMs), the temperature parameter controls the diversity and randomness of generated outputs. Lower temperature values make the model more deterministic and focused on the most probable responses — ideal for tasks that require accuracy and consistency. Higher values, on the other hand, introduce creativity and variety by allowing the model to explore less likely options. Technically, temperature scales the probabilities of predicted tokens in the softmax function: increasing it flattens the distribution (more diverse outputs), while decreasing it sharpens the distribution (more predictable outputs).

In this code, we’re prompting the LLM to give 10 different responses (n_choices = 10) for the same question — “What is one intriguing place worth visiting?” — across a range of temperature values. By doing this, we can observe how the diversity of answers changes with temperature. Lower temperatures will likely produce similar or repeated responses, while higher temperatures will show a broader and more varied distribution of places.

Copy CodeCopiedUse a different Browserprompt = “What is one intriguing place worth visiting? Give a single-word answer and think globally.”

temperatures = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]
n_choices = 10
results = {}

for temp in temperatures:
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
temperature=temp,
n=n_choices
)

# Collect all n responses in a list
results[temp] = [response.choices[i].message.content.strip() for i in range(n_choices)]

# Display results
for temp, responses in results.items():
print(f”n— temperature = {temp} —“)
print(responses)

As we can see, as the temperature increases to 0.6, the responses become more diverse, moving beyond the repeated single answer “Petra.” At a higher temperature of 1.5, the distribution shifts, and we can see responses like Kyoto, and Machu Picchu as well.

Top P

Top P (also known as nucleus sampling) is a parameter that controls how many tokens the model considers based on a cumulative probability threshold. It helps the model focus on the most likely tokens, often improving coherence and output quality.

In the following visualization, we first set a temperature value and then apply Top P = 0.5 (50%), meaning only the top 50% of the probability mass is kept. Note that when temperature = 0, the output is deterministic, so Top P has no effect.

The generation process works as follows:

Apply the temperature to adjust the token probabilities.

Use Top P to retain only the most probable tokens that together make up 50% of the total probability mass.

Renormalize the remaining probabilities before sampling.

We’ll visualize how the token probability distribution changes across different temperature values for the question:“What is one intriguing place worth visiting?”

Copy CodeCopiedUse a different Browserprompt = “What is one intriguing place worth visiting? Give a single-word answer and think globally.”

temperatures = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]
n_choices = 10
results_ = {}

for temp in temperatures:
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
temperature=temp,
n=n_choices,
top_p=0.5
)

# Collect all n responses in a list
results_[temp] = [response.choices[i].message.content.strip() for i in range(n_choices)]

# Display results
for temp, responses in results_.items():
print(f”n— temperature = {temp} —“)
print(responses)

Since Petra consistently accounted for more than 50% of the total response probability, applying Top P = 0.5 filters out all other options. As a result, the model only selects “Petra” as the final output in every case.

Frequency Penalty

Frequency Penalty controls how much the model avoids repeating the same words or phrases in its output.

Range: -2 to 2

Default: 0

When the frequency penalty is higher, the model gets penalized for using words it has already used before. This encourages it to choose new and different words, making the text more varied and less repetitive.

In simple terms — a higher frequency penalty = less repetition and more creativity.

We’ll test this using the prompt:

“List 10 possible titles for a fantasy book. Give the titles only and each title on a new line.”

Copy CodeCopiedUse a different Browserprompt = “List 10 possible titles for a fantasy book. Give the titles only and each title on a new line.”
frequency_penalties = [-2.0, -1.0, 0.0, 0.5, 1.0, 1.5, 2.0]
results = {}

for fp in frequency_penalties:
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
frequency_penalty=fp,
temperature=0.2
)

text = response.choices[0].message.content
items = [line.strip(“- “).strip() for line in text.split(“n”) if line.strip()]
results[fp] = items

# Display results
for fp, items in results.items():
print(f”n— frequency_penalty = {fp} —“)
print(items)

Low frequency penalties (-2 to 0): Titles tend to repeat, with familiar patterns like “The Shadow Weaver’s Oath”, “Crown of Ember and Ice”, and “The Last Dragon’s Heir” appearing frequently.

Moderate penalties (0.5 to 1.5): Some repetition remains, but the model starts generating more varied and creative titles.

High penalty (2.0): The first three titles are still the same, but after that, the model produces diverse, unique, and imaginative book names (e.g., “Whisperwind Chronicles: Rise of the Phoenix Queen”, “Ashes Beneath the Willow Tree”).

Presence Penalty

Presence Penalty controls how much the model avoids repeating words or phrases that have already appeared in the text.

Range: -2 to 2

Default: 0

A higher presence penalty encourages the model to use a wider variety of words, making the output more diverse and creative.

Unlike the frequency penalty, which accumulates with each repetition, the presence penalty is applied once to any word that has already appeared, reducing the chance it will be repeated in the output. This helps the model produce text with more variety and originality.

Copy CodeCopiedUse a different Browserprompt = “List 10 possible titles for a fantasy book. Give the titles only and each title on a new line.”
presence_penalties = [-2.0, -1.0, 0.0, 0.5, 1.0, 1.5, 2.0]
results = {}

for fp in frequency_penalties:
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
presence_penalty=fp,
temperature=0.2
)

text = response.choices[0].message.content
items = [line.strip(“- “).strip() for line in text.split(“n”) if line.strip()]
results[fp] = items

# Display results
for fp, items in results.items():
print(f”n— presence_penalties = {fp} —“)
print(items)

Low to Moderate Penalty (-2.0 to 0.5): Titles are somewhat varied, with some repetition of common fantasy patterns like “The Shadow Weaver’s Oath”, “The Last Dragon’s Heir”, “Crown of Ember and Ice”.

Medium Penalty (1.0 to 1.5): The first few popular titles remain, while later titles show more creativity and unique combinations. Examples: “Ashes of the Fallen Kingdom”, “Secrets of the Starbound Forest”, “Daughter of Storm and Stone”.

Maximum Penalty (2.0): Top three titles stay the same, but the rest become highly diverse and imaginative. Examples: “Moonfire and Thorn”, “Veil of Starlit Ashes”, “The Midnight Blade”.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post 5 Common LLM Parameters Explained with Examples appeared first on MarkTechPost.

How to Build, Train, and Compare Multiple Reinforcement Learning Agent …

In this tutorial, we explore advanced applications of Stable-Baselines3 in reinforcement learning. We design a fully functional, custom trading environment, integrate multiple algorithms such as PPO and A2C, and develop our own training callbacks for performance tracking. As we progress, we train, evaluate, and visualize agent performance to compare algorithmic efficiency, learning curves, and decision strategies, all within a streamlined workflow that runs entirely offline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install stable-baselines3[extra] gymnasium pygame
import numpy as np
import gymnasium as gym
from gymnasium import spaces
import matplotlib.pyplot as plt
from stable_baselines3 import PPO, A2C, DQN, SAC
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import torch

class TradingEnv(gym.Env):
def __init__(self, max_steps=200):
super().__init__()
self.max_steps = max_steps
self.action_space = spaces.Discrete(3)
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(5,), dtype=np.float32)
self.reset()
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self.current_step = 0
self.balance = 1000.0
self.shares = 0
self.price = 100.0
self.price_history = [self.price]
return self._get_obs(), {}
def _get_obs(self):
price_trend = np.mean(self.price_history[-5:]) if len(self.price_history) >= 5 else self.price
return np.array([
self.balance / 1000.0,
self.shares / 10.0,
self.price / 100.0,
price_trend / 100.0,
self.current_step / self.max_steps
], dtype=np.float32)
def step(self, action):
self.current_step += 1
trend = 0.001 * np.sin(self.current_step / 20)
self.price *= (1 + trend + np.random.normal(0, 0.02))
self.price = np.clip(self.price, 50, 200)
self.price_history.append(self.price)
reward = 0
if action == 1 and self.balance >= self.price:
shares_to_buy = int(self.balance / self.price)
cost = shares_to_buy * self.price
self.balance -= cost
self.shares += shares_to_buy
reward = -0.01
elif action == 2 and self.shares > 0:
revenue = self.shares * self.price
self.balance += revenue
self.shares = 0
reward = 0.01
portfolio_value = self.balance + self.shares * self.price
reward += (portfolio_value – 1000) / 1000
terminated = self.current_step >= self.max_steps
truncated = False
return self._get_obs(), reward, terminated, truncated, {“portfolio”: portfolio_value}
def render(self):
print(f”Step: {self.current_step}, Balance: ${self.balance:.2f}, Shares: {self.shares}, Price: ${self.price:.2f}”)

We define our custom TradingEnv, where an agent learns to make buy, sell, or hold decisions based on simulated price movements. We define the observation and action spaces, implement the reward structure, and ensure our environment reflects a realistic market scenario with fluctuating trends and noise. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ProgressCallback(BaseCallback):
def __init__(self, check_freq=1000, verbose=1):
super().__init__(verbose)
self.check_freq = check_freq
self.rewards = []
def _on_step(self):
if self.n_calls % self.check_freq == 0:
mean_reward = np.mean([ep_info[“r”] for ep_info in self.model.ep_info_buffer])
self.rewards.append(mean_reward)
if self.verbose:
print(f”Steps: {self.n_calls}, Mean Reward: {mean_reward:.2f}”)
return True

print(“=” * 60)
print(“Setting up custom trading environment…”)
env = TradingEnv()
check_env(env, warn=True)
print(“✓ Environment validation passed!”)
env = Monitor(env)
vec_env = DummyVecEnv([lambda: env])
vec_env = VecNormalize(vec_env, norm_obs=True, norm_reward=True)

Here, we create a ProgressCallback to monitor training progress and record mean rewards at regular intervals. We then validate our custom environment using Stable-Baselines3’s built-in checker, wrap it for monitoring and normalization, and prepare it for training across multiple algorithms. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=” * 60)
print(“Training multiple RL algorithms…”)
algorithms = {
“PPO”: PPO(“MlpPolicy”, vec_env, verbose=0, learning_rate=3e-4, n_steps=2048),
“A2C”: A2C(“MlpPolicy”, vec_env, verbose=0, learning_rate=7e-4),
}
results = {}
for name, model in algorithms.items():
print(f”nTraining {name}…”)
callback = ProgressCallback(check_freq=2000, verbose=0)
model.learn(total_timesteps=50000, callback=callback, progress_bar=True)
results[name] = {“model”: model, “rewards”: callback.rewards}
print(f”✓ {name} training complete!”)

print(“n” + “=” * 60)
print(“Evaluating trained models…”)
eval_env = Monitor(TradingEnv())
for name, data in results.items():
mean_reward, std_reward = evaluate_policy(data[“model”], eval_env, n_eval_episodes=20, deterministic=True)
results[name][“eval_mean”] = mean_reward
results[name][“eval_std”] = std_reward
print(f”{name}: Mean Reward = {mean_reward:.2f} +/- {std_reward:.2f}”)

We train and evaluate two different reinforcement learning algorithms, PPO and A2C, on our trading environment. We log their performance metrics, capture mean rewards, and compare how efficiently each agent learns profitable trading strategies through consistent exploration and exploitation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=” * 60)
print(“Generating visualizations…”)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
ax = axes[0, 0]
for name, data in results.items():
ax.plot(data[“rewards”], label=name, linewidth=2)
ax.set_xlabel(“Training Checkpoints (x1000 steps)”)
ax.set_ylabel(“Mean Episode Reward”)
ax.set_title(“Training Progress Comparison”)
ax.legend()
ax.grid(True, alpha=0.3)

ax = axes[0, 1]
names = list(results.keys())
means = [results[n][“eval_mean”] for n in names]
stds = [results[n][“eval_std”] for n in names]
ax.bar(names, means, yerr=stds, capsize=10, alpha=0.7, color=[‘#1f77b4’, ‘#ff7f0e’])
ax.set_ylabel(“Mean Reward”)
ax.set_title(“Evaluation Performance (20 episodes)”)
ax.grid(True, alpha=0.3, axis=’y’)

ax = axes[1, 0]
best_model = max(results.items(), key=lambda x: x[1][“eval_mean”])[1][“model”]
obs = eval_env.reset()[0]
portfolio_values = [1000]
for _ in range(200):
action, _ = best_model.predict(obs, deterministic=True)
obs, reward, done, truncated, info = eval_env.step(action)
portfolio_values.append(info.get(“portfolio”, portfolio_values[-1]))
if done:
break
ax.plot(portfolio_values, linewidth=2, color=’green’)
ax.axhline(y=1000, color=’red’, linestyle=’–‘, label=’Initial Value’)
ax.set_xlabel(“Steps”)
ax.set_ylabel(“Portfolio Value ($)”)
ax.set_title(f”Best Model ({max(results.items(), key=lambda x: x[1][‘eval_mean’])[0]}) Episode”)
ax.legend()
ax.grid(True, alpha=0.3)

We visualize our training results by plotting learning curves, evaluation scores, and portfolio trajectories for the best-performing model. We also analyze how the agent’s actions translate into portfolio growth, which helps us interpret model behavior and assess decision consistency during simulated trading sessions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserax = axes[1, 1]
obs = eval_env.reset()[0]
actions = []
for _ in range(200):
action, _ = best_model.predict(obs, deterministic=True)
actions.append(action)
obs, _, done, truncated, _ = eval_env.step(action)
if done:
break
action_names = [‘Hold’, ‘Buy’, ‘Sell’]
action_counts = [actions.count(i) for i in range(3)]
ax.pie(action_counts, labels=action_names, autopct=’%1.1f%%’, startangle=90, colors=[‘#ff9999’, ‘#66b3ff’, ‘#99ff99’])
ax.set_title(“Action Distribution (Best Model)”)
plt.tight_layout()
plt.savefig(‘sb3_advanced_results.png’, dpi=150, bbox_inches=’tight’)
print(“✓ Visualizations saved as ‘sb3_advanced_results.png'”)
plt.show()

print(“n” + “=” * 60)
print(“Saving and loading models…”)
best_name = max(results.items(), key=lambda x: x[1][“eval_mean”])[0]
best_model = results[best_name][“model”]
best_model.save(f”best_trading_model_{best_name}”)
vec_env.save(“vec_normalize.pkl”)
loaded_model = PPO.load(f”best_trading_model_{best_name}”)
print(f”✓ Best model ({best_name}) saved and loaded successfully!”)
print(“n” + “=” * 60)
print(“TUTORIAL COMPLETE!”)
print(f”Best performing algorithm: {best_name}”)
print(f”Final evaluation score: {results[best_name][‘eval_mean’]:.2f}”)
print(“=” * 60)

Finally, we visualize the action distribution of the best agent to understand its trading tendencies and save the top-performing model for reuse. We demonstrate model loading, confirm the best algorithm, and complete the tutorial with a clear summary of performance outcomes and insights gained.

In conclusion, we have created, trained, and compared multiple reinforcement learning agents in a realistic trading simulation using Stable-Baselines3. We observe how each algorithm adapts to market dynamics, visualize their learning trends, and identify the most profitable strategy. This hands-on implementation strengthens our understanding of RL pipelines and demonstrates how customizable, efficient, and scalable Stable-Baselines3 can be for complex, domain-specific tasks such as financial modeling.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build, Train, and Compare Multiple Reinforcement Learning Agents in a Custom Trading Environment Using Stable-Baselines3 appeared first on MarkTechPost.

A New AI Research from Anthropic and Thinking Machines Lab Stress Test …

AI companies use model specifications to define target behaviors during training and evaluation. Do current specs state the intended behaviors with enough precision, and do frontier models exhibit distinct behavioral profiles under the same spec? A team of researchers from Anthropic, Thinking Machines Lab and Constellation present a systematic method that stress tests model specs using value tradeoff scenarios, then quantifies cross model disagreement as a signal of gaps or contradictions in the spec. The research team analyzed 12 frontier LLMs from Anthropic, OpenAI, Google, and xAI and links high disagreement to specification violations, missing guidance on response quality, and evaluator ambiguity. The team also released a public dataset

Model specifications are the written rules that alignment systems try to enforce. If a spec is complete and precise, models trained to follow it should not diverge widely on the same input. The research team operationalizes this intuition. It generates more than 300,000 scenarios that force a choice between two legitimate values, such as social equity and business effectiveness. It then scores responses on a 0 to 6 spectrum using value spectrum rubrics and measures disagreement as the standard deviation across models. High disagreement localizes the spec clauses that need clarification or additional examples.

https://arxiv.org/pdf/2510.07686

So, what is the method used in this research?

The research team starts from a taxonomy of 3,307 fine grained values observed in natural Claude traffic, which is more granular than typical model specs. For each pair of values, they generate a neutral query and two biased variants that lean toward one value. They build value spectrum rubrics that map positions from 0, which means strongly opposing the value, to 6, which means strongly favoring the value. They classify responses from 12 models against these rubrics and define disagreement as the maximum standard deviation across the two value dimensions. To remove near duplicates while keeping the hard cases, they use a disagreement weighted k center selection with Gemini embeddings and a 2 approximation greedy algorithm.

https://arxiv.org/pdf/2510.07686

Scale and releases

The dataset on Hugging Face shows three subsets. The default split has about 132,000 rows, the complete split has about 411,000 rows, and the judge evaluations split has about 24,600 rows. The card lists modality, format as parquet, and license as Apache 2.0.

Understanding the Results

Disagreement predicts spec violations: Testing five OpenAI models against the public OpenAI model spec, high disagreement scenarios have 5 to 13 times higher frequent non compliance. The research team interprets the pattern as evidence of contradictions and ambiguities in the spec text rather than idiosyncrasies of a single model.

Specs lack granularity on quality inside the safe region: Some scenarios produce responses that all pass compliance, yet differ in helpfulness. For instance, one model refuses and offers safe alternatives, while another only refuses. The spec accepts both, which indicates missing guidance on quality standards.

Evaluator models disagree on compliance: Three LLM judges, Claude 4 Sonnet, o3, and Gemini 2.5 Pro, show only moderate agreement with Fleiss Kappa near 0.42. The blog attributes conflicts to interpretive differences such as conscientious pushback versus transformation exceptions.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Provider level character patterns: Aggregating high disagreement scenarios reveals consistent value preferences. Claude models prioritize ethical responsibility and intellectual integrity and objectivity. OpenAI models tend to favor efficiency and resource optimization. Gemini 2.5 Pro and Grok more often emphasize emotional depth and authentic connection. Other values, such as business effectiveness, personal growth and wellbeing, and social equity and justice, show mixed patterns across providers.

Refusals and false positives: The analysis shows topic sensitive refusal spikes. It documents false positive refusals, including legitimate synthetic biology study plans and standard Rust unsafe types that are often safe in context. Claude models are the most cautious by rate of refusal and often provide alternative suggestions, and o3 most often issues direct refusals without elaboration. All models show high refusal rates on child grooming risks.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Outliers reveal misalignment and over conservatism: Grok 4 and Claude 3.5 Sonnet produce the most outlier responses, but for different reasons. Grok is more permissive on requests that others consider harmful. Claude 3.5 sometimes over rejects benign content. Outlier mining is a useful lens for locating both safety gaps and excessive filtering.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Key Takeaways

Method and scale: The study stress-tests model specs using value-tradeoff scenarios generated from a 3,307-value taxonomy, producing 300,000+ scenarios and evaluating 12 frontier LLMs across Anthropic, OpenAI, Google, and xAI.

Disagreement ⇒ spec problems: High cross-model disagreement strongly predicts issues in specs, including contradictions and coverage gaps. In tests against the OpenAI model spec, high-disagreement items show 5 to 13× higher frequent non-compliance.

Public release: The team released a dataset for independent auditing and reproduction.

Provider-level behavior: Aggregated results reveal systematic value preferences, for example Claude prioritizes ethical responsibility, Gemini emphasizes emotional depth, while OpenAI and Grok optimize for efficiency. Some values, such as business effectiveness and social equity and justice, show mixed patterns.

Refusals and outliers: High-disagreement slices expose both false-positive refusals on benign topics and permissive responses on risky ones. Outlier analysis identifies cases where one model diverges from at least 9 of the other 11, useful for pinpointing misalignment and over-conservatism.

Editorial Comments

This research turns disagreement into a measurable diagnostic for spec quality, not a vibe. The research team generates 300,000 plus value trade off scenarios, scores responses on a 0 to 6 rubric, then uses cross model standard deviation to locate specification gaps. High disagreement predicts frequent non compliance by 5 to 13 times under the OpenAI model spec. Judge models show only moderate agreement, Fleiss Kappa near 0.42, which exposes interpretive ambiguity. Provider level value patterns are clear, Claude favors ethical responsibility, OpenAI favors efficiency and resource optimization, Gemini and Grok emphasize emotional depth and authentic connection. The dataset enables reproduction. Deploy this to debug specs before deployment, not after.

Check out the Paper, Dataset, and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A New AI Research from Anthropic and Thinking Machines Lab Stress Tests Model Specs and Reveal Character Differences among Language Models appeared first on MarkTechPost.

How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, …

In this tutorial, we build an advanced computer-use agent from scratch that can reason, plan, and perform virtual actions using a local open-weight model. We create a miniature simulated desktop, equip it with a tool interface, and design an intelligent agent that can analyze its environment, decide on actions like clicking or typing, and execute them step by step. By the end, we see how the agent interprets goals such as opening emails or taking notes, demonstrating how a local language model can mimic interactive reasoning and task execution. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q transformers accelerate sentencepiece nest_asyncio
import torch, asyncio, uuid
from transformers import pipeline
import nest_asyncio
nest_asyncio.apply()

We set up our environment by installing essential libraries such as Transformers, Accelerate, and Nest Asyncio, which enable us to run local models and asynchronous tasks seamlessly in Colab. We prepare the runtime so that the upcoming components of our agent can work efficiently without external dependencies. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass LocalLLM:
def __init__(self, model_name=”google/flan-t5-small”, max_new_tokens=128):
self.pipe = pipeline(“text2text-generation”, model=model_name, device=0 if torch.cuda.is_available() else -1)
self.max_new_tokens = max_new_tokens
def generate(self, prompt: str) -> str:
out = self.pipe(prompt, max_new_tokens=self.max_new_tokens, temperature=0.0)[0][“generated_text”]
return out.strip()

class VirtualComputer:
def __init__(self):
self.apps = {“browser”: “https://example.com”, “notes”: “”, “mail”: [“Welcome to CUA”, “Invoice #221”, “Weekly Report”]}
self.focus = “browser”
self.screen = “Browser open at https://example.comnSearch bar focused.”
self.action_log = []
def screenshot(self):
return f”FOCUS:{self.focus}nSCREEN:n{self.screen}nAPPS:{list(self.apps.keys())}”
def click(self, target:str):
if target in self.apps:
self.focus = target
if target==”browser”:
self.screen = f”Browser tab: {self.apps[‘browser’]}nAddress bar focused.”
elif target==”notes”:
self.screen = f”Notes AppnCurrent notes:n{self.apps[‘notes’]}”
elif target==”mail”:
inbox = “n”.join(f”- {s}” for s in self.apps[‘mail’])
self.screen = f”Mail App Inbox:n{inbox}n(Read-only preview)”
else:
self.screen += f”nClicked ‘{target}’.”
self.action_log.append({“type”:”click”,”target”:target})
def type(self, text:str):
if self.focus==”browser”:
self.apps[“browser”] = text
self.screen = f”Browser tab now at {text}nPage headline: Example Domain”
elif self.focus==”notes”:
self.apps[“notes”] += (“n”+text)
self.screen = f”Notes AppnCurrent notes:n{self.apps[‘notes’]}”
else:
self.screen += f”nTyped ‘{text}’ but no editable field.”
self.action_log.append({“type”:”type”,”text”:text})

We define the core components, a lightweight local model, and a virtual computer. We use Flan-T5 as our reasoning engine and create a simulated desktop that can open apps, display screens, and respond to typing and clicking actions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ComputerTool:
def __init__(self, computer:VirtualComputer):
self.computer = computer
def run(self, command:str, argument:str=””):
if command==”click”:
self.computer.click(argument)
return {“status”:”completed”,”result”:f”clicked {argument}”}
if command==”type”:
self.computer.type(argument)
return {“status”:”completed”,”result”:f”typed {argument}”}
if command==”screenshot”:
snap = self.computer.screenshot()
return {“status”:”completed”,”result”:snap}
return {“status”:”error”,”result”:f”unknown command {command}”}

We introduce the ComputerTool interface, which acts as the communication bridge between the agent’s reasoning and the virtual desktop. We define high-level operations such as click, type, and screenshot, enabling the agent to interact with the environment in a structured way. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ComputerAgent:
def __init__(self, llm:LocalLLM, tool:ComputerTool, max_trajectory_budget:float=5.0):
self.llm = llm
self.tool = tool
self.max_trajectory_budget = max_trajectory_budget
async def run(self, messages):
user_goal = messages[-1][“content”]
steps_remaining = int(self.max_trajectory_budget)
output_events = []
total_prompt_tokens = 0
total_completion_tokens = 0
while steps_remaining>0:
screen = self.tool.computer.screenshot()
prompt = (
“You are a computer-use agent.n”
f”User goal: {user_goal}n”
f”Current screen:n{screen}nn”
“Think step-by-step.n”
“Reply with: ACTION <click/type/screenshot> ARG <target or text> THEN <assistant message>.n”
)
thought = self.llm.generate(prompt)
total_prompt_tokens += len(prompt.split())
total_completion_tokens += len(thought.split())
action=”screenshot”; arg=””; assistant_msg=”Working…”
for line in thought.splitlines():
if line.strip().startswith(“ACTION “):
after = line.split(“ACTION “,1)[1]
action = after.split()[0].strip()
if “ARG ” in line:
part = line.split(“ARG “,1)[1]
if ” THEN ” in part:
arg = part.split(” THEN “)[0].strip()
else:
arg = part.strip()
if “THEN ” in line:
assistant_msg = line.split(“THEN “,1)[1].strip()
output_events.append({“summary”:[{“text”:assistant_msg,”type”:”summary_text”}],”type”:”reasoning”})
call_id = “call_”+uuid.uuid4().hex[:16]
tool_res = self.tool.run(action, arg)
output_events.append({“action”:{“type”:action,”text”:arg},”call_id”:call_id,”status”:tool_res[“status”],”type”:”computer_call”})
snap = self.tool.computer.screenshot()
output_events.append({“type”:”computer_call_output”,”call_id”:call_id,”output”:{“type”:”input_image”,”image_url”:snap}})
output_events.append({“type”:”message”,”role”:”assistant”,”content”:[{“type”:”output_text”,”text”:assistant_msg}]})
if “done” in assistant_msg.lower() or “here is” in assistant_msg.lower():
break
steps_remaining -= 1
usage = {“prompt_tokens”: total_prompt_tokens,”completion_tokens”: total_completion_tokens,”total_tokens”: total_prompt_tokens + total_completion_tokens,”response_cost”: 0.0}
yield {“output”: output_events, “usage”: usage}

We construct the ComputerAgent, which serves as the system’s intelligent controller. We program it to reason about goals, decide which actions to take, execute those through the tool interface, and record each interaction as a step in its decision-making process. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def main_demo():
computer = VirtualComputer()
tool = ComputerTool(computer)
llm = LocalLLM()
agent = ComputerAgent(llm, tool, max_trajectory_budget=4)
messages=[{“role”:”user”,”content”:”Open mail, read inbox subjects, and summarize.”}]
async for result in agent.run(messages):
print(“==== STREAM RESULT ====”)
for event in result[“output”]:
if event[“type”]==”computer_call”:
a = event.get(“action”,{})
print(f”[TOOL CALL] {a.get(‘type’)} -> {a.get(‘text’)} [{event.get(‘status’)}]”)
if event[“type”]==”computer_call_output”:
snap = event[“output”][“image_url”]
print(“SCREEN AFTER ACTION:n”, snap[:400],”…n”)
if event[“type”]==”message”:
print(“ASSISTANT:”, event[“content”][0][“text”], “n”)
print(“USAGE:”, result[“usage”])

loop = asyncio.get_event_loop()
loop.run_until_complete(main_demo())

We bring everything together by running the demo, where the agent interprets a user’s request and performs tasks on the virtual computer. We observe it generating reasoning, executing commands, updating the virtual screen, and achieving its goal in a clear, step-by-step manner.

In conclusion, we implemented the essence of a computer-use agent capable of autonomous reasoning and interaction. We witness how local language models like Flan-T5 can powerfully simulate desktop-level automation within a safe, text-based sandbox. This project helps us understand the architecture behind intelligent agents such as those in computer-use agents, bridging natural language reasoning with virtual tool control. It lays a strong foundation for extending these capabilities toward real-world, multimodal, and secure automation systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models appeared first on MarkTechPost.

Google vs OpenAI vs Anthropic: The Agentic AI Arms Race Breakdown

Table of contentsOpenAI: CUA for GUI Autonomy, Responses as Agent Surface, and AgentKit for LifecycleGoogle: Gemini 2.0 and Astra for Perception, Vertex AI Agent Builder for Orchestration, Gemini Enterprise for GovernanceAnthropic: Computer Use and App-Builder Path via ArtifactsBenchmarks That Matter for Agent SelectionComparative AnalysisDeployment Guidance for Technical TeamsBottom Line by VendorEditorial Comments

In this article we will analyze how Google, OpenAI, and Anthropic are productizing ‘agentic’ capabilities across computer-use control, tool/function calling, orchestration, governance, and enterprise packaging.

Agent platforms, not only models, now define competitive advantage. Google is aligning Gemini 2.0 with an enterprise control plane on Vertex AI and a new ‘front door’ called Gemini Enterprise. OpenAI is consolidating developer early around the Responses API, packaging agent lifecycle elements as AgentKit, and deploying a general GUI controller called the Computer-Using Agent (CUA). Anthropic is expanding Computer Use while turning Artifacts into a lightweight app-builder for rapid internal tools. ​

OpenAI: CUA for GUI Autonomy, Responses as Agent Surface, and AgentKit for Lifecycle

Computer-Using Agent (CUA)

OpenAI introduced Operator in January 2025, powered by the CUA model. CUA combines GPT-4o-class vision with reinforcement learning for GUI policies, executing using human-like early development: screen perception, mouse, and keyboard. The stated purpose is a single interface that generalizes across web and desktop tasks.​

Responses API

OpenAI repositioned Responses as the primary agent-native API. The design folds chat, tool use, state, and multimodality into one early step and is marketed as the integration surface for GPT-5-era reasoning workflow. This simplifies the historical split across Chat Completions and Assistants, formalizing hosted tools and persistent reasoning in a single endpoint.​

AgentKit

Launched in October 2025, AgentKit packages agent building blocks: visual design surfaces, connectors/registries, evaluation hooks, and embeddable agent UIs. The aim is to reduce orchestration sprawl and standardize agent lifecycle from design to deployment. ​

Risk Profile

Early third-party evaluations note brittleness on practical automations: flaky DOM targets, window focus loss, and recovery failure on layout changes. While not unique to OpenAI, this matters for production SLAs. Teams should instrument retries, stabilize selectors, and gate high-risk steps behind review. Pair CUA experiments with execution-based evaluation such as OSWorld tasks.​

Position: OpenAI is optimizing for a programmable agent substrate: a single API surface (Responses), a lifecycle kit (AgentKit), and a universal GUI controller (CUA). For teams willing to own their evaluation harness and operations, this stack provides tight control and fast iteration loops.​

Google: Gemini 2.0 and Astra for Perception, Vertex AI Agent Builder for Orchestration, Gemini Enterprise for Governance

Models and Runtime

Google frames Gemini 2.0 as ‘built for the agentic era,’ with native tool use and multimodal I/O including image/audio output. Project Astra demonstrations highlight low-latency, always-on perception and continuous assistance patterns that map to planning plus acting loops. These capabilities are intended to feed Gemini Live and the broader agent runtime.​

Vertex AI Agent Builder

Google’s control plane for building and deploying agents on GCP is Vertex AI Agent Builder. The official documentation shows Agent Garden for templates and tools, orchestration for multi-agent experiences, and integration with other Vertex components. This serves as the platform to implement policies, logging, and evaluation pipelines for GCP users.​

Gemini Enterprise

In October 2025, Google announced Gemini Enterprise as a governed front door to ‘discover, create, share, and run AI agents’ with central policy and visibility. It emphasize cross-suite context spanning Google Workspace and Microsoft 365/SharePoint, plus line-of-business integrations such as Salesforce and SAP. This is positioned as a fleet-level governance layer, not only a development kit.​

Application Surface

Google is also pushing agentic control into end-user environments. Agent Mode in the Gemini app and Project Mariner extend consumer and prosumer workflows: teach-and-repeat, multi-task management, and autonomous execution for common tasks like search and filtering. This serves as both a data source for guardrails and a proving ground for UI-safety patterns.​

Position: Google is optimizing for governed enterprise deployment with wide surface integration. If you need centralized policy/visibility across many agents, with Workspace and cross-suite context, the Gemini Enterprise + Vertex pairing offers the most prescriptive path today.​

Anthropic: Computer Use and App-Builder Path via Artifacts

Computer Use

Anthropic introduced Computer Use for Claude 3.5 Sonnet in October 2024, explicitly as a beta capability that requires appropriate software setup to emulate human cursor and keyboard interactions. The company has been quite transparent about error profiles and the need for careful mediation. For production, expect policy-first defaults and incremental broadening rather than a hard pivot to full autonomy.​

Artifacts → App Building

In June 2025, Anthropic extended Artifacts from an inline canvas to build, host, and share interactive apps directly from Claude. The feature targets rapid internal tools and shareable mini-apps. Developers can create apps that call back into Claude via a new API, and published app usage bills the end user rather than the author.​

Position: Anthropic is optimizing for fast human-in-the-loop creation with explicit safety posture. The combination of Computer Use and Artifacts supports a design pattern where users co-pilot agents, validate actions, and graduate prototypes into shareable internal apps without heavy scaffolding.​

Benchmarks That Matter for Agent Selection

Function/Tool Calling

The Berkeley Function-Calling Leaderboard (BFCL) V4 expands beyond single calls to multi-turn planning, live/non-live settings, and hallucination measurement. You can use BFCL for tool-routing quality, argument fidelity, and sequencing under state changes.​

Computer/Web Use

OSWorld defines a benchmark of 369 real desktop tasks with execution-based evaluations across OSes and multi-app workflows. Original results showed large human–agent gaps and identified GUI grounding as a major bottleneck. You can treat OSWorld as the minimum bar for assessing GUI agents, then layer domain-specific workflows.​

Conversational Tool Agents

τ-Bench simulates dynamic conversations where an agent must follow domain rules and interact with tools; the 2025 τ²-Bench extension adds dual-control scenarios where both the user and agent can act, increasing realism for support workflows. You can use these when you care about policy adherence, user guidance, and multi-trial reliability.​

Software-Engineering Agents

SWE-Bench family leaderboards cover end-to-end issue resolution; SWE-Bench Pro (2025) raises task difficulty and adds contamination resistance with 1,865 instances across 41 repositories. For engineering assistants, you should not rely on ‘Lite’ alone—run Verified or Pro with a locked scaffold.​​

Comparative Analysis

Model Core and Modality

OpenAI currently couples GPT-5-era orchestration via Responses with a general GUI controller (CUA). This allows one integration surface for reasoning and tools plus a controller trained with RL for on-screen actions. Google pushes Gemini 2.0 and Astra for low-latency multimodal perception with tool use, then exposes agent plumbing through Vertex and Gemini Enterprise. Anthropic advances Claude 3.5 with Computer Use, while offering Artifacts to transform prompts into shareable apps that can call the model. The differences map to strategy: programmable substrate (OpenAI), governed enterprise scale (Google), and human-in-the-loop app creation (Anthropic).​

Agent Platform and Lifecycle

OpenAI’s AgentKit is an opinionated toolkit that reduces custom scaffolds and aligns with Responses. Google’s Vertex AI Agent Builder offers multi-agent orchestration plus governance hooks in a GCP-native control plane. Anthropic’s Artifacts/app-builder anchors a rapid prototyping loop for internal tools and user-validated workflows. Select based on where you want to spend engineering effort: programmable pipelines (OpenAI), centralized IT management (Google), or fastest human-supervised iteration (Anthropic).​

Governance and Policy

Google’s Gemini Enterprise is the clearest statement of fleet-level governance: central policy, visibility, cross-suite context for Workspace and Microsoft 365, and connectors for line-of-business apps. OpenAI’s consolidation into Responses reduces integration surfaces and should simplify policy attachment, but enterprise posture varies by customer architecture. Anthropic’s default stance is cautious feature rollout with explicit policy framing and human mediation.​

Evaluation Story and External Signals

OpenAI claims strong computer-/browser-use performance for CUA, but independent harnesses like OSWorld still report significant gaps across agents. Google’s agent messaging leans on demonstrations and enterprise rollouts; verify claims on BFCL, OSWorld, and domain workloads in Vertex. Anthropic’s Artifacts provides a pathway to test-and-deploy small apps quickly, then measure them against τ-Bench-style dialogue tasks and OSWorld-style GUI tasks.

Deployment Guidance for Technical Teams

1) Lock the Runner Before the Model

You can adopt execution-based, state-aware harnesses. For GUI control, use OSWorld’s verified setups and task scripts. For tool orchestration, use BFCL V4’s multi-turn and hallucination components. For policy-bound dialogues, prefer τ/τ²-Bench. For engineering assistants, add SWE-Bench Verified or Pro. Keep the runner constant while iterating on models, prompts, and retries.​

2) Decide Where Governance Lives

If you need centralized visibility across many agents plus Workspace and Microsoft 365 context, Google’s Gemini Enterprise combined with Vertex AI Agent Builder provides the most prescriptive governance plane. If you want a programmable substrate and will own policy integration yourself, OpenAI’s Responses + AgentKit stack is coherent. Anthropic’s approach favors human-in-the-loop controls with clear policy boundaries through the product surface.​

3) Design for GUI Failure and Recovery

Selectors drift, window focus changes, and visual similarity confuses detectors. You can build retries, add ‘are we on the right page’ checks, and gate irreversible actions behind review. This guidance applies to OpenAI CUA and Anthropic Computer Use alike, and the gaps are documented in OSWorld results.​

4) Optimize for Your Iteration Style

If you prototype many small internal tools, Anthropic’s Artifacts/app-builder minimizes scaffolding and lets non-specialists contribute. If you need deeply programmable pipelines with hosted tools and memory, Responses plus AgentKit offers the most consolidated primitives today. For governed, fleet-level rollouts, Google’s Vertex + Gemini Enterprise stack is designed for IT-managed scale.​

Bottom Line by Vendor

OpenAI: A programmable agent substrate: Responses as the unifying API, AgentKit for lifecycle, and CUA for GUI autonomy. This stack is attractive when you want direct control over tools, memory, and evaluation and are prepared to operate your own runners. You can validate GUI tasks on OSWorld and dialogue planning on τ-Bench.​

Google: A governed enterprise plane: Vertex AI Agent Builder for orchestration and Gemini Enterprise for organization-wide policy, visibility, and cross-suite context. This may be the clearest route to standardized agent operations in large estates using Workspace or hybrid 365 environments. You can test tool quality on BFCL and GUI reliability on OSWorld before scaling.​

Anthropic: A human-in-the-loop path: Computer Use plus Artifacts/app-builder for rapid creation and sharing of internal apps. This works well for teams that want fast iteration with explicit checkpoints and policy framing. You can use τ-Bench to assess policy adherence and user guidance, and OSWorld to check GUI action reliability.​

Editorial Comments

The agentic AI landscape of 2025 reveals three fundamentally different philosophies that will likely define the next phase of enterprise AI adoption. OpenAI’s bet on a unified, programmable substrate reflects their developer-first DNA, but risks overwhelming teams without strong engineering capabilities. Google’s enterprise governance play is strategically sound given their Workspace dominance, yet feels bureaucratic compared to the nimble iteration cycles that define successful AI deployments. Anthropic’s human-in-the-loop approach appears most aligned with current organizational realities—where trust, not just capability, remains the bottleneck for AI adoption. The real winner may not be determined by technical superiority alone, but by which vendor best navigates the gap between AI possibility and enterprise practicality. With 95% of generative AI pilots failing to reach production according to MIT research, the platform that solves deployment friction rather than just model performance will likely capture the largest share of the projected $47.1 billion AI agent market by 2030.

References: ​

https://www.fanktank.ch/en/blog/choosing-ai-models-openai-anthropic-google-2025

https://www.mindset.ai/blogs/in-the-loop-ep15-the-three-battles-to-own-all-ai

https://deeplp.com/f/xxx

https://akka.io/blog/agentic-ai-tools

https://www.alvarezandmarsal.com/thought-leadership/demystifying-ai-agents-in-2025-separating-hype-from-reality-and-navigating-market-outlook

https://www.datacamp.com/blog/best-ai-agents

https://mashable.com/article/best-ai-agents-work

https://claude.ai/public/artifacts/e7c1cf72-338c-4b70-bab2-fff4bf0ac553

OpenAI launches Operator, an AI agent that performs tasks autonomously

https://openai.com/index/introducing-agentkit/

https://cloud.google.com/blog/products/ai-machine-learning/introducing-gemini-enterprise

https://www.anthropic.com/news/3-5-models-and-computer-use

https://openai.com/index/introducing-operator/

https://openai.com/index/computer-using-agent/

https://openai.com/index/new-tools-and-features-in-the-responses-api/

https://developers.openai.com/blog/responses-api/

OpenAI launches AgentKit to help developers build and ship AI agents 

OpenAI Launches AgentKit for Building AI Agents – Here Is All You Need To Know

https://www.technologyreview.com/2025/01/23/1110484/openai-launches-operator-an-agent-that-can-use-a-computer-for-you/

https://shellypalmer.com/2024/12/google-launches-gemini-2-0-ushering-in-the-agentic-era/

https://blog.google/products/gemini/google-gemini-ai-collection-2024/

https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/

Google ramps up its ‘AI in the workplace’ ambitions with Gemini Enterprise

https://www.reuters.com/business/google-launches-gemini-enterprise-ai-platform-business-clients-2025-10-09/

https://blog.google/products/google-cloud/gemini-enterprise-sundar-pichai/

https://www.anthropic.com/news/developing-computer-use

https://www.nist.gov/news-events/news/2024/11/pre-deployment-evaluation-anthropics-upgraded-claude-35-sonnet

https://www.infoq.com/news/2025/06/anthropic-artifacts-app/

https://www.anthropic.com/news/build-artifacts

https://www.anthropic.com/news/claude-powered-artifacts

https://gorilla.cs.berkeley.edu/leaderboard.html

https://gorilla.cs.berkeley.edu/blogs/15_bfcl_v4_web_search.html

https://openreview.net/forum?id=2GmDdhBdDk

https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf

The post Google vs OpenAI vs Anthropic: The Agentic AI Arms Race Breakdown appeared first on MarkTechPost.

Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Vision Language Model ( …

Liquid AI released LFM2-VL-3B, a 3B parameter vision language model for image text to text tasks. It extends the LFM2-VL family beyond the 450M and 1.6B variants. The model targets higher accuracy while preserving the speed profile of the LFM2 architecture. It is available on LEAP and Hugging Face under the LFM Open License v1.0.

Model overview and interface

LFM2-VL-3B accepts interleaved image and text inputs and produces text outputs. The model exposes a ChatML like template. The processor inserts an <image> sentinel that is replaced with encoded image tokens at run time. The default text context length is 32,768 tokens. These details help devs reproduce evaluations and integrate the model with existing multimodal pipelines.

https://www.liquid.ai/blog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

Architecture

The stack pairs a language tower with a shape aware vision tower and a projector. The language tower is LFM2-2.6B, a hybrid convolution plus attention backbone. The vision tower is SigLIP2 NaFlex at 400M parameters, it preserves native aspect ratios and avoids distortion. The connector is a 2 layer MLP with pixel unshuffle, it compresses image tokens before fusion with the language space. This design lets users cap vision token budgets without retraining the model.

The encoder processes native resolutions up to 512×512. Larger inputs are split into non overlapping 512×512 patches. A thumbnail pathway provides global context during tiling. The efficient token mapping is documented with concrete examples, a 256×384 image maps to 96 tokens, a 1000×3000 image maps to 1,020 tokens. The model card exposes user controls for minimum and maximum image tokens and the tiling switch. These controls tune speed and quality at inference time.

Inference settings

The Hugging Face model card provides recommended parameters. Text generation uses temperature 0.1, min p 0.15, and a repetition penalty of 1.05. Vision settings use min image tokens 64, max image tokens 256, and image splitting enabled. The processor applies the chat template and the image sentinel automatically. The example uses AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.

How is it trained?

Liquid AI describes a staged approach. The team performs joint mid training that adjusts the text to image ratio over time. The model then undergoes supervised fine tuning focused on image understanding. The data sources are large scale open datasets plus in house synthetic vision data for task coverage.

Benchmarks

The research team reports competitive results among lightweight open VLMs. On MM-IFEval the model reaches 51.83. On RealWorldQA it reaches 71.37. On MMBench dev en it reaches 79.81. The POPE score is 89.01. The table notes that scores for other systems were computed with VLMEvalKit. The table excludes Qwen3-VL-2B because that system was released one day earlier.

https://www.liquid.ai/blog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

The language capability remains close to the LFM2-2.6B backbone. The research team cites 30 percent on GPQA and 63 percent on MMLU. This matters when perception tasks include knowledge queries. The team also states expanded multilingual visual understanding across English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese, and Korean.

Why edge users should care?

The architecture keeps compute and memory within small device budgets. Image tokens are compressible and user constrained, so throughput is predictable. SigLIP2 400M NaFlex encoder preserves aspect ratios, which helps fine grained perception. The projector reduces tokens at the connector, which improves tokens per second. The research team also published a GGUF build for on device runtimes. These properties are useful for robotics, mobile, and industrial clients that need local processing and strict data boundaries.

Key Takeaways

Compact multimodal stack: 3B parameter LFM2-VL-3B pairs an LFM2-2.6B language tower with a 400M SigLIP2 NaFlex vision encoder and a 2-layer MLP projector for image-token fusion. NaFlex preserves native aspect ratios.

Resolution handling and token budgets: Images run natively up to 512×512, larger inputs tile into non overlapping 512×512 patches with a thumbnail pathway for global context. Documented token mappings include 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.

Inference interface: ChatML-like prompting with an <image> sentinel, default text context 32,768 tokens, recommended decoding settings, and processor-level controls for image splitting enable reproducible evaluation and easy integration in multimodal pipelines.

Measured performance: Reported results include MM-IFEval 51.83, RealWorldQA 71.37, MMBench-dev-en 79.81, and POPE 89.01. Language-only signals from the backbone are about 30% GPQA and 63% MMLU, useful for mixed perception plus knowledge workloads.

Editorial Comments

LFM2-VL-3B is a practical step for edge multimodal workloads, the 3B stack pairs LFM2-2.6B with a 400M SigLIP2 NaFlex encoder and an efficient projector, which lowers image token counts for predictable latency. Native resolution processing with 512 by 512 tiling and token caps gives deterministic budgets. Reported scores on MM-IFEval, RealWorldQA, MMBench, and POPE are competitive for this size. Open weights, a GGUF build, and LEAP access reduce integration friction. Overall, this is an edge ready VLM release with clear controls and transparent benchmarks.

Check out the Model on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Vision Language Model (VLM) to Edge-Class Devices appeared first on MarkTechPost.

An Implementation on Building Advanced Multi-Endpoint Machine Learning …

In this tutorial, we explore LitServe, a lightweight and powerful serving framework that allows us to deploy machine learning models as APIs with minimal effort. We build and test multiple endpoints that demonstrate real-world functionalities such as text generation, batching, streaming, multi-task processing, and caching, all running locally without relying on external APIs. By the end, we clearly understand how to design scalable and flexible ML serving pipelines that are both efficient and easy to extend for production-level applications. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install litserve torch transformers -q

import litserve as ls
import torch
from transformers import pipeline
import time
from typing import List

We begin by setting up our environment on Google Colab and installing all required dependencies, including LitServe, PyTorch, and Transformers. We then import the essential libraries and modules that will allow us to define, serve, and test our APIs efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TextGeneratorAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline(“text-generation”, model=”distilgpt2″, device=0 if device == “cuda” and torch.cuda.is_available() else -1)
self.device = device
def decode_request(self, request):
return request[“prompt”]
def predict(self, prompt):
result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)
return result[0][‘generated_text’]
def encode_response(self, output):
return {“generated_text”: output, “model”: “distilgpt2”}

class BatchedSentimentAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline(“sentiment-analysis”, model=”distilbert-base-uncased-finetuned-sst-2-english”, device=0 if device == “cuda” and torch.cuda.is_available() else -1)
def decode_request(self, request):
return request[“text”]
def batch(self, inputs: List[str]) -> List[str]:
return inputs
def predict(self, batch: List[str]):
results = self.model(batch)
return results
def unbatch(self, output):
return output
def encode_response(self, output):
return {“label”: output[“label”], “score”: float(output[“score”]), “batched”: True}

Here, we create two LitServe APIs, one for text generation using a local DistilGPT2 model and another for batched sentiment analysis. We define how each API decodes incoming requests, performs inference, and returns structured responses, demonstrating how easy it is to build scalable, reusable model-serving endpoints. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass StreamingTextAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline(“text-generation”, model=”distilgpt2″, device=0 if device == “cuda” and torch.cuda.is_available() else -1)
def decode_request(self, request):
return request[“prompt”]
def predict(self, prompt):
words = [“Once”, “upon”, “a”, “time”, “in”, “a”, “digital”, “world”]
for word in words:
time.sleep(0.1)
yield word + ” ”
def encode_response(self, output):
for token in output:
yield {“token”: token}

In this section, we design a streaming text-generation API that emits tokens as they are generated. We simulate real-time streaming by yielding words one at a time, demonstrating how LitServe can handle continuous token generation efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MultiTaskAPI(ls.LitAPI):
def setup(self, device):
self.sentiment = pipeline(“sentiment-analysis”, device=-1)
self.summarizer = pipeline(“summarization”, model=”sshleifer/distilbart-cnn-6-6″, device=-1)
self.device = device
def decode_request(self, request):
return {“task”: request.get(“task”, “sentiment”), “text”: request[“text”]}
def predict(self, inputs):
task = inputs[“task”]
text = inputs[“text”]
if task == “sentiment”:
result = self.sentiment(text)[0]
return {“task”: “sentiment”, “result”: result}
elif task == “summarize”:
if len(text.split()) < 30:
return {“task”: “summarize”, “result”: {“summary_text”: text}}
result = self.summarizer(text, max_length=50, min_length=10)[0]
return {“task”: “summarize”, “result”: result}
else:
return {“task”: “unknown”, “error”: “Unsupported task”}
def encode_response(self, output):
return output

We now develop a multi-task API that handles both sentiment analysis and summarization via a single endpoint. This snippet demonstrates how we can manage multiple model pipelines through a unified interface, dynamically routing each request to the appropriate pipeline based on the specified task. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass CachedAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline(“sentiment-analysis”, device=-1)
self.cache = {}
self.hits = 0
self.misses = 0
def decode_request(self, request):
return request[“text”]
def predict(self, text):
if text in self.cache:
self.hits += 1
return self.cache[text], True
self.misses += 1
result = self.model(text)[0]
self.cache[text] = result
return result, False
def encode_response(self, output):
result, from_cache = output
return {“label”: result[“label”], “score”: float(result[“score”]), “from_cache”: from_cache, “cache_stats”: {“hits”: self.hits, “misses”: self.misses}}

We implement an API that uses caching to store previous inference results, reducing redundant computation for repeated requests. We track cache hits and misses in real time, illustrating how simple caching mechanisms can drastically improve performance in repeated inference scenarios. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef test_apis_locally():
print(“=” * 70)
print(“Testing APIs Locally (No Server)”)
print(“=” * 70)

api1 = TextGeneratorAPI(); api1.setup(“cpu”)
decoded = api1.decode_request({“prompt”: “Artificial intelligence will”})
result = api1.predict(decoded)
encoded = api1.encode_response(result)
print(f”✓ Result: {encoded[‘generated_text’][:100]}…”)

api2 = BatchedSentimentAPI(); api2.setup(“cpu”)
texts = [“I love Python!”, “This is terrible.”, “Neutral statement.”]
decoded_batch = [api2.decode_request({“text”: t}) for t in texts]
batched = api2.batch(decoded_batch)
results = api2.predict(batched)
unbatched = api2.unbatch(results)
for i, r in enumerate(unbatched):
encoded = api2.encode_response(r)
print(f”✓ ‘{texts[i]}’ -> {encoded[‘label’]} ({encoded[‘score’]:.2f})”)

api3 = MultiTaskAPI(); api3.setup(“cpu”)
decoded = api3.decode_request({“task”: “sentiment”, “text”: “Amazing tutorial!”})
result = api3.predict(decoded)
print(f”✓ Sentiment: {result[‘result’]}”)

api4 = CachedAPI(); api4.setup(“cpu”)
test_text = “LitServe is awesome!”
for i in range(3):
decoded = api4.decode_request({“text”: test_text})
result = api4.predict(decoded)
encoded = api4.encode_response(result)
print(f”✓ Request {i+1}: {encoded[‘label’]} (cached: {encoded[‘from_cache’]})”)

print(“=” * 70)
print(” All tests completed successfully!”)
print(“=” * 70)

test_apis_locally()

We test all our APIs locally to verify their correctness and performance without starting an external server. We sequentially evaluate text generation, batched sentiment analysis, multi-tasking, and caching, ensuring each component of our LitServe setup runs smoothly and efficiently.

In conclusion, we create and run diverse APIs that showcase the framework’s versatility. We experiment with text generation, sentiment analysis, multi-tasking, and caching to experience LitServe’s seaMLess integration with Hugging Face pipelines. As we complete the tutorial, we realize how LitServe simplifies model deployment workflows, enabling us to serve intelligent ML systems in just a few lines of Python code while maintaining flexibility, performance, and simplicity.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference appeared first on MarkTechPost.

Salesforce AI Research Introduces WALT (Web Agents that Learn Tools): …

A team of Salesforce AI researchers introduced WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. It reframes browser automation around callable tools rather than long chains of clicks. Agents then call operations such as search, filter, sort, post_comment, and create_listing. This reduces dependence on large language model step by step reasoning and increases determinism during execution.

https://arxiv.org/pdf/2510.01524

What WALT builds?

Web agents often fail when layouts shift or when tasks require long sequences. WALT targets this failure mode by mining site functionality offline, then exposing it as tools that encapsulate navigation, selection, extraction, and optional agentic steps. Tools carry contracts in the form of schemas and examples. At runtime, an agent composes a short program with a few tool calls to complete a task. The design goal is higher success with fewer steps and less reliance on free form reasoning.

Pipeline in two phases

The pipeline has discovery and construction with validation. In discovery, WALT explores a website and proposes tool candidates that map to common goals such as discovery, content management, and communication. In construction and validation, WALT converts traces to deterministic scripts, stabilizes selectors, attempts URL promotion when possible, induces an input schema, and registers a tool only after end to end checks pass. This shifts as much work as possible into stable URL and form operations and leaves agentic grounding for the cases that truly require it.

https://arxiv.org/pdf/2510.01524

Results on VisualWebArena and WebArena

On VisualWebArena, WALT reports an average success rate of 52.9 percent with per split results of 64.1 percent on Classifieds, 53.4 percent on Shopping, and 39.0 percent on Reddit. The table lists baselines such as SGV at 50.2 percent and ExaCT at 33.7 percent. Human performance is 88.7 percent on average.

On WebArena, WALT reaches 50.1 percent average across GitLab, Map, Shopping, CMS, Reddit, and Multi. The table shows WALT ahead of prior methods with a nine point margin over the best skill induction baseline. Human performance is 78.2 percent.

https://arxiv.org/pdf/2510.01524

Efficiency and ablations

Tools reduce action count by a factor near 1.4 on average relative to a matched agent without tools. On the Classifieds split, ablations show consistent gains when tools are used across different agent backbones. WALT with GPT 5 mini records 7 percent higher success and 27 percent fewer steps, while a human demonstration strategy yields 66.0 percent success. The fully autonomous WALT reaches 64.1 percent with 5 percent fewer steps than the human demonstration case. Multimodal DOM parsing adds 2.6 percent absolute improvement. External verification adds 3.3 percent while increasing checks. Across components, WALT records 21.3 percent fewer steps than baseline policies.

https://arxiv.org/pdf/2510.01524

Design choices that enforce determinism

WALT prefers URL level operations when the site exposes query parameters or routes for search and filtering. When pages require dynamic grounding, the tool script inserts bounded agentic steps such as content extraction or wait for page load. Selector stabilization and schema validation reduce drift when sites change. The method keeps the fraction of agentic operations low in discovered tool sets and biases toward deterministic actions like navigation, input, and click.

Key Takeaways

Approach: WALT discovers and validates website-native functions, then exposes them as callable tools with input schemas, selector stabilization, and URL promotion, reducing brittle step sequences to deterministic operations.

Results — VisualWebArena: Average success rate 52.9%, with 64.1% on Classifieds, 53.4% on Shopping, and 39.0% on Reddit, outperforming several baselines reported in the paper.

Results — WebArena: Average success rate 50.1% across GitLab, Map, Shopping, CMS, Reddit, and Multi, showing consistent gains over skill-induction and search-based baselines.

Efficiency and Ablations: Toolization cuts steps by about 1.4x, with 21.3% fewer actions on average. Multimodal DOM parsing adds +2.6% absolute success, and external verification adds +3.3%.

Editorial Comments

WALT is a useful pivot from step sequence agents to functionality grounded tools. The framework reverse engineers latent website functionality into reusable invocable tools across discovery, content management, and communication. By promoting UI traces to deterministic tools with schema validation and URL operations, WALT lifts web agent success to 52.9 percent on VisualWebArena and 50.1 percent on WebArena, while cutting actions by about 21.3 percent. The release ships a CLI, walt discover, walt agent, and MCP serving for integration.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Salesforce AI Research Introduces WALT (Web Agents that Learn Tools): Enabling LLM agents to Automatically Discover Reusable Tools from Any Website appeared first on MarkTechPost.

Responsible AI design in healthcare and life sciences

Generative AI has emerged as a transformative technology in healthcare, driving digital transformation in essential areas such as patient engagement and care management. It has shown potential to revolutionize how clinicians provide improved care through automated systems with diagnostic support tools that provide timely, personalized suggestions, ultimately leading to better health outcomes. For example, a study reported in BMC Medical Education that medical students who received large language model (LLM)-generated feedback during simulated patient interactions significantly improved their clinical decision-making compared to those who did not.
At the center of most generative AI systems are LLMs capable of generating remarkably natural conversations, enabling healthcare customers to build products across billing, diagnosis, treatment, and research that can perform tasks and operate independently with human oversight. However, the utility of generative AI requires an understanding of the potential risks and impacts on healthcare service delivery, which necessitates the need for careful planning, definition, and execution of a system-level approach to building safe and responsible generative AI-infused applications.
In this post, we focus on the design phase of building healthcare generative AI applications, including defining system-level policies that determine the inputs and outputs. These policies can be thought of as guidelines that, when followed, help build a responsible AI system.
Designing responsibly
LLMs can transform healthcare by reducing the cost and time required for considerations such as quality and reliability. As shown in the following diagram, responsible AI considerations can be successfully integrated into an LLM-powered healthcare application by considering quality, reliability, trust, and fairness for everyone. The goal is to promote and encourage certain responsible AI functionalities of AI systems. Examples include the following:

Each component’s input and output is aligned with clinical priorities to maintain alignment and promote controllability
Safeguards, such as guardrails, are implemented to enhance the safety and reliability of your AI system
Comprehensive AI red-teaming and evaluations are applied to the entire end-to-end system to assess safety and privacy-impacting inputs and outputs

Conceptual architecture
The following diagram shows a conceptual architecture of a generative AI application with an LLM. The inputs (directly from an end-user) are mediated through input guardrails. After the input has been accepted, the LLM can process the user’s request using internal data sources. The output of the LLM is again mediated through guardrails and can be shared with end-users.

Establish governance mechanisms
When building generative AI applications in healthcare, it’s essential to consider the various risks at the individual model or system level, as well as at the application or implementation level. The risks associated with generative AI can differ from or even amplify existing AI risks. Two of the most important risks are confabulation and bias:

Confabulation — The model generates confident but erroneous outputs, sometimes referred to as hallucinations. This could mislead patients or clinicians.
Bias — This refers to the risk of exacerbating historical societal biases among different subgroups, which can result from non-representative training data.

To mitigate these risks, consider establishing content policies that clearly define the types of content your applications should avoid generating. These policies should also guide how to fine-tune models and which appropriate guardrails to implement. It is crucial that the policies and guidelines are tailored and specific to the intended use case. For instance, a generative AI application designed for clinical documentation should have a policy that prohibits it from diagnosing diseases or offering personalized treatment plans.
Additionally, defining clear and detailed policies that are specific to your use case is fundamental to building responsibly. This approach fosters trust and helps developers and healthcare organizations carefully consider the risks, benefits, limitations, and societal implications associated with each LLM in a particular application.
The following are some example policies you might consider using for your healthcare-specific applications. The first table summarizes the roles and responsibilities for human-AI configurations.

Action ID
Suggested Action
Generative AI Risks

GV-3.2-001
Policies are in place to bolster oversight of generative AI systems with independent evaluations or assessments of generative AI models or systems where the type and robustness of evaluations are proportional to the identified risks.
CBRN Information or Capabilities; Harmful Bias and Homogenization

GV-3.2-002
Consider adjustment of organizational roles and components across lifecycle stages of large or complex generative AI systems, including: test and evaluation, validation, and red-teaming of generative AI systems; generative AI content moderation; generative AI system development and engineering; increased accessibility of generative AI tools, interfaces, and systems; and incident response and containment.
Human-AI Configuration; Information Security; Harmful Bias and Homogenization

GV-3.2-003
Define acceptable use policies for generative AI interfaces, modalities, and human-AI configurations (for example, for AI assistants and decision-making tasks), including criteria for the kinds of queries generative AI applications should refuse to respond to.
Human-AI Configuration

GV-3.2-004
Establish policies for user feedback mechanisms for generative AI systems that include thorough instructions and any mechanisms for recourse.
Human-AI Configuration

GV-3.2-005
Engage in threat modeling to anticipate potential risks from generative AI systems.
CBRN Information or Capabilities; Information Security

The following table summarizes policies for risk management in AI system design.

Action ID
Suggested Action
Generative AI Risks

GV-4.1-001
Establish policies and procedures that address continual improvement processes for generative AI risk measurement. Address general risks associated with a lack of explainability and transparency in generative AI systems by using ample documentation and techniques such as application of gradient-based attributions, occlusion or term reduction, counterfactual prompts and prompt engineering, and analysis of embeddings. Assess and update risk measurement approaches at regular cadences.
Confabulation

GV-4.1-002
Establish policies, procedures, and processes detailing risk measurement in context of use with standardized measurement protocols and structured public feedback exercises such as AI red-teaming or independent external evaluations.
CBRN Information and Capability; Value Chain and Component Integration

Transparency artifacts
Promoting transparency and accountability throughout the AI lifecycle can foster trust, facilitate debugging and monitoring, and enable audits. This involves documenting data sources, design decisions, and limitations through tools like model cards and offering clear communication about experimental features. Incorporating user feedback mechanisms further supports continuous improvement and fosters greater confidence in AI-driven healthcare solutions.
AI developers and DevOps engineers should be transparent about the evidence and reasons behind all outputs by providing clear documentation of the underlying data sources and design decisions so that end-users can make informed decisions about the use of the system. Transparency enables the tracking of potential problems and facilitates the evaluation of AI systems by both internal and external teams. Transparency artifacts guide AI researchers and developers on the responsible use of the model, promote trust, and help end-users make informed decisions about the use of the system.
The following are some implementation suggestions:

When building AI features with experimental models or services, it’s essential to highlight the possibility of unexpected model behavior so healthcare professionals can accurately assess whether to use the AI system.
Consider publishing artifacts such as Amazon SageMaker model cards or AWS system cards. Also, at AWS we provide detailed information about our AI systems through AWS AI Service Cards, which list intended use cases and limitations, responsible AI design choices, and deployment and performance optimization best practices for some of our AI services. AWS also recommends establishing transparency policies and processes for documenting the origin and history of training data while balancing the proprietary nature of training approaches. Consider creating a hybrid document that combines elements of both model cards and service cards, because your application likely uses foundation models (FMs) but provides a specific service.
Offer a feedback user mechanism. Gathering regular and scheduled feedback from healthcare professionals can help developers make necessary refinements to improve system performance. Also consider establishing policies to help developers allow for user feedback mechanisms for AI systems. These should include thorough instructions and consider establishing policies for any mechanisms for recourse.

Security by design
When developing AI systems, consider security best practices at each layer of the application. Generative AI systems might be vulnerable to adversarial attacks suck as prompt injection, which exploits the vulnerability of LLMs by manipulating their inputs or prompt. These types of attacks can result in data leakage, unauthorized access, or other security breaches. To address these concerns, it can be helpful to perform a risk assessment and implement guardrails for both the input and output layers of the application. As a general rule, your operating model should be designed to perform the following actions:

Safeguard patient privacy and data security by implementing personally identifiable information (PII) detection, configuring guardrails that check for prompt attacks
Continually assess the benefits and risks of all generative AI features and tools and regularly monitor their performance through Amazon CloudWatch or other alerts
Thoroughly evaluate all AI-based tools for quality, safety, and equity before deploying

Developer resources
The following resources are useful when architecting and building generative AI applications:

Amazon Bedrock Guardrails helps you implement safeguards for your generative AI applications based on your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple FMs, providing a consistent user experience and standardizing safety and privacy controls across your generative AI applications.
The AWS responsible AI whitepaper serves as an invaluable resource for healthcare professionals and other developers that are developing AI applications in critical care environments where errors could have life-threatening consequences.
AWS AI Service Cards explains the use cases for which the service is intended, how machine learning (ML) is used by the service, and key considerations in the responsible design and use of the service.

Conclusion
Generative AI has the potential to improve nearly every aspect of healthcare by enhancing care quality, patient experience, clinical safety, and administrative safety through responsible implementation. When designing, developing, or operating an AI application, try to systematically consider potential limitations by establishing a governance and evaluation framework grounded by the need to maintain the safety, privacy, and trust that your users expect.
For more information about responsible AI, refer to the following resources:

NIST Trustworthy and Responsible AI
OWASP Top 10 for Large Language Model applications

About the authors
Tonny Ouma is an Applied AI Specialist at AWS, specializing in generative AI and machine learning. As part of the Applied AI team, Tonny helps internal teams and AWS customers incorporate leading-edge AI systems into their products. In his spare time, Tonny enjoys riding sports bikes, golfing, and entertaining family and friends with his mixology skills.
Simon Handley, PhD, is a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences team at Amazon Web Services. He has more than 25 years’ experience in biotechnology and machine learning and is passionate about helping customers solve their machine learning and life sciences challenges. In his spare time, he enjoys horseback riding and playing ice hockey.

Beyond pilots: A proven framework for scaling AI to production

The era of perpetual AI pilots is over. This year, 65% of AWS Generative AI Innovation Center customer projects moved from concept to production—some launching in just 45 days, as AWS VP Swami Sivasubramanian shared on LinkedIn. These results come from insights gained across more than one thousand customer implementations.
The Generative AI Innovation Center pairs organizations across industries with AWS scientists, strategists, and engineers to implement practical AI solutions that drive measurable outcomes. These initiatives transform diverse sectors worldwide. For example, through a cross-functional AWS collaboration, we supported the National Football League (NFL) to create a generative AI-powered solution that obtains statistical game insights within 30 seconds. This helps their media and production teams locate video content six times faster. Similarly, we helped Druva’s DruAI system streamline customer support and data protection through natural language processing, reducing investigation time from hours to minutes.
These achievements reflect a broader pattern of success, driven by a powerful methodology: The Five V’s Framework for AI Implementation.

This framework takes projects from initial testing to full deployment by focusing on concrete business outcomes and operational excellence. It’s grounded in two of Amazon’s Leadership Principles, Customer Obsession and Deliver Results. By starting with what customers actually need and working backwards, we’ve helped companies across industries modernize their operations and better serve their customers.
The Five V’s Framework: A foundation for success
Every successful AI deployment begins with groundwork. In our experience, projects thrive when organizations first identify specific challenges they need to solve, align key stakeholders around these goals, and establish clear accountability for results. The Five V’s Framework helps guide organizations through a structured process:

Value: Target high-impact opportunities aligned with your strategic priorities
Visualize: Define clear success metrics that link directly to business outcomes
Validate: Test solutions against real-world requirements and constraints
Verify: Create a scalable path to production that delivers sustainable results
Venture: Secure the resources and support needed for long-term success

Value: The critical first step
The Value phase emphasizes working backwards from your most pressing business challenges. By starting with existing pain points and collaborating across technical and business teams, organizations can develop solutions that deliver meaningful return on investment (ROI). This focused approach helps direct resources where they’ll have the greatest impact.
Visualize: Defining success through measurement
The next step requires translating the potential benefits—cost reduction, revenue growth, risk mitigation, improved customer experience, and competitive advantage—into clear, measurable performance indicators. A comprehensive measurement framework starts with baseline metrics using historical data where available. These metrics should address both technical aspects like accuracy and response time, as well as business outcomes such as productivity gains and customer satisfaction.
The Visualize phase examines data availability and quality to support proper measurement while working with stakeholders to define success criteria that align with strategic objectives. This dual focus helps organizations track not just the performance of the AI solution, but its actual impact on business goals.
Validate: Where ambition meets reality
The Validate phase focuses on testing solutions against real-world conditions and constraints. Our approach integrates strategic vision with implementation expertise from day one. As Sri Elaprolu, Director of the Generative AI Innovation Center, explains: “Effective validation creates alignment between vision and execution. We unite diverse perspectives—from scientists to business leaders—so that solutions deliver both technical excellence and measurable business impact.”
This process involves systematic integration testing, stress testing for expected loads, verifying compliance requirements, and gathering end-user feedback. Security specialists shape the core architecture. Industry subject matter experts define the operational processes and decision logic that guide prompt design and model refinement. Change management strategies are integrated early to ensure alignment and adoption.
The Generative AI Innovation Center partnered with SparkXGlobal, an AI-driven marketing-technology company, to validate their new solution through comprehensive testing. Their platform, Xnurta, provides business analytics and reporting for Amazon merchants, demonstrating impressive results: report processing time dropped from 6-8 hours to just 8 minutes while maintaining 95% accuracy. This successful validation established a foundation for SparkXGlobal’s continued innovation and enhanced AI capabilities.
Working with the Generative AI Innovation Center, the U.S. Environmental Protection Agency (EPA) created an intelligent document processing solution powered by Anthropic models on Amazon Bedrock. This solution helped EPA scientists accelerate chemical risk assessments and pesticide reviews through transparent, verifiable, and human-controlled AI practices. The impact has been substantial: document processing time decreased by 85%, evaluation costs dropped by 99%, and more than 10,000 regulatory applications have advanced faster to protect public health.
Verify: The path to production
Moving from pilot to production requires more than proof of concept—it demands scalable solutions that integrate with existing systems and deliver consistent value. While demos can seem compelling, verification reveals the true complexity of enterprise-wide deployment. This critical stage maps the journey from prototype to production, establishing a foundation for sustainable success.
Building production-ready AI solutions brings together several key elements. Robust governance structures must facilitate responsible AI deployment and oversight, managing risk and compliance in an evolving regulatory landscape. Change management prepares teams and processes for new ways of working, driving organization-wide adoption. Operational readiness assessments evaluate existing workflows, integration points, and team capabilities to facilitate smooth implementation.
Architectural decisions in the verification phase balance scale, reliability, and operability, with security and compliance woven into the solution’s fabric. This often involves practical trade-offs based on real-world constraints. A simpler solution aligned to existing team capabilities may prove more valuable than a complex one requiring specialized expertise. Similarly, meeting strict latency requirements might necessitate choosing a streamlined model over a more sophisticated one, as model selection requires a balance of performance, accuracy, and computational costs based on the use case.
Generative AI Innovation Center Principal Data Scientist, Isaac Privitera, captures this philosophy: “When building a generative AI solution, we focus primarily on three things: measurable business impact, production readiness from day one, and sustained operational excellence. This trinity drives solutions that thrive in real-world conditions.”
Effective verification demands both technical expertise and practical wisdom from real-world deployments. It requires proving not just that a solution works in principle, but that it can operate at scale within existing systems and team capabilities. By systematically addressing these factors, we help make sure deployments deliver sustainable, long-term value.
Venture: Securing long-term success
Long-term success in AI also requires mindful resource planning across people, processes, and funding. The Venture phase maps the full journey from implementation through sustained organizational adoption.
Financial viability starts with understanding the total cost of ownership, from initial development through deployment, integration, training, and ongoing operations. Promising projects can stall mid-implementation due to insufficient resource planning. Success requires strategic budget allocation across all phases, with clear ROI milestones and the flexibility to scale.
Successful ventures demand organizational commitment through executive sponsorship, stakeholder alignment, and dedicated teams for ongoing optimization and maintenance. Organizations must also account for both direct and indirect costs—from infrastructure and development, to team training, process adaptation, and change management. A blend of sound financial planning and flexible resource strategies allows teams to accelerate and adjust as opportunities and challenges arise.
From there, the solution must integrate seamlessly into daily operations with clear ownership and widespread adoption. This transforms AI from a project into a core organizational capability.
Adopting the Five V’s Framework in your enterprise
The Five V’s Framework shifts AI focus from technical capabilities to business results, replacing ‘What can AI do?’ with ‘What do we need AI to do?’. Successful implementation requires both an innovative culture and access to specialized expertise.

AWS resources to support your journey
AWS offers a variety of resources to help you scale your AI to production.
Expert guidance
The AWS Partnership Network (APN) offers multiple pathways to access specialized expertise, while AWS Professional Services brings proven methodologies from its own successful AI implementations. Certified partners, including Generative AI Partner Innovation Alliance members who receive direct enablement training from the Generative AI Innovation Center team, extend this expertise across industries. AWS Generative AI Competency Partners bring use case-specific success, while specialized partners focus on model customization and evaluation.
Self-service learning
For teams building internal capabilities, AWS provides technical blogs with implementation guides based on real-world experience, GitHub repositories with production-ready code, and AWS Workshop Studio for hands-on learning that bridges theory and practice.
Balancing learning and innovation
Even with the right framework and resources, not every AI project will reach production. These initiatives still provide valuable lessons that strengthen your overall program. Organizations can build lasting AI capabilities through three key principles:

Embracing a portfolio approach: Treat AI initiatives as an investment portfolio where diversification drives risk management and value creation. Balance quick wins (delivering value within months), strategic initiatives (driving longer-term transformation), and moonshot projects (potentially revolutionizing your business).
Creating a culture of safe experimentation: Organizations thrive with AI when teams can innovate boldly. In rapidly evolving fields, the cost of inaction often exceeds the risk of calculated experiments.
Learning from “productive failures”: Capture insights systematically across projects. Technical challenges reveal capability gaps, data issues expose information needs, and organizational readiness concerns illuminate broader transformation requirements – all shaping future initiatives.

The path forward
The next 12-18 months present a pivotal opportunity for organizations to harness generative AI and agentic AI to solve previously intractable problems, establish competitive advantages, and explore entirely new frontiers of business possibility. Those who successfully move from pilot to production will help define what’s possible within their industries and beyond.
Are you ready to move your AI initiatives into production?

Learn more about the AWS Generative AI Innovation Center and contact your AWS Account Manager to be connected to our expert guidance and support.
Join our AWS Builder community to connect with others on a similar AI journey.

About the authors
Sri Elaprolu serves as Director of the AWS Generative AI Innovation Center, where he leverages nearly three decades of technology leadership experience to drive artificial intelligence and machine learning innovation. In this role, he leads a global team of machine learning scientists and engineers who develop and deploy advanced generative and agentic AI solutions for enterprise and government organizations facing complex business challenges. Throughout his nearly 13-year tenure at AWS, Sri has held progressively senior positions, including leadership of ML science teams that partnered with high-profile organizations such as the NFL, Cerner, and NASA. These collaborations enabled AWS customers to harness AI and ML technologies for transformative business and operational outcomes. Prior to joining AWS, he spent 14 years at Northrop Grumman, where he successfully managed product development and software engineering teams. Sri holds a Master’s degree in Engineering Science and an MBA with a concentration in general management, providing him with both the technical depth and business acumen essential for his current leadership role.
Dr. Diego Socolinsky is currently the North America Head of the Generative AI Innovation Center at Amazon Web Services (AWS). With over 25 years of experience at the intersection of technology, machine learning, and computer vision, he has built a career driving innovation from cutting-edge research to production-ready solutions. Dr. Socolinsky holds a Ph.D. in Mathematics from The Johns Hopkins University and has been a pioneer in various fields including thermal imaging biometrics, augmented/mixed reality, and generative AI initiatives. His technical expertise spans from optimizing low-level embedded systems to architecting complex real-time deep learning solutions, with particular focus on generative AI platforms, large-scale unstructured data classification, and advanced computer vision applications. He is known for his ability to bridge the gap between technical innovation and strategic business objectives, consistently delivering transformative technology that solves complex real-world problems.
Sabine Khan is a Strategic Initiatives Leader with the AWS Generative AI Innovation Center, where she implements delivery and strategy initiatives focused on scaling enterprise-grade Generative AI solutions. She specializes in production-ready AI systems and drives agentic AI projects from concept to deployment. With over twenty years of experience in software delivery and a strong focus on AI/ML during her tenure at AWS, she has established a track record of successful enterprise implementations. Prior to AWS, she led digital transformation initiatives and held product development and software engineering leadership roles in Houston’s energy sector. Sabine holds a Master’s degree in GeoScience and an MBA.
Andrea Jimenez is a dual master’s candidate at the Massachusetts Institute of Technology, pursuing an M.S. in Computer Science from the School of Engineering and an MBA from the Sloan School of Management. As a GenAI Lead Graduate Fellow at the MIT GenAI Innovation Center, she researches agentic AI systems and the economic implications of generative AI technologies, while leveraging her background in artificial intelligence, product development, and startup innovation to lead teams at the intersection of technology and business strategy. Her work focuses on advancing human-AI collaboration and translating cutting-edge research into scalable, high-impact solutions. Prior to AWS and MIT, she led product and engineering teams in the tech industry and founded and sold a startup that helped early-stage companies build and launch SaaS products.
Randi Larson connects AI innovation with executive strategy for the AWS Generative AI Innovation Center, shaping how organizations understand and translate technical breakthroughs into business value. She combines strategic storytelling with data-driven insight through global keynotes, Amazon’s first tech-for-good podcast, and conversations with industry and Amazon leaders on AI transformation. Before Amazon, Randi refined her analytical precision as a Bloomberg journalist and advisor to economic institutions, think tanks, and family offices on technology initiatives. Randi holds an MBA from Duke University’s Fuqua School of Business and a B.S. in Journalism and Spanish from Boston University.