How msg enhanced HR workforce transformation with Amazon Bedrock and m …

This post is co-written with Stefan Walter from msg.
With more than 10,000 experts in 34 countries, msg is both an independent software vendor and a system integrator operating in highly regulated industries, with over 40 years of domain-specific expertise. msg.ProfileMap is a software as a service (SaaS) solution for skill and competency management. It’s an AWS Partner qualified software available on AWS Marketplace, currently serving more than 7,500 users. HR and strategy departments use msg.ProfileMap for project staffing and workforce transformation initiatives. By offering a centralized view of skills and competencies, msg.ProfileMap helps organizations map their workforce’s capabilities, identify skill gaps, and implement targeted development strategies. This supports more effective project execution, better alignment of talent to roles, and long-term workforce planning.
In this post, we share how msg automated data harmonization for msg.ProfileMap, using Amazon Bedrock to power its large language model (LLM)-driven data enrichment workflows, resulting in higher accuracy in HR concept matching, reduced manual workload, and improved alignment with compliance requirements under the EU AI Act and GDPR.
The importance of AI-based data harmonization
HR departments face increasing pressure to operate as data-driven organizations, but are often constrained by the inconsistent, fragmented nature of their data. Critical HR documents are unstructured, and legacy systems use mismatched formats and data models. This not only impairs data quality but also leads to inefficiencies and decision-making blind spots.Accurate and harmonized HR data is foundational for key activities such as matching candidates to roles, identifying internal mobility opportunities, conducting skills gap analysis, and planning workforce development. msg identified that without automated, scalable methods to process and unify this data, organizations would continue to struggle with manual overhead and inconsistent results.
Solution overview
HR data is typically scattered across diverse sources and formats, ranging from relational databases to Excel files, Word documents, and PDFs. Additionally, entities such as personnel numbers or competencies have different unique identifiers as well as different text descriptions, although with the same semantics. msg addressed this challenge with a modular architecture, tailored for IT workforce scenarios. As illustrated in the following diagram, at the core of msg.ProfileMap is a robust text extraction layer, which transforms heterogeneous inputs into structured data. This is then passed to an AI-powered harmonization engine that provides consistency across data sources by avoiding duplication and aligning disparate concepts.

The harmonization process uses a hybrid retrieval approach that combines vector-based semantic similarity and string-based matching techniques. These methods align incoming data with existing entities in the system. Amazon Bedrock is used to semantically enrich data, improving cross-source compatibility and matching precision. Extracted and enriched data is indexed and stored using Amazon OpenSearch Service and Amazon DynamoDB, facilitating fast and accurate retrieval, as shown in the following diagram.

The framework is designed to be unsupervised and domain independent. Although it’s optimized for IT workforce use cases, it has demonstrated strong generalization capabilities in other domains as well.
msg.ProfileMap is a cloud-based application that uses several AWS services, notably Amazon Neptune, Amazon DynamoDB, and Amazon Bedrock. The following diagram illustrates the full solution architecture.

Results and technical validation
msg evaluated the effectiveness of the data harmonization framework through internal testing on IT workforce concepts and external benchmarking in the Bio-ML Track of the Ontology Alignment Evaluation Initiative (OAEI), an international and EU-funded research initiative that evaluates ontology matching technologies since 2004.
During internal testing, the system processed 2,248 concepts across multiple suggestion types. High-probability merge recommendations reached 95.5% accuracy, covering nearly 60% of all inputs. This helped msg reduce manual validation workload by over 70%, significantly improving time-to-value for HR teams.
During OAEI 2024, msg.ProfileMap ranked at the top of the 2024 Bio-ML benchmark, outperforming other systems across multiple biomedical datasets. On NCIT-DOID, it achieved a 0.918 F1 score, with Hits@1 exceeding 92%, validating the engine’s generalizability beyond the HR domain. Additional details are available in the official test results.
Why Amazon Bedrock
msg relies on LLMs to semantically enrich data in near real time. These workloads require low-latency inference, flexible scaling, and operational simplicity. Amazon Bedrock met these needs by providing a fully managed, serverless interface to leading foundation models—without the need to manage infrastructure or deploy custom machine learning stacks.
Unlike hosting models on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon SageMaker, Amazon Bedrock abstracts away provisioning, versioning, scaling, and model selection. Its consumption-based pricing aligns directly with msg’s SaaS delivery model—resources are used (and billed) only when needed. This simplified integration reduced overhead and helped msg scale elastically as customer demand grew.
Amazon Bedrock also helped msg meet compliance goals under the EU AI Act and GDPR by enabling tightly scoped, auditable interactions with model APIs—critical for HR use cases that handle sensitive workforce data.
Conclusion
msg’s successful integration of Amazon Bedrock into msg.ProfileMap demonstrates that large-scale AI adoption doesn’t require complex infrastructure or specialized model training. By combining modular design, ontology-based harmonization, and the fully managed LLM capabilities of Amazon Bedrock, msg delivered an AI-powered workforce intelligence platform that is accurate, scalable, and compliant.This solution improved concept match precision and achieved top marks in international AI benchmarks, demonstrating what’s possible when generative AI is paired with the right cloud-based service. With Amazon Bedrock, msg has built a platform that’s ready for today’s HR challenges—and tomorrow’s.
msg.ProfileMap is available as a SaaS offering on AWS Marketplace. If you are interested in knowing more, you can reach out to msg.hcm.backoffice@msg.group.
The content and opinions in this blog post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the authors
Stefan Walter is Senior Vice President of AI SaaS Solutions at msg. With over 25 years of experience in IT software development, architecture, and consulting, Stefan Walter leads with a vision for scalable SaaS innovation and operational excellence. As a BU lead at msg, Stefan has spearheaded transformative initiatives that bridge business strategy with technology execution, especially in complex, multi-entity environments.
Gianluca Vegetti is a Senior Enterprise Architect in the AWS Partner Organization, aligned to Strategic Partnership Collaboration and Governance (SPCG) engagements. In his role, he supports the definition and execution of Strategic Collaboration Agreements with selected AWS partners.
Yuriy Bezsonov is a Senior Partner Solution Architect at AWS. With over 25 years in the tech, Yuriy has progressed from a software developer to an engineering manager and Solutions Architect. Now, as a Senior Solutions Architect at AWS, he assists partners and customers in developing cloud solutions, focusing on container technologies, Kubernetes, Java, application modernization, SaaS, developer experience, and GenAI. Yuriy holds AWS and Kubernetes certifications, and he is a recipient of the AWS Golden Jacket and the CNCF Kubestronaut Blue Jacket.

Top 5 No-Code Tools for AI Engineers/Developers

In today’s AI-driven world, no-code tools are transforming how people create and deploy intelligent applications. They empower anyone—regardless of coding expertise—to build solutions quickly and efficiently. From developing enterprise-grade RAG systems to designing multi-agent workflows or fine-tuning hundreds of LLMs, these platforms dramatically reduce development time and effort. In this article, we’ll explore five powerful no-code tools that make building AI solutions faster and more accessible than ever.

Sim AI

Sim AI is an open-source platform for visually building and deploying AI agent workflows—no coding required. Using its drag-and-drop canvas, you can connect AI models, APIs, databases, and business tools to create:

AI Assistants & Chatbots: Agents that search the web, access calendars, send emails, and interact with business apps.

Business Process Automation: Streamline tasks such as data entry, report creation, customer support, and content generation.

Data Processing & Analysis: Extract insights, analyze datasets, create reports, and sync data across systems.

API Integration Workflows: Orchestrate complex logic, unify services, and manage event-driven automation.

Key features:

Visual canvas with “smart blocks” (AI, API, logic, output).

Multiple triggers (chat, REST API, webhooks, schedulers, Slack/GitHub events).

Real-time team collaboration with permissions control.

80+ built-in integrations (AI models, communication tools, productivity apps, dev platforms, search services, and databases).

MCP support for custom integrations.

Deployment options:

Cloud-hosted (managed infrastructure with scaling & monitoring).

Self-hosted (via Docker, with local model support for data privacy).

RAGFlow

RAGFlow is a powerful retrieval-augmented generation (RAG) engine that helps you build grounded, citation-rich AI assistants on top of your own datasets. It runs on x86 CPUs or NVIDIA GPUs (with optional ARM builds) and provides full or slim Docker images for quick deployment. After spinning up a local server, you can connect an LLM—via API or local runtimes like Ollama—to handle chat, embedding, or image-to-text tasks. RAGFlow supports most popular language models and allows you to set defaults or customize models for each assistant.

Key capabilities include:

Knowledge base management: Upload and parse files (PDF, Word, CSV, images, slides, and more) into datasets, select an embedding model, and organize content for efficient retrieval.

Chunk editing & optimization: Inspect parsed chunks, add keywords, or manually adjust content to improve search accuracy.

AI chat assistants: Create chats linked to one or multiple knowledge bases, configure fallback responses, and fine-tune prompts or model settings.

Explainability & testing: Use built-in tools to validate retrieval quality, monitor performance, and view real-time citations.

Integration & extensibility: Leverage HTTP and Python APIs for app integration, with an optional sandbox for safe code execution inside chats.

Transformer Lab

Transformer Lab is a free, open-source workspace for Large Language Models (LLMs) and Diffusion models, designed to run on your local machine—whether that’s a GPU, TPU, or Apple M-series Mac—or in the cloud. It enables you to download, chat with, and evaluate LLMs, generate images using Diffusion models, and compute embeddings, all from one flexible environment.

Key capabilities include:

Model management: Download and interact with LLMs, or generate images using state-of-the-art Diffusion models.

Data preparation & training: Create datasets, fine-tune, or train models, including support for RLHF and preference tuning.

Retrieval-augmented generation (RAG): Use your own documents to power intelligent, grounded conversations.

Embeddings & evaluation: Calculate embeddings and assess model performance across different inference engines.

Extensibility & community: Build plugins, contribute to the core application, and collaborate via the active Discord community.

Llama Factory

LLaMA-Factory is a powerful no-code platform for training and fine-tuning open-source Large Language Models (LLMs) and Vision-Language Models (VLMs). It supports over 100 models, multimodal fine-tuning, advanced optimization algorithms, and scalable resource configurations. Designed for researchers and practitioners, it offers extensive tools for pre-training, supervised fine-tuning, reward modeling, and reinforcement learning methods like PPO and DPO—along with easy experiment tracking and faster inference.

Key highlights include:

Broad model support: Works with LLaMA, Mistral, Qwen, DeepSeek, Gemma, ChatGLM, Phi, Yi, Mixtral-MoE, and many more.

Training methods: Supports continuous pre-training, multimodal SFT, reward modeling, PPO, DPO, KTO, ORPO, and more.

Scalable tuning options: Full-tuning, freeze-tuning, LoRA, QLoRA (2–8 bit), OFT, DoRA, and other resource-efficient techniques.

Advanced algorithms & optimizations: Includes GaLore, BAdam, APOLLO, Muon, FlashAttention-2, RoPE scaling, NEFTune, rsLoRA, and others.

Tasks & modalities: Handles dialogue, tool use, image/video/audio understanding, visual grounding, and more.

Monitoring & inference: Integrates with LlamaBoard, TensorBoard, Wandb, MLflow, and SwanLab, plus offers fast inference via OpenAI-style APIs, Gradio UI, or CLI with vLLM/SGLang workers.

Flexible infrastructure: Compatible with PyTorch, Hugging Face Transformers, Deepspeed, BitsAndBytes, and supports both CPU/GPU setups with memory-efficient quantization.

AutoAgent

AutoAgent is a fully automated, self-developing framework that lets you create and deploy LLM-powered agents using natural language alone. Designed to simplify complex workflows, it enables you to build, customize, and run intelligent tools and assistants without writing a single line of code.

Key features include:

High performance: Achieves top-tier results on the GAIA benchmark, rivaling advanced deep research agents.

Effortless agent & workflow creation: Build tools, agents, and workflows through simple natural language prompts—no coding required.

Agentic-RAG with native vector database: Comes with a self-managing vector database, offering superior retrieval compared to traditional solutions like LangChain.

Broad LLM compatibility: Integrates seamlessly with leading models such as OpenAI, Anthropic, DeepSeek, vLLM, Grok, Hugging Face, and more.

Flexible interaction modes: Supports both function-calling and ReAct-style reasoning for versatile use cases.

Lightweight & extensible: A dynamic personal AI assistant that’s easy to customize and extend while remaining resource-efficient.

The post Top 5 No-Code Tools for AI Engineers/Developers appeared first on MarkTechPost.

Software Frameworks Optimized for GPUs in AI: CUDA, ROCm, Triton, Tens …

Table of contentsWhat actually determines performance on modern GPUsCUDA: nvcc/ptxas, cuDNN, CUTLASS, and CUDA GraphsROCm: HIP/Clang toolchain, rocBLAS/MIOpen, and the 6.x seriesTriton: a DSL and compiler for custom kernelsTensorRT (and TensorRT-LLM): builder-time graph optimization for inferencePractical guidance: choosing and tuning the stack

Deep-learning throughput hinges on how effectively a compiler stack maps tensor programs to GPU execution: thread/block schedules, memory movement, and instruction selection (e.g., Tensor Core MMA pipelines). In this article we will focus on four dominant stacks—CUDA, ROCm, Triton, and TensorRT—from the compiler’s perspective and explains which optimizations move the needle in practice.

What actually determines performance on modern GPUs

Across vendors, the same levers recur:

Operator scheduling & fusion: reduce kernel launches and round-trips to HBM; expose longer producer→consumer chains for register/shared-memory reuse. TensorRT and cuDNN “runtime fusion engines” exemplify this for attention and conv blocks.

Tiling & data layout: match tile shapes to Tensor Core/WGMMA/WMMA native fragment sizes; avoid shared-memory bank conflicts and partition camping. CUTLASS documents warp-level GEMM tiling for both Tensor Cores and CUDA cores.

Precision & quantization: FP16/BF16/FP8 for training/inference; INT8/INT4 (calibrated or QAT) for inference. TensorRT automates calibration and kernel selection under these precisions.

Graph capture & runtime specialization: graph execution to amortize launch overheads; dynamic fusion of common subgraphs (e.g., attention). cuDNN 9 added graph support for attention fusion engines.

Autotuning: search tile sizes, unroll factors, and pipelining depths per arch/SKU. Triton and CUTLASS expose explicit autotune hooks; TensorRT performs builder-time tactic selection.

With that lens, here’s how each stack implements the above.

CUDA: nvcc/ptxas, cuDNN, CUTLASS, and CUDA Graphs

Compiler path. CUDA code compiles through nvcc into PTX, then ptxas lowers PTX to SASS (arch-specific machine code). Controlling optimization requires feeding flags to both host and device phases; for kernels the key is -Xptxas. Developers often miss that -O3 alone affects only host code.

Kernel generation & libraries.

CUTLASS provides parametric templates for GEMM/conv, implementing warp-level tiling, Tensor Core MMA pipelines, and smem iterators designed for conflict-free access—canonical references for writing peak kernels, including Hopper’s WGMMA path.

cuDNN 9 introduced runtime fusion engines (notably for attention blocks), native CUDA Graph integration for those engines, and updates for new compute capabilities—materially reducing dispatch overheads and improving memory locality in Transformer workloads.

Performance implications.

Moving from unfused PyTorch ops to cuDNN attention fusion typically cuts kernel launches and global memory traffic; combined with CUDA Graphs, it reduces CPU bottlenecks in short-sequence inference.

On Hopper/Blackwell, aligning tile shapes to WGMMA/Tensor Core native sizes is decisive; CUTLASS tutorials quantify how mis-sized tiles waste tensor-core throughput.

When CUDA is the right tool. You need maximum control over instruction selection, occupancy, and smem choreography; or you’re extending kernels beyond library coverage while staying on NVIDIA GPUs.

ROCm: HIP/Clang toolchain, rocBLAS/MIOpen, and the 6.x series

Compiler path. ROCm uses Clang/LLVM to compile HIP (CUDA-like) into GCN/RDNA ISA. The 6.x series has focused on perf and framework coverage; release notes track component-level optimizations and HW/OS support.

Libraries and kernels.

rocBLAS and MIOpen implement GEMM/conv primitives with arch-aware tiling and algorithm selection similar in spirit to cuBLAS/cuDNN. The consolidated changelog highlights iterative perf work across these libraries.

Recent ROCm workstream includes better Triton enablement on AMD GPUs, enabling Python-level kernel authoring while still lowering through LLVM to AMD backends.

Performance implications.

On AMD GPUs, matching LDS (shared memory) bank widths and vectorized global loads to matrix tile shapes is as pivotal as smem bank alignment on NVIDIA. Compiler-assisted fusion in frameworks (e.g., attention) plus library autotuning in rocBLAS/MIOpen typically closes a large fraction of the gap to handwritten kernels, contingent on architecture/driver. Release documentation indicates continuous tuner improvements in 6.0–6.4.x.

When ROCm is the right tool. You need native support and optimization on AMD accelerators, with HIP portability from existing CUDA-style kernels and a clear LLVM toolchain.

Triton: a DSL and compiler for custom kernels

Compiler path. Triton is a Python-embedded DSL that lowers via LLVM; it handles vectorization, memory coalescing, and register allocation while giving explicit control over block sizes and program IDs. Build docs show the LLVM dependency and custom builds; NVIDIA’s developer materials discuss Triton’s tuning for newer architectures (e.g., Blackwell) with FP16/FP8 GEMM improvements.

Optimizations.

Autotuning over tile sizes, num_warps, and pipelining stages; static masking for boundary conditions without scalar fallbacks; shared-memory staging and software pipelining to overlap global loads with compute.

Triton’s design aims to automate the error-prone parts of CUDA-level optimization while leaving block-level tiling choices to the author; the original announcement outlines that separation of concerns.

Performance implications.

Triton shines when you need a fused, shape-specialized kernel outside library coverage (e.g., bespoke attention variants, normalization-activation-matmul chains). On modern NVIDIA parts, vendor collabs report architecture-specific improvements in the Triton backend, reducing the penalty versus CUTLASS-style kernels for common GEMMs.

When Triton is the right tool. You want near-CUDA performance for custom fused ops without writing SASS/WMMA, and you value Python-first iteration with autotuning.

TensorRT (and TensorRT-LLM): builder-time graph optimization for inference

Compiler path. TensorRT ingests ONNX or framework graphs and emits a hardware-specific engine. During the build, it performs layer/tensor fusion, precision calibration (INT8, FP8/FP16), and kernel tactic selection; best-practice docs describe these builder phases. TensorRT-LLM extends this with LLM-specific runtime optimizations.

Optimizations.

Graph-level: constant folding, concat-slice canonicalization, conv-bias-activation fusion, attention fusion.

Precision: post-training calibration (entropy/percentile/mse) and per-tensor quantization, plus smooth-quant/QAT workflows in TensorRT-LLM.

Runtime: paged-KV cache, in-flight batching, and scheduling for multi-stream/multi-GPU deployments (TensorRT-LLM docs).

Performance implications.

The largest wins typically come from: end-to-end INT8 (or FP8 on Hopper/Blackwell where supported), removing framework overhead via a single engine, and aggressive attention fusion. TensorRT’s builder produces per-arch engine plans to avoid generic kernels at runtime.

When TensorRT is the right tool. Production inference on NVIDIA GPUs where you can pre-compile an optimized engine and benefit from quantization and large-graph fusion.

Practical guidance: choosing and tuning the stack

Training vs. inference.

Training/experimental kernels → CUDA + CUTLASS (NVIDIA) or ROCm + rocBLAS/MIOpen (AMD); Triton for custom fused ops.

Production inference on NVIDIA → TensorRT/TensorRT-LLM for global graph-level gains.

Exploit architecture-native instructions.

On NVIDIA Hopper/Blackwell, ensure tiles map to WGMMA/WMMA sizes; CUTLASS materials show how warp-level GEMM and smem iterators should be structured.

On AMD, align LDS usage and vector widths to CU datapaths; leverage ROCm 6.x autotuners and Triton-on-ROCm for shape-specialized ops.

Fuse first, then quantize.

Kernel/graph fusion reduces memory traffic; quantization reduces bandwidth and increases math density. TensorRT’s builder-time fusions plus INT8/FP8 often deliver multiplicative gains.

Use graph execution for short sequences.

CUDA Graphs integrated with cuDNN attention fusions amortize launch overheads in autoregressive inference.

Treat compiler flags as first-class.

For CUDA, remember device-side flags: example, -Xptxas -O3,-v (and -Xptxas -O0 when diagnosing). Host-only -O3 isn’t sufficient.

References:

https://developer.nvidia.com/blog/introducing-cudnn-9/

https://rocmdocs.amd.com/en/latest/relnotes/relnotes.html

https://rocmdocs.amd.com/en/latest/develop/performance/tuning-guides/triton.html

https://github.com/NVIDIA/cutlass

https://docs.nvidia.com/deeplearning/cudnn/latest/index.html

https://docs.nvidia.com/deeplearning/tensorrt/archives/index.html

https://github.com/ROCm/ROCm/releases

https://triton-lang.org/main/getting-started/installation.html

https://github.com/NVIDIA/cutlass/tree/main/examples

https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html

Getting Started with CUDA Graphs

https://rocmdocs.amd.com/en/latest/release/changelog.html

https://triton-lang.org/main/getting-started/tutorials/index.html

https://github.com/NVIDIA/cutlass/blob/main/media/docs/warplevel-gemm.md

https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#compiler-options

https://nvidia.github.io/TensorRT-LLM/

https://github.com/ROCm/ROCm/releases

https://developer.nvidia.com/blog/nvidia-triton-on-blackwell-gpus

The post Software Frameworks Optimized for GPUs in AI: CUDA, ROCm, Triton, TensorRT—Compiler Paths and Performance Implications appeared first on MarkTechPost.

UT Austin and ServiceNow Research Team Releases AU-Harness: An Open-So …

Voice AI is becoming one of the most important frontiers in multimodal AI. From intelligent assistants to interactive agents, the ability to understand and reason over audio is reshaping how machines engage with humans. Yet while models have grown rapidly in capability, the tools for evaluating them have not kept pace. Existing benchmarks remain fragmented, slow, and narrowly focused, often making it difficult to compare models or test them in realistic, multi-turn settings.

To address this gap, UT Austin and ServiceNow Research Team has released AU-Harness, a new open-source toolkit built to evaluate Large Audio Language Models (LALMs) at scale. AU-Harness is designed to be fast, standardized, and extensible, enabling researchers to test models across a wide range of tasks—from speech recognition to complex audio reasoning—within a single unified framework.

Why do we need a new audio evaluation framework?

Current audio benchmarks have focused on applications like speech-to-text or emotion recognition. Frameworks such as AudioBench, VoiceBench, and DynamicSUPERB-2.0 broadened coverage, but they left some really critical gaps.

Three issues stand out. First is throughput bottlenecks: many toolkits don’t take advantage of batching or parallelism, making large-scale evaluations painfully slow. Second is prompting inconsistency, which makes results across models hard to compare. Third is restricted task scope: key areas like diarization (who spoke when) and spoken reasoning (following instructions delivered in audio) are missing in many cases.

These gaps limit the progress of LALMs, especially as they evolve into multimodal agents that must handle long, context-heavy, and multi-turn interactions.

https://arxiv.org/pdf/2509.08031

How does AU-Harness improve efficiency?

The research team designed AU-Harness with focus on speed. By integrating with the vLLM inference engine, it introduces a token-based request scheduler that manages concurrent evaluations across multiple nodes. It also shards datasets so that workloads are distributed proportionally across compute resources.

This design allows near-linear scaling of evaluations and keeps hardware fully utilized. In practice, AU-Harness delivers 127% higher throughput and reduces the real-time factor (RTF) by nearly 60% compared to existing kits. For researchers, this translates into evaluations that once took days now completing in hours.

Can evaluations be customized?

Flexibility is another core feature of AU-Harness. Each model in an evaluation run can have its own hyperparameters, such as temperature or max token settings, without breaking standardization. Configurations allow for dataset filtering (e.g., by accent, audio length, or noise profile), enabling targeted diagnostics.

Perhaps most importantly, AU-Harness supports multi-turn dialogue evaluation. Earlier toolkits were limited to single-turn tasks, but modern voice agents operate in extended conversations. With AU-Harness, researchers can benchmark dialogue continuity, contextual reasoning, and adaptability across multi-step exchanges.

What tasks does AU-Harness cover?

AU-Harness dramatically expands task coverage, supporting 50+ datasets, 380+ subsets, and 21 tasks across six categories:

Speech Recognition: from simple ASR to long-form and code-switching speech.

Paralinguistics: emotion, accent, gender, and speaker recognition.

Audio Understanding: scene and music comprehension.

Spoken Language Understanding: question answering, translation, and dialogue summarization.

Spoken Language Reasoning: speech-to-coding, function calling, and multi-step instruction following.

Safety & Security: robustness evaluation and spoofing detection.

Two innovations stand out:

LLM-Adaptive Diarization, which evaluates diarization through prompting rather than specialized neural models.

Spoken Language Reasoning, which tests models’ ability to process and reason about spoken instructions, rather than just transcribe them.

https://arxiv.org/pdf/2509.08031

What do the benchmarks reveal about today’s models?

When applied to leading systems like GPT-4o, Qwen2.5-Omni, and Voxtral-Mini-3B, AU-Harness highlights both strengths and weaknesses.

Models excel at ASR and question answering, showing strong accuracy in speech recognition and spoken QA tasks. But they lag in temporal reasoning tasks, such as diarization, and in complex instruction-following, particularly when instructions are given in audio form.

A key finding is the instruction modality gap: when identical tasks are presented as spoken instructions instead of text, performance drops by as much as 9.5 points. This suggests that while models are adept at processing text-based reasoning, adapting those skills to the audio modality remains an open challenge.

https://arxiv.org/pdf/2509.08031

Summary

AU-Harness marks an important step toward standardized and scalable evaluation of audio language models. By combining efficiency, reproducibility, and broad task coverage—including diarization and spoken reasoning—it addresses the long-standing gaps in benchmarking voice-enabled AI. Its open-source release and public leaderboard invite the community to collaborate, compare, and push the boundaries of what voice-first AI systems can achieve.

Check out the Paper, Project and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post UT Austin and ServiceNow Research Team Releases AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs appeared first on MarkTechPost.

Top 12 Robotics AI Blogs/NewsWebsites 2025

Robotics and artificial intelligence are converging at an unprecedented pace, driving breakthroughs in automation, perception, and human-machine collaboration. Staying current with these advancements requires following specialized sources that deliver technical depth, research updates, and industry insights. The following list highlights 12 of the most authoritative robotics and AI-focused blogs and websites to track in 2025.

IEEE Spectrum – Robotics

IEEE Spectrum’s robotics section remains one of the most respected sources for deep technical reporting on autonomy, robot design, locomotion, and control. It combines industry analysis with lab-level insights.

MarkTechPost

MarkTechPost regularly covers robotics research within the broader AI and machine learning ecosystem. It highlights cutting-edge work in robot learning, perception, simulation, and multi-agent systems.

Robohub

Robohub is a community-driven platform with contributions from robotics researchers, engineers, and practitioners worldwide. It includes interviews, technical discussions, and updates from research labs.

The Robot Report

This news platform blends robotics industry news with technical reporting. It tracks startup activity, industrial automation, and advanced robot designs across sectors.

Academic & Research Lab Blogs

Blogs from labs such as MIT CSAIL, CMU Robotics Institute, and Berkeley Artificial Intelligence Research (BAIR) often post about their latest robotics research, datasets, and open-source releases.

Specialist AI-Robotics Hybrids

AI-focused platforms like DeepMind’s blog and Meta AI Research blog frequently publish robotics-related research at the intersection of deep learning, simulation, and embodied AI.

Robotics Industries Association (RIA) – Robotics.org

The RIA offers updates on robotics standards, system integration, and industrial automation with strong technical context.

Phys.org – Robotics Section

Phys.org aggregates global robotics research news, covering new algorithms, robotic platforms, and mechanical innovations across academia and industry.

ZDNet – Robotics

ZDNet’s robotics coverage focuses on automation in enterprise settings, offering insight into emerging robotic platforms and their technical deployment.

Singularity Hub – Robots

Singularity Hub explores robotics research along with long-term societal implications. Articles often bridge lab breakthroughs with discussions on AI ethics and human-robot coexistence.

IEEE Robotics & Automation Society

The IEEE RAS blog and conference sites (e.g., IROS, RSS) share technical papers, tutorials, and summaries, making them essential for academic and applied robotics communities.

Towards Data Science – Robotics/AI Articles

Practitioners publish robotics-AI tutorials, implementations, and control algorithm discussions here, bridging applied ML with robotics systems.

Conclusion

As robotics continues to evolve across industrial, academic, and consumer domains, these platforms provide essential perspectives on research progress, engineering practices, and real-world deployment. Whether the focus is on control systems, embodied AI, or collaborative robots, these resources remain critical for understanding the trajectory of robotics and its integration with AI in 2025 and beyond.
The post Top 12 Robotics AI Blogs/NewsWebsites 2025 appeared first on MarkTechPost.

How to Build a Robust Advanced Neural AI Agent with Stable Training, A …

In this tutorial, we explore the design and implementation of an Advanced Neural Agent that combines classical neural network techniques with modern stability improvements. We build the network using Xavier initialization for balanced gradient flow and add stable activations like leaky ReLU, sigmoid, and tanh with clipping to avoid overflow. To stabilize training, we apply gradient clipping, momentum-inspired updates, and weight decay. The training loop includes mini-batches, early stopping, adaptive learning rates, and resets on instability, making the model robust for complex datasets. We also normalize targets, compute MSE, MAE, and R², and extend the agent with experience replay and exploratory decision-making, turning it into a flexible system for regression, classification-to-regression, and RL-style tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(‘ignore’)

We start by importing essential libraries like NumPy, Matplotlib, and scikit-learn, which we use for data generation, preprocessing, and splitting. We also suppress warnings to keep our workflow clean and focused. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedNeuralAgent:
def __init__(self, input_size, hidden_layers=[64, 32], output_size=1, learning_rate=0.001):
“””Advanced AI Agent with stable training and decision making capabilities”””
self.lr = learning_rate
self.initial_lr = learning_rate
self.layers = []
self.memory = []
self.performance_history = []
self.epsilon = 1e-8

layer_sizes = [input_size] + hidden_layers + [output_size]
for i in range(len(layer_sizes) – 1):
fan_in, fan_out = layer_sizes[i], layer_sizes[i+1]
limit = np.sqrt(6.0 / (fan_in + fan_out))

layer = {
‘weights’: np.random.uniform(-limit, limit, (layer_sizes[i], layer_sizes[i+1])),
‘bias’: np.zeros((1, layer_sizes[i+1])),
‘momentum_w’: np.zeros((layer_sizes[i], layer_sizes[i+1])),
‘momentum_b’: np.zeros((1, layer_sizes[i+1]))
}
self.layers.append(layer)

def activation(self, x, func=’relu’):
“””Stable activation functions with clipping”””
x = np.clip(x, -50, 50)

if func == ‘relu’:
return np.maximum(0, x)
elif func == ‘sigmoid’:
return 1 / (1 + np.exp(-x))
elif func == ‘tanh’:
return np.tanh(x)
elif func == ‘leaky_relu’:
return np.where(x > 0, x, x * 0.01)
elif func == ‘linear’:
return x

def activation_derivative(self, x, func=’relu’):
“””Stable derivatives”””
x = np.clip(x, -50, 50)

if func == ‘relu’:
return (x > 0).astype(float)
elif func == ‘sigmoid’:
s = self.activation(x, ‘sigmoid’)
return s * (1 – s)
elif func == ‘tanh’:
return 1 – np.tanh(x)**2
elif func == ‘leaky_relu’:
return np.where(x > 0, 1, 0.01)
elif func == ‘linear’:
return np.ones_like(x)

def forward(self, X):
“””Forward pass with gradient clipping”””
self.activations = [X]
self.z_values = []

current_input = X
for i, layer in enumerate(self.layers):
z = np.dot(current_input, layer[‘weights’]) + layer[‘bias’]
z = np.clip(z, -50, 50)
self.z_values.append(z)

if i < len(self.layers) – 1:
a = self.activation(z, ‘leaky_relu’)
else:
a = self.activation(z, ‘linear’)

self.activations.append(a)
current_input = a

return current_input

def clip_gradients(self, gradients, max_norm=1.0):
“””Gradient clipping to prevent explosion”””
grad_norm = np.linalg.norm(gradients)
if grad_norm > max_norm:
gradients = gradients * (max_norm / (grad_norm + self.epsilon))
return gradients

def backward(self, X, y, output):
“””Stable backpropagation with gradient clipping”””
m = X.shape[0]

dz = (output – y.reshape(-1, 1)) / m
dz = np.clip(dz, -10, 10)

for i in reversed(range(len(self.layers))):
layer = self.layers[i]

dw = np.dot(self.activations[i].T, dz)
db = np.sum(dz, axis=0, keepdims=True)

dw = self.clip_gradients(dw, max_norm=1.0)
db = self.clip_gradients(db, max_norm=1.0)

momentum = 0.9
layer[‘momentum_w’] = momentum * layer[‘momentum_w’] + (1 – momentum) * dw
layer[‘momentum_b’] = momentum * layer[‘momentum_b’] + (1 – momentum) * db

weight_decay = 0.0001
layer[‘weights’] -= self.lr * (layer[‘momentum_w’] + weight_decay * layer[‘weights’])
layer[‘bias’] -= self.lr * layer[‘momentum_b’]

if i > 0:
activation_func = ‘leaky_relu’ if i > 1 else ‘leaky_relu’
dz = np.dot(dz, layer[‘weights’].T) * self.activation_derivative(
self.z_values[i-1], activation_func)
dz = np.clip(dz, -10, 10)

def adapt_learning_rate(self, epoch, performance_history):
“””Adaptive learning rate with performance-based adjustment”””
if epoch > 10:
recent_performance = performance_history[-10:]
if len(recent_performance) >= 5:
if recent_performance[-1] >= recent_performance[-5]:
self.lr = max(self.lr * 0.95, self.initial_lr * 0.01)
elif recent_performance[-1] < recent_performance[-5] * 0.98:
self.lr = min(self.lr * 1.02, self.initial_lr * 2)

def calculate_loss(self, y_true, y_pred):
“””Stable loss calculation”””
y_true = y_true.reshape(-1, 1)
y_pred = np.clip(y_pred, -1e6, 1e6)

mse = np.mean((y_true – y_pred) ** 2)
mae = np.mean(np.abs(y_true – y_pred))

if not np.isfinite(mse):
mse = 1e6
if not np.isfinite(mae):
mae = 1e6

return mse, mae

def store_experience(self, state, action, reward, next_state):
“””Experience replay for RL aspects”””
experience = {
‘state’: state,
‘action’: action,
‘reward’: reward,
‘next_state’: next_state,
‘timestamp’: len(self.memory)
}
self.memory.append(experience)

if len(self.memory) > 1000:
self.memory.pop(0)

def make_decision(self, X, exploration_rate=0.1):
“””Stable decision making”””
prediction = self.forward(X)

if np.random.random() < exploration_rate:
noise_scale = np.std(prediction) * 0.1 if np.std(prediction) > 0 else 0.1
noise = np.random.normal(0, noise_scale, prediction.shape)
prediction += noise

return np.clip(prediction, -1e6, 1e6)

def reset_if_unstable(self):
“””Reset network if training becomes unstable”””
print(” Resetting network due to instability…”)
for i, layer in enumerate(self.layers):
fan_in, fan_out = layer[‘weights’].shape
limit = np.sqrt(6.0 / (fan_in + fan_out))
layer[‘weights’] = np.random.uniform(-limit, limit, (fan_in, fan_out))
layer[‘bias’] = np.zeros((1, fan_out))
layer[‘momentum_w’] = np.zeros((fan_in, fan_out))
layer[‘momentum_b’] = np.zeros((1, fan_out))
self.lr = self.initial_lr

def train(self, X, y, epochs=500, batch_size=32, validation_split=0.2, verbose=True):
“””Robust training with stability checks”””
y_mean, y_std = np.mean(y), np.std(y)
y_normalized = (y – y_mean) / (y_std + self.epsilon)

X_trn, X_val, y_trn, y_val = train_test_split(
X, y_normalized, test_size=validation_split, random_state=42)

best_val_loss = float(‘inf’)
patience = 30
patience_counter = 0

train_losses, val_losses = [], []
reset_count = 0

for epoch in range(epochs):
if epoch > 0 and (not np.isfinite(train_losses[-1]) or train_losses[-1] > 1e6):
if reset_count < 2:
self.reset_if_unstable()
reset_count += 1
continue
else:
print(” Training unstable, stopping…”)
break

indices = np.random.permutation(len(X_train))
X_train_shuffled = X_train[indices]
y_train_shuffled = y_train[indices]

epoch_loss = 0
batches = 0
for i in range(0, len(X_trn), batch_size):
batch_X = X_train_shuffled[i:i+batch_size]
batch_y = y_train_shuffled[i:i+batch_size]

if len(batch_X) == 0:
continue

output = self.forward(batch_X)
self.backward(batch_X, batch_y, output)

loss, _ = self.calculate_loss(batch_y, output)
epoch_loss += loss
batches += 1

avg_train_loss = epoch_loss / max(batches, 1)

val_output = self.forward(X_val)
val_loss, val_mae = self.calculate_loss(y_val, val_output)

train_losses.append(avg_train_loss)
val_losses.append(val_loss)
self.performance_history.append(val_loss)

if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
else:
patience_counter += 1

if patience_counter >= patience:
if verbose:
print(f” Early stopping at epoch {epoch}”)
break

if epoch > 0:
self.adapt_learning_rate(epoch, self.performance_history)

if verbose and (epoch % 50 == 0 or epoch < 10):
print(f”Epoch {epoch:3d}: Train Loss = {avg_train_loss:.4f}, ”
f”Val Loss = {val_loss:.4f}, LR = {self.lr:.6f}”)

self.y_mean, self.y_std = y_mean, y_std
return train_losses, val_losses

def predict(self, X):
“””Make predictions with denormalization”””
normalized_pred = self.forward(X)
if hasattr(self, ‘y_mean’) and hasattr(self, ‘y_std’):
return normalized_pred * self.y_std + self.y_mean
return normalized_pred

def evaluate_performance(self, X, y):
“””Comprehensive performance evaluation”””
predictions = self.predict(X)
mse, mae = self.calculate_loss(y, predictions)

y_mean = np.mean(y)
ss_tot = np.sum((y – y_mean) ** 2)
ss_res = np.sum((y.reshape(-1, 1) – predictions) ** 2)
r2 = 1 – (ss_res / (ss_tot + self.epsilon))

return {
‘mse’: float(mse) if np.isfinite(mse) else float(‘inf’),
‘mae’: float(mae) if np.isfinite(mae) else float(‘inf’),
‘r2’: float(r2) if np.isfinite(r2) else -float(‘inf’),
‘predictions’: predictions.flatten()
}

def visualize_training(self, train_losses, val_losses):
“””Visualize training progress”””
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(train_losses, label=’Training Loss’, alpha=0.8)
plt.plot(val_losses, label=’Validation Loss’, alpha=0.8)
plt.title(‘Training Progress’)
plt.xlabel(‘Epoch’)
plt.ylabel(‘Loss’)
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale(‘log’)

plt.subplot(1, 3, 2)
if len(self.performance_history) > 0:
plt.plot(self.performance_history)
plt.title(‘Performance History’)
plt.xlabel(‘Epoch’)
plt.ylabel(‘Validation Loss’)
plt.grid(True, alpha=0.3)
plt.yscale(‘log’)

plt.subplot(1, 3, 3)
if hasattr(self, ‘lr_history’):
plt.plot(self.lr_history)
plt.title(‘Learning Rate Schedule’)
plt.xlabel(‘Epoch’)
plt.ylabel(‘Learning Rate’)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

We implement an AdvancedNeuralAgent that we initialize with Xavier limits, leaky-ReLU activations, and momentum buffers to stabilize gradients and speed convergence. We train with mini-batches, gradient clipping, L2 weight decay, adaptive learning rates, early stopping, and automatic resets, and we track MSE/MAE/R² with normalization for reliable metrics. We also add experience replay and exploratory decisions for agent-like behavior, and we expose plotting utilities to visualize losses, validation history, and the LR schedule. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AIAgentDemo:
“””Demo class for testing the AI Agent with various scenarios”””

def __init__(self):
self.agents = {}
self.results = {}

def generate_datasets(self):
“””Generate multiple test datasets”””
datasets = {}

X1, y1 = make_regression(n_samples=600, n_features=5, n_informative=4,
noise=0.1, random_state=42)
datasets[‘simple’] = (X1, y1, “Simple Regression”)

X2, y2 = make_regression(n_samples=800, n_features=10, n_informative=8,
noise=0.2, random_state=123)
datasets[‘complex’] = (X2, y2, “Complex Regression”)

X3, y3 = make_classification(n_samples=700, n_features=8, n_informative=6,
n_classes=2, random_state=456)
y3 = y3.astype(float) + np.random.normal(0, 0.1, len(y3))
datasets[‘classification’] = (X3, y3, “Classification-to-Regression”)

return datasets

def test_agent_configuration(self, config_name, X, y, **agent_params):
“””Test agent with specific configuration”””
print(f”n Testing {config_name}…”)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

default_params = {
‘input_size’: X_scaled.shape[1],
‘hidden_layers’: [32, 16],
‘output_size’: 1,
‘learning_rate’: 0.005
}
default_params.update(agent_params)

agent = AdvancedNeuralAgent(**default_params)

try:
train_losses, val_losses = agent.train(
X_scaled, y, epochs=150, batch_size=32, verbose=False)

X_trn, X_test, y_trn, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42)

performance = agent.evaluate_performance(X_test, y_test)

self.agents[config_name] = agent
self.results[config_name] = {
‘performance’: performance,
‘train_losses’: train_losses,
‘val_losses’: val_losses,
‘data_shape’: X_scaled.shape
}

print(f” {config_name}: R²={performance[‘r2’]:.3f}, MSE={performance[‘mse’]:.3f}”)
return True

except Exception as e:
print(f” {config_name} failed: {str(e)[:50]}…”)
return False

def run_comprehensive_demo(self):
“””Run comprehensive testing of the AI agent”””
print(” COMPREHENSIVE AI AGENT DEMO”)
print(“=” * 60)

datasets = self.generate_datasets()

configs = {
‘lightweight’: {‘hidden_layers’: [16, 8], ‘learning_rate’: 0.01},
‘standard’: {‘hidden_layers’: [32, 16], ‘learning_rate’: 0.005},
‘deep’: {‘hidden_layers’: [64, 32, 16], ‘learning_rate’: 0.003},
‘wide’: {‘hidden_layers’: [128, 64], ‘learning_rate’: 0.002}
}

success_count = 0
total_tests = len(datasets) * len(configs)

for dataset_name, (X, y, desc) in datasets.items():
print(f”n Dataset: {desc} – Shape: {X.shape}”)
print(f”Target range: [{np.min(y):.2f}, {np.max(y):.2f}]”)

for config_name, config_params in configs.items():
test_name = f”{dataset_name}_{config_name}”
if self.test_agent_configuration(test_name, X, y, **config_params):
success_count += 1

print(f”n OVERALL RESULTS: {success_count}/{total_tests} tests successful”)

if self.results:
self.show_best_performers()
self.demonstrate_agent_intelligence()

def show_best_performers(self):
“””Show top performing configurations”””
print(f”n TOP PERFORMERS:”)

sorted_results = sorted(self.results.items(),
key=lambda x: x[1][‘performance’][‘r2’],
reverse=True)

for i, (name, result) in enumerate(sorted_results[:5]):
perf = result[‘performance’]
print(f”{i+1}. {name}: R²={perf[‘r2’]:.3f}, MSE={perf[‘mse’]:.3f}, MAE={perf[‘mae’]:.3f}”)

def demonstrate_agent_intelligence(self):
“””Demonstrate advanced AI capabilities”””
if not self.agents:
return

print(f”n INTELLIGENCE DEMONSTRATION:”)

best_name = max(self.results.keys(),
key=lambda x: self.results[x][‘performance’][‘r2’])
best_agent = self.agents[best_name]

print(f”Using best agent: {best_name}”)

print(f” Memory capacity: {len(best_agent.memory)} experiences”)

dummy_input = np.random.randn(3, best_agent.layers[0][‘weights’].shape[0])
conservative_decisions = best_agent.make_decision(dummy_input, exploration_rate=0.0)
exploratory_decisions = best_agent.make_decision(dummy_input, exploration_rate=0.3)

print(f” Decision making:”)
print(f” Conservative: {conservative_decisions.flatten()[:3]}”)
print(f” Exploratory: {exploratory_decisions.flatten()[:3]}”)

if len(best_agent.performance_history) > 10:
initial_perf = np.mean(best_agent.performance_history[:5])
final_perf = np.mean(best_agent.performance_history[-5:])
improvement = ((initial_perf – final_perf) / initial_perf) * 100
print(f” Learning improvement: {improvement:.1f}%”)

total_params = sum(layer[‘weights’].size + layer[‘bias’].size
for layer in best_agent.layers)
print(f” Network complexity: {total_params} parameters”)

return best_agent

We orchestrate a comprehensive demo where we generate multiple datasets, sweep agent configurations, and train/evaluate each setup with standardized metrics (R², MSE, MAE). We log results, rank top performers, and then showcase “intelligence” by probing memory, exploration vs. exploitation decisions, learning improvement, and total parameter count. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_quick_demo():
“””Quick demo for immediate testing”””
print(” QUICK AI AGENT DEMO”)
print(“=” * 40)

X, y = make_regression(n_samples=500, n_features=6, noise=0.15, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f”Dataset: {X_scaled.shape[0]} samples, {X_scaled.shape[1]} features”)

agent = AdvancedNeuralAgent(
input_size=X_scaled.shape[1],
hidden_layers=[24, 12],
output_size=1,
learning_rate=0.008
)

print(“Training agent…”)
train_losses, val_losses = agent.train(X_scaled, y, epochs=100, verbose=False)

X_trn, X_test, y_trn, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
performance = agent.evaluate_performance(X_test, y_test)

print(f”n RESULTS:”)
print(f”R² Score: {performance[‘r2’]:.3f}”)
print(f”MSE: {performance[‘mse’]:.3f}”)
print(f”MAE: {performance[‘mae’]:.3f}”)

agent.visualize_training(train_losses, val_losses)

return agent

We add a quick demo utility that trains the agent on a simple regression dataset with six features, using a lightweight two-layer configuration. We normalize the data, train for 100 epochs, evaluate on a test split, and display R², MSE, and MAE before plotting training vs. validation loss curves for immediate feedback. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
print(“Choose demo type:”)
print(“1. Quick Demo (fast)”)
print(“2. Comprehensive Demo (detailed)”)

demo = AIAgentDemo()
best_agent = demo.run_comprehensive_demo()

We define the main entry point so the script can be run directly. We display demo options, initialize AIAgentDemo, and by default execute the comprehensive demo, which trains multiple configurations across datasets, evaluates performance, and highlights the best agent.

In conclusion, we demonstrate how stability-aware engineering choices, ranging from weight decay regularization to dynamic learning rate scaling based on validation loss history, play a critical role in achieving consistent performance across diverse datasets. The agent is not just a static predictor; it actively adapts by storing past experiences, injecting controlled exploration into its decisions, and resetting its parameters when instability thresholds are reached. We further validate the design through comprehensive demos across lightweight, standard, deep, and wide configurations, benchmarking performance on simple, complex, and classification-derived regression datasets. The results highlight measurable improvements in R², MSE, and MAE, while visualization tools provide insight into learning dynamics and convergence behavior.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a Robust Advanced Neural AI Agent with Stable Training, Adaptive Learning, and Intelligent Decision-Making? appeared first on MarkTechPost.

Google AI Releases VaultGemma: The Largest and Most Capable Open Model …

Google AI Research and DeepMind have released VaultGemma 1B, the largest open-weight large language model trained entirely with differential privacy (DP). This development is a major step toward building AI models that are both powerful and privacy-preserving.

Why Do We Need Differential Privacy in LLMs?

Large language models trained on vast web-scale datasets are prone to memorization attacks, where sensitive or personally identifiable information can be extracted from the model. Studies have shown that verbatim training data can resurface, especially in open-weight releases.

Differential Privacy offers a mathematical guarantee that prevents any single training example from significantly influencing the model. Unlike approaches that apply DP only during fine-tuning, VaultGemma enforces full private pretraining, ensuring that privacy protection begins at the foundational level.

https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf

What Is the Architecture of VaultGemma?

VaultGemma is architecturally similar to earlier Gemma models, but optimized for private training.

Model size: 1B parameters, 26 layers.

Transformer type: Decoder-only.

Activations: GeGLU with feedforward dimension of 13,824.

Attention: Multi-Query Attention (MQA) with global span of 1024 tokens.

Normalization: RMSNorm in pre-norm configuration.

Tokenizer: SentencePiece with a 256K vocabulary.

A notable change is the reduction of sequence length to 1024 tokens, which lowers compute costs and enables larger batch sizes under DP constraints.

What Data Was Used for Training?

VaultGemma was trained on the same 13 trillion-token dataset as Gemma 2, composed primarily of English text from web documents, code, and scientific articles.

The dataset underwent several filtering stages to:

Remove unsafe or sensitive content.

Reduce personal information exposure.

Prevent evaluation data contamination.

This ensures both safety and fairness in benchmarking.

How Was Differential Privacy Applied?

VaultGemma used DP-SGD (Differentially Private Stochastic Gradient Descent) with gradient clipping and Gaussian noise addition. Implementation was built on JAX Privacy and introduced optimizations for scalability:

Vectorized per-example clipping for parallel efficiency.

Gradient accumulation to simulate large batches.

Truncated Poisson Subsampling integrated into the data loader for efficient on-the-fly sampling.

The model achieved a formal DP guarantee of (ε ≤ 2.0, δ ≤ 1.1e−10) at the sequence level (1024 tokens).

How Do Scaling Laws Work for Private Training?

Training large models under DP constraints requires new scaling strategies. The VaultGemma team developed DP-specific scaling laws with three innovations:

Optimal learning rate modeling using quadratic fits across training runs.

Parametric extrapolation of loss values to reduce reliance on intermediate checkpoints.

Semi-parametric fits to generalize across model size, training steps, and noise-batch ratios.

This methodology enabled precise prediction of achievable loss and efficient resource use on the TPUv6e training cluster.

What Were the Training Configurations?

VaultGemma was trained on 2048 TPUv6e chips using GSPMD partitioning and MegaScale XLA compilation.

Batch size: ~518K tokens.

Training iterations: 100,000.

Noise multiplier: 0.614.

The achieved loss was within 1% of predictions from the DP scaling law, validating the approach.

How Does VaultGemma Perform Compared to Non-Private Models?

On academic benchmarks, VaultGemma trails its non-private counterparts but shows strong utility:

ARC-C: 26.45 vs. 38.31 (Gemma-3 1B).

PIQA: 68.0 vs. 70.51 (GPT-2 1.5B).

TriviaQA (5-shot): 11.24 vs. 39.75 (Gemma-3 1B).

These results suggest that DP-trained models are currently comparable to non-private models from about five years ago. Importantly, memorization tests confirmed that no training data leakage was detectable in VaultGemma, unlike in non-private Gemma models.

https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf

Summary

In summary, VaultGemma 1B proves that large-scale language models can be trained with rigorous differential privacy guarantees without making them impractical to use. While a utility gap remains compared to non-private counterparts, the release of both the model and its training methodology provides the community with a strong foundation for advancing private AI. This work signals a shift toward building models that are not only capable but also inherently safe, transparent, and privacy-preserving.

Check out the Paper, Model on Hugging Face and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy appeared first on MarkTechPost.

How to Build a Multilingual OCR AI Agent in Python with EasyOCR and Op …

In this tutorial, we build an Advanced OCR AI Agent in Google Colab using EasyOCR, OpenCV, and Pillow, running fully offline with GPU acceleration. The agent includes a preprocessing pipeline with contrast enhancement (CLAHE), denoising, sharpening, and adaptive thresholding to improve recognition accuracy. Beyond basic OCR, we filter results by confidence, generate text statistics, and perform pattern detection (emails, URLs, dates, phone numbers) along with simple language hints. The design also supports batch processing, visualization with bounding boxes, and structured exports for flexible usage. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install easyocr opencv-python pillow matplotlib

import easyocr
import cv2
import numpy as np
from PIL import Image, ImageEnhance, ImageFilter
import matplotlib.pyplot as plt
import os
import json
from typing import List, Dict, Tuple, Optional
import re
from google.colab import files
import io

We start by installing the required libraries, EasyOCR, OpenCV, Pillow, and Matplotlib, to set up our environment. We then import all necessary modules so we can handle image preprocessing, OCR, visualization, and file operations seamlessly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedOCRAgent:
“””
Advanced OCR AI Agent with preprocessing, multi-language support,
and intelligent text extraction capabilities.
“””

def __init__(self, languages: List[str] = [‘en’], gpu: bool = True):
“””Initialize OCR agent with specified languages.”””
print(” Initializing Advanced OCR Agent…”)
self.languages = languages
self.reader = easyocr.Reader(languages, gpu=gpu)
self.confidence_threshold = 0.5
print(f” OCR Agent ready! Languages: {languages}”)

def upload_image(self) -> Optional[str]:
“””Upload image file through Colab interface.”””
print(” Upload your image file:”)
uploaded = files.upload()
if uploaded:
filename = list(uploaded.keys())[0]
print(f” Uploaded: {filename}”)
return filename
return None

def preprocess_image(self, image: np.ndarray, enhance: bool = True) -> np.ndarray:
“””Advanced image preprocessing for better OCR accuracy.”””
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
else:
gray = image.copy()

if enhance:
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
gray = clahe.apply(gray)

gray = cv2.fastNlMeansDenoising(gray)

kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
gray = cv2.filter2D(gray, -1, kernel)

binary = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2
)

return binary

def extract_text(self, image_path: str, preprocess: bool = True) -> Dict:
“””Extract text from image with advanced processing.”””
print(f” Processing image: {image_path}”)

image = cv2.imread(image_path)
if image is None:
raise ValueError(f”Could not load image: {image_path}”)

if preprocess:
processed_image = self.preprocess_image(image)
else:
processed_image = image

results = self.reader.readtext(processed_image)

extracted_data = {
‘raw_results’: results,
‘filtered_results’: [],
‘full_text’: ”,
‘confidence_stats’: {},
‘word_count’: 0,
‘line_count’: 0
}

high_confidence_text = []
confidences = []

for (bbox, text, confidence) in results:
if confidence >= self.confidence_threshold:
extracted_data[‘filtered_results’].append({
‘text’: text,
‘confidence’: confidence,
‘bbox’: bbox
})
high_confidence_text.append(text)
confidences.append(confidence)

extracted_data[‘full_text’] = ‘ ‘.join(high_confidence_text)
extracted_data[‘word_count’] = len(extracted_data[‘full_text’].split())
extracted_data[‘line_count’] = len(high_confidence_text)

if confidences:
extracted_data[‘confidence_stats’] = {
‘mean’: np.mean(confidences),
‘min’: np.min(confidences),
‘max’: np.max(confidences),
‘std’: np.std(confidences)
}

return extracted_data

def visualize_results(self, image_path: str, results: Dict, show_bbox: bool = True):
“””Visualize OCR results with bounding boxes.”””
image = cv2.imread(image_path)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(15, 10))

if show_bbox:
plt.subplot(2, 2, 1)
img_with_boxes = image_rgb.copy()

for item in results[‘filtered_results’]:
bbox = np.array(item[‘bbox’]).astype(int)
cv2.polylines(img_with_boxes, [bbox], True, (255, 0, 0), 2)

x, y = bbox[0]
cv2.putText(img_with_boxes, f”{item[‘confidence’]:.2f}”,
(x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1)

plt.imshow(img_with_boxes)
plt.title(“OCR Results with Bounding Boxes”)
plt.axis(‘off’)

plt.subplot(2, 2, 2)
processed = self.preprocess_image(image)
plt.imshow(processed, cmap=’gray’)
plt.title(“Preprocessed Image”)
plt.axis(‘off’)

plt.subplot(2, 2, 3)
confidences = [item[‘confidence’] for item in results[‘filtered_results’]]
if confidences:
plt.hist(confidences, bins=20, alpha=0.7, color=’blue’)
plt.xlabel(‘Confidence Score’)
plt.ylabel(‘Frequency’)
plt.title(‘Confidence Score Distribution’)
plt.axvline(self.confidence_threshold, color=’red’, linestyle=’–‘,
label=f’Threshold: {self.confidence_threshold}’)
plt.legend()

plt.subplot(2, 2, 4)
stats = results[‘confidence_stats’]
if stats:
labels = [‘Mean’, ‘Min’, ‘Max’]
values = [stats[‘mean’], stats[‘min’], stats[‘max’]]
plt.bar(labels, values, color=[‘green’, ‘red’, ‘blue’])
plt.ylabel(‘Confidence Score’)
plt.title(‘Confidence Statistics’)
plt.ylim(0, 1)

plt.tight_layout()
plt.show()

def smart_text_analysis(self, text: str) -> Dict:
“””Perform intelligent analysis of extracted text.”””
analysis = {
‘language_detection’: ‘unknown’,
‘text_type’: ‘unknown’,
‘key_info’: {},
‘patterns’: []
}

email_pattern = r’b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b’
phone_pattern = r'(+d{1,3}[-.s]?)?(?d{3})?[-.s]?d{3}[-.s]?d{4}’
url_pattern = r’http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+’
date_pattern = r’bd{1,2}[/-]d{1,2}[/-]d{2,4}b’

patterns = {
’emails’: re.findall(email_pattern, text, re.IGNORECASE),
‘phones’: re.findall(phone_pattern, text),
‘urls’: re.findall(url_pattern, text, re.IGNORECASE),
‘dates’: re.findall(date_pattern, text)
}

analysis[‘patterns’] = {k: v for k, v in patterns.items() if v}

if any(patterns.values()):
if patterns.get(’emails’) or patterns.get(‘phones’):
analysis[‘text_type’] = ‘contact_info’
elif patterns.get(‘urls’):
analysis[‘text_type’] = ‘web_content’
elif patterns.get(‘dates’):
analysis[‘text_type’] = ‘document_with_dates’

if re.search(r'[а-яё]’, text.lower()):
analysis[‘language_detection’] = ‘russian’
elif re.search(r'[àáâãäåæçèéêëìíîïñòóôõöøùúûüý]’, text.lower()):
analysis[‘language_detection’] = ‘romance_language’
elif re.search(r'[一-龯]’, text):
analysis[‘language_detection’] = ‘chinese’
elif re.search(r'[ひらがなカタカナ]’, text):
analysis[‘language_detection’] = ‘japanese’
elif re.search(r'[a-zA-Z]’, text):
analysis[‘language_detection’] = ‘latin_based’

return analysis

def process_batch(self, image_folder: str) -> List[Dict]:
“””Process multiple images in batch.”””
results = []
supported_formats = (‘.png’, ‘.jpg’, ‘.jpeg’, ‘.bmp’, ‘.tiff’)

for filename in os.listdir(image_folder):
if filename.lower().endswith(supported_formats):
image_path = os.path.join(image_folder, filename)
try:
result = self.extract_text(image_path)
result[‘filename’] = filename
results.append(result)
print(f” Processed: {filename}”)
except Exception as e:
print(f” Error processing {filename}: {str(e)}”)

return results

def export_results(self, results: Dict, format: str = ‘json’) -> str:
“””Export results in specified format.”””
if format.lower() == ‘json’:
output = json.dumps(results, indent=2, ensure_ascii=False)
filename = ‘ocr_results.json’
elif format.lower() == ‘txt’:
output = results[‘full_text’]
filename = ‘extracted_text.txt’
else:
raise ValueError(“Supported formats: ‘json’, ‘txt'”)

with open(filename, ‘w’, encoding=’utf-8′) as f:
f.write(output)

print(f” Results exported to: {filename}”)
return filename

We define an AdvancedOCRAgent that we initialize with multilingual EasyOCR and a GPU, and we set a confidence threshold to control output quality. We preprocess images (CLAHE, denoise, sharpen, adaptive threshold), extract text, visualize bounding boxes and confidence, run smart pattern/language analysis, support batch folders, and export results as JSON or TXT. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_ocr_agent():
“””Demonstrate the OCR agent capabilities.”””
print(” Advanced OCR AI Agent Demo”)
print(“=” * 50)

ocr = AdvancedOCRAgent(languages=[‘en’], gpu=True)

image_path = ocr.upload_image()
if image_path:
try:
results = ocr.extract_text(image_path, preprocess=True)

print(“n OCR Results:”)
print(f”Words detected: {results[‘word_count’]}”)
print(f”Lines detected: {results[‘line_count’]}”)
print(f”Average confidence: {results[‘confidence_stats’].get(‘mean’, 0):.2f}”)

print(“n Extracted Text:”)
print(“-” * 30)
print(results[‘full_text’])
print(“-” * 30)

analysis = ocr.smart_text_analysis(results[‘full_text’])
print(f”n Smart Analysis:”)
print(f”Detected text type: {analysis[‘text_type’]}”)
print(f”Language hints: {analysis[‘language_detection’]}”)
if analysis[‘patterns’]:
print(f”Found patterns: {list(analysis[‘patterns’].keys())}”)

ocr.visualize_results(image_path, results)

ocr.export_results(results, ‘json’)

except Exception as e:
print(f” Error: {str(e)}”)
else:
print(“No image uploaded. Please try again.”)

if __name__ == “__main__”:
demo_ocr_agent()

We create a demo function that walks us through the full OCR workflow: we initialize the agent with English and GPU support, upload an image, preprocess it, and extract text with confidence stats. We then display the results, run smart text analysis to detect patterns and language hints, visualize bounding boxes and scores, and finally export everything into a JSON file.

In conclusion, we create a robust OCR pipeline that combines preprocessing, recognition, and analysis in a single Colab workflow. We enhance EasyOCR outputs using OpenCV techniques, visualize results for interpretability, and add confidence metrics for reliability. The agent is modular, allowing both single-image and batch processing, with results exported in JSON or text formats. This shows that open-source tools can deliver production-grade OCR without external APIs, while leaving room for domain-specific extensions like invoice or document parsing.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a Multilingual OCR AI Agent in Python with EasyOCR and OpenCV appeared first on MarkTechPost.

BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarkin …

BentoML has recently released llm-optimizer, an open-source framework designed to streamline the benchmarking and performance tuning of self-hosted large language models (LLMs). The tool addresses a common challenge in LLM deployment: finding optimal configurations for latency, throughput, and cost without relying on manual trial-and-error.

Why is tuning the LLM performance difficult?

Tuning LLM inference is a balancing act across many moving parts—batch size, framework choice (vLLM, SGLang, etc.), tensor parallelism, sequence lengths, and how well the hardware is utilized. Each of these factors can shift performance in different ways, which makes finding the right combination for speed, efficiency, and cost far from straightforward. Most teams still rely on repetitive trial-and-error testing, a process that is slow, inconsistent, and often inconclusive. For self-hosted deployments, the cost of getting it wrong is high: poorly tuned configurations can quickly translate into higher latency and wasted GPU resources.

How llm-optimizer is different?

llm-optimizer provides a structured way to explore the LLM performance landscape. It eliminates repetitive guesswork by enabling systematic benchmarking and automated search across possible configurations.

Core capabilities include:

Running standardized tests across inference frameworks such as vLLM and SGLang.

Applying constraint-driven tuning, e.g., surfacing only configurations where time-to-first-token is below 200ms.

Automating parameter sweeps to identify optimal settings.

Visualizing tradeoffs with dashboards for latency, throughput, and GPU utilization.

The framework is open-source and available on GitHub.

How can devs explore results without running benchmarks locally?

Alongside the optimizer, BentoML released the LLM Performance Explorer, a browser-based interface powered by llm-optimizer. It provides pre-computed benchmark data for popular open-source models and lets users:

Compare frameworks and configurations side by side.

Filter by latency, throughput, or resource thresholds.

Browse tradeoffs interactively without provisioning hardware.

How does llm-optimizer impact LLM deployment practices?

As the use of LLMs grows, getting the most out of deployments comes down to how well inference parameters are tuned. llm-optimizer lowers the complexity of this process, giving smaller teams access to optimization techniques that once required large-scale infrastructure and deep expertise.

By providing standardized benchmarks and reproducible results, the framework adds much-needed transparency to the LLM space. It makes comparisons across models and frameworks more consistent, closing a long-standing gap in the community.

Ultimately, BentoML’s llm-optimizer brings a constraint-driven, benchmark-focused method to self-hosted LLM optimization, replacing ad-hoc trial and error with a systematic and repeatable workflow.

Check out the GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference appeared first on MarkTechPost.

Deepdub Introduces Lightning 2.5: A Real-Time AI Voice Model With 2.8x …

Deepdub, an Israeli Voice AI startup, has introduced Lightning 2.5, a real-time foundational voice model designed to power scalable, production-grade voice applications. The new release delivers substantial improvements in performance and efficiency, positioning it for use in live interactive systems such as contact centers, AI agents, and real-time dubbing.

Performance and Efficiency

Lightning 2.5 achieves 2.8× higher throughput compared to previous versions, alongside a 5× efficiency gain in terms of computational resource utilization. Delivering latency as low as 200 milliseconds—roughly half a second faster than typical industry benchmarks—Lightning enables true real-time performance across use cases like live conversational AI, on-the-fly voiceovers, and event-driven AI pipelines.

The model is optimized for NVIDIA GPU-accelerated environments, ensuring deployment at scale without compromising qualitu. By leveraging parallelized inference pipelines, Deepdub has positioned Lightning 2.5 as a high-performance solution for latency-sensitive scenarios.

Real-Time Applications

Lightning 2.5 positions itself in a landscape where voice is at core to user experience. Deployment applications include:

Customer support platforms that require seamless multilingual conversations.

AI agents and virtual assistants delivering natural, real-time interactions.

Media localization through instant dubbing across multiple languages.

Gaming and entertainment voice chat requiring expressive and natural speech output.

In a PR release, Deepdub team emphasized that Lightning maintains voice fidelity, natural prosody, and emotional nuance while scaling across multiple languages, a challenge for most real-time TTS (text-to-speech) systems.

Summary

Lightning 2.5 underscores Deepdub’s push to make real-time, high-quality multilingual voice generation practical at scale. With notable gains in throughput and efficiency, the model positions the company to compete in enterprise voice AI, though its ultimate impact will depend on adoption, integration ease, and how it measures up against rival systems in real-world deployments.
The post Deepdub Introduces Lightning 2.5: A Real-Time AI Voice Model With 2.8x Throughput Gains for Scalable AI Agents and Enterprise AI appeared first on MarkTechPost.

Automate advanced agentic RAG pipeline with Amazon SageMaker AI

Retrieval Augmented Generation (RAG) is a fundamental approach for building advanced generative AI applications that connect large language models (LLMs) to enterprise knowledge. However, crafting a reliable RAG pipeline is rarely a one-shot process. Teams often need to test dozens of configurations (varying chunking strategies, embedding models, retrieval techniques, and prompt designs) before arriving at a solution that works for their use case. Furthermore, management of high-performing RAG pipeline involves complex deployment, with teams often using manual RAG pipeline management, leading to inconsistent results, time-consuming troubleshooting, and difficulty in reproducing successful configurations. Teams struggle with scattered documentation of parameter choices, limited visibility into component performance, and the inability to systematically compare different approaches. Additionally, the lack of automation creates bottlenecks in scaling the RAG solutions, increases operational overhead, and makes it challenging to maintain quality across multiple deployments and environments from development to production.
In this post, we walk through how to streamline your RAG development lifecycle from experimentation to automation, helping you operationalize your RAG solution for production deployments with Amazon SageMaker AI, helping your team experiment efficiently, collaborate effectively, and drive continuous improvement. By combining experimentation and automation with SageMaker AI, you can verify that the entire pipeline is versioned, tested, and promoted as a cohesive unit. This approach provides comprehensive guidance for traceability, reproducibility, and risk mitigation as the RAG system advances from development to production, supporting continuous improvement and reliable operation in real-world scenarios.
Solution overview
By streamlining both experimentation and operational workflows, teams can use SageMaker AI to rapidly prototype, deploy, and monitor RAG applications at scale. Its integration with SageMaker managed MLflow provides a unified platform for tracking experiments, logging configurations, and comparing results, supporting reproducibility and robust governance throughout the pipeline lifecycle. Automation also minimizes manual intervention, reduces errors, and streamlines the process of promoting the finalized RAG pipeline from the experimentation phase directly into production. With this approach, every stage from data ingestion to output generation operates efficiently and securely, while making it straightforward to transition validated solutions from development to production deployment.
For automation, Amazon SageMaker Pipelines orchestrates end-to-end RAG workflows from data preparation and vector embedding generation to model inference and evaluation all with repeatable and version-controlled code. Integrating continuous integration and delivery (CI/CD) practices further enhances reproducibility and governance, enabling automated promotion of validated RAG pipelines from development to staging or production environments. Promoting an entire RAG pipeline (not just an individual subsystem of the RAG solution like a chunking layer or orchestration layer) to higher environments is essential because data, configurations, and infrastructure can vary significantly across staging and production. In production, you often work with live, sensitive, or much larger datasets, and the way data is chunked, embedded, retrieved, and generated can impact system performance and output quality in ways that are not always apparent in lower environments. Each stage of the pipeline (chunking, embedding, retrieval, and generation) must be thoroughly evaluated with production-like data for accuracy, relevance, and robustness. Metrics at every stage (such as chunk quality, retrieval relevance, answer correctness, and LLM evaluation scores) must be monitored and validated before the pipeline is trusted to serve real users.
The following diagram illustrates the architecture of a scalable RAG pipeline built on SageMaker AI, with MLflow experiment tracking seamlessly integrated at every stage and the RAG pipeline automated using SageMaker Pipelines. SageMaker managed MLflow provides a unified platform for centralized RAG experiment tracking across all pipeline stages. Every MLflow execution run whether for RAG chunking, ingestion, retrieval, or evaluation sends execution logs, parameters, metrics, and artifacts to SageMaker managed MLflow. The architecture uses SageMaker Pipelines to orchestrate the entire RAG workflow through versioned, repeatable automation. These RAG pipelines manage dependencies between critical stages, from data ingestion and chunking to embedding generation, retrieval, and final text generation, supporting consistent execution across environments. Integrated with CI/CD practices, SageMaker Pipelines enable seamless promotion of validated RAG configurations from development to staging and production environments while maintaining infrastructure as code (IaC) traceability.

For the operational workflow, the solution follows a structured lifecycle: During experimentation, data scientists iterate on pipeline components within Amazon SageMaker Studio notebooks while SageMaker managed MLflow captures parameters, metrics, and artifacts at every stage. Validated workflows are then codified into SageMaker Pipelines and versioned in Git. The automated promotion phase uses CI/CD to trigger pipeline execution in target environments, rigorously validating stage-specific metrics (chunk quality, retrieval relevance, answer correctness) against production data before deployment. The other core components include:

Amazon SageMaker JumpStart for accessing the latest LLM models and hosting them on SageMaker endpoints for model inference with the embedding model huggingface-textembedding-all-MiniLM-L6-v2 and text generation model deepseek-llm-r1-distill-qwen-7b.
Amazon OpenSearch Service as a vector database to store document embeddings with the OpenSearch index configured for k-nearest neighbors (k-NN) search.
The Amazon Bedrock model anthropic.claude-3-haiku-20240307-v1:0 as an LLM-as-a-judge component for all the MLflow LLM evaluation metrics.
A SageMaker Studio notebook for a development environment to experiment and automate the RAG pipelines with SageMaker managed MLflow and SageMaker Pipelines.

You can implement this agentic RAG solution code from the GitHub repository. In the following sections, we use snippets from this code in the repository to illustrate RAG pipeline experiment evolution and automation.
Prerequisites
You must have the following prerequisites:

An AWS account with billing enabled.
A SageMaker AI domain. For more information, see Use quick setup for Amazon SageMaker AI.
Access to a running SageMaker managed MLflow tracking server in SageMaker Studio. For more information, see the instructions for setting up a new MLflow tracking server.
Access to SageMaker JumpStart to host LLM embedding and text generation models.
Access to the Amazon Bedrock foundation models (FMs) for RAG evaluation tasks. For more details, see Subscribe to a model.

SageMaker MLFlow RAG experiment
SageMaker managed MLflow provides a powerful framework for organizing RAG experiments, so teams can manage complex, multi-stage processes with clarity and precision. The following diagram illustrates the RAG experiment stages with SageMaker managed MLflow experiment tracking at every stage. This centralized tracking offers the following benefits:

Reproducibility: Every experiment is fully documented, so teams can replay and compare runs at any time
Collaboration: Shared experiment tracking fosters knowledge sharing and accelerates troubleshooting
Actionable insights: Visual dashboards and comparative analytics help teams identify the impact of pipeline changes and drive continuous improvement

The following diagram illustrates the solution workflow.

Each RAG experiment in MLflow is structured as a top-level run under a specific experiment name. Within this top-level run, nested runs are created for each major pipeline stage, such as data preparation, data chunking, data ingestion, RAG retrieval, and RAG evaluation. This hierarchical approach allows for granular tracking of parameters, metrics, and artifacts at every step, while maintaining a clear lineage from raw data to final evaluation results.
The following screenshot shows an example of the experiment details in MLflow.

The various RAG pipeline steps defined are:

Data preparation: Logs dataset version, preprocessing steps, and initial statistics
Data chunking: Records chunking strategy, chunk size, overlap, and resulting chunk counts
Data ingestion: Tracks embedding model, vector database details, and document ingestion metrics
RAG retrieval: Captures retrieval model, context size, and retrieval performance metrics
RAG evaluation: Logs evaluation metrics (such as answer similarity, correctness, and relevance) and sample results

This visualization provides a clear, end-to-end view of the RAG pipeline’s execution, so you can trace the impact of changes at any stage and achieve full reproducibility. The architecture supports scaling to multiple experiments, each representing a distinct configuration or hypothesis (for example, different chunking strategies, embedding models, or retrieval parameters). MLflow’s experiment UI visualizes these experiments side by side, enabling side-by-side comparison and analysis across runs. This structure is especially valuable in enterprise settings, where dozens or even hundreds of experiments might be conducted to optimize RAG performance.
We use MLflow experimentation throughout the RAG pipeline to log metrics and parameters, and the different experiment runs are initialized as shown in the following code snippet:

with mlflow.start_run() as run:
    main_run_id = run.info.run_id
    print(“mlflow_run”, run_id)
    with mlflow.start_run(run_name=”DataPreparation”, nested=True):

RAG pipeline experimentation
The key components of the RAG workflow are ingestion, chunking, retrieval, and evaluation, which we explain in this section. The MLflow dashboard makes it straightforward to visualize and analyze these parameters and metrics, supporting data-driven refinement of the chunking stage within the RAG pipeline.

Data ingestion and preparation
In the RAG workflow, rigorous data preparation is foundational to downstream performance and reliability. Tracking detailed metrics on data quality, such as the total number of question-answer pairs, the count of unique questions, average context length, and initial evaluation predictions, provides essential visibility into the dataset’s structure and suitability for RAG tasks. These metrics help validate the dataset is comprehensive, diverse, and contextually rich, which directly impacts the relevance and accuracy of the RAG system’s responses. Additionally, logging critical RAG parameters like the data source, detected personally identifiable information (PII) types, and data lineage information is vital for maintaining compliance, reproducibility, and trust in enterprise environments. Capturing this metadata in SageMaker managed MLflow supports robust experiment tracking, auditability, efficient comparison, and root cause analysis across multiple data preparation runs, as visualized in the MLflow dashboard. This disciplined approach to data preparation lays the groundwork for effective experimentation, governance, and continuous improvement throughout the RAG pipeline. The following screenshot shows an example of the experiment run details in MLflow.

Data chunking
After data preparation, the next step is to split documents into manageable chunks for efficient embedding and retrieval. This process is pivotal, because the quality and granularity of chunks directly affect the relevance and completeness of answers returned by the RAG system. The RAG workflow in this post supports experimentation and RAG pipeline automation with both fixed-size and recursive chunking strategies for comparison and validations. However, this RAG solution can be expanded to many other chucking techniques.

FixedSizeChunker divides text into uniform chunks with configurable overlap
RecursiveChunker splits text along logical boundaries such as paragraphs or sentences

Tracking detailed chunking metrics such as total_source_contexts_entries, total_contexts_chunked, and total_unique_chunks_final is crucial for understanding how much of the source data is represented, how effectively it is segmented, and whether the chunking approach is yielding the desired coverage and uniqueness. These metrics help diagnose issues like excessive duplication or under-segmentation, which can impact retrieval accuracy and model performance.
Additionally, logging parameters such as chunking_strategy_type (for example, FixedSizeChunker), chunking_strategy_chunk_size (for example, 500 characters), and chunking_strategy_chunk_overlap provide transparency and reproducibility for each experiment. Capturing these details in SageMaker managed MLflow helps teams systematically compare the impact of different chunking configurations, optimize for efficiency and contextual relevance, and maintain a clear audit trail of how chunking decisions evolve over time. The MLflow dashboard makes it straightforward to visualize and analyze these parameters and metrics, supporting data-driven refinement of the chunking stage within the RAG pipeline. The following screenshot shows an example of the experiment run details in MLflow.

After the documents are chunked, the next step is to convert these chunks into vector embeddings using a SageMaker embedding endpoint, after which the embeddings are ingested into a vector database such as OpenSearch Service for fast semantic search. This ingestion phase is crucial because the quality, completeness, and traceability of what enters the vector store directly determine the effectiveness and reliability of downstream retrieval and generation stages.
Tracking ingestion metrics such as the number of documents and chunks ingested provides visibility into pipeline throughput and helps identify bottlenecks or data loss early in the process. Logging detailed parameters, including the embedding model ID, endpoint used, and vector database index, is essential for reproducibility and auditability. This metadata helps teams trace exactly which model and infrastructure were used for each ingestion run, supporting root cause analysis and compliance, especially when working with evolving datasets or sensitive information.
Retrieval and generation
For a given query, we generate an embedding and retrieve the top-k relevant chunks from OpenSearch Service. For answer generation, we use a SageMaker LLM endpoint. The retrieved context and the query are combined into a prompt, and the LLM generates an answer. Finally, we orchestrate retrieval and generation using LangGraph, enabling stateful workflows and advanced tracing:

graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_with_context = graph_builder.compile()

With the GenerativeAI agent defined with LangGraph framework, the agentic layers are evaluated for each iteration of RAG development, verifying the efficacy of the RAG solution for agentic applications. Each retrieval and generation run is logged to SageMaker managed MLflow, capturing the prompt, generated response, and key metrics and parameters such as retrieval performance, top-k values, and the specific model endpoints used. Tracking these details in MLflow is essential for evaluating the effectiveness of the retrieval stage, making sure the returned documents are relevant and that the generated answers are accurate and complete. It is equally important to track the performance of the vector database during retrieval, including metrics like query latency, throughput, and scalability. Monitoring these system-level metrics alongside retrieval relevance and accuracy makes sure the RAG pipeline delivers correct and relevant answers and meets production requirements for responsiveness and scalability. The following screenshot shows an example of the Langraph RAG retrieval tracing in MLflow.

RAG Evaluation
Evaluation is conducted on a curated test set, and results are logged to MLflow for quick comparison and analysis. This helps teams identify the best-performing configurations and iterate toward production-grade solutions. With MLflow you can evaluate the RAG solution with heuristics metrics, content similarity metrics and LLM-as-a-judge. In this post, we evaluate the RAG pipeline using advanced LLM-as-a-judge MLflow metrics (answer similarity, correctness, relevance, faithfulness):

metrics_genai_only = [answer_correctness_aws, answer_similarity_aws, answer_relevance_aws, answer_faithfulness_aws]

The following screenshot shows an RAG evaluation stage experiment run details in MLflow.

You can use MLflow to log all metrics and parameters, enabling quick comparison of different experiment runs. See the following code for reference:

with mlflow.start_run(run_id=main_run_id) as run:
    with mlflow.start_run(run_name=”RAGEvaluation”, nested=True):
        results = mlflow.evaluate(
            …         # Other parameters
            extra_metrics=metrics_genai_only,
            evaluator_config={
                … # Config parameters
                }
            }
        )

By using MLflow’s evaluation capabilities (such as mlflow.evaluate()), teams can systematically assess retrieval quality, identify potential gaps or misalignments in chunking or embedding strategies, and compare the performance of different retrieval and generation configurations. MLflow’s flexibility allows for seamless integration with external libraries and evaluation libraries such as RAGAS for comprehensive RAG pipeline assessment. RAGAS is an open source library that provide tools specifically for evaluation of LLM applications and generative AI agents. RAGAS includes the method ragas.evaluate() to run evaluations for LLM agents with the choice of LLM models (evaluators) for scoring the evaluation, and an extensive list of default metrics. To incorporate RAGAS metrics into your MLflow experiments, refer to the following GitHub repository.
Comparing experiments
In the MLflow UI, you can compare runs side by side. For example, comparing FixedSizeChunker and RecursiveChunker as shown in the following screenshot reveals differences in metrics such as answer_similarity (a difference of 1 point), providing actionable insights for pipeline optimization.

Automation with Amazon SageMaker pipelines
After systematically experimenting with and optimizing each component of the RAG workflow through SageMaker managed MLflow, the next step is transforming these validated configurations into production-ready automated pipelines. Although MLflow experiments help identify the optimal combination of chunking strategies, embedding models, and retrieval parameters, manually reproducing these configurations across environments can be error-prone and inefficient.
To produce the automated RAG pipeline, we use SageMaker Pipelines, which helps teams codify their experimentally validated RAG workflows into automated, repeatable pipelines that maintain consistency from development through production. By converting the successful MLflow experiments into pipeline definitions, teams can make sure the exact same chunking, embedding, retrieval, and evaluation steps that performed well in testing are reliably reproduced in production environments.
SageMaker Pipelines offers a serverless workflow orchestration for converting experimental notebook code into a production-grade pipeline, versioning and tracking pipeline configurations alongside MLflow experiments, and automating the end-to-end RAG workflow. The automated Sagemaker pipeline-based RAG workflow offers dependency management, comprehensive custom testing and validation before production deployment, and CI/CD integration for automated pipeline promotion.
With SageMaker Pipelines, you can automate your entire RAG workflow, from data preparation to evaluation, as reusable, parameterized pipeline definitions. This provides the following benefits:

Reproducibility – Pipeline definitions capture all dependencies, configurations, and executions logic in version-controlled code
Parameterization – Key RAG parameters (chunk sizes, model endpoints, retrieval settings) can be quickly modified between runs
Monitoring – Pipeline executions provide detailed logs and metrics for each step
Governance – Built-in lineage tracking supports full audibility of data and model artifacts
Customization – Serverless workflow orchestration is customizable to your unique enterprise landscape, with scalable infrastructure and flexibility with instances optimized for CPU, GPU, or memory-intensive tasks, memory configuration, and concurrency optimization

To implement a RAG workflow in SageMaker pipelines, each major component of the RAG process (data preparation, chunking, ingestion, retrieval and generation, and evaluation) is included in a SageMaker processing job. These jobs are then orchestrated as steps within a pipeline, with data flowing between them, as shown in the following screenshot. This structure allows for modular development, quick debugging, and the ability to reuse components across different pipeline configurations.

The key RAG configurations are exposed as pipeline parameters, enabling flexible experimentation with minimal code changes. For example, the following code snippets showcase the modifiable parameters for RAG configurations, which can be used as pipeline configurations:

processor  PyTorchProcessor(
    …
    arguments[
    “–experiment-name”, experiment_name,
    “–mlflow-tracking-uri”, mlflow_tracking_uri,
    “–embedding-endpoint-name”, embedding_endpoint_name,
    “–text-endpoint-name”, text_endpoint_name,
    “–domain-name”, domain_name,
    “–index-name”, index_name,
    “–chunking-strategy”, chunking_strategy,
    “–chunk-size”, chunk_size,
    “–chunk-overlap”, chunk_overlap,
    “–context-retrieval-size”, context_retrieval_size,
    “–embedding-model-id”, embedding_model_id,
    “–text-model-id”, text_model_id,
    “–output-data-path”, “/opt/ml/processing/output”,
    “–role-arn”, role
    ],
)

In this post, we provide two agentic RAG pipeline automation approaches to building the SageMaker pipeline, each with own benefits: single-step SageMaker pipelines and multi-step pipelines.
The single-step pipeline approach is designed for simplicity, running the entire RAG workflow as one unified process. This setup is ideal for straightforward or less complex use cases, because it minimizes pipeline management overhead. With fewer steps, the pipeline can start quickly, benefitting from reduced execution times and streamlined development. This makes it a practical option when rapid iteration and ease of use are the primary concerns.
The multi-step pipeline approach is preferred for enterprise scenarios where flexibility and modularity are essential. By breaking down the RAG process into distinct, manageable stages, organizations gain the ability to customize, swap, or extend individual components as needs evolve. This design enables plug-and-play adaptability, making it straightforward to reuse or reconfigure pipeline steps for various workflows. Additionally, the multi-step format allows for granular monitoring and troubleshooting at each stage, providing detailed insights into performance and facilitating robust enterprise management. For enterprises seeking maximum flexibility and the ability to tailor automation to unique requirements, the multi-step pipeline approach is the superior choice.
CI/CD for an agentic RAG pipeline
Now we integrate the SageMaker RAG pipeline with CI/CD. CI/CD is important for making a RAG solution enterprise-ready because it provides faster, more reliable, and scalable delivery of AI-powered workflows. Specifically for enterprises, CI/CD pipelines automate the integration, testing, deployment, and monitoring of changes in the RAG system, which brings several key benefits, such as faster and more reliable updates, version control and traceability, consistency across environments, modularity and flexibility for customization, enhanced collaboration and monitoring, risk mitigation, and cost savings. This aligns with general CI/CD benefits in software and AI systems, emphasizing automation, quality assurance, collaboration, and continuous feedback essential to enterprise AI readiness.
When your SageMaker RAG pipeline definition is in place, you can implement robust CI/CD practices by integrating your development workflow and toolsets already enabled at your enterprise. This setup makes it possible to automate code promotion, pipeline deployment, and model experimentation through simple Git triggers, so changes are versioned, tested, and systematically promoted across environments. For demonstration, in this post, we show the CI/CD integration using GitHub Actions and by using GitHub Actions as the CI/CD orchestrator. Each code change, such as refining chunking strategies or updating pipeline steps, triggers an end-to-end automation workflow, as shown in the following screenshot. You can use the same CI/CD pattern with your choice of CI/CD tool instead of GitHub Actions, if needed.

Each GitHub Actions CI/CD execution automatically triggers the SageMaker pipeline (shown in the following screenshot), allowing for seamless scaling of serverless compute infrastructure.

Throughout this cycle, SageMaker managed MLflow records every executed pipeline (shown in the following screenshot), so you can seamlessly review results, compare performance across different pipeline runs, and manage the RAG lifecycle.

After an optimal RAG pipeline configuration is determined, the new desired configuration (Git version tracking captured in MLflow as shown in the following screenshot) can be promoted to higher stages or environments directly through an automated workflow, minimizing manual intervention and reducing risk.

Clean up
To avoid unnecessary costs, delete resources such as the SageMaker managed MLflow tracking server, SageMaker pipelines, and SageMaker endpoints when your RAG experimentation is complete. You can visit the SageMaker Studio console to destroy resources that aren’t needed anymore or call appropriate AWS APIs actions.
Conclusion
By integrating SageMaker AI, SageMaker managed MLflow, and Amazon OpenSearch Service, you can build, evaluate, and deploy RAG pipelines at scale. This approach provides the following benefits:

Automated and reproducible workflows with SageMaker Pipelines and MLflow, minimizing manual steps and reducing the risk of human error
Advanced experiment tracking and comparison for different chunking strategies, embedding models, and LLMs, so every configuration is logged, analyzed, and reproducible
Actionable insights from both traditional and LLM-based evaluation metrics, helping teams make data-driven improvements at every stage
Seamless deployment to production environments, with automated promotion of validated pipelines and robust governance throughout the workflow

Automating your RAG pipeline with SageMaker Pipelines brings additional benefits: it enables consistent, version-controlled deployments across environments, supports collaboration through modular, parameterized workflows, and supports full traceability and auditability of data, models, and results. With built-in CI/CD capabilities, you can confidently promote your entire RAG solution from experimentation to production, knowing that each stage meets quality and compliance standards.
Now it’s your turn to operationalize RAG workflows and accelerate your AI initiatives. Explore SageMaker Pipelines and managed MLflow using the solution from the GitHub repository to unlock scalable, automated, and enterprise-grade RAG solutions.

About the authors
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, generative AI applications like agents, and scaling generative AI use cases. He also focuses on Go-To-Market strategies, helping AWS build and align products to solve industry challenges in the generative AI space. You can find Sandeep on LinkedIn.
Blake Shin is an Associate Specialist Solutions Architect at AWS who enjoys learning about and working with new AI/ML technologies. In his free time, Blake enjoys exploring the city and playing music.

Unlock model insights with log probability support for Amazon Bedrock …

You can use Amazon Bedrock Custom Model Import to seamlessly integrate your customized models—such as Llama, Mistral, and Qwen—that you have fine-tuned elsewhere into Amazon Bedrock. The experience is completely serverless, minimizing infrastructure management while providing your imported models with the same unified API access as native Amazon Bedrock models. Your custom models benefit from automatic scaling, enterprise-grade security, and native integration with Amazon Bedrock features such as Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases.
Understanding how confident a model is in its predictions is essential for building reliable AI applications, particularly when working with specialized custom models that might encounter domain-specific queries.
With log probability support now added to Custom Model Import, you can access information about your models’ confidence in their predictions at the token level. This enhancement provides greater visibility into model behavior and enables new capabilities for model evaluation, confidence scoring, and advanced filtering techniques.
In this post, we explore how log probabilities work with imported models in Amazon Bedrock. You will learn what log probabilities are, how to enable them in your API calls, and how to interpret the returned data. We also highlight practical applications—from detecting potential hallucinations to optimizing RAG systems and evaluating fine-tuned models—that demonstrate how these insights can improve your AI applications, helping you build more trustworthy solutions with your custom models.
Understanding log probabilities
In language models, a log probability represents the logarithm of the probability that the model assigns to a token in a sequence. These values indicate how confident the model is about each token it generates or processes. Log probabilities are expressed as negative numbers, with values closer to zero indicating higher confidence. For example, a log probability of -0.1 corresponds to approximately 90% confidence, while a value of -3.0 corresponds to about 5% confidence. By examining these values, you can identify when a model is highly certain versus when it’s making less confident predictions. Log probabilities provide a quantitative measure of how likely the model considered each generated token, offering valuable insight into the confidence of its output. By analyzing them you can,

Gauge confidence across a response: Assess how confident the model was in different sections of its output, helping you identify where it was certain versus uncertain.
Score and compare outputs: Compare overall sequence likelihood (by adding or averaging log probabilities) to rank or filter multiple model outputs.
Detect potential hallucinations: Identify sudden drops in token-level confidence, which can flag segments that might require verification or review.
Reduce RAG costs with early pruning: Run short, low-cost draft generations based on retrieved contexts, compute log probabilities for those drafts, and discard low-scoring candidates early, avoiding unnecessary full-length generations or expensive reranking while keeping only the most promising contexts in the pipeline.
Build confidence-aware applications: Adapt system behavior based on certainty levels—for example, trigger clarifying prompts, provide fallback responses, or flagging for human review.

Overall, log probabilities are a powerful tool for interpreting and debugging model responses with measurable certainty—particularly valuable for applications where understanding why a model responded in a certain way can be as important as the response itself.
Prerequisites
To use log probability support with custom model import in Amazon Bedrock, you need:

An active AWS account with access to Amazon Bedrock
A custom model created in Amazon Bedrock using the Custom Model Import feature after July 31, 2025, when the log probabilities support was released
Appropriate AWS Identity and Access Management (IAM) permissions to invoke models through the Amazon Bedrock Runtime

Introducing log probabilities support in Amazon Bedrock
With this release, Amazon Bedrock now allows models imported using the Custom Model Import feature to return token-level log probabilities as part of the inference response.
When invoking a model through Amazon Bedrock InvokeModel API, you can access token log probabilities by setting “return_logprobs”: true in the JSON request body. With this flag enabled, the model’s response will include additional fields providing log probabilities for both the prompt tokens and the generated tokens, so that customers can analyze the model’s confidence in its predictions. These log probabilities let you quantitatively assess how confident your custom models are when processing inputs and generating responses. The granular metrics allow for better evaluation of response quality, troubleshooting of unexpected outputs, and optimization of prompts or model configurations.
Let’s walk through an example of invoking a custom model on Amazon Bedrock with log probabilities enabled and examine the output format. Suppose you have already imported a custom model (for instance, a fine-tuned Llama 3.2 1B model) into Amazon Bedrock and have its model Amazon Resource Name (ARN). You can invoke this model using the Amazon Bedrock Runtime SDK (Boto3 for Python in this example) as shown in the following example:

import boto3, json

bedrock_runtime = boto3.client(‘bedrock-runtime’)  
model_arn = “arn:aws:bedrock:<<aws-region>>:<<account-id>>:imported-model/your-model-id”

# Define the request payload with log probabilties enabled
request_payload = {
    “prompt”: “The quick brown fox jumps”,
    “max_gen_len”: 50,
    “temperature”: 0.5,
    “stop”: [“.”, “n”],
    “return_logprobs”: True   # Request log probabilities
}

response = bedrock_runtime.invoke_model(
    modelId=model_arn,
    body=json.dumps(request_payload),
    contentType=”application/json”,
    accept=”application/json”
)

# Parse the JSON response
result = json.loads(response[“body”].read())
print(json.dumps(result, indent=2))

In the preceding code, we send a prompt—”The quick brown fox jumps”—to our custom imported model. We configure standard inference parameters: a maximum generation length of 50 tokens, a moderate temperature of 0.5 for moderate randomness, and a stop condition (either a period or a newline). The “return_logprobs”:True parameter tells Amazon Bedrock to return log probabilities in the response.
The InvokeModel API returns a JSON response containing three main components: the standard generated text output, metadata about the generation process, and now log probabilities for both prompt and generated tokens. These values reveal the model’s internal confidence for each token prediction, so you can understand not just what text was produced, but how certain the model was at each step of the process. The following is an example response from the “quick brown fox jumps” prompt, showing log probabilities (appearing as negative numbers):

{
‘prompt_logprobs’: [
None,
{‘791’: -3.6223082542419434, ‘14924’: -1.184808373451233},
{‘4062’: -9.256651878356934, ‘220’: -3.6941518783569336},
{‘14198’: -4.840845108032227, ‘323’: -1.7158453464508057},
{‘39935’: -0.049946799874305725},
{‘35308’: -0.2087990790605545}
],
‘generation’: ‘ over the lazy dog’,
‘prompt_token_count’: 6,
‘generation_token_count’: 5,
‘stop_reason’: ‘stop’,
‘logprobs’: [
{‘927’: -0.04093993827700615},
{‘279’: -0.0728893131017685},
{‘16053’: -0.02005653828382492},
{‘5679’: -0.03769925609230995},
{‘627’: -1.194122076034546}
]
}

The raw API response provides token IDs paired with their log probabilities. To make this data interpretable, we need to first decode the token IDs using the appropriate tokenizer (in this case, the Llama 3.2 1B tokenizer), which maps each ID back to its actual text token. Then we convert log probabilities to probabilities by applying the exponential function, translating these values into more intuitive probabilities between 0 and 1. We have implemented these transformations using custom code (not shown here) to produce a human-readable format where each token appears alongside its probability, making the model’s confidence in its predictions immediately clear.

{‘prompt_logprobs’: [None,
  {‘791’: “‘The’ (p=0.0267)”, ‘14924’: “‘Question’ (p=0.3058)”},
  {‘4062’: “‘ quick’ (p=0.0001)”, ‘220’: “‘ ‘ (p=0.0249)”},
  {‘14198’: “‘ brown’ (p=0.0079)”, ‘323’: “‘ and’ (p=0.1798)”},
  {‘39935’: “‘ fox’ (p=0.9513)”},
  {‘35308’: “‘ jumps’ (p=0.8116)”}],
 ‘generation’: ‘ over the lazy dog’,
 ‘prompt_token_count’: 6,
 ‘generation_token_count’: 5,
 ‘stop_reason’: ‘stop’,
 ‘logprobs’: [{‘927’: “‘ over’ (p=0.9599)”},
  {‘279’: “‘ the’ (p=0.9297)”},
  {‘16053’: “‘ lazy’ (p=0.9801)”},
  {‘5679’: “‘ dog’ (p=0.9630)”},
  {‘627’: “‘.n’ (p=0.3030)”}]}

Let’s break down what this tells us about the model’s internal processing:

generation: This is the actual text generated by the model (in our example, it’s a continuation of the prompt that we sent to the model). This is the same field you would get normally from any model invocation.
prompt_token_count and generation_token_count: These indicate the number of tokens in the input prompt and in the output, respectively. In our example, the prompt was tokenized into six tokens, and the model generated five tokens in its completion.
stop_reason: The reason the generation stopped (“stop” means the model naturally stopped at a stop sequence or end-of-text, “length” means it hit the max token limit, and so on). In our case it shows “stop”, indicating the model stopped on its own or because of the stop condition we provided.
prompt_logprobs: This array provides log probabilities for each token in the prompt. As the model processes your input, it continuously predicts what should come next based on what it has seen so far. These values measure which tokens in your prompt were expected or surprising to the model.

The first entry is None because the very first token has no preceding context. The model cannot predict anything without prior information. Each subsequent entry contains token IDs mapped to their log probabilities. We have converted these IDs to readable text and transformed the log probabilities into percentages for easier understanding.
You can observe the model’s increasing confidence as it processes familiar sequences. For example, after seeing The quick brown, the model predicted fox with 95.1% confidence. After seeing the full context up to fox, it predicted jumps with 81.1% confidence.
Many positions show multiple tokens with their probabilities, revealing alternatives the model considered. For instance, at the second position, the model evaluated both The (2.7%) and Question (30.6%), which means the model considered both tokens viable at that position. This added visibility helps you understand where the model weighted alternatives and can reveal when it was more uncertain or had difficulty choosing from multiple options.
Notably low probabilities appear for some tokens—quick received just 0.01%—indicating the model found these words unexpected in their context.
The overall pattern tells a clear story: individual words initially received low probabilities, but as the complete quick brown fox jumps phrase emerged, the model’s confidence increased dramatically, showing it recognized this as a familiar expression.
When multiple tokens in your prompt consistently receive low probabilities, your phrasing might be unusual for the model. This uncertainty can affect the quality of completions. Using these insights, you can reformulate prompts to better align with patterns the model encountered in its training data.

logprobs: This array contains log probabilities for each token in the model’s generated output. The format is similar: a dictionary mapping token IDs to their corresponding log probabilities.

After decoding these values, we can see that the tokens over, the, lazy, and dog all have high probabilities. This demonstrates the model recognized it was completing the well-known phrase the quick brown fox jumps over the lazy dog—a common pangram that the model appears to have strong familiarity with.
In contrast, the final period (newline) token has a much lower probability (30.3%), revealing the model’s uncertainty about how to conclude the sentence. This makes sense because the model had multiple valid options: ending the sentence with a period, continuing with additional content, or choosing another punctuation mark altogether.

Practical use cases of log probabilities
Token-level log probabilities from the Custom Model Import feature provide valuable insights into your model’s decision-making process. These metrics transform how you interact with your custom models by revealing their confidence levels for each generated token. Here are impactful ways to use these insights:
Ranking multiple completions
You can use log probabilities to quantitatively rank multiple generated outputs for the same prompt. When your application needs to choose between different possible completions—whether for summarization, translation, or creative writing—you can calculate each completion’s overall likelihood by averaging or adding the log probabilities across all its tokens.
Example:
Prompt: Translate the phrase “Battre le fer pendant qu’il est chaud”

Completion A: “Strike while the iron is hot” (Average log probability: -0.39)
Completion B: “Beat the iron while it is hot.” (Average log probability: -0.46)

In this example, Completion A receives a higher log probability score (closer to zero), indicating the model found this idiomatic translation more natural than the more literal Completion B. This numerical approach enables your application to automatically select the most probable output or present multiple candidates ranked by the model’s confidence level.
This ranking capability extends beyond translation to many scenarios where multiple valid outputs exist—including content generation, code completion, and creative writing—providing an objective quality metric based on the model’s confidence rather than relying solely on subjective human judgment.
Detecting hallucinations and low-confidence answers
Models might produce hallucinations—plausible-sounding but factually incorrect statements—when handling ambiguous prompts, complex queries, or topics outside their expertise. Log probabilities provide a practical way to detect these instances by revealing the model’s internal uncertainty, helping you identify potentially inaccurate information even when the output appears confident.
By analyzing token-level log probabilities, you can identify which parts of a response the model was potentially uncertain about, even when the text appears confident on the surface. This capability is especially valuable in retrieval-augmented generation (RAG) systems, where responses should be grounded in retrieved context. When a model has relevant information available, it typically generates answers with higher confidence. Conversely, low confidence across multiple tokens suggests the model might be generating content without sufficient supporting information.
Example:

Prompt:

“Explain how the Portfolio Synergy Quotient (PSQ) is applied in multi-asset investment
strategies?”

Model output:

“The PSQ is a measure of the diversification benefits of combining different asset
classes in a portfolio.”

In this example, we intentionally asked about a fictional metric—Portfolio Synergy Quotient (PSQ)—to demonstrate how log probabilities reveal uncertainty in model responses. Despite producing a professional-sounding definition for this non-existent financial concept, the token-level confidence scores tell a revealing story. The confidence scores shown below are derived by applying the exponential function to the log probabilities returned by the model.

PSQ shows medium confidence (63.8%), indicating that the model recognized the acronym format but wasn’t highly certain about this specific term.
Common finance terminology like classes (98.2%) and portfolio (92.8%) exhibit high confidence, likely because these are standard concepts widely used in financial contexts.
Critical connecting concepts show notably low confidence: measure (14.0%) and diversification (31.8%), reveal the model’s uncertainty when attempting to explain what PSQ means or does.
Functional words like is (45.9%) and of (56.6%) hover in the medium confidence levels, suggesting uncertainty about the overall structure of the explanation.

By identifying these low-confidence segments, you can implement targeted safeguards in your applications—such as flagging content for verification, retrieving additional context, generating clarifying questions, or applying confidence thresholds for sensitive information. This approach helps create more reliable AI systems that can distinguish between high-confidence knowledge and uncertain responses.
Monitoring prompt quality
When engineering prompts for your application, log probabilities reveal how well the model understands your instructions. If the first few generated tokens show unusually low probabilities, it often signals that the model struggled to interpret what you are asking.
By tracking the average log probability of the initial tokens—typically the first 5–10 generated tokens—you can quantitatively measure prompt clarity. Well-structured prompts with clear context typically produce higher probabilities because the model immediately knows what to do. Vague or underspecified prompts often yield lower initial token likelihoods as the model hesitates or searches for direction.
Example:
Prompt comparison for customer service responses:

Basic prompt:

“Write a response to this customer complaint: I ordered a laptop 2 weeks ago and it
still hasn’t arrived.”

Average log probability of first five tokens: -1.215 (lower confidence)

Optimized prompt:

“You are a senior customer service manager with expertise in conflict resolution and
customer retention. You work for a reputable electronics retailer that values
customer satisfaction above all else. Your task is to respond to the following
customer complaint with professionalism and empathy.
Customer Complaint: I ordered a laptop 2 weeks ago and it still hasn’t arrived.”

Average log probability of first five tokens: -0.333 (higher confidence)

The optimized prompt generates higher log probabilities, demonstrating that precise instructions and clear context reduce the model’s uncertainty. Rather than making absolute judgments about prompt quality, this approach lets you measure relative improvement between versions. You can directly observe how specific elements—role definitions, contextual details, and explicit expectations—increase model confidence. By systematically measuring these confidence scores across different prompt iterations, you build a quantitative framework for prompt engineering that reveals exactly when and how your instructions become unclear to the model, enabling continuous data-driven refinement.
Reducing RAG costs with early pruning
In traditional RAG implementations, systems retrieve 5–20 documents and generate complete responses using these retrieved contexts. This approach drives up inference costs because every retrieved context consumes tokens regardless of actual usefulness.
Log probabilities enable a more cost-effective alternative through early pruning. Instead of immediately processing the retrieved documents in full:

Generate draft responses based on each retrieved context
Calculate the average log probability across these short drafts
Rank contexts by their average log probability scores
Discard low-scoring contexts that fall below a confidence threshold
Generate the complete response using only the highest-confidence contexts

This approach works because contexts that contain relevant information produce higher log probabilities in the draft generation phase. When the model encounters helpful context, it generates text with greater confidence, reflected in log probabilities closer to zero. Conversely, irrelevant or tangential contexts produce more uncertain outputs with lower log probabilities.
By filtering contexts before full generation, you can reduce token consumption while maintaining or even improving answer quality. This shifts the process from a brute-force approach to a targeted pipeline that directs full generation only toward contexts where the model demonstrates genuine confidence in the source material.
Fine-tuning evaluation
When you have fine-tuned a model for your specific domain, log probabilities offer a quantitative way to assess the effectiveness of your training. By analyzing confidence patterns in responses, you can determine if your model has developed proper calibration—showing high confidence for correct domain-specific answers and appropriate uncertainty elsewhere.
A well-calibrated fine-tuned model should assign higher probabilities to accurate information within its specialized area while maintaining lower confidence when operating outside its training domain. Problems with calibration appear in two main forms. Overconfidence occurs when the model assigns high probabilities to incorrect responses, suggesting it hasn’t properly learned the boundaries of its knowledge. Under confidence manifests as consistently low probabilities despite generating accurate answers, indicating that training might not have sufficiently reinforced correct patterns.
By systematically testing your model across various scenarios and analyzing the log probabilities, you can identify areas needing additional training or detect potential biases in your current approach. This creates a data-driven feedback loop for iterative improvements, making sure your model performs reliably within its intended scope while maintaining appropriate boundaries around its expertise.
Getting started
Here’s how to start using log probabilities with models imported through the Amazon Bedrock Custom Model Import feature:

Enable log probabilities in your API calls: Add “return_logprobs”: true to your request payload when invoking your custom imported model. This parameter works with both the InvokeModel and InvokeModelWithResponseStream APIs. Begin with familiar prompts to observe which tokens your model predicts with high confidence compared to which it finds surprising.
Analyze confidence patterns in your custom models: Examine how your fine-tuned or domain-adapted models respond to different inputs. The log probabilities reveal whether your model is appropriately calibrated for your specific domain—showing high confidence where it should be certain.
Develop confidence-aware applications: Implement practical use cases such as hallucination detection, response ranking, and content verification to make your applications more robust. For example, you can flag low-confidence sections of responses for human review or select the highest-confidence response from multiple generations.

Conclusion
Log probability support for Amazon Bedrock Custom Model Import offers enhanced visibility into model decision-making. This feature transforms previously opaque model behavior into quantifiable confidence metrics that developers can analyze and use.
Throughout this post, we have demonstrated how to enable log probabilities in your API calls, interpret the returned data, and use these insights for practical applications. From detecting potential hallucinations and ranking multiple completions to optimizing RAG systems and evaluating fine-tuning quality, log probabilities offer tangible benefits across diverse use cases.
For customers working with customized foundation models like Llama, Mistral, or Qwen, these insights address a fundamental challenge: understanding not just what a model generates, but how confident it is in its output. This distinction becomes critical when deploying AI in domains requiring high reliability—such as finance, healthcare, or enterprise applications—where incorrect outputs can have significant consequences.
By revealing confidence patterns across different types of queries, log probabilities help you assess how well your model customizations have affected calibration, highlighting where your model excels and where it might need refinement. Whether you are evaluating fine-tuning effectiveness, debugging unexpected responses, or building systems that adapt to varying confidence levels, this capability represents an important advancement in bringing greater transparency and control to generative AI development on Amazon Bedrock.
We look forward to seeing how you use log probabilities to build more intelligent and trustworthy applications with your custom imported models. This capability demonstrates the commitment from Amazon Bedrock to provide developers with tools that enable confident innovation while delivering the scalability, security, and simplicity of a fully managed service.

About the authors
Manoj Selvakumar is a Generative AI Specialist Solutions Architect at AWS, where he helps organizations design, prototype, and scale AI-powered solutions in the cloud. With expertise in deep learning, scalable cloud-native systems, and multi-agent orchestration, he focuses on turning emerging innovations into production-ready architectures that drive measurable business value. He is passionate about making complex AI concepts practical and enabling customers to innovate responsibly at scale—from early experimentation to enterprise deployment. Before joining AWS, Manoj worked in consulting, delivering data science and AI solutions for enterprise clients, building end-to-end machine learning systems supported by strong MLOps practices for training, deployment, and monitoring in production.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Revendra Kumar is a Senior Software Development Engineer at Amazon Web Services. In his current role, he focuses on model hosting and inference MLOps on Amazon Bedrock. Prior to this, he worked as an engineer on hosting Quantum computers on the cloud and developing infrastructure solutions for on-premises cloud environments. Outside of his professional pursuits, Revendra enjoys staying active by playing tennis and hiking.

Migrate from Anthropic’s Claude 3.5 Sonnet to Claude 4 Sonnet on Ama …

This post is co-written with Gareth Jones from Anthropic.
Anthropic’s Claude 4 Sonnet model has launched on Amazon Bedrock, marking a significant advancement in foundation model capabilities. Consequently, the deprecation timeline for Anthropic’s Claude 3.5 Sonnet (v1 and v2) was announced. This evolution creates a dual imperative for production AI applications: the opportunity to harness enhanced performance and the operational necessity to migrate before deprecation. Organizations must treat model migrations as a core component of their AI inference strategy, because poor execution can result in service disruptions, performance regressions, and cost overruns.
This post provides a systematic approach to migrating from Anthropic’s Claude 3.5 Sonnet to Claude 4 Sonnet on Amazon Bedrock. We examine the key model differences, highlight essential migration considerations, and deliver proven best practices to transform this necessary transition into a strategic advantage that drives measurable value for your organization.
Overview of model differences
Understanding the specific changes between model versions is the first step in planning a successful migration. The migration from Anthropic’s Claude Sonnet 3.5 Sonnet to Claude 4 Sonnet introduces capability and behavioral shifts that you can take advantage of:

Increased context window – Anthropic’s Claude 4 Sonnet expands the context window from 200,000 tokens to 1 million tokens (beta). This allows applications to process and reason over codebases, financial reports, or lengthy legal documents in a single prompt, simplifying complex workflows.
Native reasoning mechanisms – Whereas Anthropic’s Claude 3.5 Sonnet models rely on the chain-of-thought (CoT) prompting technique, Claude 4 models introduce built-in, API-enabled reasoning features like extended thinking and interleaved thinking. These give the model dedicated computational time to reason before answering, dramatically improving performance on complex problems.
Advanced tool use – Anthropic’s Claude 4 Sonnet model significantly upgrades tool use capabilities. They can execute multiple tools in parallel and use extended thinking between tool calls, enabling more sophisticated and efficient agentic workflows compared to the sequential tool use of older models.

To find out more about the model differences, refer to the Complete Model Comparison Guide.
A successful migration hinges on a few critical technical and strategic considerations, which we discuss in the following sections. Addressing these points proactively can minimize risk and accelerate your path to production.
Prerequisites
Before you can start using Anthropic’s Claude 4 Sonnet model, you must enable access to these models in Amazon Bedrock. For instructions, see Request access to an Amazon Bedrock foundation model. Make sure you review and accept the model’s End User License Agreement (EULA) proceeding with the model access request. It is crucial to first confirm that the Claude 4 Sonnet is available in your intended AWS Region, because model support can differ depending on location. For an updated list of supported Regions, refer to Model support by AWS Region in Amazon Bedrock. Additionally, you can use cross-Region inference (CRIS) by specifying an inference profile when running model inference, which can help improve throughput and maximize resource availability. For a list of supported cross-Region inference profiles, refer to Supported Regions and models for inference profiles.
API changes and code updates
When migrating on Amazon Bedrock, you can use either the model-specific InvokeModel API or the unified Converse API.
If you use the InvokeModel API, the migration is straightforward. You only need to update the modelId in your code, because the request body structure for Anthropic’s Messages API remains consistent between versions:

Old model ID – ‘anthropic.claude-3-5-sonnet-20240620-v1:0’ or ‘anthropic.claude-3-5-sonnet-20241022-v2:0’
New model ID – ‘anthropic.claude-4-sonnet-20240514-v1:0’

If you use a CRIS profile for the model, be sure to specify the correct inference profile ID in one of the source Regions, for example, ‘us.anthropic.claude-sonnet-4-20250514-v1:0′.
This migration is an excellent opportunity to switch to the Converse API. It provides a standardized request/response format for language models on Amazon Bedrock, making future migrations to different models or providers much simpler. See the following code:

import boto3

bedrock_runtime = boto3.client(service_name=’bedrock-runtime’)
# Simply change the model ID to switch models
response = bedrock_runtime.converse(
    modelId=’us.anthropic.claude-sonnet-4-20250514-v1:0′,
    messages=[{‘role’: ‘user’, ‘content’: [{‘text’: “Your prompt here”}]}],
    inferenceConfig={‘maxTokens’: 1024}
)

print(response[‘output’][‘message’][‘content’][0][‘text’])

The following are some key changes:

Updated text editor tool – Developers using Anthropic’s built-in text editor tool for agentic workflows must be aware that the tool definition has been updated. The tool type is now text_editor_20250124 and tool name is now str_replace_based_edit_tool. The computer-use-2024-10-22 in the anthropic_beta header can be removed because the text editor tool is generally available in Anthropic’s Claude 4. Any code using this tool must be updated to reflect these changes. For more details, see Anthropic defined tools.
Removed undo_edit command – The undo_edit command is no longer supported in Anthropic’s Claude 4 Sonnet. Any code using this command should be removed.
New refusal stop reason – The model now introduces a new refusal stop reason. This reason is returned when the model declines to generate content due to safety policies. Application logic that handles model responses must be updated to recognize and manage this new stop reason to help prevent unexpected behavior or errors. For example, when you receive a refusal stop reason in a multi-turn interaction, you must reset the conversation context by either removing or updating the turn that was refused.

Prompt engineering and behavioral shifts
Don’t assume your existing prompts will work perfectly with the new model. Adherence to model-specific best practices is key.
Refer to Claude 4 prompt engineering best practices released by Anthropic to achieve optimal results. For example, Anthropic’s Claude models perform best when prompts are structured with XML tags (for example, <document>, </document>) to clearly separate different parts of the input.
Anthropic’s Claude 4 Sonnet is designed to follow instructions more precisely. This is a major advantage, but might mean it is less verbose or elaborative than Claude 3.5 Sonnet unless explicitly prompted to be. Your system prompts defining the AI’s persona might need adjustments.
New reasoning features
Anthropic’s Claude 4 Sonnet model’s built-in extended thinking is a powerful capability. To unlock the deep reasoning capabilities of the new models, developers can enable extended thinking by including the thinking keyword argument in their API call. But make sure to use it strategically—extended thinking incurs additional costs because reasoning tokens are billed as output tokens at standard model rates. You will be charged for the full thinking process (not just the summary you see), and costs can scale with the thinking budget you set.
We recommend enabling extended thinking for tasks requiring deep, multi-step analysis where accuracy is paramount (for example, complex scientific problems or advanced coding challenges).
We recommend disabling extended thinking for simple queries or latency-sensitive applications. The feature adds to both cost (reasoning tokens are billed) and response time. For many tasks, a well-crafted CoT prompt is still the most efficient choice.
To use extended thinking, simply update the Converse API call by adding the additionalModelRequestFields parameter with the thinking configuration. You can specify the maximum thinking tokens that the model can use by setting a value for budget_tokens. Note that maxTokens should be larger than budget_tokens for extended thinking. See the following code:

import boto3

bedrock_runtime = boto3.client(service_name=’bedrock-runtime’)

# Add configurations to enable extended thinking
response = bedrock_runtime.converse(
    modelId=’us.anthropic.claude-sonnet-4-20250514-v1:0′,
    messages=[{‘role’: ‘user’, ‘content’: [{‘text’: “Your prompt here”}]}],
    inferenceConfig={‘maxTokens’: 2048},
    additionalModelRequestFields={“thinking”:{“type”: “enabled”,
     “budget_tokens”: 1024}}
)

reasoning_text = response[‘output’][‘message’][‘content’][0][‘reasoningContent’][‘reasoningText’][‘text’]
assistant_response = response[‘output’][‘message’][‘content’][1][‘text’]

The following table shows an example of extended thinking and how it can improve response accuracy. We prompt Anthropic’s Claude 3.5 Sonnet v2 and Claude 4 Sonnet (with thinking enabled) with an example prompt:

Alan, Bob, Colin, Dave and Emily are standing in a circle. Alan is on Bob’s immediate left. Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s immediate left. Who is on Alan’s immediate right?

The table shows responses from the two models. With extended thinking, the model is able to carefully reason through the problem and arrive at the correct answer.

Model
Extended Thinking
Output
Correct Answer

Claude 3.5 Sonnet v2
N/A
Let me solve this step by step. First, let’s understand what we know: Alan is on Bob’s immediate left Bob is on Colin’s immediate left Colin is on Dave’s immediate left Dave is on Emily’s immediate left Starting with any person, let’s arrange them in order: If Dave is on Emily’s immediate left, then Emily is on Dave’s immediate right If Colin is on Dave’s immediate left, then Dave is on Colin’s immediate right If Bob is on Colin’s immediate left, then Colin is on Bob’s immediate right If Alan is on Bob’s immediate left, then Bob is on Alan’s immediate right So the order going clockwise is: Alan → Bob → Colin → Dave → Emily Since they’re in a circle, Emily must be next to Alan to complete the circle. Therefore, Emily is on Alan’s immediate right. The answer is Emily.
No

Claude 4 Sonnet
Let me work through this step by step. We have 5 people standing in a circle: Alan, Bob, Colin, Dave, and Emily. Let me trace through the given information: Alan is on Bob’s immediate left Bob is on Colin’s immediate left Colin is on Dave’s immediate left Dave is on Emily’s immediate left Since they’re in a circle, I need to figure out the order. Let me start with one person and work around. If Alan is on Bob’s immediate left, that means going clockwise: Alan, then Bob. If Bob is on Colin’s immediate left, that means going clockwise: Bob, then Colin. If Colin is on Dave’s immediate left, that means going clockwise: Colin, then Dave. If Dave is on Emily’s immediate left, that means going clockwise: Dave, then Emily. So going clockwise around the circle: Alan → Bob → Colin → Dave → Emily But wait, since it’s a circle, Emily must connect back to someone. Since we have Alan → Bob → Colin → Dave → Emily, and this is a complete circle of 5 people, Emily must be followed by Alan. So the complete circle clockwise is: Alan → Bob → Colin → Dave → Emily → Alan This means Emily is on Alan’s immediate left, and therefore Alan’s immediate right would be… let me think about this carefully. If we go clockwise: Alan → Bob → Colin → Dave → Emily → (back to Alan) So Alan’s immediate right (going clockwise from Alan) is Bob. And Alan’s immediate left (going counter-clockwise from Alan) is Emily. Wait, let me double-check this by verifying the given conditions: Alan is on Bob’s immediate left ✓ (Alan comes right before Bob going clockwise) Bob is on Colin’s immediate left ✓ (Bob comes right before Colin going clockwise) Colin is on Dave’s immediate left ✓ (Colin comes right before Dave going clockwise) Dave is on Emily’s immediate left ✓ (Dave comes right before Emily going clockwise) I confirm Bob sits directly to Alan’s right in the circular arrangement.
Looking at this step by step: Given information: Alan is on Bob’s immediate left Bob is on Colin’s immediate left Colin is on Dave’s immediate left Dave is on Emily’s immediate left Let me trace the clockwise order around the circle: Alan → Bob (since Alan is on Bob’s immediate left) Bob → Colin (since Bob is on Colin’s immediate left) Colin → Dave (since Colin is on Dave’s immediate left) Dave → Emily (since Dave is on Emily’s immediate left) Since they’re standing in a circle, Emily must connect back to complete the circle: Emily → Alan So the complete clockwise order is: Alan → Bob → Colin → Dave → Emily → Alan Therefore, Bob is on Alan’s immediate right.
Yes

With extended thinking enabled, the API will, by default, return a summary of the model’s full thought process rather than the entire chain of reasoning. This is a safeguard designed to prevent misuse while still providing the full intelligence benefits. Developers should be aware that this can result in a chunky or delayed pattern in streaming responses and should account for this in their application’s user experience.
Anthropic’s Claude 4 models feature the interleaved thinking capability when using extended thinking with tools, allowing the model to perform intermediate reasoning between tool calls and develop more nuanced analytical conclusions based on the tool results received. To enable interleaved thinking for tool calling, add the additional parameter “anthropic_beta”: [“interleaved-thinking-2025-05-14”] to additionalModelRequestFields in the Converse API request.
For more details on using extended thinking, refer to Extended thinking.
Robust evaluation is non-negotiable
You must verify that the new model performs as well as or better than the old one on your specific tasks. Create a curated set of prompts and expected outputs that are highly representative of your production traffic. This custom benchmark is far more valuable than reported benchmark scores.
Integrate this dataset into an automated evaluation pipeline. This pipeline should become part of your continuous integration and continuous delivery (CI/CD) process, serving as a permanent regression suite for future model or prompt changes. For tooling, teams can use built-in solutions like Amazon Bedrock evaluations or open source evaluation frameworks like RAGAS and DeepEval.
Managing integrated safety and governance
A model’s safety profile changes with each version. This must be tested as part of an integrated system. Never test the new model in isolation. Your migration test plan is only valid if it evaluates the new model with the exact same guardrail configuration you use in production, for example, Amazon Bedrock Guardrails. A new model’s conversational style might trigger existing guardrails in unexpected ways, leading to a spike in blocked responses.
Implementing safe deployment strategies
When deploying Anthropic’s Claude 4 Sonnet to production, implement a phased rollout strategy to minimize risks. Consider shadow testing to compare model performance using mirrored traffic without impacting users, then conduct A/B testing to measure impact on business KPIs. For the actual rollout, use either a canary release approach by gradually exposing a small percentage of users to the new model, or a blue/green deployment strategy that maintains parallel environments for instant rollback capability. This structured approach facilitates safe validation of the new model’s performance while maintaining business continuity.
Conclusion
By treating the transition from Anthropic’s Claude 3.5 Sonnet (v1 and v2) to Claude 4 Sonnet as a structured engineering project, you can de-risk the migration process and unlock significant benefits. Understanding the key model differences, adapting your prompts and API calls, and implementing a robust, automated evaluation strategy are the pillars of a successful upgrade.
This transition is an opportunity to maintain business continuity and enhance your application with next-generation capabilities. We encourage you to begin your analysis and testing today.
For more details, refer to Migrating to Claude 4 and Anthropic’s Claude in Amazon Bedrock, and reach out to us for migration-specific inquiries.

About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Deepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in AI, he partners with clients to accelerate their generative AI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences.
Mahsa Paknezhad, PhD, is a Deep Learning Architect at the AWS Generative AI Innovation Center. With a focus on scalability and production readiness, Mahsa helps organizations across diverse industries harness advanced generative AI models to achieve meaningful business outcomes. Mahsa has a proven track record of successfully delivering projects specifically within the mining industry and the healthcare sector.
Nicholas Moore is a Solutions Architect at AWS, helping startups and businesses of all sizes turn ideas into reality. He specializes in cloud solutions with a focus on AI, analytics, and modern application development.
Derrick Choo is a Senior Solutions Architect at AWS who accelerates enterprise digital transformation through cloud adoption, AI/ML, and generative AI solutions. He specializes in full-stack development and ML, designing end-to-end solutions spanning frontend interfaces, IoT applications, data integrations, and ML models, with a particular focus on computer vision and multi-modal systems.
Sovik Kumar Nath is an AI/ML and Generative AI Senior Solutions Architect with AWS. He has extensive experience designing end-to-end ML and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has master’s degrees from the University of South Florida and University of Fribourg, Switzerland, and a bachelor’s degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Gareth Jones is a Product Manager at Anthropic focused on the Claude API. His work includes collaboration with AWS on Amazon Bedrock to make Claude widely accessible to developers.

TwinMind Introduces Ear-3 Model: A New Voice AI Model that Sets New In …

TwinMind, a California-based Voice AI startup, unveiled Ear-3 speech-recognition model, claiming state-of-the-art performance on several key metrics and expanded multilingual support. The release positions Ear-3 as a competitive offering against existing ASR (Automatic Speech Recognition) solutions from providers like Deepgram, AssemblyAI, Eleven Labs, Otter, Speechmatics, and OpenAI.

Key Metrics

MetricTwinMind Ear-3 ResultComparisons / NotesWord Error Rate (WER)5.26 %Significantly lower than many competitors: Deepgram ~8.26 %, AssemblyAI ~8.31 %.Speaker Diarization Error Rate (DER)3.8 %Slight improvement over previous best from Speechmatics (~3.9 %).Language Support140+ languagesOver 40 more languages than many leading models; aims for “true global coverage.”Cost per Hour of TranscriptionUS$ 0.23/hrPositioned as lowest among major services.

Technical Approach & Positioning

TwinMind indicates Ear-3 is a “fine-tuned blend of several open-source models,” trained on a curated dataset containing human-annotated audio sources such as podcasts, videos, and films.

Diarization and speaker labeling are improved via a pipeline that includes audio cleaning and enhancement before diarization, plus “precise alignment checks” to refine speaker boundary detections.

The model handles code-switching and mixed scripts, which are typically difficult for ASR systems due to varied phonetics, accent variance, and linguistic overlap.

Trade-offs & Operational Details

Ear-3 requires cloud deployment. Because of its model size and compute load, it cannot be fully offline. TwinMind’s Ear-2 (its earlier model) remains the fallback when connectivity is lost.

Privacy: TwinMind claims audio is not stored long-term; only transcripts are stored locally, with optional encrypted backups. Audio recordings are deleted “on the fly.”

Platform integration: API access for the model is planned in the coming weeks for developers/enterprises. For end users, Ear-3 functionality will be rolled out to TwinMind’s iPhone, Android, and Chrome apps over the next month for Pro users.

Comparative Analysis & Implications

Ear-3’s WER and DER metrics put it ahead of many established models. Lower WER translates to fewer transcription errors (mis-recognitions, dropped words, etc.), which is critical for domains like legal, medical, lecture transcription, or archival of sensitive content. Similarly, lower DER (i.e. better speaker separation + labeling) matters for meetings, interviews, podcasts — anything with multiple participants.

The price point of US$0.23/hr makes high-accuracy transcription more economically feasible for long-form audio (e.g. hours of meetings, lectures, recordings). Combined with support for over 140 languages, there is a clear push to make this usable in global settings, not just English-centric or well-resourced language contexts.

However, cloud dependency could be a limitation for users needing offline or edge-device capabilities, or where data privacy / latency concerns are stringent. Implementation complexity for supporting 140+ languages (accent drift, dialects, code-switching) may reveal weaker zones under adverse acoustic conditions. Real-world performance may vary compared to controlled benchmarking.

Conclusion

TwinMind’s Ear-3 model represents a strong technical claim: high accuracy, speaker diarization precision, extensive language coverage, and aggressive cost reduction. If benchmarks hold in real usage, this could shift expectations for what “premium” transcription services should deliver.

Check out the Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post TwinMind Introduces Ear-3 Model: A New Voice AI Model that Sets New Industry Records in Accuracy, Speaker Labeling, Languages and Price appeared first on MarkTechPost.

What are Optical Character Recognition (OCR) Models? Top Open-Source O …

Optical Character Recognition (OCR) is the process of turning images that contain text—such as scanned pages, receipts, or photographs—into machine-readable text. What began as brittle rule-based systems has evolved into a rich ecosystem of neural architectures and vision-language models capable of reading complex, multi-lingual, and handwritten documents.

How OCR Works?

Every OCR system tackles three core challenges:

Detection – Finding where text appears in the image. This step has to handle skewed layouts, curved text, and cluttered scenes.

Recognition – Converting the detected regions into characters or words. Performance depends heavily on how the model handles low resolution, font diversity, and noise.

Post-Processing – Using dictionaries or language models to correct recognition errors and preserve structure, whether that’s table cells, column layouts, or form fields.

The difficulty grows when dealing with handwriting, scripts beyond Latin alphabets, or highly structured documents such as invoices and scientific papers.

From Hand-Crafted Pipelines to Modern Architectures

Early OCR: Relied on binarization, segmentation, and template matching. Effective only for clean, printed text.

Deep Learning: CNN and RNN-based models removed the need for manual feature engineering, enabling end-to-end recognition.

Transformers: Architectures such as Microsoft’s TrOCR expanded OCR into handwriting recognition and multilingual settings with improved generalization.

Vision-Language Models (VLMs): Large multimodal models like Qwen2.5-VL and Llama 3.2 Vision integrate OCR with contextual reasoning, handling not just text but also diagrams, tables, and mixed content.

Comparing Leading Open-Source OCR Models

ModelArchitectureStrengthsBest FitTesseractLSTM-basedMature, supports 100+ languages, widely usedBulk digitization of printed textEasyOCRPyTorch CNN + RNNEasy to use, GPU-enabled, 80+ languagesQuick prototypes, lightweight tasksPaddleOCRCNN + Transformer pipelinesStrong Chinese/English support, table & formula extractionStructured multilingual documentsdocTRModular (DBNet, CRNN, ViTSTR)Flexible, supports both PyTorch & TensorFlowResearch and custom pipelinesTrOCRTransformer-basedExcellent handwriting recognition, strong generalizationHandwritten or mixed-script inputsQwen2.5-VLVision-language modelContext-aware, handles diagrams and layoutsComplex documents with mixed mediaLlama 3.2 VisionVision-language modelOCR integrated with reasoning tasksQA over scanned docs, multimodal tasks

Emerging Trends

Research in OCR is moving in three notable directions:

Unified Models: Systems like VISTA-OCR collapse detection, recognition, and spatial localization into a single generative framework, reducing error propagation.

Low-Resource Languages: Benchmarks such as PsOCR highlight performance gaps in languages like Pashto, suggesting multilingual fine-tuning.

Efficiency Optimizations: Models such as TextHawk2 reduce visual token counts in transformers, cutting inference costs without losing accuracy.

Conclusion

The open-source OCR ecosystem offers options that balance accuracy, speed, and resource efficiency. Tesseract remains dependable for printed text, PaddleOCR excels with structured and multilingual documents, while TrOCR pushes the boundaries of handwriting recognition. For use cases requiring document understanding beyond raw text, vision-language models like Qwen2.5-VL and Llama 3.2 Vision are promising, though costly to deploy.

The right choice depends less on leaderboard accuracy and more on the realities of deployment: the types of documents, scripts, and structural complexity you need to handle, and the compute budget available. Benchmarking candidate models on your own data remains the most reliable way to decide.
The post What are Optical Character Recognition (OCR) Models? Top Open-Source OCR Models appeared first on MarkTechPost.