Google AI Research Releases DeepSomatic: A New AI Model that Identifie …

A team of researchers from Google Research and UC Santa Cruz released DeepSomatic, an AI model that identifies cancer cell genetic variants. In research with Children’s Mercy, it found 10 variants in pediatric leukemia cells missed by other tools. DeepSomatic has a somatic small variant caller for cancer genomes that works across Illumina short reads, PacBio HiFi long reads, and Oxford Nanopore long reads. The method extends DeepVariant, detects single nucleotide variants and small insertions and deletions in whole genome and whole exome data, and supports tumor normal and tumor only workflows, including FFPE models.

https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct

How It Works?

DeepSomatic converts aligned reads into image like tensors that encode pileups, base qualities, and alignment context. A convolutional neural network classifies candidate sites as somatic or not and the pipeline emits VCF or gVCF. This design is platform agnostic because the tensor summarizes local haplotype and error patterns across technologies. Google researchers describe the approach and its focus on distinguishing inherited and acquired variants including difficult samples such as glioblastoma and pediatric leukemia.

Datasets and Benchmarking

Training and evaluation use CASTLE, Cancer Standards Long read Evaluation. CASTLE contains 6 matched tumor and normal cell line pairs that were whole genome sequenced on Illumina, PacBio HiFi, and Oxford Nanopore. The research team releases benchmark sets and accessions for reuse. This fills a gap in multi technology somatic training and testing resources.

https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct

Reported Results

The research team report consistent gains over widely used methods in both single nucleotide variants and indels. On Illumina indels, the next best method is about 80 percent F1, DeepSomatic is about 90 percent. On PacBio indels, the next best method is under 50 percent, DeepSomatic is above 80 percent. Baselines include SomaticSniper, MuTect2, and Strelka2 for short reads and ClairS for long reads. The study reports 329,011 somatic variants across the reference lines and an additional preserved sample. Google research team reports that DeepSomatic outperforms current methods with particular strength on indels.

https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct

Generalization to Real Samples

The research team evaluates transfer to cancers beyond the training set. A glioblastoma sample shows recovery of known drivers. Pediatric leukemia samples test the tumor only mode where a clean normal is not available. The tool recovers known calls and reports additional variants in that cohort. These studies indicate the representation and training scheme generalize to new disease contexts and to settings without matched normals.

Key Takeaways

DeepSomatic detects somatic SNVs (single nucleotide variants) and indels across Illumina, PacBio HiFi, and Oxford Nanopore, and builds on the DeepVariant methodology.

The pipeline supports tumor normal and tumor only workflows, includes FFPE WGS and WES models, and is released on GitHub.

It encodes read pileups as image like tensors and uses a convolutional neural network to classify somatic sites and emit VCF or gVCF.

Training and evaluation use the CASTLE dataset with 6 matched tumor normal cell line pairs sequenced on three platforms, with benchmarks and accessions provided.

Reported results show about 90 percent indel F1 on Illumina and above 80 percent on PacBio, outperforming common baselines, with 329,011 somatic variants identified across reference samples.

Editorial Comments

DeepSomatic is a pragmatic step for somatic variant calling across sequencing platforms, the model keeps DeepVariant’s image tensor representation and a convolutional neural network, so the same architecture scales from Illumina to PacBio HiFi to Oxford Nanopore with consistent preprocessing and outputs. The CASTLE dataset is the right move, it supplies matched tumor and normal cell lines across 3 technologies, which strengthens training and benchmarking and aids reproducibility. Reported results emphasize indel accuracy, about 90% F1 on Illumina and more than 80% on PacBio against lower baselines, which addresses a long running weakness in indel detection. The pipeline supports WGS and WES, tumor normal and tumor only, and FFPE, which matches real laboratory constraints.

Check out the Technical Paper, Technical details, Dataset and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants appeared first on MarkTechPost.

DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Perf …

DeepSeek-AI released 3B DeepSeek-OCR, an end to end OCR and document parsing Vision-Language Model (VLM) system that compresses long text into a small set of vision tokens, then decodes those tokens with a language model. The method is simple, images carry compact representations of text, which reduces sequence length for the decoder. The research team reports 97% decoding precision when text tokens are within 10 times the vision tokens on Fox benchmark, and useful behavior even at 20 times compression. It also reports competitive results on OmniDocBench with far fewer tokens than common baselines.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Architecture, what is actually new?

DeepSeek-OCR-3B has two components, a vision encoder named DeepEncoder and a Mixture of Experts decoder named DeepSeek3B-MoE-A570M. The encoder is designed for high resolution inputs with low activation cost and with few output tokens. It uses a window attention stage based on SAM for local perception, a 2 layer convolutional compressor for 16× token downsampling, and a dense global attention stage based on CLIP for visual knowledge aggregation. This design keeps activation memory controlled at high resolution, and keeps the vision token count low. The decoder is a 3B parameter MoE model (named as DeepSeek3B-MoE-A570M) with about 570M active parameters per token.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Multi resolution modes, engineered for token budgets

DeepEncoder supports native modes and dynamic modes. Native modes are Tiny with 64 tokens at 512 by 512 pixels, Small with 100 tokens at 640 by 640, Base with 256 tokens at 1024 by 1024, and Large with 400 tokens at 1280 by 1280. Dynamic modes named Gundam and Gundam-Master mix tiled local views with a global view. Gundam yields n×100 plus 256 tokens, or n×256 plus 400 tokens, with n in the range 2 to 9. For padded modes, the research team gives a formula for valid tokens, which is lower than the raw token count, and depends on the aspect ratio. These modes let AI developers and researchers align token budgets with page complexity.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Compression results, what the numbers say…..

The Fox benchmark study measures precision as exact text match after decoding. With 100 vision tokens, pages with 600 to 700 text tokens reach 98.5% precision at 6.7× compression. Pages with 900 to 1000 text tokens reach 96.8% precision at 9.7× compression. With 64 vision tokens, precision decreases as compression increases, for example 59.1% at about 19.7× for 1200 to 1300 text tokens. These values come directly from Table 2.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

On OmniDocBench, the abstract reports that DeepSeek-OCR surpasses GOT-OCR 2.0 when using only 100 vision tokens per page, and that under 800 vision tokens it outperforms MinerU 2.0, which uses over 6000 tokens per page on average. The benchmark section presents overall performance in terms of edit distance.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Training details that matter….

The research team describes a two phase training pipeline. It first trains DeepEncoder with next token prediction on OCR 1.0 and OCR 2.0 data and 100M LAION samples, then trains the full system with pipeline parallelism across 4 partitions. For hardware, the run used 20 nodes, each with 8 A100 40G GPUs, and used AdamW. The team reports a training speed of 90B tokens per day on text only data, and 70B tokens per day on multimodal data. In production, it reports the ability to generate over 200k pages per day on a single A100 40G node.

How to evaluate it in a practical stack

If your target documents are typical reports or books, start with Small mode at 100 tokens, then adjust upward only if the edit distance is unacceptable. If your pages contain dense small fonts or very high token counts, use a Gundam mode, since it combines global and local fields of view with explicit token budgeting. If your workload includes charts, tables, or chemical structures, review the “Deep parsing” qualitative section, which shows conversions to HTML tables and SMILES and structured geometry, then design outputs that are easy to validate.

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Key Takeaways

DeepSeek OCR targets token efficiency using optical context compression with near lossless decoding at about 10 times compression, and around 60 percent precision at about 20 times compression.

The HF release expose explicit token budgets, Tiny uses 64 tokens at 512 by 512, Small uses 100 tokens at 640 by 640, Base uses 256 tokens at 1024 by 1024, Large uses 400 tokens at 1280 by 1280, and Gundam composes n views at 640 by 640 plus one global view at 1024 by 1024.

The system structure is a DeepEncoder that compresses pages into vision tokens and a DeepSeek3B MoE decoder with about 570M active parameters, as described by the research team in the technical report.

The Hugging Face model card documents a tested setup for immediate use, Python 3.12.9, CUDA 11.8, PyTorch 2.6.0, Transformers 4.46.3, Tokenizers 0.20.3, and Flash Attention 2.7.3.

Editorial Comments

DeepSeek OCR is a practical step for document AI, it treats pages as compact optical carriers that reduce decoder sequence length without discarding most information, the model card and technical report describe 97 percent decoding precision at about 10 times compression on Fox benchmark, which is the key claim to test in real workloads. The released model is a 3B MoE decoder with a DeepEncoder front end, packaged for Transformers, with tested versions for PyTorch 2.6.0, CUDA 11.8, and Flash Attention 2.7.3, which lowers setup cost for engineers. The repository shows a single 6.67 GB safetensors shard, which suits common GPUs. Overall, DeepSeek OCR operationalizes optical context compression with a 3B MoE decoder, reports about 97% decoding precision at 10x compression on Fox, provides explicit token budget modes, and includes a tested Transformers setup, validate the throughput claim in your own pipeline.

Check out the Technical Paper, Model on HF and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Performance OCR and Structured Document Conversion appeared first on MarkTechPost.

The Local AI Revolution: Expanding Generative AI with GPT-OSS-20B and …

The landscape of AI is expanding. Today, many of the most powerful LLMs (large language models) reside primarily in the cloud, offering incredible capabilities but also concerns about privacy and limitations around how many files you can upload or how long they stay loaded. Now, a powerful new paradigm is emerging.

This is the dawn of local, private AI.

Imagine a university student preparing for finals with a semester’s overload of data: dozens of  lecture recordings, scanned textbooks, proprietary lab simulations, and folders filled with dozens of handwritten notes. Uploading this massive, copyrighted, and disorganized dataset to the cloud is impractical, and most services would require you to re-upload it for every session. Instead, students are using local LLMs to load all these files and maintain complete control on their laptop.

They prompt the AI: “Analyze my notes on ‘XL1 reactions,’ cross-reference the concept with Professor Dani’s lecture from October 3rd, and explain how it applies to question 5 on the practice exam.”

Seconds later, the AI generates a personalized study guide, highlights the key chemical mechanism from the slides, transcribes the relevant lecture segment, deciphers the student’s handwritten scrawl, and drafts new, targeted practice problems to solidify their understanding.

This switch to local PCs is catalyzed by the release of powerful open models like OpenAI’s new gpt-oss, and supercharged by accelerations provided by NVIDIA RTX AI PCs on LLM frameworks used to run these models locally. A new era of private, instantaneous, and hyper-personalized AI is here.

gpt-oss: the Keys to the Kingdom

OpenAI’s recent launch of gpt-oss is a seismic event for the developer community. It’s a robust 20-billion parameter LLM that is both open-source and, crucially, “open-weight.”

But gpt-oss isn’t just a powerful engine; it’s a meticulously engineered machine with several game-changing features built-in:

● A Specialized Pit Crew (Mixture-of-Experts): The model uses a Mixture-of-Experts (MoE) architecture. Instead of one giant brain doing all the work, it has a team of specialists. For any given task, it intelligently routes the problem to the relevant “experts,” making inference incredibly fast and efficient which is perfect for powering an interactive language-tutor bot, where instant replies are needed to make a practice conversation feel natural and engaging.

● A Tunable Mind (Adjustable Reasoning): The model showcases its thinking with Chain-of-Thought and gives you direct control with adjustable reasoning levels. This allows you to manage the trade-off between speed and depth for any task. For instance, a student writing a term paper could use a “low” setting to quickly summarize a single research article, then switch to “high” to generate a detailed essay outline that thoughtfully synthesizes complex arguments from multiple sources.

● A Marathon Runner’s Memory (Long Context): With a massive 131,000-token context window, it can digest and remember entire technical documents without losing track of the plot. For example, this allows a student to load an entire textbook chapter and all of their lecture notes to prepare for an exam, asking the model to synthesize the key concepts from both sources and generate tailored practice questions.

● Lightweight Power (MXFP4): It is built using MXFP4 quantization. Think of this as building an engine from an advanced, ultra-light alloy. It dramatically reduces the model’s memory footprint, allowing it to deliver high performance. This makes it practical for a computer science student to run a powerful coding assistant directly on their personal laptop in their dorm room, getting help debugging a final project without needing a powerful server or dealing with a slow wifi.

This level of access unlocks superpowers that proprietary cloud models simply can’t match:

● The ‘Air-Gapped’ Advantage (Data Sovereignty): You can analyze and fine-tune LLMs locally using your most sensitive intellectual property without a single byte leaving your secure, air-gapped environment. This is essential for AI data security and compliance (HIPAA/GDPR).

● Forging Specialized AI (Customization): Developers can inject their company’s DNA directly into the model’s brain, teaching it proprietary codebases, specialized industry jargon, or unique creative styles.

● The Zero-Latency Experience (Control): Local deployment provides immediate responsiveness, independent of network connectivity, and offers predictable operational costs.

However, running an engine of this magnitude requires serious computational muscle. To unlock the true potential of gpt-oss, you need hardware built for the job. This model requires at least 16GB of memory to run on local PCs.

The Need for Speed: Why the RTX 50 Series Accelerates Local AI

Benchmarks

When you shift AI processing to your desk, performance isn’t just a metric, it’s the entire experience. It’s the difference between waiting and creating; between a frustrating bottleneck and a seamless thought partner. If you’re waiting for your model to process, you’re losing your creative flow and your analytical edge.

To achieve this seamless experience, the software stack is just as crucial as the hardware. Open-source frameworks like Llama.cpp are essential, acting as the high-performance runtime for these LLMs. Through deep collaboration with NVIDIA, Llama.cpp is heavily optimized for GeForce RTX GPUs for maximum throughput.

The results of this optimization are staggering. Benchmarks utilizing Llama.cpp show NVIDIA’s flagship consumer GPU, the GeForce RTX 5090 , running the gpt-oss-20b model at a blistering 282 tokens per second (tok/s). Tokens are the chunks of text a model processes in a single step, and this metric measures how quickly the AI can generate a response. To put this in perspective, the RTX 5090 significantly outpaces the Mac M3 Ultra (116 tok/s) and AMD’s 7900 XTX (102 tok/s). This performance lead is driven by the dedicated AI hardware, the Tensor Cores, built into the GeForce RTX 5090, specifically engineered to accelerate these demanding AI tasks.

But access isn’t just for developers comfortable with command-line tools. The ecosystem is rapidly evolving to become more user-friendly while leveraging these same NVIDIA optimizations. Applications like LM Studio, which is built on top of Llama.cpp, provide an intuitive interface for running and experimenting with local LLMs. LM Studio makes the process easy and supports advanced techniques like RAG (retrieval-augmented generation).

Ollama is another popular, open-source framework that handles model downloads, environment setup and GPU acceleration automatically,  and multi-model management with seamless application integration. NVIDIA has also collaborated with Ollama to optimize its performance, ensuring these accelerations apply to gpt-oss models. Users can interact directly through the new Ollama app or utilize third-party applications such as AnythingLLM, which offers a streamlined, local interface and also includes support for RAG.

The NVIDIA RTX AI Ecosystem: The Force Multiplier

NVIDIA’s advantage isn’t just about raw power; it’s about the robust, optimized software ecosystem acting as a force multiplier for the hardware, making advanced AI possible on local PCs.

The Democratization of Fine-Tuning: Unsloth AI and RTX

Customizing a 20B model has traditionally required extensive data center resources. However RTX GPUs changed that, and software innovations like Unsloth AI are maximizing this potential.

Optimized for NVIDIA architecture, it leverages techniques like LoRA (Low-Rank Adaptation) to drastically reduce memory usage and increase training speed.

Critically, Unsloth is heavily optimized for the new GeForce RTX 50 Series (Blackwell architecture). This synergy means developers can rapidly fine-tune gpt-oss right on their local PC, fundamentally changing the economics and security of training models on a proprietary “IP vault.”

The Future of AI: Local, Personalized, and Powered by RTX

The release of OpenAI’s gpt-oss is a landmark moment, signaling an industry-wide pivot toward transparency and control. But harnessing this power, achieving instantaneous insights, zero-latency creativity, and ironclad security, requires the right platform.This isn’t just about faster PCs; it’s about a fundamental shift in control and the democratization of AI power. With unmatched performance, and groundbreaking optimization tools like Unsloth AI, NVIDIA RTX AI PCs are essential hardware for this revolution.

Thanks to the NVIDIA AI team for the thought leadership/ Resources for this article. NVIDIA AI team has supported this content/article.
The post The Local AI Revolution: Expanding Generative AI with GPT-OSS-20B and the NVIDIA RTX AI PC appeared first on MarkTechPost.

Meet LangChain’s DeepAgents Library and a Practical Example to See H …

While a basic Large Language Model (LLM) agent—one that repeatedly calls external tools—is easy to create, these agents often struggle with long and complex tasks because they lack the ability to plan ahead and manage their work over time. They can be considered “shallow” in their execution.

The deepagents library is designed to overcome this limitation by implementing a general architecture inspired by advanced applications like Deep Research and Claude Code.

This architecture gives agents more depth by combining four key features:

A Planning Tool: Allows the agent to strategically break down a complex task into manageable steps before acting.

Sub-Agents: Enables the main agent to delegate specialized parts of the task to smaller, focused agents.

Access to a File System: Provides persistent memory for saving work-in-progress, notes, and final outputs, allowing the agent to continue where it left off.

A Detailed Prompt: Gives the agent clear instructions, context, and constraints for its long-term objectives.

By providing these foundational components, deepagents makes it easier for developers to build powerful, general-purpose agents that can plan, manage state, and execute complex workflows effectively.

In this article, we’ll take a look at a practical example to see how DeepAgents actually work in action. Check out the FULL CODES here.

Core Capabilities of DeepAgents

1. Planning and Task Breakdown: DeepAgents come with a built-in write_todos tool that helps agents break large tasks into smaller, manageable steps. They can track their progress and adjust the plan as they learn new information.

2. Context Management: Using file tools like ls, read_file, write_file, and edit_file, agents can store information outside their short-term memory. This prevents context overflow and lets them handle larger or more detailed tasks smoothly.

3. Sub-Agent Creation: The built-in task tool allows an agent to create smaller, focused sub-agents. These sub-agents work on specific parts of a problem without cluttering the main agent’s context.

4. Long-Term Memory: With support from LangGraph’s Store, agents can remember information across sessions. This means they can recall past work, continue previous conversations, and build on earlier progress.

Setting up dependencies

Copy CodeCopiedUse a different Browser!pip install deepagents tavily-python langchain-google-genai langchain-openai

Environment Variables

In this tutorial, we’ll use the OpenAI API key to power our Deep Agent. However, for reference, we’ll also show how you can use a Gemini model instead.

You’re free to choose any model provider you prefer — OpenAI, Gemini, Anthropic, or others — as DeepAgents works seamlessly with different backends. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘TAVILY_API_KEY’] = getpass(‘Enter Tavily API Key: ‘)
os.environ[‘OPENAI_API_KEY’] = getpass(‘Enter OpenAI API Key: ‘)
os.environ[‘GOOGLE_API_KEY’] = getpass(‘Enter Google API Key: ‘)

Importing the necessary libraries

Copy CodeCopiedUse a different Browserimport os
from typing import Literal
from tavily import TavilyClient
from deepagents import create_deep_agent

tavily_client = TavilyClient()

Tools

Just like regular tool-using agents, a Deep Agent can also be equipped with a set of tools to help it perform tasks.

In this example, we’ll give our agent access to a Tavily Search tool, which it can use to gather real-time information from the web. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserfrom typing import Literal
from langchain.chat_models import init_chat_model
from deepagents import create_deep_agent

def internet_search(
query: str,
max_results: int = 5,
topic: Literal[“general”, “news”, “finance”] = “general”,
include_raw_content: bool = False,
):
“””Run a web search”””
search_docs = tavily_client.search(
query,
max_results=max_results,
include_raw_content=include_raw_content,
topic=topic,
)
return search_docs

Sub-Agents

Subagents are one of the most powerful features of Deep Agents. They allow the main agent to delegate specific parts of a complex task to smaller, specialized agents — each with its own focus, tools, and instructions. This helps keep the main agent’s context clean and organized while still allowing for deep, focused work on individual subtasks.

In our example, we defined two subagents:

policy-research-agent — a specialized researcher that conducts in-depth analysis on AI policies, regulations, and ethical frameworks worldwide. It uses the internet_search tool to gather real-time information and produces a well-structured, professional report.

policy-critique-agent — an editorial agent responsible for reviewing the generated report for accuracy, completeness, and tone. It ensures that the research is balanced, factual, and aligned with regional legal frameworks.

Together, these subagents enable the main Deep Agent to perform research, analysis, and quality review in a structured, modular workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsersub_research_prompt = “””
You are a specialized AI policy researcher.
Conduct in-depth research on government policies, global regulations, and ethical frameworks related to artificial intelligence.

Your answer should:
– Provide key updates and trends
– Include relevant sources and laws (e.g., EU AI Act, U.S. Executive Orders)
– Compare global approaches when relevant
– Be written in clear, professional language

Only your FINAL message will be passed back to the main agent.
“””

research_sub_agent = {
“name”: “policy-research-agent”,
“description”: “Used to research specific AI policy and regulation questions in depth.”,
“system_prompt”: sub_research_prompt,
“tools”: [internet_search],
}

sub_critique_prompt = “””
You are a policy editor reviewing a report on AI governance.
Check the report at `final_report.md` and the question at `question.txt`.

Focus on:
– Accuracy and completeness of legal information
– Proper citation of policy documents
– Balanced analysis of regional differences
– Clarity and neutrality of tone

Provide constructive feedback, but do NOT modify the report directly.
“””

critique_sub_agent = {
“name”: “policy-critique-agent”,
“description”: “Critiques AI policy research reports for completeness, clarity, and accuracy.”,
“system_prompt”: sub_critique_prompt,
}

System Prompt

Deep Agents include a built-in system prompt that serves as their core set of instructions. This prompt is inspired by the system prompt used in Claude Code and is designed to be more general-purpose, providing guidance on how to use built-in tools like planning, file system operations, and subagent coordination.

However, while the default system prompt makes Deep Agents capable out of the box, it’s highly recommended to define a custom system prompt tailored to your specific use case. Prompt design plays a crucial role in shaping the agent’s reasoning, structure, and overall performance.

In our example, we defined a custom prompt called policy_research_instructions, which transforms the agent into an expert AI policy researcher. It clearly outlines a step-by-step workflow — saving the question, using the research subagent for analysis, writing the report, and optionally invoking the critique subagent for review. It also enforces best practices such as Markdown formatting, citation style, and professional tone to ensure the final report meets high-quality policy standards. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserpolicy_research_instructions = “””
You are an expert AI policy researcher and analyst.
Your job is to investigate questions related to global AI regulation, ethics, and governance frameworks.

1️⃣ Save the user’s question to `question.txt`
2️⃣ Use the `policy-research-agent` to perform in-depth research
3️⃣ Write a detailed report to `final_report.md`
4️⃣ Optionally, ask the `policy-critique-agent` to critique your draft
5️⃣ Revise if necessary, then output the final, comprehensive report

When writing the final report:
– Use Markdown with clear sections (## for each)
– Include citations in [Title](URL) format
– Add a ### Sources section at the end
– Write in professional, neutral tone suitable for policy briefings
“””

Main Agent

Here we define our main Deep Agent using the create_deep_agent() function. We initialize the model with OpenAI’s gpt-4o, but as shown in the commented-out line, you can easily switch to Google’s Gemini 2.5 Flash model if you prefer. The agent is configured with the internet_search tool, our custom policy_research_instructions system prompt, and two subagents — one for in-depth research and another for critique.

By default, DeepAgents internally uses Claude Sonnet 4.5 as its model if none is explicitly specified, but the library allows full flexibility to integrate OpenAI, Gemini, Anthropic, or other LLMs supported by LangChain. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsermodel = init_chat_model(model=”openai:gpt-4o”)
# model = init_chat_model(model=”google_genai:gemini-2.5-flash”)
agent = create_deep_agent(
model=model,
tools=[internet_search],
system_prompt=policy_research_instructions,
subagents=[research_sub_agent, critique_sub_agent],
)

Invoking the Agent

Copy CodeCopiedUse a different Browserquery = “What are the latest updates on the EU AI Act and its global impact?”
result = agent.invoke({“messages”: [{“role”: “user”, “content”: query}]})

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet LangChain’s DeepAgents Library and a Practical Example to See How DeepAgents Actually Work in Action appeared first on MarkTechPost.

An Implementation to Build Dynamic AI Systems with the Model Context P …

In this tutorial, we explore the Advanced Model Context Protocol (MCP) and demonstrate how to use it to address one of the most unique challenges in modern AI systems: enabling real-time interaction between AI models and external data or tools. Traditional models operate in isolation, limited to their training data, but through MCP, we create a bridge that enables models to access live resources, run specialized tools, and adapt dynamically to changing contexts. We walk through building an MCP server and client from scratch, showing how each component contributes to this powerful ecosystem of intelligent collaboration. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport json
import asyncio
from dataclasses import dataclass, asdict
from typing import Dict, List, Any, Optional, Callable
from datetime import datetime
import random

@dataclass
class Resource:
uri: str
name: str
description: str
mime_type: str
content: Any = None

@dataclass
class Tool:
name: str
description: str
parameters: Dict[str, Any]
handler: Optional[Callable] = None

@dataclass
class Message:
role: str
content: str
timestamp: str = None
def __post_init__(self):
if not self.timestamp:
self.timestamp = datetime.now().isoformat()

We begin by defining the fundamental building blocks of MCP: resources, tools, and messages. We design these data structures to represent how information flows between AI systems and their external environments in a clean, structured way. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCPServer:
def __init__(self, name: str):
self.name = name
self.resources: Dict[str, Resource] = {}
self.tools: Dict[str, Tool] = {}
self.capabilities = {“resources”: True, “tools”: True, “prompts”: True, “logging”: True}
print(f”✓ MCP Server ‘{name}’ initialized with capabilities: {list(self.capabilities.keys())}”)
def register_resource(self, resource: Resource) -> None:
self.resources[resource.uri] = resource
print(f” → Resource registered: {resource.name} ({resource.uri})”)
def register_tool(self, tool: Tool) -> None:
self.tools[tool.name] = tool
print(f” → Tool registered: {tool.name}”)
async def get_resource(self, uri: str) -> Optional[Resource]:
await asyncio.sleep(0.1)
return self.resources.get(uri)
async def execute_tool(self, tool_name: str, arguments: Dict[str, Any]) -> Any:
if tool_name not in self.tools:
raise ValueError(f”Tool ‘{tool_name}’ not found”)
tool = self.tools[tool_name]
if tool.handler:
return await tool.handler(**arguments)
return {“status”: “executed”, “tool”: tool_name, “args”: arguments}
def list_resources(self) -> List[Dict[str, str]]:
return [{“uri”: r.uri, “name”: r.name, “description”: r.description} for r in self.resources.values()]
def list_tools(self) -> List[Dict[str, Any]]:
return [{“name”: t.name, “description”: t.description, “parameters”: t.parameters} for t in self.tools.values()]

We implement the MCP server that manages resources and tools while handling execution and retrieval operations. We ensure it supports asynchronous interaction, making it efficient and scalable for real-world AI applications. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCPClient:
def __init__(self, client_id: str):
self.client_id = client_id
self.connected_servers: Dict[str, MCPServer] = {}
self.context: List[Message] = []
print(f”n✓ MCP Client ‘{client_id}’ initialized”)
def connect_server(self, server: MCPServer) -> None:
self.connected_servers[server.name] = server
print(f” → Connected to server: {server.name}”)
async def query_resources(self, server_name: str) -> List[Dict[str, str]]:
if server_name not in self.connected_servers:
raise ValueError(f”Not connected to server: {server_name}”)
return self.connected_servers[server_name].list_resources()
async def fetch_resource(self, server_name: str, uri: str) -> Optional[Resource]:
if server_name not in self.connected_servers:
raise ValueError(f”Not connected to server: {server_name}”)
server = self.connected_servers[server_name]
resource = await server.get_resource(uri)
if resource:
self.add_to_context(Message(role=”system”, content=f”Fetched resource: {resource.name}”))
return resource
async def call_tool(self, server_name: str, tool_name: str, **kwargs) -> Any:
if server_name not in self.connected_servers:
raise ValueError(f”Not connected to server: {server_name}”)
server = self.connected_servers[server_name]
result = await server.execute_tool(tool_name, kwargs)
self.add_to_context(Message(role=”system”, content=f”Tool ‘{tool_name}’ executed”))
return result
def add_to_context(self, message: Message) -> None:
self.context.append(message)
def get_context(self) -> List[Dict[str, Any]]:
return [asdict(msg) for msg in self.context]

We create the MCP client that connects to the server, queries resources, and executes tools. We maintain a contextual memory of all interactions, enabling continuous, stateful communication with the server. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def analyze_sentiment(text: str) -> Dict[str, Any]:
await asyncio.sleep(0.2)
sentiments = [“positive”, “negative”, “neutral”]
return {“text”: text, “sentiment”: random.choice(sentiments), “confidence”: round(random.uniform(0.7, 0.99), 2)}

async def summarize_text(text: str, max_length: int = 100) -> Dict[str, str]:
await asyncio.sleep(0.15)
summary = text[:max_length] + “…” if len(text) > max_length else text
return {“original_length”: len(text), “summary”: summary, “compression_ratio”: round(len(summary) / len(text), 2)}

async def search_knowledge(query: str, top_k: int = 3) -> List[Dict[str, Any]]:
await asyncio.sleep(0.25)
mock_results = [{“title”: f”Result {i+1} for ‘{query}'”, “score”: round(random.uniform(0.5, 1.0), 2)} for i in range(top_k)]
return sorted(mock_results, key=lambda x: x[“score”], reverse=True)

We define a set of asynchronous tool handlers, including sentiment analysis, text summarization, and knowledge search. We use them to simulate how the MCP system can execute diverse operations through modular, pluggable tools. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def run_mcp_demo():
print(“=” * 60)
print(“MODEL CONTEXT PROTOCOL (MCP) – ADVANCED TUTORIAL”)
print(“=” * 60)
print(“n[1] Setting up MCP Server…”)
server = MCPServer(“knowledge-server”)
print(“n[2] Registering resources…”)
server.register_resource(Resource(uri=”docs://python-guide”, name=”Python Programming Guide”, description=”Comprehensive Python documentation”, mime_type=”text/markdown”, content=”# Python GuidenPython is a high-level programming language…”))
server.register_resource(Resource(uri=”data://sales-2024″, name=”2024 Sales Data”, description=”Annual sales metrics”, mime_type=”application/json”, content={“q1”: 125000, “q2”: 142000, “q3”: 138000, “q4”: 165000}))
print(“n[3] Registering tools…”)
server.register_tool(Tool(name=”analyze_sentiment”, description=”Analyze sentiment of text”, parameters={“text”: {“type”: “string”, “required”: True}}, handler=analyze_sentiment))
server.register_tool(Tool(name=”summarize_text”, description=”Summarize long text”, parameters={“text”: {“type”: “string”, “required”: True}, “max_length”: {“type”: “integer”, “default”: 100}}, handler=summarize_text))
server.register_tool(Tool(name=”search_knowledge”, description=”Search knowledge base”, parameters={“query”: {“type”: “string”, “required”: True}, “top_k”: {“type”: “integer”, “default”: 3}}, handler=search_knowledge))
client = MCPClient(“demo-client”)
client.connect_server(server)
print(“n” + “=” * 60)
print(“DEMONSTRATION: MCP IN ACTION”)
print(“=” * 60)
print(“n[Demo 1] Listing available resources…”)
resources = await client.query_resources(“knowledge-server”)
for res in resources:
print(f” • {res[‘name’]}: {res[‘description’]}”)
print(“n[Demo 2] Fetching sales data resource…”)
sales_resource = await client.fetch_resource(“knowledge-server”, “data://sales-2024″)
if sales_resource:
print(f” Data: {json.dumps(sales_resource.content, indent=2)}”)
print(“n[Demo 3] Analyzing sentiment…”)
sentiment_result = await client.call_tool(“knowledge-server”, “analyze_sentiment”, text=”MCP is an amazing protocol for AI integration!”)
print(f” Result: {json.dumps(sentiment_result, indent=2)}”)
print(“n[Demo 4] Summarizing text…”)
summary_result = await client.call_tool(“knowledge-server”, “summarize_text”, text=”The Model Context Protocol enables seamless integration between AI models and external data sources…”, max_length=50)
print(f” Summary: {summary_result[‘summary’]}”)
print(“n[Demo 5] Searching knowledge base…”)
search_result = await client.call_tool(“knowledge-server”, “search_knowledge”, query=”machine learning”, top_k=3)
print(” Top results:”)
for result in search_result:
print(f” – {result[‘title’]} (score: {result[‘score’]})”)
print(“n[Demo 6] Current context window…”)
context = client.get_context()
print(f” Context length: {len(context)} messages”)
for i, msg in enumerate(context[-3:], 1):
print(f” {i}. [{msg[‘role’]}] {msg[‘content’]}”)
print(“n” + “=” * 60)
print(“✓ MCP Tutorial Complete!”)
print(“=” * 60)
print(“nKey Takeaways:”)
print(“• MCP enables modular AI-to-resource connections”)
print(“• Resources provide context from external sources”)
print(“• Tools enable dynamic operations and actions”)
print(“• Async design supports efficient I/O operations”)

if __name__ == “__main__”:
import sys
if ‘ipykernel’ in sys.modules or ‘google.colab’ in sys.modules:
await run_mcp_demo()
else:
asyncio.run(run_mcp_demo())

We bring everything together into a complete demonstration where the client interacts with the server, fetches data, runs tools, and maintains context. We witness the full potential of MCP as it seamlessly integrates AI logic with external knowledge and computation.

In conclusion, the uniqueness of the problem we solve here lies in breaking the boundaries of static AI systems. Instead of treating models as closed boxes, we design an architecture that enables them to query, reason, and act on real-world data in structured, context-driven ways. This dynamic interoperability, achieved through the MCP framework, represents a major shift toward modular, tool-augmented intelligence. By understanding and implementing MCP, we position ourselves to build the next generation of adaptive AI systems that can think, learn, and connect beyond their original confines.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post An Implementation to Build Dynamic AI Systems with the Model Context Protocol (MCP) for Real-Time Resource and Tool Integration appeared first on MarkTechPost.

Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that T …

Researchers from Stanford, EPFL, and UNC introduce Weak-for-Strong Harnessing, W4S, a new Reinforcement Learning RL framework that trains a small meta-agent to design and refine code workflows that call a stronger executor model. The meta-agent does not fine tune the strong model, it learns to orchestrate it. W4S formalizes workflow design as a multi turn Markov decision process, and trains the meta-agent with a method called Reinforcement Learning for Agentic Workflow Optimization, RLAO. The research team reports consistent gains across 11 benchmarks with a 7B meta-agent trained for about 1 GPU hour.

https://arxiv.org/pdf/2504.04785

W4S operates in turns. The state contains task instructions, the current workflow program, and feedback from prior executions. An action has 2 components, an analysis of what to change, and new Python workflow code that implements those changes. The environment executes the code on validation items, returns accuracy and failure cases, and provides a new state for the next turn. The meta-agent can run a quick self check on one sample, if errors arise it attempts up to 3 repairs, if errors persist the action is skipped. This loop gives learning signal without touching the weights of the strong executor.

https://arxiv.org/pdf/2504.04785

W4S runs as an iterative loop

Workflow generation: The weak meta agent writes a new workflow that leverages the strong model, expressed as executable Python code.

Execution and feedback: The strong model executes the workflow on validation samples, then returns accuracy and error cases as feedback.

Refinement: The meta agent uses the feedback to update the analysis and the workflow, then repeats the loop.

Reinforcement Learning for Agentic Workflow Optimization (RLAO)

RLAO is an offline reinforcement learning procedure over multi turn trajectories. At each iteration, the system samples multiple candidate actions, keeps the best performing action to advance the state, and stores the others for training. The policy is optimized with reward weighted regression. The reward is sparse and compares current validation accuracy to history, a higher weight is given when the new result beats the previous best, a smaller weight is given when it beats the last iteration. This objective favors steady progress while controlling exploration cost.

https://arxiv.org/pdf/2504.04785

Understanding the Results

On HumanEval with GPT-4o-mini as executor, W4S achieves Pass@1 of 95.4, with about 33 minutes of workflow optimization, zero meta-agent API cost, an optimization execution cost of about 0.4 dollars, and about 2.7 minutes to execute the test set at about 0.5 dollars, for a total of about 0.9 dollars. Under the same executor, AFlow and ADAS trail this number. The reported average gains against the strongest automated baseline range from 2.9% to 24.6% across 11 benchmarks.

On math transfer, the meta-agent is trained on GSM Plus and MGSM with GPT-3.5-Turbo as executor, then evaluated on GSM8K, GSM Hard, and SVAMP. The paper reports 86.5 on GSM8K and 61.8 on GSM Hard, both above automated baselines. This indicates that the learned orchestration transfers to related tasks without re training the executor.

Across seen tasks with GPT-4o-mini as executor, W4S surpasses training free automated methods that do not learn a planner. The study also runs ablations where the meta-agent is trained by supervised fine tuning rather than RLAO, the RLAO agent yields better accuracy under the same compute budget. The research team include a GRPO baseline on a 7B weak model for GSM Hard, W4S outperforms it under limited compute.

Iteration budgets matter. The research team sets W4S to about 10 optimization turns on main tables, while AFlow runs about 20 turns and ADAS runs about 30 turns. Despite fewer turns, W4S achieves higher accuracy. This suggests that learned planning over code, combined with validation feedback, makes the search more sample efficient.

https://arxiv.org/pdf/2504.04785

Key Takeaways

W4S trains a 7B weak meta agent with RLAO to write Python workflows that harness stronger executors, modeled as a multi turn MDP.

On HumanEval with GPT 4o mini as executor, W4S reaches Pass@1 of 95.4, with about 33 minutes optimization and about 0.9 dollars total cost, beating automated baselines under the same executor.

Across 11 benchmarks, W4S improves over the strongest baseline by 2.9% to 24.6%, while avoiding fine tuning of the strong model.

The method runs an iterative loop, it generates a workflow, executes it on validation data, then refines it using feedback.

ADAS and AFlow also program or search over code workflows, W4S differs by training a planner with offline reinforcement learning.

Editorial Comments

W4S targets orchestration, not model weights, and trains a 7B meta agent to program workflows that call stronger executors. W4S formalizes workflow design as a multi turn MDP and optimizes the planner with RLAO using offline trajectories and reward weighted regression. Reported results show Pass@1 of 95.4 on HumanEval with GPT 4o mini, average gains of 2.9% to 24.6% across 11 benchmarks, and about 1 GPU hour of training for the meta agent. The framing compares cleanly with ADAS and AFlow, which search agent designs or code graphs, while W4S fixes the executor and learns the planner.

Check out the Technical Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Weak-for-Strong (W4S): A Novel Reinforcement Learning Algorithm that Trains a weak Meta Agent to Design Agentic Workflows with Stronger LLMs appeared first on MarkTechPost.

Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight …

Microsoft Research proposes BitNet Distillation, a pipeline that converts existing full precision LLMs into 1.58 bit BitNet students for specific tasks, while keeping accuracy close to the FP16 teacher and improving CPU efficiency. The method combines SubLN based architectural refinement, continued pre training, and dual signal distillation from logits and multi head attention relations. Reported results show up to 10× memory savings and about 2.65× faster CPU inference, with task metrics comparable to FP16 across multiple sizes.

What BitNet Distillation changes?

The community already showed that BitNet b1.58 can match full precision quality when trained from scratch, but converting a pretrained FP16 model directly to 1.58 bit often loses accuracy, and the gap grows as model size increases. BitNet Distillation targets this conversion problem for practical downstream deployment. It is designed to preserve accuracy while delivering CPU friendly ternary weights with INT8 activations.

Stage 1: Modeling refinement with SubLN

Low bit models suffer from large activation variance. The research team inserts SubLN normalization inside each Transformer block, specifically before the output projection of the MHSA module and before the output projection of the FFN. This stabilizes hidden state scales that flow into quantized projections, which improves optimization and convergence once weights are ternary. The training loss curves in the analysis section support this design.

Stage 2: Continued pre training to adapt weight distributions

Direct task fine tuning at 1.58 bit gives the student only a small number of task tokens, which is not enough to reshape the FP16 weight distribution for ternary constraints. BitNet Distillation performs a short continued pre training on a general corpus, the research team uses 10B tokens from the FALCON corpus, to push weights toward BitNet like distributions. The visualization shows the mass concentrating near transition boundaries, which makes small gradients flip weights among [-1, 0, 1] during downstream task training. This improves learning capacity without a full pretraining run.

Stage 3: Distillation based fine tuning with two signals

The student learns from the FP16 teacher using logits distillation and multi head self attention relation distillation. The logits path uses temperature softened KL between teacher and student token distributions. The attention path follows the MiniLM and MiniLMv2 formulations, which transfer relations among Q, K, V without requiring the same number of heads, and let you choose a single layer to distill. Ablations show that combining both signals works best, and that selecting one well chosen layer preserves flexibility.

Understanding the results

The research team evaluates classification, MNLI, QNLI, SST 2, and summarization on CNN/DailyMail dataset. It compares three settings, FP16 task fine tuning, direct 1.58 bit task fine tuning, and BitNet Distillation. Figure 1 shows that BitNet Distillation matches FP16 accuracy for Qwen3 backbones at 0.6B, 1.7B, 4B, while the direct 1.58 bit baseline lags more as model size grows. On CPU, tokens per second improve by about 2.65×, and memory drops by about 10× for the student. The research team quantizes activations to INT8 and uses the Straight Through Estimator for gradients through the quantizer.

https://arxiv.org/pdf/2510.13998

The framework is compatible with post training quantization methods such as GPTQ and AWQ, which provide additional gains on top of the pipeline. Distilling from a stronger teacher helps more, which suggests pairing small 1.58 bit students with larger FP16 teachers when available.

Key Takeaways

BitNet Distillation is a 3 stage pipeline, SubLN insertion, continued pre training, and dual distillation from logits and multi head attention relations.

The research reports near FP16 accuracy with about 10× lower memory and about 2.65× faster CPU inference for 1.58 bit students.

The method transfers attention relations using MiniLM and MiniLMv2 style objectives, which do not require matching head counts.

Evaluations cover MNLI, QNLI, SST 2, and CNN/ DailyMail, and include Qwen3 backbones at 0.6B, 1.7B, and 4B parameters.

Deployment targets ternary weights with INT8 activations, with optimized CPU and GPU kernels available in the official BitNet repository.

Editorial Comments

BitNet Distillation is a pragmatic step toward 1.58 bit deployment without a full retrain, the three stage design, SubLN, continual pre training, and MiniLM family attention distillation, maps cleanly to known failure modes in extreme quantization. The reported 10× memory reduction and about 2.65× CPU speedup at near FP16 accuracy indicate solid engineering value for on premise and edge targets. The reliance on attention relation distillation is well grounded in prior MiniLM work, which helps explain the stability of results. The presence of bitnet.cpp with optimized CPU and GPU kernels lowers integration risk for production teams.

Check out the Technical Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup appeared first on MarkTechPost.

Kong Releases Volcano: A TypeScript, MCP-native SDK for Building Produ …

Kong has open-sourced Volcano, a TypeScript SDK that composes multi-step agent workflows across multiple LLM providers with native Model Context Protocol (MCP) tool use. The release coincides with broader MCP capabilities in Kong AI Gateway and Konnect, positioning Volcano as the developer SDK in an MCP-governed control plane.

Why Volcano SDK? because 9 lines of code are faster to write and easier to manage than 100+.

Without Volcano SDK? You’d need 100+ lines handling tool schemas, context management, provider switching, error handling, and HTTP clients. 

With Volcano SDK: 9 lines.

Copy CodeCopiedUse a different Browserimport { agent, llmOpenAI, llmAnthropic, mcp } from “volcano-ai”;

// Setup: two LLMs, two MCP servers
const planner = llmOpenAI({ model: “gpt-5-mini”, apiKey: process.env.OPENAI_API_KEY! });
const executor = llmAnthropic({ model: “claude-4.5-sonnet”, apiKey: process.env.ANTHROPIC_API_KEY! });
const database = mcp(“https://api.company.com/database/mcp”);
const slack = mcp(“https://api.company.com/slack/mcp”);

// One workflow
await agent({ llm: planner })
.then({
prompt: “Analyze last week’s sales data”,
mcps: [database] // Auto-discovers and calls the right tools
})
.then({
llm: executor, // Switch to Claude
prompt: “Write an executive summary”
})
.then({
prompt: “Post the summary to #executives”,
mcps: [slack]
})
.run();

What Volcano provides?

Volcano exposes a compact, chainable API—.then(…).run()—that passes intermediate context between steps while switching LLMs per step (e.g., plan with one model, execute with another). It treats MCP as a first-class interface: developers hand Volcano a list of MCP servers, and the SDK performs tool discovery and invocation automatically. Production features include automatic retries, per-step timeouts, connection pooling for MCP servers, OAuth 2.1 authentication, and OpenTelemetry traces/metrics for distributed observability. The project is released under Apache-2.0.

Here are the Key Features of the Volcano SDK:

Chainable API: Build multi-step workflows with a concise .then(…).run() pattern; context flows between steps

MCP-native tool use: Pass MCP servers; the SDK auto-discovers and invokes the right tools in each step.

Multi-provider LLM support: Mix models (e.g., planning with one, execution with another) inside one workflow.

Streaming of intermediate and final results for responsive agent interactions.

Retries & timeouts configurable per step for reliability under real-world failures.

Hooks (before/after step) to customize behavior and instrumentation.

Typed error handling to surface actionable failures during agent execution.

Parallel execution, branching, and loops to express complex control flow.

Observability via OpenTelemetry for tracing and metrics across steps and tool calls.

OAuth support & connection pooling for secure, efficient access to MCP servers.

Where it fits in Kong’s MCP architecture?

Kong’s Konnect platform adds multiple MCP governance and access layers that complement Volcano’s SDK surface:

AI Gateway gains MCP gateway features such as server autogeneration from Kong-managed APIs, centralized OAuth 2.1 for MCP servers, and observability over tools, workflows, and prompts in Konnect dashboards. These provide uniform policy and analytics for MCP analytics.

The Konnect Developer Portal can be turned into an MCP server so AI coding tools and agents can discover APIs, request access, and consume endpoints programmatically—reducing manual credential workflows and making API catalogs accessible through MCP.

Kong’s team also previewed MCP Composer and MCP Runner to design, generate, and operate MCP servers and integrations.

Key Takeaways

Volcano is an open-source TypeScript SDK that builds multi-step AI agents with first-class MCP tool use.

The SDK provides production features—retries, timeouts, connection pooling, OAuth, and OpenTelemetry tracing/metrics—for MCP workflows.

Volcano composes multi-LLM plans/executions and auto-discovers/invokes MCP servers/tools, minimizing custom glue code.

Kong paired the SDK with platform controls: AI Gateway/Konnect add MCP server autogeneration, centralized OAuth 2.1, and observability.

Editorial Comments

Kong’s Volcano SDK is a pragmatic addition to the MCP ecosystem: a TypeScript-first agent framework that aligns developer workflow with enterprise controls (OAuth 2.1, OpenTelemetry) delivered via AI Gateway and Konnect. The pairing closes a common gap in agent stacks—tool discovery, auth, and observability—without inventing new interfaces beyond MCP. This design prioritizes protocol-native MCP integration over bespoke glue, cutting operational drift and closing auditing gaps as internal agents scale.

Check out the GitHub Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Kong Releases Volcano: A TypeScript, MCP-native SDK for Building Production Ready AI Agents with LLM Reasoning and Real-World actions appeared first on MarkTechPost.

AutoCode: A New AI Framework that Lets LLMs Create and Verify Competit …

Are your LLM code benchmarks actually rejecting wrong-complexity solutions and interactive-protocol violations, or are they passing under-specified unit tests? A team of researchers from UCSD, NYU, University of Washington, Princeton University, Canyon Crest Academy, OpenAI, UC Berkeley, MIT, University of Waterloo, and Sentient Labs introduce AutoCode, a new AI framework that lets LLMs create and verify competitive programming problems, mirroring the workflow of human problem setters. AutoCode reframes evaluation for code-reasoning models by treating problem setting (not only problem solving) as the target task. The system trains LLMs to produce competition-grade statements, test data, and verdict logic that match official online judges at high rates. On a 7,538-problem benchmark built from prior datasets, AutoCode achieves 91.1% consistency with official judgments (FPR 3.7%, FNR 14.1%). On a separate, more difficult 720 recent Codeforces problems (including interactive tasks), the full framework reports 98.7% consistency, 1.3% FPR, 1.2% FNR.

https://arxiv.org/pdf/2510.12803

Why problem setting matters for evaluation?

Public code benchmarks often rely on under-specified tests that let wrong-complexity or shortcut solutions pass. That inflates scores and pollutes reinforcement signals (rewarding fragile tactics). AutoCode’s validator-first approach and adversarial test generation aim to reduce false positives (FPR)—incorrect programs that pass—and false negatives (FNR)—correct programs rejected due to malformed inputs.

https://arxiv.org/pdf/2510.12803

The core loop: Validator → Generator → Checker

AutoCode runs a closed loop that mirrors human contest workflows, but each step is selected from LLM-generated candidates using targeted in-framework tests.

1) Validator (minimize FNR by enforcing input legality)

The system first asks an LLM to synthesize 40 evaluation inputs—10 valid and 30 near-valid illegal (e.g., off-by-one boundary violations). It then prompts the LLM for three candidate validator programs and selects the one that best classifies these cases. This prevents “correct” solutions from crashing on malformed data.

https://arxiv.org/pdf/2510.12803

2) Generator (reduce FPR by adversarial coverage)

Three complementary strategies produce test cases:• Small-data exhaustion for boundary coverage,• Randomized + extreme cases (overflows, precision, hash-collisions),• TLE-inducing structures to break wrong-complexity solutions.

Invalid cases are filtered by the selected validator; then cases are deduplicated and bucket-balanced before sampling.

https://arxiv.org/pdf/2510.12803

3) Checker (verdict logic)

The checker compares contestant outputs with the reference solution under complex rules. AutoCode again generates 40 checker scenarios and three candidate checker programs, keeps only scenarios with validator-approved inputs, and selects the best checker by accuracy against the 40 labeled scenarios.

https://arxiv.org/pdf/2510.12803

4) Interactor (for interactive problems)

For tasks that require dialogue with the judge, AutoCode introduces a mutant-based interactor: it makes small logical edits (“mutants”) to the reference solution, selects interactors that accept the true solution but reject the mutants, maximizing discrimination. This addresses a gap in earlier public datasets that avoided interactives.

https://arxiv.org/pdf/2510.12803

Dual verification enables new problems (not just tests for existing ones)

AutoCode can generate novel problem variants starting from a random “seed” Codeforces problem (<2200 Elo). The LLM drafts a new statement and two solutions: an efficient reference and a simpler brute-force baseline. A problem is accepted only if the reference output matches brute force across the generated test suite (the brute force may TLE on large cases but serves as ground truth on small/exhaustive cases). This dual-verification protocol filters ~27% of error-prone items, lifting reference-solution correctness from 86% → 94% before human review.

Human experts then grade the survivors on solvability, solution correctness, quality, novelty, difficulty. After filtering, 61.6% are usable for model training, 76.3% for human training, and 3.2% are ICPC/IOI-level problems. Difficulty typically increases relative to the seed, and difficulty gain correlates with perceived quality.

https://arxiv.org/pdf/2510.12803

Understanding the results

Existing problems (7,538 total; 195,988 human submissions). AutoCode: 91.1% consistency, 3.7% FPR, 14.1% FNR, vs 72.9–81.0% consistency for prior generators (CodeContests, CodeContests+, TACO, HardTests).

Recent Codeforces problems (720, unfiltered; includes interactives). AutoCode: 98.7% consistency, 1.3% FPR, 1.2% FNR. Ablations show all three generator strategies and prompt optimization contribute: removing prompt optimization drops consistency to 98.0% and more than doubles FNR to 2.9%.

https://arxiv.org/pdf/2510.12803

Key Takeaways

AutoCode couples a Validator–Generator–Checker (+Interactor) loop with dual verification (reference vs. brute-force) to build contest-grade test suites and new problems.

On held-out problems, AutoCode’s test suites reach ~99% consistency with official judges, surpassing prior generators like HardTests (<81%).

For recent Codeforces tasks (including interactives), the full framework reports ~98.7% consistency with ~1.3% FPR and ~1.2% FNR.

The mutant-based interactor reliably accepts the true solution while rejecting mutated variants, improving evaluation for interactive problems.

Human experts rate a sizable fraction of AutoCode-generated items as training-usable and a non-trivial share as contest-quality, aligning with the LiveCodeBench Pro program’s aims.

Editorial Comments

AutoCode is a practical fix for current code benchmarks. It centers problem setting and uses a closed-loop Validator–Generator–Checker (+Interactor) pipeline with dual verification (reference vs. brute-force). This structure reduces false positives/negatives and yields judge-aligned consistency (≈99% on held-out problems; 98.7% on recent Codeforces, including interactives). The approach standardizes constraint legality, adversarial coverage, and protocol-aware judging, which makes downstream RL reward signals cleaner. Its placement under LiveCodeBench Pro fits a hallucination-resistant evaluation program that emphasizes expert-checked rigor.

Check out the Paper and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters appeared first on MarkTechPost.

Sigmoidal Scaling Curves Make Reinforcement Learning RL Post-Training …

Reinforcement Learning RL post-training is now a major lever for reasoning-centric LLMs, but unlike pre-training, it hasn’t had predictive scaling rules. Teams pour tens of thousands of GPU-hours into runs without a principled way to estimate whether a recipe will keep improving with more compute. A new research from Meta, UT Austin, UCL, Berkeley, Harvard, and Periodic Labs provides a compute-performance framework—validated over >400,000 GPU-hours—that models RL progress with a sigmoidal curve and supplies a tested recipe, ScaleRL, that follows those predicted curves up to 100,000 GPU-hours.

Fit a sigmoid, not a power law

Pre-training often fits power laws (loss vs compute). RL fine-tuning targets bounded metrics (e.g., pass rate/mean reward). The research team show sigmoidal fits to pass rate vs training compute are empirically more robust and stable than power-law fits, especially when you want to extrapolate from smaller runs to larger budgets. They exclude the very early, noisy regime (~first 1.5k GPU-hours) and fit the predictable portion that follows. The sigmoidal parameters have intuitive roles: one sets the asymptotic performance (ceiling), another the efficiency/exponent, and another the midpoint where gains are fastest.

https://arxiv.org/pdf/2510.13786

Why that matters: After ~1–2k GPU-hours, you can fit the curve and forecast whether pushing to 10k–100k GPU-hours is worth it—before you burn the budget. The research also shows power-law fits can produce misleading ceilings unless you only fit at very high compute, which defeats the purpose of early forecasting.

ScaleRL: a recipe that scales predictably

ScaleRL is not just new algorithm; it’s a composition of choices that produced stable, extrapolatable scaling in the study:

Asynchronous Pipeline RL (generator–trainer split across GPUs) for off-policy throughput.

CISPO (truncated importance-sampling REINFORCE) as the RL loss.

FP32 precision at the logits to avoid numeric mismatch between generator and trainer.

Prompt-level loss averaging and batch-level advantage normalization.

Forced length interruptions to cap runaway traces.

Zero-variance filtering (drop prompts that provide no gradient signal).

No-Positive-Resampling (remove high-pass-rate prompts ≥0.9 from later epochs).

The research team validated each component with leave-one-out (LOO) ablations at 16k GPU-hours and show that ScaleRL’s fitted curves reliably extrapolate from 8k → 16k, then hold at much larger scales—including a single run extended to 100k GPU-hours.

https://arxiv.org/pdf/2510.13786

Results and generalization

Two key demonstrations:

Predictability at scale: For an 8B dense model and a Llama-4 17B×16 MoE (“Scout”), the extended training closely followed the sigmoid extrapolations derived from smaller-compute segments.

Downstream transfer: Pass-rate improvements on an iid validation set track downstream evaluation (e.g., AIME-24), suggesting the compute-performance curve isn’t a dataset artifact.

The research also compares fitted curves for prevalent recipes (e.g., DeepSeek (GRPO), Qwen-2.5 (DAPO), Magistral, MiniMax-M1) and reports higher asymptotic performance and better compute efficiency for ScaleRL in their setup.

https://arxiv.org/pdf/2510.13786

Which knobs move the ceiling vs the efficiency?

The framework lets you classify design choices:

Ceiling movers (asymptote): scaling model size (e.g., MoE) and longer generation lengths (up to 32,768 tokens) raise the asymptotic performance but may slow early progress. Larger global batch size can also lift the final asymptote and stabilize training.

Efficiency shapers: loss aggregation, advantage normalization, data curriculum, and the off-policy pipeline mainly change how fast you approach the ceiling, not the ceiling itself.

Operationally, the research team advises fitting curves early and prioritizing interventions that raise the ceiling, then tune the efficiency knobs to reach it faster at fixed compute.

Key Takeaways

The research team models RL post-training progress with sigmoidal compute-performance curves (pass-rate vs. log compute), enabling reliable extrapolation—unlike power-law fits on bounded metrics.

A best-practice recipe, ScaleRL, combines PipelineRL-k (asynchronous generator–trainer), CISPO loss, FP32 logits, prompt-level aggregation, advantage normalization, interruption-based length control, zero-variance filtering, and no-positive-resampling.

Using these fits, the research team predicted and matched extended runs up to 100k GPU-hours (8B dense) and ~50k GPU-hours (17B×16 MoE “Scout”) on validation curves.

Ablations show some choices move the asymptotic ceiling (A) (e.g., model scale, longer generation lengths, larger global batch), while others mainly improve compute efficiency (B) (e.g., aggregation/normalization, curriculum, off-policy pipeline).

The framework provides early forecasting to decide whether to scale a run, and improvements on the in-distribution validation track downstream metrics (e.g., AIME-24), supporting external validity.

Editorial Comments

This work turns RL post-training from trial-and-error into forecastable engineering. It fits sigmoidal compute-performance curves (pass-rate vs. log compute) to predict returns and decide when to stop or scale. It also provides a concrete recipe, ScaleRL, that uses PipelineRL-style asynchronous generation/training, the CISPO loss, and FP32 logits for stability. The study reports >400,000 GPU-hours of experiments and a single-run extension to 100,000 GPU-hours. Results support a clean split: some choices raise the asymptote; others mainly improve compute efficiency. That separation helps teams prioritize ceiling-moving changes before tuning throughput knobs.

Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Sigmoidal Scaling Curves Make Reinforcement Learning RL Post-Training Predictable for LLMs appeared first on MarkTechPost.

A Coding Implementation to Build a Unified Tool Orchestration Framewor …

In this tutorial, we build a compact, efficient framework that demonstrates how to convert tool documentation into standardized, callable interfaces, register those tools in a central system, and execute them as part of an automated pipeline. As we move through each stage, we create a simple converter, design mock bioinformatics tools, organize them into a registry, and benchmark both individual and multi-step pipeline executions. Through this process, we explore how structured tool interfaces and automation can streamline and modularize data workflows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport re, json, time, random
from dataclasses import dataclass
from typing import Callable, Dict, Any, List, Tuple

@dataclass
class ToolSpec:
name: str
description: str
inputs: Dict[str, str]
outputs: Dict[str, str]

def parse_doc_to_spec(name: str, doc: str) -> ToolSpec:
desc = doc.strip().splitlines()[0].strip() if doc.strip() else name
arg_block = “n”.join([l for l in doc.splitlines() if “–” in l or “:” in l])
inputs = {}
for line in arg_block.splitlines():
m = re.findall(r”(–?w[w-]*|bw+b)s*[:=]?s*(w+)?”, line)
for key, typ in m:
k = key.lstrip(“-“)
if k and k not in inputs and k not in [“Returns”,”Output”,”Outputs”]:
inputs[k] = (typ or “str”)
if not inputs: inputs = {“in”: “str”}
return ToolSpec(name=name, description=desc, inputs=inputs, outputs={“out”:”json”})

We start by defining the structure for our tools and writing a simple parser that converts plain documentation into a standardized tool specification. This helps us automatically extract parameters and outputs from textual descriptions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef tool_fastqc(seq_fasta: str, min_len:int=30) -> Dict[str,Any]:
seqs = [s for s in re.split(r”>[^n]*n”, seq_fasta)[1:]]
lens = [len(re.sub(r”s+”,””,s)) for s in seqs]
q30 = sum(l>=min_len for l in lens)/max(1,len(lens))
gc = sum(c in “GCgc” for s in seqs for c in s)/max(1,sum(lens))
return {“n_seqs”:len(lens),”len_mean”:(sum(lens)/max(1,len(lens))),”pct_q30″:q30,”gc”:gc}

def tool_bowtie2_like(ref:str, reads:str, mode:str=”end-to-end”) -> Dict[str,Any]:
def revcomp(s):
t=str.maketrans(“ACGTacgt”,”TGCAtgca”); return s.translate(t)[::-1]
reads_list=[r for r in re.split(r”>[^n]*n”, reads)[1:]]
ref_seq=””.join(ref.splitlines()[1:])
hits=[]
for i,r in enumerate(reads_list):
rseq=””.join(r.split())
aligned = (rseq in ref_seq) or (revcomp(rseq) in ref_seq)
hits.append({“read_id”:i,”aligned”:bool(aligned),”pos”:ref_seq.find(rseq)})
return {“n”:len(hits),”aligned”:sum(h[“aligned”] for h in hits),”mode”:mode,”hits”:hits}

def tool_bcftools_like(ref:str, alt:str, win:int=15) -> Dict[str,Any]:
ref_seq=””.join(ref.splitlines()[1:]); alt_seq=””.join(alt.splitlines()[1:])
n=min(len(ref_seq),len(alt_seq)); vars=[]
for i in range(n):
if ref_seq[i]!=alt_seq[i]: vars.append({“pos”:i,”ref”:ref_seq[i],”alt”:alt_seq[i]})
return {“n_sites”:n,”n_var”:len(vars),”variants”:vars[:win]}

FASTQC_DOC = “””FastQC-like quality control for FASTA
–seq_fasta: str –min_len: int Outputs: json”””
BOWTIE_DOC = “””Bowtie2-like aligner
–ref: str –reads: str –mode: str Outputs: json”””
BCF_DOC = “””bcftools-like variant caller
–ref: str –alt: str –win: int Outputs: json”””

We create mock implementations of bioinformatics tools such as FastQC, Bowtie2, and Bcftools. We define their expected inputs and outputs so they can be executed consistently through a unified interface. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class MCPTool:
spec: ToolSpec
fn: Callable[…, Dict[str,Any]]

class MCPServer:
def __init__(self): self.tools: Dict[str,MCPTool] = {}
def register(self, name:str, doc:str, fn:Callable[…,Dict[str,Any]]):
spec = parse_doc_to_spec(name, doc); self.tools[name]=MCPTool(spec, fn)
def list_tools(self) -> List[Dict[str,Any]]:
return [dict(name=t.spec.name, description=t.spec.description, inputs=t.spec.inputs, outputs=t.spec.outputs) for t in self.tools.values()]
def call_tool(self, name:str, args:Dict[str,Any]) -> Dict[str,Any]:
if name not in self.tools: raise KeyError(f”tool {name} not found”)
spec = self.tools[name].spec
kwargs={k:args.get(k) for k in spec.inputs.keys()}
return self.tools[name].fn(**kwargs)

server=MCPServer()
server.register(“fastqc”, FASTQC_DOC, tool_fastqc)
server.register(“bowtie2”, BOWTIE_DOC, tool_bowtie2_like)
server.register(“bcftools”, BCF_DOC, tool_bcftools_like)

Task = Tuple[str, Dict[str,Any]]
PIPELINES = {
“rnaseq_qc_align_call”:[
(“fastqc”, {“seq_fasta”:”{reads}”, “min_len”:30}),
(“bowtie2”, {“ref”:”{ref}”, “reads”:”{reads}”, “mode”:”end-to-end”}),
(“bcftools”, {“ref”:”{ref}”, “alt”:”{alt}”, “win”:15}),
]
}

def compile_pipeline(nl_request:str) -> List[Task]:
key = “rnaseq_qc_align_call” if re.search(r”rna|qc|align|variant|call”, nl_request, re.I) else “rnaseq_qc_align_call”
return PIPELINES[key]

We build a lightweight server that registers tools, lists their specifications, and allows us to call them programmatically. We also define a basic pipeline structure that outlines the sequence in which tools should run. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef mk_fasta(header:str, seq:str)->str: return f”>{header}n{seq}n”
random.seed(0)
REF_SEQ=””.join(random.choice(“ACGT”) for _ in range(300))
REF = mk_fasta(“ref”,REF_SEQ)
READS = mk_fasta(“r1”, REF_SEQ[50:130]) + mk_fasta(“r2”,”ACGT”*15) + mk_fasta(“r3”, REF_SEQ[180:240])
ALT = mk_fasta(“alt”, REF_SEQ[:150] + “T” + REF_SEQ[151:])

def run_pipeline(nl:str, ctx:Dict[str,str]) -> Dict[str,Any]:
plan=compile_pipeline(nl); results=[]; t0=time.time()
for name, arg_tpl in plan:
args={k:(v.format(**ctx) if isinstance(v,str) else v) for k,v in arg_tpl.items()}
out=server.call_tool(name, args)
results.append({“tool”:name,”args”:args,”output”:out})
return {“request”:nl,”elapsed_s”:round(time.time()-t0,4),”results”:results}

We prepare small synthetic FASTA data for testing and implement a function that runs the entire pipeline. Here, we dynamically pass tool parameters and execute each step in the sequence. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef bench_individual() -> List[Dict[str,Any]]:
cases=[
(“fastqc”, {“seq_fasta”:READS,”min_len”:25}),
(“bowtie2”, {“ref”:REF,”reads”:READS,”mode”:”end-to-end”}),
(“bcftools”, {“ref”:REF,”alt”:ALT,”win”:10}),
]
rows=[]
for name,args in cases:
t0=time.time(); ok=True; err=None; out=None
try: out=server.call_tool(name,args)
except Exception as e: ok=False; err=str(e)
rows.append({“tool”:name,”ok”:ok,”ms”:int((time.time()-t0)*1000),”out_keys”:list(out.keys()) if ok else [],”err”:err})
return rows

def bench_pipeline() -> Dict[str,Any]:
t0=time.time()
res=run_pipeline(“Run RNA-seq QC, align, and variant call.”, {“ref”:REF,”reads”:READS,”alt”:ALT})
ok = all(step[“output”] for step in res[“results”])
return {“pipeline”:”rnaseq_qc_align_call”,”ok”:ok,”ms”:int((time.time()-t0)*1000),”n_steps”:len(res[“results”])}

print(“== TOOLS ==”); print(json.dumps(server.list_tools(), indent=2))
print(“n== INDIVIDUAL BENCH ==”); print(json.dumps(bench_individual(), indent=2))
print(“n== PIPELINE BENCH ==”); print(json.dumps(bench_pipeline(), indent=2))
print(“n== PIPELINE RUN ==”); print(json.dumps(run_pipeline(“Run RNA-seq QC, align, and variant call.”, {“ref”:REF,”reads”:READS,”alt”:ALT}), indent=2))

We benchmark both individual tools and the full pipeline, capturing their outputs and performance metrics. Finally, we print the results to verify that each stage of the workflow runs successfully and integrates smoothly.

In conclusion, we develop a clear understanding of how lightweight tool conversion, registration, and orchestration can work together in a single environment. We observe how a unified interface allows us to connect multiple tools seamlessly, run them in sequence, and measure their performance. This hands-on exercise helps us appreciate how simple design principles, standardization, automation, and modularity can enhance the reproducibility and efficiency of computational workflows in any domain.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Build a Unified Tool Orchestration Framework from Documentation to Automated Pipelines appeared first on MarkTechPost.

Baidu’s PaddlePaddle Team Releases PaddleOCR-VL (0.9B): a NaViT-styl …

How do you convert complex, multilingual documents—dense layouts, small scripts, formulas, charts, and handwriting—into faithful structured Markdown/JSON with state-of-the-art accuracy while keeping inference latency and memory low enough for real deployments?Baidu’s PaddlePaddle group has released PaddleOCR-VL, a 0.9B-parameter vision-language model designed for end-to-end document parsing across text, tables, formulas, charts, and handwriting. The core model combines a NaViT-style (Native-resolution ViT) dynamic-resolution vision encoder with the ERNIE-4.5-0.3B decoder. It supports 109 languages.

https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf

Understanding the system design

PaddleOCR-VL is deployed as a two-stage pipeline. Stage one (PP-DocLayoutV2) performs page-level layout analysis: an RT-DETR detector localizes and classifies regions; a pointer network predicts reading order. Stage two (PaddleOCR-VL-0.9B) conducts element-level recognition conditioned on the detected layout. Final outputs are aggregated to Markdown and JSON for downstream consumption. This decoupling mitigates long-sequence decoding latency and instability that end-to-end VLMs face on dense, multi-column, mixed text–graphic pages.

At the model level, PaddleOCR-VL-0.9B integrates a NaViT-style dynamic high-resolution encoder (native-resolution sequence packing) with a 2-layer MLP projector and the ERNIE-4.5-0.3B language model; 3D-RoPE is used for positional representation. The technical report attributes lower hallucinations and better text-dense performance to native-resolution processing relative to fixed-resize or tiling approaches. The NaViT idea—patch-and-pack variable-resolution inputs without destructive resizing—originates from prior work showing improved efficiency and robustness; PaddleOCR-VL adopts this encoder style directly.

Benchmarks

PaddleOCR-VL achieves state-of-the-art results on OmniDocBench v1.5 and competitive or leading scores on v1.0, covering overall quality as well as sub-tasks (text edit distances, Formula-CDM, Table-TEDS/TEDS-S, and reading-order edit), with complementary strength on olmOCR-Bench and in-house handwriting, table, formula, and chart evaluations.

https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf

Key Takeaways

0.9B-parameter PaddleOCR-VL integrates a NaViT-style dynamic-resolution encoder with ERNIE-4.5-0.3B for document parsing.

Targets end-to-end extraction across text, tables, formulas, charts, and handwriting with structured Markdown/JSON outputs.

Claims SOTA performance on public document benchmarks with fast inference suitable for deployment.

Supports 109 languages, including small scripts and complex page layouts.

Editorial Comments

This release is meaningful because it joins a NaViT-style dynamic-resolution visual encoder with the lightweight ERNIE-4.5-0.3B decoder to deliver SOTA page-level document parsing and element-level recognition at practical inference cost. The two-stage PP-DocLayoutV2 → PaddleOCR-VL-0.9B design stabilizes reading order and preserves native typography cues, which matter for small scripts, formulas, charts, and handwriting across 109 languages. Structured Markdown/JSON outputs and optional vLLM/SGLang acceleration make the system operationally clean for production document intelligence.

Check out the Technical Paper, Model on HF, and Technical details . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Baidu’s PaddlePaddle Team Releases PaddleOCR-VL (0.9B): a NaViT-style + ERNIE-4.5-0.3B VLM Targeting End-to-End Multilingual Document Parsing appeared first on MarkTechPost.

How TP ICAP transformed CRM data into real-time insights with Amazon B …

This post is co-written with Ross Ashworth at TP ICAP.
The ability to quickly extract insights from customer relationship management systems (CRMs) and vast amounts of meeting notes can mean the difference between seizing opportunities and missing them entirely. TP ICAP faced this challenge, having thousands of vendor meeting records stored in their CRM. Using Amazon Bedrock, their Innovation Lab built a production-ready solution that transforms hours of manual analysis into seconds by providing AI-powered insights, using a combination of Retrieval Augmented Generation (RAG) and text-to-SQL approaches.
This post shows how TP ICAP used Amazon Bedrock Knowledge Bases and Amazon Bedrock Evaluations to build ClientIQ, an enterprise-grade solution with enhanced security features for extracting CRM insights using AI, delivering immediate business value.
The challenge
TP ICAP had accumulated tens of thousands of vendor meeting notes in their CRM system over many years. These notes contained rich, qualitative information and details about product offerings, integration discussions, relationship insights, and strategic direction. However, this data was being underutilized and business users were spending hours manually searching through records, knowing the information existed but unable to efficiently locate it. The TP ICAP Innovation Lab set out to make the information more accessible, actionable, and quickly summarized for their internal stakeholders. Their solution needed to surface relevant information quickly, be accurate, and maintain proper context.
ClientIQ: TP ICAP’s custom CRM assistant
With ClientIQ, users can interact with their Salesforce meeting data through natural language queries. For example:

Ask questions about meeting data in plain English, such as “How can we improve our relationship with customers?”, “What do our clients think about our solution?”, or “How were our clients impacted by Brexit?”
Refine their queries through follow-up questions.
Apply filters to restrict model answers to a particular time period.
Access source documents directly through links to specific Salesforce records.

ClientIQ provides comprehensive responses while maintaining full traceability by including references to the source data and direct links to the original Salesforce records. The conversational interface supports natural dialogue flow, so users can refine and explore their queries without starting over. The following screenshot shows an example interaction (examples in this post use fictitious data and AnyCompany, a fictitious company, for demonstration purposes).

ClientIQ performs multiple tasks to fulfill a user’s request:

It uses a large language model (LLM) to analyze each user query to determine the optimal processing path.
It routes requests to one of two workflows:

The RAG workflow for getting insights from unstructured meeting notes. For example, “Was topic A discussed with AnyCompany the last 14 days?”
The SQL generation workflow for answering analytical queries by querying structured data. For example, “Get me a report on meeting count per region for last 4 weeks.”

It then generates the responses in natural language.
ClientIQ respects existing permission boundaries and access controls, helping verify users only access the data they’re authorized to. For example, if a user only has access to their regional accounts in the CRM system, ClientIQ only returns information from these accounts.

Solution overview
Although the team considered using their CRM’s built-in AI assistant, they opted to develop a more customized, cost-effective solution that would precisely match their requirements. They partnered with AWS and built an enterprise-grade solution powered by Amazon Bedrock. With Amazon Bedrock, TP ICAP evaluated and selected the best models for their use case and built a production-ready RAG solution in weeks rather than months, without having to manage the underlying infrastructure. They specifically used the following Amazon Bedrock managed capabilities:

Amazon Bedrock foundation models – Amazon Bedrock provides a range of foundation models (FMs) from providers, including Anthropic, Meta, Mistral AI, and Amazon, accessible through a single API. TP ICAP experimented with different models for various tasks and selected the best model for each task, balancing latency, performance, and cost. For instance, they used Anthropic’s Claude 3.5 Sonnet for classification tasks and Amazon Nova Pro for text-to-SQL generation. Because Amazon Bedrock is fully managed, they didn’t need to spend time setting up infrastructure for hosting these models, reducing the time to delivery.
Amazon Bedrock Knowledge Bases – The FMs needed access to the information in TP ICAP’s Salesforce system to provide accurate, relevant responses. TP ICAP used Amazon Bedrock Knowledge Bases to implement RAG, a technique that enhances generative AI responses by incorporating relevant data from your organization’s knowledge sources. Amazon Bedrock Knowledge Bases is a fully managed RAG capability with built-in session context management and source attribution. The final implementation delivers precise, contextually relevant responses while maintaining traceability to source documents.
Amazon Bedrock Evaluations – For consistent quality and performance, the team wanted to implement automated evaluations. By using Amazon Bedrock Evaluations and the RAG evaluation tool for Amazon Bedrock Knowledge Bases in their development environment and CI/CD pipeline, they were able to evaluate and compare FMs with human-like quality. They evaluated different dimensions, including response accuracy, relevance, and completeness, and quality of RAG retrieval.

Since launch, their approach scales efficiently to analyze thousands of responses and facilitates data-driven decision-making about model and inference parameter selection, and RAG configuration.The following diagram showcases the architecture of the solution.

The user query workflow consists of the following steps:

The user logs in through a frontend React application, hosted in an Amazon Simple Storage Service (Amazon S3) bucket and accessible only within the organization’s network through an internal-only Application Load Balancer.
After logging in, a WebSocket connection is opened between the client and Amazon API Gateway to enable real-time, bi-directional communication.
After the connection is established, an AWS Lambda function (connection handler) is invoked, which process the payload, logs tracking data to Amazon DynamoDB, and publishes request data to an Amazon Simple Notification Service (Amazon SNS) topic for downstream processing.
Lambda functions for different types of tasks consume messages from Amazon Simple Queue Service (Amazon SQS) for scalable and event-driven processing.
The Lambda functions use Amazon Bedrock FMs to determine whether a question is best answered by querying structured data in Amazon Athena or by retrieving information from an Amazon Bedrock knowledge base.
After processing, the answer is returned to the user in real time using the existing WebSocket connection through API Gateway.

Data ingestion
ClientIQ needs to be regularly updated with the latest Salesforce data. Rather than using an off-the-shelf option, TP ICAP developed a custom connector to interface with their highly tailored Salesforce implementation and ingest the latest data to Amazon S3. This bespoke approach provided the flexibility needed to handle their specific data structures while remaining simple to configure and maintain. The connector, which employs Salesforce Object Query Language (SOQL) queries to retrieve the data, runs daily and has proven to be fast and reliable. To optimize the quality of the results during the RAG retrieval workflow, TP ICAP opted for a custom chunking approach in their Amazon Bedrock knowledge base. The custom chunking happens as part of the ingestion process, where the connector splits the data into individual CSV files, one per meeting. These files are also automatically tagged with relevant topics from a predefined list, using Amazon Nova Pro, to further increase the quality of the retrieval results. The final outputs in Amazon S3 contain a CSV file per meeting and a matching JSON metadata file containing tags such as date, division, brand, and region. The following is an example of the associated metadata file:

{
“metadataAttributes”: {
“Tier”: “Bronze”,
“Number_Date_of_Visit”: 20171130,
“Author_Region_C”: “AMER”,
“Brand_C”: “Credit”,
“Division_C”: “Credit”,
“Visiting_City_C”: “Chicago”,
“Client_Name”: “AnyCompany”
}
}

As soon as the data is available in Amazon S3, an AWS Glue job is triggered to populate the AWS Glue Data Catalog. This is later used by Athena when querying the Amazon S3 data.
The Amazon Bedrock knowledge base is also synced with Amazon S3. As part of this process, each CSV file is converted into embeddings using Amazon Titan v1 and indexed in the vector store, Amazon OpenSearch Serverless. The metadata is also ingested and available for filtering the vector store results during retrieval, as described in the following section.
Boosting RAG retrieval quality
In a RAG query workflow, the first step is to retrieve the documents that are relevant to the user’s query from the vector store and append them to the query as context. Common ways to find the relevant documents include semantic search, keyword search, or a combination of both, referred to as hybrid search. ClientIQ uses hybrid search to first filter documents based on their metadata and then perform semantic search within the filtered results. This pre-filtering provides more control over the retrieved documents and helps disambiguate queries. For example, a question such as “find notes from executive meetings with AnyCompany in Chicago” can mean meetings with any AnyCompany division that took place in Chicago or meetings with AnyCompany’s division headquartered in Chicago.
TP ICAP used the manual metadata filtering capability in Amazon Bedrock Knowledge Bases to implement hybrid search in their vector store, OpenSearch Serverless. With this approach, in the preceding example, the documents are first pre-filtered for “Chicago” as Visiting_City_C. After that, a semantic search is performed to find the documents that contain executive meeting notes for AnyCompany. The final output contains notes from meetings in Chicago, which is what is expected in this case. The team enhanced this functionality further by using the implicit metadata filtering of Amazon Bedrock Knowledge Bases. This capability relies on Amazon Bedrock FMs to automatically analyze the query, understand which values can be mapped to metadata fields, and rewrite the query accordingly before performing the retrieval.
Finally, for additional precision, users can manually specify filters through the application UI, giving them greater control over their search results. This multi-layered filtering approach significantly improves context and final response accuracy while maintaining fast retrieval speeds.
Security and access control
To maintain Salesforce’s granular permissions model in the ClientIQ solution, TP ICAP implemented a security framework using Okta group claims mapped to specific divisions and regions. When a user signs in, their group claims are attached to their session. When the user asks a question, these claims are automatically matched against metadata fields in Athena or OpenSearch Serverless, depending on the path followed.
For example, if a user has access to see information for EMEA only, then the documents are automatically filtered by the EMEA region. In Athena, this is done by automatically adjusting the query to include this filter. In Amazon Bedrock Knowledge Bases, this is done by introducing an additional metadata field filter for region=EMEA in the hybrid search. This is highlighted in the following diagram.

Results that don’t match the user’s permission tags are filtered out, so that users can only access data they’re authorized to see. This unified security model maintains consistency between Salesforce permissions and ClientIQ access controls, preserving data governance across solutions.
The team also developed a custom administrative interface for admins that manage permission in Salesforce to add or remove users from groups using Okta’s APIs.
Automated evaluation
The Innovation Lab team faced a common challenge in building their RAG application: how to scientifically measure and improve its performance. To address that, they developed an evaluation strategy using Amazon Bedrock Evaluations that involves three phrases:

Ground truth creation – They worked closely with stakeholders and testing teams to develop a comprehensive set of 100 representative question answers pairs that mirrored real-world interactions.
RAG evaluation – In their development environment, they programmatically triggered RAG evaluations in Amazon Bedrock Evaluations to process the ground truth data in Amazon S3 and run comprehensive assessments. They evaluated different chunking strategies, including default and custom chunking, tested different embedding models for retrieval, and compared FMs for generation using a range of inference parameters.
Metric-driven optimization – Amazon Bedrock generates evaluation reports containing metrics, scores, and insights upon completion of an evaluation job. The team tracked content relevance and content coverage for retrieval and quality, and responsible AI metrics such as response relevance, factual accuracy, retrieval precision, and contextual comprehension for generation. They used the evaluation reports to make optimizations until they reached their performance goals.

The following diagram illustrates this approach.

In addition, they integrated RAG evaluation directly into their continuous integration and continuous delivery (CI/CD) pipeline, so every deployment automatically validates that changes don’t degrade response quality. The automated testing approach gives the team confidence to iterate quickly while maintaining consistently high standards for the production solution.
Business outcomes
ClientIQ has transformed how TP ICAP extracts value from their CRM data. Following the initial launch with 20 users, the results showed that the solution has driven a 75% reduction in time spent on research tasks. Stakeholders also reported an improvement in insight quality, with more comprehensive and contextual information being surfaced. Building on this success, the TP ICAP Innovation Lab plans to evolve ClientIQ into a more intelligent virtual assistant capable of handling broader, more complex tasks across multiple enterprise systems. Their mission remains consistent: to help technical and non-technical teams across the business to unlock business benefits with generative AI.
Conclusion
In this post, we explored how the TP ICAP Innovation Lab team used Amazon Bedrock FMs, Amazon Bedrock Knowledge Bases, and Amazon Bedrock Evaluations to transform thousands of meeting records from an underutilized resource into a valuable asset and accelerate time to insights while maintaining enterprise-grade security and governance. Their success demonstrates that with the right approach, businesses can implement production-ready AI solutions and deliver business value in weeks. To learn more about building similar solutions with Amazon Bedrock, visit the Amazon Bedrock documentation or discover real-world success stories and implementations on the AWS Financial Services Blog.

About the authors
Ross Ashworth works in TP ICAP’s AI Innovation Lab, where he focuses on enabling the business to harness Generative AI across a range of projects. With over a decade of experience working with AWS technologies, Ross brings deep technical expertise to designing and delivering innovative, practical solutions that drive business value. Outside of work, Ross is a keen cricket fan and former amateur player. He is now a member at The Oval, where he enjoys attending matches with his family, who also share his passion for the sport.
Anastasia Tzeveleka is a Senior Generative AI/ML Specialist Solutions Architect at AWS. Her experience spans the entire AI lifecycle, from collaborating with organizations training cutting-edge Large Language Models (LLMs) to guiding enterprises in deploying and scaling these models for real-world applications. In her spare time, she explores new worlds through fiction.

Principal Financial Group accelerates build, test, and deployment of A …

This guest post was written by Mulay Ahmed and Caroline Lima-Lane of Principal Financial Group. The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.
With US contact centers that handle millions of customer calls annually, Principal Financial Group® wanted to modernize their customer call experience. In the post Principal Financial Group increases Voice Virtual Assistant performance using Genesys, Amazon Lex, and Amazon QuickSight, we discussed the overall Principal Virtual Assistant solution using Genesys Cloud, Amazon Lex V2, multiple AWS services, and a custom reporting and analytics solution using Amazon QuickSight.
This post focuses on the acceleration of the Virtual Assistant (VA) platform delivery processes through automated build, testing, and deployment of an Amazon Lex V2 bot (including other database and analytics resources described later in this post) using a GitHub continuous integration and delivery (CI/CD) pipeline with automated execution of the Amazon Lex V2 Test Workbench for quality assurance. This solution helps Principal® scale and maintain VA implementations with confidence and speed using infrastructure as code (IaC), configuration as code (CaC,) and an automated CI/CD approach instead of testing and deploying the Amazon Lex V2 bot on the AWS Management Console.
Principal is a global financial company with nearly 20,000 employees passionate about improving the wealth and well-being of people and businesses. In business for 145 years, Principal is helping approximately 70 million customers (as of Q4 2024) plan, protect, invest, and retire, while working to support the communities where it does business.The enterprise virtual assistant engineering team at Principal, in collaboration with AWS, used Amazon Lex V2 to implement a voice virtual assistant to provide self-service and routing capabilities for contact center customers. The following engineering opportunities were recognized and prioritized:

Elimination of console-driven configuration, testing, and deployment of an Amazon Lex V2 bot
Collaboration through structured version control and parallel development workflows for multiple team members
Acceleration of development cycles with automated build, test, and deployment processes for Amazon Lex bot creation and optimization
Enhanced quality assurance controls through automated testing gates and coding standard validation for reliable releases

With the automation solutions described in the post, as of September 2024, Principal has accelerated development efforts by 50% across all environments (development, pilot, and production) through streamlined implementation and deployment processes. This solution also enhances deployment reliability through automated workflows, providing consistent updates while minimizing errors across development, pilot, and production environments, and maximizes development efficiency by integrating the Test Workbench with GitHub, enabling version control and automated testing.With the automation of the Test Workbench and its integration with GitHub, the solution strengthens the CI/CD pipeline by maintaining alignment between test files and bot versions, creating a more agile and reliable development process.
Solution overview
The solution uses the services described in Principal Financial Group increases Voice Virtual Assistant performance using Genesys, Amazon Lex, and Amazon QuickSight. The following services/APIs are also used as part of this solution:

AWS Step Functions to orchestrate the deployment workflow
The Test Workbench APIs, which are invoked within the Step Functions state machine as a sequence of tasks
AWS Lambda to process data to support some of the Test Workbench APIs inputs

VA code organization and management
The Principal VA implementation uses Genesys Cloud as the contact center application and the following AWS services organized as different stacks:

Bot stack:

The Amazon Lex V2 CDK is used for defining and deploying the bot infrastructure
Lambda functions handle the bot logic and manage routing logic (for Amazon Lex and Genesys Cloud)
AWS Secrets Manager stores secrets for calling downstream systems endpoints

Testing stack:

Step Functions orchestrates the testing workflow
Lambda functions are used in the testing process
Test files contains test cases and scenarios in Test Workbench format
Simulated data is used to simulate various scenarios for testing without connecting to downstream systems or APIs

Data stack:

Amazon Dynamo DB manages and stores bot prompts
Amazon Simple Storage Service (Amazon S3) stores testing data

Analytics stack:

Amazon S3 stores logs and processed data
Amazon Data Firehose streams logs to Amazon S3
Lambda orchestrates extract, transform, and load (ETL) operations
AWS Glue manages the Data Catalog and ETL jobs
Amazon Athena is used for querying and analyzing analytics data in Amazon S3
Amazon QuickSight is used for data visualization and business intelligence

CI/CD pipeline:

GitHub serves as the source code repository
A GitHub workflow automates the CI/CD pipeline

Amazon Lex V2 configuration as code and CI/CD workflow
The following diagram illustrates how multiple developers can work on changes to the bot stack and test in parallel by deploying changes locally or using a GitHub workflow.

The process consists of the following steps:

A developer clones the repository and creates a new branch for changes.
Developer A or B makes changes to the bot configuration or Lambda functions using code.
The developer creates a pull request.
The developer deploys the Amazon Lex V2 CDK stack through one of the following methods:

Create a pull request and ensure all code quality and standards checks are passing.
Merge it with the main branch.
Deploy the Amazon Lex V2 CDK stack from their local environment.

The developer runs the Test Workbench as part of the CI/CD pipeline or from their local environment using the automation scripts.

Tests results are displayed in GitHub Actions and the terminal (if run locally).
The pipeline succeeds only if defined checks such as linting, unit testing, infrastructure testing and integration, and Test Workbench functional testing pass.

After all tests and checks pass, a new pre-release can be drafted to deploy to the staging environment. After staging deployment and testing (automated and UAT) is successful, a new release can be created for production deployment (after manual review and approval).

Amazon Lex Test Workbench automation
The solution uses GitHub and AWS services, such as Step Functions state machines and Lambda functions, to orchestrate the entire Amazon Lex V2 Bot testing process (instead of using the existing manual testing process for Amazon Lex). The pipeline triggers the upload of test sets, Lambda functions to interact with the Amazon Lex V2 bot and Test Workbench, then another Lambda function to read the tests results and provide results in the pipeline.
To maintain consistent, repeatable evaluations of your Amazon Lex V2 bots, it’s essential to manage and organize your test datasets effectively. The following key practices help keep test sets up-to-date:

Test set files are version-controlled and linked to each bot and its version
Separate golden test sets are created for each intent and updated on a regular basis to include production customer utterances, increasing intent recognition rates
The versioned test data is deployed as part of each bot deployment in non-production environments

The following diagram illustrates the end-to-end automated process for testing Amazon Lex V2 bots after each deployment.

The post-deployment workflow consists of the following steps:

The developer checks the test file into the GitHub repository (or deploys directly from local). After each bot deployment, GitHub triggers the test script using the GitHub workflow.
The test scripts upload the test files to an S3 bucket.
The test script invokes a Step Functions state machine, using a bot name and list of file keys as inputs.
Amazon Lex Model API calls are invoked to get the bot ID (ListBots) and alias (ListBotAliases).
Each test file key is iterated within a Map state, where the following tasks are executed:

Call Amazon Lex APIs to start import jobs:

StartImport – Creates a test set ID and stores it under an S3 bucket specified location.
DescribeImport – Checks if the status of StartImport is complete.

Run the test set:

StartTestExecution – Creates a test execution ID and executes the test.
ListTestExecutions – Gathers all test executions. A Lambda function filters out the current test execution id and its status.

Get test results.

When the test is complete:

The ListTestExecutionResultItems API is invoked to gather overall test results.
The ListTestExecutionResultItems API is invoked to fetch test failure details at the utterance level if present.

A Lambda function orchestrates the final cleanup and reporting:

DeleteTestSet cleans up test sets that are no longer needed from an S3 bucket.
The pipeline outputs the results and if there are test failures, these are listed in the GitHub action or local terminal job report.

Developers conduct the manual process of reviewing the test result files from the Test Workbench console.

Conclusion
In this post, we presented how Principal accelerated the development, testing, and deployment of Amazon Lex V2 bots and supporting AWS services using code. In addition to the reporting and analytics solution, this provides a robust solution for the continued enhancement and maintenance of the Virtual Assistant ecosystem.
By automating Test Workbench processes and integrating them with version control and CI/CD processes, Principal was able to decrease testing and deployment time, increase test coverage, streamline their development workflows, and deliver quality conversational experience to customers. For a deeper dive into other relevant services, refer to Evaluating Lex V2 bot performance with the Test Workbench.
AWS and Amazon are not affiliates of any company of the Principal Financial Group. This communication is intended to be educational in nature and is not intended to be taken as a recommendation. Insurance products issued by Principal National Life Insurance Co (except in NY) and Principal Life Insurance Company. Plan administrative services offered by Principal Life. Principal Funds, Inc. is distributed by Principal Funds Distributor, Inc. Securities offered through Principal Securities, Inc., member SIPC and/or independent broker/dealers. Referenced companies are members of the Principal Financial Group, Des Moines, IA 50392. ©2025 Principal Financial Services, Inc. 4373397-042025

About the authors
Mulay Ahmed is a Solutions Architect at Principal with expertise in architecting complex enterprise-grade solutions, including AWS Cloud implementations.
Caroline Lima-Lane is a Software Engineer at Principal with a vast background in the AWS Cloud space.

Beyond vibes: How to properly select the right LLM for the right task

Choosing the right large language model (LLM) for your use case is becoming both increasingly challenging and essential. Many teams rely on one-time (ad hoc) evaluations based on limited samples from trending models, essentially judging quality on “vibes” alone.
This approach involves experimenting with a model’s responses and forming subjective opinions about its performance. However, relying on these informal tests of model output is risky and unscalable, often misses subtle errors, overlooks unsafe behavior, and provides no clear criteria for improvement.
A more holistic approach entails evaluating the model based on metrics around qualitative and quantitative aspects, such as quality of response, cost, and performance. This also requires the evaluation system to compare models based on these predefined metrics and give a comprehensive output comparing models across all these areas. However, these evaluations don’t scale effectively enough to help organizations take full advantage of the model choices available.
In this post, we discuss an approach that can guide you to build comprehensive and empirically driven evaluations that can help you make better decisions when selecting the right model for your task.
From vibes to metrics and why it matters
Human brains excel at pattern-matching, and models are designed to be convincing. Although a vibes-based approach can serve as a starting point, without systematic evaluation, we lack the evidence needed to trust a model in production. This limitation makes it difficult to compare models fairly or identify specific areas for improvement.
The limitations of “just trying it out” include:

Subjective bias – Human testers might favor responses based on style or tone rather than factual accuracy. Users can be swayed by “exotic words” or formatting. A model whose writing sounds confident might win on vibes while actually introducing inaccuracies.
Lack of coverage – A few interactive prompts won’t cover the breadth of real-world inputs, often missing edge cases that reveal model weaknesses.
Inconsistency – Without defined metrics, evaluators might disagree on why one model is better based on different priorities (brevity vs. factual detail), making it difficult to align model choice with business goals.
No trackable benchmarks – Without quantitative metrics, it’s impossible to track accuracy degradation during prompt optimization or model changes.

Established benchmarks like MMLU, HellaSwag, and HELM offer valuable standardized assessments across reasoning, knowledge retrieval, and factuality dimensions, efficiently helping narrow down candidate models without extensive internal resources.
However, exclusive reliance on these benchmarks is problematic: they measure generalized rather than domain-specific performance, prioritize easily quantifiable metrics over business-critical capabilities, and can’t account for your organization’s unique constraints around latency, costs, and safety requirements. A high-ranking model might excel at trivia while failing with your industry terminology or producing responses too verbose or costly for your specific implementation.
A robust evaluation framework is vital for building trust, which is why no single metric can capture what makes an LLM response “good.” Instead, you must evaluate across multiple dimensions:

Accuracy – Does the model produce accurate information? Does it fully answer the question or cover required points? Is the response on-topic, contextually relevant, well-structured, and logically coherent?
Latency – How fast does the model produce a response? For interactive applications, response time directly impacts user experience.
Cost-efficiency – What is the monetary cost per API call or token? Different models have varying pricing structures and infrastructure costs.

By evaluating along these facets, you can make informed decisions aligned with product requirements. For example, if robustness under adversarial inputs is crucial, a slightly slower but more aligned model might be preferable. For simple internal tasks, trading some accuracy for cost-efficiency might make sense.
Although many metrics require qualitative judgment, you can structure and quantify these with careful evaluation methods. Industry best practices combine quantitative metrics with human or AI raters for subjective criteria, moving from “I like this answer more” to “Model A scored 4/5 on correctness and 5/5 on completeness.” This detail enables meaningful discussion and improvement, and technical managers should demand such accuracy measurements before deploying any model.
Unique evaluation dimensions for LLM performance
In this post, we make the case for structured, multi-metric assessment of foundation models (FMs) and discuss the importance of creating ground truth as a prerequisite to model evaluation. We use the open source 360-Eval framework as a practical, code-first tool to orchestrate rigorous evaluations across multiple models and cloud providers.
We show the approach by comparing four LLMs within Amazon Bedrock, across a spectrum of correctness, completeness, relevance, format, coherence, and instruction following, to understand how each model responds matches our ground truth dataset. Our evaluation measures the accuracy, latency, and cost for each model, painting a 360° picture of their strengths and weaknesses.
To evaluate FMs, it’s highly recommended that you break up model performance into distinct dimensions. The following is a sample set of criteria and what each one measures:

Correctness (accuracy) – The factual accuracy of the model’s output. For tasks with a known answer, you can measure this using exact match or cosine similarity; for open-ended responses, you might rely on human or LLM judgment of factual consistency.
Completeness – The extent to which the model’s response addresses all parts of the query or problem. In human/LLM evaluations, completeness is often scored on a scale (did the answer partly address or fully address the query).
Relevance – Measures if the content of the response is on-topic and pertinent to the user’s request. Relevance scoring looks at how well the response stays within scope. High relevance means the model understood the query and stayed focused on it.
Coherence – The logical flow and clarity of the response. Coherence can be judged by human or LLM evaluators, or approximated with metrics like coherence scores or by checking discourse structure.
Following instructions – How well the model obeys explicit instructions in the prompt (formatting, style, length, and so on). For example, if asked “List three bullet-point advantages,” does the model produce a three-item bullet list? If the system or user prompt sets a role or tone, does the model adhere to it? Instruction-following can be evaluated by programmatically checking if the output meets the specified criteria (for example, contains the required sections) or using evaluator ratings.

Performing such comprehensive evaluations manually can be extremely time-consuming. Each model needs to be run on many if not hundreds of prompts, and each output must be checked for across all metrics. Doing this by hand or writing one-off scripts is error-prone and doesn’t scale. In practice, these can be evaluated automatically using LLM-as-a-judge or human feedback. This is where evaluation frameworks come into play.
After you’ve chosen an evaluation philosophy, it’s wise to invest in tooling to support it. Instead of combining ad hoc evaluation scripts, you can use dedicated frameworks to streamline the process of testing LLMs across many metrics and models.
Automating 360° model evaluation with 360-Eval
360-Eval is a lightweight solution that captures the depth and breadth of model evaluation. You can use it as an evaluation orchestrator to define the following:

Your dataset of test prompts and respective golden answers (expected answers or reference outputs)
Models you want to evaluate
The metrics and tasks framework evaluating the models against

The tool is designed to capture relevant and user-defined dimensions of model performance in one workflow, supporting multi-model comparisons out of the box. You can evaluate models hosted in Amazon Bedrock or Amazon SageMaker, or call external APIs—the framework is flexible in integrating different model endpoints. This is ideal for a scenario where you might want to use the full power of Amazon Bedrock models without having to sacrifice performance.
The framework consists of the following key components:

Data configuration – You specify your evaluation dataset; for example, a JSONL file of prompts with optional expected outputs, the task, and a description. The framework can also work with a custom prompt CSV dataset you provide.
API gateway – Using the versatile LiteLLM framework, it abstracts the API differences so the evaluation loop can treat all models uniformly. Inference metadata such as time-to-first-token (TTFT), time-to-last-token (TTLT), total token output, API errors count, and pricing is also captured.
Evaluation architecture – 360-Eval uses LLM-as-a-judge to score and calculate the weight of model outputs on qualities like correctness or relevance. You can provide all the metrics you care about into one pipeline. Each evaluation algorithm will produce a score and verdict per test case per model.

Choosing the right model: A real-world example
For our example use case, AnyCompany is developing an innovative software as a service (SaaS) solution that streamlines database architecture for developers and businesses. Their platform accepts natural language requirements as input and uses LLMs to automatically generate PostgreSQL-specific data models. Users can describe their requirements in plain English—for example, “I need a cloud-based order management platform designed to streamline operations for small to medium businesses”—and the tool intelligently extracts the entity and attribute information and creates an optimized table structure specifically for PostgreSQL. This solution avoids hours of manual entity and database design work, reduces the expertise barrier for database modeling, and supports PostgreSQL best practices even for teams without dedicated database specialists.
In our example, we provide our model a set of requirements (as prompts) relevant to the task and ask it to extract the dominant entity and its attributes (a data extraction task) and also produce a relevant create table statement using PostgreSQL (a text-to-SQL task).
Example prompt:

Given the following requirement, extract the data model and attributes that you will
recommend. I need the output in a single line. You can provide the attributes separated
by comma: “A global manufacturing company uses a web-based supply chain management
system to track inventory across 50 locations, manage relationships with over 200
suppliers, forecast material needs, and automatically trigger purchase orders when stock
levels reach predefined thresholds……”

The following table shows our task types, criteria, and golden answers for this example prompt. We have shortened the prompt for brevity. In a real-world use case, your requirements might span multiple paragraphs.

task_type
task_criteria
golden_answer

DATA EXTRACTION
Check if the extracted entity and attributes matches the requirements

Supply Chain Inventory: inventory_id, product_sku,
location_id, quantity_on_hand, reorder_threshold,
supplier_id, last_order_date, forecasted_demand,
cost_per_unit, status, last_updated

TEXT-TO-SQL
Given the requirements check if the generated create table matches the requirements

CREATE TABLE supply_chain_inventory (
inventory_id SERIAL PRIMARY KEY,
product_sku VARCHAR(50) NOT NULL,
location_id INTEGER NOT NULL,
quantity_on_hand INTEGER NOT NULL,
reorder_threshold INTEGER NOT NULL,
supplier_id INTEGER,
last_order_date TIMESTAMP,
forecasted_demand NUMERIC(10,2),
cost_per_unit NUMERIC(10,2),
status VARCHAR(20),
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

AnyCompany wants to find a model that will solve the task in the fastest and most cost-effective way, without compromising on quality.
360-Eval UI
To reduce the complexity of the process, we have built a UI on top of the evaluation engine.
The UI_README.md file has instructions to launch and run the evaluation using the UI. You must also follow the instructions in the README.md to install the Python packages as prerequisites and enable Amazon Bedrock model access.
Let’s explore the different pages in the UI in more detail.
Setup page
As you launch the UI, you land on the initial Setup page, where you select your evaluation data, define your label, define your task as discreetly as possible, and set the temperature the models will have when being evaluated. Then you select the models you want to evaluate against your dataset, the judges that will evaluate the models’ accuracy (using custom metrics and the standard quality and relevance metrics), configure pricing and AWS Region options, and finally configure how you want the evaluation to take place, such as concurrency, request per minute, and experiment counts (unique runs).

This is where you specify the CSV file with sample prompts, task type, and task criteria according to your needs.
Monitor page
After the evaluation criteria and parameters are defined, they are displayed on the Monitor page, which you can navigate to by choosing Monitor in the Navigation section. On this page, you can monitor all your evaluations, including those currently running, those queued, and those not yet scheduled to run. You can choose the evaluation you want to run, and if any evaluation is no longer relevant, you can remove it here as well.
The workflow is as follows:

Execute the prompts in the input file against the models selected.
Capture the metrics such as input token count, output token count, and TTFT.
Use the input and output tokens to calculate the cost of running each prompt against the models.
Use an LLM-as-a-judge to evaluate the accuracy against predefined metrics (correctness, completeness, relevance, format, coherence, following instructions) and any user-defined metrics.

Evaluations page
Detailed information of the evaluations, such as the evaluation configuration, the judge models used to evaluate, the Regions where the models are hosted, the input and output cost, and the task and its criteria the model was evaluated with, are displayed on the Evaluations page.

Reports page
Lastly, the Reports page is where you can select the completed evaluations to generate a report in HTML format. You can also delete old and irrelevant reports.

Understanding the evaluation report
The tool output is an HTML file that shows the results of the evaluation. It includes the following sections:

Executive Summary – This section provides an overall summary of the results. It provides a quick summary of which model was most accurate, which model was the fastest overall, and which model provided the best success-to-cost ratio.
Recommendations – This section contains more details and a breakdown of what you see in the executive summary, in a tabular format.
Latency Metrics – In this section, you can review the performance aspect of your evaluation. We use the TTFT and output tokens per second as a measure for performance.
Cost Metrics – This section shows the overall cost of running the evaluation, which indicates what you can expect in your AWS billing.
Task Analysis – The tool further breaks down the performance and cost metrics by task type. In our case, there will be a section for the text-to-SQL task and one for data extraction.
Judge Scores Analysis – In this section, you can review the quality of each model based on the various metrics. You can also explore prompt optimizations to improve your model. In our case, our prompts were more biased towards the Anthropic family, but if you use the Amazon Bedrock prompt optimization feature, you might be able to address this bias.

Interpreting the evaluation results
By using the 360-Eval UI, AnyCompany ran the evaluation with their own dataset and got the following results. They chose four different LLMs in Amazon Bedrock to conduct the evaluation. For this post, the exact models used aren’t relevant. We call these models Model-A, Model-B, Model-C, and Model-D.
These results will vary in your case depending on the dataset and prompts. The results here are a reflection of our own example within a test account. As shown in the following figures, Model-A was the fastest, followed by Model-B. Model-C was 3–4 times slower than Model-A. Model-D was the slowest.

As shown in the following figure, Model B was the cheapest. Model A was three times more expensive than Model-B. Model-C and Model-D were both very expensive.

The next focus was the quality of the evaluation. The two most important metrics to were the correctness and completeness of the response. In the following evaluation, only Model-D scored more than 3 for both task types.

Model-C was the next closest contender.

Model-B scored lowest in the correctness and completeness metrics.

Model-A missed slightly on the completeness for the text-to-SQL use case.

Evaluation summary
Let’s revisit AnyCompany’s criteria, which was to find a model that will solve the task in the fastest and most cost-effective way, without compromising on quality. There was no obvious winner.
AnyCompany then considered providing a tiered pricing model to their customers. Premium-tier customers will receive the most accurate model at a premium price, and basic-tier customers will get the model with the best price-performance.
Although for this use case, Model-D was the slowest and more expensive, it scored highest on the most crucial metrics: correctness and completeness of responses. For a database modeling tool, accuracy is far more important than speed or cost, because incorrect database schemas might lead to significant downstream issues in application development. AnyCompany chose Model-D for premium-tier customers.
Cost is a major constraint for the basic-tier, so AnyCompany chose Model-A, because it scored reasonably well on correctness for both tasks and only slightly missed on completeness for one task type, while being faster and less expensive than the top performers.
AnyCompany also considered Model-B as a viable option for free-tier customers.
Conclusion
As FMs become more reliant, they can also become more complex. Because their strengths and weaknesses more difficult to detect, evaluating them requires a systematic approach. By using a data-driven, multi-metric evaluation, technical leaders can make informed decisions rooted in the model’s actual performance, including factual accuracy, user experience, compliance, and cost.
Adopting frameworks like 360-Eval can operationalize this approach. You can encode your evaluation philosophy into a standardized procedure, making sure every new model or version is judged the same, and enabling side-by-side comparisons.
The framework handles the heavy lifting of running models on test cases and computing metrics, so your team can focus on interpreting results and making decisions. As the field of generative AI continues to evolve rapidly, having this evaluation infrastructure can help you find the right model for your use case. Furthermore, this approach can enable faster iteration on prompts and policies, and ultimately help you develop more reliable and effective AI systems in production.

About the authors
Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.
Anubhav Sharma is a Principal Solutions Architect at AWS with over 2 decades of experience in coding and architecting business-critical applications. Known for his strong desire to learn and innovate, Anubhav has spent the past 6 years at AWS working closely with multiple independent software vendors (ISVs) and enterprises. He specializes in guiding these companies through their journey of building, deploying, and operating SaaS solutions on AWS.