What is Agentic RAG? Use Cases and Top Agentic RAG Tools (2025)

Table of contentsWhat is Agentic RAG?Use Cases and ApplicationsTop Agentic RAG Tools & Frameworks (2025)Open-source frameworksVendor/managed platformsKey Benefits of Agentic RAGFAQ 1: What makes Agentic RAG different from traditional RAG?FAQ 2: What are the main applications of Agentic RAG?FAQ 3: How do agentic RAG systems improve accuracy?FAQ 4: Can Agentic RAG be deployed on-premises or in the cloud?

What is Agentic RAG?

Agentic RAG combines the strengths of traditional RAG—where large language models (LLMs) retrieve and ground outputs in external context—with agentic decision-making and tool use. Unlike static approaches, agentic RAG features AI agents that orchestrate retrieval, generation, query planning, and iterative reasoning. These agents autonomously choose data sources, refine queries, invoke APIs/tools, validate context, and self-correct in a loop until the best output is produced. The result is deeper, more accurate, and context-sensitive answers as the agent can dynamically adapt the workflow to each query.

Why not just vanilla RAG?

Vanilla RAG struggles with underspecified questions, multi-hop reasoning, and noisy corpora. Agentic patterns address this by adding:

Planning / query decomposition (plan-then-retrieve).

Conditional retrieval (decide if retrieval is needed, from which source).

Self-reflection / corrective loops (detect bad retrieval and try alternatives).

Graph-aware exploration (narrative/relational discovery instead of flat chunk search).

Use Cases and Applications

Agentic RAG is being deployed across many industries to solve complex problems that traditional RAG struggles to address.

Customer Support: Empowers AI helpdesks to adapt responses to customer context and needs, resolving issues faster and learning from past tickets for continuous improvement.

Healthcare: Assists clinicians with evidence-based recommendations by retrieving and synthesizing medical literature, patient records, and treatment guidelines, enhancing diagnostic precision and patient safety.

Finance: Automates regulatory compliance analysis, risk management, and monitoring by reasoning over real-time regulatory updates and transactional data, significantly reducing manual effort.

Education: Delivers personalized learning through adaptive content retrieval and individualized learning plans, improving student engagement and outcomes.

Internal Knowledge Management: Finds, checks, and routes internal documents, streamlining access to crucial information for enterprise teams.

Business Intelligence: Automates multi-step KPI analysis, trend detection, and report generation by leveraging external data and API integrations with intelligent query planning.

Scientific Research: Helps researchers rapidly conduct literature reviews and extract insights, cutting down manual review time.

Top Agentic RAG Tools & Frameworks (2025)

Open-source frameworks

LangGraph (LangChain) – First-class state machines for multi-actor/agent workflows; includes Agentic RAG tutorial (conditional retrieval, retries). Strong for graph-style control over steps.

LlamaIndex – “Agentic strategies / data agents” for planning and tool use atop existing query engines; courseware and cookbooks available.

Haystack (deepset) – Agents + Studio recipes for agentic RAG, including conditional routing and web fallback. Good tracing, production docs.

DSPy – Programmatic LLM engineering; ReAct-style agents with retrieval and optimization; fits teams who want declarative pipelines and tuning.

Microsoft GraphRAG – Research-backed approach that builds a knowledge graph for narrative discovery; open materials and paper. Ideal for messy corpora.

RAPTOR (Stanford) – Hierarchical summarization tree improves retrieval for long corpora; works as a pre-compute stage in agentic stacks.

Vendor/managed platforms

AWS Bedrock Agents (AgentCore) – Multi-agent runtime with security, memory, browser tool, and gateway integration; designed for enterprise deployment.

Azure AI Foundry + Azure AI Search – Managed RAG pattern, indexes, and agent templates; integrates with Azure OpenAI Assistants preview.

Google Vertex AI: RAG Engine & Agent Builder – Managed orchestration and agent tooling; hybrid retrieval and agent patterns.

NVIDIA NeMo – Retriever NIMs and Agent Toolkit for tool-connected teams of agents; integrates with LangChain/LlamaIndex.

Cohere Agents / Tools API – Tutorials and building blocks for multi-stage agentic RAG with native tools.

Key Benefits of Agentic RAG

Autonomous multi-step reasoning: Agents plan and execute the best sequence of tool use and retrieval to reach the correct answer.

Goal-driven workflows: Systems adaptively pursue user goals, overcoming limitations of linear RAG pipelines.

Self-verification and refinement: Agents verify the accuracy of retrieved context and generated outputs, reducing hallucinations.

Multi-agent orchestration: Complex queries are broken down and solved collaboratively by specialized agents.

Greater adaptability and contextual understanding: Systems learn from user interactions and adapt to diverse domains and requirements.

Example: Choosing a stack

Research copilot over long PDFs & wikis → LlamaIndex or LangGraph + RAPTOR summaries; optional GraphRAG layer.

Enterprise helpdesk → Haystack agent with conditional routing and web fallback; or AWS Bedrock Agents for managed runtime and governance.

Data/BI assistant → DSPy (programmatic agents) with SQL tool adapters; Azure/Vertex for managed RAG and monitoring.

High-security production → Managed agent services (Bedrock AgentCore, Azure AI Foundry) to standardize memory, identity, and tool gateways.

Agentic RAG is redefining what’s possible with generative AI, transforming traditional RAG into dynamic, adaptive, and deeply integrated systems for enterprise, research, and developer use.

FAQ 1: What makes Agentic RAG different from traditional RAG?

Agentic RAG adds autonomous reasoning, planning, and tool use to retrieval-augmented generation, allowing the AI to refine queries, synthesize information from multiple sources, and self-correct, instead of simply fetching and summarizing data.

FAQ 2: What are the main applications of Agentic RAG?

Agentic RAG is widely used in customer support, healthcare decision support, financial analysis, education, business intelligence, knowledge management, and research, excelling at complex tasks requiring multi-step reasoning and dynamic context integration.

FAQ 3: How do agentic RAG systems improve accuracy?

Agentic RAG agents can verify and cross-check retrieved context and responses by iteratively querying multiple data sources and refining their outputs, which helps reduce errors and hallucinations common in basic RAG pipelines.

FAQ 4: Can Agentic RAG be deployed on-premises or in the cloud?

Most frameworks offer both on-premises and cloud deployment options, supporting enterprise security needs and seamless integration with proprietary databases and external APIs for flexible architecture choices.
The post What is Agentic RAG? Use Cases and Top Agentic RAG Tools (2025) appeared first on MarkTechPost.

Meta AI Introduces DeepConf: First AI Method to Achieve 99.9% on AIME …

Large language models (LLMs) have reshaped AI reasoning, with parallel thinking and self-consistency methods often cited as pivotal advances. However, these techniques face a fundamental trade-off: sampling multiple reasoning paths boosts accuracy but at a steep computational cost. A team of researchers from Meta AI and UCSD introduce Deep Think with Confidence (DeepConf), a new AI approachthat nearly eliminates this trade-off. DeepConf delivers state-of-the-art reasoning performance with dramatic efficiency gains—achieving, for example, 99.9% accuracy on the grueling AIME 2025 math competition using the open-source GPT-OSS-120B, while requiring up to 85% fewer generated tokens than conventional parallel thinking approaches.

Why DeepConf?

Parallel thinking (self-consistency with majority voting) is the de facto standard for boosting LLM reasoning: generate multiple candidate solutions, then pick the most common answer. While effective, this method has diminishing returns—accuracy plateaus or even declines as more paths are sampled, because low-quality reasoning traces can dilute the vote. Moreover, generating hundreds or thousands of traces per query is costly, both in time and compute.

DeepConf tackles these challenges by exploiting the LLM’s own confidence signals. Rather than treating all reasoning traces equally, it dynamically filters out low-confidence paths—either during generation (online) or afterward (offline)—using only the most reliable trajectories to inform the final answer. This strategy is model-agnostic, requires no training or hyperparameter tuning, and can be plugged into any existing model or serving framework with minimal code changes.

https://arxiv.org/pdf/2508.15260

How DeepConf Works: Confidence as a Guide

DeepConf introduces several advancements in how confidence is measured and used:

Token Confidence: For each generated token, compute the negative average log-probability of the top-k candidates. This gives a local measure of certainty.

Group Confidence: Average token confidence over a sliding window (e.g., 2048 tokens), providing a smoothed, intermediate signal of reasoning quality.

Tail Confidence: Focus on the final segment of the reasoning trace, where the answer often resides, to catch late breakdowns.

Lowest Group Confidence: Identify the least confident segment in the trace, which often signals reasoning collapse.

Bottom Percentile Confidence: Highlight the worst segments, which are most predictive of errors.

These metrics are then used to weight votes (high-confidence traces count more) or to filter traces (only the top η% most confident traces are kept). In online mode, DeepConf stops generating a trace as soon as its confidence drops below a dynamically calibrated threshold, dramatically reducing wasted computation.

https://arxiv.org/pdf/2508.15260

Key Results: Performance & Efficiency

DeepConf was evaluated across multiple reasoning benchmarks (AIME 2024/2025, HMMT 2025, BRUMO25, GPQA-Diamond) and models (DeepSeek-8B, Qwen3-8B/32B, GPT-OSS-20B/120B). The results are striking:

ModelDatasetPass@1 AccCons@512 AccDeepConf@512 AccTokens SavedGPT-OSS-120BAIME 202591.8%97.0%99.9%-84.7%DeepSeek-8BAIME 202483.0%86.7%93.3%-77.9%Qwen3-32BAIME 202480.6%85.3%90.8%-56.0%

Performance boost: Across models and datasets, DeepConf improves accuracy by up to ~10 percentage points over standard majority voting, often saturating the benchmark’s upper limit.

Ultra-efficient: By early-stopping low-confidence traces, DeepConf reduces the total number of generated tokens by 43–85%, with no loss (and often a gain) in final accuracy.

Plug & play: DeepConf works out of the box with any model—no fine-tuning, no hyperparameter search, and no changes to the underlying architecture. You can drop it into your existing serving stack (e.g., vLLM) with ~50 lines of code.

Easy to deploy: The method is implemented as a lightweight extension to existing inference engines, requiring only access to token-level logprobs and a few lines of logic for confidence calculation and early stopping.

Simple Integration: Minimal Code, Maximum Impact

DeepConf’s implementation is quite simple. For vLLM, the changes are minimal:

Extend the logprobs processor to track sliding-window confidence.

Add an early-stop check before emitting each output.

Pass confidence thresholds via the API, with no model retraining.

This allows any OpenAI-compatible endpoint to support DeepConf with a single extra setting, making it trivial to adopt in production environments.

Conclusion

Meta AI’s DeepConf represents a leap forward in LLM reasoning, delivering both peak accuracy and unprecedented efficiency. By dynamically leveraging the model’s internal confidence, DeepConf achieves what was previously out of reach for open-source models: near-perfect results on elite reasoning tasks, with a fraction of the computational cost.

FAQs

FAQ 1: How does DeepConf improve accuracy and efficiency compared to majority voting?

DeepConf’s confidence-aware filtering and voting prioritizes traces with higher model certainty, boosting accuracy by up to 10 percentage points across reasoning benchmarks compared to majority voting alone. At the same time, its early termination of low-confidence traces slashes token usage by up to 85%, offering both performance and massive efficiency gains in practical deployments

FAQ 2: Can DeepConf be used with any language model or serving framework?

Yes. DeepConf is fully model-agnostic and can be integrated into any serving stack—including open-source and commercial models—without modification or retraining. Deployment requires only minimal changes (~50 lines of code for vLLM), leveraging token logprobs to compute confidence and handle early stopping.

FAQ 2: Does DeepConf require retraining, special data, or complex tuning?

No. DeepConf operates entirely at inference-time, requiring no additional model training, fine-tuning, or hyperparameter searches. It uses only built-in logprob outputs and works immediately with standard API settings for leading frameworks; it’s scalable, robust, and deployable on real workloads without interruption.

Check out the Paper and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Meta AI Introduces DeepConf: First AI Method to Achieve 99.9% on AIME 2025 with Open-Source Models Using GPT-OSS-120B appeared first on MarkTechPost.

The Evolution of AI Protocols: Why Model Context Protocol (MCP) Could …

Welcome to a new era of AI interoperability, where the Model Context Protocol (MCP) stands ready to do for agents and AI assistants what HTTP did for the web. If you’re building, scaling, or analyzing AI systems, MCP is the open standard you can’t ignore—it provides a universal contract for discovering tools, fetching resources, and coordinating rich, agentic workflows in real time.

From Fragmentation to Standardization: The AI Pre‑Protocol Era

Between 2018 and 2023, integrators lived in a world of fragmented APIs, bespoke connectors, and countless hours lost to customizing every function call or tool integration. Each assistant or agent needed unique schemas, custom connectors for GitHub or Slack, and its own brittle handling of secrets. Context—whether files, databases, or embeddings—moved via one-off workarounds.

The web faced this same problem before HTTP and URIs standardized everything. AI desperately needs its own minimal, composable contract, so any capable client can plug into any server without glue code or custom hacks.

What MCP Actually Standardizes

Think of MCP as a universal bus for AI capabilities and context—connecting hosts (agents/apps), clients (connectors), and servers (capability providers) using a clear interface: JSON-RPC messaging, a set of HTTP or stdio transports, and well-defined contracts for security and negotiation.

MCP Feature Set

Tools: Typed functions exposed by servers, described in JSON Schema, that any client can list or invoke.

Resources: Addressable context (files, tables, docs, URIs) that agents can reliably list, read, subscribe to, or update.

Prompts: Reusable prompt templates and workflows you can discover, fill, and trigger dynamically.

Sampling: Agents can delegate LLM calls or requests to hosts when a server needs model interaction.

Transports: MCP runs over local stdio (for quick desktop/server processes) and streamable HTTP—POST for requests, optional SSE for server events. The choice depends on scale and deployment.

Security: Designed for explicit user consent and OAuth-style authorization with audience-bound tokens. No token passthrough—clients declare their identity, and servers enforce scopes and approvals with clear UX prompts.

The HTTP Analogy

Resources ≈ URLs: AI-context blocks are now routable, listable, and fetchable.

Tools ≈ HTTP Methods: Typed, interoperable actions replace bespoke API calls.

Negotiation/versioning ≈ Headers/content-type: Capability negotiation, protocol versioning, and error handling are standardized.

The Path to Becoming “The New HTTP for AI”

What makes MCP a credible contender to become the “HTTP for AI”?

Cross‑client adoption: MCP support is rolling out widely, from Claude Desktop and JetBrains to emerging cloud agent frameworks—one connector works anywhere.

Minimal core, strong conventions: MCP is simple at its heart—core JSON-RPC plus clear APIs—allowing servers to be as simple or complex as the need demands.

Simple: A single tool, a database, or file-server.

Complex: Full-blown prompt graphs, event streaming, multi-agent orchestration.

Runs everywhere: Wrap local tools for safety, or deploy enterprise-grade servers behind OAuth 2.1 and robust logging—flexibility without sacrificing security.

Security, governance, and audit: Built to satisfy enterprise requirements—OAuth 2.1 flows, audience-bound tokens, explicit consent, and audit trails everywhere user data or tools are accessed.

Ecosystem momentum: Hundreds of open and commercial MCP servers now expose databases, SaaS apps, search, observability, and cloud services. IDEs and assistants converge on the protocol, fueling fast adoption.

MCP Architecture Deep‑Dive

MCP’s architecture is intentionally straightforward:

Initialization/Negotiation: Clients and servers establish features, negotiate versions, and set up security. Each server declares which tools, resources, and prompts it supports—and what authentication is required.

Tools: Stable names, clear descriptions, and JSON Schemas for parameters (enabling client-side UI, validation, and invocation).

Resources: Server-exposed roots and URIs, so AI agents can add, list, or browse them dynamically.

Prompts: Named, parameterized templates for consistent flows, like “summarize-doc-set” or “refactor‑PR.”

Sampling: Servers can ask hosts to call an LLM, with explicit user consent.

Transports: stdio for quick/local processes; HTTP + SSE for production or remote communication. HTTP sessions add state.

Auth & trust: OAuth 2.1 required for HTTP; tokens must be audience-bound, never reused. All tool invocation requires clear consent dialogs.

What Changes if MCP Wins

If MCP becomes the dominant protocol:

One connector, many clients: Vendors ship a single MCP server—customers plug into any IDE or assistant supporting MCP.

Portable agent skills: “Skills” become server-side tools/prompts, composable across agents and hosts.

Centralized policy: Enterprises manage scopes, audit, DLP, and rate limits server-side—no fragmented controls.

Fast onboarding: “Add to” deep links—like protocol handlers for browsers—install a connector instantly.

No more brittle scraping: Context resources become first‑class, replace copy-paste hacks.

Gaps and Risks: Realism Over Hype

Standards body and governance: MCP is versioned and open, but not yet a formal IETF or ISO standard.

Security supply chain: Thousands of servers need trust, signing, sandboxing; OAuth must be implemented correctly.

Capability creep: The protocol must stay minimal; richer patterns belong in libraries, not the protocol’s core.

Inter-server composition: Moving resources across servers (e.g., from Notion → S3 → indexer) requires new idempotency/retry patterns.

Observability & SLAs: Standard metrics and error taxonomies are essential for robust monitoring in production.

Migration: The Adapter‑First Playbook

Inventory use cases: Map current actions, connect CRUD/search/workflow tools and resources.

Define schemas: Concise names, descriptions, and JSON Schemas for every tool/resource.

Pick transport and auth: Stdio for quick local prototypes; HTTP/OAuth for cloud and team deployments.

Ship a reference server: Start with a single domain, then expand to more workflows and prompt templates.

Test across clients: Ensure Claude Desktop, VS Code/Copilot, Cursor, JetBrains, etc. all interoperate.

Add guardrails: Implement allow‑lists, dry‑run, consent prompts, rate limits, and invocation logs.

Observe: Emit trace logs, metrics, and errors. Add circuit breakers for external APIs.

Document/version: Publish a server README, changelog, and semver’d tool catalog, and respect version headers.

Design Notes for MCP Servers

Deterministic outputs: Structured results; return resource links for large data.

Idempotency keys: Clients supply request_id for safe retries.

Fine-grained scopes: Token scopes per tool/action (readonly vs. write).

Human-in-the-loop: Offer dryRun and plan tools so users see planned effects first.

Resource catalogs: Expose list endpoints with pagination; support eTag/updatedAt for cache refresh.

Will MCP Become “The New HTTP for AI?”

If “new HTTP” means a universal, low-friction contract letting any AI client interact safely with any capability provider—MCP is the closest we have today. Its tiny core, flexible transports, typed contracts, and explicit security all bring the right ingredients. MCP’s success depends on neutral governance, industry weight, and robust operational patterns. Given the current momentum, MCP is on a realistic path to become the default interoperability layer between AI agents and the software they act on.

FAQs

FAQ 1: What is MCP?

MCP (Model Context Protocol) is an open, standardized protocol that enables AI models—such as assistants, agents, or large language models—to securely connect and interact with external tools, services, and data sources through a common language and interface

FAQ 2: Why is MCP important for AI?

MCP eliminates custom, fragmented integrations by providing a universal framework for connecting AI systems to real-time context—databases, APIs, business tools, and beyond—making models dramatically more accurate, relevant, and agentic while improving security and scalability for developers and enterprises

FAQ 3: How does MCP work in practice?

MCP uses a client-server architecture with JSON-RPC messaging, supporting both local (stdio) and remote (HTTP+SSE) communication; AI hosts send requests to MCP servers, which expose capabilities and resources, and handle authentication and consent, allowing for safe, structured, cross-platform automation and data retrieval.

FAQ 4: How can I start using MCP in a project?

Deploy or reuse an MCP server for your data source, embed an MCP client in the host app, negotiate features via JSON-RPC 2.0, and secure any HTTP transport with OAuth 2.1 scopes and audience-bound tokens.
The post The Evolution of AI Protocols: Why Model Context Protocol (MCP) Could Become the New HTTP for AI appeared first on MarkTechPost.

Mercury foundation models from Inception Labs are now available in Ama …

Today, we are excited to announce that Mercury and Mercury Coder foundation models (FMs) from Inception Labs are available through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, you can deploy the Mercury FMs to build, experiment, and responsibly scale your generative AI applications on AWS.
In this post, we demonstrate how to get started with Mercury models on Amazon Bedrock Marketplace and SageMaker JumpStart.
About Mercury foundation models
Mercury is the first family of commercial-scale diffusion-based language models, offering groundbreaking advancements in generation speed while maintaining high-quality outputs. Unlike traditional autoregressive models that generate text one token at a time, Mercury models use diffusion to generate multiple tokens in parallel through a coarse-to-fine approach, resulting in dramatically faster inference speeds. Mercury Coder models deliver the following key features:

Ultra-fast generation speeds of up to 1,100 tokens per second on NVIDIA H100 GPUs, up to 10 times faster than comparable models
High-quality code generation across multiple programming languages, including Python, Java, JavaScript, C++, PHP, Bash, and TypeScript
Strong performance on fill-in-the-middle tasks, making them ideal for code completion and editing workflows
Transformer-based architecture, providing compatibility with existing optimization techniques and infrastructure
Context length support of up to 32,768 tokens out of the box and up to 128,000 tokens with context extension approaches

About Amazon Bedrock Marketplace
Amazon Bedrock Marketplace plays a pivotal role in democratizing access to advanced AI capabilities through several key advantages:

Comprehensive model selection – Amazon Bedrock Marketplace offers an exceptional range of models, from proprietary to publicly available options, so organizations can find the perfect fit for their specific use cases.
Unified and secure experience – By providing a single access point for models through the Amazon Bedrock APIs, Amazon Bedrock Marketplace significantly simplifies the integration process. Organizations can use these models securely, and for models that are compatible with the Amazon Bedrock Converse API, you can use the robust toolkit of Amazon Bedrock, including Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Flows.
Scalable infrastructure – Amazon Bedrock Marketplace offers configurable scalability through managed endpoints, so organizations can select their desired number of instances, choose appropriate instance types, define custom automatic scaling policies that dynamically adjust to workload demands, and optimize costs while maintaining performance.

Deploy Mercury and Mercury Coder models in Amazon Bedrock Marketplace
Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models through Amazon Bedrock. To access the Mercury models in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, in the navigation pane under Foundation models, choose Model catalog.

You can also use the Converse API to invoke the model with Amazon Bedrock tooling.

On the Model catalog page, filter for Inception as a provider and choose the Mercury model.

The Model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

To begin using the Mercury model, choose Subscribe.

On the model detail page, choose Deploy.

You will be prompted to configure the deployment details for the model. The model ID will be prepopulated.

For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
For Number of instances, enter a number of instances (between 1–100).
For Instance type, choose your instance type. For optimal performance with Nemotron Super, a GPU-based instance type like ml.p5.48xlarge is recommended.
Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.
Choose Deploy to begin using the model.

When the deployment is complete, you can test its capabilities directly in the Amazon Bedrock playground.This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results. You can use these models with the Amazon Bedrock Converse API.
SageMaker JumpStart overview
SageMaker JumpStart is a fully managed service that offers state-of-the-art FMs for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.
You can now discover and deploy Mercury and Mercury Coder in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, and derive model performance and MLOps controls with Amazon SageMaker AI features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your VPC, helping support data security for enterprise security needs.
Prerequisites
To deploy the Mercury models, make sure you have access to the recommended instance types based on the model size. To verify you have the necessary resources, complete the following steps:

On the Service Quotas console, under AWS Services, choose Amazon SageMaker.
Check that you have sufficient quota for the required instance type for endpoint deployment.
Make sure at least one of these instance types is available in your target AWS Region.
If needed, request a quota increase and contact your AWS account team for support.

Make sure your SageMaker AWS Identity and Access Management (IAM) service role has the necessary permissions to deploy the model, including the following permissions to make AWS Marketplace subscriptions in the AWS account used:

aws-marketplace:ViewSubscriptions
aws-marketplace:Unsubscribe
aws-marketplace:Subscribe

Alternatively, confirm your AWS account has a subscription to the model. If so, you can skip the following deployment instructions and start with subscribing to the model package.
Subscribe to the model package
To subscribe to the model package, complete the following steps:

Open the model package listing page and choose Mercury or Mercury Coder.
On the AWS Marketplace listing, choose Continue to subscribe.
On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with the EULA, pricing, and support terms.
Choose Continue to proceed with the configuration and then choose a Region where you have the service quota for the desired instance type.

A product Amazon Resource Name (ARN) will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.
Deploy Mercury and Mercury Coder models on SageMaker JumpStart
For those new to SageMaker JumpStart, you can use SageMaker Studio to access the Mercury and Mercury Coder models on SageMaker JumpStart.

Deployment starts when you choose the Deploy option. You might be prompted to subscribe to this model through Amazon Bedrock Marketplace. If you are already subscribed, choose Deploy. After deployment is complete, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

Deploy Mercury using the SageMaker SDK
In this section, we walk through deploying the Mercury model through the SageMaker SDK. You can follow a similar process for deploying the Mercury Coder model as well.
To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

#Create the model package

endpoint_name = name_from_base(“mercury-endpoint”)  # set this to your liking
model = ModelPackage(role=role_arn, model_package_arn=package_arn, sagemaker_session=sagemaker_session)

Deploy the model:

# Deploy the Model. This may take 5-10 minutes to run

instance_type = “ml.p5.48xlarge” # We only support ml.p5.48xlarge instances at the moment
start = perf_counter()
deployed_model = model.deploy(initial_instance_count=1, instance_type=instance_type, endpoint_name=endpoint_name)
print(f”nDeployment took {perf_counter() – start:.2f} seconds”)

Use Mercury for code generation
Let’s try asking the model to generate a simple tic-tac-toe game:

payload = {
    “messages”: [
     {
            “role”: “user”,
            “content”: “””
Build a simple tic-tac-toe game.

REQUIREMENTS:
1. **Game**: 3×3 grid, human vs AI, click to play
2. **AI**: Uses minimax to never lose (only win or draw)
3. **Visualization**: Show AI’s move scores in a simple list
4. **Interface**: Grid + “New Game” button + move explanation

IMPLEMENTATION:
– Single HTML file with embedded CSS/JS
– Basic minimax algorithm (no pruning needed)
– Display: “AI chose position 5 (score: +10)”
– Clean, functional design

DELIVERABLE:
Working game that demonstrates perfect AI play with basic score visibility.
        “””
        }
    ],
    “max_tokens”: 2500,
}
start = perf_counter()
outputs = predictor.predict(payload)
eta = perf_counter() – start
print(f”Speed: {outputs[‘usage’][‘completion_tokens’] / eta:.2f} tokens / secondn”)
print(outputs[“choices”][0][“message”][“content”])

We get the following response:

Speed: 528.15 tokens / second

“`html
<!DOCTYPE html>
<html>
<head>
<title>Tic-Tac-Toe with Unbeatable AI</title>
<style>
  body {
    font-family: sans-serif;
    display: flex;
    flex-direction: column;
    align-items: center;
    justify-content: center;
    height: 100vh;
    margin: 0;
    background-color: #f0f0f0;
  }

  .grid {
    display: grid;
    grid-template-columns: repeat(3, 100px);
    gap: 10px;
    margin-bottom: 20px;
  }

  .cell {
    width: 100px;
    height: 100px;
    border: 1px solid #ccc;
    display: flex;
    align-items: center;
    justify-content: center;
    font-size: 2em;
    cursor: pointer;
    background-color: #fff;
  }

  .cell:hover {
    background-color: #e0e0e0;
  }

  .cell.x {
    color: red;
  }

  .cell.o {
    color: blue;
  }

  #aiMoves {
    margin-top: 20px;
    font-size: 1.2em;
  }

  #newGame {
    padding: 10px 20px;
    font-size: 1.2em;
    cursor: pointer;
    background-color: #4CAF50;
    color: white;
    border: none;
    border-radius: 5px;
  }

  #newGame:hover {
    background-color: #3e8e41;
  }
</style>
</head>
<body>

<h1>Tic-Tac-Toe</h1>

<div class=”grid” id=”gameGrid”>
  <div class=”cell” data-index=”0″></div>
  <div class=”cell” data-index=”1″></div>
  <div class=”cell” data-index=”2″></div>
  <div class=”cell” data-index=”3″></div>
  <div class=”cell” data-index=”4″></div>
  <div class=”cell” data-index=”5″></div>
  <div class=”cell” data-index=”6″></div>
  <div class=”cell” data-index=”7″></div>
  <div class=”cell” data-index=”8″></div>
</div>

<button id=”newGame”>New Game</button>

<div id=”aiMoves”></div>

<script>
  const grid = document.getElementById(‘gameGrid’);
  const cells = document.querySelectorAll(‘.cell’);
  const aiMovesDiv = document.getElementById(‘aiMoves’);
  const newGameButton = document.getElementById(‘newGame’);

  let board = Array(9).fill(null);
  let currentPlayer = ‘x’;
  let gameActive = true;
  let aiMoveScores = [];

  function checkWin(player) {
    const winConditions = [
      [0, 1, 2], [3, 4, 5], [6, 7, 8], // Rows
      [0, 3, 6], [1, 4, 7], [2, 5, 8], // Columns
      [0, 4, 8], [2, 4, 6] // Diagonals
    ];

    return winConditions.some(condition => {
      return condition.every(index => board[index] === player);
    });
  }

  function checkDraw() {
    return board.every(cell => cell !== null);
  }

  function switchPlayer() {
    currentPlayer = currentPlayer === ‘x’ ? ‘o’ : ‘x’;
  }

  function updateBoard() {
    cells.forEach((cell, index) => {
      cell.textContent = board[index];
      cell.classList.remove(‘x’, ‘o’);
      if (board[index] === ‘x’) {
        cell.classList.add(‘x’);
      } else if (board[index] === ‘o’) {
        cell.classList.add(‘o’);
      }
    });
  }

  function minimax(board, player) {
    let opponent = player === ‘x’ ? ‘o’ : ‘x’;

    if (checkWin(player)) {
      return { score: 10 };
    } else if (checkWin(opponent)) {
      return { score: -10 };
    } else if (checkDraw()) {
      return { score: 0 };
    }

    let moves = [];
    for (let i = 0; i < board.length; i++) {
      if (board[i] === null) {
        board[i] = player;
        let result = minimax(board, opponent);
        moves.push({ index: i, score: result.score });
        board[i] = null; // Undo move
      }
    }

    if (player === ‘x’) {
      let bestScore = -Infinity;
      let bestMove = null;
      for (let i = 0; i < moves.length; i++) {
        if (moves[i].score > bestScore) {
          bestScore = moves[i].score;
          bestMove = moves[i];
        }
      }
      return bestMove;
    } else {
      let bestScore = Infinity;
      let bestMove = null;
      for (let i = 0; i < moves.length; i++) {
        if (moves[i].score < bestScore) {
          bestScore = moves[i].score;
          bestMove = moves[i];
        }
      }
      return bestMove;
    }
  }

  function aiMove() {
    if (!gameActive) return;

    let bestMove = minimax(board, ‘o’);
    board[bestMove.index] = ‘o’;
    updateBoard();
    aiMoveScores.push(`AI chose position ${bestMove.index} (score: ${bestMove.score})`);
    aiMovesDiv.textContent = aiMoveScores.join(‘n’);

    if (checkWin(‘o’)) {
      alert(‘AI wins!’);
      gameActive = false;
    } else if (checkDraw()) {
      alert(‘It’s a draw!’);
      gameActive = false;
    } else {
      currentPlayer = ‘x’;
    }
  }

  function handleCellClick(event) {
    const cellIndex = parseInt(event.target.dataset.index);

    if (gameActive && board[cellIndex] === null && currentPlayer === ‘x’) {
      board[cellIndex] = ‘x’;
      updateBoard();

      if (checkWin(‘x’)) {
        alert(‘You win!’);
        gameActive = false;
      } else if (checkDraw()) {
        alert(‘It’s a draw!’);
        gameActive = false;
      } else {
        aiMove();
      }
    }
  }

  function newGame() {
    board = Array(9).fill(null);
    currentPlayer = ‘x’;
    gameActive = true;
    aiMoveScores = [];
    aiMovesDiv.textContent = ”;
    updateBoard();
  }

  cells.forEach(cell => cell.addEventListener(‘click’, handleCellClick));
  newGameButton.addEventListener(‘click’, newGame);
</script>

</body>
</html>
“`

From the preceding response, we can see that the Mercury model generated a complete, functional tic-tac-toe game with minimax AI implementation at 528 tokens per second, delivering working HTML, CSS, and JavaScript in a single response. The code includes proper game logic, an unbeatable AI algorithm, and a clean UI with the specified requirements correctly implemented. This demonstrates strong code generation capabilities with exceptional speed for a diffusion-based model.

Use Mercury for tool use and function calling
Mercury models support advanced tool use capabilities, enabling them to intelligently determine when and how to call external functions based on user queries. This makes them ideal for building AI agents and assistants that can interact with external systems, APIs, and databases.
Let’s demonstrate Mercury’s tool use capabilities by creating a travel planning assistant that can check weather and perform calculations:

# Define available tools for the assistant
tools = [
{
“type”: “function”,
“function”: {
“name”: “get_weather”,
“description”: “Get the current weather in a given location”,
“parameters”: {
“type”: “object”,
“properties”: {
“location”: {
“type”: “string”,
“description”: “The city and state, e.g. San Francisco, CA”
},
“unit”: {
“type”: “string”,
“enum”: [“celsius”, “fahrenheit”],
“description”: “The unit of temperature”
}
},
“required”: [“location”]
}
}
},
{
“type”: “function”,
“function”: {
“name”: “calculate”,
“description”: “Perform mathematical calculations”,
“parameters”: {
“type”: “object”,
“properties”: {
“expression”: {
“type”: “string”,
“description”: “The mathematical expression to evaluate”
}
},
“required”: [“expression”]
}
}
}
]
#Create a travel planning query that requires multiple tools
payload = {
“messages”: [
{
“role”: “user”,
“content”: “I’m planning a trip to Tokyo. Can you check the weather there and also tell me what 1000 USD is in Japanese Yen (use 1 USD = 150 JPY for calculation)?”
}
],
“tools”: tools,
“tool_choice”: “auto”, # Let the model decide which tools to use
“max_tokens”: 2000,
“temperature”: 0.15
}
# Invoke the endpoint
start = perf_counter()
response = predictor.predict(payload)
eta = perf_counter() – start
# Display the tool calls requested by the model
if ‘choices’ in response:
message = response[‘choices’][0].get(‘message’, {})
if ‘tool_calls’ in message:
print(f”Speed: {response[‘usage’][‘completion_tokens’] / eta:.2f} tokens/secondn”)
print(f”Mercury requested {len(message[‘tool_calls’])} tool calls:n”)

for i, tool_call in enumerate(message[‘tool_calls’], 1):
func = tool_call.get(‘function’, {})
tool_name = func.get(‘name’)
args = json.loads(func.get(‘arguments’, ‘{}’))

print(f”Tool Call {i}:”)
print(f” Function: {tool_name}”)
print(f” Arguments: {json.dumps(args, indent=4)}”)
print()

Expected response:

Speed: 892.34 tokens/second
Mercury requested 2 tool calls:
Tool Call 1:
Function: get_weather
Arguments: {
“location”: “Tokyo, Japan”,
“unit”: “celsius”
}
Tool Call 2:
Function: calculate
Arguments: {
“expression”: “1000 * 150”
}

After receiving the tool results, you can continue the conversation to get a natural language response:

# Simulate tool execution results
tool_results = [
{
“role”: “tool”,
“tool_call_id”: message[‘tool_calls’][0][‘id’],
“content”: “The weather in Tokyo, Japan is 18°C and partly cloudy with a chance of rain.”
},
{
“role”: “tool”,
“tool_call_id”: message[‘tool_calls’][1][‘id’],
“content”: “The result is: 150000”
}
]
# Continue the conversation with tool results
messages_with_results = [
{“role”: “user”, “content”: “I’m planning a trip to Tokyo. Can you check the weather there and also tell me what 1000 USD is in Japanese Yen (use 1 USD = 150 JPY for calculation)?”},
message, # Assistant’s message with tool calls
*tool_results # Tool execution results
]
final_payload = {
“messages”: messages_with_results,
“max_tokens”: 500
}
final_response = predictor.predict(final_payload)
print(final_response[‘choices’][0][‘message’][‘content’])

Expected response:

Based on the information I’ve gathered for your Tokyo trip:
**Weather in Tokyo:**
Currently, Tokyo is experiencing mild weather at 18°C (64°F) with partly cloudy skies and a chance of rain. I’d recommend bringing a light jacket and an umbrella just in case.
**Currency Conversion:**
1,000 USD converts to 150,000 Japanese Yen at the rate you specified (1 USD = 150 JPY). This should give you a good amount for expenses like meals, transportation, and shopping in Tokyo.
For your trip planning, the mild temperature is perfect for sightseeing, though you’ll want to have rain gear handy. The weather is comfortable for walking around popular areas like Shibuya, Shinjuku, or exploring temples and gardens.

Clean up
To avoid unwanted charges, complete the steps in this section to clean up your resources.
Delete the Amazon Bedrock Marketplace deployment
If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, in the navigation pane, under Foundation models, choose Marketplace deployments.
Select the endpoint you want to delete, and on the Actions menu, choose Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:

Endpoint name
Model name
Endpoint status

Choose Delete to delete the endpoint.
In the Delete endpoint confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart endpoint
The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion
In this post, we explored how you can access and deploy Mercury models using Amazon Bedrock Marketplace and SageMaker JumpStart. With support for both Mini and Small parameter sizes, you can choose the optimal model size for your specific use case. Visit SageMaker JumpStart in SageMaker Studio or Amazon Bedrock Marketplace to get started. For more information, refer to Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models, Amazon SageMaker JumpStart Foundation Models, Getting started with Amazon SageMaker JumpStart, Amazon Bedrock Marketplace, and SageMaker JumpStart pretrained models.
The Mercury family of diffusion-based large language models offers exceptional speed and performance, making it a powerful choice for your generative AI workloads with latency-sensitive requirements.

About the authors
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.
John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.
Jonathan Evans is a Worldwide Solutions Architect for Generative AI at AWS, where he helps customers leverage cutting-edge AI technologies with Anthropic’s Claude models on Amazon Bedrock, to solve complex business challenges. With a background in AI/ML engineering and hands-on experience supporting machine learning workflows in the cloud, Jonathan is passionate about making advanced AI accessible and impactful for organizations of all sizes.
Rohit Talluri is a Generative AI GTM Specialist at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.
Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science (HCLS) customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing model adoption for first- and third-party models. Breanne is also Vice President of the Women at Amazon board with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor’s of Science in Computer Engineering from the University of Illinois Urbana-Champaign.

NVIDIA AI Released Jet-Nemotron: 53x Faster Hybrid-Architecture Langua …

NVIDIA researchers have shattered the longstanding efficiency hurdle in large language model (LLM) inference, releasing Jet-Nemotron—a family of models (2B and 4B) that delivers up to 53.6× higher generation throughput than leading full-attention LLMs while matching, or even surpassing, their accuracy. Most importantly, this breakthrough isn’t the result of a new pre-training run from scratch, but rather a retrofit of existing, pre-trained models using a novel technique called Post Neural Architecture Search (PostNAS). The implications are transformative for businesses, practitioners, and researchers alike.

The Need for Speed in Modern LLMs

While today’s state-of-the-art (SOTA) LLMs, like Qwen3, Llama3.2, and Gemma3, have set new benchmarks for accuracy and flexibility, their O(n²) self-attention mechanism incurs exorbitant costs—both in compute and memory—especially for long-context tasks. This makes them expensive to deploy at scale and nearly impossible to run on edge or memory-constrained devices. Efforts to replace full-attention Transformers with more efficient architectures (Mamba2, GLA, RWKV, etc.) have struggled to close the accuracy gap, until now.

https://arxiv.org/abs/2508.15884v1?

PostNAS: A Surgical, Capital-Efficient Overhaul

The core innovation is PostNAS: a neural architecture search pipeline designed specifically for efficiently retrofitting pre-trained models. Here’s how it works:

Freeze the Knowledge: Start with a SOTA full-attention model (like Qwen2.5). Freeze its MLP layers—this preserves the model’s learned intelligence and greatly reduces training cost.

Surgical Replacement: Replace computationally expensive full-attention (Transformers) with JetBlock, a new, hardware-efficient linear attention block designed for NVIDIA’s latest GPUs.

Hybrid, Hardware-Aware Design: Use super-network training and beam search to automatically determine the optimal placement and minimal set of full-attention layers necessary to preserve accuracy on key tasks (retrieval, math, MMLU, coding, etc.). This step is task-specific and hardware-aware: the search maximizes throughput for target hardware, not just parameter count.

Scale and Deploy: The result is a hybrid-architecture LLM that inherits the backbone intelligence of the original model but slashes latency and memory footprint.

JetBlock is particularly noteworthy: it introduces dynamic causal convolution kernels conditioned on input (unlike static kernels in prior linear attention blocks) and removes redundant convolutions for streamlined efficiency. With hardware-aware hyperparameter search, it not only keeps pace with prior linear attention designs in throughput, but actually boosts accuracy.

https://arxiv.org/abs/2508.15884v1?

Jet-Nemotron: Performance by the Numbers

The key metrics from NVIDIA’s technical paper are staggering:

ModelMMLU-Pro Acc.Generation Throughput (tokens/s, H100)KV Cache Size (MB, 64K context)NotesQwen3-1.7B-Base37.8617,168Full-attention baselineJet-Nemotron-2B39.02,88515447× throughput, 47× smaller cacheJet-Nemotron-4B44.21,27125821× throughput, still SOTA acc.Mamba2-2.7B8.62,50780All-linear, much lower accuracyRWKV7-1.5B13.43,05024All-linear, much lower accuracyDeepSeek-V3-Small (MoE)———2.2B activated, 15B total, lower acc.

Jet-Nemotron-2B matches or exceeds Qwen3-1.7B-Base on every major benchmark—math, commonsense, coding, retrieval, long-context—while delivering 47× higher generation throughput.

This isn’t a small gain: a 53.6× speedup in decoding at 256K context length means a 98% reduction in inference cost for the same volume of tokens. Prefilling speedups are also dramatic: 6.14× faster at 256K context.

Memory footprint shrinks by 47× (154MB cache vs. 7,168MB for Qwen3-1.7B-Base). This is a game-changer for edge deployment: Jet-Nemotron-2B is 8.84× and 6.5× faster than Qwen2.5-1.5B on Jetson Orin and RTX 3090, respectively.

https://arxiv.org/abs/2508.15884v1?

Applications

For Business Leaders: Better ROI $$

Inference at scale is now affordable. A 53× throughput gain means dollar-for-dollar, you can serve 53× more users—or slash hosting costs by 98%.

Operational efficiency is transformed: latency drops, batch sizes grow, and memory constraints vanish. Cloud providers can offer SOTA AI at commodity prices.

The AI business model reshapes: Tasks once too expensive (real-time document AI, long-context agents, on-device copilots) suddenly become viable.

For Practitioners: SOTA on the Edge

Forget about quantization, distillation, or pruning compromises. Jet-Nemotron’s tiny KV cache (154MB) and 2B parameters fit on Jetson Orin, RTX 3090, and even mobile chips—no more offloading to the cloud.

No retraining, no data pipeline changes: Just retrofitting. Your existing Qwen, Llama, or Gemma checkpoints can be upgraded without losing accuracy.

Real-world AI services (search, copilots, summarization, coding) are now instant and scalable.

For Researchers: Lower Barrier, Higher Innovation

PostNAS slashes the cost of LLM architecture innovation. Instead of months and millions on pre-training, architecture search happens on frozen backbone models in a fraction of the time.

Hardware-aware NAS is the future: The Jet-Nemotron process considers KV cache size (not just parameters) as the critical factor for real-world speed. This is a paradigm shift in how we measure and optimize efficiency.

The community can iterate faster: PostNAS is a rapid testbed. If a new attention block works here, it’s worth pre-training; if not, it’s filtered out before the big spend.

Summary

The open-sourcing of Jet-Nemotron and JetBlock (code on GitHub) means the broader AI ecosystem can now retrofit their models for unprecedented efficiency. PostNAS is not a one-off trick: it’s a general-purpose framework for accelerating any Transformer, lowering the cost of future breakthroughs.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA AI Released Jet-Nemotron: 53x Faster Hybrid-Architecture Language Model Series that Translates to a 98% Cost Reduction for Inference at Scale appeared first on MarkTechPost.

Google AI Introduces Gemini 2.5 Flash Image: A New Model that Allows Y …

Table of contentsWhat Makes Gemini 2.5 Flash Image Impressive?Key Technical FeaturesBenchmark Leadership and Community ReceptionPricing, Access, and Future RoadmapIn Summary:FAQs

Google AI has just unveiled Gemini 2.5 Flash Image, a new generation image model designed to let users generate and edit images simply by describing them—and its true innovation is how it delivers precise, consistent, and high-fidelity edits at impressive speed and scale.

What Makes Gemini 2.5 Flash Image Impressive?

Gemini 2.5 Flash Image is built on the multimodal, advanced reasoning foundation of Gemini 2.5, (meaning it natively understands both images and text) enabling seamless workflows for generation and editing. This architecture allows users to:

Blend multiple images into one with a single prompt

Maintain subject and character consistency across many edits

Make targeted, natural language-driven transformations (e.g. “change the shirt color,” “remove person from photo”)

Retain context and visual fidelity through iterative revisions—regardless of the complexity or diversity of edits

This is a leap beyond older image models, which often struggled to maintain identity or visual coherence when making edits or compositing scenes.

Key Technical Features

Precise visual editing: The model supports highly accurate, localized edits based on natural language prompts, from background blurring to pose adjustments and object removals.

Multimodal fusion: Accepts multiple reference images and fuses them, enabling, for instance, complex product mockups or multi-character scenes in advertising.

Template/brand consistency: Gemini 2.5 Flash Image preserves styling, branding, and character consistency across generated assets or product catalogs.

Advanced reasoning: Taps into Gemini’s semantic world knowledge for tasks like diagram understanding or educational annotation—not just photorealistic rendering.

Scalable API availability: Developers and enterprises can access the model via Gemini API, Google AI Studio, and Vertex AI—with built-in SynthID watermarking for AI provenance and regulatory compliance.

Benchmark Leadership and Community Reception

Gemini 2.5 Flash Image has quickly led public benchmarks, topping LMArena for prompt adherence and edit quality, surpassing competitors like GPT-4o’s native image tools and FLUX AI image models. Enthusiasts and experts highlight its photorealism, but also its remarkable semantic control—making edits that look natural and true to the source material even across multiple iterations.

https://developers.googleblog.com/en/introducing-gemini-2-5-flash-image/

Pricing, Access, and Future Roadmap

The model is available in preview for $0.039 per image via Gemini API, Google AI Studio, and Vertex AI, with enterprise and developer integration rising rapidly thanks to partnerships with platforms like OpenRouter and fal.ai. All generated images feature invisible SynthID watermarks for traceability and AI ethics compliance, and Google is actively improving long-form text rendering and even finer consistency.

In Summary:

Gemini 2.5 Flash Image isn’t just faster and more creative, it’s technically “a-peel-ing” because it finally solves the long-standing challenge of consistent, context-aware image editing in generative AI—unlocking powerful new workflows for creators, developers, and enterprises.

FAQs

What is Gemini 2.5 Flash Image?

Gemini 2.5 Flash Image is Google’s state-of-the-art AI model for generating and editing images with natural language prompts, supporting multimodal fusion and advanced reasoning for precise, consistent edits.

How do you edit images using Gemini 2.5 Flash Image?

Simply describe the changes needed in natural language, such as “remove a person from the photo” or “change shirt color,” and the model applies edits while preserving key visual details and scene consistency.

Where can users access the model?

Gemini 2.5 Flash Image is available in the Gemini app, Google AI Studio, Vertex AI, and via API for developers and enterprises; it’s also integrated in platforms like Adobe Firefly and Express.

Which file formats does Gemini 2.5 Flash Image support?

By default, images are generated in JPEG format rather than PNG or WebP, reflecting optimization for broad compatibility and file size.

Are there safeguards for image generation?

Google employs strict safety features and content filters to prevent the creation of harmful or inappropriate visuals, balancing creative control with responsible AI use.

Check out the Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Introduces Gemini 2.5 Flash Image: A New Model that Allows You to Generate and Edit Images by Simply Describing Them appeared first on MarkTechPost.

What is MLSecOps(Secure CI/CD for Machine Learning)?: Top MLSecOps Too …

Machine learning (ML) is transforming industries, powering innovation in domains as varied as financial services, healthcare, autonomous systems, and e-commerce. However, as organizations operationalize ML models at scale, traditional approaches to software delivery—chiefly, Continuous Integration and Continuous Deployment (CI/CD)—have revealed critical gaps when applied to machine learning workflows. Unlike conventional software systems, ML pipelines are highly dynamic, data-driven, and exposed to unique risks such as data drift, adversarial attacks, and regulatory compliance demands. These realities have accelerated adoption of MLSecOps: a holistic discipline that fuses security, governance, and observability throughout the ML lifecycle, ensuring not only agility but also safety and trustworthiness in AI deployments.

Rethinking ML Security: Why MLSecOps is Important

Traditional CI/CD processes were built for code; they evolved to speed up integration, testing, and release cycles. In Machine learning (ML), however, the “code” is just one side; the pipeline is also driven by external data, model artifacts, and iterative feedback loops. This makes ML systems vulnerable to a broad spectrum of threats, including:

Data poisoning: Malicious actors may contaminate training sets, causing models to make dangerous or biased predictions.

Model inversion & extraction: Attackers may reverse-engineer models or leverage prediction APIs to recover sensitive training data (such as patient records in healthcare or financial transactions in banking).

Adversarial examples: Sophisticated inputs are crafted to deceive models, sometimes with catastrophic consequences (e.g., misclassifying road signs for autonomous vehicles).

Regulatory compliance & governance loopholes: Laws such as GDPR, HIPAA, and emerging AI-specific frameworks require traceability of training data, auditability of decision logic, and robust privacy controls.

MLSecOps is the answer—embedding security controls, monitoring routines, privacy protocols, and compliance checks at every stage of the ML pipeline, from raw data ingestion and model experimentation to deployment, serving, and continuous monitoring.

The MLSecOps Lifecycle: From Planning to Monitoring

A robust MLSecOps implementation aligns with the following lifecycle stages, each demanding attention to distinct risks and controls:

1. Planning and Threat Modeling

Security for ML pipelines must begin at the design stage. Here, teams map out objectives, assess threats (such as supply chain risks and model theft), and select tools and standards for secure development. Architectural planning also involves defining roles and responsibilities across data engineering, ML engineering, operations, and security. Failure to anticipate threats during planning can leave pipelines exposed to risks that compound downstream.

2. Data Engineering and Ingestion

Data is the lifeblood of Machine learning (ML). Pipelines must validate the provenance, integrity, and confidentiality of all datasets. This involves:

Automated data quality checks, anomaly detection, and data lineage tracking.

Hashing and digital signatures to verify authenticity.

Role-based access control (RBAC) and encryption for datasets, restricting access only to authorized identities.

A single compromised dataset can destroy an entire pipeline, resulting in silent failures or exploitable vulnerabilities.

3. Experimentation and Development

Machine learning (ML) experimentation demands reproducibility. Secure experimentation mandates:

Isolated workspaces for testing(new features or models) without risking production systems.

Auditable notebooks and version-controlled model artifacts.

Enforcement of least privilege: only trusted engineers can modify model logic, hyperparameters, or training pipelines.

4. Model and Pipeline Validation

Validation is not just about accuracy—it must also include robust security checks:

Automated adversarial robustness testing to surface vulnerabilities to adversarial inputs.

Privacy testing using differential privacy and membership inference resistance protocols.

Explainability and bias audits for ethical compliance and regulatory reporting.

5. CI/CD Pipeline Hardening

Secure CI/CD for Machine learning (ML) extends foundation DevSecOps principles:

Secure artifacts with signed containers or trusted model registries.

Ensure pipeline steps (data processing, training, deployment) operate under least-privilege policies, minimizing lateral movement in case of compromise.

Implement rigorous pipeline and runtime audit logs to enable traceability and facilitate incident response.

6. Secure Deployment and Model Serving

Models must be deployed in isolated production environments (e.g., Kubernetes namespaces, service meshes). Security controls include:

Automated runtime monitoring for detection of anomalous requests or adversarial inputs.

Model health checks, continuous model evaluation, and automated rollback on anomaly detection.

Secure model update mechanisms, with version tracking and rigorous access control.

7. Continuous Training

As new data arrives or user behaviors change, pipelines may retrain models automatically (continuous training). While this supports adaptability, it also introduces new risks:

Data drift detection to trigger retraining only when justified, preventing “silent degradation.”

Versioning of both datasets and models for full auditability.

Security reviews of retraining logic, ensuring no malicious data can hijack the process.

8. Monitoring and Governance

Ongoing monitoring is the backbone of reliable ML security:

Outlier detection systems to spot incoming data anomalies and prediction drift.

Automated compliance audits, generating evidence for internal and external reviews.

Integrated explainability modules (e.g., SHAP, LIME) tied directly into monitoring platforms for traceable, human-readable decision logic.

Regulatory reporting for GDPR, HIPAA, SOC 2, ISO 27001, and emerging AI governance frameworks.

Mapping Threats to Pipeline Stages

Every stage in the Machine learning (ML) pipeline introduces distinctive risks. For instance:

Planning failures lead to weak model protection and supply chain vulnerabilities (such as dependency confusion or package tampering).

Improper data engineering may result in unauthorized dataset exposure or poisoning.

Poor validation opens the door to adversarial testing failures or explainability gaps.

Soft deployment practices invite model theft, API abuse, and infrastructure compromise.

A credible defense requires stage-specific security controls, mapped precisely to the relevant threats.

Tools and Frameworks Powering MLSecOps

MLSecOps leverages a mix of open-source and commercial platforms. Leading examples for 2025 include:

Platform/ToolCore CapabilitiesMLflow RegistryArtifact versioning, access control, audit trailsKubeflow PipelinesKubernetes-native security, pipeline isolation, RBACSeldon DeployRuntime drift/adversarial monitoring, auditabilityTFX (TensorFlow Ex.)Validation at scale, secure model servingAWS SageMakerIntegrated bias detection, governance, explainabilityJenkins XPlug-in CI/CD security for ML workloadsGitHub Actions / GitLab CIEmbedded security scanning, dependency and artifact controlsDeepChecks / Robust IntelligenceAutomated robustness/security validationFiddler AI / Arize AIModel monitoring, explainability-driven complianceProtect AISupply chain risk monitoring, red teaming for AI

These platforms help automate security, governance, and monitoring across every ML lifecycle stage, whether in the cloud or on-premises infrastructure.

Case Studies: MLSecOps in Action

Financial Services

Real-time fraud detection and credit scoring pipelines must withstand regulatory scrutiny and sophisticated adversarial attacks. MLSecOps enables encrypted data ingestion, role-based access control, continuous monitoring, and automated auditing—delivering compliant, trustworthy models while resisting data poisoning and model inversion attacks.

Healthcare

Medical diagnostics demand HIPAA-compliant handling of patient data. MLSecOps integrates privacy-preserving training, rigorous audit trails, explainability modules, and anomaly detection to guard sensitive data while maintaining clinical relevance.

Autonomous Systems

Autonomous vehicles and robotics require robust defenses against adversarial inputs and perception errors. MLSecOps enforces adversarial testing, secure endpoint isolation, continuous model retraining, and rollback mechanisms to ensure safety in dynamic, high-stakes environments.

Retail & E-Commerce

Recommendation engines and personalization models power modern retail. MLSecOps shields these vital systems from data poisoning, privacy leaks, and compliance failures through full-lifecycle security controls and real-time drift detection.

The Strategic Value of MLSecOps

As machine learning moves from research labs to goal oriented business operations, ML security and compliance have become essential—not optional. MLSecOps is an approach, architecture, and toolkit that brings together engineering, operations, and security professionals to build resilient, explainable, and trustworthy AI systems. Investing in MLSecOps enables organizations to deploy Machine learning (ML) models rapidly, guard against adversarial threats, ensure regulatory alignment, and build stakeholder trust.

FAQs: Addressing Common MLSecOps Questions

How is MLSecOps different from MLOps?MLOps emphasizes automation and operational efficiency, while MLSecOps treats security, privacy, and compliance as non-negotiable pillars—integrating them directly into every ML lifecycle stage.

What are the biggest threats to ML pipelines?Data poisoning, adversarial input, model theft, privacy leaks, fragile supply chains, and compliance failures top the risk list for ML systems in 2025.

How can training data be secured in CI/CD pipelines?Robust encryption (at rest and in transit), RBAC, automated anomaly detection, and thorough provenance tracking are essential for preventing unauthorized access and contamination.

Why is monitoring indispensable for MLSecOps?Continuous monitoring enables early detection of adversarial activity, drift, and data leakage—empowering teams to trigger rollbacks, retrain models, or escalate incidents before they affect production systems.

Which industries benefit most from MLSecOps?Finance, healthcare, government, autonomous systems, and any domain governed by strict regulatory or safety requirements stand to gain the greatest value from MLSecOps adoption.

Do open-source tools fulfill MLSecOps requirements?Open-source platforms such as Kubeflow, MLflow, and Seldon deliver strong foundational security, monitoring, and compliance features—often extended by commercial enterprise tools to meet advanced needs.
The post What is MLSecOps(Secure CI/CD for Machine Learning)?: Top MLSecOps Tools (2025) appeared first on MarkTechPost.

Learn how Amazon Health Services improved discovery in Amazon search u …

Healthcare discovery on ecommerce domains presents unique challenges that traditional product search wasn’t designed to handle. Unlike searching for books or electronics, healthcare queries involve complex relationships between symptoms, conditions, treatments, and services, requiring sophisticated understanding of medical terminology and customer intent.
This challenge became particularly relevant for Amazon as we expanded beyond traditional ecommerce into comprehensive healthcare services. Amazon now offers direct access to prescription medications through Amazon Pharmacy, primary care through One Medical, and specialized care partnerships through Health Benefits Connector. These healthcare offerings represent a significant departure from traditional Amazon.com products, presenting both exciting opportunities and unique technical challenges.
In this post, we show you how Amazon Health Services (AHS) solved discoverability challenges on Amazon.com search using AWS services such as Amazon SageMaker, Amazon Bedrock, and Amazon EMR. By combining machine learning (ML), natural language processing, and vector search capabilities, we improved our ability to connect customers with relevant healthcare offerings. This solution is now used daily for health-related search queries, helping customers find everything from prescription medications to primary care services.
At AHS, we’re on a mission to transform how people access healthcare. We strive to make healthcare more straightforward for customers to find, choose, afford, and engage with the services, products, and professionals they need to get and stay healthy.
Challenges
Integrating healthcare services into the ecommerce business of Amazon presented two unique opportunities to enhance search for customers on healthcare journeys: understanding health search intent in queries and matching up customer query intent with the most relevant healthcare products and services.
The challenge in understanding health search intent lies in the relationships between symptoms (such as back pain or sore throat), conditions (such as a herniated disc or the common cold), treatments (such as physical therapy or medication), and the healthcare services Amazon offers. This requires sophisticated query understanding capabilities that can parse medical terminology and map it to common search terminology that a layperson outside of the medical field might use to search.
AHS offerings also present unique challenges for search matching. For example, a customer searching for “back pain treatment” might be looking for a variety of solutions, from over-the-counter pain relievers like Tylenol or prescription medications such as cyclobenzaprine (a muscle relaxant), to scheduling a doctor’s appointment or accessing virtual physical therapy. Existing search algorithms optimized for physical products might not match these service-based health offerings, potentially missing relevant results such as One Medical’s primary care services or Hinge Health’s virtual physical therapy program that helps reduce joint and muscle pain through personalized exercises and 1-on-1 support from dedicated therapists. This unique nature of healthcare offerings called for developing specialized approaches to connect customers with relevant services.
Solution overview
To address these challenges, we developed a comprehensive solution that combines ML for query understanding, vector search for product matching, and large language models (LLMs) for relevance optimization. The solution consists of three main components:

Query understanding pipeline – Uses ML models to identify and classify health-related searches, distinguishing between specific medication queries and broader health condition searches
Product knowledge base – Combines existing product metadata with LLM-enhanced health information to create comprehensive product embeddings for semantic search
Relevance optimization – Implements a hybrid approach using both human labeling and LLM-based classification to produce high-quality matches between searches and healthcare offerings

The solution is built entirely on AWS services, with Amazon SageMaker powering our ML models, Amazon Bedrock providing LLM capabilities, and Amazon EMR and Amazon Athena handling our data processing needs.
Solution architecture
Now let’s examine the technical implementation details of our architecture, exploring how each component was engineered to address the unique challenges of healthcare search on Amazon.com.
Query understanding: Identification of health searches
We approached the customer search journey by recognizing its two distinct ends of the spectrum. On one end are what we call “spearfishing queries” or lower funnel searches, where customers have a clear product search intent with specific knowledge about attributes. For Amazon Health Services, these typically include searches for specific prescription medications with precise dosages and form factors, such as “atorvastatin 40 mg” or “lisinopril 20 mg.”
On the other end are broad, upper funnel queries where customers seek inspiration, information, or recommendations with general product search intent that might encompass multiple product types. Examples include searches like “back pain relief,” “acne,” or “high blood pressure.” Building upon Amazon search capabilities, we developed additional query understanding models to serve the full spectrum of healthcare searches.
For identifying spearfishing search intent, we analyzed anonymized customer search engagement data for Amazon products and trained a classification model to understand which search keywords exclusively lead to engagement with Amazon Pharmacy Amazon Standard Identification Numbers (ASINs). This process used PySpark on Amazon EMR and Athena to collect and process Amazon search data at scale. The following diagram shows this architecture.

For identifying broad health search intent, we trained a named entity recognition (NER) model to annotate search keywords at a medical terminology level. To build this capability, we used a corpus of health ontology data sources to identify concepts such as health conditions, diseases, treatments, injuries, and medications. For health concepts where we did not have enough alternate terms in our knowledge base, we used LLMs to expand our knowledge base. For example, alternate terms for the condition “acid reflux” might be “heart burn”, “GERD”, “indigestion”, etc. We gated this NER model behind health-relevant product types predicted by Amazon search query-to-product-type models. The following diagram shows the training process for the NER model.

The following image is an example of a query identification task in practice. In the example on the left, the pharmacy classifier predicts that “atorvastatin 40 mg” is a query with intent for a prescription drug and triggers a custom search experience geared towards AHS products. In the example on the right, we detect the broad “high blood pressure” symptom but don’t know the customer’s intention. So, we trigger an experience that gives them multiple options to make the search more specific.

For those interested in implementing similar medical entity recognition capabilities, Amazon Comprehend Medical offers powerful tools for detecting medical entities in text spans.
Building product knowledge
With our ability to identify health-related searches in place, we needed to build comprehensive knowledge bases for our healthcare products and services. We started with our existing offerings and collected all available product knowledge information that best described each product or service.
To enhance this foundation, we used a large language model (LLM) with a fine-tuned prompt and few-shot examples to layer in additional relevant health conditions, symptoms, and treatment-related keywords for each product or service. We did this using the Amazon Bedrock batch inference capability. This approach meant that we significantly expanded our product knowledge with medically relevant information.
The entire knowledge base was then converted into embeddings using Facebook AI Similarity Search (FAISS), and we created an index file to enable efficient similarity searches. We maintained careful mappings from each embedding back to the original knowledge base items, making sure we could perform accurate reverse lookups when needed.
This process used several AWS services, including Amazon Simple Storage Service (Amazon S3) for storage of the knowledge base and the embeddings files. Note that Amazon OpenSearch Service is also a viable option for vector database capabilities. Large-scale knowledge base embedding jobs were executed with scheduled SageMaker Notebook Jobs. Through the combination of these technologies, we built a robust foundation of healthcare product knowledge that could be efficiently searched and matched to customer queries.
The following diagram illustrates how we built the product knowledge base using Amazon catalog data, and then used that to prepare a FAISS index file.

Mapping health search intent to the most relevant products and services
A core component of our solution was implementing the Retrieval Augmented Generation (RAG) design pattern. The first step in this pattern was to identify a set of known keywords and Amazon products, establishing the initial ground truth for our solution.
With our product knowledge base built from Amazon catalog metadata and ASIN attributes, we were ready to support new queries from customers. When a customer search query arrived, we converted it to an embedding and used it as a search key for matching against our index. This similarity search used FAISS with matching criteria based on the threshold against the similarity score.
To verify the quality of these query-product pairs identified for health search keywords, we needed to maintain the relevance of each pair. To achieve this, we implemented a two-pronged approach to relevance labeling. We used an established scheme to tag each offering as exact, substitute, complement, or irrelevant to the keyword. Referred to as the exact, substitute, complement, irrelevant (ESCI) framework established through academic research. For more information, refer to the ESCI challenge and esci-data GitHub repository.
First, we worked with a human labeling team to establish ground truth on a substantial sample size, creating a reliable benchmark for our system’s performance using this scheme. The labeling team was given guidance based on the ESCI framework and tailored towards AHS products and services.
Second, we implemented LLM-based labeling using Amazon Bedrock and batch jobs. After matches were found in the previous step, we retrieved the top products and used them as prompt context for our generative model. We included few-shot examples of ESCI guidance as part of the prompt. This way, we conducted large-scale inference across the top health searches, connecting them to the most relevant offerings using similarity search. We performed this at scale for the query-product pairs identified as relevant to AHS and stored the outputs in Amazon S3.
The following diagram shows our query retrieval, re-ranking and ESCI labeling pipeline.

Using a mix of high-confidence human and LLM-based labels, we established a true ground truth. Through this process, we successfully identified relevant product offerings for customers using only semantic data from aggregated search keywords and product metadata.
How did this help customers?
We’re on a mission to make it more straightforward for people to find, choose, afford, and engage with the services, products, and professionals they need to get and stay healthy. Today, customers searching for health solutions on Amazon—whether for acute conditions like acne, strep throat, and fever or chronic conditions such as arthritis, high blood pressure, and diabetes—will begin to see medically vetted and relevant offerings alongside other relevant products and services available on Amazon.com.
Customers can now quickly find and choose to meet with doctors, get their prescription medications, and access other healthcare services through a familiar experience. By extending the powerful ecommerce search capabilities of Amazon to address healthcare-specific opportunities, we’ve created additional discovery pathways for relevant health services.
We’ve used semantic understanding of health queries and comprehensive product knowledge to create connections that help customers find the right healthcare solutions at the right time.
Amazon Health Services Offerings
Here is a little more information about three healthcare services you can use directly through Amazon:

Amazon Pharmacy (AP) provides a full-service, online pharmacy experience with transparent medication pricing, convenient home delivery at no additional cost, ongoing delivery updates, 24/7 pharmacist support, and insurance plan acceptance, which supports access and medication adherence. Prime members enjoy special savings with Prime Rx, RxPass, and automatic coupons, making medications more affordable.
One Medical Membership and Amazon One Medical Pay Per Visit offer flexible health solutions, from in-office and virtual primary care to condition-based telehealth. Membership offers convenient access to preventive, quality primary care and the option to connect with your care team virtually in the One Medical app. Pay-per-visit is a one-time virtual visit option to find treatment for more than 30 common conditions like acne, pink eye, and sinus infections.
Health Benefits Connector matches customers to digital health companies outside of Amazon that are covered by their employer. This program has been expanding over the past year, offering access to specialized care through partners like Hinge Health for musculoskeletal care, Rula and Talkspace for mental health support, and Omada for diabetes treatment.

Key takeaways
As we reflect on our journey to enhance healthcare discovery on Amazon, several key insights stand out that might be valuable for others working on similar challenges:

Using domain-specific ontology – We began by developing a deep understanding of customer health searches, specifically identifying what kinds of conditions, symptoms, and treatments customers were seeking. By using established health ontology datasets, we enriched a NER model to detect these entities in search queries, providing a foundation for better matching.
Similarity search on product knowledge – We used existing product knowledge along with LLM-augmented real-world knowledge to build a comprehensive corpus of data that could be mapped to our offerings. Through this approach, we created semantic connections between customer queries and relevant healthcare solutions without relying on individual customer data.
Generative AI is more than just chatbots – Throughout this project, we relied on various AWS services that proved instrumental to our success. Amazon SageMaker provided the infrastructure for our ML models. However, using Amazon Bedrock batch inference was a key differentiator. It provided us with powerful LLMs for knowledge augmentation and relevance labeling, and services such as Amazon S3 and Amazon EMR supported our data storage and processing needs. Scaling this process manually would have required orders of magnitude more financial budget. Consider generative AI applications at scale beyond merely chat assistants.

By combining these approaches, we’ve created a more intuitive and effective way for customers to discover healthcare offerings on Amazon.
Implementation considerations
If you’re looking to implement a similar solution for healthcare or search, consider the following:

Security and compliance: Make sure your solution adheres to healthcare data privacy regulations like Health Insurance Portability and Accountability Act (HIPAA). Our approach doesn’t use individual customer data.
Cost optimization:

Use Amazon EMR on EC2 Spot Instances for batch processing jobs
Implement caching for frequently searched queries
Choose appropriate instance types for your workload

Scalability:

Design your vector search infrastructure to handle peak traffic
Use auto scaling for your inference endpoints
Implement proper monitoring and alerting

Maintenance:

Regularly update your health ontology datasets
Monitor model performance and retrain as needed
Keep your product knowledge base current

Conclusion
In this post, we demonstrated how Amazon Health Services used AWS ML and generative AI services to solve the unique challenges of healthcare discovery on Amazon.com, illustrating how you can build sophisticated domain-specific search experiences using Amazon SageMaker, Amazon Bedrock, and Amazon EMR. We showed how to create a query understanding pipeline to identify health-related searches, build comprehensive product knowledge bases enhanced with LLM capabilities, and implement semantic matching using vector search and the ESCI relevance framework to connect customers with relevant healthcare offerings.
This scalable, AWS based approach demonstrates how ML and generative AI can transform specialized search experiences, advancing our mission to make healthcare more straightforward for customers to find, choose, afford, and engage with. We encourage you to explore how these AWS services can address similar challenges in your own healthcare or specialized search applications. For more information about implementing healthcare solutions on AWS, visit the AWS for Healthcare & Life Sciences page.

About the authors
K. Faryab Haye is an Applied Scientist II at Amazon Health located in Seattle, WA, where he leads search and query understanding initiatives for healthcare AI. His work spans the complete ML lifecycle from large-scale data processing to deploying production systems that serve millions of customers. Faryab earned his MS in Computer Science with a Machine Learning specialization from the University of Michigan and co-founded the Applied Science Club at Amazon Health. When not building ML systems, he can be found hiking mountains, cycling, skiing, or playing volleyball.
Vineeth Harikumar is a Principal Engineer at Amazon Health Services working on growth and engagement tech initiatives for Amazon One Medical (primary care and telehealth services), Pharmacy prescription delivery, and Health condition programs. Prior to working in healthcare, he worked on building large-scale backend systems in Amazon’s global inventory, supply chain and fulfillment network, Kindle devices, and Digital commerce businesses (such as Prime Video, Music, and eBooks).

Microsoft Released VibeVoice-1.5B: An Open-Source Text-to-Speech Model …

Table of contentsKey FeaturesArchitecture and Technical Deep DiveModel Limitations and Responsible UseConclusionFAQs

Microsoft’s latest open source release, VibeVoice-1.5B, redefines the boundaries of text-to-speech (TTS) technology—delivering expressive, long-form, multi-speaker generated audio that is MIT licensed, scalable, and highly flexible for research use. This model isn’t just another TTS engine; it’s a framework designed to generate up to 90 minutes of uninterrupted, natural-sounding audio, support simultaneous generation of up to four distinct speakers, and even handle cross-lingual and singing synthesis scenarios. With a streaming architecture and a larger 7B model announced for the near future, VibeVoice-1.5B positions itself as a major advance for AI-powered conversational audio, podcasting, and synthetic voice research.

Key Features

Massive Context and Multi-Speaker Support: VibeVoice-1.5B can synthesize up to 90 minutes of speech with up to four distinct speakers in a single session—far surpassing the typical 1-2 speaker limit of traditional TTS models.

Simultaneous Generation: The model isn’t just stitching together single-voice clips; it’s designed to support parallel audio streams for multiple speakers, mimicking natural conversation and turn-taking.

Cross-Lingual and Singing Synthesis: While primarily trained on English and Chinese, the model is capable of cross-lingual synthesis and can even generate singing—features rarely demonstrated in previous open source TTS models.

MIT License: Fully open source and commercially friendly, with a focus on research, transparency, and reproducibility.

Scalable for Streaming and Long-Form Audio: The architecture is designed for efficient long-duration synthesis and anticipates a forthcoming 7B streaming-capable model, further expanding possibilities for real-time and high-fidelity TTS.

Emotion and Expressiveness: The model is touted for its emotion control and natural expressiveness, making it suitable for applications like podcasts or conversational scenarios.

https://huggingface.co/microsoft/VibeVoice-1.5B

Architecture and Technical Deep Dive

VibeVoice’s foundation is a 1.5B-parameter LLM (Qwen2.5-1.5B) that integrates with two novel tokenizers—Acoustic and Semantic—both designed to operate at a low frame rate (7.5Hz) for computational efficiency and consistency across long sequences.

Acoustic Tokenizer: A σ-VAE variant with a mirrored encoder-decoder structure (each ~340M parameters), achieving 3200x downsampling from raw audio at 24kHz.

Semantic Tokenizer: Trained via an ASR proxy task, this encoder-only architecture mirrors the acoustic tokenizer’s design (minus the VAE components).

Diffusion Decoder Head: A lightweight (~123M parameter) conditional diffusion module predicts acoustic features, leveraging Classifier-Free Guidance (CFG) and DPM-Solver for perceptual quality.

Context Length Curriculum: Training starts at 4k tokens and scales up to 65k tokens—enabling the model to generate very long, coherent audio segments.

Sequence Modeling: The LLM understands dialogue flow for turn-taking, while the diffusion head generates fine-grained acoustic details—separating semantics and synthesis while preserving speaker identity over long durations.

Model Limitations and Responsible Use

English and Chinese Only: The model is trained solely on these languages; other languages may produce unintelligible or offensive outputs.

No Overlapping Speech: While it supports turn-taking, VibeVoice-1.5B does not model overlapping speech between speakers.

Speech-Only: The model does not generate background sounds, Foley, or music—audio output is strictly speech.

Legal and Ethical Risks: Microsoft explicitly prohibits use for voice impersonation, disinformation, or authentication bypass. Users must comply with laws and disclose AI-generated content.

Not for Professional Real-Time Applications: While efficient, this release is not optimized for low-latency, interactive, or live-streaming scenarios; that’s the target for the soon-to-come 7B variant.

Conclusion

Microsoft’s VibeVoice-1.5B is a breakthrough in open TTS: scalable, expressive, and multi-speaker, with a lightweight diffusion-based architecture that unlocks long-form, conversational audio synthesis for researchers and open source developers. While use is currently research-focused and limited to English/Chinese, the model’s capabilities—and the promise of upcoming versions—signal a paradigm shift in how AI can generate and interact with synthetic speech.

For technical teams, content creators, and AI enthusiasts, VibeVoice-1.5B is a must-explore tool for the next generation of synthetic voice applications—available now on Hugging Face and GitHub, with clear documentation and an open license. As the field pivots toward more expressive, interactive, and ethically transparent TTS, Microsoft’s latest offering is a landmark for open source AI speech synthesis.

FAQs

What makes VibeVoice-1.5B different from other text-to-speech models?

VibeVoice-1.5B can generate up to 90 minutes of expressive, multi-speaker audio (up to four speakers), supports cross-lingual and singing synthesis, and is fully open source under the MIT license—pushing the boundaries of long-form conversational AI audio generation

What hardware is recommended for running the model locally?

Community tests show that generating a multi-speaker dialog with the 1.5 B checkpoint consumes ≈ 7 GB of GPU VRAM, so an 8 GB consumer card (e.g., RTX 3060) is generally sufficient for inference.

Which languages and audio styles does the model support today?

VibeVoice-1.5B is trained only on English and Chinese and can perform cross-lingual narration (e.g., English prompt → Chinese speech) as well as basic singing synthesis. It produces speech only—no background sounds—and does not model overlapping speakers; turn-taking is sequential.

Check out the Technical Report, Model on Hugging Face and Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Microsoft Released VibeVoice-1.5B: An Open-Source Text-to-Speech Model that can Synthesize up to 90 Minutes of Speech with Four Distinct Speakers appeared first on MarkTechPost.

SEA-LION v4: Multimodal Language Modeling for Southeast Asia

AI Singapore (AISG) has released SEA-LION v4, an open-source multimodal language model developed in collaboration with Google and based on the Gemma 3 (27B) architecture. The model is designed to support Southeast Asian languages, including those with limited digital resources, and provides both text and image understanding capabilities. SEA-LION v4 uses a commercially permissive license and is intended for straightforward deployment on standard hardware platforms.

https://leaderboard.sea-lion.ai/

Benchmark Results: “Small” but State-of-the-Art

Performance evaluations on the SEA-HELM benchmark—a rigorous multilingual suite designed specifically to test Southeast Asian (SEA) languages—confirm SEA-LION v4’s capabilities. Across tasks in Burmese, Filipino, Indonesian, Malay, Tamil, Thai, and Vietnamese, v4 achieves a top ranking among models under 200B parameters, and globally places #5 out of 55 models tested.

This result is striking: the model is not only outperforming open-source peers like Llama 3, Qwen 3, and Gemma 3, but also holding its own against proprietary giants with parameter counts several times larger.

Filipino: 74.53 (v4) vs. 74.09 (Gemma 3-27B)

Malay: 71.31 (v4) vs. 71.20 (Gemma 3-27B)

Tamil: 68.47 (v4) vs. 68.45 (Gemma 3-27B)

Burmese: 57.18 (v4) just behind Gemma 3’s 57.78, outperforming Llama 4 MoE (109B).

In many languages, SEA-LION v4 performs on par with or better than models over 3–10x its size. This balance of efficiency and capability makes it one of the strongest openly available multilingual models for both research and industry use.

What’s New in SEA-LION v4

The fourth-generation model introduces several major technical advancements that make it uniquely suited for both regional and global applications:

1. Open Sourced

Unlike many closed models, SEA-LION v4 is released under the commercially permissive Gemma license, lowering adoption barriers for startups, researchers, and enterprises. Distribution is supported across multiple ecosystems:

Hugging Face (fine-tuned and base models)

Google Cloud Vertex AI

AWS SageMaker

Kaggle for lightweight experimentation

NVIDIA NIM and Ollama for edge deployment

This openness ensures SEA-LION v4 can be integrated into workflows across both cloud-scale enterprises and on-device environments.

2. Efficiency and Portability at Scale

Despite its 27B parameters, SEA-LION v4 is designed to run practically anywhere. With quantized versions in FP4 and FP8, users can achieve:

<0.5% performance drop vs. full precision

Up to 50% faster inference

Deployment on consumer-grade hardware (e.g., a laptop with 32GB RAM)

This efficiency democratizes access: a high-quality multimodal model that previously required extensive infrastructure is now available to researchers or developers with modest setups.

3. Multimodality: Text + Vision

SEA-LION v4 is the initiative’s first multimodal release. Beyond text generation and understanding, the model can “see,” interpret images, and combine multimodal information in responses. This makes it highly relevant for use cases such as:

Multilingual document analysis and translation with embedded images

Image-grounded question answering in local languages

Interactive agentic workflows requiring text + image context

The model also supports 128K token context windows, enabling extended reasoning over long documents, transcripts, or multi-turn prompts, a critical capability for enterprise and research applications.

4. Agentic and Structured Interactions

SEA-LION v4 includes tools beyond raw language generation, including:

Function calling—enabling integration with external APIs and agents

Structured outputs—JSON and schema-compliant generations for downstream automation

Compatibility with agentic workflows popular in enterprise adoption of LLMs

Together, these enhancements extend SEA-LION v4 beyond static Q&A into real-world applications such as workflow orchestration, research assistants, and multimodal enterprise bots.

Trained for Southeast Asia, Built for the World

A unique differentiator of SEA-LION v4 is its training foundation. The model is trained on over 1 trillion tokens, with heavy emphasis on a curated Southeast Asian dataset. This makes it particularly strong in handling low-resource regional languages, dialects, and cultural contexts, where global foundation models often fail.

In SEA-HELM’s Filipino, Malay, Tamil, and Burmese tasks, SEA-LION v4 is consistently among the best-performing models across all parameter ranges. This makes it a crucial enabler for digital equity in a region where over 600 million people rely on diverse linguistic ecosystems.

At the same time, because it inherits Gemma’s strong general-purpose reasoning, the model remains competitive in English and global tasks, making it a versatile choice for universal deployment.

Conclusion

SEA-LION v4 explain how models with 27B parameters, when optimized and trained on domain-specific data, can achieve competitive results in multilingual tasks. It offers multilingual performance, multimodal capabilities, an open license, and deployability across various platforms, contributing to advancements in regional AI models.

Check out the Model on Hugging Face and SEA-LION Playground. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post SEA-LION v4: Multimodal Language Modeling for Southeast Asia appeared first on MarkTechPost.

How Do GPUs and TPUs Differ in Training Large Transformer Models? Top …

Both GPUs and TPUs play crucial roles in accelerating the training of large transformer models, but their core architectures, performance profiles, and ecosystem compatibility lead to significant differences in use case, speed, and flexibility.

Architecture and Hardware Fundamentals

TPUs are custom ASICs (Application-Specific Integrated Circuits) engineered by Google, purpose-built for highly efficient matrix operations required by large neural networks. Their design focuses on vector processing, matrix multiplication units, and systolic arrays—leading to exceptional throughput on Transformer layers and deep integration with TensorFlow and JAX.

GPUs, dominated by NVIDIA’s CUDA-capable chips, use thousands of general-purpose parallel cores alongside specialized tensor units, high-bandwidth memory, and complex memory management systems. While originally designed for graphics, modern GPUs now offer optimized support for large-scale ML tasks and a wider variety of model architectures.

Performance in Transformer Training

TPUs outperform GPUs for massive batch processing and models directly compatible with their architecture, including most TensorFlow-based LLMs and transformer networks. For example, Google’s v4/v5p TPUs can be up to 2.8 times faster at training models such as PaLM and Gemini compared to some previous TPUs—and consistently edge out GPUs like the A100 for these workloads at scale.

GPUs deliver strong performance for a diverse set of models, especially those using dynamic shapes, custom layers, or frameworks other than TensorFlow. GPUs excel in smaller batch sizes, unconventional model topologies, and scenarios requiring flexible debugging, custom kernel development, or non-standard operations.

Software Ecosystem and Framework Support

TPUs are tightly coupled with Google’s AI ecosystem, primarily supporting TensorFlow and JAX. PyTorch support is available but less mature and less widely adopted for production workloads.

GPUs support nearly every major AI framework—including PyTorch, TensorFlow, JAX, and MXNet—enabled by mature toolchains like CUDA, cuDNN, and ROCm.

Scalability and Deployment Options

TPUs scale seamlessly via Google Cloud, allowing the training of ultra-large models on pod-scale infrastructure with thousands of interconnected chips for maximum throughput and minimal latency in distributed setups.

GPUs provide broad deployment flexibility on cloud, on-premises, and edge environments, with multi-vendor availability (AWS, Azure, Google Cloud, private hardware) and extensive support for containerized ML, orchestration, and distributed training frameworks (e.g., DeepSpeed, Megatron-LM).

Energy Efficiency and Cost

TPUs are engineered for high efficiency in data centers, often delivering superior performance-per-watt and lower total project costs in compatible workflows.

GPUs are catching up with greater efficiency in newer generations, but often entail higher total power consumption and costs for ultra-large production runs versus optimized TPUs.

Use Cases and Limitations

TPUs shine in training extremely large LLMs (Gemini, PaLM) within the Google Cloud ecosystem using TensorFlow. They struggle with models requiring dynamic shapes, custom operations, or advanced debugging.

GPUs are preferred for experimentation, prototyping, training/fine-tuning with PyTorch or multi-framework support, and deployments needing on-prem or diverse cloud options. Most commercial and open-source LLMs (GPT-4, LLaMA, Claude) run on high-end NVIDIA GPUs.

Summary Comparison Table

FeatureTPUGPUArchitectureCustom ASIC, systolic arrayGeneral-purpose parallel processorPerformanceBatch processing, TensorFlow LLMsAll frameworks, dynamic modelsEcosystemTensorFlow, JAX (Google-centric)PyTorch, TensorFlow, JAX, wide adoptionScalabilityGoogle Cloud pods, up to thousands of chipsCloud/on-prem/edge, containers, multi-vendorEnergy EfficiencyOptimal for data centersImproved in new generationsFlexibilityLimited; mostly TensorFlow/JAXHigh; all frameworks, custom opsAvailabilityGoogle Cloud onlyGlobal cloud and on-prem platforms

TPUs and GPUs are designed for different priorities: TPUs maximize throughput and efficiency for transformer models at scale using Google’s stack, while GPUs offer universal flexibility, mature software support, and broad hardware choice for ML practitioners and enterprise teams. For training large transformer models, select the accelerator that aligns with model framework, workflow needs, debugging and deployment requirements, and scaling ambitions for your project.

The best 2025 training benchmarks for large transformer models are currently achieved by Google’s TPU v5p and NVIDIA’s Blackwell (B200) and H200 GPUs, according to MLPerf and independent deep learning infrastructure reviews.

Top TPU Models and Benchmarks

Google TPU v5p: Delivers market-leading performance for training LLMs and dense transformer networks. TPU v5p offers substantial improvements over previous TPU versions, allowing massive scale (up to thousands of chips) within Google Cloud pods and supporting models up to and beyond 500B parameters. TPU v5p is noted for high throughput, cost-effective training, and class-leading efficiency for TensorFlow/JAX-based workloads.

Google TPU Ironwood (for inference): Optimized for inference with transformer models, achieving best-in-class speed and lowest energy consumption for production-scale deployments.

Google TPU v5e: Delivers strong price-performance, especially for training large models on a budget, with up to 70B+ parameters. TPU v5e can be 4–10× more cost-efficient than similarly sized GPU clusters for large LLMs.

Top GPU Models and Benchmarks

NVIDIA Blackwell B200: The new Blackwell architecture (GB200 NVL72 and B200) shows record-breaking throughput in MLPerf v5.0 benchmarks, achieving up to 3.4× higher per-GPU performance than the H200 for models like Llama 3.1 (405B params) and Mixtral 8x7B. System-level speedups with NVLink domains allow for 30× cluster-wide performance compared to older generations.

NVIDIA H200 Tensor Core GPU: Highly efficient for LLM training, succeeding the H100 with greater bandwidth (10TB/s), improved FP8/BF16 performance, and fine-tuned for transformer workloads. Outperformed by Blackwell B200 but still the most widely supported and available option in enterprise cloud environments.

NVIDIA RTX 5090 (Blackwell 2.0): Newly launched in 2025, offers up to 104.8 TFLOPS single-precision performance and 680 fifth-gen Tensor Cores. It’s ideal for research labs and medium-scale production, especially when price-to-performance and local deployment are primary concerns.

MLPerf and Real-World Highlights

TPU v5p and B200 demonstrate the fastest training throughput and efficiency for massive LLMs, with B200 delivering 3× speedup over prior generations and MLPerf confirming record token/second rates in multi-GPU NVLink clusters.

TPU pods retain an edge in price-per-token, energy efficiency, and scalability for Google Cloud-centric TensorFlow/JAX workflows, while Blackwell B200 dominates MLPerf for PyTorch and heterogeneous environments.

These models represent the industry standard for large transformer training in 2025, with both TPUs and GPUs delivering state-of-the-art performance, scalability, and cost-efficiency depending on framework and ecosystem.

Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How Do GPUs and TPUs Differ in Training Large Transformer Models? Top GPUs and TPUs with Benchmark appeared first on MarkTechPost.

A Coding Guide to Build Flexible Multi-Model Workflows in GluonTS with …

In this tutorial, we explore GluonTS from a practical perspective, where we generate complex synthetic datasets, prepare them, and apply multiple models in parallel. We focus on how to work with diverse estimators in the same pipeline, handle missing dependencies gracefully, and still produce usable results. By building in evaluation and visualization steps, we create a workflow that highlights how models can be trained, compared, and interpreted in a single, seamless process. Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks.

Copy CodeCopiedUse a different Browserimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings(‘ignore’)

from gluonts.dataset.pandas import PandasDataset
from gluonts.dataset.split import split
from gluonts.evaluation import make_evaluation_predictions, Evaluator
from gluonts.dataset.artificial import ComplexSeasonalTimeSeries

try:
from gluonts.torch import DeepAREstimator
TORCH_AVAILABLE = True
except ImportError:
TORCH_AVAILABLE = False

try:
from gluonts.mx import DeepAREstimator as MXDeepAREstimator
from gluonts.mx import SimpleFeedForwardEstimator
MX_AVAILABLE = True
except ImportError:
MX_AVAILABLE = False

We begin by importing the core libraries for data handling, visualization, and GluonTS utilities. We also set up conditional imports for PyTorch and MXNet estimators, allowing us to flexibly use whichever backend is available in our environment. Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks.

Copy CodeCopiedUse a different Browserdef create_synthetic_dataset(num_series=50, length=365, prediction_length=30):
“””Generate synthetic multi-variate time series with trends, seasonality, and noise”””
np.random.seed(42)
series_list = []

for i in range(num_series):
trend = np.cumsum(np.random.normal(0.1 + i*0.01, 0.1, length))

daily_season = 10 * np.sin(2 * np.pi * np.arange(length) / 7)
yearly_season = 20 * np.sin(2 * np.pi * np.arange(length) / 365.25)

noise = np.random.normal(0, 5, length)
values = np.maximum(trend + daily_season + yearly_season + noise + 100, 1)

dates = pd.date_range(start=’2020-01-01′, periods=length, freq=’D’)

series_list.append(pd.Series(values, index=dates, name=f’series_{i}’))

return pd.concat(series_list, axis=1)

We create a synthetic dataset where each series combines trend, seasonality, and noise. We design it so every run produces consistent results, and we return a clean multi-series DataFrame ready for experimentation. Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks.

Copy CodeCopiedUse a different Browserprint(” Creating synthetic multi-series dataset…”)
df = create_synthetic_dataset(num_series=10, length=200, prediction_length=30)

dataset = PandasDataset(df, target=df.columns.tolist())

training_data, test_gen = split(dataset, offset=-60)
test_data = test_gen.generate_instances(prediction_length=30, windows=2)

print(” Initializing forecasting models…”)

models = {}

if TORCH_AVAILABLE:
try:
models[‘DeepAR_Torch’] = DeepAREstimator(
freq=’D’,
prediction_length=30
)
print(” PyTorch DeepAR loaded”)
except Exception as e:
print(f” PyTorch DeepAR failed to load: {e}”)

if MX_AVAILABLE:
try:
models[‘DeepAR_MX’] = MXDeepAREstimator(
freq=’D’,
prediction_length=30,
trainer=dict(epochs=5)
)
print(” MXNet DeepAR loaded”)
except Exception as e:
print(f” MXNet DeepAR failed to load: {e}”)

try:
models[‘FeedForward’] = SimpleFeedForwardEstimator(
freq=’D’,
prediction_length=30,
trainer=dict(epochs=5)
)
print(” FeedForward model loaded”)
except Exception as e:
print(f” FeedForward failed to load: {e}”)

if not models:
print(” Using artificial dataset with built-in models…”)
artificial_ds = ComplexSeasonalTimeSeries(
num_series=10,
prediction_length=30,
freq=’D’,
length_low=150,
length_high=200
).generate()

training_data, test_gen = split(artificial_ds, offset=-60)
test_data = test_gen.generate_instances(prediction_length=30, windows=2)

We generate a 10-series dataset, wrap it into a GluonTS PandasDataset, and split it into training and test windows. We then initialize multiple estimators (PyTorch DeepAR, MXNet DeepAR, and FeedForward) when available, and fall back to a built-in artificial dataset if no backends load. Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks.

Copy CodeCopiedUse a different Browsertrained_models = {}
all_forecasts = {}

if models:
for name, estimator in models.items():
print(f” Training {name} model…”)
try:
predictor = estimator.train(training_data)
trained_models[name] = predictor

forecasts = list(predictor.predict(test_data.input))
all_forecasts[name] = forecasts
print(f” {name} training completed!”)

except Exception as e:
print(f” {name} training failed: {e}”)
continue

print(” Evaluating model performance…”)
evaluator = Evaluator(quantiles=[0.1, 0.5, 0.9])
evaluation_results = {}

for name, forecasts in all_forecasts.items():
if forecasts:
try:
agg_metrics, item_metrics = evaluator(test_data.label, forecasts)
evaluation_results[name] = agg_metrics
print(f”n{name} Performance:”)
print(f” MASE: {agg_metrics[‘MASE’]:.4f}”)
print(f” sMAPE: {agg_metrics[‘sMAPE’]:.4f}”)
print(f” Mean wQuantileLoss: {agg_metrics[‘mean_wQuantileLoss’]:.4f}”)
except Exception as e:
print(f” Evaluation failed for {name}: {e}”)

We train each available estimator, collect probabilistic forecasts, and store the fitted predictors for reuse. We then evaluate results with MASE, sMAPE, and weighted quantile loss, giving us a consistent, comparative view of model performance. Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks.

Copy CodeCopiedUse a different Browserdef plot_advanced_forecasts(test_data, forecasts_dict, series_idx=0):
“””Advanced plotting with multiple models and uncertainty bands”””
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle(‘Advanced GluonTS Forecasting Results’, fontsize=16, fontweight=’bold’)

if not forecasts_dict:
fig.text(0.5, 0.5, ‘No successful forecasts to display’,
ha=’center’, va=’center’, fontsize=20)
return fig

if series_idx < len(test_data.label):
ts_label = test_data.label[series_idx]
ts_input = test_data.input[series_idx][‘target’]

colors = [‘blue’, ‘red’, ‘green’, ‘purple’, ‘orange’]

ax1 = axes[0, 0]
ax1.plot(range(len(ts_input)), ts_input, ‘k-‘, label=’Historical’, alpha=0.8, linewidth=2)
ax1.plot(range(len(ts_input), len(ts_input) + len(ts_label)),
ts_label, ‘k–‘, label=’True Future’, alpha=0.8, linewidth=2)

for i, (name, forecasts) in enumerate(forecasts_dict.items()):
if series_idx < len(forecasts):
forecast = forecasts[series_idx]
forecast_range = range(len(ts_input), len(ts_input) + len(forecast.mean))

color = colors[i % len(colors)]
ax1.plot(forecast_range, forecast.mean,
color=color, label=f'{name} Mean’, linewidth=2)

try:
ax1.fill_between(forecast_range,
forecast.quantile(0.1), forecast.quantile(0.9),
alpha=0.2, color=color, label=f'{name} 80% CI’)
except:
pass

ax1.set_title(‘Multi-Model Forecasts Comparison’, fontsize=12, fontweight=’bold’)
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlabel(‘Time Steps’)
ax1.set_ylabel(‘Value’)

ax2 = axes[0, 1]
if all_forecasts:
first_model = list(all_forecasts.keys())[0]
if series_idx < len(all_forecasts[first_model]):
forecast = all_forecasts[first_model][series_idx]
ax2.scatter(ts_label, forecast.mean, alpha=0.7, s=60)

min_val = min(min(ts_label), min(forecast.mean))
max_val = max(max(ts_label), max(forecast.mean))
ax2.plot([min_val, max_val], [min_val, max_val], ‘r–‘, alpha=0.8)

ax2.set_title(f’Prediction vs Actual – {first_model}’, fontsize=12, fontweight=’bold’)
ax2.set_xlabel(‘Actual Values’)
ax2.set_ylabel(‘Predicted Values’)
ax2.grid(True, alpha=0.3)

ax3 = axes[1, 0]
if all_forecasts:
first_model = list(all_forecasts.keys())[0]
if series_idx < len(all_forecasts[first_model]):
forecast = all_forecasts[first_model][series_idx]
residuals = ts_label – forecast.mean
ax3.hist(residuals, bins=15, alpha=0.7, color=’skyblue’, edgecolor=’black’)
ax3.axvline(x=0, color=’r’, linestyle=’–‘, linewidth=2)
ax3.set_title(f’Residuals Distribution – {first_model}’, fontsize=12, fontweight=’bold’)
ax3.set_xlabel(‘Residuals’)
ax3.set_ylabel(‘Frequency’)
ax3.grid(True, alpha=0.3)

ax4 = axes[1, 1]
if evaluation_results:
metrics = [‘MASE’, ‘sMAPE’]
model_names = list(evaluation_results.keys())
x = np.arange(len(metrics))
width = 0.35

for i, model_name in enumerate(model_names):
values = [evaluation_results[model_name].get(metric, 0) for metric in metrics]
ax4.bar(x + i*width, values, width,
label=model_name, color=colors[i % len(colors)], alpha=0.8)

ax4.set_title(‘Model Performance Comparison’, fontsize=12, fontweight=’bold’)
ax4.set_xlabel(‘Metrics’)
ax4.set_ylabel(‘Value’)
ax4.set_xticks(x + width/2 if len(model_names) > 1 else x)
ax4.set_xticklabels(metrics)
ax4.legend()
ax4.grid(True, alpha=0.3)
else:
ax4.text(0.5, 0.5, ‘No evaluationnresults available’,
ha=’center’, va=’center’, transform=ax4.transAxes, fontsize=14)

plt.tight_layout()
return fig

if all_forecasts and test_data.label:
print(” Creating advanced visualizations…”)
fig = plot_advanced_forecasts(test_data, all_forecasts, series_idx=0)
plt.show()

print(f”n Tutorial completed successfully!”)
print(f” Trained {len(trained_models)} model(s) on {len(df.columns) if ‘df’ in locals() else 10} time series”)
print(f” Prediction length: 30 days”)

if evaluation_results:
best_model = min(evaluation_results.items(), key=lambda x: x[1][‘MASE’])
print(f” Best performing model: {best_model[0]} (MASE: {best_model[1][‘MASE’]:.4f})”)

print(f”n Environment Status:”)
print(f” PyTorch Support: {” if TORCH_AVAILABLE else ”}”)
print(f” MXNet Support: {” if MX_AVAILABLE else ”}”)

else:
print(” Creating demonstration plot with synthetic data…”)

fig, ax = plt.subplots(1, 1, figsize=(12, 6))

dates = pd.date_range(‘2020-01-01′, periods=100, freq=’D’)
ts = 100 + np.cumsum(np.random.normal(0, 2, 100)) + 20 * np.sin(np.arange(100) * 2 * np.pi / 30)

ax.plot(dates[:70], ts[:70], ‘b-‘, label=’Historical Data’, linewidth=2)
ax.plot(dates[70:], ts[70:], ‘r–‘, label=’Future (Example)’, linewidth=2)
ax.fill_between(dates[70:], ts[70:] – 5, ts[70:] + 5, alpha=0.3, color=’red’)

ax.set_title(‘GluonTS Probabilistic Forecasting Example’, fontsize=14, fontweight=’bold’)
ax.set_xlabel(‘Date’)
ax.set_ylabel(‘Value’)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(“n Tutorial demonstrates advanced GluonTS concepts:”)
print(” • Multi-series dataset generation”)
print(” • Probabilistic forecasting”)
print(” • Model evaluation and comparison”)
print(” • Advanced visualization techniques”)
print(” • Robust error handling”)

We train each available model, generate probabilistic forecasts, and evaluate them with consistent metrics before visualizing comparisons, residuals, and uncertainty bands. If no models are available, we still demonstrate the workflow with a synthetic example so we can inspect plots and key concepts end to end.

In conclusion, we put together a robust setup that balances data creation, model experimentation, and performance analysis. Instead of relying on a single configuration, we see how to adapt flexibly, test multiple options, and visualize results in ways that make comparison intuitive. This gives us a stronger foundation for experimenting with GluonTS and applying the same principles to real datasets, while keeping the process modular and easy to extend.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Build Flexible Multi-Model Workflows in GluonTS with Synthetic Data, Evaluation, and Advanced Visualizations appeared first on MarkTechPost.

What is a Database? Modern Database Types, Examples, and Applications …

In today’s data-driven world, databases form the backbone of modern applications—from mobile apps to enterprise systems. Understanding the different types of databases and their applications is crucial for selecting the right system for specific needs, whether you’re building a personal project or architecting enterprise-level solutions.

What is a Database?

A database is a structured collection of data that is stored electronically and managed by a database management system (DBMS). Databases enable efficient storage, retrieval, and management of both structured and unstructured data, providing the foundation for applications to function effectively.

The choice of database significantly impacts performance, scalability, consistency, and data integrity. Modern applications rely on databases to organize data and allow users to access information quickly and reliably.

Key Types of Modern Databases

1. Relational Databases (RDBMS)

Relational databases organize data into tables with rows and columns, enforcing schemas and relationships using keys. They are ACID-compliant (ensuring atomicity, consistency, isolation, durability) and use SQL for data querying.

Recent Innovations (2025):

MySQL 9.0: Enhanced JSON processing, vector data types for AI, Enterprise JavaScript stored procedures, SHA-3 encryption.

PostgreSQL 17: Advanced JSON query functions, vector search for ML, streaming I/O, incremental backups, and more robust replication.

Oracle Database and IBM Db2: Leading RDBMSs in security, scalability, multi-cloud deployment, and disaster recovery.

Best for: Financial systems, e-commerce, enterprise apps, analytics.

Popular Platforms: MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, IBM Db2, MariaDB.

2. NoSQL Databases

NoSQL databases break away from structured, table-based models, offering flexible data formats suited for semi-structured and unstructured data.

Key Types:

Document Stores: Store data as JSON/BSON documents. (e.g., MongoDB, Couchbase)

Key-Value Stores: Ultra-fast, each data item is a key-value pair. (e.g., Redis, Amazon DynamoDB)

Wide-Column Stores: Flexible columns per row; optimized for big data and analytics. (e.g., Apache Cassandra, HBase)

Graph Databases: Nodes and edges model complex relationships. (e.g., Neo4j, Amazon Neptune)

Multi-Model Databases: Support several of the above paradigms in one platform.

Notable Advances (2025):

MongoDB: Now with native enterprise SSO, DiskANN vector indexing for production AI, sharding for horizontal scaling, strong access controls.

Cassandra 5.0: Advanced vector types for AI, storage-attached indexes, dynamic data masking, and improved compaction for massive, distributed workloads.

Best for: Real-time analytics, recommendation systems, IoT, social platforms, streaming data.

3. Cloud Databases

Cloud databases are managed on cloud platforms, offering elasticity, high availability, managed services, and seamless scaling. They are optimized for modern DevOps and serverless environments, often delivering database-as-a-service (DBaaS).

Leading Platforms: Amazon RDS, Google Cloud SQL, Azure SQL Database, MongoDB Atlas, Amazon Aurora.

Why choose cloud?

Automatic failover, scaling, and backups.

Global distribution for high availability.

Streamlines devops with managed infrastructure.

4. In-Memory and Distributed SQL Databases

In-memory databases (e.g., SAP HANA, SingleStore, Redis) store data in RAM instead of disk for lightning-fast access—ideal for real-time analytics and financial trades.

Distributed SQL databases (e.g., CockroachDB, Google Spanner) marry relational consistency (ACID) with NoSQL-style cloud scalability, handling multi-region deployments with global replication.

5. Time-Series Databases

Purpose-built to store and analyze chronological data, such as sensor readings or financial ticks. Optimized for fast ingestion, compression, and time-series queries.

Top platforms: InfluxDB, TimescaleDB.

6. Object-Oriented and Multi-Model Databases

Object-oriented DBs like ObjectDB map directly to object-oriented code, great for multimedia and custom app logic.

Multi-model databases (e.g., ArangoDB, SingleStore) can act as document, key-value, column store, and graph database in one platform for maximum flexibility.

7. Specialized & Emerging Types

Ledger Databases: Immutable records for compliance and blockchain-like trust. (e.g., Amazon QLDB)

Search Databases: For text search and analytics (e.g., Elasticsearch, OpenSearch).

Vector Databases: Natively index and retrieve embeddings for AI and search tasks, integrating with vector search and LLMs.

2025 Feature Highlights Across Top Platforms

DatabaseRecent Standout Features (2025)Ideal Use CasesMySQL (RDBMS)JSON schema validation, vector search, SHA-3, OpenID ConnectWeb apps, analytics, AIPostgreSQLVector search, streaming I/O, JSON_TABLE(), enhanced replicationAnalytics, machine learning, web, ERPMongoDBNative SSO, DiskANN indexing for high-dim vectors, robust shardingCloud-native, AI, content managementCassandraVector types, new indexing, dynamic data masking, unified compactionIoT, analytics, high-scale workloadsInfluxDBExtreme time-series compression, Grafana integration, high-throughput ingestionIoT, monitoring, time-series analyticsDynamoDBServerless scaling, global replication, continuous backupReal-time apps, serverless, web-scaleCockroachDBCloud-native, multi-region ACID consistency, vector indexes (AI similarity search)Global-scale SQL, fintech, complianceMariaDBColumnar storage, MySQL compatibility, microsecond precision, advanced replicationWeb, analytics, multi-cloudIBM Db2ML-powered tuning, multi-site replication, advanced compressionEnterprise, analytics, cloud/hybrid

Real-World Applications

E-commerce: Customer, catalog, orders in RDBMS/NoSQL; recommendation engine in graph/vector DB; live analytics in time-series DB.

Banking: Core ledgers in RDBMS; anti-fraud AI models rely on vector and graph DBs; caching in Redis/in-memory for transactions.

AI/ML: Modern DBs (e.g., MySQL, PostgreSQL, Cassandra, MongoDB) now support vector search and indexing for LLMs, embeddings, and retrieval-augmented generation (RAG).

IoT & Monitoring: InfluxDB, Cassandra process millions of time-stamped sensor readings per second for real-time dashboards.

The post What is a Database? Modern Database Types, Examples, and Applications (2025) appeared first on MarkTechPost.

Build vs Buy for Enterprise AI (2025): A U.S. Market Decision Framewor …

Enterprise AI in the U.S. has left the experimentation phase. CFOs expect clear ROI, boards expect evidence of risk oversight, and regulators expect controls consistent with existing risk management obligations. Against this backdrop, every VP of AI faces the enduring question: Should we build this capability in-house, buy it from a vendor, or blend the two?

The truth is there is no universal winner. The right answer is context-specific and portfolio-based. The choice is not about “in-house vs outsourced” in the abstract, but about mapping each use case to strategic differentiation, regulatory scrutiny, and execution maturity.

The U.S. Context: Regulatory and Market Anchors

While the EU is defining prescriptive rules through the AI Act, the U.S. remains sector-driven and enforcement-led. For U.S. enterprises, the real references are:

NIST AI Risk Management Framework (RMF): The de facto federal guidance, shaping procurement and vendor assurance programs across agencies and now mirrored in enterprise practice.

NIST AI 600-1 (Generative AI Profile): Refines evaluation expectations on hallucination testing, monitoring, and evidence.

Banking/finance: Federal Reserve SR 11-7 (model risk), FDIC/FFIEC guidance, OCC’s continued scrutiny of models embedded in underwriting/risk.

Healthcare: HIPAA + FDA regulatory oversight of algorithms in clinical context.

FTC enforcement authority: Expect risk of “deceptive practices” citations around transparency/disclosure.

SEC disclosure expectations: Public companies must begin disclosing “material AI-related risks”, especially bias, cybersecurity, and data use.

Bottom line for U.S. leaders: there is no monolithic AI Act yet, but boards and regulators will test your oversight, model governance, and vendor risk management frameworks. That reality puts pressure on the Build vs Buy decision to be evidence-based and defensible.

Build, Buy, and Blend: The Executive Portfolio View

At a strategic level, consider:

Build when a capability underpins competitive advantage, involves sensitive U.S. regulatory data (PHI, PII, financials), or demands deep integration into proprietary systems.

Buy when the use case is commoditized, speed-to-value determines success, or vendors bring compliance coverage you lack internally.

Blend for the majority of U.S. enterprise use cases: pair proven vendor platforms (multi-model routing, safety layers, compliance artifacts) with custom “last mile” work on prompts, retrieval, orchestration, and domain evals.

A 10-Dimension Framework for Scoring Build vs Buy

To move beyond opinion-driven debates, use a structured scoring model. Each dimension is scored 1–5, weighted by strategic priorities.

DimensionWeightBuild BiasBuy Bias1. Strategic differentiation15%AI capability is your product moatCommodity productivity gain2. Data sensitivity & residency10%PHI/PII/regulatory datasetsVendor can evidence HIPAA/SOC 23. Regulatory exposure10%SR 11-7/HIPAA/FDA obligationsVendor provides mapped controls4. Time-to-value10%3–6 months acceptableMust deliver in weeks5. Customization depth10%Domain-heavy, workflow-specificConfigurable suffices6. Integration complexity10%Embedded into legacy, ERP, control planeStandard connectors adequate7. Talent & ops maturity10%LLMOps in place with platform/SREVendor hosting preferred8. 3-year TCO10%Infra amortized, reuse across teamsVendor’s unit economics win9. Performance & scale7.5%Millisecond latency or burst control requiredOut-of-box SLA acceptable10. Lock-in & portability7.5%Need open weights/standardsComfortable with exit clause

Decision rules:

Build if Build score exceeds Buy score by ≥20%.

Buy if Buy exceeds Build by ≥20%.

Blend if results are within the ±20% band.

For executives, this turns debates into numbers—and sets the stage for transparent board reporting.

Modeling TCO on a 3-Year Horizon

A common failure mode in U.S. enterprises is comparing 1-year subscription costs against 3-year build costs. Correct decision-making requires like-for-like.

Build TCO (36 months):

Internal engineering (AI platform eng, ML eng, SRE, security)

Cloud compute (training + inference with GPUs/CPUs, caching layers, autoscaling)

Data pipelines (ETL, labeling, continuous eval, red-teaming)

Observability (vector stores, eval datasets, monitoring pipelines)

Compliance (NIST RMF audit prep, SOC 2 readiness, HIPAA reviews, penetration testing)

Egress fees and replication costs across regions

Buy TCO (36 months):

Subscription/license baseline + seats

Usage fees (tokens, calls, context length)

Integration/change management uplift

Add-ons (proprietary RAG, eval, safety layers)

Vendor compliance uplift (SOC 2, HIPAA BAAs, NIST mapping deliverables)

Migration costs at exit—especially egress fees, which remain material in U.S. cloud economics

When to Build (U.S. Context)

Best-fit scenarios for Build:

Strategic IP: Underwriting logic, risk scoring, financial anomaly detection—the AI model is central to revenue.

Data control: You cannot let PHI, PII, or trade secrets pass into opaque vendor pipelines. HIPAA BAAs may cover exposure, but often fall short.

Custom integration: AI must be wired into claims systems, trading platforms, or ERP workflows that outsiders cannot navigate efficiently.

Risks:

Continuous compliance overhead: auditors will demand evidence artifacts, not policies.

Talent scarcity: hiring senior LLMOps engineers in the U.S. remains highly competitive.

Predictable overspending: red-teaming, observability, and evaluation pipelines are hidden costs not fully captured in initial budgets.

When to Buy (U.S. Context)

Best-fit scenarios for Buy:

Commodity tasks: Note-taking, Q&A, ticket deflection, baseline code copilots.

Speed: Senior leadership demands deployment inside a fiscal quarter.

Vendor-provided compliance: Reputable U.S. vendors increasingly align to NIST RMF, SOC 2, and HIPAA, with some pursuing or achieving ISO/IEC 42001 certification.

Risks:

Vendor lock-in: Some providers expose embeddings or retrieval only through proprietary APIs.

Usage volatility: Token metering creates budget unpredictability unless governed by rate limits.

Exit costs: Cloud egress pricing and re-platforming can distort ROI. Always demand explicit exit clauses around data portability.

The Blended Operating Model (Default for U.S. Enterprises in 2025)

Across U.S. Fortune 500 firms, the pragmatic equilibrium is blend:

Buy platform capabilities (governance, audit trails, multi-model routing, RBAC, DLP, compliance attestations).

Build the last mile: retrieval, tool adapters, evaluation datasets, hallucination tests, and sector-specific guardrails.

This allows scale without surrendering control of sensitive IP or falling short on board-level oversight.

Due Diligence Checklist for VP of AI

If Buying Vendors:

Assurance: ISO/IEC 42001 + SOC 2 + mapping to NIST RMF.

Data Management: HIPAA BAA, retention and minimization terms, redaction, regional segregation.

Exit: Explicit portability contract language; negotiated egress fee relief.

SLAs: Latency/throughput targets, U.S. data residency guarantees, bias and safety evaluation deliverables.

If Building In-House:

Governance: Operate under NIST AI RMF categories—govern, map, measure, manage.

Architecture: Multi-model orchestration layer to avoid lock-in; robust observability pipelines (traces, cost metering, hallucination metrics).

People: Dedicated LLMOps team; embedded evaluation and security experts.

Cost Controls: Request batching, retrieval optimization, explicit egress minimization strategies.

Decision Tree for Executives

Does the capability drive a competitive advantage within 12–24 months?

Yes → Probable Build.

No → Consider Buy.

Do you have governance maturity (aligned to NIST AI RMF) in-house?

Yes → Lean Build.

No → Blend: Buy vendor guardrails, build last-mile.

Would a vendor’s compliance artifacts satisfy regulators faster?

Yes → Lean Buy/Blend.

No → Build to meet obligations.

Does 3-year TCO favor internal amortization vs subscription costs?

Internal lower → Build.

Vendor lower → Buy.

Example: U.S. Healthcare Insurer

Use Case: Automated claim review and explanation of benefits.

Strategic differentiation: Moderate—efficiency vs competitor baseline.

Data sensitivity: PHI, subject to HIPAA.

Regulation: Subject to HHS + potential FDA oversight for clinical decision support.

Integration: Tight coupling with legacy claim processing systems.

Time-to-value: 6-month tolerance.

Internal team: Mature ML pipeline, but limited LLMOps experience.

Outcome:

Blend. Use a U.S. vendor platform with HIPAA BAA and SOC 2 Type II assurance for base LLM + governance.

Build custom retrieval layers, medical CPT/ICD code adaptation, and evaluation datasets.

Map oversight to NIST AI RMF and document evidence for board audit committee.

Takeaways for VPs of AI

Use a scored, weighted framework to evaluate each AI use case—this creates audit-ready evidence for boards and regulators.

Expect blended estates to dominate. Retain last-mile control (retrieval, prompts, evaluators) as enterprise IP.

Align builds and buys to NIST AI RMF, SOC 2, ISO/IEC 42001, and U.S. sector-specific laws (HIPAA, SR 11-7).

Always model 3-year TCO including cloud egress.

Insert exit/portability clauses into contracts up front.

For U.S. enterprises in 2025, the Build vs Buy question is not about ideology. It is about strategic allocation, governance evidence, and execution discipline. VPs of AI who operationalize this decision-making framework will not just accelerate deployment—they will also build resilience against regulatory scrutiny and board risk oversight.

Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Build vs Buy for Enterprise AI (2025): A U.S. Market Decision Framework for VPs of AI Product appeared first on MarkTechPost.

Prefix-RFT: A Unified Machine Learning Framework to blend Supervised F …

Large language models are typically refined after pretraining using either supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), each with distinct strengths and limitations. SFT is effective in teaching instruction-following through example-based learning, but it can lead to rigid behavior and poor generalization. RFT, on the other hand, optimizes models for task success using reward signals, which can improve performance but also introduce instability and reliance on a strong starting policy. While these methods are often used sequentially, their interaction remains poorly understood. This raises an important question: how can we design a unified framework that combines SFT’s structure with RFT’s goal-driven learning? 

Research at the intersection of RL and LLM post-training has gained momentum, particularly for training reasoning-capable models. Offline RL, which learns from fixed datasets, often yields suboptimal policies due to the limited diversity of the data. This has sparked interest in combining offline and online RL approaches to improve performance. In LLMs, the dominant strategy is to first apply SFT to teach desirable behaviors, then use RFT to optimize outcomes. However, the dynamics between SFT and RFT are still not well understood, and finding effective ways to integrate them remains an open research challenge. 

Researchers from the University of Edinburgh, Fudan University, Alibaba Group, Stepfun, and the University of Amsterdam propose a unified framework that combines supervised and reinforcement fine-tuning in a way called Prefix-RFT. This method guides exploration using partial demonstrations, allowing the model to continue generating solutions with flexibility and adaptability. Tested on math reasoning tasks, Prefix-RFT consistently outperforms standalone SFT, RFT, and mixed-policy methods. It integrates easily into existing frameworks and proves robust to changes in demonstration quality and quantity. Blending demonstration-based learning with exploration can lead to more effective and adaptive training of large language models. 

https://arxiv.org/abs/2507.01679

The study presents Prefix Reinforcement Fine-Tuning (Prefix-RFT) as a way to blend the strengths of SFT and RFT. While SFT offers stability by mimicking expert demonstrations, RFT encourages exploration through the use of reward signals. Prefix-RFT bridges the two by using a partial demonstration (a prefix) and letting the model generate the rest. This approach guides learning without relying too heavily on full supervision. It incorporates techniques like entropy-based clipping and a cosine decay scheduler to ensure stable training and efficient learning. Compared to prior methods, Prefix-RFT offers a more balanced and adaptive fine-tuning strategy. 

Prefix-RFT is a reward fine-tuning method that improves performance using high-quality offline math datasets, such as OpenR1-Math-220K (46k filtered problems). Tested on Qwen2.5-Math-7B, 1.5B, and LLaMA-3.1-8B, it was evaluated on benchmarks including AIME 2024/25, AMC, MATH500, Minerva, and OlympiadBench. Prefix-RFT achieved the highest avg@32 and pass@1 scores across tasks, outperforming RFT, SFT, ReLIFT, and LUFFY. Using Dr. GRPO, it updated only the top 20% high-entropy prefix tokens, with the prefix length decaying from 95% to 5%. It maintained intermediate SFT loss, indicating a strong balance between imitation and exploration, especially on difficult problems (Trainhard). 

https://arxiv.org/abs/2507.01679

In conclusion, Prefix-RFT combines the strengths of SFT and RFT by utilizing sampled demonstration prefixes to guide learning. Despite its simplicity, it consistently outperforms SFT, RFT, and hybrid baselines across various models and datasets. Even with only 1% of the training data (450 prompts), it maintains strong performance (avg@32 drops only from 40.8 to 37.6), showing efficiency and robustness. Its top-20% entropy-based token update strategy proves most effective, achieving the highest benchmark scores with shorter outputs. Moreover, using a cosine decay scheduler for prefix length enhances stability and learning dynamics compared to a uniform strategy, particularly on complex tasks such as AIME. 

Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Prefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) appeared first on MarkTechPost.