StepFun AI Introduce Step-DeepResearch: A Cost-Effective Deep Research …

StepFun has introduced Step-DeepResearch, a 32B parameter end to end deep research agent that aims to turn web search into actual research workflows with long horizon reasoning, tool use and structured reporting. The model is built on Qwen2.5 32B-Base and is trained to act as a single agent that plans, explores sources, verifies evidence and writes reports with citations, while keeping inference cost low.

From Search to Deep Research

Most existing web agents are tuned for multi-hop question-answering benchmarks. They try to match ground truth answers for short questions. This is closer to targeted retrieval than to real research. Deep research tasks are different. They involve latent intent recognition, long horizon decision making, multi-turn tool use, structured-reasoning and cross-source verification under uncertainty.

Step-DeepResearch reframes this as sequential decision making over a compact set of atomic capabilities. The research team defines 4 atomic capabilities, planning and task decomposition, deep-information seeking, reflection and verification, and professional report generation. Instead of orchestrating many external agents, the system internalizes this loop into a single model that decides the next action at each step.

Data Synthesis around Atomic Capabilities

To teach these atomic capabilities, the research team builds separate data pipelines for each skill. For planning, they start from high quality technical reports, survey papers and financial analysis documents. They reverse-engineer realistic research plans and task trees from titles, abstracts and structure, then generate trajectories that follow these plans. This exposes the model to long horizon project structures, not only short question templates.

For deep information seeking, they construct graph based queries over knowledge graphs such as Wikidata5m and CN-DBpedia. They sample subgraphs, expand them using search, and synthesize questions that require multi hop reasoning across entities and documents. A separate pipeline uses a Wiki style hyperlink index to force cross document retrieval and combination of evidence. Easy questions that a strong model can already solve with a simple ReAct style strategy are filtered out, so training focuses on hard search problems.

Reflection and verification data is generated through self-correction loops and multi-agent teacher traces. Teacher agents extract claims, plan checks, verify facts, replan if inconsistencies appear and only then write reports. The resulting trajectories are cleaned and used as supervision for a single student agent. Report generation is trained in 2 phases, mid training for domain style and depth using query report pairs, then supervised fine-tuning with strict formatting and plan consistency constraints.

Progressive Training on Qwen2.5-32B-Base

The training pipeline has 3 stages, agentic mid-training, supervised fine-tuning and reinforcement learning. In mid training stage-1, the team injects atomic capabilities without tools, using context length up to 32k tokens. The data covers active reading, synthetic reasoning traces, summarization and reflection. The research team show steady gains on SimpleQA, TriviaQA and FRAMES as training scales up to about 150B tokens, with the largest gains on FRAMES, which stresses structured reasoning.

In stage-2, the context extends to 128k tokens and explicit tool calls are introduced. The model learns tasks such as URL based question-answering, deep web search, long document summarization and long dialogue reasoning. This stage aligns the model with real research scenarios where search, browsing and analysis must be mixed in one trajectory.

During supervised fine-tuning, the 4 atomic capabilities are composed into full deep search and deep research traces. Data cleaning keeps trajectories that are correct and short in terms of steps and tool calls. The pipeline injects controlled tool errors followed by correction to improve robustness, and enforces citation formats so that reports stay grounded in the retrieved sources.

Reinforcement learning then optimizes the agent in a real tool environment. The research team builds tasks and checklists through reverse synthesis, and trains a checklist style Rubrics Judge to score reports along fine grained dimensions. The reward design converts ternary rubric labels into asymmetric binary rewards that capture both positive targets and violations. The policy is trained with PPO and a learned critic, using generalized advantage estimation with near zero discount so that long trajectories are not truncated.

Single Agent ReAct Architecture and Search Stack

At inference time, Step-DeepResearch runs as a single ReAct style agent that alternates thinking, tool calls and observations until it decides to output a report. The tool set includes batch web search, a todo manager, shell commands and file operations. Execution runs in a sandbox with terminal persistence through tmux. A perception oriented browser reduces redundant page captures by using perceptual hash distance. Tools for document parsing, audio transcription and image analysis support multimodal inputs.

Information acquisition uses 2 related resources. StepFun team states that its Search API is grounded in more than 20M high quality papers and 600 premium indices. The research team then describes a curated authority indexing strategy that isolates more than 600 trusted domains, including government, academic and institutional sites. Retrieval operates at paragraph level and uses authority aware ranking so that high trust domains are preferred when relevance is similar.

The file tools support patch based editing, so the agent can update only modified sections of a report. A summary aware storage scheme writes full tool outputs to local files and injects only compact summaries into the context. This acts as external memory and avoids context overflow for long projects.

Evaluation, Cost and Access

To measure deep research behavior, the team introduce ADR-Bench, a Chinese benchmark with 110 open ended tasks across 9 domains. 70 tasks cover general domains such as education, science and engineering and social life, evaluated by expert side by side comparison. 40 tasks in finance and law are scored with explicit rubrics that follow atomicity and verifiability constraints.

On Scale AI Research Rubrics, Step-DeepResearch reaches 61.42 percent rubric compliance, which is comparable to OpenAI-DeepResearch and Gemini-DeepResearch, and clearly ahead of multiple open and proprietary baselines. On ADR-Bench, expert-based Elo ratings show that the 32B model outperforms larger open-models such as MiniMax-M2, GLM-4.6 and DeepSeek-V3.2, and is competitive with systems like Kimi-Researcher and MiniMax-Agent-Pro.

Key Takeaways

Single agent, atomic capability design: Step-DeepResearch is a 32B parameter single agent built on Qwen2.-32B-Base, it internalizes 4 atomic capabilities, planning, deep information seeking, reflection and verification, and professional report generation, instead of relying on many external agents.

Targeted data synthesis for each skill: The research team builds separate data pipelines for planning, deep information seeking, reflection and report writing, using reverse-engineered plans from real reports, graph-based queries over Wikidata5m and CN-DBpedia, multi-agent teacher traces and strict report formatting data.

Three stage training with long context and RL: Training uses mid training, supervised fine-tuning and reinforcement learning, with mid training up to 150B tokens at 32k and then 128k context, SFT composes full deep research trajectories, and PPO based RL with a Rubrics Judge optimizes reports against fine grained checklists.

ReAct architecture with curated search and external memory: At inference time the model runs a ReAct loop that calls tools for batch web search, todo, shell and file operations, uses a Search API grounded in more than 20M papers and 600 premium indices along with 600+trusted domains, and relies on patch editing and summary aware storage to act as external memory.

Competitive quality with lower cost: On Scale AI Research Rubrics the model reaches 61.42 percent rubric compliance and is competitive with OpenAI-DeepResearch and Gemini-DeepResearch, on ADR Bench it achieves 67.1 percent win or tie rate against strong baselines.

Check out the Paper and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post StepFun AI Introduce Step-DeepResearch: A Cost-Effective Deep Research Agent Model Built Around Atomic Capabilities appeared first on MarkTechPost.

A Coding Implementation to Automating LLM Quality Assurance with DeepE …

We initiate this tutorial by configuring a high-performance evaluation environment, specifically focused on integrating the DeepEval framework to bring unit-testing rigor to our LLM applications. By bridging the gap between raw retrieval and final generation, we implement a system that treats model outputs as testable code and uses LLM-as-a-judge metrics to quantify performance. We move beyond manual inspection by building a structured pipeline in which every query, retrieved context, and generated response is validated against rigorous academic-standard metrics. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport sys, os, textwrap, json, math, re
from getpass import getpass

print(” Hardening environment (prevents common Colab/py3.12 numpy corruption)…”)

!pip -q uninstall -y numpy || true
!pip -q install –no-cache-dir –force-reinstall “numpy==1.26.4″

!pip -q install -U deepeval openai scikit-learn pandas tqdm

print(” Packages installed.”)

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualRelevancyMetric,
ContextualPrecisionMetric,
ContextualRecallMetric,
GEval,
)

print(” Imports loaded successfully.”)

OPENAI_API_KEY = getpass(” Enter OPENAI_API_KEY (leave empty to run without OpenAI): “).strip()
openai_enabled = bool(OPENAI_API_KEY)

if openai_enabled:
os.environ[“OPENAI_API_KEY”] = OPENAI_API_KEY
print(f” OpenAI enabled: {openai_enabled}”)

We initialize our environment by stabilizing core dependencies and installing the deepeval framework to ensure a robust testing pipeline. Next, we import specialized metrics like Faithfulness and Contextual Recall while configuring our API credentials to enable automated, high-fidelity evaluation of our LLM responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDOCS = [
{
“id”: “doc_01”,
“title”: “DeepEval Overview”,
“text”: (
“DeepEval is an open-source LLM evaluation framework for unit testing LLM apps. ”
“It supports LLM-as-a-judge metrics, custom metrics like G-Eval, and RAG metrics ”
“such as contextual precision and faithfulness.”
),
},
{
“id”: “doc_02”,
“title”: “RAG Evaluation: Why Faithfulness Matters”,
“text”: (
“Faithfulness checks whether the answer is supported by retrieved context. ”
“In RAG, hallucinations occur when the model states claims not grounded in context.”
),
},
{
“id”: “doc_03”,
“title”: “Contextual Precision”,
“text”: (
“Contextual precision evaluates how well retrieved chunks are ranked by relevance ”
“to a query. High precision means relevant chunks appear earlier in the ranked list.”
),
},
{
“id”: “doc_04”,
“title”: “Contextual Recall”,
“text”: (
“Contextual recall measures whether the retriever returns enough relevant context ”
“to answer the query. Low recall means key information was missed in retrieval.”
),
},
{
“id”: “doc_05”,
“title”: “Answer Relevancy”,
“text”: (
“Answer relevancy measures whether the generated answer addresses the user’s query. ”
“Even grounded answers can be irrelevant if they don’t respond to the question.”
),
},
{
“id”: “doc_06”,
“title”: “G-Eval (GEval) Custom Rubrics”,
“text”: (
“G-Eval lets you define evaluation criteria in natural language. ”
“It uses an LLM judge to score outputs against your rubric (e.g., correctness, tone, policy).”
),
},
{
“id”: “doc_07”,
“title”: “What a DeepEval Test Case Contains”,
“text”: (
“A test case typically includes input (query), actual_output (model answer), ”
“expected_output (gold answer), and retrieval_context (ranked retrieved passages) for RAG.”
),
},
{
“id”: “doc_08”,
“title”: “Common Pitfall: Missing expected_output”,
“text”: (
“Some RAG metrics require expected_output in addition to input and retrieval_context. ”
“If expected_output is None, evaluation fails for metrics like contextual precision/recall.”
),
},
]

EVAL_QUERIES = [
{
“query”: “What is DeepEval used for?”,
“expected”: “DeepEval is used to evaluate and unit test LLM applications using metrics like LLM-as-a-judge, G-Eval, and RAG metrics.”,
},
{
“query”: “What does faithfulness measure in a RAG system?”,
“expected”: “Faithfulness measures whether the generated answer is supported by the retrieved context and avoids hallucinations not grounded in that context.”,
},
{
“query”: “What does contextual precision mean?”,
“expected”: “Contextual precision evaluates whether relevant retrieved chunks are ranked higher than irrelevant ones for a given query.”,
},
{
“query”: “What does contextual recall mean in retrieval?”,
“expected”: “Contextual recall measures whether the retriever returns enough relevant context to answer the query, capturing key missing information issues.”,
},
{
“query”: “Why might an answer be relevant but still low quality in RAG?”,
“expected”: “An answer can address the question (relevant) but still be low quality if it is not grounded in retrieved context or misses important details.”,
},
]

We define a structured knowledge base consisting of documentation snippets that serve as our ground-truth context for the RAG system. We also establish a set of evaluation queries and corresponding expected outputs to create a “gold dataset,” enabling us to assess how accurately our model retrieves information and generates grounded responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TfidfRetriever:
def __init__(self, docs):
self.docs = docs
self.texts = [f”{d[‘title’]}n{d[‘text’]}” for d in docs]
self.vectorizer = TfidfVectorizer(stop_words=”english”, ngram_range=(1, 2))
self.matrix = self.vectorizer.fit_transform(self.texts)

def retrieve(self, query, k=4):
qv = self.vectorizer.transform([query])
sims = cosine_similarity(qv, self.matrix).flatten()
top_idx = np.argsort(-sims)[:k]
results = []
for i in top_idx:
results.append(
{
“id”: self.docs[i][“id”],
“score”: float(sims[i]),
“text”: self.texts[i],
}
)
return results

retriever = TfidfRetriever(DOCS)

We implement a custom TF-IDF Retriever class that transforms our documentation into a searchable vector space using bigram-aware TF-IDF vectorization. This allows us to perform cosine similarity searches against the knowledge base, ensuring we can programmatically fetch the top-k most relevant text chunks for any given query. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef extractive_baseline_answer(query, retrieved_contexts):
“””
Offline fallback: we create a short answer by extracting the most relevant sentences.
This keeps the notebook runnable even without OpenAI.
“””
joined = “n”.join(retrieved_contexts)
sents = re.split(r”(?<=[.!?])s+”, joined)
keywords = [w.lower() for w in re.findall(r”[a-zA-Z]{4,}”, query)]
scored = []
for s in sents:
s_l = s.lower()
score = sum(1 for k in keywords if k in s_l)
if len(s.strip()) > 20:
scored.append((score, s.strip()))
scored.sort(key=lambda x: (-x[0], -len(x[1])))
best = [s for sc, s in scored[:3] if sc > 0]
if not best:
best = [s.strip() for s in sents[:2] if len(s.strip()) > 20]
ans = ” “.join(best).strip()
if not ans:
ans = “I could not find enough context to answer confidently.”
return ans

def openai_answer(query, retrieved_contexts, model=”gpt-4.1-mini”):
“””
Simple RAG prompt for demonstration. DeepEval metrics can still evaluate even if
your generation prompt differs; the key is we store retrieval_context separately.
“””
from openai import OpenAI
client = OpenAI()

context_block = “nn”.join([f”[CTX {i+1}]n{c}” for i, c in enumerate(retrieved_contexts)])
prompt = f”””You are a concise technical assistant.
Use ONLY the provided context to answer the query. If the answer is not in context, say you don’t know.

Query:
{query}

Context:
{context_block}

Answer:”””
resp = client.chat.completions.create(
model=model,
messages=[{“role”: “user”, “content”: prompt}],
temperature=0.2,
)
return resp.choices[0].message.content.strip()

def rag_answer(query, retrieved_contexts):
if openai_enabled:
try:
return openai_answer(query, retrieved_contexts)
except Exception as e:
print(f” OpenAI generation failed, falling back to extractive baseline. Error: {e}”)
return extractive_baseline_answer(query, retrieved_contexts)
else:
return extractive_baseline_answer(query, retrieved_contexts)

We implement a hybrid answering mechanism that prioritizes high-fidelity generation via OpenAI while maintaining a keyword-based extractive baseline as a reliable fallback. By isolating the retrieval context from the final generation, we ensure our DeepEval test cases remain consistent regardless of whether the answer is synthesized by an LLM or extracted programmatically. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n Running RAG to create test cases…”)

test_cases = []
K = 4

for item in tqdm(EVAL_QUERIES):
q = item[“query”]
expected = item[“expected”]

retrieved = retriever.retrieve(q, k=K)
retrieval_context = [r[“text”] for r in retrieved]

actual = rag_answer(q, retrieval_context)

tc = LLMTestCase(
input=q,
actual_output=actual,
expected_output=expected,
retrieval_context=retrieval_context,
)
test_cases.append(tc)

print(f” Built {len(test_cases)} LLMTestCase objects.”)

print(“n Metrics configured.”)

metrics = [
AnswerRelevancyMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),
FaithfulnessMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),
ContextualRelevancyMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),
ContextualPrecisionMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),
ContextualRecallMetric(threshold=0.5, model=”gpt-4.1″, include_reason=True, async_mode=True),

GEval(
name=”RAG Correctness Rubric (GEval)”,
criteria=(
“Score the answer for correctness and usefulness. ”
“The answer must directly address the query, must not invent facts not supported by context, ”
“and should be concise but complete.”
),
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
LLMTestCaseParams.RETRIEVAL_CONTEXT,
],
model=”gpt-4.1″,
threshold=0.5,
async_mode=True,
),
]

if not openai_enabled:
print(“n You did NOT provide an OpenAI API key.”)
print(“DeepEval’s LLM-as-a-judge metrics (AnswerRelevancy/Faithfulness/Contextual* and GEval) require an LLM judge.”)
print(“Re-run this cell and provide OPENAI_API_KEY to run DeepEval metrics.”)
print(“n However, your RAG pipeline + test case construction succeeded end-to-end.”)
rows = []
for i, tc in enumerate(test_cases):
rows.append({
“id”: i,
“query”: tc.input,
“actual_output”: tc.actual_output[:220] + (“…” if len(tc.actual_output) > 220 else “”),
“expected_output”: tc.expected_output[:220] + (“…” if len(tc.expected_output) > 220 else “”),
“contexts”: len(tc.retrieval_context or []),
})
display(pd.DataFrame(rows))
raise SystemExit(“Stopped before evaluation (no OpenAI key).”)

We execute the RAG pipeline to generate LLMTestCase objects by pairing our retrieved context with model-generated answers and ground-truth expectations. We then configure a comprehensive suite of DeepEval metrics, including G-Eval and specialized RAG indicators, to evaluate the system’s performance using an LLM-as-a-judge approach. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n Running DeepEval evaluate(…) …”)

results = evaluate(test_cases=test_cases, metrics=metrics)

summary_rows = []
for idx, tc in enumerate(test_cases):
row = {
“case_id”: idx,
“query”: tc.input,
“actual_output”: tc.actual_output[:200] + (“…” if len(tc.actual_output) > 200 else “”),
}
for m in metrics:
row[m.__class__.__name__ if hasattr(m, “__class__”) else str(m)] = None

summary_rows.append(row)

def try_extract_case_metrics(results_obj):
extracted = []
candidates = []
for attr in [“test_results”, “results”, “evaluations”]:
if hasattr(results_obj, attr):
candidates = getattr(results_obj, attr)
break
if not candidates and isinstance(results_obj, list):
candidates = results_obj

for case_i, case_result in enumerate(candidates or []):
item = {“case_id”: case_i}
metrics_list = None
for attr in [“metrics_data”, “metrics”, “metric_results”]:
if hasattr(case_result, attr):
metrics_list = getattr(case_result, attr)
break
if isinstance(metrics_list, dict):
for k, v in metrics_list.items():
item[f”{k}_score”] = getattr(v, “score”, None) if v is not None else None
item[f”{k}_reason”] = getattr(v, “reason”, None) if v is not None else None
else:
for mr in metrics_list or []:
name = getattr(mr, “name”, None) or getattr(getattr(mr, “metric”, None), “name”, None)
if not name:
name = mr.__class__.__name__
item[f”{name}_score”] = getattr(mr, “score”, None)
item[f”{name}_reason”] = getattr(mr, “reason”, None)
extracted.append(item)
return extracted

case_metrics = try_extract_case_metrics(results)

df_base = pd.DataFrame([{
“case_id”: i,
“query”: tc.input,
“actual_output”: tc.actual_output,
“expected_output”: tc.expected_output,
} for i, tc in enumerate(test_cases)])

df_metrics = pd.DataFrame(case_metrics) if case_metrics else pd.DataFrame([])
df = df_base.merge(df_metrics, on=”case_id”, how=”left”)

score_cols = [c for c in df.columns if c.endswith(“_score”)]
compact = df[[“case_id”, “query”] + score_cols].copy()

print(“n Compact score table:”)
display(compact)

print(“n Full details (includes reasons):”)
display(df)

print(“n Done. Tip: if contextual precision/recall are low, improve retriever ranking/coverage; if faithfulness is low, tighten generation to only use context.”)

We finalize the workflow by executing the evaluate function, which triggers the LLM-as-a-judge process to score each test case against our defined metrics. We then aggregate these scores and their corresponding qualitative reasoning into a centralized DataFrame, providing a granular view of where the RAG pipeline excels or requires further optimization in retrieval and generation.

At last, we conclude by running our comprehensive evaluation suite, in which DeepEval transforms complex linguistic outputs into actionable data using metrics such as Faithfulness, Contextual Precision, and the G-Eval rubric. This systematic approach allows us to diagnose “silent failures” in retrieval and hallucinations in generation with surgical precision, providing the reasoning necessary to justify architectural changes. With these results, we move forward from experimental prototyping to a production-ready RAG system backed by a verifiable, metric-driven safety net.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Automating LLM Quality Assurance with DeepEval, Custom Retrievers, and LLM-as-a-Judge Metrics appeared first on MarkTechPost.

How Machine Learning and Semantic Embeddings Reorder CVE Vulnerabiliti …

In this tutorial, we build an AI-assisted vulnerability scanner that goes beyond static CVSS scoring and instead learns to prioritize vulnerabilities using semantic understanding and machine learning. We treat vulnerability descriptions as rich linguistic artifacts, embed them using modern sentence transformers, and combine these representations with structural metadata to produce a data-driven priority score. Also, we demonstrate how security teams can shift from rule-based triage to adaptive, explainable, ML-driven risk assessment. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Installing required packages…”)
import subprocess
import sys

packages = [
‘sentence-transformers’,
‘scikit-learn’,
‘pandas’,
‘numpy’,
‘matplotlib’,
‘seaborn’,
‘requests’
]

for package in packages:
subprocess.check_call([sys.executable, ‘-m’, ‘pip’, ‘install’, ‘-q’, package])

import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import json
import re
from collections import Counter
import warnings
warnings.filterwarnings(‘ignore’)

from sentence_transformers import SentenceTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, mean_squared_error

import matplotlib.pyplot as plt
import seaborn as sns

print(“✓ All packages installed successfully!n”)

We install and load all required NLP, machine learning, and visualization libraries for the end-to-end pipeline. We ensure the runtime is fully self-contained and ready to execute in Colab or similar notebook environments. It establishes a reproducible foundation for the scanner. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass CVEDataFetcher:
def __init__(self):
self.base_url = “https://services.nvd.nist.gov/rest/json/cves/2.0″

def fetch_recent_cves(self, days=30, max_results=100):
print(f”Fetching CVEs from last {days} days…”)

end_date = datetime.now()
start_date = end_date – timedelta(days=days)

params = {
‘pubStartDate’: start_date.strftime(‘%Y-%m-%dT00:00:00.000’),
‘pubEndDate’: end_date.strftime(‘%Y-%m-%dT23:59:59.999’),
‘resultsPerPage’: min(max_results, 2000)
}

try:
response = requests.get(self.base_url, params=params, timeout=30)
response.raise_for_status()
data = response.json()

cves = []
for item in data.get(‘vulnerabilities’, [])[:max_results]:
cve = item.get(‘cve’, {})
cve_id = cve.get(‘id’, ‘Unknown’)

descriptions = cve.get(‘descriptions’, [])
description = next((d[‘value’] for d in descriptions if d[‘lang’] == ‘en’), ‘No description’)

metrics = cve.get(‘metrics’, {})
cvss_v3 = metrics.get(‘cvssMetricV31’, [{}])[0].get(‘cvssData’, {})
cvss_v2 = metrics.get(‘cvssMetricV2’, [{}])[0].get(‘cvssData’, {})

base_score = cvss_v3.get(‘baseScore’) or cvss_v2.get(‘baseScore’) or 0.0
severity = cvss_v3.get(‘baseSeverity’) or ‘UNKNOWN’

published = cve.get(‘published’, ”)
references = cve.get(‘references’, [])

cves.append({
‘cve_id’: cve_id,
‘description’: description,
‘cvss_score’: float(base_score),
‘severity’: severity,
‘published’: published,
‘reference_count’: len(references),
‘attack_vector’: cvss_v3.get(‘attackVector’, ‘UNKNOWN’),
‘attack_complexity’: cvss_v3.get(‘attackComplexity’, ‘UNKNOWN’),
‘privileges_required’: cvss_v3.get(‘privilegesRequired’, ‘UNKNOWN’),
‘user_interaction’: cvss_v3.get(‘userInteraction’, ‘UNKNOWN’)
})

print(f”✓ Fetched {len(cves)} CVEsn”)
return pd.DataFrame(cves)

except Exception as e:
print(f”Error fetching CVEs: {e}”)
return self._generate_sample_data(max_results)

def _generate_sample_data(self, n=50):
print(“Using sample CVE data for demonstration…n”)

sample_descriptions = [
“A buffer overflow vulnerability in the network driver allows remote code execution”,
“SQL injection vulnerability in web application login form enables unauthorized access”,
“Cross-site scripting (XSS) vulnerability in user input validation”,
“Authentication bypass in admin panel due to weak session management”,
“Remote code execution via deserialization of untrusted data”,
“Path traversal vulnerability allows reading arbitrary files”,
“Privilege escalation through improper input validation”,
“Denial of service through resource exhaustion in API endpoint”,
“Information disclosure via error messages exposing sensitive data”,
“Memory corruption vulnerability in image processing library”,
“Command injection in file upload functionality”,
“Integer overflow leading to heap buffer overflow”,
“Use-after-free vulnerability in memory management”,
“Race condition in multi-threaded application”,
“Cryptographic weakness in password storage mechanism”
]

severities = [‘LOW’, ‘MEDIUM’, ‘HIGH’, ‘CRITICAL’]
attack_vectors = [‘NETWORK’, ‘ADJACENT’, ‘LOCAL’, ‘PHYSICAL’]
complexities = [‘LOW’, ‘HIGH’]

data = []
for i in range(n):
severity = np.random.choice(severities, p=[0.1, 0.3, 0.4, 0.2])
score_ranges = {‘LOW’: (0.1, 3.9), ‘MEDIUM’: (4.0, 6.9), ‘HIGH’: (7.0, 8.9), ‘CRITICAL’: (9.0, 10.0)}

data.append({
‘cve_id’: f’CVE-2024-{10000+i}’,
‘description’: np.random.choice(sample_descriptions),
‘cvss_score’: np.random.uniform(*score_ranges[severity]),
‘severity’: severity,
‘published’: (datetime.now() – timedelta(days=np.random.randint(1, 30))).isoformat(),
‘reference_count’: np.random.randint(1, 10),
‘attack_vector’: np.random.choice(attack_vectors),
‘attack_complexity’: np.random.choice(complexities),
‘privileges_required’: np.random.choice([‘NONE’, ‘LOW’, ‘HIGH’]),
‘user_interaction’: np.random.choice([‘NONE’, ‘REQUIRED’])
})

return pd.DataFrame(data)

We implement a robust CVE ingestion component that pulls recent vulnerabilities directly from the NVD API. We normalize raw CVE records into structured features while gracefully falling back to synthetic data when API access fails. It allows the tutorial to remain runnable while reflecting real-world challenges in data ingestion. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass VulnerabilityFeatureExtractor:
def __init__(self):
print(“Loading sentence transformer model…”)
self.model = SentenceTransformer(‘all-MiniLM-L6-v2’)
print(“✓ Model loadedn”)

self.critical_keywords = {
‘execution’: [‘remote code execution’, ‘rce’, ‘execute’, ‘arbitrary code’],
‘injection’: [‘sql injection’, ‘command injection’, ‘code injection’],
‘authentication’: [‘bypass’, ‘authentication’, ‘authorization’],
‘overflow’: [‘buffer overflow’, ‘heap overflow’, ‘stack overflow’],
‘exposure’: [‘information disclosure’, ‘data leak’, ‘exposure’],
}

def extract_semantic_features(self, descriptions):
print(“Generating semantic embeddings…”)
embeddings = self.model.encode(descriptions, show_progress_bar=True)
return embeddings

def extract_keyword_features(self, df):
print(“Extracting keyword features…”)

for category, keywords in self.critical_keywords.items():
df[f’has_{category}’] = df[‘description’].apply(
lambda x: any(kw in x.lower() for kw in keywords)
).astype(int)

df[‘desc_length’] = df[‘description’].apply(len)
df[‘word_count’] = df[‘description’].apply(lambda x: len(x.split()))

return df

def encode_categorical_features(self, df):
print(“Encoding categorical features…”)

categorical_cols = [‘attack_vector’, ‘attack_complexity’, ‘privileges_required’, ‘user_interaction’]

for col in categorical_cols:
dummies = pd.get_dummies(df[col], prefix=col)
df = pd.concat([df, dummies], axis=1)

return df

We transform unstructured vulnerability descriptions into dense semantic embeddings using a sentence-transformer model. We also extract keyword-based risk indicators and textual statistics that capture exploit intent and complexity. Together, these features bridge linguistic context with quantitative ML inputs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass VulnerabilityPrioritizer:
def __init__(self):
self.severity_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
self.score_predictor = GradientBoostingRegressor(n_estimators=100, random_state=42)
self.scaler = StandardScaler()
self.feature_cols = None

def prepare_features(self, df, embeddings):
numeric_features = [‘reference_count’, ‘desc_length’, ‘word_count’]
keyword_features = [col for col in df.columns if col.startswith(‘has_’)]
categorical_features = [col for col in df.columns if any(col.startswith(prefix) for prefix in [‘attack_vector_’, ‘attack_complexity_’, ‘privileges_required_’, ‘user_interaction_’])]
self.feature_cols = numeric_features + keyword_features + categorical_features
X_structured = df[self.feature_cols].values
X_embeddings = embeddings
X_combined = np.hstack([X_structured, X_embeddings])
return X_combined

def train_models(self, X, y_severity, y_score):
print(“nTraining ML models…”)
X_scaled = self.scaler.fit_transform(X)
X_train, X_test, y_sev_train, y_sev_test, y_score_train, y_score_test = train_test_split(
X_scaled, y_severity, y_score, test_size=0.2, random_state=42
)
self.severity_classifier.fit(X_train, y_sev_train)
sev_pred = self.severity_classifier.predict(X_test)
self.score_predictor.fit(X_train, y_score_train)
score_pred = self.score_predictor.predict(X_test)
print(“n— Severity Classification Report —“)
print(classification_report(y_sev_test, sev_pred))
print(f”n— CVSS Score Prediction —“)
print(f”RMSE: {np.sqrt(mean_squared_error(y_score_test, score_pred)):.2f}”)
return X_scaled

def predict_priority(self, X):
X_scaled = self.scaler.transform(X)
severity_pred = self.severity_classifier.predict_proba(X_scaled)
score_pred = self.score_predictor.predict(X_scaled)
severity_weight = severity_pred[:, -1] * 0.4
score_weight = (score_pred / 10.0) * 0.6
priority_score = severity_weight + score_weight
return priority_score, severity_pred, score_pred

def get_feature_importance(self):
importance = self.score_predictor.feature_importances_
n_structured = len(self.feature_cols)
structured_importance = importance[:n_structured]
embedding_importance = importance[n_structured:]
feature_imp_df = pd.DataFrame({
‘feature’: self.feature_cols,
‘importance’: structured_importance
}).sort_values(‘importance’, ascending=False)
return feature_imp_df, embedding_importance.mean()

We train supervised models to predict both vulnerability severity classes and CVSS-like scores from learned features. We combine structured metadata with embeddings to create a hybrid feature space and derive a composite priority score. This is where the scanner learns how to rank vulnerabilities beyond static heuristics. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass VulnerabilityAnalyzer:
def __init__(self, n_clusters=5):
self.n_clusters = n_clusters
self.kmeans = KMeans(n_clusters=n_clusters, random_state=42)

def cluster_vulnerabilities(self, embeddings):
print(f”nClustering vulnerabilities into {self.n_clusters} groups…”)
clusters = self.kmeans.fit_predict(embeddings)
return clusters

def analyze_clusters(self, df, clusters):
df[‘cluster’] = clusters
print(“n— Cluster Analysis —“)
for i in range(self.n_clusters):
cluster_df = df[df[‘cluster’] == i]
print(f”nCluster {i} ({len(cluster_df)} vulnerabilities):”)
print(f” Avg CVSS Score: {cluster_df[‘cvss_score’].mean():.2f}”)
print(f” Severity Distribution: {cluster_df[‘severity’].value_counts().to_dict()}”)
print(f” Top keywords: “, end=””)
all_words = ‘ ‘.join(cluster_df[‘description’].values).lower()
words = re.findall(r’b[a-z]{4,}b’, all_words)
common = Counter(words).most_common(5)
print(‘, ‘.join([w for w, _ in common]))
return df

We cluster vulnerabilities based on embedding similarity to uncover recurring exploit patterns. We analyze each cluster to understand dominant attack themes, severity distributions, and common exploit terminology. It helps surface systemic risks rather than isolated issues. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef visualize_results(df, priority_scores, feature_importance):
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle(‘Vulnerability Scanner – ML Analysis Dashboard’, fontsize=16, fontweight=’bold’)
axes[0, 0].hist(priority_scores, bins=30, color=’crimson’, alpha=0.7, edgecolor=’black’)
axes[0, 0].set_xlabel(‘Priority Score’)
axes[0, 0].set_ylabel(‘Frequency’)
axes[0, 0].set_title(‘Priority Score Distribution’)
axes[0, 0].axvline(np.percentile(priority_scores, 75), color=’orange’, linestyle=’–‘, label=’75th percentile’)
axes[0, 0].legend()
axes[0, 1].scatter(df[‘cvss_score’], priority_scores, alpha=0.6, c=priority_scores, cmap=’RdYlGn_r’, s=50)
axes[0, 1].set_xlabel(‘CVSS Score’)
axes[0, 1].set_ylabel(‘ML Priority Score’)
axes[0, 1].set_title(‘CVSS vs ML Priority’)
axes[0, 1].plot([0, 10], [0, 1], ‘k–‘, alpha=0.3)
severity_counts = df[‘severity’].value_counts()
colors = {‘CRITICAL’: ‘darkred’, ‘HIGH’: ‘red’, ‘MEDIUM’: ‘orange’, ‘LOW’: ‘yellow’}
axes[0, 2].bar(severity_counts.index, severity_counts.values, color=[colors.get(s, ‘gray’) for s in severity_counts.index])
axes[0, 2].set_xlabel(‘Severity’)
axes[0, 2].set_ylabel(‘Count’)
axes[0, 2].set_title(‘Severity Distribution’)
axes[0, 2].tick_params(axis=’x’, rotation=45)
top_features = feature_importance.head(10)
axes[1, 0].barh(top_features[‘feature’], top_features[‘importance’], color=’steelblue’)
axes[1, 0].set_xlabel(‘Importance’)
axes[1, 0].set_title(‘Top 10 Feature Importance’)
axes[1, 0].invert_yaxis()
if ‘cluster’ in df.columns:
cluster_counts = df[‘cluster’].value_counts().sort_index()
axes[1, 1].bar(cluster_counts.index, cluster_counts.values, color=’teal’, alpha=0.7)
axes[1, 1].set_xlabel(‘Cluster’)
axes[1, 1].set_ylabel(‘Count’)
axes[1, 1].set_title(‘Vulnerability Clusters’)
attack_vector_counts = df[‘attack_vector’].value_counts()
axes[1, 2].pie(attack_vector_counts.values, labels=attack_vector_counts.index, autopct=’%1.1f%%’, startangle=90)
axes[1, 2].set_title(‘Attack Vector Distribution’)
plt.tight_layout()
plt.show()

def main():
print(“=”*70)
print(“AI-ASSISTED VULNERABILITY SCANNER WITH ML PRIORITIZATION”)
print(“=”*70)
print()
fetcher = CVEDataFetcher()
df = fetcher.fetch_recent_cves(days=30, max_results=50)
print(f”Dataset Overview:”)
print(f” Total CVEs: {len(df)}”)
print(f” Date Range: {df[‘published’].min()[:10]} to {df[‘published’].max()[:10]}”)
print(f” Severity Breakdown: {df[‘severity’].value_counts().to_dict()}”)
print()
feature_extractor = VulnerabilityFeatureExtractor()
embeddings = feature_extractor.extract_semantic_features(df[‘description’].tolist())
df = feature_extractor.extract_keyword_features(df)
df = feature_extractor.encode_categorical_features(df)
prioritizer = VulnerabilityPrioritizer()
X = prioritizer.prepare_features(df, embeddings)
severity_map = {‘LOW’: 0, ‘MEDIUM’: 1, ‘HIGH’: 2, ‘CRITICAL’: 3, ‘UNKNOWN’: 1}
y_severity = df[‘severity’].map(severity_map).values
y_score = df[‘cvss_score’].values
X_scaled = prioritizer.train_models(X, y_severity, y_score)
priority_scores, severity_probs, score_preds = prioritizer.predict_priority(X)
df[‘ml_priority_score’] = priority_scores
df[‘predicted_score’] = score_preds
analyzer = VulnerabilityAnalyzer(n_clusters=5)
clusters = analyzer.cluster_vulnerabilities(embeddings)
df = analyzer.analyze_clusters(df, clusters)
feature_imp, emb_imp = prioritizer.get_feature_importance()
print(f”n— Feature Importance —“)
print(feature_imp.head(10))
print(f”nAverage embedding importance: {emb_imp:.4f}”)
print(“n” + “=”*70)
print(“TOP 10 PRIORITY VULNERABILITIES”)
print(“=”*70)
top_vulns = df.nlargest(10, ‘ml_priority_score’)[[‘cve_id’, ‘cvss_score’, ‘ml_priority_score’, ‘severity’, ‘description’]]
for idx, row in top_vulns.iterrows():
print(f”n{row[‘cve_id’]} [Priority: {row[‘ml_priority_score’]:.3f}]”)
print(f” CVSS: {row[‘cvss_score’]:.1f} | Severity: {row[‘severity’]}”)
print(f” {row[‘description’][:100]}…”)
print(“nnGenerating visualizations…”)
visualize_results(df, priority_scores, feature_imp)
print(“n” + “=”*70)
print(“ANALYSIS COMPLETE”)
print(“=”*70)
print(f”nResults summary:”)
print(f” High Priority (>0.7): {(priority_scores > 0.7).sum()} vulnerabilities”)
print(f” Medium Priority (0.4-0.7): {((priority_scores >= 0.4) & (priority_scores <= 0.7)).sum()}”)
print(f” Low Priority (<0.4): {(priority_scores < 0.4).sum()}”)
return df, prioritizer, analyzer

if __name__ == “__main__”:
results_df, prioritizer, analyzer = main()
print(“n✓ All analyses completed successfully!”)
print(“nYou can now:”)
print(” – Access results via ‘results_df’ DataFrame”)
print(” – Use ‘prioritizer’ to predict new vulnerabilities”)
print(” – Explore ‘analyzer’ for clustering insights”)

We generate an interactive analysis dashboard that visualizes priority distributions, feature importance, clusters, and attack vectors. We execute the complete pipeline, rank the highest-priority vulnerabilities, and summarize actionable insights. It turns raw model outputs into decision-ready intelligence.

In conclusion, we implemented how vulnerability management can evolve from static scoring to intelligent prioritization using machine learning and semantic analysis. By combining embeddings, metadata, clustering, and explainability, we created a system that better reflects real-world exploit risk and operational urgency. It lays the groundwork for adaptive security pipelines where prioritization improves continuously as new vulnerability data emerges.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Machine Learning and Semantic Embeddings Reorder CVE Vulnerabilities Beyond Raw CVSS Scores appeared first on MarkTechPost.

GitHub Releases Copilot-SDK to Embed Its Agentic Runtime in Any App

GitHub has opened up the internal agent runtime that powers GitHub Copilot CLI and exposed it as a programmable SDK. The GitHub Copilot-SDK, now in technical preview, lets you embed the same agentic execution loop into any application so the agent can plan, invoke tools, edit files, and run commands as part of your own workflows.

What the GitHub Copilot SDK provides

The GitHub Copilot-SDK is a multi platform SDK for integrating the GitHub Copilot Agent into applications and services. It gives programmatic access to the execution loop that already powers GitHub Copilot CLI. Instead of building your own planner and tool loop for each project, you attach your logic to this existing runtime and treat it as an execution platform.

The GitHub Copilot-SDK exposes the same production tested runtime used by Copilot CLI, with support for multi model operation, multi step planning, tools, Model Context Protocol (MCP) integration, authentication, and streaming. This gives you the same agent behavior that Copilot uses in the terminal, but callable from your own code.

Agentic execution loop as a runtime primitive

The core abstraction is the agentic execution loop. In Copilot CLI and in the SDK, interactions are not isolated prompts. The agent maintains state across turns, chooses plans, calls tools, executes commands, reads results, and repeats these steps until it reaches the goal that you provided.

The GitHub team describes the usual problems when you implement this loop yourself. You need to manage context across multiple turns, orchestrate external tools and commands, route calls across models, integrate MCP servers, and think through permiss developer, you concentrate on defining domain specific tools, describing tasks, and constraining what the agent can do.

Supported languages and core API

The Copilot-SDK is available in 4 languages in this technical preview:

Node.js and TypeScript, through the package @github/copilot-cli-sdk

Python, through the package copilot

Go, through the module github.com/github/copilot-cli-sdk-go

.NET, through the package GitHub.Copilot.SDK

All SDKs expose a consistent API surface. According to the changelog, every language binding supports multi-turn conversations with session history, custom tool execution, and programmatic control over client and session life cycles.

Tools, MCP servers, and integration with existing systems

A main feature of the Copilot agent is tool execution. Through the SDK you can register custom tools that the model can call during a conversation. The Copilot-CLI already exposes custom tool definitions and full MCP server integration, and the SDK reuses that capability.

MCP gives a standard protocol for agents to connect to external systems such as internal APIs, document stores, or operations tools. When you integrate an MCP server, the Copilot agent can discover and call its operations in a structured way with consistent metadata rather than ad hoc prompt engineering.

The pattern is straightforward. You define a tool with a clear schema and effect, you expose it through the SDK, and the Copilot planner decides when and how to call it as part of the multi step plan.

Authentication, subscriptions, and streaming

The SDK integrates with GitHub authentication and Copilot subscriptions. You can either use an existing GitHub Copilot subscription or bring your own key when configuring the SDK. This is important when you embed the agent in enterprise environments where identity and access control are already standardized around GitHub.

Streaming is part of the contract. Copilot-CLI already supports real time streaming in the terminal, and the SDK exposes streaming so that applications can receive responses incrementally. This allows you to build user interfaces that update progressively as the agent reasons and executes, without waiting for a full completion.

Relationship to GitHub Copilot-CLI

The SDK is not a separate agent implementation. It is a layer on top of the existing Copilot CLI execution loop. It as a way to reuse the planning, tool use, and multi turn execution behavior of the CLI in any environment.

Copilot-CLI itself continues to evolve. Recent updates add persistent memory, infinite sessions, and context compaction, support for explore and plan workflows with model selection per step, custom agents and agent skills, full MCP support, and asynchronous task delegation. The SDK benefits from this work, because it exposes that same behavior through language specific libraries.

Key Takeaways

GitHub Copilot-SDK exposes the same agentic execution loop that powers GitHub Copilot CLI, so applications can call a production tested planner that runs multi step workflows with tools and commands.

The SDK is available for Node.js, Python, Go, and .NET, and each language binding provides a similar abstraction around clients and sessions that manage multi turn conversations and tool use.

Developers define domain specific tools and Model Context Protocol servers, then register them through the SDK, and the Copilot agent decides when and how to call them as part of the plan.

The runtime integrates with GitHub authentication and Copilot subscriptions, supports multiple AI models such as GPT based backends, and exposes real time streaming so applications can render partial responses incrementally.

Check out the GitHub Page. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post GitHub Releases Copilot-SDK to Embed Its Agentic Runtime in Any App appeared first on MarkTechPost.

How an AI Agent Chooses What to Do Under Tokens, Latency, and Tool-Cal …

In this tutorial, we build a cost-aware planning agent that deliberately balances output quality against real-world constraints such as token usage, latency, and tool-call budgets. We design the agent to generate multiple candidate actions, estimate their expected costs and benefits, and then select an execution plan that maximizes value while staying within strict budgets. With this, we demonstrate how agentic systems can move beyond “always use the LLM” behavior and instead reason explicitly about trade-offs, efficiency, and resource awareness, which is critical for deploying agents reliably in constrained environments. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport os, time, math, json, random
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple, Any
from getpass import getpass

USE_OPENAI = True

if USE_OPENAI:
if not os.getenv(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY (hidden): “).strip()
try:
from openai import OpenAI
client = OpenAI()
except Exception as e:
print(“OpenAI SDK import failed. Falling back to offline mode.nError:”, e)
USE_OPENAI = False

We set up the execution environment and securely load the OpenAI API key at runtime without hardcoding it. We also initialize the client so the agent gracefully falls back to offline mode if the API is unavailable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef approx_tokens(text: str) -> int:
return max(1, math.ceil(len(text) / 4))

@dataclass
class Budget:
max_tokens: int
max_latency_ms: int
max_tool_calls: int

@dataclass
class Spend:
tokens: int = 0
latency_ms: int = 0
tool_calls: int = 0

def within(self, b: Budget) -> bool:
return (self.tokens <= b.max_tokens and
self.latency_ms <= b.max_latency_ms and
self.tool_calls <= b.max_tool_calls)

def add(self, other: “Spend”) -> “Spend”:
return Spend(
tokens=self.tokens + other.tokens,
latency_ms=self.latency_ms + other.latency_ms,
tool_calls=self.tool_calls + other.tool_calls
)

We define the core budgeting abstractions that enable the agent to reason explicitly about costs. We model token usage, latency, and tool calls as first-class quantities and provide utility methods to accumulate and validate spend. It gives us a clean foundation for enforcing constraints throughout planning and execution. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class StepOption:
name: str
description: str
est_spend: Spend
est_value: float
executor: str
payload: Dict[str, Any] = field(default_factory=dict)

@dataclass
class PlanCandidate:
steps: List[StepOption]
spend: Spend
value: float
rationale: str = “”

def llm_text(prompt: str, *, model: str = “gpt-5”, effort: str = “low”) -> str:
if not USE_OPENAI:
return “”
t0 = time.time()
resp = client.responses.create(
model=model,
reasoning={“effort”: effort},
input=prompt,
)
_ = (time.time() – t0)
return resp.output_text or “”

We introduce the data structures that represent individual action choices and full plan candidates. We also define a lightweight LLM wrapper that standardizes how text is generated and measured. This separation allows the planner to reason about actions abstractly without being tightly coupled to execution details. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef generate_step_options(task: str) -> List[StepOption]:
base = [
StepOption(
name=”Clarify deliverables (local)”,
description=”Extract deliverable checklist + acceptance criteria from the task.”,
est_spend=Spend(tokens=60, latency_ms=20, tool_calls=0),
est_value=6.0,
executor=”local”,
),
StepOption(
name=”Outline plan (LLM)”,
description=”Create a structured outline with sections, constraints, and assumptions.”,
est_spend=Spend(tokens=600, latency_ms=1200, tool_calls=1),
est_value=10.0,
executor=”llm”,
payload={“prompt_kind”:”outline”}
),
StepOption(
name=”Outline plan (local)”,
description=”Create a rough outline using templates (no LLM).”,
est_spend=Spend(tokens=120, latency_ms=40, tool_calls=0),
est_value=5.5,
executor=”local”,
),
StepOption(
name=”Risk register (LLM)”,
description=”Generate risks, mitigations, owners, and severity.”,
est_spend=Spend(tokens=700, latency_ms=1400, tool_calls=1),
est_value=9.0,
executor=”llm”,
payload={“prompt_kind”:”risks”}
),
StepOption(
name=”Risk register (local)”,
description=”Generate a standard risk register from a reusable template.”,
est_spend=Spend(tokens=160, latency_ms=60, tool_calls=0),
est_value=5.0,
executor=”local”,
),
StepOption(
name=”Timeline (LLM)”,
description=”Draft a realistic milestone timeline with dependencies.”,
est_spend=Spend(tokens=650, latency_ms=1300, tool_calls=1),
est_value=8.5,
executor=”llm”,
payload={“prompt_kind”:”timeline”}
),
StepOption(
name=”Timeline (local)”,
description=”Draft a simple timeline from a generic milestone template.”,
est_spend=Spend(tokens=150, latency_ms=60, tool_calls=0),
est_value=4.8,
executor=”local”,
),
StepOption(
name=”Quality pass (LLM)”,
description=”Rewrite for clarity, consistency, and formatting.”,
est_spend=Spend(tokens=900, latency_ms=1600, tool_calls=1),
est_value=8.0,
executor=”llm”,
payload={“prompt_kind”:”polish”}
),
StepOption(
name=”Quality pass (local)”,
description=”Light formatting + consistency checks without LLM.”,
est_spend=Spend(tokens=120, latency_ms=50, tool_calls=0),
est_value=3.5,
executor=”local”,
),
]

if USE_OPENAI:
meta_prompt = f”””
You are a planning assistant. For the task below, propose 3-5 OPTIONAL extra steps that improve quality,
like checks, validations, or stakeholder tailoring. Keep each step short.

TASK:
{task}

Return JSON list with fields: name, description, est_value(1-10).
“””
txt = llm_text(meta_prompt, model=”gpt-5″, effort=”low”)
try:
items = json.loads(txt.strip())
for it in items[:5]:
base.append(
StepOption(
name=str(it.get(“name”,”Extra step (local)”))[:60],
description=str(it.get(“description”,””))[:200],
est_spend=Spend(tokens=120, latency_ms=60, tool_calls=0),
est_value=float(it.get(“est_value”, 5.0)),
executor=”local”,
)
)
except Exception:
pass

return base

We focus on generating a diverse set of candidate steps, including both LLM-based and local alternatives with different cost–quality trade-offs. We optionally use the model itself to suggest additional low-cost improvements while still controlling their impact on the budget. By doing so, we enrich the action space without losing efficiency. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef plan_under_budget(
options: List[StepOption],
budget: Budget,
*,
max_steps: int = 6,
beam_width: int = 12,
diversity_penalty: float = 0.2
) -> PlanCandidate:
def redundancy_cost(chosen: List[StepOption], new: StepOption) -> float:
key_new = new.name.split(“(“)[0].strip().lower()
overlap = 0
for s in chosen:
key_s = s.name.split(“(“)[0].strip().lower()
if key_s == key_new:
overlap += 1
return overlap * diversity_penalty

beams: List[PlanCandidate] = [PlanCandidate(steps=[], spend=Spend(), value=0.0, rationale=””)]

for _ in range(max_steps):
expanded: List[PlanCandidate] = []
for cand in beams:
for opt in options:
if opt in cand.steps:
continue
new_spend = cand.spend.add(opt.est_spend)
if not new_spend.within(budget):
continue
new_value = cand.value + opt.est_value – redundancy_cost(cand.steps, opt)
expanded.append(
PlanCandidate(
steps=cand.steps + [opt],
spend=new_spend,
value=new_value,
rationale=cand.rationale
)
)
if not expanded:
break
expanded.sort(key=lambda c: c.value, reverse=True)
beams = expanded[:beam_width]

best = max(beams, key=lambda c: c.value)
return best

We implement the budget-constrained planning logic that searches for the highest-value combination of steps under strict limits. We apply a beam-style search with redundancy penalties to avoid wasteful action overlap. This is where the agent truly becomes cost-aware by optimizing value subject to constraints. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_local_step(task: str, step: StepOption, working: Dict[str, Any]) -> str:
name = step.name.lower()
if “clarify deliverables” in name:
return (
“Deliverables checklist:n”
“- Executive summaryn- Scope & assumptionsn- Workplan + milestonesn”
“- Risk register (risk, impact, likelihood, mitigation, owner)n”
“- Next steps + data neededn”
)
if “outline plan” in name:
return (
“Outline:n1) Context & objectiven2) Scopen3) Approachn4) Timelinen5) Risksn6) Next stepsn”
)
if “risk register” in name:
return (
“Risk register (template):n”
“1) Data access delays | High | Mitigation: agree data list + ownersn”
“2) Stakeholder alignment | Med | Mitigation: weekly reviewn”
“3) Tooling constraints | Med | Mitigation: phased rolloutn”
)
if “timeline” in name:
return (
“Timeline (template):n”
“Week 1: discovery + requirementsnWeek 2: prototype + feedbackn”
“Week 3: pilot + metricsnWeek 4: rollout + handovern”
)
if “quality pass” in name:
draft = working.get(“draft”, “”)
return “Light quality pass done (headings normalized, bullets aligned).n” + draft
return f”Completed: {step.name}n”

def run_llm_step(task: str, step: StepOption, working: Dict[str, Any]) -> str:
kind = step.payload.get(“prompt_kind”, “generic”)
context = working.get(“draft”, “”)
prompts = {
“outline”: f”Create a crisp, structured outline for the task below.nTASK:n{task}nReturn a numbered outline.”,
“risks”: f”Create a risk register for the task below. Include: Risk | Impact | Likelihood | Mitigation | Owner.nTASK:n{task}”,
“timeline”: f”Create a realistic milestone timeline with dependencies for the task below.nTASK:n{task}”,
“polish”: f”Rewrite and polish the following draft for clarity and consistency.nDRAFT:n{context}”,
“generic”: f”Help with this step: {step.description}nTASK:n{task}nCURRENT:n{context}”,
}
return llm_text(prompts.get(kind, prompts[“generic”]), model=”gpt-5″, effort=”low”)

def execute_plan(task: str, plan: PlanCandidate) -> Tuple[str, Spend]:
working = {“draft”: “”}
actual = Spend()

for i, step in enumerate(plan.steps, 1):
t0 = time.time()
if step.executor == “llm” and USE_OPENAI:
out = run_llm_step(task, step, working)
tool_calls = 1
else:
out = run_local_step(task, step, working)
tool_calls = 0

dt_ms = int((time.time() – t0) * 1000)
tok = approx_tokens(out)

actual = actual.add(Spend(tokens=tok, latency_ms=dt_ms, tool_calls=tool_calls))
working[“draft”] += f”nn### Step {i}: {step.name}n{out}n”

return working[“draft”].strip(), actual

TASK = “Draft a 1-page project proposal for a logistics dashboard + fleet optimization pilot, including scope, timeline, and risks.”
BUDGET = Budget(
max_tokens=2200,
max_latency_ms=3500,
max_tool_calls=2
)

options = generate_step_options(TASK)
best_plan = plan_under_budget(options, BUDGET, max_steps=6, beam_width=14)

print(“=== SELECTED PLAN (budget-aware) ===”)
for s in best_plan.steps:
print(f”- {s.name} | est_spend={s.est_spend} | est_value={s.est_value}”)
print(“nEstimated spend:”, best_plan.spend)
print(“Budget:”, BUDGET)

print(“n=== EXECUTING PLAN ===”)
draft, actual = execute_plan(TASK, best_plan)

print(“n=== OUTPUT DRAFT ===n”)
print(draft[:6000])

print(“n=== ACTUAL SPEND (approx) ===”)
print(actual)
print(“nWithin budget?”, actual.within(BUDGET))

We execute the selected plan and track actual resource usage step by step. We dynamically choose between local and LLM execution paths and aggregate the final output into a coherent draft. By comparing estimated and actual spend, we demonstrate how planning assumptions can be validated and refined in practice.

In conclusion, we demonstrated how a cost-aware planning agent can reason about its resource consumption and adapt its behavior in real time. We executed only the steps that fit within predefined budgets and tracked actual spend to validate the planning assumptions, closing the loop between estimation and execution. Also, we highlighted how agentic AI systems can become more practical, controllable, and scalable by treating cost, latency, and tool usage as first-class decision variables rather than afterthoughts.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How an AI Agent Chooses What to Do Under Tokens, Latency, and Tool-Call Budget Constraints? appeared first on MarkTechPost.

Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite wit …

Alibaba Cloud’s Qwen team has open-sourced Qwen3-TTS, a family of multilingual text-to-speech models that target three core tasks in one stack, voice clone, voice design, and high quality speech generation.

https://arxiv.org/pdf/2601.15621v1

Model family and capabilities

Qwen3-TTS uses a 12Hz speech tokenizer and 2 language model sizes, 0.6B and 1.7B, packaged into 3 main tasks. The open release exposes 5 models, Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base for voice cloning and generic TTS, Qwen3-TTS-12Hz-0.6B-CustomVoice and Qwen3-TTS-12Hz-1.7B-CustomVoice for promptable preset speakers, and Qwen3-TTS-12Hz-1.7B-VoiceDesign for free form voice creation from natural language descriptions, along with the Qwen3-TTS-Tokenizer-12Hz codec.

All models support 10 languages, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. CustomVoice variants ship with 9 curated timbres, such as Vivian, a bright young Chinese female voice, Ryan, a dynamic English male voice, and Ono_Anna, a playful Japanese female voice, each with a short description that encodes timbre and speaking style.

The VoiceDesign model maps text instructions directly to new voices, for example ‘speak in a nervous teenage male voice with rising intonation’ and can then be combined with the Base model by first generating a short reference clip and reusing it via create_voice_clone_prompt.

https://arxiv.org/pdf/2601.15621v1

Architecture, tokenizer, and streaming path

Qwen3-TTS is a dual track language model, one track predicts discrete acoustic tokens from text, the other handles alignment and control signals. The system is trained on more than 5 million hours of multilingual speech in 3 pre training stages that move from general mapping, to high quality data, to long context support up to 32,768 tokens.

A key component is the Qwen3-TTS-Tokenizer-12Hz codec. It operates at 12.5 frames per second, about 80 ms per token, and uses 16 quantizers with a 2048 entry codebook. On LibriSpeech test clean it reaches PESQ wideband 3.21, STOI 0.96, and UTMOS 4.16, outperforming SpeechTokenizer, XCodec, Mimi, FireredTTS 2 and other recent semantic tokenizers, while using a similar or lower frame rate.

The tokenizer is implemented as a pure left context streaming decoder, so it can emit waveforms as soon as enough tokens are available. With 4 tokens per packet, each streaming packet carries 320 ms of audio. The non-DiT decoder and BigVGAN free design reduces decode cost and simplifies batching.

On the language model side, the research team reports end to end streaming measurements on a single vLLM backend with torch.compile and CUDA Graph optimizations. For Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base at concurrency 1, the first packet latency is around 97 ms and 101 ms, with real time factors of 0.288 and 0.313 respectively. Even at concurrency 6, first packet latency stays around 299 ms and 333 ms.

https://arxiv.org/pdf/2601.15621v1

Alignment and control

Post training uses a staged alignment pipeline. First, Direct Preference Optimization aligns generated speech with human preferences on multilingual data. Then GSPO with rule based rewards improves stability and prosody. A final speaker fine tuning stage on the Base model yields target speaker variants while preserving the core capabilities of the general model.

Instruction following is implemented in a ChatML style format, where text instructions about style, emotion or tempo are prepended to the input. This same interface powers VoiceDesign, CustomVoice style prompts, and fine grained edits for cloned speakers.

Benchmarks, zero shot cloning, and multilingual speech

On the Seed-TTS test set, Qwen3-TTS is evaluated as a zero-shot voice cloning system. The Qwen3-TTS-12Hz-1.7B-Base model reaches a Word Error Rate of 0.77 on test-zh and 1.24 on test-en. The research team highlights the 1.24 WER on test-en as state of the art among the compared systems, while the Chinese WER is close to, but not lower than, the best CosyVoice 3 score.

https://arxiv.org/pdf/2601.15621v1

On a multilingual TTS test set covering 10 languages, Qwen3-TTS achieves the lowest WER in 6 languages, Chinese, English, Italian, French, Korean, and Russian, and competitive performance on the remaining 4 languages, while also obtaining the highest speaker similarity in all 10 languages compared to MiniMax-Speech and ElevenLabs Multilingual v2.

Cross-lingual evaluations show that Qwen3-TTS-12Hz-1.7B-Base reduces mixed error rate for several language pairs, such as zh-to-ko, where the error drops from 14.4 for CosyVoice3 to 4.82, about a 66 percent relative reduction.

On InstructTTSEval, the Qwen3TTS-12Hz-1.7B-VD VoiceDesign model sets new state of the art scores among open source models on Description-Speech Consistency and Response Precision in both Chinese and English, and is competitive with commercial systems like Hume and Gemini on several metrics.

Key Takeaways

Full open source multilingual TTS stack: Qwen3-TTS is an Apache 2.0 licensed suite that covers 3 tasks in one stack, high quality TTS, 3 second voice cloning, and instruction based voice design across 10 languages using the 12Hz tokenizer family.

Efficient discrete codec and real time streaming: The Qwen3-TTS-Tokenizer-12Hz uses 16 codebooks at 12.5 frames per second, reaches strong PESQ, STOI and UTMOS scores, and supports packetized streaming with about 320 ms of audio per packet and sub 120 ms first packet latency for the 0.6B and 1.7B models in the reported setup.

Task specific model variants: The release offers Base models for cloning and generic TTS, CustomVoice models with 9 predefined speakers and style prompts, and a VoiceDesign model that generates new voices directly from natural language descriptions which can then be reused by the Base model.

Strong alignment and multilingual quality: A multi stage alignment pipeline with DPO, GSPO and speaker fine tuning gives Qwen3-TTS low word error rates and high speaker similarity, with lowest WER in 6 of 10 languages and the best speaker similarity in all 10 languages among the evaluated systems, and state of the art zero shot English cloning on Seed TTS.

Check out the Model Weights, Repo and Playground. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Qwen Researchers Release Qwen3-TTS: an Open Multilingual TTS Suite with Real-Time Latency and Fine-Grained Voice Control appeared first on MarkTechPost.

Build AI agents with Amazon Bedrock AgentCore using AWS CloudFormation

Agentic-AI has become essential for deploying production-ready AI applications, yet many developers struggle with the complexity of manually configuring agent infrastructure across multiple environments. Infrastructure as code (IaC) facilitates consistent, secure, and scalable infrastructure that autonomous AI systems require. It minimizes manual configuration errors through automated resource management and declarative templates, reducing deployment time from hours to minutes while facilitating infrastructure consistency across the environments to help prevent unpredictable agent behavior. It provides version control and rollback capabilities for quick recovery from issues, essential for maintaining agentic system availability, and enables automated scaling and resource optimization through parameterized templates that adapt from lightweight development to production-grade deployments. For agentic applications operating with minimal human intervention, the reliability of IaC, automated validation of security standards, and seamless integration into DevOps workflows are essential for robust autonomous operations.
In order to streamline the resource deployment and management, Amazon Bedrock AgentCore services are now being supported by various IaC frameworks such as AWS Cloud Development Kit (AWS CDK), Terraform and AWS CloudFormation Templates. This integration brings the power of IaC directly to AgentCore so developers can provision, configure, and manage their AI agent infrastructure. In this post, we use CloudFormation templates to build an end-to-end application for a weather activity planner. Examples of using CDK and Terraform can be found at GitHub Sample Library.
Building an activity planner agent based on weather
The sample creates a weather activity planner, demonstrating a practical application that processes real-time weather data to provide personalized activity recommendations based on a location of interest. The application consists of multiple integrated components:

Real-time weather data collection – The application retrieves current weather conditions from authoritative meteorological sources such as weather.gov, gathering essential data points including temperature readings, precipitation probability forecasts, wind speed measurements, and other relevant atmospheric conditions that influence outdoor activity suitability.
Weather analysis engine – The application processes raw meteorological data through customized logic to evaluate suitability of a day for an outdoor activity based on multiple weather factors:

Temperature comfort scoring – Activities receive reduced suitability scores when temperatures drop below 50°F
Precipitation risk assessment – Rain probabilities exceeding 30% trigger adjustments to outdoor activity recommendations
Wind condition impact evaluation – Wind speeds above 15 mph affect overall comfort and safety ratings for various activities

Personalized recommendation system – The application processes weather analysis results with user preferences and location-based awareness to generate tailored activity suggestions.

The following diagram shows this flow.

Now let’s look at how this can be implemented using AgentCore services:

AgentCore Browser – For automated browsing of weather data from sources such as weather.gov
AgentCore Code Interpreter – For executing Python code that processes weather data, performs calculations, and implements the scoring algorithms
AgentCore Runtime – For hosting an agent that orchestrates the application flow, managing data processing pipelines, and coordinating between different components
AgentCore Memory – For storing the user preferences as long term memory

The following diagram shows this architecture.

Deploying the CloudFormation template

Download the CloudFormation template from github for End-to-End-Weather-Agent.yaml on your local machine
Open CloudFormation from AWS Console
Click Create stack → With new resources (standard)
Choose template source (upload file) and select your template
Enter stack name and change any required parameters if needed
Review configuration and acknowledge IAM capabilities
Click Submit and monitor deployment progress on the Events tab

Here is the visual steps for CloudFomation template deployment

Running and testing the application

Adding observability and monitoring
AgentCore Observability provides key advantages. It offers quality and trust through detailed workflow visualizations and real-time performance monitoring. You can gain accelerated time-to-market by using Amazon CloudWatch powered dashboards that reduce manual data integration from multiple sources, making it possible to take corrective actions based on actionable insights. Integration flexibility with OpenTelemetry-compatible format supports existing tools such as CloudWatch, DataDog, Arize Phoenix, LangSmith, and LangFuse.
The service provides end-to-end traceability across frameworks and foundation models (FMs), captures critical metrics such as token usage and tool selection patterns, and supports both automatic instrumentation for AgentCore Runtime hosted agents and configurable monitoring for agents deployed on other services. This comprehensive observability approach helps organizations achieve faster development cycles, more reliable agent behavior, and improved operational visibility while building trustworthy AI agents at scale.
The following screenshot shows metrics in the AgentCore Runtime UI.

Customizing for your use case
The weather activity planner AWS CloudFormation template is designed with modular components that can be seamlessly adapted for various applications. For instance, you can customize the AgentCore Browser tool to collect information from different web applications (such as financial websites for investment guidance, social media feeds for sentiment monitoring, or ecommerce sites for price tracking), modify the AgentCore Code Interpreter algorithms to process your specific business logic (such as predictive modeling for sales forecasting, risk assessment for insurance, or quality control for manufacturing), adjust the AgentCore Memory component to store relevant user preferences or business context (such as customer profiles, inventory levels, or project requirements), and reconfigure the Strands Agents tasks to orchestrate workflows specific to your domain (such as supply chain optimization, customer service automation, or compliance monitoring).
Best practices for deployments
We recommend the following practices for your deployments:

Modular component architecture – Design AWS CloudFormation templates with separate sections for each AWS Services.
Parameterized template design – Use AWS CloudFormation parameters for the configurable elements to facilitate reusable templates across environments. For example, this can help associate the same base container with multiple agent deployments, help point to two different build configurations, or parameterize the LLM of choice for powering your agents.
AWS Identity and Access Management (IAM) security and least privilege – Implement fine-grained IAM roles for each AgentCore component with specific resource Amazon Resource Names (ARNs). Refer to our documentation on AgentCore security considerations.
Comprehensive monitoring and observability – Enable CloudWatch logging, custom metrics, AWS X-Ray distributed tracing, and alerts across the components.
Version control and continuous integration and continuous delivery (CI/CD) integration – Maintain templates in GitHub with automated validation, comprehensive testing, and AWS CloudFormation StackSets for consistent multi-Region deployments.

You can find a more comprehensive set of best practices at CloudFormation best practices
Clean up resources
To avoid incurring future charges, delete the resources used in this solution:

On the Amazon S3 console, manually delete the contents inside the bucket you created for template deployment and then delete the bucket.
On the CloudFormation console, choose Stacks in the navigation pane, select the main stack, and choose Delete.

Conclusion
In this post, we introduced an automated solution for deploying AgentCore services using AWS CloudFormation. These preconfigured templates enable rapid deployment of powerful agentic AI systems without the complexity of manual component setup. This automated approach helps save time and facilitates consistent and reproducible deployments so you can focus on building agentic AI workflows that drive business growth.
Try out some more examples from our Infrastructure as Code sample repositories :

Terraform
CloudFormation
CDK

About the authors
Chintan Patel is a Senior Solution Architect at AWS with extensive experience in solution design and development. He helps organizations across diverse industries to modernize their infrastructure, demystify Generative AI technologies, and optimize their cloud investments. Outside of work, he enjoys spending time with his kids, playing pickleball, and experimenting with AI tools.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using Generative AI and deep learning to solve their business challenges using AWS services like Amazon Bedrock and AgentCore. Dr. Subramanian contributes to cutting-edge research in deep learning, Agentic AI, foundation models and optimization techniques with several books, papers and patents to his name. In his current role at Amazon, Dr. Subramanian works with various science leaders and research teams within and outside Amazon, helping to guide customers to best leverage state-of-the-art algorithms and techniques to solve business critical problems. Outside AWS, Dr. Subramanian is a expert reviewer for AI papers and funding via organizations like Neurips, ICML, ICLR, NASA and NSF.
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and explores the wilderness with his family.

How the Amazon.com Catalog Team built self-learning generative AI at s …

The Amazon.com Catalog is the foundation of every customer’s shopping experience—the definitive source of product information with attributes that power search, recommendations, and discovery. When a seller lists a new product, the catalog system must extract structured attributes—dimensions, materials, compatibility, and technical specifications—while generating content such as titles that match how customers search. A title isn’t a simple enumeration like color or size; it must balance seller intent, customer search behavior, and discoverability. This complexity, multiplied by millions of daily submissions, makes catalog enrichment an ideal proving ground for self-learning AI.
In this post, we demonstrate how the Amazon Catalog Team built a self-learning system that continuously improves accuracy while reducing costs at scale using Amazon Bedrock.
The challenge
In generative AI deployment environments, improving model performance calls for constant attention. Because models process millions of products, they inevitably encounter edge cases, evolving terminology, and domain-specific patterns where accuracy may degrade. The traditional approach—applied scientists analyzing failures, updating prompts, testing changes, and redeploying—works but is resource-intensive and struggles to keep pace with real-world volume and variety. The challenge isn’t whether we can improve these systems, but how to make improvement scalable and automatic rather than dependent on manual intervention. At Amazon Catalog, we faced this challenge head-on. The tradeoffs seemed impossible: large models would deliver accuracy but wouldn’t scale efficiently to our volume, while smaller models struggled with the complex, ambiguous cases where sellers needed the most help.
Solution overview
Our breakthrough came from an unconventional experiment. Instead of choosing a single model, we deployed multiple smaller models to process the same products. When these models agreed on an attribute extraction, we could trust the result. But when they disagreed—whether from genuine ambiguity, missing context, or one model making an error—we discovered something profound. These disagreements weren’t always errors, but they were almost always indicators of complexity. This led us to design a self-learning system that reimagines how generative AI scales. Multiple smaller models process routine cases through consensus, invoking larger models only when disagreements occur. The larger model is implemented as a supervisor agent with access to specialized tools for deeper investigation and analysis. But the supervisor doesn’t just resolve disputes; it generates reusable learnings stored in a dynamic knowledge base that helps prevent entire classes of future disagreements. We invoke more powerful models only when the system detects high learning value at inference time, while correcting the output. The result is a self-learning system where costs decrease and quality increases—because the system learns to handle edge cases that previously triggered supervisor calls. Error rates fell continuously, not through retraining but through accumulated learnings from resolved disagreements injected into smaller model prompts. The following figure shows the architecture of this self-learning system.

In the self-learning architecture, product data flows through generator-evaluator workers, with disagreements routed to a supervisor for investigation. Post-inference, the system also captures feedback signals from sellers (such as listing updates and appeals) and customers (such as returns and negative reviews). Learnings from the sources are stored in a hierarchical knowledge base and injected back into worker prompts, creating a continuous improvement loop.
The following describes a simplified reference architecture that demonstrates how this self-learning pattern can be implemented using AWS services. While our production system has additional complexity, this example illustrates the core components and data flows.
This system can be built with Amazon Bedrock, which provides the essential infrastructure for multi-model architectures. The ability of Amazon Bedrock to access diverse foundation models enables teams to deploy smaller, efficient models like Amazon Nova Lite as workers and more capable models like Anthropic Claude Sonnet as supervisors—optimizing both cost and performance. For even greater cost efficiency at scale, teams can also deploy open source small models on Amazon Elastic Compute Cloud (Amazon EC2) GPU instances, providing full control over worker model selection and batch throughput optimization. For productionizing a supervisor agent with its specialized tools and dynamic knowledge base, Bedrock AgentCore provides the runtime scalability, memory management, and observability needed to deploy self-learning systems reliably at scale.

Our supervisor agent integrates with Amazon’s extensive Selection and Catalog Systems. The above diagram is a simplified view showing the key features of the agent and some of the AWS services that make it possible. Product data flows through generator-evaluator workers (Amazon EC2 and Amazon Bedrock Runtime), with agreements stored directly and disagreements routed to a supervisor agent (Bedrock AgentCore). The learning aggregator and memory manager utilize Amazon DynamoDB for the knowledge base, with learnings injected back into worker prompts. Human review (Amazon Simple Queue Service (Amazon SQS)) and observability (Amazon CloudWatch) complete the architecture. Production implementations will likely require additional components for scale, reliability, and integration with existing systems.
But how did we arrive at this architecture? The key insight came from an unexpected place.
The insight: Turning disagreements into opportunities
Our perspective shifted during a debugging session. When multiple smaller models (such as Nova Lite) disagreed on product attributes—interpreting the same specification differently based on how they understood technical terminology—we initially saw this as a failure. But the data told a different story: products where our smaller models disagreed correlated with cases requiring more manual review and clarification. When models disagreed, those were precisely the products that needed additional investigation. The disagreements were surfacing learning opportunities, but we couldn’t have engineers and scientists deep-dive on every case. The supervisor agent does this automatically at scale. And crucially, the goal isn’t just to determine which model was right—it’s to extract learnings that help prevent similar disagreements in the future. This is the key to efficient scaling. Disagreements don’t just come from AI workers at inference time. Post-inference, sellers express disagreement through listing updates and appeals—signals that our original extraction might have missed important context. Customers disagree through returns and negative reviews, often indicating that product information didn’t match expectations. These post-inference human signals feed into the same learning pipeline, with the supervisor investigating patterns and generating learnings that help prevent similar issues across future products. We found a sweet spot: attributes with moderate AI worker disagreement rates yielded the richest learnings—high enough to surface meaningful patterns, low enough to indicate solvable ambiguity. When disagreement rates are too low, they typically reflect noise or fundamental model limitations rather than learnable patterns—for those, we consider using more capable workers. When disagreement rates are too high, it signals that worker models or prompts aren’t yet mature enough, triggering excessive supervisor calls that undermine the efficiency gains of the architecture. These thresholds will vary by task and domain; the key is identifying your own sweet spot where disagreements represent genuine complexity worth investigating, rather than fundamental gaps in worker capability or random noise.
Deep dive: How it works
At the heart of our system are multiple lightweight worker models operating in parallel—some as generators extracting attributes, others as evaluators assessing those extractions. These workers can be implemented in a non-agentic way with fixed inputs, making them batch-friendly and scalable. The generator-evaluator pattern creates productive tension, conceptually similar to the productive tension in generative adversarial networks (GANs), though our approach operates at inference time through prompting rather than training. We explicitly prompt evaluators to be critical, instructing them to scrutinize extractions for ambiguities, missing context, or potential misinterpretations. This adversarial dynamic surfaces disagreements that represent genuine complexity rather than letting ambiguous cases pass through undetected. When the generator and evaluator agree, we have high confidence in the result and process it at minimal computational cost. This consensus path handles most product attributes. When they disagree, we’ve identified a case worth investigating—triggering the supervisor to resolve the dispute and extract reusable learnings.
Our architecture treats disagreement as a universal learning signal. At inference time, worker-to-worker disagreements catch ambiguity. Post-inference, seller feedback catches misalignments with intent and customer feedback catches misalignments with expectations. The three channels feed the supervisor, which extracts learnings that improve accuracy across the board. When workers disagree, we invoke a supervisor agent—a more capable model that resolves the dispute and investigates why it occurred. The supervisor determines what context or reasoning the workers lacked, and these insights become reusable learnings for future cases. For example, when workers disagreed about usage classification for a product based on certain technical terms, the supervisor investigated and clarified that those terms alone were insufficient—visual context and other indicators needed to be considered together. The supervisor generated a learning about how to properly weight different signals for that product category. This learning immediately updated our knowledge base, and when injected into worker prompts for similar products, helped prevent future disagreements across thousands of items. While the workers could theoretically be the same model as the supervisor, using smaller models is crucial for efficiency at scale. The architectural advantage emerges from this asymmetry: lightweight workers handle routine cases through consensus, while the more capable supervisor is invoked only when disagreements surface high-value learning opportunities. As the system accumulates learnings and disagreement rates drop, supervisor calls naturally decline—efficiency gains are baked directly into the architecture. This worker-supervisor heterogeneity also enables richer investigation. Because supervisors are invoked selectively, they can afford to pull in additional signals—customer reviews, return reasons, seller history—that would be impractical to retrieve for every product but provide crucial context when resolving complex disagreements. When these signals yield generalizable insights about how customers want product information presented—which attributes to highlight, what terminology resonates, how to frame specifications—the resulting learnings benefit future inferences across similar products without retrieving those resource-intensive signals again. Over time, this creates a feedback loop: better product information leads to fewer returns and negative reviews, which in turn reflects improved customer satisfaction.
The knowledge base: Making learnings scalable
The supervisor investigates disagreements at the individual product level. With millions of items to process, we need a scalable way to transform these product-specific insights into reusable learnings. Our aggregation strategy adapts to context: high-volume patterns get synthesized into broader learnings, while unique or critical cases are preserved individually. We use a hierarchical structure where a large language model (LLM)-based memory manager navigates the knowledge tree to place each learning. Starting from the root, it traverses categories and subcategories, deciding at each level whether to continue down an existing path, create a new branch, merge with existing knowledge, or replace outdated information. This dynamic organization allows the knowledge base to evolve with emerging patterns while maintaining logical structure. During inference, workers receive relevant learnings in their prompts based on product category, automatically incorporating domain knowledge from past disagreements. The knowledge base also introduces traceability—when an extraction seems incorrect, we can pinpoint exactly which learning influenced it. This shifts auditing from an unscalable task to a practical one: instead of reviewing a sample of millions of outputs—where human effort grows proportionally with scale—teams can audit the knowledge base itself, which remains relatively fixed in size regardless of inference volume. Domain experts can directly contribute by adding or refining entries, no retraining required. A single well-crafted learning can immediately improve accuracy across thousands of products. The knowledge base bridges human expertise and AI capability, where automated learnings and human insights work together.
Lessons learned and best practices
When this self-learning architecture works best:

High-volume inference where input diversity drives compounded learning
Quality-critical applications where consensus provides natural quality assurance
Evolving domains with new patterns and terminology constantly emerging

It’s less suitable for low-volume scenarios (insufficient disagreements for learning) or use cases with fixed, unchanging rules.
Critical success factors:

Defining disagreements: With a generator-evaluator pair, disagreement occurs when the evaluator flags the extraction as needing improvement. With multiple workers, scale thresholds accordingly. The key is maintaining productive tension between workers. If disagreement rates fall outside the productive range (too low or too high), consider more capable workers or refined prompts.
Tracking learning effectiveness: Disagreement rates must decrease over time—this is your primary health metric. If rates stay flat, check knowledge retrieval, prompt injection, or evaluator criticality.
Knowledge organization: Structure learnings hierarchically and keep them actionable. Abstract guidance doesn’t help; specific, concrete learnings directly improve future inferences.

Common pitfalls

Focusing on cost over intelligence: Cost reduction is a byproduct, not the goal
Rubber-stamp evaluators: Evaluators that simply approve generator outputs won’t surface meaningful disagreements—prompt them to actively challenge and critique extractions
Poor learning extraction: Supervisors must identify generalizable patterns, not just fix individual cases
Knowledge rot: Without organization, learnings become unsearchable and unusable

The key insight: treat declining disagreement rates as your north star metric—they show the system is truly learning.
Deployment strategies: Two approaches

Learn-then-deploy: Start with basic prompts and let the system learn aggressively in a pre-production environment. Domain experts then audit the knowledge base—not individual outputs—to make sure learned patterns align with desired outcomes. When approved, deploy with validated learnings. This is ideal for new use cases where you don’t yet know what good looks like—disagreements help discover the right patterns, and knowledge base auditing lets you shape them before production.
Deploy-and-learn: Start with refined prompts and good initial quality, then continuously improve through ongoing learning in production. This works best for well-understood use cases where you can define quality upfront but still want to capture domain-specific nuances over time.

Both approaches use the same architecture—the choice depends on whether you’re exploring new territory or optimizing familiar ground.
Conclusion
What started as an experiment in catalog enrichment revealed a fundamental truth: AI systems don’t have to be frozen in time. By embracing disagreements as learning signals rather than failures, we’ve built an architecture that accumulates domain knowledge through actual usage. We watched the system evolve from generic understanding to domain-specific expertise. It learned industry-specific terminology. It discovered contextual rules that vary across categories. It adapted to requirements no pre-trained model would encounter—all without retraining, through learnings stored in a knowledge base and injected back into worker prompts. For teams operationalizing similar architectures, Amazon Bedrock AgentCore offers purpose-built capabilities:

AgentCore Runtime  handles quick consensus decisions for routine cases while supporting extended reasoning when supervisors investigate complex disagreements
AgentCore Observability provides visibility into which learnings drive impact, helping teams refine knowledge propagation and maintain reliability at scale

The implications extend beyond catalog management. High-volume AI applications could benefit from this process—and the ability of Amazon Bedrock to access diverse models makes this architecture straightforward to implement. The key insight is this: we’ve shifted from asking “which model should we use?” to “how can we build systems that learn our specific patterns? “Whether you learn-then-deploy for new use cases or deploy-and-learn for established ones, the implementation is straightforward: start with workers suited to your task, choose a supervisor, and let disagreements drive learning. With the right architecture, every inference can become an opportunity to capture domain knowledge. That’s not just scaling—that’s building institutional knowledge into your AI systems.
Acknowledgement This work wouldn’t have been possible without the contributions and support from Ankur Datta (Senior Principal Applied Scientist – leader of science in Everyday Essentials Stores), Zhu Cheng (Applied Scientist), Xuan Tang (Software Engineer), Mohammad Ghasemi (Applied Scientist). We sincerely appreciate the contributions in designs, implementations, numerous fruitful brain-storming sessions, and all the insightful ideas and suggestions.

About the authors
Tarik Arici is a Principal Scientist at Amazon Selection and Catalog Systems (ASCS), where he pioneers self-learning generative AI systems design for catalog quality enhancement at scale. His work focuses on building AI systems that automatically accumulate domain knowledge through production usage—learning from customer reviews and returns, seller feedback, and model disagreements to improve quality while reducing costs. Tarik holds a PhD in Electrical and Computer Engineering from Georgia Institute of Technology.
Sameer Thombare is a Senior Product Manager at Amazon with over a decade of experience in Product Management, Category/P&L Management across diverse industries, including heavy engineering, telecommunications, finance, and eCommerce. Sameer is passionate about developing continuously improving closed-loop systems and leads strategic initiatives within Amazon Selection and Catalog Systems (ASCS) to build a sophisticated self-learning closed-loop system that synthesize signals from customers, sellers, and supply chain operations to optimize outcomes. Sameer holds an MBA from the Indian Institute of Management Bangalore and an engineering degree from Mumbai University.
Amin Banitalebi received his PhD in the Digital Media at the University of British Columbia (UBC), Canada, in 2014. Since then, he has taken various applied science roles spanning over areas in computer vision, natural language processing, recommendation systems, classical machine learning, and generative AI. Amin has co-authored over 90 publications and patents. He is currently an Applied Science Manager in Amazon Everyday Essentials.
Puneet Sahni is a Senior Principal Engineer at Amazon Selection and Catalog Systems (ASCS), where he has spent over 8 years improving the completeness, consistency, and correctness of catalog data. He specializes in catalog data modeling and its application to enhancing Selling Partner and customer experiences, while using ML/DL and LLM-based enrichment to drive improvements in catalog data quality.
Erdinc Basci joined Amazon in 2015 and brings over 23 years of technology industry experience. At Amazon, he has led the evolution of Catalog system architectures—including ingestion pipelines, prioritized processing, and traffic shaping—as well as catalog data architecture improvements such as segmented offers, product specifications for manufacture-on-demand products, and catalog data experimentation. Erdinc has championed a hands-on performance engineering culture across Amazon services unlocking $1B+ annualized cost savings and 20%+ latency wins across core Stores services. He is currently focused on improving generative AI application performance and GPU efficiency across Amazon. Erdinc holds a BS in Computer Science from Bilkent University, Turkey, and an MBA from Seattle University, US.
Mey Meenakshisundaram is a Director in Amazon Selection and Catalog Systems, where he leads innovative GenAI solutions to establish Amazon’s worldwide catalog as the best-in-class source for product information. His team pioneers advanced machine learning techniques, including multi-agent systems and large language models, to automatically enrich product attributes and improve catalog quality at scale. High-quality product information in the catalog is critical for delighting customers in finding the right products, empowering selling partners to list their products effectively, and enabling Amazon operations to reduce manual effort.

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Desig …

Microsoft has released VibeVoice-ASR as part of the VibeVoice family of open source frontier voice AI models. VibeVoice-ASR is described as a unified speech-to-text model that can handle 60-minute long-form audio in a single pass and output structured transcriptions that encode Who, When, and What, with support for Customized Hotwords.

VibeVoice sits in a single repository that hosts Text-to-Speech, real time TTS, and Automatic Speech Recognition models under an MIT license. VibeVoice uses continuous speech tokenizers that run at 7.5 Hz and a next-token diffusion framework where a Large Language Model reasons over text and dialogue and a diffusion head generates acoustic detail. This framework is mainly documented for TTS, but it defines the overall design context in which VibeVoice-ASR lives.

https://huggingface.co/microsoft/VibeVoice-ASR

Long form ASR with a single global context

Unlike conventional ASR (Automatic Speech Recognition) systems that first cut audio into short segments and then run diarization and alignment as separate components, VibeVoice-ASR is designed to accept up to 60 minutes of continuous audio input within a 64K token length budget. The model keeps one global representation of the full session. This means the model can maintain speaker identity and topic context across the entire hour instead of resetting every few seconds.

60-minute Single-Pass Processing

The first key feature is that many conventional ASR systems process long audio by cutting it into short segments, which can lose global context. VibeVoice-ASR instead takes up to 60 minutes of continuous audio within a 64K token window so it can maintain consistent speaker tracking and semantic context across the entire recording.

This is important for tasks like meeting transcription, lectures, and long support calls. A single pass over the complete sequence simplifies the pipeline. There is no need to implement custom logic to merge partial hypotheses or repair speaker labels at boundaries between audio chunks.

Customized Hotwords for domain accuracy

Customized Hotwords are the second key feature. Users can provide hotwords such as product names, organization names, technical terms, or background context. The model uses these hotwords to guide the recognition process.

This allows you to bias decoding toward the correct spelling and pronunciation for domain specific tokens without retraining the model. For example, a dev-user can pass internal project names or customer specific terms at inference time. This is useful when deploying the same base model across several products that share similar acoustic conditions but very different vocabularies.

Microsoft also ships a finetuning-asr directory with LoRA based fine tuning scripts for VibeVoice-ASR. Together, hotwords and LoRA fine tuning give a path for both light weight adaptation and deeper domain specialization.

Rich Transcription, diarization, and timing

The third feature is Rich Transcription with Who, When, and What. The model jointly performs ASR, diarization, and timestamping, and returns a structured output that indicates who said what and when.

See below the three evaluation figures named DER, cpWER, and tcpWER.

https://huggingface.co/microsoft/VibeVoice-ASR

DER is Diarization Error Rate, it measures how well the model assigns speech segments to the correct speaker

cpWER and tcpWER are word error rate metrics computed under conversational settings

These graphs summarize how well the model performs on multi speaker long form data, which is the primary target setting for this ASR system.

The structured output format is well suited for downstream processing like speaker specific summarization, action item extraction, or analytics dashboards. Since segments, speakers, and timestamps already come from a single model, downstream code can treat the transcript as a time aligned event log.

Key Takeaways

VibeVoice-ASR is a unified speech to text model that handles 60 minute long form audio in a single pass within a 64K token context.

The model jointly performs ASR, diarization, and timestamping so it outputs structured transcripts that encode Who, When, and What in a single inference step.

Customized Hotwords let users inject domain specific terms such as product names or technical jargon to improve recognition accuracy without retraining the model.

Evaluation with DER, cpWER, and tcpWER focuses on multi speaker conversational scenarios which aligns the model with meetings, lectures, and long calls.

VibeVoice-ASR is released in the VibeVoice open source stack under MIT license with official weights, fine tuning scripts, and an online Playground for experimentation.

Check out the Model Weights, Repo and Playground. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass appeared first on MarkTechPost.

How PDI built an enterprise-grade RAG system for AI applications with …

PDI Technologies is a global leader in the convenience retail and petroleum wholesale industries. They help businesses around the globe increase efficiency and profitability by securely connecting their data and operations. With 40 years of experience, PDI Technologies assists customers in all aspects of their business, from understanding consumer behavior to simplifying technology ecosystems across the supply chain.
Enterprises face a significant challenge of making their knowledge bases accessible, searchable, and usable by AI systems. Internal teams at PDI Technologies were struggling with information scattered across disparate systems including websites, Confluence pages, SharePoint sites, and various other data sources. To address this, PDI Technologies built PDI Intelligence Query (PDIQ), an AI assistant that gives employees access to company knowledge through an easy-to-use chat interface. This solution is powered by a custom Retrieval Augmented Generation (RAG) system, built on Amazon Web Services (AWS) using serverless technologies. Building PDIQ required addressing the following key challenges:

Automatically extracting content from diverse sources with different authentication requirements
Needing the flexibility to select, apply, and interchange the most suitable large language model (LLM) for diverse processing requirements
Processing and indexing content for semantic search and contextual retrieval
Creating a knowledge foundation that enables accurate, relevant AI responses
Continuously refreshing information through scheduled crawling
Supporting enterprise-specific context in AI interactions

In this post, we walk through the PDIQ process flow and architecture, focusing on the implementation details and the business outcomes it has helped PDI achieve.
Solution architecture
In this section, we explore PDIQ’s comprehensive end-to-end design. We examine the data ingestion pipeline from initial processing through storage to user search capabilities, as well as the zero-trust security framework that protects key user personas throughout their platform interactions. The architecture consists of these elements:

Scheduler – Amazon EventBridge maintains and executes the crawler scheduler.
Crawlers – AWS Lambda invokes crawlers that are executed as tasks by Amazon Elastic Container Service (Amazon ECS).
Amazon DynamoDB – Persists crawler configurations and other metadata such as Amazon Simple Storage Service (Amazon S3) image location and captions.
Amazon S3 – All source documents are stored in Amazon S3. Amazon S3 events trigger the downstream flow for every object that is created or deleted.
Amazon Simple Notification Service (Amazon SNS) – Receives notification from Amazon S3 events.
Amazon Simple Queue Service (Amazon SQS) – Subscribed to Amazon SNS to hold the incoming requests in a queue.
AWS Lambda – Handles the business logic for chunking, summarizing, and generating vector embeddings.
Amazon Bedrock – Provides API access to foundation models (FMs) used by PDIQ:

Amazon Nova Lite to generate image caption
Amazon Nova Micro to generate document summary
Amazon Titan Text Embeddings V2 to generate vector embeddings
Amazon Nova Pro to generate responses to user inquiries

Amazon Aurora PostgreSQL-Compatible Edition – Stores vector embeddings.

The following diagram is the solution architecture.

Next, we review how PDIQ implements a zero-trust security model with role-based access control for two key personas:

Administrators configure knowledge bases and crawlers through Amazon Cognito user groups integrated with enterprise single sign-on. Crawler credentials are encrypted at rest using AWS Key Management Service (AWS KMS) and only accessible within isolated execution environments.
End users access knowledge bases based on group permissions validated at the application layer. Users can belong to multiple groups (such as human resources or compliance) and switch contexts to query role-appropriate datasets.

Process flow
In this section, we review the end-to-end process flow. We break it down by sections to dive deeper into each step and explain the functionality.

Crawlers
Crawlers are configured by Administrator to collect data from a variety of sources that PDI relies on. Crawlers hydrate the data into the knowledge base so that this information can be retrieved by end users. PDIQ currently supports the following crawler configurations:

Web crawler – By using Puppeteer for headless browser automation, the crawler converts HTML web pages to markdown format using turndown. By following the embedded links on the website, the crawler can capture full context and relationships between pages. Additionally, the crawler downloads assets such as PDFs and images while preserving the original reference and offers users configuration options such as rate limiting.
Confluence crawler – This crawler uses Confluence REST API with authenticated access to extract page content, attachments, and embedded images. It preserves page hierarchy and relationships, handles special Confluence elements such as info boxes, notes, and many more.
Azure DevOps crawler – PDI uses Azure DevOps to manage its code base, track commits, and maintain project documentation in a centralized repository. PDIQ uses Azure DevOps REST API with OAuth or personal access token (PAT) authentication to extract this information. Azure DevOps crawler preserves project hierarchy, sprint relationships, and backlog structure also maps work item relationships (such as parent/child or linked items), thereby providing a complete view of the dataset.
SharePoint crawler – It uses Microsoft Graph API with OAuth authentication to extract document libraries, lists, pages, and file content. The crawler processes MS Office documents (Word, Excel, PowerPoint) into searchable text and maintains document version history and permission metadata.

By building separate crawler configurations, PDIQ offers easy extensibility into the platform to configure additional crawlers on demand. It also offers the flexibility to administrator users to configure the settings for their respective crawlers (such as frequency, depth, or rate limits).
The following figure shows the PDIQ UI to configure the knowledge base.

The following figure shows the PDI UI to configure your crawler (such as Confluence).

The following figure shows the PDIQ UI to schedule crawlers.

Handling images
Data crawled is stored in Amazon S3 with proper metadata tags. If the source is in HTML format, the task converts the content into markdown (.md) files. For these markdown files, there is an additional optimization step performed to replace the images in the document with the Amazon S3 reference location. Key benefits of this approach include:

PDI can use S3 object keys to uniquely reference each image, thereby optimizing the synchronization process to detect changes in source data
You can optimize storage by replacing images with captions and avoiding the need to store duplicate images
It provides the ability to make the content of the images searchable and relatable to the text content in the document
Seamlessly inject original images when rendering a response to user inquiry

The following is a sample markdown file where images are replaced with the S3 file location:

![image-20230113-074652](https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/123/file/attachments/12133171243_image-20230113-074652.png)

Document processing
This is the most critical step of the process. The key objective of this step is to generate vector embeddings so that they can be used for similarity matching and effective retrieval based on user inquiry. The process follows several steps, starting with image captioning, then document chunking, summary generation, and embedding generation. To caption the images, PDIQ scans the markdown files to locate image tags <image>. For each of these images, PDIQ scans and generates an image caption that explains the content of the image. This caption gets injected back into the markdown file, next to the <image> tag, thereby enriching the document content. This approach offers improved contextual searchability. PDIQ enhances content discovery by embedding insights extracted from images directly into the original markdown files. This approach ensures that image content becomes part of the searchable text, enabling richer and more accurate context retrieval during search and analysis. The approach also saves costs. To avoid unnecessary LLM inference calls for exact same images, PDIQ stores image metadata (file location and generated captions) in Amazon DynamoDB. This step enables efficient reuse of previously generated captions, eliminating the need for repeated caption generation calls to LLM.
The following is an example of an image caption prompt:

You are a professional image captioning assistant. Your task is to provide clear, factual, and objective descriptions of images. Focus on describing visible elements, objects, and scenes in a neutral and appropriate manner.

The following is a snippet of markdown file that contains the image tag, LLM-generated caption, and the corresponding S3 file location:

![image-20230818-114454: The image displays a security tip notification on a computer screen. The notification is titled “Security tip” and advises the user to use generated passwords to keep their accounts safe. The suggested password, “2m5oFX#g&tLRMhN3,” is shown in a green box. Below the suggested password, there is a section labeled “Very Strong,” indicating the strength of the password. The password length is set to 16 characters, and it includes lowercase letters, uppercase letters, numbers, and symbols. There is also a “Dismiss” button to close the notification. Below the password section, there is a link to “See password history.” The bottom of the image shows navigation icons for “Vault,” “Generator,” “Alerts,” and “Account.” The “Generator” icon is highlighted in red.]
(https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/ABC/file/attachments/12133171243_image-20230818-114454.png)

Now that markdown files are injected with image captions, the next step is to break the original document into chunks that fit into the context window of the embeddings model. PDIQ uses Amazon Titan Text Embeddings V2 model to generate vectors and stores them in Aurora PostgreSQL-Compatible Serverless. Based on internal accuracy testing and chunking best practices from AWS, PDIQ performs chunking as follows:

70% of the tokens for content
10% overlap between chunks
20% for summary tokens

Using the document chunking logic from the previous step, the document is converted into vector embeddings. The process includes:

Calculate chunk parameters – Determine the size and total number of chunks required for the document based on the 70% calculation.
Generate document summary – Use Amazon Nova Lite to create a summary of the entire document, constrained by the 20% token allocation. This summary is reused across all chunks to provide consistent context.
Chunk and prepend summary – Split the document into overlapping chunks (10%), with the summary prepended at the top.
Generate embeddings – Use Amazon Titan Text Embeddings V2 to generate vector embeddings for each chunk (summary plus content), which is then stored in the vector store.

By designing a customized approach to generate a summary section atop of all chunks, PDIQ ensures that when a particular chunk is matched based on similarity search, the LLM has access to the entire summary of the document and not only the chunk that matched. This approach enriches end user experience resulting in an increase of approval rate for accuracy from 60% to 79%.
The following is an example of a summarization prompt:

You are a specialized document summarization assistant with expertise in business and technical content.

Your task is to create concise, information-rich summaries that:
Preserve all quantifiable data (numbers, percentages, metrics, dates, financial figures)
Highlight key business terminology and domain-specific concepts
Extract important entities (people, organizations, products, locations)
Identify critical relationships between concepts
Maintain factual accuracy without adding interpretations
Focus on extracting information that would be most valuable for:
Answering specific business questions
Supporting data-driven decision making
Enabling precise information retrieval in a RAG system
The summary should be comprehensive yet concise, prioritizing specific facts over general descriptions.
Include any tables, lists, or structured data in a format that preserves their relationships.
Ensure all technical terms, acronyms, and specialized vocabulary are preserved exactly as written.

The following is an example of summary text, available on each chunk:

### Summary: PLC User Creation Process and Password Reset
**Document Overview:**
This document provides instructions for creating new users and resetting passwords
**Key Instructions:**

{Shortened for Blog illustration}

This summary captures the essential steps, requirements, and entities involved in the PLC user creation and password reset process using Jenkins.

Chunk 1 has a summary at the top followed by details from the source:

{Summary Text from above}
This summary captures the essential steps, requirements, and entities involved in the PLC user creation and password reset process using Jenkins.

title: 2. PLC User Creation Process and Password Reset

![image-20230818-114454: The image displays a security tip notification on a computer screen. The notification is titled “Security tip” and advises the user to use generated passwords to keep their accounts safe. The suggested password, “2m5oFX#g&tLRMhN3,” is shown in a green box. Below the suggested password, there is a section labeled “Very Strong,” indicating the strength of the password. The password length is set to 16 characters, and it includes lowercase letters, uppercase letters, numbers, and symbols. There is also a “Dismiss” button to close the notification. Below the password section, there is a link to “See password history.” The bottom of the image shows navigation icons for “Vault,” “Generator,” “Alerts,” and “Account.” The “Generator” icon is highlighted in red.](https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/123/file/attachments/12133171243_image-20230818-114454.png)

Chunk 2 has a summary at the top, followed by continuation of details from the source:

{Summary Text from above}
This summary captures the essential steps, requirements, and entities involved in the PLC user creation and password reset process using Jenkins.

Maintains a menu with options such as

![image-20230904-061307: – The generated text has been blocked by our content filters.](https:// amzn-s3-demo-bucket.s3.amazonaws.com/kb/123/file/attachments/12133171243_image-20230904-061307.png)

PDIQ scans each document chunk and generates vector embeddings. This data is stored in Aurora PostgreSQL database with key attributes, including a unique knowledge base ID, corresponding embeddings attribute, original text (summary plus chunk plus image caption), and a JSON binary object that includes metadata fields for extensibility. To keep the knowledge base in sync, PDI implements the following steps:

Add – These are net new source objects that should be ingested. PDIQ implements the document processing flow described previously.
Update – If PDIQ determines the same object is present, it compares the hash key value from the source with the hash value from the JSON object.
Delete – If PDIQ determines that a specific source document no longer exists, it triggers a delete operation on the S3 bucket (s3:ObjectRemoved:*), which results in a cleanup job, deleting the records corresponding to the key value in the Aurora table.

PDI uses Amazon Nova Pro to retrieve the most relevant document and generates a response by following these key steps:

Using similarity search, retrieves the most relevant document chunks, which include summary, chunk data, image caption, and image link.
For the matching chunk, retrieve the entire document.
LLM then replaces the image link with the actual image from Amazon S3.
LLM generates a response based on the data retrieved and the preconfigured system prompt.

The following is a snippet of system prompt:

Support assistant specializing in PDI’s Logistics(PLC) platform, helping staff research and resolve support cases in Salesforce. You will assist with finding solutions, summarizing case information, and recommending appropriate next steps for resolution.

Professional, clear, technical when needed while maintaining accessible language.

Resolution Process:
Response Format template:
Handle Confidential Information:

Outcomes and next steps
By building this customized RAG solution on AWS, PDI realized the following benefits:

Flexible configuration options allow data ingestion at consumer-preferred frequencies.
Scalable design enables future ingestion from additional source systems through easily configurable crawlers.
Supports crawler configuration using multiple authentication methods, including username and password, secret key-value pairs, and API keys.
Customizable metadata fields enable advanced filtering and improve query performance.
Dynamic token management helps PDI intelligently balance tokens between content and summaries, enhancing user responses.
Consolidates diverse source data formats into a unified layout for streamlined storage and retrieval.

PDIQ provides key business outcomes that include:

Improved efficiency and resolution rates – The tool empowers PDI support teams to resolve customer queries significantly faster, often automating routine issues and providing immediate, precise responses. This has led to less customer waiting on case resolution and more productive agents.
High customer satisfaction and loyalty – By delivering accurate, relevant, and personalized answers grounded in live documentation and company knowledge, PDIQ increased customer satisfaction scores (CSAT), net promoter scores (NPS), and overall loyalty. Customers feel heard and supported, strengthening PDI brand relationships.
Cost reduction – PDIQ handles the bulk of repetitive queries, allowing limited support staff to focus on expert-level cases, which improves productivity and morale. Additionally, PDIQ is built on serverless architecture, which automatically scales while minimizing operational overhead and cost.
Business flexibility – A single platform can serve different business units, who can curate the content by configuring their respective data sources.
Incremental value – Each new content source adds measurable value without system redesign.

PDI continues to enhance the application with several planned improvements in the pipeline, including:

Build additional crawler configuration for new data sources (for example, GitHub).
Build agentic implementation for PDIQ to be integrated into larger complex business processes.
Enhanced document understanding with table extraction and structure preservation.
Multilingual support for global operations.
Improved relevance ranking with hybrid retrieval techniques.
Ability to invoke PDIQ based on events (for example, source commits).

Conclusion
PDIQ service has transformed how users access and use enterprise knowledge at PDI Technologies. By using Amazon serverless services, PDIQ can automatically scale with demand, reduce operational overhead, and optimize costs. The solution’s unique approach to document processing, including the dynamic token management and the custom image captioning system, represents significant technical innovation in enterprise RAG systems. The architecture successfully balances performance, cost, and scalability while maintaining security and authentication requirements. As PDI Technologies continue to expand PDIQ’s capabilities, they’re excited to see how this architecture can adapt to new sources, formats, and use cases.

About the authors
Samit Kumbhani is an Amazon Web Services (AWS) Senior Solutions Architect in the New York City area with over 18 years of experience. He currently partners with independent software vendors (ISVs) to build highly scalable, innovative, and secure cloud solutions. Outside of work, Samit enjoys playing cricket, traveling, and biking.
Jhorlin De Armas is an Architect II at PDI Technologies, where he leads the design of AI-driven platforms on Amazon Web Services (AWS). Since joining PDI in 2024, he has architected a compositional AI service that enables configurable assistants, agents, knowledge bases, and guardrails using Amazon Bedrock, Aurora Serverless, AWS Lambda, and DynamoDB. With over 18 years of experience building enterprise software, Jhorlin specializes in cloud-centered architectures, serverless platforms, and AI/ML solutions.
David Mbonu is a Sr. Solutions Architect at Amazon Web Services (AWS), helping horizontal business application ISV customers build and deploy transformational solutions on AWS. David has over 27 years of experience in enterprise solutions architecture and system engineering across software, FinTech, and public cloud companies. His recent interests include AI/ML, data strategy, observability, resiliency, and security. David and his family reside in Sugar Hill, GA.

How CLICKFORCE accelerates data-driven advertising with Amazon Bedrock …

CLICKFORCE is one of leaders in digital advertising services in Taiwan, specializing in data-driven advertising and conversion (D4A – Data for Advertising & Action). With a mission to deliver industry-leading, trend-aligned, and innovative marketing solutions, CLICKFORCE helps brands, agencies, and media partners make smarter advertising decisions.
However, as the advertising industry rapidly evolves, traditional analysis methods and generic AI outputs are no longer sufficient to provide actionable insights. To remain competitive, CLICKFORCE turned to AWS to build Lumos, a next-generation AI-driven marketing analysis solution powered by Amazon Bedrock, Amazon SageMaker AI, Amazon OpenSearch, and AWS Glue.
In this post, we demonstrate how CLICKFORCE used AWS services to build Lumos and transform advertising industry analysis from weeks-long manual work into an automated, one-hour process.
Digital advertising challenges
Before adopting Amazon Bedrock, CLICKFORCE faced several roadblocks in building actionable intelligence for digital advertising. Large language models (LLMs) tend to produce generic recommendations rather than actionable industry-specific intelligence. Without an understanding of the advertising environment, these models didn’t have the industry context needed to align their suggestions with actual industry realities.
Another significant challenge was the absence of integrated internal datasets, which weakened the reliability of outputs and increased the risk of hallucinated or inaccurate insights. At the same time, marketing teams relied on disconnected tools and technique such as vibe coding, without standardized architectures or workflows, making the processes difficult to maintain and scale.
Preparing a comprehensive industry analysis report was also a time-consuming process, typically requiring between two and six weeks. The timeline stemmed from multiple labor-intensive stages: one to three days to define objectives and set the research plan, one to four weeks to gather and validate data from different sources, one to two weeks to conduct statistical analysis and build charts, one to two to extract strategic insights, and finally three to seven days to draft and finalize the report. Each stage often required back-and-forth coordination across teams, which further extended the timeline. As a result, marketing strategies were frequently delayed and based more on intuition than timely, data-backed insights.
Solutions overview
To address these challenges, CLICKFORCE built Lumos, an integrated AI-powered industry analysis service, using AWS services.
The solution is designed around Amazon Bedrock Agents for contextualized reasoning and Amazon SageMaker AI for fine-tuning Text-to-SQL accuracy. CLICKFORCE chose Amazon Bedrock because it provides managed access to foundation models without the need to build or maintain infrastructure, while also offering agents that can orchestrate multi-step tasks and integrate with enterprise data sources through Knowledge Bases. This allowed the team to ground insights in real, verifiable data, minimize hallucinations, and quickly experiment with different models, while also reducing operational overhead and accelerating time-to-market.

The first step was to build a unified AI agent using Amazon Bedrock. End-users interact with a chatbot interface that runs on Amazon ECS, developed with Streamlit and fronted by an Application Load Balancer. When a user submits a query, it is routed to an AWS Lambda function that invokes an Amazon Bedrock Agent. The agent retrieves relevant information from a Amazon Bedrock Knowledge Bases, which is built from source documents—such as campaign reports, product descriptions, and industry analysis files—hosted in Amazon S3. These documents are automatically converted into vector embeddings and indexed in Amazon OpenSearch Service. By grounding model responses in this curated document set, CLICKFORCE made sure that outputs were contextualized, reduced hallucinations, and aligned with real-world advertising data.
Next, CLICKFORCE made the workflows more action-oriented by using Text-to-SQL requests. When queries required data retrieval, the Bedrock Agent generated JSON schemas via the Agent Actions API Schema. These were passed to Lambda Executor functions that translated requests into Text-to-SQL queries. With AWS Glue crawlers continuously updating SQL databases from CSV files in Amazon S3, analysts were able to run precise queries on campaign performance, audience behaviors, and competitive benchmarks.
Finally, the company improved accuracy by incorporating Amazon SageMaker and MLflow into the development workflow. Initially, CLICKFORCE relied on foundation models for Text-to-SQL translation but found them to be inflexible and often inaccurate. By using SageMaker, the team processed data, evaluated different approaches, and tuned the overall Text-to-SQL pipeline. Once validated, the optimized pipeline was deployed through AWS Lambda functions and integrated back into the agent, making sure that improvements flowed directly into the Lumos application. With MLflow providing experiment tracking and evaluation, the cycle of data processing, pipeline tuning, and deployment became streamlined, allowing Lumos to achieve higher precision in query generation and deliver automated, data-driven marketing reports.
Results
The impact of adopting Amazon Bedrock Agents and SageMaker AI has been transformative for CLICKFORCE. Industry analysis that previously required two to six weeks can now be completed in under one hour, dramatically accelerating decision-making. The company also reduced its reliance on third-party industry research reports, which resulted in a 47 percent reduction in operational costs.
In addition to time and cost savings, the Lumos system has extended scalability across roles within the marketing environment. Brand owners, agencies, analysts, marketers, and media partners can now independently generate insights without waiting for centralized analyst teams. This autonomy has led to greater agility across campaigns. Moreover, by grounding outputs in both internal datasets and industry-specific context, Lumos significantly reduced the risk of hallucinations and made sure that insights aligned more closely with industry realities.

Users can generate industry analysis reports through natural language conversations and iteratively refine the content by continuing the dialogue.

These visual reports, generated through the Lumos system powered by Amazon Bedrock Agents and SageMaker AI, showcase the platform’s ability to produce comprehensive market intelligence within minutes. The charts illustrate brand sales distribution, retail and e-commerce performance, and demonstrating how AI-driven analytics automate data aggregation, visualization, and insight generation with high precision and efficiency.
Conclusion
CLICKFORCE’s Lumos system represents a breakthrough in how digital marketing decisions are made. By combining Amazon Bedrock Agents, Amazon SageMaker AI, Amazon OpenSearch Service, and AWS Glue, CLICKFORCE transformed its industry analysis workflow from a slow, manual process into a fast, automated, and reliable system. In this post, we demonstrated how CLICKFORCE used these AWS services to build Lumos and transform advertising industry analysis from weeks-long manual work into an automated, one-hour process.

About the Authors
Ray Wang is a Senior Solutions Architect at AWS. With 12+ years of experience in the backend and consultant, Ray is dedicated to building modern solutions in the cloud, especially in especially in NoSQL, big data, machine learning, and Generative AI. As a hungry go-getter, he passed all 12 AWS certificates to increase the breadth and depth of his technical knowledge. He loves to read and watch sci-fi movies in his spare time.
Shanna Chang is a Solutions Architect at AWS. She focuses on observability in modern architectures and cloud-native monitoring solutions. Before joining AWS, she was a software engineer.

FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialog …

Chroma 1.0 is a real time speech to speech dialogue model that takes audio as input and returns audio as output while preserving the speaker identity across multi turn conversations. It is presented as the first open source end to end spoken dialogue system that combines low latency interaction with high fidelity personalized voice cloning from only a few seconds of reference audio.

The model operates directly on discrete speech representations rather than on text transcripts. It targets the same use cases as commercial real time agents, but with a compact 4B parameter dialogue core and a design that treats speaker similarity as a primary objective, not as an auxiliary feature. Chroma achieves a reported 10.96% relative improvement in speaker similarity over a human baseline and reaches a Real Time Factor (RTF) of 0.43, so it can generate speech more than 2 times faster than playback.

https://arxiv.org/pdf/2601.11141

From cascaded ASR LLM TTS end to end S2S

Most production assistants still use a three stage pipeline, automatic speech recognition to convert audio to text, a large language model for reasoning, and text to speech synthesis. This structure is flexible but it introduces latency and loses paralinguistic information such as timbre, emotion, speaking rate and prosody once the system collapses audio to text. In real time dialogue this loss of acoustic detail directly hurts speaker fidelity and naturalness.

Chroma follows the newer class of speech to speech systems that map between sequences of codec tokens. A speech tokenizer and neural codec produce quantized acoustic codes. A language model then reasons and responds over a sequence that interleaves text tokens and audio codes, without an explicit intermediate transcript. This keeps the model conditioned on prosody and speaker identity during the whole processing chain.

Architecture, Reasoner + speech generation stack

Chroma 1.0 has two main subsystems. The Chroma Reasoner handles multimodal understanding and text generation. The speech stack, Chroma Backbone, Chroma Decoder and Chroma Codec Decoder, converts that semantic output into personalized response audio.

The Chroma Reasoner is built on the Thinker module from the Qwen-omni series and uses the Qwen2 Audio encoding pipeline. It processes text and audio inputs with shared front ends, fuses them with cross modal attention, and aligns them over time using Time aligned Multimodal Rotary Position Embedding (TM-RoPE). The output is a sequence of hidden states that carry both linguistic content and acoustic cues, for example rhythm and emphasis.

https://arxiv.org/pdf/2601.11141

The Chroma Backbone is a 1B parameter LLaMA style model based on Llama3. It is conditioned on the target voice using CSM-1B, which encodes a short reference audio clip and its transcript into embedding prompts that are prepended to the sequence. During inference, token embeddings and hidden states from the Reasoner are fed as unified context, so the Backbone always sees the semantic state of the dialogue while it generates acoustic codes.

To support streaming, the system uses a fixed 1 to 2 interleaving schedule. For every text token from the Reasoner, the Backbone produces 2 audio code tokens. This allows the model to start emitting speech as soon as text generation begins and avoids waiting for full sentences. This interleaving is the main mechanism behind the low Time to First Token.

The Chroma Decoder is a lightweight LLaMA variant with about 100M parameters. The Backbone predicts only the first Residual Vector Quantization codebook per frame, which is a coarse representation. The Decoder then takes the Backbone hidden state and the first code and autoregressively predicts the remaining RVQ levels inside the same frame. This factorization keeps long context temporal structure in the Backbone and restricts the Decoder to frame local refinement, which reduces compute and improves detailed prosody and articulation.

The Chroma Codec Decoder concatenates the coarse and refined codes and maps them to waveform samples. It follows the decoder design of the Mimi vocoder and uses a causal convolutional neural network so that each output sample depends only on past context, which is required for streaming. The system uses 8 codebooks, which cuts the number of autoregressive refinement steps for the Decoder while preserving enough detail for voice cloning.

Training setup and synthetic speech to speech (S2S) data

High quality speech dialogue data with strong reasoning signals is scarce. Chroma therefore uses a synthetic speech to speech (S2S) pipeline. A Reasoner like LLM first produces textual answers for user questions. A Test to Speech (TTS) system then synthesizes target speech that matches the timbre of the reference audio for those answers. These synthetic pairs train the Backbone and Decoder to perform acoustic modeling and voice cloning. The Reasoner stays frozen and acts as a provider of text embeddings and multimodal hidden states.

Voice cloning quality and comparison with existing systems

Objective evaluation uses the SEED-TTS-EVAL protocol on English CommonVoice speakers. Chroma operates at 24 kHz sampling rate and achieves a Speaker Similarity score of 0.81. The human baseline is 0.73. CosyVoice-3 reaches 0.72 and most other TTS baselines lie below the human reference. The research team report this as a 10.96% relative improvement over the human baseline, which indicates that the model captures fine paralinguistic details more consistently than human recordings in this metric.

https://arxiv.org/pdf/2601.11141

Subjective evaluation compares Chroma with the ElevenLabs eleven_multilingual_v2 model. In naturalness CMOS, listeners prefer ElevenLabs 57.2% of the time versus 24.4% for Chroma, with 18.3% deuce. In speaker similarity CMOS, the scores are very close, 42.4% for ElevenLabs and 40.6% for Chroma, with 17.0% deuce. A follow up test asking which audio sounds more natural between ElevenLabs and the original recordings yields 92.0% preference for ElevenLabs versus 8.0% for ground truth, which shows that perceived naturalness and speaker fidelity are not aligned.

Latency and real-time behavior

Latency is measured with one concurrent stream. For a 38.80 second response, the total generation time is 16.58 seconds, which gives a Real Time Factor (RTF) of 0.43. The Reasoner contributes 119.12 ms TTFT, the Backbone 8.48 ms and the Decoder 19.27 ms per frame on average. The Codec Decoder works on groups of 4 frames so TTFT does not apply to that component. The overall Time to First Token is 146.87 ms, which is well under one second and suitable for interactive dialogue.

https://arxiv.org/pdf/2601.11141

Spoken dialogue and reasoning benchmarks

Chroma is evaluated on the basic track of URO Bench. It uses only 4B parameters yet achieves an overall task accomplishment score of 57.44%. GLM-4 Voice, a 9B parameter model, leads with 69.09%. Chroma ranks second overall and outperforms several 7B and 0.5B omni baselines on many dimensions. It reaches 71.14% on Storal, 51.69% on TruthfulQA and 22.74% on GSM8K. For oral conversation metrics it attains the highest scores on MLC at 60.26% and on CommonVoice at 62.07%.

https://arxiv.org/pdf/2601.11141

Critically, Chroma is the only model in this comparison that supports personalized voice cloning. All other systems focus on spoken dialogue and reasoning only. This means Chroma provides competitive cognitive capability while also performing high fidelity voice personalization in real time.

Key Takeaways

End to end real time speech to speech: Chroma 1.0 is a 4B parameter spoken dialogue model that maps speech to speech directly using codec tokens, it avoids explicit ASR and TTS stages and preserves prosody and speaker identity through the whole pipeline.

Reasoner plus speech stack architecture: The system combines a Qwen-based Chroma Reasoner with a 1B LLaMA style Backbone, a 100M Chroma Decoder and a Mimi based Codec Decoder, it uses RVQ codebooks and an interleaved 1 to 2 text to audio token schedule to support streaming and low Time to First Token.

Strong personalized voice cloning: On SEED-TTS-EVAL with CommonVoice speakers, Chroma reaches a Speaker Similarity score of 0.81 at 24 kHz, this is reported as a 10.96 percent relative improvement over the human baseline of 0.73 and outperforms CosyVoice 3 and other TTS baselines.

Sub second latency and faster than real time generation: Single stream inference on an H200 GPU yields an overall Time to First Token of about 147 ms, for a 38.80 second response the model generates audio in 16.58 seconds, resulting in a Real Time Factor of 0.43 which is more than 2 times faster than playback.

Competitive dialogue and reasoning with cloning as a unique feature: On URO Bench basic track, Chroma attains 57.44 percent overall task accomplishment and competitive scores on Storal, TruthfulQA, GSM8K, MLC and CommonVoice.

Check out the Paper, Model Weights, Project and Playground. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning appeared first on MarkTechPost.

Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agent …

Inworld AI has introduced Inworld TTS-1.5, an upgrade to its TTS-1 family that targets realtime voice agents with strict constraints on latency, quality, and cost. TTS-1.5 is described as the number top ranked text to speech system on Artificial Analysis and is designed to be more expressive and more stable than prior generations while remaining suitable for large scale consumer deployments.

Realtime latency for interactive agents

TTS-1.5 focuses on P90 time to first audio latency, which is a critical metric for user perceived responsiveness. For TTS-1.5 Max, P90 time to first audio is below 250 ms. For TTS-1.5 Mini, P90 time to first audio is below 130 ms. These values are about 4 times faster than the prior TTS generation according to Inworld.

The TTS-1.5 stack supports streaming over WebSocket so synthesis and playback can start as soon as the first audio chunk is generated. In practice this keeps end to end interaction latency in the same range as typical realtime language model responses when models run on modern GPUs, which is important when TTS is part of a full agent pipeline.

Inworld recommends TTS-1.5 Max for most applications because it balances latency near 200 ms with higher stability and quality. TTS-1.5 Mini is positioned for latency sensitive workloads such as real time gaming or ultra responsive voice agents where every millisecond is important.

Expression, stability and benchmark position

TTS-1.5 builds on TTS-1 and it delivers about 30 percent more expressive range and about 40 percent better stability than the earlier models.

Here expression refers to features such as prosody, emphasis, and emotional variation. Stability is measured by metrics such as word error rate and output consistency across long sequences and varied prompts. The reduction in word error rate reduces issues like truncated sentences, unintended word substitutions, or artifacts, which is important when TTS output is driven directly from generated language model text.

Pricing and cost profile at consumer scale

TTS-1.5 is priced with two main configurations. Inworld TTS-1.5 Mini costs 5 dollars per 1 million characters, which is about 0.005 dollars per minute of speech. TTS-1.5 Max costs 10 dollars per 1 million characters, which is about 0.01 dollars per minute.

This cost profile makes it feasible to run TTS continuously in high usage products such as voice native companions, education platforms, or customer support lines without TTS becoming the dominant variable cost.

Multilingual support, voice cloning and deployment options

Inworld TTS-1.5 supports 15 languages. The list includes English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. This allows a single TTS pipeline to cover a wide set of markets without separate models per region.

The system provides instant voice cloning and professional voice cloning. Instant voice cloning can create a custom voice from about 15 seconds of audio and is exposed directly in the Inworld portal and through API. Professional voice cloning uses at least 30 minutes of clean audio, with 20 minutes or more recommended for best results, and targets branded voices and less common accents.

For deployment, TTS-1.5 is available as a cloud API and also as an on prem solution, where the full model runs inside the customer infrastructure for data sovereignty and compliance. The same quality profile is maintained across both deployment modes, and the models integrate with partner platforms such as LiveKit, Pipecat, and Vapi for end to end voice agent stacks.

Key Takeaways

Inworld TTS 1.5 delivers realtime performance, with P90 time to first audio under 250 ms for the Max model and under 130 ms for the Mini model, about 4 times faster than the prior generation.

The model increases expressiveness by about 30 percent and improves stability with about 40 percent lower word error rate.

Pricing is optimized for consumer scale, TTS 1.5 Mini costs about 5 dollars per 1 million characters and TTS 1.5 Max costs about 10 dollars per 1 million characters, which is significantly cheaper per minute than many competing systems.

TTS 1.5 supports 15 languages and offers instant and professional voice cloning, enabling custom and branded voices from short reference audio or longer recorded datasets.

The system is available as a cloud API and as an on prem deployment, and integrates with existing voice agent stacks, which makes it suitable for production realtime agents that require explicit guarantees on latency, quality, and data control.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents appeared first on MarkTechPost.

Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flo …

Salesforce AI research team present FOFPred, a language driven future optical flow prediction framework that connects large vision language models with diffusion transformers for dense motion forecasting in control and video generation settings. FOFPred takes one or more images and a natural language instruction such as ‘moving the bottle from right to left’ and predicts 4 future optical flow frames that describe how every pixel is expected to move over time.

https://arxiv.org/pdf/2601.10781

Future optical flow as a motion representation

Optical flow is the apparent per pixel displacement between two frames. FOFPred focuses on future optical flow, which means predicting dense displacement fields for future frames given only current observations and text, without access to future images at inference.

Future optical flow is a compact motion only representation. It removes static appearance and keeps only pixel level motion, so it is well suited as an intermediate state for robot control policies and as a conditioning signal for video diffusion models. Compared to predicting future RGB frames, it reduces the complexity of the output distribution and avoids modeling textures and high frequency details that are not required for motion planning.

To plug into existing latent diffusion infrastructure, the research team encode optical flow as RGB images. They map flow magnitude and direction from polar form into HSV channels, then convert to RGB. The scaling of each channel is tuned so that consecutive flow frames are visually smooth and resemble animated graphics. A standard Flux.1 variational autoencoder then encodes and decodes these flow images.

Unified VLM Diffusion backbone

FOFPred uses a unified architecture that combines a frozen vision language model, a frozen VAE and a trainable diffusion transformer. The pipeline is:

Qwen2.5-VL is used as the vision language encoder to jointly encode the caption and visual inputs.

Flux.1 VAE encodes the input images and the training optical flow targets into latent tensors.

An OmniGen style diffusion transformer, DiT, takes projected visual and textual features as conditional inputs and generates latent future flow sequences.

Only the DiT and small MLP projectors are trained. The Qwen2.5-VL and Flux.1 weights stay frozen, which lets the model reuse image editing pretraining and multimodal reasoning ability from prior work. Temporal modeling is added by extending the RoPE positional encoding and attention blocks from two dimensional spatial positions to full spatio-temporal positions across input and output frame sequences. This gives full spatio-temporal attention without adding extra parameters, so the DiT can reuse OmniGen image pretraining directly.

https://arxiv.org/pdf/2601.10781

Training on noisy web videos with relative optical flow

The core model is trained on web scale human activity videos with paired captions. The research team uses the Something Something V2 dataset and the EgoDex egocentric manipulation dataset to obtain around 500,000 video caption pairs.

Training uses an end to end flow matching objective in latent space. Future optical flow sequences are first computed offline, then encoded by the VAE and used as targets in a flow matching diffusion loss for the DiT. During training the method also applies classifier free guidance on both text and visual conditions and masks some frames and viewpoints to improve robustness.

A critical contribution is the relative optical flow calculation used to build clean training targets from noisy egocentric videos. For each frame pair the method:

Computes dense optical flow with an off the shelf estimator.

Estimates camera motion via homography using deep features.

Uses projective geometry to subtract camera motion and obtain object centric relative flow vectors.

Filters frame pairs by selecting those where the top k percent flow magnitudes exceed a threshold, which focuses training on segments with meaningful motion.

These steps are run offline at lower resolution for efficiency, then recomputed at original resolution for the final targets. The ablation study shows that static frame targets or raw flow without camera motion removal harm downstream performance, while disentangled relative flow targets give the best results.

https://arxiv.org/pdf/2601.10781

Language driven robot manipulation

The first downstream use case is robot control. FOFPred is finetuned on robot video caption data to predict future optical flow from both fixed and wrist mounted cameras. On top of FOFPred, the research team attach a diffusion policy network that takes predicted flow, text and robot state, and outputs continuous actions. This setup follows prior diffusion policy work but uses future optical flow instead of predicted RGB frames as the core representation.

On the CALVIN ABCD benchmark, which evaluates long horizon zero shot chains of 5 language specified manipulation tasks, FOFPred reaches an average chain length of 4.48. VPP reaches 4.33 and DreamVLA reaches 4.44 under the same protocol. FOFPred also attains a Task 5 success rate of 78.7 percent, which is the best among reported methods. In a low data setting with 10 percent of CALVIN demonstrations, FOFPred still reaches 3.43 average length, higher than the 3.25 of VPP.

On RoboTwin 2.0, a dual arm manipulation benchmark with 5 tasks that require both arms, FOFPred attains an average success rate of 68.6 percent. The VPP baseline reaches 61.8 percent under identical training settings. FOFPred improves success on every task in the subset.

https://arxiv.org/pdf/2601.10781

Motion aware text to video generation

The second downstream task is motion control in text to video generation. The research team build a two stage pipeline by connecting FOFPred with the Go with the Flow video diffusion model. FOFPred takes an initial frame and a language description of motion, predicts a sequence of future flow frames, and interpolates them into a dense motion field. Go with the Flow then uses this motion field and the initial frame to synthesize the final video, enforcing the described motion pattern.

On the motion heavy Something Something V2 benchmark, the FOFPred along with Go with the Flow pipeline improves over the CogVideoX baseline under identical conditions. The method reaches SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and motion fidelity 0.662, which are consistently better than CogVideoX. Importantly, FOFPred only uses language and a single frame at inference, while several controllable video baselines require hand or object masks or trajectories as extra inputs.

https://arxiv.org/pdf/2601.10781

Key Take aways

FOFPred reframes motion prediction as language driven future optical flow, predicting 4 dense optical flow frames from one or more current images and a text instruction, which provides a compact motion only representation for downstream tasks.

The model uses a unified VLM Diffusion backbone, with Qwen2.5-VL as a frozen vision language encoder, Flux.1-VAE as a frozen latent encoder for images and flow, and an OmniGen style DiT as the only trained component with spatio temporal RoPE based attention.

Training relies on large scale web and egocentric video from Something Something-V2 and EgoDex, and builds relative optical flow targets by estimating ego-motion via homography, subtracting camera flow and filtering for high motion segments, which significantly improves downstream performance.

In robot manipulation, FOFPred acts as a motion backbone for a diffusion policy head and achieves state of the art or better results on CALVIN ABCD and RoboTwin 2.0, including 4.48 average task chain length on CALVIN and 68.6 percent average success on RoboTwin, outperforming VPP and DreamVLA variants.

For text to video generation, connecting FOFPred to Go with the Flow yields better SSv2 metrics than CogVideoX, with higher SSIM and PSNR, lower FVD and KVD, and improved motion fidelity, while requiring only language and a single frame at inference, making FOFPred a reusable motion controller for both robotics and video synthesis pipelines.

Check out the Paper, Model and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flow Prediction Framework that Enables Improved Robot Control and Video Generation appeared first on MarkTechPost.

How Thomson Reuters built an Agentic Platform Engineering Hub with Ama …

This post was co-written with Naveen Pollamreddi and Seth Krause from Thomson Reuters.
Thomson Reuters (TR) is a leading AI and technology company dedicated to delivering trusted content and workflow automation solutions. With over 150 years of expertise, TR provides essential solutions across legal, tax, accounting, risk, trade, and media sectors in a fast-evolving world. AI plays a critical role at TR. It’s embedded in how it helps create, enhance, connect, and deliver trusted information to customers. It powers the products used by professionals around the world. AI at TR empowers professionals with professional-grade AI that clarifies complex challenges.
This blog post explains how TR’s Platform Engineering team, a geographically distributed unit overseeing TR’s service availability, boosted its operational productivity by transitioning from manual to an automated agentic system using Amazon Bedrock AgentCore.
Business challenge
Platform engineering teams face significant challenges in providing seamless, self-service experiences to its internal customers at scale for operational activities such as database management, information security and risk management (ISRM) operations, landing zone maintenance, infrastructure provisioning, secrets management, continuous integration and deployment (CI/CD) pipeline orchestration, and compliance automation. At TR, the Platform Engineering team supports multiple lines of business by providing essential cloud infrastructure and enablement services, including cloud account provisioning and database management. However, manual processes and the need for repeated coordination between teams for operational tasks created delays that slowed down innovation.
“Our engineers were spending considerable time answering the same questions and executing identical processes across different teams,” says Naveen Polalmreddi, Distinguished Engineer at TR. “We needed a way to automate these interactions while maintaining our security and compliance standards.”
Current state
The Platform Engineering team offers services to multiple product teams within TR including Product Engineering and Service Management. These teams consume their internal home-grown solutions as a service to build and run applications at scale on AWS services. Over a period, these services are offered not only as tools but also through TR’s internal processes, following Information Technology Infrastructure Library (ITIL) standards and using third party software as a service (SaaS) systems.
Some of these services rely on humans to execute a predefined list of steps and are repeated many times, creating a significant dependency on engineers to execute the same tasks repeatedly for multiple applications. Current processes are semi-automated and are:-

Repetitive and labor intensive – Because of the nature of the workflows and multi-team engagement model, these operational processes tend to be labor intensive and repetitive. The Platform Engineering team spent a lot of time doing work that is undifferentiated heavy lifting.
Longer time to value – Because of process interdependencies, these operational workflows aren’t fully autonomous and take a long time to realize the value compared to fully automated processes.
Resource and cost intensive – Manual execution requires dedicated engineering resources whose time could be better spent on innovation rather than repetitive tasks. Each operational request consumes engineer hours across multiple teams for coordination, execution, and validation.

The Platform Engineering team is solving this problem by building autonomous agentic solutions that use specialized agents across multiple service domains and groups. The cloud account provisioning agent automates the creation and configuration of new cloud accounts according to internal standards, handling tasks such as setting up organizational units, applying security policies, and configuring baseline networking. The database patching agent manages the end-to-end database patching lifecycle, version upgrades. Network service agents handle network configuration requests such as VPC setup, subnet allocation, and connectivity establishment between environments. Architecture review agents assist in evaluating proposed architectures against best practices, security requirements, and compliance standards, providing automated feedback and recommendations. AgentCore serves as the foundational orchestration layer for these agents, providing the core agentic capabilities that enable intelligent decision-making, natural language understanding, tool calling and agent-to-agent (A2A) communication.
Solution overview
TR’s Platform Engineering team built this solution with scalability, extensibility, and security as core principles and designed it so that non-technical users can quickly create and deploy AI-powered automation. Designed for a broad enterprise audience, the architecture is designed so that business users can interact with specialized agents through basic natural language requests without needing to understand the underlying technical complexity. TR chose Amazon Bedrock AgentCore because it provides the complete foundational infrastructure needed to build, deploy, and operate enterprise-grade AI agents at scale without having to build that infrastructure from scratch. The Platform Engineering team gained the flexibility to innovate with their preferred frameworks while designing their autonomous agents operate with enterprise-level security, reliability, and scalability—critical requirements for managing production operational workflows at scale.
The following diagram illustrates the architecture of solution:

TR built an AI-powered platform engineering hub using AgentCore. The solution consists of:

A custom web portal for more secure agent interactions
A central orchestrator agent that routes requests and manages interactions
Multiple service-specific agents handling specialized tasks such as AWS account provisioning and database patching
A human-in-the-loop validation service for sensitive operations

TR decided to use AgentCore because it helped their developers to accelerate from prototype to production with fully managed services that minimize infrastructure complexity and build AI agents using different frameworks, models, and tools while maintaining complete control over how agents operate and integrate with their existing systems.
Solution workflow
The team used the following workflow to develop and deploy the agentic AI system.

Discovery and architecture planning: Evaluated existing AWS resources and code base to design a comprehensive solution incorporating AgentCore, focusing on service objectives and integration requirements.
Core development and migration: Developed a dual-track approach by migrating existing solutions to AgentCore while building TRACK (deployment engine), enabling rapid agent creation. Implemented a registry system as a modular bridge between the agent and the orchestrator.
System enhancement and deployment: Refined orchestrator functionality, developed an intuitive UX , and executed a team onboarding process for the new agentic system deployment.

Building the orchestrator agent
TR’s Platform Engineering team designed their orchestrator service, named Aether, as a modular system using the LangGraph Framework. The orchestrator retrieves context from their agent registry to determine the appropriate agent for each situation. When an agent’s actions are required, the orchestrator makes a tool call that programmatically populates data from the registry, helping prevent potential prompt injection attacks and facilitating more secure communication between endpoints.
To maintain conversation context while keeping the system stateless, the orchestrator integrates with the AgentCore Memory service capabilities at both conversation and user levels. Short-term memory maintains context within individual conversations, while long-term memory tracks user preferences and interaction patterns over time. This dual-memory approach allows the system to learn from past interactions and avoid repeating previous mistakes.
Service Agent Development Framework
The Platform Engineering team developed their own framework, TR-AgentCore-Kit (TRACK), to simplify agent deployment across the organization. TRACK, which is a homegrown solution utilizes a customized version of the Bedrock AgentCore Starter Toolkit. The team customized this toolkit to meet TR’s specific compliance alignment requirements, which include asset identification standards and resource tagging standards. The framework handles connection to AgentCore Runtime, tool management, AgentCore Gateway connectivity, and baseline agent setup, so developers can focus on implementing business logic rather than dealing with infrastructure concerns. AgentCore Gateway provided a straightforward and more secure way for developers to build, deploy, discover, and connect to tools at scale. TRACK also handles the registration of service agents into the Aether environment by deploying agent cards into the custom-built A2A registry. TRACK maintains a seamless flow for developers by offering deployment capabilities to AWS and registration to the custom-built services in one package. By deploying the agent cards into the registry, the process to fully onboard an agent built by a service team can continue to make the agent available from the overarching orchestrator.
Agent discovery and registration system
To enable seamless agent discovery and communication, TR implemented a custom A2A solution using Amazon DynamoDB and Amazon API Gateway. This system supports cross-account agent calls, which was essential for their modular architecture. The registration process occurs through the TRACK project, so that teams can register their agents directly with the orchestrator service. The A2A registry maintains a comprehensive history of agent versions for auditing purposes and requires human validation before allowing new agents into the production environment. This governance model facilitates conformance with TR’s ISRM standards while providing flexibility for future expansion.
Aether web portal integration
The team developed a web portal using React, hosted on Amazon Simple Storage Service (Amazon S3), to provide a more secure and intuitive interface for agent interactions. The portal authenticates users against TR’s enterprise single sign-on (SSO) and provides access to agent flows based on user permissions. This approach helps ensure that sensitive operations, such as AWS account provisioning or database patching, are only accessible to authorized personnel.
Human-in-the-loop validation service
The system includes Aether Greenlight, a validation service that makes sure critical operations receive appropriate human oversight. This service extends beyond basic requester approval, so that team members outside the initial conversation can participate in the validation process. The system maintains a complete audit trail of approvals and actions, supporting TR’s compliance requirements.
Outcome
By building a self-service agentic system on AgentCore, TR implemented autonomous agents that use AI orchestration to handle complex operational workflows end-to-end.
Productivity and efficiency

15-fold productivity gain through intelligent automation of routine tasks
70% automation rate achieved at first launch, dramatically reducing manual workload
Continuous reliability with repeatable runbooks executed by agents around the clock

Speed and agility

Faster time to value: Accelerated product delivery by automating environment setup, policy enforcement, and day-to-day operations
Self-service workflows: Empowered teams with clear standards and paved-road tooling

Security and compliance

Stronger security posture: Applied guardrails and database patching by default
Human-in-the-loop approvals: Maintained oversight while automating verification of changes

Cost and resource optimization

Better cost efficiency: Automated infrastructure usage optimization
Strategic talent allocation: Freed engineering teams to focus on highest-priority, high-value work
Reduced operational toil: Removed repetitive tasks and variance through standardization

Developer experience

Improved satisfaction: Streamlined workflows with intuitive self-service capabilities
Consistent standards: Established repeatable patterns for other teams to adopt and scale

Conclusion
This agentic system described in this post establishes a replicable pattern that teams across the organization can use to adopt similar automation capabilities, creating a multiplier effect for operational excellence. The Aether project aims to help enhance the experience of engineers by removing the need for manual execution of tasks that could be automated to support further innovation and creative thinking. As Aether continues to improve, the team hopes that the pattern will be adopted more broadly to begin assisting teams beyond Platform Engineering to break-through productivity standards organization wide, solidifying TR as a front-runner in the age of artificial intelligence.
Using Amazon Bedrock AgentCore, TR transformed their platform engineering operations from manual processes to an AI-powered self-service hub. This approach not only improved efficiency but also strengthened security and compliance controls.
Ready to transform your platform engineering operations:

Explore AgentCore
Explore AgentCore documentation
For additional use cases, explore notebook-based tutorials

About the Authors
Naveen Pollamreddi is a Distinguished Engineer in Thomson Reuters as part of the Platform Engineering team and drives the Agentic AI strategy for Cloud Infrastructure services.
Seth Krause is a Cloud Engineer on Thomson Reuters’ Platform Engineering Compute team. Since joining the company, he has contributed to architecting and implementing generative AI solutions that enhance productivity across the organization. Seth specializes in building cloud-based microservices with a current focus on integrating AI capabilities into enterprise workflows.
Pratip Bagchi is an Enterprise Solutions Architect at Amazon Web Services. He is passionate about helping customers to drive AI adoption and innovation to unlock business value and enterprise transformation.
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.