Vision-RAG vs Text-RAG: A Technical Comparison for Enterprise Search

Most RAG failures originate at retrieval, not generation. Text-first pipelines lose layout semantics, table structure, and figure grounding during PDF→text conversion, degrading recall and precision before an LLM ever runs. Vision-RAG—retrieving rendered pages with vision-language embeddings—directly targets this bottleneck and shows material end-to-end gains on visually rich corpora.

Pipelines (and where they fail)

Text-RAG. PDF → (parser/OCR) → text chunks → text embeddings → ANN index → retrieve → LLM. Typical failure modes: OCR noise, multi-column flow breakage, table cell structure loss, and missing figure/chart semantics—documented by table- and doc-VQA benchmarks created to measure exactly these gaps.

Vision-RAG. PDF → page raster(s) → VLM embeddings (often multi-vector with late-interaction scoring) → ANN index → retrieve → VLM/LLM consumes high-fidelity crops or full pages. This preserves layout and figure-text grounding; recent systems (ColPali, VisRAG, VDocRAG) validate the approach.

What current evidence supports

Document-image retrieval works and is simpler. ColPali embeds page images and uses late-interaction matching; on the ViDoRe benchmark it outperforms modern text pipelines while remaining end-to-end trainable.

End-to-end lift is measurable. VisRAG reports 25–39% end-to-end improvement over text-RAG on multimodal documents when both retrieval and generation use a VLM.

Unified image format for real-world docs. VDocRAG shows that keeping documents in a unified image format (tables, charts, PPT/PDF) avoids parser loss and improves generalization; it also introduces OpenDocVQA for evaluation.

Resolution drives reasoning quality. High-resolution support in VLMs (e.g., Qwen2-VL/Qwen2.5-VL) is explicitly tied to SoTA results on DocVQA/MathVista/MTVQA; fidelity matters for ticks, superscripts, stamps, and small fonts.

Costs: vision context is (often) order-of-magnitude heavier—because of tokens

Vision inputs inflate token counts via tiling, not necessarily per-token price. For GPT-4o-class models, total tokens ≈ base + (tile_tokens × tiles), so 1–2 MP pages can be ~10× cost of a small text chunk. Anthropic recommends ~1.15 MP caps (~1.6k tokens) for responsiveness. By contrast, Google Gemini 2.5 Flash-Lite prices text/image/video at the same per-token rate, but large images still consume many more tokens. Engineering implication: adopt selective fidelity (crop > downsample > full page).

Design rules for production Vision-RAG

Align modalities across embeddings. Use encoders trained for textimage alignment (CLIP-family or VLM retrievers) and, in practice, dual-index: cheap text recall for coverage + vision rerank for precision. ColPali’s late-interaction (MaxSim-style) is a strong default for page images.

Feed high-fidelity inputs selectively. Coarse-to-fine: run BM25/DPR, take top-k pages to a vision reranker, then send only ROI crops (tables, charts, stamps) to the generator. This preserves crucial pixels without exploding tokens under tile-based accounting.

Engineer for real documents.• Tables: if you must parse, use table-structure models (e.g., PubTables-1M/TATR); otherwise prefer image-native retrieval.• Charts/diagrams: expect tick- and legend-level cues; resolution must retain these. Evaluate on chart-focused VQA sets.• Whiteboards/rotations/multilingual: page rendering avoids many OCR failure modes; multilingual scripts and rotated scans survive the pipeline.• Provenance: store page hashes and crop coordinates alongside embeddings to reproduce exact visual evidence used in answers.

StandardText-RAGVision-RAGIngest pipelinePDF → parser/OCR → text chunks → text embeddings → ANNPDF → page render(s) → VLM page/crop embeddings (often multi-vector, late interaction) → ANN. ColPali is a canonical implementation.Primary failure modesParser drift, OCR noise, multi-column flow breakage, table structure loss, missing figure/chart semantics. Benchmarks exist because these errors are common.Preserves layout/figures; failures shift to resolution/tiling choices and cross-modal alignment. VDocRAG formalizes “unified image” processing to avoid parsing loss.Retriever representationSingle-vector text embeddings; rerank via lexical or cross-encodersPage-image embeddings with late interaction (MaxSim-style) capture local regions; improves page-level retrieval on ViDoRe.End-to-end gains (vs Text-RAG)Baseline+25–39% E2E on multimodal docs when both retrieval and generation are VLM-based (VisRAG).Where it excelsClean, text-dominant corpora; low latency/costVisually rich/structured docs: tables, charts, stamps, rotated scans, multilingual typography; unified page context helps QA. Resolution sensitivityNot applicable beyond OCR settingsReasoning quality tracks input fidelity (ticks, small fonts). High-res document VLMs (e.g., Qwen2-VL family) emphasize this.Cost model (inputs)Tokens ≈ characters; cheap retrieval contextsImage tokens grow with tiling: e.g., OpenAI base+tiles formula; Anthropic guidance ~1.15 MP ≈ ~1.6k tokens. Even when per-token price is equal (Gemini 2.5 Flash-Lite), high-res pages consume far more tokens. Cross-modal alignment needNot requiredCritical: textimage encoders must share geometry for mixed queries; ColPali/ViDoRe demonstrate effective page-image retrieval aligned to language tasks.Benchmarks to trackDocVQA (doc QA), PubTables-1M (table structure) for parsing-loss diagnostics. ViDoRe (page retrieval), VisRAG (pipeline), VDocRAG (unified-image RAG).Evaluation approachIR metrics plus text QA; may miss figure-text grounding issuesJoint retrieval+gen on visually rich suites (e.g., OpenDocVQA under VDocRAG) to capture crop relevance and layout grounding.Operational patternOne-stage retrieval; cheap to scaleCoarse-to-fine: text recall → vision rerank → ROI crops to generator; keeps token costs bounded while preserving fidelity. (Tiling math/pricing inform budgets.) When to preferContracts/templates, code/wikis, normalized tabular data (CSV/Parquet)Real-world enterprise docs with heavy layout/graphics; compliance workflows needing pixel-exact provenance (page hash + crop coords).Representative systemsDPR/BM25 + cross-encoder rerankColPali (ICLR’25) vision retriever; VisRAG pipeline; VDocRAG unified image framework.

When Text-RAG is still the right default?

Clean, text-dominant corpora (contracts with fixed templates, wikis, code)

Strict latency/cost constraints for short answers

Data already normalized (CSV/Parquet)—skip pixels and query the table store

Evaluation: measure retrieval + generation jointly

Add multimodal RAG benchmarks to your harness—e.g., M²RAG (multi-modal QA, captioning, fact-verification, reranking), REAL-MM-RAG (real-world multi-modal retrieval), and RAG-Check (relevance + correctness metrics for multi-modal context). These catch failure cases (irrelevant crops, figure-text mismatch) that text-only metrics miss.

Summary

Text-RAG remains efficient for clean, text-only data. Vision-RAG is the practical default for enterprise documents with layout, tables, charts, stamps, scans, and multilingual typography. Teams that (1) align modalities, (2) deliver selective high-fidelity visual evidence, and (3) evaluate with multimodal benchmarks consistently get higher retrieval precision and better downstream answers—now backed by ColPali (ICLR 2025), VisRAG’s 25–39% E2E lift, and VDocRAG’s unified image-format results.

References:

https://arxiv.org/abs/2407.01449

https://github.com/illuin-tech/vidore-benchmark

https://huggingface.co/vidore

https://arxiv.org/abs/2410.10594

https://github.com/OpenBMB/VisRAG

https://huggingface.co/openbmb/VisRAG-Ret

https://arxiv.org/abs/2504.09795

https://openaccess.thecvf.com/content/CVPR2025/papers/Tanaka_VDocRAG_Retrieval-Augmented_Generation_over_Visually-Rich_Documents_CVPR_2025_paper.pdf

https://cvpr.thecvf.com/virtual/2025/poster/34926

https://vdocrag.github.io/

https://arxiv.org/abs/2110.00061

https://openaccess.thecvf.com/content/CVPR2022/papers/Smock_PubTables-1M_Towards_Comprehensive_Table_Extraction_From_Unstructured_Documents_CVPR_2022_paper.pdf (CVF Open Access)

https://huggingface.co/datasets/bsmock/pubtables-1m (Hugging Face)

https://arxiv.org/abs/2007.00398

https://www.docvqa.org/datasets

https://qwenlm.github.io/blog/qwen2-vl/

https://arxiv.org/html/2409.12191v1

https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

https://arxiv.org/abs/2203.10244

https://arxiv.org/abs/2504.05506

https://aclanthology.org/2025.findings-acl.978.pdf

https://arxiv.org/pdf/2504.05506

https://openai.com/api/pricing/

https://docs.claude.com/en/docs/build-with-claude/vision

https://docs.claude.com/en/docs/build-with-claude/token-counting

https://ai.google.dev/gemini-api/docs/pricing

https://arxiv.org/abs/2502.17297

https://openreview.net/forum?id=1oCZoWvb8i

https://github.com/NEUIR/M2RAG

https://arxiv.org/abs/2502.12342

https://aclanthology.org/2025.acl-long.1528/

https://aclanthology.org/2025.acl-long.1528.pdf

https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34

https://research.ibm.com/publications/real-mm-rag-a-real-world-multi-modal-retrieval-benchmark

https://arxiv.org/abs/2501.03995

https://platform.openai.com/docs/guides/images-vision

The post Vision-RAG vs Text-RAG: A Technical Comparison for Enterprise Search appeared first on MarkTechPost.

How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and M …

In this tutorial, we explore advanced computer vision techniques using TorchVision’s v2 transforms, modern augmentation strategies, and powerful training enhancements. We walk through the process of building an augmentation pipeline, applying MixUp and CutMix, designing a modern CNN with attention, and implementing a robust training loop. By running everything seamlessly in Google Colab, we position ourselves to understand and apply state-of-the-art practices in deep learning with clarity and efficiency. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install torch torchvision torchaudio –quiet
!pip install matplotlib pillow numpy –quiet

import torch
import torchvision
from torchvision import transforms as T
from torchvision.transforms import v2
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import requests
from io import BytesIO

print(f”PyTorch version: {torch.__version__}”)
print(f”TorchVision version: {torchvision.__version__}”)

We begin by installing the libraries and importing all the essential modules for our workflow. We set up PyTorch, TorchVision v2 transforms, and supporting tools like NumPy, PIL, and Matplotlib, so we are ready to build and test advanced computer vision pipelines. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedAugmentationPipeline:
def __init__(self, image_size=224, training=True):
self.image_size = image_size
self.training = training
base_transforms = [
v2.ToImage(),
v2.ToDtype(torch.uint8, scale=True),
]
if training:
self.transform = v2.Compose([
*base_transforms,
v2.Resize((image_size + 32, image_size + 32)),
v2.RandomResizedCrop(image_size, scale=(0.8, 1.0), ratio=(0.9, 1.1)),
v2.RandomHorizontalFlip(p=0.5),
v2.RandomRotation(degrees=15),
v2.ColorJitter(brights=0.4, contst=0.4, sation=0.4, hue=0.1),
v2.RandomGrayscale(p=0.1),
v2.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0)),
v2.RandomPerspective(distortion_scale=0.1, p=0.3),
v2.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
else:
self.transform = v2.Compose([
*base_transforms,
v2.Resize((image_size, image_size)),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
def __call__(self, image):
return self.transform(image)

We define an advanced augmentation pipeline that adapts to both training and validation modes. We apply powerful TorchVision v2 transforms, such as cropping, flipping, color jittering, blurring, perspective, and affine transformations, during training, while keeping validation preprocessing simple with resizing and normalization. This way, we ensure that we enrich the training data for better generalization while maintaining consistent and stable evaluation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedMixupCutmix:
def __init__(self, mixup_alpha=1.0, cutmix_alpha=1.0, prob=0.5):
self.mixup_alpha = mixup_alpha
self.cutmix_alpha = cutmix_alpha
self.prob = prob
def mixup(self, x, y):
batch_size = x.size(0)
lam = np.random.beta(self.mixup_alpha, self.mixup_alpha) if self.mixup_alpha > 0 else 1
index = torch.randperm(batch_size)
mixed_x = lam * x + (1 – lam) * x[index, :]
y_a, y_b = y, y[index]
return mixed_x, y_a, y_b, lam
def cutmix(self, x, y):
batch_size = x.size(0)
lam = np.random.beta(self.cutmix_alpha, self.cutmix_alpha) if self.cutmix_alpha > 0 else 1
index = torch.randperm(batch_size)
y_a, y_b = y, y[index]
bbx1, bby1, bbx2, bby2 = self._rand_bbox(x.size(), lam)
x[:, :, bbx1:bbx2, bby1:bby2] = x[index, :, bbx1:bbx2, bby1:bby2]
lam = 1 – ((bbx2 – bbx1) * (bby2 – bby1) / (x.size()[-1] * x.size()[-2]))
return x, y_a, y_b, lam
def _rand_bbox(self, size, lam):
W = size[2]
H = size[3]
cut_rat = np.sqrt(1. – lam)
cut_w = int(W * cut_rat)
cut_h = int(H * cut_rat)
cx = np.random.randint(W)
cy = np.random.randint(H)
bbx1 = np.clip(cx – cut_w // 2, 0, W)
bby1 = np.clip(cy – cut_h // 2, 0, H)
bbx2 = np.clip(cx + cut_w // 2, 0, W)
bby2 = np.clip(cy + cut_h // 2, 0, H)
return bbx1, bby1, bbx2, bby2
def __call__(self, x, y):
if np.random.random() > self.prob:
return x, y, y, 1.0
if np.random.random() < 0.5:
return self.mixup(x, y)
else:
return self.cutmix(x, y)

class ModernCNN(nn.Module):
def __init__(self, num_classes=10, dropout=0.3):
super(ModernCNN, self).__init__()
self.conv1 = self._conv_block(3, 64)
self.conv2 = self._conv_block(64, 128, downsample=True)
self.conv3 = self._conv_block(128, 256, downsample=True)
self.conv4 = self._conv_block(256, 512, downsample=True)
self.gap = nn.AdaptiveAvgPool2d(1)
self.attention = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 512),
nn.Sigmoid()
)
self.classifier = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(dropout/2),
nn.Linear(256, num_classes)
)
def _conv_block(self, in_channels, out_channels, downsample=False):
stride = 2 if downsample else 1
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, 3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.gap(x)
x = torch.flatten(x, 1)
attention_weights = self.attention(x)
x = x * attention_weights
return self.classifier(x)

We strengthen our training with a unified MixUp/CutMix module, where we stochastically blend images or patch-swap regions and compute label interpolation with the exact pixel ratio. We pair this with a modern CNN that stacks progressive conv blocks, applies global average pooling, and uses a learned attention gate before a dropout-regularized classifier, so we improve generalization while keeping inference straightforward. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedTrainer:
def __init__(self, model, device=’cuda’ if torch.cuda.is_available() else ‘cpu’):
self.model = model.to(device)
self.device = device
self.mixup_cutmix = AdvancedMixupCutmix()
self.optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
self.scheduler = optim.lr_scheduler.OneCycleLR(
self.optimizer, max_lr=1e-2, epochs=10, steps_per_epoch=100
)
self.criterion = nn.CrossEntropyLoss()
def mixup_criterion(self, pred, y_a, y_b, lam):
return lam * self.criterion(pred, y_a) + (1 – lam) * self.criterion(pred, y_b)
def train_epoch(self, dataloader):
self.model.train()
total_loss = 0
correct = 0
total = 0
for batch_idx, (data, target) in enumerate(dataloader):
data, target = data.to(self.device), target.to(self.device)
data, target_a, target_b, lam = self.mixup_cutmix(data, target)
self.optimizer.zero_grad()
output = self.model(data)
if lam != 1.0:
loss = self.mixup_criterion(output, target_a, target_b, lam)
else:
loss = self.criterion(output, target)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
self.scheduler.step()
total_loss += loss.item()
_, predicted = output.max(1)
total += target.size(0)
if lam != 1.0:
correct += (lam * predicted.eq(target_a).sum().item() +
(1 – lam) * predicted.eq(target_b).sum().item())
else:
correct += predicted.eq(target).sum().item()
return total_loss / len(dataloader), 100. * correct / total

We orchestrate training with AdamW, OneCycleLR, and dynamic MixUp/CutMix so we stabilize optimization and boost generalization. We compute an interpolated loss when mixing, clip gradients for safety, and step the scheduler each batch, so we track loss/accuracy per epoch in a single tight loop. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_advanced_techniques():
batch_size = 16
num_classes = 10
sample_data = torch.randn(batch_size, 3, 224, 224)
sample_labels = torch.randint(0, num_classes, (batch_size,))
transform_pipeline = AdvancedAugmentationPipeline(training=True)
model = ModernCNN(num_classes=num_classes)
trainer = AdvancedTrainer(model)
print(” Advanced Deep Learning Tutorial Demo”)
print(“=” * 50)
print(“n1. Advanced Augmentation Pipeline:”)
augmented = transform_pipeline(Image.fromarray((sample_data[0].permute(1,2,0).numpy() * 255).astype(np.uint8)))
print(f” Original shape: {sample_data[0].shape}”)
print(f” Augmented shape: {augmented.shape}”)
print(f” Applied transforms: Resize, Crop, Flip, ColorJitter, Blur, Perspective, etc.”)
print(“n2. MixUp/CutMix Augmentation:”)
mixup_cutmix = AdvancedMixupCutmix()
mixed_data, target_a, target_b, lam = mixup_cutmix(sample_data, sample_labels)
print(f” Mixed batch shape: {mixed_data.shape}”)
print(f” Lambda value: {lam:.3f}”)
print(f” Technique: {‘MixUp’ if lam > 0.7 else ‘CutMix’}”)
print(“n3. Modern CNN Architecture:”)
model.eval()
with torch.no_grad():
output = model(sample_data)
print(f” Input shape: {sample_data.shape}”)
print(f” Output shape: {output.shape}”)
print(f” Features: Residual blocks, Attention, Global Average Pooling”)
print(f” Parameters: {sum(p.numel() for p in model.parameters()):,}”)
print(“n4. Advanced Training Simulation:”)
dummy_loader = [(sample_data, sample_labels)]
loss, acc = trainer.train_epoch(dummy_loader)
print(f” Training loss: {loss:.4f}”)
print(f” Training accuracy: {acc:.2f}%”)
print(f” Learning rate: {trainer.scheduler.get_last_lr()[0]:.6f}”)
print(“n Tutorial completed successfully!”)
print(“This code demonstrates state-of-the-art techniques in deep learning:”)
print(“• Advanced data augmentation with TorchVision v2”)
print(“• MixUp and CutMix for better generalization”)
print(“• Modern CNN architecture with attention”)
print(“• Advanced training loop with OneCycleLR”)
print(“• Gradient clipping and weight decay”)

if __name__ == “__main__”:
demo_advanced_techniques()

We run a compact end-to-end demo where we visualize our augmentation pipeline, apply MixUp/CutMix, and double-check the ModernCNN with a forward pass. We then simulate one training epoch on dummy data to verify loss, accuracy, and learning-rate scheduling, so we confirm the full stack works before scaling to a real dataset.

In conclusion, we have successfully developed and tested a comprehensive workflow that integrates advanced augmentations, innovative CNN design, and modern training strategies. By experimenting with TorchVision v2, MixUp, CutMix, attention mechanisms, and OneCycleLR, we not only strengthen model performance but also deepen our understanding of cutting-edge techniques.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and Modern CNN Training for State-of-the-Art Computer Vision? appeared first on MarkTechPost.

Alibaba’s Qwen3-Max: Production-Ready Thinking Mode, 1T+ Parameters, …

Alibaba has released Qwen3-Max, a trillion-parameter Mixture-of-Experts (MoE) model positioned as its most capable foundation model to date, with an immediate public on-ramp via Qwen Chat and Alibaba Cloud’s Model Studio API. The launch moves Qwen’s 2025 cadence from preview to production and centers on two variants: Qwen3-Max-Instruct for standard reasoning/coding tasks and Qwen3-Max-Thinking for tool-augmented “agentic” workflows.

What’s new at the model level?

Scale & architecture: Qwen3-Max crosses the 1-trillion-parameter mark with an MoE design (sparse activation per token). Alibaba positions the model as its largest and most capable to date; public briefings and coverage consistently describe it as a 1T-parameter class system rather than another mid-scale refresh.

Training/runtime posture: Qwen3-Max uses a sparse Mixture-of-Experts design and was pretrained on ~36T tokens (~2× Qwen2.5). The corpus skews toward multilingual, coding, and STEM/reasoning data. Post-training follows Qwen3’s four-stage recipe: long CoT cold-start → reasoning-focused RL → thinking/non-thinking fusion → general-domain RL. Alibaba confirms >1T parameters for Max; treat token counts/routing as team-reported until a formal Max tech report is published.

Access: Qwen Chat showcases the general-purpose UX, while Model Studio exposes inference and “thinking mode” toggles (notably, incremental_output=true is required for Qwen3 thinking models). Model listings and pricing sit under Model Studio with regioned availability.

Benchmarks: coding, agentic control, math

Coding (SWE-Bench Verified). Qwen3-Max-Instruct is reported at 69.6 on SWE-Bench Verified. That places it above some non-thinking baselines (e.g., DeepSeek V3.1 non-thinking) and slightly below Claude Opus 4 non-thinking in at least one roundup. Treat these as point-in-time numbers; SWE-Bench evaluations move quickly with harness updates.

Agentic tool use (Tau2-Bench). Qwen3-Max posts 74.8 on Tau2-Bench—an agent/tool-calling evaluation—beating named peers in the same report. Tau2 is designed to test decision-making and tool routing, not just text accuracy, so gains here are meaningful for workflow automation.

Math & advanced reasoning (AIME25, etc.). The Qwen3-Max-Thinking track (with tool use and a “heavy” runtime configuration) is described as near-perfect on key math benchmarks (e.g., AIME25) in multiple secondary sources and earlier preview coverage. Until an official technical report drops, treat “100%” claims as vendor-reported or community-replicated, not peer-reviewed.

https://qwen.ai/

https://qwen.ai/

Why two tracks—Instruct vs. Thinking?

Instruct targets conventional chat/coding/reasoning with tight latency, while Thinking enables longer deliberation traces and explicit tool calls (retrieval, code execution, browsing, evaluators), aimed at higher-reliability “agent” use cases. Critically, Alibaba’s API docs formalize the runtime switch: Qwen3 thinking models only operate with streaming incremental output enabled; commercial defaults are false, so callers must explicitly set it. This is a small but consequential contract detail if you’re instrumenting tools or chain-of-thought-like rollouts.

How to reason about the gains (signal vs. noise)?

Coding: A 60–70 SWE-Bench Verified score range typically reflects non-trivial repository-level reasoning and patch synthesis under evaluation harness constraints (e.g., environment setup, flaky tests). If your workloads hinge on repo-scale code changes, these deltas matter more than single-file coding toys.

Agentic: Tau2-Bench emphasizes multi-tool planning and action selection. Improvements here usually translate into fewer brittle hand-crafted policies in production agents, provided your tool APIs and execution sandboxes are robust.

Math/verification: “Near-perfect” math numbers from heavy/thinky modes underscore the value of extended deliberation plus tools (calculators, validators). Portability of those gains to open-ended tasks depends on your evaluator design and guardrails.

Summary

Qwen3-Max is not a teaser—it’s a deployable 1T-parameter MoE with documented thinking-mode semantics and reproducible access paths (Qwen Chat, Model Studio). Treat day-one benchmark wins as directionally strong but continue local evals; the hard, verifiable facts are scale (≈36T tokens, >1T params) and the API contract for tool-augmented runs (incremental_output=true). For teams building coding and agentic systems, this is ready for hands-on trials and internal gating against SWE-/Tau2-style suites.

Check out the Technical details, API and Qwen Chat. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Alibaba’s Qwen3-Max: Production-Ready Thinking Mode, 1T+ Parameters, and Day-One Coding/Agentic Bench Signals appeared first on MarkTechPost.

Coding Implementation to End-to-End Transformer Model Optimization wit …

In this tutorial, we walk through how we use Hugging Face Optimum to optimize Transformer models and make them faster while maintaining accuracy. We begin by setting up DistilBERT on the SST-2 dataset, and then we compare different execution engines, including plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step by step, we get hands-on experience with model export, optimization, quantization, and benchmarking, all inside a Google Colab environment. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “transformers>=4.49” “optimum[onnxruntime]>=1.20.0” “datasets>=2.20” “evaluate>=0.4” accelerate

from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig

os.environ.setdefault(“OMP_NUM_THREADS”, “1”)
os.environ.setdefault(“MKL_NUM_THREADS”, “1”)

MODEL_ID = “distilbert-base-uncased-finetuned-sst-2-english”
ORT_DIR = Path(“onnx-distilbert”)
Q_DIR = Path(“onnx-distilbert-quant”)
DEVICE = “cuda” if torch.cuda.is_available() else “cpu”
BATCH = 16
MAXLEN = 128
N_WARM = 3
N_ITERS = 8

print(f”Device: {DEVICE} | torch={torch.__version__}”)

We begin by installing the required libraries and setting up our environment for Hugging Face Optimum with ONNX Runtime. We configure paths, batch size, and iteration settings, and we confirm whether we run on CPU or GPU. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserds = load_dataset(“glue”, “sst2″, split=”validation[:20%]”)
texts, labels = ds[“sentence”], ds[“label”]
metric = evaluate.load(“accuracy”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

def make_batches(texts, max_len=MAXLEN, batch=BATCH):
for i in range(0, len(texts), batch):
yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
max_length=max_len, return_tensors=”pt”)

def run_eval(predict_fn, texts, labels):
preds = []
for toks in make_batches(texts):
preds.extend(predict_fn(toks))
return metric.compute(predictions=preds, references=labels)[“accuracy”]

def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
for _ in range(n_warm):
for toks in make_batches(texts[:BATCH*2]):
predict_fn(toks)
times = []
for _ in range(n_iters):
t0 = time.time()
for toks in make_batches(texts):
predict_fn(toks)
times.append((time.time() – t0) * 1000)
return float(np.mean(times)), float(np.std(times))

We load an SST-2 validation slice and prepare tokenization, an accuracy metric, and batching. We define run_eval to compute accuracy from any predictor and bench to warm up and time end-to-end inference. With these helpers, we fairly compare different engines using identical data and batching. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertorch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()

@torch.no_grad()
def pt_predict(toks):
toks = {k: v.to(DEVICE) for k, v in toks.items()}
logits = torch_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()

pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f”[PyTorch eager] {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}”)

compiled_model = torch_model
compile_ok = False
try:
compiled_model = torch.compile(torch_model, mode=”reduce-overhead”, fullgraph=False)
compile_ok = True
except Exception as e:
print(“torch.compile unavailable or failed -> skipping:”, repr(e))

@torch.no_grad()
def ptc_predict(toks):
toks = {k: v.to(DEVICE) for k, v in toks.items()}
logits = compiled_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()

if compile_ok:
ptc_ms, ptc_sd = bench(ptc_predict, texts)
ptc_acc = run_eval(ptc_predict, texts, labels)
print(f”[torch.compile] {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}”)

We load the baseline PyTorch classifier, define a pt_predict helper, and benchmark/score it on SST-2. We then attempt torch.compile for just-in-time graph optimizations and, if successful, run the same benchmarks to compare speed and accuracy under an identical setup. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprovider = “CUDAExecutionProvider” if DEVICE == “cuda” else “CPUExecutionProvider”
ort_model = ORTModelForSequenceClassification.from_pretrained(
MODEL_ID, export=True, provider=provider, cache_dir=ORT_DIR
)

@torch.no_grad()
def ort_predict(toks):
logits = ort_model(**{k: v.cpu() for k, v in toks.items()}).logits
return logits.argmax(-1).cpu().tolist()

ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f”[ONNX Runtime] {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}”)

Q_DIR.mkdir(parents=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(approach=”dynamic”, per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)

ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, provider=provider)

@torch.no_grad()
def ortq_predict(toks):
logits = ort_quant(**{k: v.cpu() for k, v in toks.items()}).logits
return logits.argmax(-1).cpu().tolist()

oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f”[ORT Quantized] {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}”)

We export the model to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark both to see how latency improves while accuracy stays comparable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserpt_pipe = pipeline(“sentiment-analysis”, model=torch_model, tokenizer=tokenizer,
device=0 if DEVICE==”cuda” else -1)
ort_pipe = pipeline(“sentiment-analysis”, model=ort_model, tokenizer=tokenizer, device=-1)
samples = [
“What a fantastic movie—performed brilliantly!”,
“This was a complete waste of time.”,
“I’m not sure how I feel about this one.”
]
print(“nSample predictions (PT | ORT):”)
for s in samples:
a = pt_pipe(s)[0][“label”]
b = ort_pipe(s)[0][“label”]
print(f”- {s}n PT={a} | ORT={b}”)

import pandas as pd
rows = [[“PyTorch eager”, pt_ms, pt_sd, pt_acc],
[“ONNX Runtime”, ort_ms, ort_sd, ort_acc],
[“ORT Quantized”, oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, [“torch.compile”, ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=[“Engine”, “Mean ms (↓)”, “Std ms”, “Accuracy”])
display(df)

print(“””
Notes:
– BetterTransformer is deprecated on transformers>=4.49, hence omitted.
– For larger gains on GPU, also try FlashAttention2 models or FP8 with TensorRT-LLM.
– For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; try NUMA pinning.
– For static (calibrated) quantization, use QuantizationConfig(approach=’static’) with a calibration set.
“””)

We sanity-check predictions with quick sentiment pipelines and print PyTorch vs ONNX labels side by side. We then assemble a summary table to compare latency and accuracy across engines, inserting torch.compile results when available. We conclude with practical notes, allowing us to extend the workflow to other backends and quantization modes.

In conclusion, we can clearly see how Optimum helps us bridge the gap between standard PyTorch models and production-ready, optimized deployments. We achieve speedups with ONNX Runtime and quantization while retaining accuracy, and we also explore how torch.compile provides gains directly within PyTorch. This workflow demonstrates a practical approach to balancing performance and efficiency for Transformer models, providing a foundation that can be further extended with advanced backends, such as OpenVINO or TensorRT.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization appeared first on MarkTechPost.

Google AI Introduces the Public Preview of Chrome DevTools MCP: Making …

Google has released a public preview of “Chrome DevTools MCP,” a Model Context Protocol (MCP) server that lets AI coding agents control and inspect a real Chrome instance—recording performance traces, inspecting the DOM and CSS, executing JavaScript, reading console output, and automating user flows. The launch directly targets a well-known limitation in code-generating agents: they usually cannot observe the runtime behavior of the pages they create or modify. By wiring agents into Chrome’s DevTools via MCP, Google is turning static suggestion engines into loop-closed debuggers that run measurements in the browser before proposing fixes.

What exactly is Chrome DevTools MCP?

MCP is an open protocol for connecting LLMs to tools and data. Google’s DevTools MCP acts as a specialized server that exposes Chrome’s debugging surface to MCP-compatible clients. Google’s developer blog positions this as “bringing the power of Chrome DevTools to AI coding assistants,” with concrete workflows like initiating a performance trace (e.g., performance_start_trace) against a target URL, then having the agent analyze the resulting trace to suggest optimizations (for example, diagnosing high Largest Contentful Paint).

Capabilities and tool surface

The official GitHub repository documents a broad tool set. Beyond performance tracing (performance_start_trace, performance_stop_trace, performance_analyze_insight), agents can run navigation primitives (navigate_page, new_page, wait_for), simulate user input (click, fill, drag, hover), and interrogate runtime state (list_console_messages, evaluate_script, list_network_requests, get_network_request). Screenshot and snapshot utilities provide visual and DOM-state capture to support diffs and regressions. The server uses Puppeteer under the hood for reliable automation and waiting semantics, and it speaks to Chrome via the Chrome DevTools Protocol (CDP).

Installation

Setup is intentionally minimal for MCP clients. Google recommends adding a single config stanza that shells out to npx, always tracking the latest server build:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“chrome-devtools”: {
“command”: “npx”,
“args”: [“chrome-devtools-mcp@latest”]
}
}
}

This server integrates with multiple agent front ends: Gemini CLI, Claude Code, Cursor, and GitHub Copilot’s MCP support. For VS Code/Copilot, the repo documents a code –add-mcp one-liner; for Claude Code, a claude mcp add command mirrors the same npx target. The package targets Node.js ≥22 and current Chrome.

Example agent workflows

Google’s announcement highlights pragmatic prompts that demonstrate end-to-end loops: verify a proposed fix in a live browser; analyze network failures (e.g., CORS or blocked image requests); simulate user behaviors like form submission to reproduce bugs; inspect layout issues by reading DOM/CSS in context; and run automated performance audits to reduce LCP and other Core Web Vitals. These are all operations agents can now validate with actual measurements rather than heuristics.

https://developer.chrome.com/blog/chrome-devtools-mcp?hl=en

Summary

Chrome DevTools MCP’s public preview is a practical inflection point for agentic frontend tooling: it grounds AI assistants in real browser telemetry—performance traces, DOM/CSS state, network and console data—so recommendations are driven by measurements rather than guesswork. The first-party server, shipped by the Chrome DevTools team, is installable via npx and targets MCP-capable clients, with Chrome/CDP under the hood. Expect shorter diagnose-fix loops for regressions and flaky UI flows, plus tighter validation of performance work.

Check out the Technical details and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Google AI Introduces the Public Preview of Chrome DevTools MCP: Making Your Coding Agent Control and Inspect a Live Chrome Browser appeared first on MarkTechPost.

Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Rea …

Real-time agents, live dubbing, and simultaneous translation die by a thousand milliseconds. Most “streaming” TTS (Text to Speech) stacks still wait for a chunk of text before they emit sound, so the human hears a beat of silence before the voice starts. VoXtream—released by KTH’s Speech, Music and Hearing group—attacks this head-on: it begins speaking after the first word, outputs audio in 80 ms frames, and reports 102 ms first-packet latency (FPL) on a modern GPU (with PyTorch compile).

What exactly is “full-stream” TTS and how is it different from “output streaming”?

Output-streaming systems decode speech in chunks but still require the entire input text upfront; the clock starts late. Full-stream systems consume text as it arrives (word-by-word from an LLM) and emit audio in lockstep. VoXtream implements the latter: it ingests a word stream and generates audio frames continuously, eliminating input-side buffering while maintaining low per-frame compute. The architecture explicitly targets first-word onset rather than only steady-state throughput.

https://arxiv.org/pdf/2509.15969

How does VoXtream start speaking without waiting for future words?

The core trick is a dynamic phoneme look-ahead inside an incremental Phoneme Transformer (PT). PT may peek up to 10 phonemes to stabilize prosody, but it does not wait for that context; generation can start immediately after the first word enters the buffer. This avoids fixed look-ahead windows that add onset delay.

What’s the model stack under the hood?

VoXtream is a single, fully-autoregressive (AR) pipeline with three transformers:

Phoneme Transformer (PT): decoder-only, incremental; dynamic look-ahead ≤ 10 phonemes; phonemization via g2pE at the word level.

Temporal Transformer (TT): AR predictor over Mimi codec semantic tokens plus a duration token that encodes a monotonic phoneme-to-audio alignment (“stay/go” and {1, 2} phonemes per frame). Mimi runs at 12.5 Hz (→ 80 ms frames).

Depth Transformer (DT): AR generator for the remaining Mimi acoustic codebooks, conditioned on TT outputs and a ReDimNet speaker embedding for zero-shot voice prompting. The Mimi decoder reconstructs the waveform frame-by-frame, enabling continuous emission.

Mimi’s streaming codec design and dual-stream tokenization are well documented; VoXtream uses its first codebook as “semantic” context and the rest for high-fidelity reconstruction.

Is it actually fast in practice—or just “fast on paper”?

The repository includes a benchmark script that measures both FPL and real-time factor (RTF). On A100, the research team report 171 ms / 1.00 RTF without compile and 102 ms / 0.17 RTF with compile; on RTX 3090, 205 ms / 1.19 RTF uncompiled and 123 ms / 0.19 RTF compiled.

How does it compare to today’s popular streaming baselines?

The research team evaluates short-form output streaming and full-stream scenarios. On LibriSpeech-long full-stream (where text arrives word-by-word), VoXtream shows lower WER (3.24 %) than CosyVoice2 (6.11 %) and a significant naturalness preference for VoXtream in listener studies (p ≤ 5e-10), while CosyVoice2 scores higher on speaker-similarity—consistent with its flow-matching decoder. In runtime, VoXtream has the lowest FPL among the compared public streaming systems, and with compile it operates >5× faster than real time (RTF ≈ 0.17).

https://arxiv.org/pdf/2509.15969

https://arxiv.org/pdf/2509.15969

Why does this AR design beat diffusion/flow stacks on onset?

Diffusion/flow vocoders typically generate audio in chunks, so even if the text-audio interleaving is clever, the vocoder imposes a floor on first-packet latency. VoXtream keeps every stage AR and frame-synchronous—PT→TT→DT→Mimi decoder—so the first 80 ms packet emerges after one pass through the stack rather than a multi-step sampler. The introduction surveys prior interleaved and chunked approaches and explains how NAR flow-matching decoders used in IST-LM and CosyVoice2 impede low FPL despite strong offline quality.

Did they get here with huge data—or something smaller and cleaner?

VoXtream trains on a ~9k-hour mid-scale corpus: roughly 4.5k h Emilia and 4.5k h HiFiTTS-2 (22 kHz subset). The team diarized to remove multi-speaker clips, filtered transcripts using ASR, and applied NISQA to drop low-quality audio. Everything is resampled to 24 kHz, and the dataset card spells out the preprocessing pipeline and alignment artifacts (Mimi tokens, MFA alignments, duration labels, and speaker templates).

Are the headline quality metrics holding up outside cherry-picked clips?

Table 1 (zero-shot TTS) shows VoXtream is competitive on WER, UTMOS (MOS predictor), and speaker similarity across SEED-TTS test-en and LibriSpeech test-clean; the research team also runs an ablation: adding the CSM Depth Transformer and speaker encoder notably improves similarity without a significant WER penalty relative to a stripped baseline. The subjective study uses a MUSHRA-like protocol and a second-stage preference test tailored to full-stream generation.

source: marktechpost.com

Where does this land in the TTS landscape?

As per the research paper, it positions VoXtream among recent interleaved AR + NAR vocoder approaches and LM-codec stacks. The core contribution isn’t a new codec or a giant model—it’s a latency-focused AR arrangement plus a duration-token alignment that preserves input-side streaming. If you build live agents, the important trade-off is explicit: a small drop in speaker similarity vs. order-of-magnitude lower FPL than chunked NAR vocoders in full-stream conditions.

Check out the PAPER, Model on Hugging, GitHub Page and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Real-Time Use that Begins Speaking from the First Word appeared first on MarkTechPost.

Running deep research AI agents on Amazon Bedrock AgentCore

AI agents are evolving beyond basic single-task helpers into more powerful systems that can plan, critique, and collaborate with other agents to solve complex problems. Deep Agents—a recently introduced framework built on LangGraph—bring these capabilities to life, enabling multi-agent workflows that mirror real-world team dynamics. The challenge, however, is not just building such agents but also running them reliably and securely in production. This is where Amazon Bedrock AgentCore Runtime comes in. By providing a secure, serverless environment purpose-built for AI agents and tools, Runtime makes it possible to deploy Deep Agents at enterprise scale without the heavy lifting of managing infrastructure.
In this post, we demonstrate how to deploy Deep Agents on AgentCore Runtime. As shown in the following figure, AgentCore Runtime scales any agent and provides session isolation by allocating a new microVM for each new session.

What is Amazon Bedrock AgentCore?
Amazon Bedrock AgentCore is both framework-agnostic and model-agnostic, giving you the flexibility to deploy and operate advanced AI agents securely and at scale. Whether you’re building with Strands Agents, CrewAI, LangGraph, LlamaIndex, or another framework—and running them on a large language model (LLM)—AgentCore provides the infrastructure to support them. Its modular services are purpose-built for dynamic agent workloads, with tools to extend agent capabilities and controls required for production use. By alleviating the undifferentiated heavy lifting of building and managing specialized agent infrastructure, AgentCore lets you bring your preferred framework and model and deploy without rewriting code.
Amazon Bedrock AgentCore offers a comprehensive suite of capabilities designed to transform local agent prototypes into production-ready systems. These include persistent memory for maintaining context in and across conversations, access to existing APIs using Model Context Protocol (MCP), seamless integration with corporate authentication systems, specialized tools for web browsing and code execution, and deep observability into agent reasoning processes. In this post, we focus specifically on the AgentCore Runtime component.
Core capabilities of AgentCore Runtime
AgentCore Runtime provides a serverless, secure hosting environment specifically designed for agentic workloads. It packages code into a lightweight container with a simple, consistent interface, making it equally well-suited for running agents, tools, MCP servers, or other workloads that benefit from seamless scaling and integrated identity management.AgentCore Runtime offers extended execution times up to 8 hours for complex reasoning tasks, handles large payloads for multimodal content, and implements consumption-based pricing that charges only during active processing—not while waiting for LLM or tool responses. Each user session runs in complete isolation within dedicated micro virtual machines (microVMs), maintaining security and helping to prevent cross-session contamination between agent interactions. The runtime works with many frameworks (for example: LangGraph, CrewAI, Strands, and so on) and many foundation model providers, while providing built-in corporate authentication, specialized agent observability, and unified access to the broader AgentCore environment through a single SDK.
Real-world example: Deep Agents integration
In this post we’re going to deploy the recently released Deep Agents implementation example on AgentCore Runtime—showing just how little effort it takes to get the latest agent innovations up and running.

The sample implementation in the preceding diagram includes:

A research agent that conducts deep internet searches using the Tavily API
A critique agent that reviews and provides feedback on generated reports
A main orchestrator that manages the workflow and handles file operations

Deep Agents uses LangGraph’s state management to create a multi-agent system with:

Built-in task planning through a write_todos tool that helps agents break down complex requests
Virtual file system where agents can read/write files to maintain context across interactions
Sub-agent architecture allowing specialized agents to be invoked for specific tasks while maintaining context isolation
Recursive reasoning with high recursion limits (more than 1,000) to handle complex, multi-step workflows

This architecture enables Deep Agents to handle research tasks that require multiple rounds of information gathering, synthesis, and refinement.The key integration points in our code showcase how agents work with AgentCore. The beauty is in its simplicity—we only need to add a couple of lines of code to make an agent AgentCore-compatible:

# 1. Import the AgentCore runtime
from bedrock_agentcore.runtime import BedrockAgentCoreApp
app = BedrockAgentCoreApp()

# 2. Decorate your agent function with @app.entrypoint
@app.entrypoint
async def langgraph_bedrock(payload):
# Your existing agent logic remains unchanged
user_input = payload.get(“prompt”)

# Call your agent as before
stream = agent.astream(
{“messages”: [HumanMessage(content=user_input)]},
stream_mode=”values”
)

# Stream responses back
async for chunk in stream:
yield(chunk)

# 3. Add the runtime starter at the bottom
if __name__ == “__main__”:
app.run()

That’s it! The rest of the code—model initialization, API integrations, and agent logic—remains exactly as it was. AgentCore handles the infrastructure while your agent handles the intelligence. This integration pattern works for most Python agent frameworks, making AgentCore truly framework-agnostic.
Deploying to AgentCore Runtime: Step-by-step
Let’s walk through the actual deployment process using the AgentCore Starter ToolKit, which dramatically simplifies the deployment workflow.
Prerequisites
Before you begin, make sure you have:

Python 3.10 or higher
AWS credentials configured
Amazon Bedrock AgentCore SDK installed

Step 1: IAM permissions
There are two different AWS Identity and Access Management (IAM) permissions you need to consider when deploying an agent in an AgentCore Runtime—the role you, as a developer use to create AgentCore resources and the execution role that an agent needs to run in an AgentCore Runtime. While the latter role can now be auto-created by the AgentCore Starter Toolkit (auto_create_execution_role=True), the former must be defined as described in IAM Permissions for AgentCore Runtime.
Step 2: Add a wrapper to your agent
As shown in the preceding Deep Agents example, add the AgentCore imports and decorator to your existing agent code.
Step 3: Deploy using the AgentCore starter toolkit
The starter toolkit provides a three-step deployment process:

from bedrock_agentcore_starter_toolkit import Runtime

# Step 1: Configure
agentcore_runtime = Runtime()
config_response = agentcore_runtime.configure(
entrypoint=”hello.py”, # contains the code we showed earlier in the post
execution_role=role_arn, # or auto-create
auto_create_ecr=True,
requirements_file=”requirements.txt”,
region=”us-west-2″,
agent_name=”deepagents-research”
)

# Step 2: Launch
launch_result = agentcore_runtime.launch()
print(f”Agent deployed! ARN: {launch_result[‘agent_arn’]}”)

# Step 3: Invoke
response = agentcore_runtime.invoke({
“prompt”: “Research the latest developments in quantum computing”
})

Step 4: What happens behind the scenes
When you run the deployment, the starter kit automatically:

Generates an optimized Docker file with Python 3.13-slim base image and OpenTelemetry instrumentation
Builds your container with the dependencies from requirements.txt
Creates an Amazon Elastic Container Registry (Amazon ECR) repository (if auto_create_ecr=True) and pushes your image
Deploys to AgentCore Runtime and monitors the deployment status
Configures networking and observability with Amazon CloudWatch and AWS X-Ray integration

The entire process typically takes 2–3 minutes, after which your agent is ready to handle requests at scale. Each new session is launched in its own fresh AgentCore Runtime microVM, maintaining complete environment isolation.
The starter kit generates a configuration file (.bedrock_agentcore.yaml) that captures your deployment settings, making it straightforward to redeploy or update your agent later.
Invoking your deployed agent
After deployment, you have two options for invoking your agent:
Option 1: Using the start kit (shown in Step 3)

response = agentcore_runtime.invoke({
“prompt”: “Research the latest developments in quantum computing”
})

Option 2: Using boto3 SDK directly

import boto3
import json

agentcore_client = boto3.client(‘bedrock-agentcore’, region_name=’us-west-2′)
response = agentcore_client.invoke_agent_runtime(
agentRuntimeArn=agent_arn,
qualifier=”DEFAULT”,
payload=json.dumps({
“prompt”: “Analyze the impact of AI on healthcare in 2024”
})
)

# Handle streaming response
for event in response[‘completion’]:
if ‘chunk’ in event:
print(event[‘chunk’][‘bytes’].decode(‘utf-8’))

Deep Agents in action
As the code executes in Bedrock AgentCore Runtime, the primary agent orchestrates specialized sub-agents—each with its own purpose, prompt, and tool access—to solve complex tasks more effectively. In this case, the orchestrator prompt (research_instructions) sets the plan:

Write the question to question.txt
Fan out to one or more research-agent calls (each on a single sub-topic) using the internet_search tool
Synthesize findings into final_report.md
Call critique-agent to evaluate gaps and structure
Optionally loop back to more research/edits until quality is met

Here it is in action:

Clean up
When finished, don’t forget to de-allocate provisioned AgentCore Runtime in addition to the container repository that was created during the process:

agentcore_control_client = boto3.client(
‘bedrock-agentcore-control’, region_name=region )
ecr_client = boto3.client(‘ecr’,region_name=region )
runtime_delete_response = agentcore_control_client.delete_agent_runtime( agentRuntimeId=launch_result.agent_id,)
response = ecr_client.delete_repository(
repositoryName=launch_result.ecr_uri.split(‘/’)[1],force=True)

Conclusion
Amazon Bedrock AgentCore represents a paradigm shift in how we deploy AI agents. By abstracting away infrastructure complexity while maintaining framework and model flexibility, AgentCore enables developers to focus on building sophisticated agent logic rather than managing deployment pipelines. Our Deep Agents deployment demonstrates that even complex, multi-agent systems with external API integrations can be deployed with minimal code changes. The combination of enterprise-grade security, built-in observability, and serverless scaling makes AgentCore the best choice for production AI agent deployments. Specifically for deep research agents, AgentCore offers the following unique capabilities that you can explore:

AgentCore Runtime can handle asynchronous processing and long running (up to 8 hours) agents. Asynchronous tasks allow your agent to continue processing after responding to the client and handle long-running operations without blocking responses. Your background research sub-agent could be asynchronously researching for hours.
AgentCore Runtime works with AgentCore Memory, enabling capabilities such as building upon previous findings, remembering research preferences, and maintaining complex investigation context without losing progress between sessions.
You can use AgentCore Gateway to extend your deep research to include proprietary insights from enterprise services and data sources. By exposing these differentiated resources as MCP tools, your agents can quickly take advantage and combine that with publicly available knowledge.

Ready to deploy your agents to production? Here’s how to get started:

Install the AgentCore starter kit: pip install bedrock-agentcore-starter-toolkit
Experiment: Deploy your code by following this step by step guide.

The era of production-ready AI agents is here. With AgentCore, the journey from prototype to production has never been shorter.

About the authors
Vadim Omeltchenko is a Sr. AI/ML Solutions Architect who is passionate about helping AWS customers innovate in the cloud. His prior IT experience was predominantly on the ground.
Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Shreyas Subramanian is a Principal data scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS Certifications, including the ML Specialty Certification.

Integrate tokenization with Amazon Bedrock Guardrails for secure data …

This post is co-written by Mark Warner, Principal Solutions Architect for Thales, Cyber Security Products.
As generative AI applications make their way into production environments, they integrate with a wider range of business systems that process sensitive customer data. This integration introduces new challenges around protecting personally identifiable information (PII) while maintaining the ability to recover original data when legitimately needed by downstream applications. Consider a financial services company implementing generative AI across different departments. The customer service team needs an AI assistant that can access customer profiles and provide personalized responses that include contact information, for example: “We’ll send your new card to your address at 123 Main Street.” Meanwhile, the fraud analysis team requires the same customer data but must analyze patterns without exposing actual PII, working only with protected representations of sensitive information.
Amazon Bedrock Guardrails helps detect sensitive information, such as PII, in standard format in input prompts or model responses. Sensitive information filters give organizations control over how sensitive data is handled, with options to block requests containing PII or mask the sensitive information with generic placeholders like {NAME} or {EMAIL}. This capability helps organizations comply with data protection regulations while still using the power of large language models (LLMs).
Although masking effectively protects sensitive information, it creates a new challenge: the loss of data reversibility. When guardrails replace sensitive data with generic masks, the original information becomes inaccessible to downstream applications that might need it for legitimate business processes. This limitation can impact workflows where both security and functional data are required.
Tokenization offers a complementary approach to this challenge. Unlike masking, tokenization replaces sensitive data with format-preserving tokens that are mathematically unrelated to the original information but maintain its structure and usability. These tokens can be securely reversed back to their original values when needed by authorized systems, creating a path for secure data flows throughout an organization’s environment.
In this post, we show you how to integrate Amazon Bedrock Guardrails with third-party tokenization services to protect sensitive data while maintaining data reversibility. By combining these technologies, organizations can implement stronger privacy controls while preserving the functionality of their generative AI applications and related systems. The solution described in this post demonstrates how to combine Amazon Bedrock Guardrails with tokenization services from Thales CipherTrust Data Security Platform to create an architecture that protects sensitive data without sacrificing the ability to process that data securely when needed. This approach is particularly valuable for organizations in highly regulated industries that need to balance innovation with compliance requirements.
Amazon Bedrock Guardrails APIs
This section describes the key components and workflow for the integration between Amazon Bedrock Guardrails and a third-party tokenization service.
Amazon Bedrock Guardrails provides two distinct approaches for implementing content safety controls:

Direct integration with model invocation through APIs like InvokeModel and Converse, where guardrails automatically evaluate inputs and outputs as part of the model inference process.
Standalone evaluation through the ApplyGuardrail API, which decouples guardrails assessment from model invocation, allowing evaluation of text against defined policies.

This post uses the ApplyGuardrail API for tokenization integration because it separates content assessment from model invocation, allowing for the insertion of tokenization processing between these steps. This separation creates the necessary space in the workflow to replace guardrail masks with format-preserving tokens before model invocation, or after the model response is handed over to the target application downstream in the process.
The solution extends the typical ApplyGuardrail API implementation by inserting tokenization processing between guardrail evaluation and model invocation, as follows:

The application calls the ApplyGuardrail API to assess the user input for sensitive information.
If no sensitive information is detected (action = “NONE”), the application proceeds to model invocation via the InvokeModel API.
If sensitive information is detected (action = “ANONYMIZED”):

The application captures the detected PII and its positions.
It calls a tokenization service to convert these entities into format-preserving tokens.
It replaces the generic guardrail masks with these tokens.
The application then invokes the foundation model with the tokenized content.

For model responses:

The application applies guardrails to check the output from the model for sensitive information.
It tokenizes detected PII before passing the response to downstream systems.

Solution overview
To illustrate how this workflow delivers value in practice, consider a financial advisory application that helps customers understand their spending patterns and receive personalized financial recommendations. In this example, three distinct application components work together to provide secure, AI-powered financial insights:

Customer gateway service – This trusted frontend orchestrator receives customer queries that often contain sensitive information. For example, a customer might ask: “Hi, this is j.smith@example.com. Based on my last five transactions on acme.com, and my current balance of $2,342.18, should I consider their new credit card offer?”
Financial analysis engine – This AI-powered component analyzes financial patterns and generates recommendations but doesn’t need access to actual customer PII. It works with anonymized or tokenized information.
Response processing service – This trusted service handles the final customer communication, including detokenizing sensitive information before presenting results to the customer.

The following diagram illustrates the workflow for integrating Amazon Bedrock Guardrails with tokenization services in this financial advisory application. AWS Step Functions orchestrates the sequential process of PII detection, tokenization, AI model invocation, and detokenization across the three key components (customer gateway service, financial analysis engine, and response processing service) using AWS Lambda functions.

The workflow operates as follows:

The customer gateway service (for this example, through Amazon API Gateway) receives the user input containing sensitive information.
It calls the ApplyGuardrail API to identify PII or other sensitive information that should be anonymized or blocked.
For detected sensitive elements (such as user names or merchant names), it calls the tokenization service to generate format-preserving tokens.
The input with tokenized values is passed to the financial analysis engine for processing. (For example, “Hi, this is [[TOKEN_123]]. Based on my last five transactions on [[TOKEN_456]] and my current balance of $2,342.18, should I consider their new credit card offer?”)
The financial analysis engine invokes an LLM on Amazon Bedrock to generate financial advice using the tokenized data.
The model response, potentially containing tokenized values, is sent to the response processing service.
This service calls the tokenization service to detokenize the tokens, restoring the original sensitive values.
The final, detokenized response is delivered to the customer.

This architecture maintains data confidentiality throughout the processing flow while preserving the information’s utility. The financial analysis engine works with structurally valid but cryptographically protected data, allowing it to generate meaningful recommendations without exposing sensitive customer information. Meanwhile, the trusted components at the entry and exit points of the workflow can access the actual data when necessary, creating a secure end-to-end solution.
In the following sections, we provide a detailed walkthrough of implementing the integration between Amazon Bedrock Guardrails and tokenization services.
Prerequisites
To implement the solution described in this post, you must have the following components configured in your environment:

An AWS account with Amazon Bedrock enabled in your target AWS Region.
Appropriate AWS Identity and Access Management (IAM) permissions configured following least privilege principles with specific actions enabled: bedrock:CreateGuardrail, bedrock:ApplyGuardrail, and bedrock-runtime:InvokeModel.
For AWS Organizations, verify Amazon Bedrock access is permitted by service control policies.
A Python 3.7+ environment with the boto3 library installed. For information about installing the boto3 library, refer to AWS SDK for Python (Boto3).
AWS credentials configured for programmatic access using the AWS Command Line Interface (AWS CLI). For more details, refer to Configuring settings for the AWS CLI.
This implementation requires a deployed tokenization service accessible through REST API endpoints. Although this walkthrough demonstrates integration with Thales CipherTrust, the pattern adapts to tokenization providers offering protect and unprotect API operations. Make sure network connectivity exists between your application environment and both AWS APIs and your tokenization service endpoints, along with valid authentication credentials for accessing your chosen tokenization service. For information about setting up Thales CipherTrust specifically, refer to How Thales Enables PCI DSS Compliance with a Tokenization Solution on AWS.

Configure Amazon Bedrock Guardrails
Configure Amazon Bedrock Guardrails for PII detection and masking through the Amazon Bedrock console or programmatically using the AWS SDK. Sensitive information filter policies can anonymize or redact information from model requests or responses:

import boto3
def create_bedrock_guardrail():
“””
Create a guardrail in Amazon Bedrock for financial applications with PII protection.
“””
bedrock = boto3.client(‘bedrock’)

response = bedrock.create_guardrail(
name=”FinancialServiceGuardrail”,
description=”Guardrail for financial applications with PII protection”,
sensitiveInformationPolicyConfig={
‘piiEntitiesConfig’: [
{
‘type’: ‘URL’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
},
{
‘type’: ‘EMAIL’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
},
{
‘type’: ‘NAME’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
}
]
},
blockedInputMessaging=”I can’t provide information with PII data.”,
blockedOutputsMessaging=”I can’t generate content with PII data.”
)

return response

Integrate the tokenization workflow
This section implements the tokenization workflow by first detecting PII entities with the ApplyGuardrail API, then replacing the generic masks with format-preserving tokens from your tokenization service.
Apply guardrails to detect PII entities
Use the ApplyGuardrail API to validate input text from the user and detect PII entities:

import boto3
from botocore.exceptions import ClientError
def invoke_guardrail(user_query):
“””
Apply Amazon Bedrock Guardrails to validate input text and detect PII entities.

Args:
user_query (str): The user’s input text to be checked.

Returns:
dict: The response from the ApplyGuardrail API.

Raises:
ClientError: If there’s an error applying the guardrail.
“””
try:
bedrock_runtime = boto3.client(‘bedrock-runtime’)

response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=’your-guardrail-id’, # Replace with your actual guardrail ID
guardrailVersion=’your-guardrail-version’, # Replace with your actual version
source=”INPUT”,
content=[{“text”: {“text”: user_query}}]
)

return response
except ClientError as e:
print(f”Error applying guardrail: {e}”)
raise

Invoke tokenization service
The response from the ApplyGuadrail API includes the list of PII entities matching the sensitive information policy. Parse those entities and invoke the tokenization service to generate the tokens.
The following example code uses the Thales CipherTrust tokenization service:

import json
import requests
from botocore.exceptions import ClientError
def thales_ciphertrust_tokenizer(guardrail_response):
“””
Process PII entities detected by the guardrail and tokenize them using Thales CipherTrust

Args:
guardrail_response (dict): The response from the ApplyGuardrail API

Returns:
list: List of dictionaries containing original values, types, and tokenized responses

Raises:
ClientError: If there’s an error invoking Thales CipherTrust.
“””
try:
protected_results = []

for assessment in guardrail_response.get(“assessments”, []):
pii_entities = assessment.get(“sensitiveInformationPolicy”, {}).get(“piiEntities”, [])

for entity in pii_entities:
sensitive_value = entity.get(“match”)
entity_type = entity.get(“type”)

if sensitive_value:
# Prepare payload for Thales CipherTrust tokenization service
crdp_payload = {
“protection_policy_name”: “plain-alpha-internal”,
“DATA_KEY”: sensitive_value,
}

url_str = “http://your-ciphertrust-cname:8090/v1/protect” # Replace with your actual CipherTrust URL
headers = {“Content-Type”: “application/json”}

# Invoke the Thales CipherTrust tokenization service
response = requests.post(url_str, headers=headers, data=json.dumps(crdp_payload))
response.raise_for_status()
response_json = response.json()

protected_results.append({
“original_value”: sensitive_value,
“type”: entity_type,
“protection_response”: response_json
})

return protected_results
except requests.RequestException as e:
print(f”Error invoking Thales CipherTrust: {e}”)
raise ClientError(f”Error invoking Thales CipherTrust: {e}”, “TokenizationError”)

Replace guardrail masks with tokens
Next, substitute the generic guardrail masks with the tokens generated by the Thales CipherTrust tokenization service. This enables downstream applications to work with structurally valid data while maintaining security and reversibility.

def process_guardrail_output(protected_results, guardrail_response):
“””
Process guardrail output by replacing placeholders with protected values.

Args:
protected_results (list): List of protected data tokenized by Thales CipherTrust.
guardrail_response (dict): Guardrail response dictionary.

Returns:
list: List of modified output items with placeholders replaced by tokens.

Raises:
ValueError: If input parameters are invalid.
Exception: For any unexpected errors during processing.
“””
try:
# Validate input types
if not isinstance(protected_results, list) or not isinstance(guardrail_response, dict):
raise ValueError(“Invalid input parameters”)

# Extract protection map
protection_map = {res[‘type’].upper(): res[‘protection_response’][‘protected_data’]
for res in protected_results}
# Process outputs
modified_outputs = []
for output_item in guardrail_response.get(‘outputs’, []):
if ‘text’ in output_item:
modified_text = output_item[‘text’]

# Replace all placeholders in one pass
for pii_type, protected_value in protection_map.items():
modified_text = modified_text.replace(f”{{{pii_type}}}”, protected_value)

modified_outputs.append({“text”: modified_text})
return modified_outputs
except (ValueError, KeyError) as e:
print(f”Error processing guardrail output: {e}”)
raise
except Exception as e:
print(f”Unexpected error while processing guardrail output: {e}”)
raise

The result of this process transforms user inputs containing information that match the sensitive information policy applied using Amazon Bedrock Guardrails into unique and reversible tokenized versions.
The following example input contains PII elements:

“Hi, this is john.smith@example.com. Based on my last five transactions on acme.com, and my current balance of $2,342.18, should I consider their new credit card offer?”

The following is an example of the sanitized user input:

“Hi, this is 1001000GC5gDh1.D8eK71@EjaWV.lhC. Based on my last five transactions on 1001000WcFzawG.Jc9Tfc, and my current balance of $2,342.18, should I consider their new credit card offer?”

Downstream application processing
The sanitized input is ready to be used by generative AI applications, including model invocations on Amazon Bedrock. In response to the tokenized input, an LLM invoked by the financial analysis engine would produce a relevant analysis that maintains the secure token format:

“Based on your recent transactions at 1001000WcFzawG.Jc9Tfc and your current account status, I can confirm that the new credit card offer would provide approximately $33 in monthly rewards based on your spending patterns. With annual benefits of around $394 against the $55 annual fee, this card would be beneficial for your profile, 1001000GC5gDh1.D8eK71@EjaWV.lhC.”

When authorized systems need to recover original values, tokens are detokenized. With Thales CipherTrust, this is accomplished using the Detokenize API, which requires the same parameters as in the previous tokenize action. This completes the secure data flow while preserving the ability to recover original information when needed.
Clean up
As you follow the approach described in this post, you will create new AWS resources in your account. To avoid incurring additional charges, delete these resources when you no longer need them.
To clean up your resources, complete the following steps:

Delete the guardrails you created. For instructions, refer to Delete your guardrail.
If you implemented the tokenization workflow using Lambda, API Gateway, or Step Functions as described in this post, remove the resources you created.
This post assumes a tokenization solution is already available in your account. If you deployed a third-party tokenization solution (such as Thales CipherTrust) to test this implementation, refer to that solution’s documentation for instructions to properly decommission these resources and stop incurring charges.

Conclusion
This post demonstrated how to combine Amazon Bedrock Guardrails with tokenization to enhance handling of sensitive information in generative AI workflows. By integrating these technologies, organizations can protect PII during processing while maintaining data utility and reversibility for authorized downstream applications.
The implementation illustrated uses Thales CipherTrust Data Security Platform for tokenization, but the architecture supports many tokenization solutions. To learn more about a serverless approach to building custom tokenization capabilities, refer to Building a serverless tokenization solution to mask sensitive data.
This solution provides a practical framework for builders to use the full potential of generative AI with appropriate safeguards. By combining the content safety mechanisms of Amazon Bedrock Guardrails with the data reversibility of tokenization, you can implement responsible AI workflows that align with your application requirements and organizational policies while preserving the functionality needed for downstream systems.
To learn more about implementing responsible AI practices on AWS, see Transform responsible AI from theory into practice.

About the Authors
Nizar Kheir is a Senior Solutions Architect at AWS with more than 15 years of experience spanning various industry segments. He currently works with public sector customers in France and across EMEA to help them modernize their IT infrastructure and foster innovation by harnessing the power of the AWS Cloud.
Mark Warner is a Principal Solutions Architect for Thales, Cyber Security Products division. He works with companies in various industries such as finance, healthcare, and insurance to improve their security architectures. His focus is assisting organizations with reducing risk, increasing compliance, and streamlining data security operations to reduce the probability of a breach.

Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, …

Microsoft has released a public preview that enables Azure Logic Apps (Standard) to run as Model Context Protocol (MCP) servers, exposing Logic Apps workflows as agent tools discoverable and callable by MCP-capable clients (e.g., VS Code + Copilot).

What’s actually shipping

Remote MCP server on Logic Apps (Standard): You configure a Standard logic app to host an MCP endpoint (/api/mcp) and surface HTTP Request/Response workflows as tools. Authentication is front-doored by Easy Auth; MCP endpoints default to OAuth 2.0. VS Code (≥1.102) includes GA MCP client support for testing.

API Center registration path (preview): You can also create/register MCP servers in Azure API Center, where selected managed connector actions become tools with cataloging and governance.

https://learn.microsoft.com/en-us/azure/logic-apps/set-up-model-context-protocol-server-standard

Key requirements and transport details

Workflow shape: Tools must be implemented as HTTP Request trigger (“When a HTTP request is received”) plus a Response action.

Auth & access control: By default, MCP uses OAuth 2.0; Easy Auth enforces client/identity/tenant restrictions. During setup, App Service authentication must allow unauthenticated requests (the MCP flow still performs OAuth).

Transports: Streamable HTTP works out of the box. SSE additionally requires VNET integration and host.json settingRuntime.Backend.EdgeWorkflowRuntimeTriggerListener.AllowCrossWorkerCommunication=true.

Enablement switch: MCP APIs are enabled by adding extensions.workflow.McpServerEndpoints.enable=true in host.json.

API Center path: preview limitations that matter

When creating MCP servers via API Center backed by Logic Apps, the current preview imposes the following limits:

Start with an empty Standard logic app resource.

One connector per MCP server.

Built-in service-provider and custom connectors aren’t supported in this path (managed connectors only).

One action per tool.

These constraints materially affect tool granularity and server layout in larger estates.

Why Standard (single-tenant) is the target?

Standard runs on the single-tenant Logic Apps runtime (on Azure Functions), supports multiple workflows per app, and integrates directly with virtual networks and private endpoints—all relevant for exposing private systems safely to agents and for predictable throughput/latency. By contrast, Consumption is multitenant, single-workflow per app, and pay-per-execution.

Tooling semantics and discoverability

Microsoft recommends adding trigger descriptions, parameter schemas/descriptions, and required markers to improve agent tool selection and invocation reliability. These annotations are read by MCP clients and influence calling behavior.

Connectors and enterprise reach

Organizations can front existing workflows and a large catalog of Logic Apps connectors (cloud and on-prem) through MCP, turning them into callable agent tools; Microsoft explicitly cites “more than 1,400 connectors.”

Operations, governance, and testing

Run history plus Application Insights/Log Analytics are available for diagnostics and auditability. VS Code provides quick client validation via MCP: Add Server, including OAuth sign-in and tool enumeration. Registering via API Center brings discovery/governance to MCP servers across teams.

Production notes (preview)

SSE requires both VNET and the cross-worker setting; without these, use streamable HTTP.

Easy Auth must be configured precisely (including the “allow unauthenticated” toggle) or client sign-in flows will fail despite OAuth expectations.

Throttling, idempotency, and schema versioning remain your responsibility when wrapping connectors as tools (not new, but now in the agent path). InfoQ highlights similar operational concerns from early adopters.

Summary

The preview cleanly MCP-enables Logic Apps (Standard): you expose HTTP-based workflows as OAuth-protected tools; you can catalog them in API Center; and you can reach private systems through single-tenant networking. For teams already invested in Logic Apps, this is a low-friction, standards-aligned route to operationalize enterprise agent tooling—just mind the API Center limits, SSE prerequisites, and Easy Auth nuances during rollout.

Check out more details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership with marktechpost.com, please TALK to us

The post Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, Turning Connectors into Agent Tools appeared first on MarkTechPost.

Perplexity Launches an AI Email Assistant Agent for Gmail and Outlook, …

Perplexity introduced “Email Assistant,” an AI agent that plugs into Gmail and Outlook to draft replies in your voice, auto-label and prioritize messages, and coordinate meetings end-to-end (availability checks, time suggestions, and calendar invites). The feature is restricted to Perplexity’s Max plan and is live today.

What it does?

Email Assistant adds an agent to any thread (via cc) that handles the back-and-forth typical of scheduling. It reads availability, proposes times, and issues invites, while also surfacing daily priorities and generating reply drafts aligned to the user’s tone. Launch support covers Gmail and Outlook with one-click setup links.

https://www.perplexity.ai/assistant

How it plugs into calendars and mail?

Perplexity has been shipping native connectors for Google and Microsoft stacks; the current changelog notes that Gmail/Gcal/Outlook connections support email search and “create calendar invites directly within Perplexity,” which is what the Email Assistant automates from within a live thread. Practically, users enroll, then send or cc assistant@perplexity.com to delegate scheduling and triage tasks.

https://www.perplexity.ai/assistant

Security posture

Perplexity’s specifies SOC 2 and GDPR compliance and says user data is not used for training. For teams evaluating agents in regulated environments, that implies standard audit controls and data-handling boundaries, but as always, production rollouts should validate data-access scopes and DLP posture in the target tenant.

Competitive context

Email Assistant overlaps with Microsoft Copilot for Outlook and Google Gemini for Gmail (summaries/assists). Perplexity’s differentiator is agentic handling of the entire negotiation loop inside email threads plus cross-account connectors already present in its Comet stack. That makes it a realistic drop-in for users who prefer an external agent rather than suite-native assistants.

Early read for implementers

Integration path: Connect Gmail/Outlook, then cc the agent on threads that need scheduling; use it for triage queries and auto-drafts.

Workflow coverage: Auto-labels for “needs reply” vs. FYI; daily summaries; draft-in-your-style replies; invite creation.

Boundary conditions: Max-only; launch support limited to Gmail/Outlook; verify calendar write permissions and compliance needs per domain.

Summary

Perplexity’s Email Assistant is a concrete agentic workflow for inboxes: cc it, let it negotiate times, send invites, and keep your triage queue lean—currently gated to Max subscribers and Gmail/Outlook environments.

Try it here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership with marktechpost.com, please TALK to us

The post Perplexity Launches an AI Email Assistant Agent for Gmail and Outlook, Aimed at Scheduling, Drafting, and Inbox Triage appeared first on MarkTechPost.

Alibaba Qwen Team Just Released FP8 Builds of Qwen3-Next-80B-A3B (Inst …

Alibaba’s Qwen team has just released FP8-quantized checkpoints for its new Qwen3-Next-80B-A3B models in two post-training variants—Instruct and Thinking—aimed at high-throughput inference with ultra-long context and MoE efficiency. The FP8 repos mirror the BF16 releases but package “fine-grained FP8” weights (block size 128) and deployment notes for sglang and vLLM nightly builds. Benchmarks in the cards remain those of the original BF16 models; FP8 is provided “for convenience and performance,” not as a separate evaluation run.

What’s in the A3B stack

Qwen3-Next-80B-A3B is a hybrid architecture combining Gated DeltaNet (a linear/conv-style attention surrogate) with Gated Attention, interleaved with an ultra-sparse Mixture-of-Experts (MoE). The 80B total parameter budget activates ~3B params per token via 512 experts (10 routed + 1 shared). The layout is specified as 48 layers arranged into 12 blocks: 3×(Gated DeltaNet → MoE) followed by 1×(Gated Attention → MoE). Native context is 262,144 tokens, validated up to ~1,010,000 tokens using RoPE scaling (YaRN). Hidden size is 2048; attention uses 16 Q heads and 2 KV heads at head dim 256; DeltaNet uses 32 V and 16 QK linear heads at head dim 128.

Qwen team reports the 80B-A3B base model outperforms Qwen3-32B on downstream tasks at ~10% of its training cost and delivers ~10× inference throughput beyond 32K context—driven by low activation in MoE and multi-token prediction (MTP). The Instruct variant is non-reasoning (no <think> tags), whereas the Thinking variant enforces reasoning traces by default and is optimized for complex problems.

FP8 releases: what actually changed

The FP8 model cards state the quantization is “fine-grained fp8” with block size 128. Deployment differs slightly from BF16: both sglang and vLLM require current main/nightly builds, with example commands provided for 256K context and optional MTP. The Thinking FP8 card also recommends a reasoning parser flag (e.g., –reasoning-parser deepseek-r1 in sglang, deepseek_r1 in vLLM). These releases retain Apache-2.0 licensing.

Benchmarks (reported on BF16 weights)

The Instruct FP8 card reproduces Qwen’s BF16 comparison table, putting Qwen3-Next-80B-A3B-Instruct on par with Qwen3-235B-A22B-Instruct-2507 on several knowledge/reasoning/coding benchmarks, and ahead on long-context workloads (up to 256K). The Thinking FP8 card lists AIME’25, HMMT’25, MMLU-Pro/Redux, and LiveCodeBench v6, where Qwen3-Next-80B-A3B-Thinking surpasses earlier Qwen3 Thinking releases (30B A3B-2507, 32B) and claims wins over Gemini-2.5-Flash-Thinking on multiple benchmarks.

Training and post-training signals

The series is trained on ~15T tokens before post-training. Qwen highlights stability additions (zero-centered, weight-decayed layer norm, etc.) and uses GSPO in RL post-training for the Thinking model to handle the hybrid attention + high-sparsity MoE combination. MTP is used to speed inference and improve pretraining signal.

Why FP8 matters?

On modern accelerators, FP8 activations/weights reduce memory bandwidth pressure and resident footprint versus BF16, allowing larger batch sizes or longer sequences at similar latency. Because A3B routes only ~3B parameters per token, the combination of FP8 + MoE sparsity compounds throughput gains in long-context regimes, particularly when paired with speculative decoding via MTP as exposed in the serving flags. That said, quantization interacts with routing and attention variants; real-world acceptance rates for speculative decoding and end-task accuracy can vary with engine and kernel implementations—hence Qwen’s guidance to use current sglang/vLLM and to tune speculative settings.

Summary

Qwen’s FP8 releases make the 80B/3B-active A3B stack practical to serve at 256K context on mainstream engines, preserving the hybrid-MoE design and MTP path for high throughput. The model cards keep benchmarks from BF16, so teams should validate FP8 accuracy and latency on their own stacks, especially with reasoning parsers and speculative settings. Net outcome: lower memory bandwidth and improved concurrency without architectural regressions, positioned for long-context production workloads.

Check out the Qwen3-Next-80B-A3B models in two post-training variants—Instruct and Thinking. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Alibaba Qwen Team Just Released FP8 Builds of Qwen3-Next-80B-A3B (Instruct & Thinking), Bringing 80B/3B-Active Hybrid-MoE to Commodity GPUs appeared first on MarkTechPost.

Rapid ML experimentation for enterprises with Amazon SageMaker AI and …

This post was written with Sarah Ostermeier from Comet.
As enterprise organizations scale their machine learning (ML) initiatives from proof of concept to production, the complexity of managing experiments, tracking model lineage, and managing reproducibility grows exponentially. This is primarily because data scientists and ML engineers constantly explore different combinations of hyperparameters, model architectures, and dataset versions, generating massive amounts of metadata that must be tracked for reproducibility and compliance. As the ML model development scales across multiple teams and regulatory requirements intensify, tracking experiments becomes even more complex. With increasing AI regulations, particularly in the EU, organizations now require detailed audit trails of model training data, performance expectations, and development processes, making experiment tracking a business necessity and not just a best practice.
Amazon SageMaker AI provides the managed infrastructure enterprises need to scale ML workloads, handling compute provisioning, distributed training, and deployment without infrastructure overhead. However, teams still need robust experiment tracking, model comparison, and collaboration capabilities that go beyond basic logging.
Comet is a comprehensive ML experiment management platform that automatically tracks, compares, and optimizes ML experiments across the entire model lifecycle. It provides data scientists and ML engineers with powerful tools for experiment tracking, model monitoring, hyperparameter optimization, and collaborative model development. It also offers Opik, Comet’s open source platform for LLM observability and development.
Comet is available in SageMaker AI as a Partner AI App, as a fully managed experiment management capability, with enterprise-grade security, seamless workflow integration, and a straightforward procurement process through AWS Marketplace.
The combination addresses the needs of an enterprise ML workflow end-to-end, where SageMaker AI handles infrastructure and compute, and Comet provides the experiment management, model registry, and production monitoring capabilities that teams require for regulatory compliance and operational efficiency. In this post, we demonstrate a complete fraud detection workflow using SageMaker AI with Comet, showcasing reproducibility and audit-ready logging needed by enterprises today.
Enterprise-ready Comet on SageMaker AI
Before proceeding to setup instructions, organizations must identify their operating model and based on that, decide how Comet is going to be set up. We recommend implementing Comet using a federated operating model. In this architecture, Comet is centrally managed and hosted in a shared services account, and each data science team maintains fully autonomous environments. Each operating model comes with their own sets of benefits and limitations. For more information, refer to SageMaker Studio Administration Best Practices.
Let’s dive into the setup of Comet in SageMaker AI. Large enterprise generally have the following personas:

Administrators – Responsible for setting up the common infrastructure services and environment for use case teams
Users – ML practitioners from use case teams who use the environments set up by platform team to solve their business problems

In the following sections, we go through each persona’s journey.
Comet works well with both SageMaker AI and Amazon SageMaker. SageMaker AI provides the Amazon SageMaker Studio integrated development environment (IDE), and SageMaker provides the Amazon SageMaker Unified Studio IDE. For this post, we use SageMaker Studio.
Administrator journey
In this scenario, the administrator receives a request from a team working on a fraud detection use case to provision an ML environment with a fully managed training and experimentation setup. The administrator’s journey includes the following steps:

Follow the prerequisites to set up Partner AI Apps. This sets up permissions for administrators, allowing Comet to assume a SageMaker AI execution role on behalf of the users and additional privileges for managing the Comet subscription through AWS Marketplace.
On the SageMaker AI console, under Applications and IDEs in the navigation pane, choose Partner AI Apps, then choose View details for Comet.

The details are shown, including the contract pricing model for Comet and infrastructure tier estimated costs.

Comet provides different subscription options ranging from a 1-month to 36-month contract. With this contract, users can access Comet in SageMaker. Based on the number of users, the admin can review and analyze the appropriate instance size for the Comet dashboard server. Comet supports 5–500 users running more than 100 experiment jobs..

Choose Go to Marketplace to subscribe to be redirected to the Comet listing on AWS Marketplace.
Choose View purchase options.

In the subscription form, provide the required details.

When the subscription is complete, the admin can start configuring Comet.

While deploying Comet, add the project lead of the fraud detection use case team as an admin to manage the admin operations for the Comet dashboard.

It takes a few minutes for the Comet server to be deployed. For more details on this step, refer to Partner AI App provisioning.

Set up a SageMaker AI domain following the steps in Use custom setup for Amazon SageMaker AI. As a best practice, provide a pre-signed domain URL for the use case team member to directly access the Comet UI without logging in to the SageMaker console.
Add the team members to this domain and enable access to Comet while configuring the domain.

Now the SageMaker AI domain is ready for users to log in to and start working on the fraud detection use case.
User journey
Now let’s explore the journey of an ML practitioner from the fraud detection use case. The user completes the following steps:

Log in to the SageMaker AI domain through the pre-signed URL.

You will be redirected to the SageMaker Studio IDE. Your user name and AWS Identity and Access Management (IAM) execution role are preconfigured by the admin.

Create a JupyterLab Space following the JupyterLab user guide.
You can start working on the fraud detection use case by spinning up a Jupyter notebook.

The admin has also set up required access to the data through an Amazon Simple Storage Service (Amazon S3) bucket.

To access Comet APIs, install the comet_ml library and configure the required environment variables as described in Set up the Amazon SageMaker Partner AI Apps SDKs.
To access the Comet UI, choose Partner AI Apps in the SageMaker Studio navigation pane and choose Open for Comet.

Now, let’s walk through the use case implementation.
Solution overview
This use case highlights common enterprise challenges: working with imbalanced datasets (in this example, only 0.17% of transactions are fraudulent), requiring multiple experiment iterations, and maintaining full reproducibility for regulatory compliance. To follow along, refer to the Comet documentation and Quickstart guide for additional setup and API details.
For this use case, we use the Credit Card Fraud Detection dataset. The dataset contains credit card transactions with binary labels representing fraudulent (1) or legitimate (0) transactions. In the following sections, we walk through some of the important sections of the implementation. The entire code of the implementation is available in the GitHub repository.
Prerequisites
As a prerequisite, configure the necessary imports and environment variables for the Comet and SageMaker integration:

# Comet ML for experiment tracking
import comet_ml
from comet_ml import Experiment, API, Artifact
from comet_ml.integration.sagemaker import log_sagemaker_training_job_v1
AWS_PARTNER_APP_AUTH=true
AWS_PARTNER_APP_ARN=<Your_AWS_PARTNER_APP_ARN>
COMET_API_KEY=<Your_Comet_API_Key>
# From Details Page, click Open Comet. In the top #right corner, click on user -> API # Key
# Comet ML configuration
COMET_WORKSPACE = ‘<your-comet-workspace-name>’
COMET_PROJECT_NAME = ‘<your-comet-project-name>’

Prepare the dataset
One of Comet’s key enterprise features is automatic dataset versioning and lineage tracking. This capability provides full auditability of what data was used to train each model, which is critical for regulatory compliance and reproducibility. Start by loading the dataset:

# Create a Comet Artifact to track our raw dataset
dataset_artifact = Artifact(
name=”fraud-dataset”,
artifact_type=”dataset”,
aliases=[“raw”]
)
# Add the raw dataset file to the artifact
dataset_artifact.add_remote(s3_data_path, metadata={
“dataset_stage”: “raw”,
“dataset_split”: “not_split”,
“preprocessing”: “none”
})

Start a Comet experiment
With the dataset artifact created, you can now start tracking the ML workflow. Creating a Comet experiment automatically begins capturing code, installed libraries, system metadata, and other contextual information in the background. You can log the dataset artifact created earlier in the experiment. See the following code:

# Create a new Comet experiment
experiment_1 = comet_ml.Experiment(
project_name=COMET_PROJECT_NAME,
workspace=COMET_WORKSPACE,
)
# Log the dataset artifact to this experiment for lineage tracking
experiment_1.log_artifact(dataset_artifact)

Preprocess the data
The next steps are standard preprocessing steps, including removing duplicates, dropping unneeded columns, splitting into train/validation/test sets, and standardizing features using scikit-learn’s StandardScaler. We wrap the processing code in preprocess.py and run it as a SageMaker Processing job. See the following code:

# Run SageMaker processing job
processor = SKLearnProcessor(
framework_version=’1.0-1′,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type=’ml.t3.medium’
)
processor.run(
code=’preprocess.py’,
inputs=[ProcessingInput(source=s3_data_path, destination=’/opt/ml/processing/input’)],
outputs=[ProcessingOutput(source=’/opt/ml/processing/output’, destination=f’s3://{bucket_name}/{processed_data_prefix}’)]
)

After you submit the processing job, SageMaker AI launches the compute instances, processes and analyzes the input data, and releases the resources upon completion. The output of the processing job is stored in the S3 bucket specified.
Next, create a new version of the dataset artifact to track the processed data. Comet automatically versions artifacts with the same name, maintaining complete lineage from raw to processed data.

# Create an updated version of the ‘fraud-dataset’ Artifact for the preprocessed data
preprocessed_dataset_artifact = Artifact(
name=”fraud-dataset”,
artifact_type=”dataset”,
aliases=[“preprocessed”],
metadata={
“description”: “Credit card fraud detection dataset”,
“fraud_percentage”: f”{fraud_percentage:.3f}%”,
“dataset_stage”: “preprocessed”,
“preprocessing”: “StandardScaler + train/val/test split”,
}
)
# Add our train, validation, and test dataset files as remote assets
preprocessed_dataset_artifact.add_remote(
uri=f’s3://{bucket_name}/{processed_data_prefix}’,
logical_path=’split_data’
)
# Log the updated dataset to the experiment to track the updates
experiment_1.log_artifact(preprocessed_dataset_artifact)

The Comet and SageMaker AI experiment workflow
Data scientists prefer rapid experimentation; therefore, we organized the workflow into reusable utility functions that can be called multiple times with different hyperparameters while maintaining consistent logging and evaluation across all runs. In this section, we showcase the utility functions along with a brief snippet of the code inside the function:

train() – Spins up a SageMaker model training job using the SageMaker built-in XGBoost algorithm:

# Create SageMaker estimator
estimator = Estimator(
image_uri=xgboost_image,
role=execution_role,
instance_count=1,
instance_type=’ml.m5.large’,
output_path=model_output_path,
sagemaker_session=sagemaker_session_obj,
hyperparameters=hyperparameters_dict,
max_run=1800 # Maximum training time in seconds
)
# Start training
estimator.fit({
‘train’: train_channel,
‘validation’: val_channel
})

log_training_job() – Captures the training metadata and metrics and links the model asset to the experiment for complete traceability:

# Log SageMaker training job to Comet
log_sagemaker_training_job_v1(
estimator=training_estimator,
experiment=api_experiment
)

log_model_to_comet() – Links model artifacts to Comet, captures the training metadata, and links the model asset to the experiment for complete traceability:

experiment.log_remote_model(
model_name=model_name,
uri=model_artifact_path,
metadata=metadata
)

deploy_and_evaluate_model() – Performs model deployment and evaluation, and metric logging:

# Deploy to endpoint
predictor = estimator.deploy(
initial_instance_count=1,
instance_type=”ml.m5.xlarge”)
# Log metrics and visualizations to Comet
experiment.log_metrics(metrics) experiment.log_confusion_matrix(matrix=cm,labels=[‘Normal’, ‘Fraud’])
# Log ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob_as_np_array) experiment.log_curve(“roc_curve”, x=fpr, y=tpr)

The complete prediction and evaluation code is available in the GitHub repository.
Run the experiments
Now you can run multiple experiments by calling the utility functions with different configurations and compare experiments to find the most optimal settings for the fraud detection use case.
For the first experiment, we establish a baseline using standard XGBoost hyperparameters:

# Define hyperparameters for first experiment
hyperparameters_v1 = {
‘objective’: ‘binary:logistic’, # Binary classification
‘num_round’: 100, # Number of boosting rounds
‘eval_metric’: ‘auc’, # Evaluation metric
‘learning_rate’: 0.15, # Learning rate
‘booster’: ‘gbtree’ # Booster algorithm
}
# Train the model
estimator_1 = train(
model_output_path=f”s3://{bucket_name}/{model_output_prefix}/1″,
execution_role=role,
sagemaker_session_obj=sagemaker_session,
hyperparameters_dict=hyperparameters_v1,
train_channel_loc=train_channel_location,
val_channel_loc=validation_channel_location
)
# log the training job and model artifact
log_training_job(experiment_key = experiment_1.get_key(), training_estimator=estimator_1)
log_model_to_comet(experiment = experiment_1,
model_name=”fraud-detection-xgb-v1″,
model_artifact_path=estimator_1.model_data,
metadata=metadata)
# Deploy and evaluate
deploy_and_evaluate_model(experiment=experiment_1,
estimator=estimator_1,
X_test_scaled=X_test_scaled,
y_test=y_test
)

While running a Comet experiment from a Jupyter notebook, we need to end the experiment to make sure everything is captured and persisted in the Comet server. See the following code: experiment_1.end()
When the baseline experiment is complete, you can run additional experiments with different hyperparameters. Check out the notebook to see the details of both experiments.
When the second experiment is complete, navigate to the Comet UI to compare these two experiment runs.
View Comet experiments in the UI
To access the UI, you can locate the URL in the SageMaker Studio IDE or by executing the code provided in the notebook: experiment_2.url
The following screenshot shows the Comet experiments UI. The experiment details are for illustration purposes only and do not represent a real-world fraud detection experiment.

This concludes the fraud detection experiment.
Clean up
For the experimentation part, SageMaker processing and training infrastructure is ephemeral in nature and shuts down automatically when the job is complete. However, you must still manually clean up a few resources to avoid unnecessary costs:

Shut down the SageMaker JupyterLab Space after use. For instructions, refer to Idle shutdown.
The Comet subscription renews based on the contract chosen. Cancel the contract when there is no further requirement to renew the Comet subscription.

Advantages of SageMaker and Comet integration
Having demonstrated the technical workflow, let’s examine the broader advantages this integration provides.
Streamlined model development
The Comet and SageMaker combination reduces the manual overhead of running ML experiments. While SageMaker handles infrastructure provisioning and scaling, Comet’s automatic logging captures hyperparameters, metrics, code, installed libraries, and system performance from your training jobs without additional configuration. This helps teams focus on model development rather than experiment bookkeeping.Comet’s visualization capabilities extend beyond basic metric plots. Built-in charts enable rapid experiment comparison, and custom Python panels support domain-specific analysis tools for debugging model behavior, optimizing hyperparameters, or creating specialized visualizations that standard tools can’t provide.
Enterprise collaboration and governance
For enterprise teams, the combination creates a mature platform for scaling ML projects across regulated environments. SageMaker provides consistent, secure ML environments, and Comet enables seamless collaboration with complete artifact and model lineage tracking. This helps avoid costly mistakes that occur when teams can’t recreate previous results.
Complete ML lifecycle integration
Unlike point solutions that only address training or monitoring, Comet paired with SageMaker supports your complete ML lifecycle. Models can be registered in Comet’s model registry with full version tracking and governance. SageMaker handles model deployment, and Comet maintains the lineage and approval workflows for model promotion. Comet’s production monitoring capabilities track model performance and data drift after deployment, creating a closed loop where production insights inform your next round of SageMaker experiments.
Conclusion
In this post, we showed how to use SageMaker and Comet together to spin up fully managed ML environments with reproducibility and experiment tracking capabilities.
To enhance your SageMaker workflows with comprehensive experiment management, deploy Comet directly in your SageMaker environment through the AWS Marketplace, and share your feedback in the comments.
For more information about the services and features discussed in this post, refer to the following resources:

Set up Partner AI Apps
Comet Quickstart
GitHub notebook
Comet Documentation
Opik open source platform for LLM observability

About the authors
Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS, helping large financial institutions adopt and scale generative AI and ML workloads. He is the author of book “Generative AI for financial services.” He carries more than 15 years of experience building enterprise-grade applications on generative AI/ML and related technologies. In his spare time, he plays an unnamed sport with his son that lies somewhere between football and rugby.
Naufal Mir is a Senior GenAI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning workloads to SageMaker. He previously worked at financial services institutes developing and operating systems at scale. Outside of work, he enjoys ultra endurance running and cycling.
Sarah Ostermeier is a Technical Product Marketing Manager at Comet. She specializes in bringing Comet’s GenAI and ML developer products to the engineers who need them through technical content, educational resources, and product messaging. She has previously worked as an ML engineer, data scientist, and customer success manager, helping customers implement and scale AI solutions. Outside of work she enjoys traveling off the beaten path, writing about AI, and reading science fiction.

Understanding the Universal Tool Calling Protocol (UTCP)

The Universal Tool Calling Protocol (UTCP) is a lightweight, secure, and scalable way for AI agents and applications to find and call tools directly, without the need for additional wrapper servers.

Key Features

Lightweight and secure – Allows tools to be accessed directly, avoiding unnecessary middle layers.

Scalable – Can support a large number of tools and providers without losing performance.

Modular design – Version 1.0.0 introduces a plugin-based core, making the protocol easier to extend, test, and package.

Built on Pydantic models – Provides simple, well-defined data structures that make implementation straightforward.

The Problem with Current Approaches

Traditional solutions for integrating tools often require:

Building and maintaining wrapper servers for every tool

Routing all traffic through a central protocol or service

Reimplementing authentication and security for each tool

Accepting additional latency and complexity

These steps add friction for developers and slow down execution.

The UTCP Solution

UTCP offers a better alternative by:

Defining a clear, language-agnostic standard for describing tools and their interfaces

Allowing agents to connect directly to tools using their native communication protocols

Providing an architecture that lets developers add:

New communication protocols (HTTP, SSE, CLI, etc.)

Alternative storage systems

Custom search strategies

All of this can be done without modifying the core library.

By eliminating the need for wrapper servers or other heavy middle layers, UTCP streamlines the way AI agents and applications connect with tools. It reduces latency and overall complexity, since requests no longer have to pass through extra infrastructure. Authentication and security become simpler as well, because UTCP allows agents to use the tool’s existing mechanisms rather than duplicating them in an intermediary service. This leaner approach also makes it easier to build, test, and maintain integrations, while naturally supporting growth as the number of tools and providers increases.

How It Works

UTCP makes tool integration simple and predictable. First, an AI agent discovers your tools by fetching a UTCP manual, which contains definitions and metadata for every capability you expose. Next, the agent learns how to call these tools by reading the manual and understanding the associated call templates. Once the definitions are clear, the agent can invoke your APIs directly using their native communication protocols. Finally, your API processes the request and returns a normal response. This process ensures seamless interoperability without extra middleware or custom translation layers.

Source: https://www.utcp.io/

Architecture Overview

Version 1.0 of UTCP introduces a modular, plugin-based architecture designed for scalability and flexibility. At its core are manuals, which define tools and their metadata, as well as call templates that specify how to interact with each tool over different protocols. 

The UTCP Client acts as the engine for discovering tools and executing calls. Around this core is a plugin system that supports protocol adapters, custom communication methods, tool repositories, and search strategies. This separation of concerns makes it easy to extend the system or customize it for a particular environment without altering its foundation.

How is UTCP different from MCP?

UTCP and MCP both help AI agents connect with external tools, but they focus on different needs. UTCP enables direct calls to APIs, CLIs, WebSockets, and other interfaces through simple JSON manuals, keeping infrastructure light and latency low. MCP provides a more structured layer, wrapping tools behind dedicated servers and standardizing communication with JSON-RPC.

Key points:

Architecture: UTCP connects agents straight to tools; MCP uses a server layer for routing.

Performance & Overhead: UTCP minimizes hops; MCP centralizes calls but adds a layer of processing.

Infrastructure: UTCP requires only manuals and a discovery endpoint, while MCP relies on servers for wrapping and routing.

Protocol Support: UTCP works across HTTP, WebSocket, CLI, SSE, and more; MCP focuses on JSON-RPC transport.

Security & Auth: UTCP uses the tool’s existing mechanisms, while MCP manages access inside its servers.

Flexibility: UTCP supports hybrid deployments through its MCP plugin, while MCP offers centralized management and monitoring.

Both approaches are useful: UTCP is ideal for lightweight, flexible integrations, while MCP suits teams wanting a standardized gateway with built-in control.

Conclusion

UTCP is a versatile solution for both tool providers and AI developers. It lets API owners, SaaS providers, and enterprise teams expose services like REST or GraphQL endpoints to AI agents in a simple, secure way. At the same time, developers building agents or applications can use UTCP to connect effortlessly with internal or external tools. By removing complexity and overhead, it streamlines integration and makes it easier for software to access powerful capabilities.

The post Understanding the Universal Tool Calling Protocol (UTCP) appeared first on MarkTechPost.

Meta AI Proposes ‘Metacognitive Reuse’: Turning LLM Chains-of-Thou …

Meta researchers introduced a method that compresses repeated reasoning patterns into short, named procedures—“behaviors”—and then conditions models to use them at inference or distills them via fine-tuning. The result: up to 46% fewer reasoning tokens on MATH while matching or improving accuracy, and up to 10% accuracy gains in a self-improvement setting on AIME, without changing model weights. The work frames this as procedural memory for LLMs—how to reason, not just what to recall—implemented with a curated, searchable “behavior handbook.”

https://arxiv.org/pdf/2509.13237

What problem does this solve?

Long chain-of-thought (CoT) traces repeatedly re-derive common sub-procedures (e.g., inclusion–exclusion, base conversions, geometric angle sums). That redundancy burns tokens, adds latency, and can crowd out exploration. Meta’s idea is to abstract recurring steps into concise, named behaviors (name + one-line instruction) recovered from prior traces via an LLM-driven reflection pipeline, then reuse them during future reasoning. On math benchmarks (MATH-500; AIME-24/25), this reduces output length substantially while preserving or improving solution quality.

How does the pipeline work?

Three roles, one handbook:

Metacognitive Strategist (R1-Llama-70B):

solves a problem to produce a trace, 2) reflects on the trace to identify generalizable steps, 3) emits behaviors as (behavior_name → instruction) entries. These populate a behavior handbook (procedural memory).

Teacher (LLM B): generates behavior-conditioned responses used to build training corpora.

Student (LLM C): consumes behaviors in-context (inference) or is fine-tuned on behavior-conditioned data.Retrieval is topic-based on MATH and embedding-based (BGE-M3 + FAISS) on AIME.

Prompts: The team provides explicit prompts for solution, reflection, behavior extraction, and behavior-conditioned inference (BCI). In BCI, the model is instructed to reference behaviors explicitly in its reasoning, encouraging consistently short, structured derivations.

What are the evaluation modes?

Behavior-Conditioned Inference (BCI): Retrieve K relevant behaviors and prepend them to the prompt.

Behavior-Guided Self-Improvement: Extract behaviors from a model’s own earlier attempts and feed them back as hints for revision.

Behavior-Conditioned SFT (BC-SFT): Fine-tune students on teacher outputs that already follow behavior-guided reasoning, so the behavior usage becomes parametric (no retrieval at test time).

Key results (MATH, AIME-24/25)

Token efficiency: On MATH-500, BCI reduces reasoning tokens by up to 46% versus the same model without behaviors, while matching or improving accuracy. This holds for both R1-Llama-70B and Qwen3-32B students across token budgets (2,048–16,384).

Self-improvement gains: On AIME-24, behavior-guided self-improvement beats a critique-and-revise baseline at nearly every budget, with up to 10% higher accuracy as budgets increase, indicating better test-time scaling of accuracy (not just shorter traces).

BC-SFT quality lift: Across Llama-3.1-8B-Instruct, Qwen2.5-14B-Base, Qwen2.5-32B-Instruct, and Qwen3-14B, BC-SFT consistently outperforms (accuracy) standard SFT and the original base across budgets, while remaining more token-efficient. Importantly, the advantage is not explained by an easier training corpus: teacher correctness rates in the two training sets (original vs. behavior-conditioned) are close, yet BC-SFT students generalize better on AIME-24/25.

Why does this work?

The handbook stores procedural knowledge (how-to strategies), distinct from classic RAG’s declarative knowledge (facts). By converting verbose derivations into short, reusable steps, the model skips re-derivation and reallocates compute to novel subproblems. Behavior prompts serve as structured hints that bias the decoder toward efficient, correct trajectories; BC-SFT then internalizes these trajectories so that behaviors are implicitly invoked without prompt overhead.

What’s inside a “behavior”?

Behaviors range from domain-general reasoning moves to precise mathematical tools, e.g.,

behavior_inclusion_exclusion_principle: avoid double counting by subtracting intersections;

behavior_translate_verbal_to_equation: formalize word problems systematically;

behavior_distance_from_point_to_line: apply |Ax+By+C|/√(A²+B²) for tangency checks.During BCI, the student explicitly cites behaviors when they’re used, making traces auditable and compact.

Retrieval and cost considerations

On MATH, behaviors are retrieved by topic; on AIME, top-K behaviors are selected via BGE-M3 embeddings and FAISS. While BCI introduces extra input tokens (the behaviors), input tokens are pre-computable and non-autoregressive, and are often billed cheaper than output tokens on commercial APIs. Since BCI shrinks output tokens, the overall cost can drop while latency improves. BC-SFT eliminates retrieval at test time entirely.

Image source: marktechpost.com

Summary

Meta’s behavior-handbook approach operationalizes procedural memory for LLMs: it abstracts recurring reasoning steps into reusable “behaviors,” applies them via behavior-conditioned inference or distills them with BC-SFT, and empirically delivers up to 46% fewer reasoning tokens with accuracy that holds or improves (≈10% gains in self-correction regimes). The method is straightforward to integrate—an index, a retriever, optional fine-tuning—and surfaces auditable traces, though scaling beyond math and managing a growing behavior corpus remain open engineering problems.

Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta AI Proposes ‘Metacognitive Reuse’: Turning LLM Chains-of-Thought into a Procedural Handbook that Cuts Tokens by 46% appeared first on MarkTechPost.

IBM and ETH Zürich Researchers Unveil Analog Foundation Models to Tac …

IBM researchers, together with ETH Zürich, have unveiled a new class of Analog Foundation Models (AFMs) designed to bridge the gap between large language models (LLMs) and Analog In-Memory Computing (AIMC) hardware. AIMC has long promised a radical leap in efficiency—running models with a billion parameters in a footprint small enough for embedded or edge devices—thanks to dense non-volatile memory (NVM) that combines storage and computation. But the technology’s Achilles’ heel has been noise: performing matrix-vector multiplications directly inside NVM devices yields non-deterministic errors that cripple off-the-shelf models.

Why does analog computing matter for LLMs?

Unlike GPUs or TPUs that shuttle data between memory and compute units, AIMC performs matrix-vector multiplications directly inside memory arrays. This design removes the von Neumann bottleneck and delivers massive improvements in throughput and power efficiency. Prior studies showed that combining AIMC with 3D NVM and Mixture-of-Experts (MoE) architectures could, in principle, support trillion-parameter models on compact accelerators. That could make foundation-scale AI feasible on devices well beyond data-centers.

https://arxiv.org/pdf/2505.09663

What makes Analog In-Memory Computing (AIMC) so difficult to use in practice?

The biggest barrier is noise. AIMC computations suffer from device variability, DAC/ADC quantization, and runtime fluctuations that degrade model accuracy. Unlike quantization on GPUs—where errors are deterministic and manageable—analog noise is stochastic and unpredictable. Earlier research found ways to adapt small networks like CNNs and RNNs (<100M parameters) to tolerate such noise, but LLMs with billions of parameters consistently broke down under AIMC constraints.

How do Analog Foundation Models address the noise problem?

The IBM team introduces Analog Foundation Models, which integrate hardware-aware training to prepare LLMs for analog execution. Their pipeline uses:

Noise injection during training to simulate AIMC randomness.

Iterative weight clipping to stabilize distributions within device limits.

Learned static input/output quantization ranges aligned with real hardware constraints.

Distillation from pre-trained LLMs using 20B tokens of synthetic data.

These methods, implemented with AIHWKIT-Lightning, allow models like Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct to sustain performance comparable to weight-quantized 4-bit / activation 8-bit baselines under analog noise. In evaluations across reasoning and factual benchmarks, AFMs outperformed both quantization-aware training (QAT) and post-training quantization (SpinQuant).

Do these models work only for analog hardware?

No. An unexpected outcome is that AFMs also perform strongly on low-precision digital hardware. Because AFMs are trained to tolerate noise and clipping, they handle simple post-training round-to-nearest (RTN) quantization better than existing methods. This makes them useful not just for AIMC accelerators, but also for commodity digital inference hardware.

Can performance scale with more compute at inference time?

Yes. The researchers tested test-time compute scaling on the MATH-500 benchmark, generating multiple answers per query and selecting the best via a reward model. AFMs showed better scaling behavior than QAT models, with accuracy gaps shrinking as more inference compute was allocated. This is consistent with AIMC’s strengths—low-power, high-throughput inference rather than training.

https://arxiv.org/pdf/2505.09663

How does it impact Analog In-Memory Computing (AIMC) future?

The research team provides the first systematic demonstration that large LLMs can be adapted to AIMC hardware without catastrophic accuracy loss. While training AFMs is resource-heavy and reasoning tasks like GSM8K still show accuracy gaps, the results are a milestone. The combination of energy efficiency, robustness to noise, and cross-compatibility with digital hardware makes AFMs a promising direction for scaling foundation models beyond GPU limits.

Summary

The introduction of Analog Foundation Models marks a critical milestone for scaling LLMs beyond the limits of digital accelerators. By making models robust to the unpredictable noise of analog in-memory computing, the research team shows that AIMC can move from a theoretical promise to a practical platform. While training costs remain high and reasoning benchmarks still show gaps, this work establishes a path toward energy-efficient large scale models running on compact hardware, pushing foundation models closer to edge deployment

Check out the PAPER and GITHUB PAGE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post IBM and ETH Zürich Researchers Unveil Analog Foundation Models to Tackle Noise in In-Memory AI Hardware appeared first on MarkTechPost.