Ant Group Releases Ling 2.0: A Reasoning-First MoE Language Model Seri …

How do you build a language model that grows in capacity but keeps the computation for each token almost unchanged? The Inclusion AI team from the Ant Group is pushing sparse large models in a methodical way by releasing Ling 2.0. Ling 2.0 is a reasoning based language model family built on the idea that each activation should translate directly into stronger reasoning behavior. It is one of the latest approaches that shows how to keep activation small while moving from 16B to 1T without rewriting the recipe. The series has three versions, Ling mini 2.0 at 16B total with 1.4B activated, Ling flash 2.0 in the 100B class with 6.1B activated, and Ling 1T with 1T total and about 50B active per token.

Sparse MoE as the central design

Every Ling 2.0 model uses the same sparse Mixture of Experts layer. Each layer has 256 routed experts and one shared expert. The router picks 8 routed experts for every token, the shared expert is always on, so about 9 experts out of 257 are used for every token, this is about 3.5 percent activation, which matches the 1/32 activation ratio. The research team reports about 7 times efficiency compared to an equivalent dense model because you train and serve only a small part of the network per token while keeping a very large parameter pool. 

https://arxiv.org/abs/2510.22115

Ling 2.0 brings coordinated advances across four layers of the stack, model architecture, pre training, post training, and the underlying FP8 infrastructure:

Model architecture: The architecture is chosen using Ling Scaling Laws, not by trial and error. To support the Ling Scaling Laws, the team runs what they call the Ling Wind Tunnel, a fixed set of small MoE runs trained under the same data and routing rules, then fitted to power laws to predict loss, activation and expert balance at much larger sizes. This gives them a low cost way to choose 1/32 activation, 256 routed experts and 1 shared expert before committing GPUs to 1T scale. Routing is aux-loss-free with sigmoid scoring, and the stack uses QK Norm, MTP loss and partial RoPE to keep depth stable. Because the same law picked the shape, Ling mini 2.0, Ling flash 2.0 and Ling 1T can all share the consistency across sizes.

Pre training: The series is trained on more than 20T tokens, starting with 4K context and a mix in which reasoning heavy sources such as math and code gradually increase to almost half of the corpus. A later mid training stage extends context to about 32K on a selected 150B token slice, then injects another 600B tokens of high quality chain of thought, before finally stretching to 128K with YaRN while preserving short context quality. This pipeline ensures that long context and reasoning are introduced early, not just added at the SFT step. 

Post training: Alignment is separated into a capability pass and a preference pass. First, Decoupled Fine Tuning teaches the model to switch between quick responses and deep reasoning through different system prompts, then an evolutionary CoT stage expands and diversifies chains, and finally a sentence level policy optimization with a Group Arena Reward aligns outputs to human judgments at fine granularity. This staged alignment is what lets a non thinking base reach strong math, code and instruction performance without inflating every answer.

Infrastructure: Ling 2.0 trains natively in FP8 with safeguards, keeping the loss curve within a small gap of BF16 while gaining about 15% utilization on the reported hardware. The larger speedups, around 40 percent, come from heterogeneous pipeline parallelism, interleaved one forward one backward execution and partitioning that is aware of the MTP block, not from precision alone. Together with Warmup Stable Merge, which replaces LR decay by merging checkpoints, this systems stack makes 1T scale runs practical on existing clusters. 

Understanding the Results

Evaluations are consistent in pattern, small activation MoE models deliver competitive quality while keeping per token compute low. Ling mini 2.0 has 16B total parameters, activates 1.4B per token, and is reported to perform in the 7 to 8B dense band. Ling flash 2.0 keeps the same 1/32 activation recipe, has 100B and activates 6.1B per token. Ling 1T is the flagship non thinking model, it has 1T total parameters and about 50B active per token, preserving the 1/32 sparsity and extending the same Ling Scaling Laws to trillion scale. 

https://arxiv.org/abs/2510.22115

https://arxiv.org/abs/2510.22115

https://arxiv.org/abs/2510.22115

Key Takeaways

Ling 2.0 is built around a 1/32 activation MoE architecture, selected using Ling Scaling Laws so that 256 routed experts plus 1 shared expert stay optimal from 16B up to 1T.

Ling mini 2.0 has 16B total parameters with 1.4B activated per token and is reported to match 7B to 8B dense models while generating at more than 300 tokens per second in simple QA on H20.

Ling flash 2.0 keeps the same recipe, has 6.1B active parameters and sits in the 100B range, giving a higher capacity option without increasing per token compute.

Ling 1T exposes the full design, 1T total parameters with about 50B active per token, 128K context, and an Evo CoT plus LPO style post training stack to push efficient reasoning.

Across all sizes, efficiency gains above 7 times over dense baselines come from the combination of sparse activation, FP8 training, and a shared training schedule, so quality scales predictably without re tuning compute.

Editorial Comments

This release demonstrates a complete sparse MoE stack. Ling Scaling Laws identify a 1/32 activation as optimal, the architecture locks in 256 routed experts plus 1 shared expert, and the same shape is used from 16B to 1T. Training, context extension and preference optimization are all aligned to that choice, so small activation does not block math, code or long context, and FP8 plus heterogeneous pipelines keep cost in a practical range. It is a clear signal that trillion scale reasoning can be organized around fixed sparsity instead of growing dense compute.

Check out the Weights on HF, Repo and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Ant Group Releases Ling 2.0: A Reasoning-First MoE Language Model Series Built on the Principle that Each Activation Enhances Reasoning Capability appeared first on MarkTechPost.

How to Build Ethically Aligned Autonomous Agents through Value-Guided …

In this tutorial, we explore how we can build an autonomous agent that aligns its actions with ethical and organizational values. We use open-source Hugging Face models running locally in Colab to simulate a decision-making process that balances goal achievement with moral reasoning. Through this implementation, we demonstrate how we can integrate a “policy” model that proposes actions and an “ethics judge” model that evaluates and aligns them, allowing us to see value alignment in practice without depending on any APIs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q transformers torch accelerate sentencepiece

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM

def generate_seq2seq(model, tokenizer, prompt, max_new_tokens=128):
inputs = tokenizer(prompt, return_tensors=”pt”)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
top_p=0.9,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else tokenizer.pad_token_id,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)

def generate_causal(model, tokenizer, prompt, max_new_tokens=128):
inputs = tokenizer(prompt, return_tensors=”pt”)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
top_p=0.9,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else tokenizer.pad_token_id,
)
full_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return full_text[len(prompt):].strip()

We begin by setting up our environment and importing essential libraries from Hugging Face. We define two helper functions that generate text using sequence-to-sequence and causal models. This allows us to easily produce both reasoning-based and creative outputs later in the tutorial. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserpolicy_model_name = “distilgpt2”
judge_model_name = “google/flan-t5-small”

policy_tokenizer = AutoTokenizer.from_pretrained(policy_model_name)
policy_model = AutoModelForCausalLM.from_pretrained(policy_model_name)

judge_tokenizer = AutoTokenizer.from_pretrained(judge_model_name)
judge_model = AutoModelForSeq2SeqLM.from_pretrained(judge_model_name)

device = “cuda” if torch.cuda.is_available() else “cpu”
policy_model = policy_model.to(device)
judge_model = judge_model.to(device)

if policy_tokenizer.pad_token is None:
policy_tokenizer.pad_token = policy_tokenizer.eos_token
if judge_tokenizer.pad_token is None:
judge_tokenizer.pad_token = judge_tokenizer.eos_token

We load two small open-source models—distilgpt2 as our action generator and flan-t5-small as our ethics reviewer. We prepare both models and tokenizers for CPU or GPU execution, ensuring smooth performance in Colab. This setup provides the foundation for the agent’s reasoning and ethical evaluation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass EthicalAgent:
def __init__(self, policy_model, policy_tok, judge_model, judge_tok):
self.policy_model = policy_model
self.policy_tok = policy_tok
self.judge_model = judge_model
self.judge_tok = judge_tok

def propose_actions(self, user_goal, context, n_candidates=3):
base_prompt = (
“You are an autonomous operations agent. ”
“Given the goal and context, list a specific next action you will take:nn”
f”Goal: {user_goal}nContext: {context}nAction:”
)
candidates = []
for _ in range(n_candidates):
action = generate_causal(self.policy_model, self.policy_tok, base_prompt, max_new_tokens=40)
action = action.split(“n”)[0]
candidates.append(action.strip())
return list(dict.fromkeys(candidates))

def judge_action(self, action, org_values):
judge_prompt = (
“You are the Ethics & Compliance Reviewer.n”
“Evaluate the proposed agent action.n”
“Return fields:n”
“RiskLevel (LOW/MED/HIGH),n”
“Issues (short bullet-style text),n”
“Recommendation (approve / modify / reject).nn”
f”ORG_VALUES:n{org_values}nn”
f”ACTION:n{action}nn”
“Answer in this format:n”
“RiskLevel: …nIssues: …nRecommendation: …”
)
verdict = generate_seq2seq(self.judge_model, self.judge_tok, judge_prompt, max_new_tokens=128)
return verdict.strip()

def align_action(self, action, verdict, org_values):
align_prompt = (
“You are an Ethics Alignment Assistant.n”
“Your job is to FIX the proposed action so it follows ORG_VALUES.n”
“Keep it effective but safe, legal, and respectful.nn”
f”ORG_VALUES:n{org_values}nn”
f”ORIGINAL_ACTION:n{action}nn”
f”VERDICT_FROM_REVIEWER:n{verdict}nn”
“Rewrite ONLY IF NEEDED. If original is fine, return it unchanged. ”
“Return just the final aligned action:”
)
aligned = generate_seq2seq(self.judge_model, self.judge_tok, align_prompt, max_new_tokens=128)
return aligned.strip()

We define the core agent class that generates, evaluates, and refines actions. Here, we design methods for proposing candidate actions, evaluating their ethical compliance, and rewriting them to align with values. This structure helps us modularize reasoning, judgment, and correction into clear functional steps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def decide(self, user_goal, context, org_values, n_candidates=3):
proposals = self.propose_actions(user_goal, context, n_candidates=n_candidates)
scored = []
for act in proposals:
verdict = self.judge_action(act, org_values)
aligned_act = self.align_action(act, verdict, org_values)
scored.append({“original_action”: act, “review”: verdict, “aligned_action”: aligned_act})

def extract_risk(vtext):
for line in vtext.splitlines():
if “RiskLevel” in line:
lvl = line.split(“:”, 1)[-1].strip().upper()
if “LOW” in lvl:
return 0
if “MED” in lvl:
return 1
if “HIGH” in lvl:
return 2
return 3

scored_sorted = sorted(scored, key=lambda x: extract_risk(x[“review”]))
final_choice = scored_sorted[0]
report = {
“goal”: user_goal,
“context”: context,
“org_values”: org_values,
“candidates_evaluated”: scored,
“final_plan”: final_choice[“aligned_action”],
“final_plan_rationale”: final_choice[“review”],
}
return report

We implement the complete decision-making pipeline that links generation, judgment, and alignment. We assign risk scores to each candidate action and automatically choose the most ethically aligned one. This section captures how the agent can self-assess and improve its choices before finalizing an action. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserorg_values_text = (
“- Respect privacy; do not access personal data without consent.n”
“- Follow all laws and safety policies.n”
“- Avoid discrimination, harassment, or harmful manipulation.n”
“- Be transparent and truthful with stakeholders.n”
“- Prioritize user well-being and long-term trust over short-term gain.”
)

demo_goal = “Increase customer adoption of the new financial product.”
demo_context = (
“The agent works for a bank outreach team. The target customers are small family businesses. ”
“Regulations require honest disclosure of risks and fees. Cold-calling minors or lying about terms is illegal.”
)

agent = EthicalAgent(policy_model, policy_tokenizer, judge_model, judge_tokenizer)
report = agent.decide(demo_goal, demo_context, org_values_text, n_candidates=4)

def pretty_report(r):
print(“=== ETHICAL DECISION REPORT ===”)
print(f”Goal: {r[‘goal’]}n”)
print(f”Context: {r[‘context’]}n”)
print(“Org Values:”)
print(r[“org_values”])
print(“n— Candidate Evaluations —“)
for i, cand in enumerate(r[“candidates_evaluated”], 1):
print(f”nCandidate {i}:”)
print(“Original Action:”)
print(” “, cand[“original_action”])
print(“Ethics Review:”)
print(cand[“review”])
print(“Aligned Action:”)
print(” “, cand[“aligned_action”])
print(“n— Final Plan Selected —“)
print(r[“final_plan”])
print(“nWhy this plan is acceptable (review snippet):”)
print(r[“final_plan_rationale”])

pretty_report(report)

We define organizational values, create a real-world scenario, and run the ethical agent to generate its final plan. Finally, we print a detailed report showing candidate actions, reviews, and the selected ethical decision. Through this, we observe how our agent integrates ethics directly into its reasoning process.

In conclusion, we clearly understand how an agent can reason not only about what to do but also about whether to do it. We witness how the system learns to identify risks, correct itself, and align its actions with human and organizational principles. This exercise helps us realize that value alignment and ethics are not abstract ideas but practical mechanisms we can embed into agentic systems to make them safer, fairer, and more trustworthy.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build Ethically Aligned Autonomous Agents through Value-Guided Reasoning and Self-Correcting Decision-Making Using Open-Source Models appeared first on MarkTechPost.

IBM AI Team Releases Granite 4.0 Nano Series: Compact and Open-Source …

Small models are often blocked by poor instruction tuning, weak tool use formats, and missing governance. IBM AI team released Granite 4.0 Nano, a small model family that targets local and edge inference with enterprise controls and open licensing. The family includes 8 models in two sizes, 350M and about 1B, with both hybrid SSM and transformer variants, each in base and instruct. Granite 4.0 Nano series models are released under an Apache 2.0 license with native architecture support on popular runtimes like vLLM, llama.cpp, and MLX

https://huggingface.co/blog/ibm-granite/granite-4-nano

What is new in Granite 4.0 Nano series?

Granite 4.0 Nano consists of four model lines and their base counterparts. Granite 4.0 H 1B uses a hybrid SSM based architecture and is about 1.5B parameters. Granite 4.0 H 350M uses the same hybrid approach at 350M. For maximum runtime portability IBM also provides Granite 4.0 1B and Granite 4.0 350M as transformer versions.

Granite releaseSizes in releaseArchitectureLicense and governanceKey notesGranite 13B, first watsonx Granite models13B base, 13B instruct, later 13B chatDecoder only transformer, 8K contextIBM enterprise terms, client protectionsFirst public Granite models for watsonx, curated enterprise data, English focusGranite Code Models (open)3B, 8B, 20B, 34B code, base and instructDecoder only transformer, 2 stage code training on 116 languagesApache 2.0First fully open Granite line, for code intelligence, paper 2405.04324, available on HF and GitHubGranite 3.0 Language Models 2B and 8B, base and instructTransformer, 128K context for instructApache 2.0Business LLMs for RAG, tool use, summarization, shipped on watsonx and HFGranite 3.1 Language Models (HF) 1B A400M, 3B A800M, 2B, 8BTransformer, 128K contextApache 2.0Size ladder for enterprise tasks, both base and instruct, same Granite data recipeGranite 3.2 Language Models (HF) 2B instruct, 8B instructTransformer, 128K, better long promptApache 2.0Iterative quality bump on 3.x, keeps business alignmentGranite 3.3 Language Models (HF) 2B base, 2B instruct, 8B base, 8B instruct, all 128KDecoder only transformerApache 2.0Latest 3.x line on HF before 4.0, adds FIM and better instruction followingGranite 4.0 Language Models 3B micro, 3B H micro, 7B H tiny, 32B H small, plus transformer variantsHybrid Mamba 2 plus transformer for H, pure transformer for compatibilityApache 2.0, ISO 42001, cryptographically signedStart of hybrid generation, lower memory, agent friendly, same governance across sizesGranite 4.0 Nano Language Models 1B H, 1B H instruct, 350M H, 350M H instruct, 2B transformer, 2B transformer instruct, 0.4B transformer, 0.4B transformer instruct, total 8H models are hybrid SSM plus transformer, non H are pure transformerApache 2.0, ISO 42001, signed, same 4.0 pipelineSmallest Granite models, made for edge, local and browser, run on vLLM, llama.cpp, MLX, watsonxTable Created by Marktechpost.com

Architecture and training

The H variants interleave SSM layers with transformer layers. This hybrid design reduces memory growth versus pure attention, while preserving the generality of transformer blocks. The Nano models did not use a reduced data pipeline. They were trained with the same Granite 4.0 methodology and more than 15T tokens, then instruction tuned to deliver solid tool use and instruction following. This carries over strengths from the larger Granite 4.0 models to sub 2B scales.

Benchmarks and competitive context

IBM compares Granite 4.0 Nano with other under 2B models, including Qwen, Gemma, and LiquidAI LFM. Reported aggregates show a significant increase in capabilities across general knowledge, math, code, and safety at similar parameter budgets. On agent tasks, the models outperform several peers on IFEval and on the Berkeley Function Calling Leaderboard v3.

https://huggingface.co/blog/ibm-granite/granite-4-nano

Key Takeaways

IBM released 8 Granite 4.0 Nano models, 350M and about 1B each, in hybrid SSM and transformer variants, in base and instruct, all under Apache 2.0.

The hybrid H models, Granite 4.0 H 1B at about 1.5B parameters and Granite 4.0 H 350M at about 350M, reuse the Granite 4.0 training recipe on more than 15T tokens, so capability is inherited from the larger family and not a reduced data branch.

IBM team reports that Granite 4.0 Nano is competitive with other sub 2B models such as Qwen, Gemma and LiquidAI LFM on general, math, code and safety, and that it outperforms on IFEval and BFCLv3 which matter for tool using agents.

All Granite 4.0 models, including Nano, are cryptographically signed, ISO 42001 certified and released for enterprise use, which gives provenance and governance that typical small community models do not provide.

The models are available on Hugging Face and IBM watsonx.ai with runtime support for vLLM, llama.cpp and MLX, which makes local, edge and browser level deployments realistic for early AI engineers and software teams.

Editorial Comments

IBM is doing the right thing here, it is taking the same Granite 4.0 training pipeline, the same 15T token scale, the same hybrid Mamba 2 plus transformer architecture, and pushing it down to 350M and about 1B so that edge and on device workloads can use the exact governance and provenance story that the larger Granite models already have. The models are Apache 2.0, ISO 42001 aligned, cryptographically signed, and already runnable on vLLM, llama.cpp and MLX. Overall, this is a clean and auditable way to run small LLMs.

Check out the Model Weights on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post IBM AI Team Releases Granite 4.0 Nano Series: Compact and Open-Source Small Models Built for AI at the Edge appeared first on MarkTechPost.

Reduce CAPTCHAs for AI agents browsing the web with Web Bot Auth (Prev …

AI agents need to browse the web on your behalf. When your agent visits a website to gather information, complete a form, or verify data, it encounters the same defenses designed to stop unwanted bots: CAPTCHAs, rate limits, and outright blocks.
Today, we are excited to share that AWS has a solution. Amazon Bedrock AgentCore Browser, our secure, cloud-based browser for AI agents to interact with websites, now supports Web Bot Auth (in preview), a draft IETF protocol that gives agents verifiable cryptographic identities.
CAPTCHA friction
Customers tell us that CAPTCHA friction is one of the biggest obstacles to reliable browser-based agentic workflows. Your agent halts mid-task, waiting for human intervention to solve a puzzle that proves you’re not a bot – except your agent is a bot, and that’s the point. CAPTCHAs exist for good reason. Websites face constant challenges protecting their content, inventory and reviews. Web Application Firewalls (WAFs) and bot detection services protect these sites, but they treat nearly all automated traffic as suspicious because they have no reliable way to distinguish legitimate agents from malicious ones.
Some automation providers try to solve CAPTCHAs programmatically – using computer vision models to read distorted text or clicking through image grids until the puzzle clears. This approach is brittle, expensive, and is bypassing controls that domain owners intended for their content. Other approaches rely on IP allowlists or User-Agent strings. IP allowlists break when you run agents in cloud environments where addresses change frequently. User-Agent strings can be spoofed by anyone, so they provide no verification, and pose a risk of people emulating well trusted strings. Both methods require manual coordination with every website you want to access, which does not scale.
Web Bot Auth: Cryptographic identity for agents browsing the web
Web Bot Auth is a draft IETF protocol that gives agents verifiable cryptographic identities. When you enable Web Bot Auth in AgentCore Browser, we issue cryptographic credentials that websites can verify. The agent presents these credentials with every request. The WAF may now additionally check the signature, confirm it matches a trusted directory, and allow the request through if verified bots are allowed by the domain owner and other WAF checks are clear.
AgentCore is working with Cloudflare, HUMAN Security, and Akamai Technologies to support this verification flow. These providers protect millions of websites. When you create an AgentCore Browser with signing enabled in the configuration, we automatically register your agent’s signature directory with these providers. Many domains already configure their WAFs to allow verified bots by default, which means you can see immediate CAPTCHA reduction without additional setup in the cases that this happens.
How domain owners control access
WAF providers give website owners three levels of control using Web Bot Auth:

Block all bots – Some sites choose to block automated traffic entirely. Web Bot Auth does not bypass this – if a domain wants no automation, that choice is respected.
Allow verified bots – Many domains configure their WAF to allow any bot that presents a valid cryptographic signature. This is the default policy for a growing number of sites protected by Cloudflare, HUMAN Security, and Akamai Technologies. When you enable signing, as a parameter in the AgentCore Browser configuration, this policy will apply to your agents.
Allow specific verified bots to conduct only specific actions – For example, a financial services company automating vendor portal access can share its unique directory with those vendors. The vendor can create rules like “allow FinCo agents at 100 requests per minute, don’t allow them to create new accounts, and block all other signed agents.” This gives websites granular control while preserving the benefits of cryptographic verification.

Today’s preview release of Web Both Auth support in AgentCore Browser helps reduce friction with CAPTCHAs on domains that allow verified bots, by making your agent appear as a verified bot. Once the Web Bot Auth protocol is finalized, AgentCore intends to transition to customer-specific keys, so AgentCore users can use the tier of control that allows only specified verified bots.
Using the Web Bot Auth protocol
To enable the browser to sign requests using the Web Bot Auth protocol, create a browser tool with the browserSigning configuration:

import boto3
cp_client = boto3.client(‘bedrock-agentcore-control’)
response = cp_client.create_browser(
name=”signed_browser”,
description=”Browser tool with Web Bot Auth enabled”,
networkConfiguration={
“networkMode”: “PUBLIC”
},
executionRoleArn=”arn:aws:iam::123456789012:role/AgentCoreExecutionRole”,
browserSigning={
“enabled”: True
}
)
browserId = response[‘browserId’]

Pass the browser identifier to your agent framework. Here is an example using Strands Agents:

from strands import Agent
from strands_tools.browser import AgentCoreBrowser
agent_core_browser = AgentCoreBrowser(
region=”us-west-2″,
identifier=browserId
)
strands_agent = Agent(
tools=[agent_core_browser.browser],
model=”anthropic.claude-4-5-haiku-20251001-v1:0″,
system_prompt=”You are a website analyst. Use the browser tool efficiently.”
)
result = strands_agent(“Analyze the website at <https://example.com/>”)

The agent is now configured to use the new browser tool that signs every HTTP request. Websites protected by Cloudflare, HUMAN Security, or Akamai Technologies can verify the signature and allow the request through without presenting a CAPTCHA, if the domain owner allows verified bots.
Protocol development
The Web Bot Auth protocol is gaining industry momentum because it solves a real problem: legitimate automation is indistinguishable from abuse without verifiable identity. You can read the draft protocol specification, HTTP Message Signatures for automated traffic Architecture. The architecture defines how agents generate signatures, how WAFs verify them, and how key directories enable discovery. Amazon is working with Cloudflare and many popular WAF providers to help finalize the customer-specific key directory format and work towards finalizing the draft.
Conclusion
Amazon Bedrock AgentCore Browser is generally available, with the Web Bot Auth feature available in preview. AgentCore Browser signing requests using the Web Bot Auth protocol help reduce friction with CAPTCHA across domains that allow verified bots. As the protocol finalizes, AgentCore Browser intends to issue customer-specific keys and directories, so you can prove your agent’s identity to specific websites and establish trust relationships directly with the domains you need to access.
Web Bot Auth enables agents to prove their identity when challenged, reduces operational friction in automated workflows, and gives website owners control over which agents access their resources. Amazon Bedrock AgentCore Browser support for Web Bot Auth (Preview) provides the infrastructure layer that makes this possible.

About the authors
Veda Raman is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Veda works with customers to help them architect efficient, secure, and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon SageMaker.
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.
Joshua Samuel is a Senior AI/ML Specialist Solutions Architect at AWS who accelerates enterprise transformation through AI/ML, and generative AI solutions, based in Melbourne, Australia. A passionate disrupter, he specializes in agentic AI and coding techniques – Anything that makes builders faster and happier.

Microsoft Releases Agent Lightning: A New AI Framework that Enables Re …

How do you convert real agent traces into reinforcement learning RL transitions to improve policy LLMs without changing your existing agent stack? Microsoft AI team releases Agent Lightning to help optimize multi-agent systems. Agent Lightning is a open-sourced framework that makes reinforcement learning work for any AI agent without rewrites. It separates training from execution, defines a unified trace format, and introduces LightningRL, a hierarchical method that converts complex agent runs into transitions that standard single turn RL trainers can optimize.

What Agent Lightning does?

The framework models an agent as a decision process. It formalizes the agent as a partially observable Markov decision process where the observation is the current input to the policy LLM, the action is the model call, and the reward can be terminal or intermediate. From each run it extracts only the calls made by the policy model, along with inputs, outputs, and rewards. This trims away other framework noise and yields clean transitions for training.

LightningRL performs credit assignment across multi step episodes, then optimizes the policy with a single turn RL objective. The research team describes compatibility with single turn RL methods. In practice, teams often use trainers that implement PPO or GRPO, such as VeRL, which fits this interface.

https://arxiv.org/pdf/2508.03680v1

System architecture

Agent Lightning uses Training Agent Disaggregation. A Lightning Server runs training and serving, and exposes an OpenAI like API for the updated model. A Lightning Client runs the agent runtime where it already lives, captures traces of prompts, tool calls, and rewards, and streams them back to the server. This keeps tools, browsers, shells, and other dependencies close to production while the GPU training stays in the server tier.

https://arxiv.org/pdf/2508.03680v1

The runtime supports two tracing paths. A default path uses OpenTelemetry spans, so you can pipe agent telemetry through standard collectors. There is also a lightweight embedded tracer for teams that do not want to deploy OpenTelemetry. Both paths end up in the same store for training.

https://arxiv.org/pdf/2508.03680v1

Unified data interface

Agent Lightning records each model call and each tool call as a span with inputs, outputs, and metadata. The algorithm layer adapts spans into ordered triplets of prompt, response, and reward. This selective extraction lets you optimize one agent in a multi agent workflow, or multiple agents at once, without touching orchestration code. The same traces can also drive automatic prompt optimization or supervised finetuning.

https://arxiv.org/pdf/2508.03680v1

Experiments and datasets

The research team reports three tasks. For text to SQL, the team uses the Spider benchmark. Spider contains more than 10,000 questions across 200 databases that span 138 domains. The policy model is Llama 3.2 3B Instruct. The implementation uses LangChain with a writer agent, a rewriter agent, and a checker. The writer and the rewriter are optimized, and the checker is left fixed. Rewards improve steadily during training and at test time.

https://arxiv.org/pdf/2508.03680v1

For retrieval augmented generation, the setup uses the MuSiQue benchmark and a Wikipedia scale index with about 21 million documents. The retriever uses BGE embeddings with cosine similarity. The agent is built with the OpenAI Agents SDK. The reward is a weighted sum of a format score and an F1 correctness score. Reward curves show stable gains during training and evaluation with the same base model.

https://arxiv.org/pdf/2508.03680v1

For math question answering with tool use, the agent is implemented with AutoGen and calls a calculator tool. The dataset is Calc X. The base model again is Llama 3.2 3B Instruct. Training improves the ability to invoke tools correctly and integrate results into final answers.

https://arxiv.org/pdf/2508.03680v1

Key Takeaways

Agent Lightning uses Training Agent Disaggregation and a unified trace interface, so existing agents in LangChain, OpenAI Agents SDK, AutoGen, or CrewAI connect with near zero code change.

LightningRL converts trajectories to transitions. It applies credit assignment to multi step runs, then optimizes the policy with single turn RL methods such as PPO or GRPO in standard trainers.

Automatic Intermediate Rewarding, AIR, supplies dense feedback. AIR turns system signals such as tool return status into intermediate rewards to reduce sparse reward issues in long workflows.

The research evaluates text to SQL on Spider, RAG on MuSiQue with a Wikipedia scale index using BGE embeddings and cosine similarity, and math tool use on Calc X, all with Llama 3.2 3B Instruct as the base model.

The runtime records traces through OpenTelemetry, streams them to the training server, and exposes an OpenAI compatible endpoint for updated models, enabling scalable rollouts without moving tools.

Editorial Comments

Agent Lightning is a practical bridge between agent execution and reinforcement learning, not another framework rewrite. It formalizes agent runs as an Markov Decision Process (MDP), introduces LightningRL for credit assignment, and extracts transitions that slot into single turn RL trainers. The Training Agent Disaggregation design separates a client that runs the agent from a server that trains and serves an OpenAI compatible endpoint, so teams keep existing stacks. Automatic Intermediate Rewarding converts runtime signals into dense feedback, reducing sparse rewards in long workflows. Overall, Agent Lightning is a clean, minimal-integration path to make agents learn from their own traces.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft Releases Agent Lightning: A New AI Framework that Enables Reinforcement Learning (RL)-based Training of LLMs for Any AI Agent appeared first on MarkTechPost.

Liquid AI Releases LFM2-ColBERT-350M: A New Small Model that brings La …

Can a compact late interaction retriever index once and deliver accurate cross lingual search with fast inference? Liquid AI released LFM2-ColBERT-350M, a compact late interaction retriever for multilingual and cross-lingual search. Documents can be indexed in one language, queries can be written in many languages, and the system retrieves with high accuracy. The Liquid AI team reports inference speed on par with models that are 2.3 times smaller, which is attributed to the LFM2 backbone. The model is available with a Hugging Face demo and a detailed model card for integration in retrieval augmented generation systems.

https://www.liquid.ai/blog/lfm2-colbert-350m-one-model-to-embed-them-all

What late interaction means and why it matters?

Most production systems use bi-encoders for speed or cross encoders for accuracy. Late interaction aims to combine both advantages. Queries and documents are encoded separately at the token level. The system compares token vectors at query time using operations such as MaxSim. This preserves fine grained token interactions without the full cost of joint cross attention. It allows pre-computation for documents and improves precision at ranking time. It can serve as a first stage retriever and also as a ranker in one pass.

Model specification

LFM2-ColBERT-350M has 350 million total parameters. There are 25 layers, with 18 convolution blocks, 6 attention blocks, and 1 dense layer. The context length is 32k tokens. The vocabulary size is 65,536. The similarity function is MaxSim. The output dimensionality is 128. Training precision is BF16. The license is LFM Open License v1.0.

https://huggingface.co/LiquidAI/LFM2-ColBERT-350M

Languages, supported and evaluated

The model supports 8 languages. These are English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. The evaluation adds Italian and Portuguese, which brings the matrix to 9 languages for cross comparisons of document and query languages. This distinction is relevant when planning deployments that must cover specific customer markets.

https://www.liquid.ai/blog/lfm2-colbert-350m-one-model-to-embed-them-all

Evaluation setup and key results

Liquid AI extends the NanoBEIR benchmark with Japanese and Korean and publishes the extension for reproducibility. On this setup, LFM2-ColBERT-350M shows stronger multilingual capability than the baseline late interaction model in this class, which is GTE-ModernColBERT-v1 at 150M parameters. The largest gains appear in German, Arabic, Korean, and Japanese, while English performance is maintained.

Key Takeaways

Token-level scoring with MaxSim preserves fine-grained interactions while keeping separate encoders, so document embeddings can be precomputed and queried efficiently.

Documents can be indexed in one language and retrieved in many. The model card lists 8 supported languages, while evaluations span 9 languages for cross-lingual pairs.

On the NanoBEIR multilingual extension, LFM2-ColBERT-350M outperforms the prior late-interaction baseline (GTE-ModernColBERT-v1 at 150M) and maintains English performance.

Inference speed is reported on par with models 2.3× smaller across batch sizes, attributed to the LFM2 backbone.

Editorial Notes

Liquid AI’s LFM2-ColBERT-350M applies late interaction ColBERT with MaxSim, it encodes queries and documents separately, then scores token vectors at query time, which preserves token level interactions and enables precomputed document embeddings for scale. It targets multilingual and cross lingual retrieval, index once and query in many languages, with evaluations described on a NanoBEIR multilingual extension. Liquid AI team reports inference speed on par with models 2.3 times smaller, attributed to the LFM2 backbone. Overall, late interaction at the nano scale looks production ready for multilingual RAG trials.

Check out the Model Weights, Demo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Liquid AI Releases LFM2-ColBERT-350M: A New Small Model that brings Late Interaction Retrieval to Multilingual and Cross-Lingual RAG appeared first on MarkTechPost.

How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively …

In this tutorial, we explore how exploration strategies shape intelligent decision-making through agent-based problem solving. We build and train three agents, Q-Learning with epsilon-greedy exploration, Upper Confidence Bound (UCB), and Monte Carlo Tree Search (MCTS), to navigate a grid world and reach a goal efficiently while avoiding obstacles. Also, we experiment with different ways of balancing exploration and exploitation, visualize learning curves, and compare how each agent adapts and performs under uncertainty. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import random
from collections import defaultdict, deque
import math
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict

class GridWorld:
def __init__(self, size=10, n_obstacles=15):
self.size = size
self.grid = np.zeros((size, size))
self.start = (0, 0)
self.goal = (size-1, size-1)
obstacles = set()
while len(obstacles) < n_obstacles:
obs = (random.randint(0, size-1), random.randint(0, size-1))
if obs not in [self.start, self.goal]:
obstacles.add(obs)
self.grid[obs] = 1
self.reset()
def reset(self):
self.agent_pos = self.start
return self.agent_pos
def step(self, action):
if self.agent_pos == self.goal:
reward, done = 100, True
else:
reward, done = -1, False
return self.agent_pos, reward, done
def get_valid_actions(self, state):
valid = []
for i, move in enumerate(moves):
new_pos = (state[0] + move[0], state[1] + move[1])
if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size
and self.grid[new_pos] == 0):
valid.append(i)
return valid

We begin by creating a grid world environment that challenges our agent to reach a goal while avoiding obstacles. We design its structure, define movement rules, and ensure realistic navigation boundaries to simulate an interactive problem-solving space. This forms the foundation where our exploration agents will operate and learn. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass QLearningAgent:
def __init__(self, n_actions=4, alpha=0.1, gamma=0.95, epsilon=1.0):
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.q_table = defaultdict(lambda: np.zeros(n_actions))
def get_action(self, state, valid_actions):
if random.random() < self.epsilon:
return random.choice(valid_actions)
else:
q_values = self.q_table[state]
valid_q = [(a, q_values[a]) for a in valid_actions]
return max(valid_q, key=lambda x: x[1])[0]
def update(self, state, action, reward, next_state, valid_next_actions):
current_q = self.q_table[state][action]
if valid_next_actions:
max_next_q = max([self.q_table[next_state][a] for a in valid_next_actions])
else:
max_next_q = 0
new_q = current_q + self.alpha * (reward + self.gamma * max_next_q – current_q)
self.q_table[state][action] = new_q
def decay_epsilon(self, decay_rate=0.995):
self.epsilon = max(0.01, self.epsilon * decay_rate)

We implement the Q-Learning agent that learns through experience, guided by an epsilon-greedy policy. We observe how it explores random actions early on and gradually focuses on the most rewarding paths. Through iterative updates, it learns to balance exploration and exploitation effectively.

Copy CodeCopiedUse a different Browserclass UCBAgent:
def __init__(self, n_actions=4, c=2.0, gamma=0.95):
self.n_actions = n_actions
self.c = c
self.gamma = gamma
self.q_values = defaultdict(lambda: np.zeros(n_actions))
self.action_counts = defaultdict(lambda: np.zeros(n_actions))
self.total_counts = defaultdict(int)
def get_action(self, state, valid_actions):
self.total_counts[state] += 1
ucb_values = []
for action in valid_actions:
q = self.q_values[state][action]
count = self.action_counts[state][action]
if count == 0:
return action
exploration_bonus = self.c * math.sqrt(math.log(self.total_counts[state]) / count)
ucb_values.append((action, q + exploration_bonus))
return max(ucb_values, key=lambda x: x[1])[0]
def update(self, state, action, reward, next_state, valid_next_actions):
self.action_counts[state][action] += 1
count = self.action_counts[state][action]
current_q = self.q_values[state][action]
if valid_next_actions:
max_next_q = max([self.q_values[next_state][a] for a in valid_next_actions])
else:
max_next_q = 0
target = reward + self.gamma * max_next_q
self.q_values[state][action] += (target – current_q) / count

We develop the UCB agent that uses confidence bounds to guide its exploration decisions. We watch how it strategically tries less-visited actions while prioritizing those that yield higher rewards. This approach helps us understand a more mathematically grounded exploration strategy. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCTSNode:
def __init__(self, state, parent=None):
self.state = state
self.parent = parent
self.children = {}
self.visits = 0
self.value = 0.0
def is_fully_expanded(self, valid_actions):
return len(self.children) == len(valid_actions)
def best_child(self, c=1.4):
choices = [(action, child.value / child.visits +
c * math.sqrt(2 * math.log(self.visits) / child.visits))
for action, child in self.children.items()]
return max(choices, key=lambda x: x[1])

class MCTSAgent:
def __init__(self, env, n_simulations=50):
self.env = env
self.n_simulations = n_simulations
def search(self, state):
root = MCTSNode(state)
for _ in range(self.n_simulations):
node = root
sim_env = GridWorld(size=self.env.size)
sim_env.grid = self.env.grid.copy()
sim_env.agent_pos = state
while node.is_fully_expanded(sim_env.get_valid_actions(node.state)) and node.children:
action, _ = node.best_child()
node = node.children[action]
sim_env.agent_pos = node.state
valid_actions = sim_env.get_valid_actions(node.state)
if valid_actions and not node.is_fully_expanded(valid_actions):
untried = [a for a in valid_actions if a not in node.children]
action = random.choice(untried)
next_state, _, _ = sim_env.step(action)
child = MCTSNode(next_state, parent=node)
node.children[action] = child
node = child
total_reward = 0
depth = 0
while depth < 20:
valid = sim_env.get_valid_actions(sim_env.agent_pos)
if not valid:
break
action = random.choice(valid)
_, reward, done = sim_env.step(action)
total_reward += reward
depth += 1
if done:
break
while node:
node.visits += 1
node.value += total_reward
node = node.parent
if root.children:
return max(root.children.items(), key=lambda x: x[1].visits)[0]
return random.choice(self.env.get_valid_actions(state))

We construct the Monte Carlo Tree Search (MCTS) agent to simulate and plan multiple potential future outcomes. We see how it builds a search tree, expands promising branches, and backpropagates results to refine decisions. This allows the agent to plan intelligently before acting. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef train_agent(agent, env, episodes=500, max_steps=100, agent_type=”standard”):
rewards_history = []
for episode in range(episodes):
state = env.reset()
total_reward = 0
for step in range(max_steps):
valid_actions = env.get_valid_actions(state)
if agent_type == “mcts”:
action = agent.search(state)
else:
action = agent.get_action(state, valid_actions)
next_state, reward, done = env.step(action)
total_reward += reward
if agent_type != “mcts”:
valid_next = env.get_valid_actions(next_state)
agent.update(state, action, reward, next_state, valid_next)
state = next_state
if done:
break
rewards_history.append(total_reward)
if hasattr(agent, ‘decay_epsilon’):
agent.decay_epsilon()
if (episode + 1) % 100 == 0:
avg_reward = np.mean(rewards_history[-100:])
print(f”Episode {episode+1}/{episodes}, Avg Reward: {avg_reward:.2f}”)
return rewards_history

if __name__ == “__main__”:
print(“=” * 70)
print(“Problem Solving via Exploration Agents Tutorial”)
print(“=” * 70)
env = GridWorld(size=8, n_obstacles=10)
agents_config = {
‘Q-Learning (ε-greedy)’: (QLearningAgent(), ‘standard’),
‘UCB Agent’: (UCBAgent(), ‘standard’),
‘MCTS Agent’: (MCTSAgent(env, n_simulations=30), ‘mcts’)
}
results = {}
for name, (agent, agent_type) in agents_config.items():
print(f”nTraining {name}…”)
rewards = train_agent(agent, GridWorld(size=8, n_obstacles=10),
episodes=300, agent_type=agent_type)
results[name] = rewards
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for name, rewards in results.items():
smoothed = np.convolve(rewards, np.ones(20)/20, mode=’valid’)
plt.plot(smoothed, label=name, linewidth=2)
plt.xlabel(‘Episode’)
plt.ylabel(‘Reward (smoothed)’)
plt.title(‘Agent Performance Comparison’)
plt.legend()
plt.grid(alpha=0.3)
plt.subplot(1, 2, 2)
for name, rewards in results.items():
avg_last_100 = np.mean(rewards[-100:])
plt.bar(name, avg_last_100, alpha=0.7)
plt.ylabel(‘Average Reward (Last 100 Episodes)’)
plt.title(‘Final Performance’)
plt.xticks(rotation=15, ha=’right’)
plt.grid(axis=’y’, alpha=0.3)
plt.tight_layout()
plt.show()
print(“=” * 70)
print(“Tutorial Complete!”)
print(“Key Concepts Demonstrated:”)
print(“1. Epsilon-Greedy exploration”)
print(“2. UCB strategy”)
print(“3. MCTS-based planning”)
print(“=” * 70)

We train all three agents in our grid world and visualize their learning progress and performance. We analyze how each strategy, Q-Learning, UCB, and MCTS, adapts to the environment over time. Finally, we compare results and gain insights into which exploration approach leads to faster, more reliable problem-solving.

In conclusion, we successfully implemented and compared three exploration-driven agents, each demonstrating a unique strategy for solving the same navigation challenge. We observe how epsilon-greedy enables gradual learning through randomness, UCB balances confidence with curiosity, and MCTS leverages simulated rollouts for foresight and planning. This exercise helps us appreciate how different exploration mechanisms influence convergence, adaptability, and efficiency in reinforcement learning.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively Learn Intelligent Problem-Solving Strategies in Dynamic Grid Environments appeared first on MarkTechPost.

MiniMax Releases MiniMax M2: A Mini Open Model Built for Max Codin …

Can an open source MoE truly power agentic coding workflows at a fraction of flagship model costs while sustaining long-horizon tool use across MCP, shell, browser, retrieval, and code? MiniMax team has just released MiniMax-M2, a mixture of experts MoE model optimized for coding and agent workflows. The weights are published on Hugging Face under the MIT license, and the model is positioned as for end to end tool use, multi file editing, and long horizon plans, It lists 229B total parameters with about 10B active per token, which keeps memory and latency in check during agent loops.

https://github.com/MiniMax-AI/MiniMax-M2

Architecture and why activation size matters?

MiniMax-M2 is a compact MoE that routes to about 10B active parameters per token. The smaller activations reduce memory pressure and tail latency in plan, act, and verify loops, and allow more concurrent runs in CI, browse, and retrieval chains. This is the performance budget that enables the speed and cost claims relative to dense models of similar quality.

MiniMax-M2 is an interleaved thinking model. The research team wrapped internal reasoning in <think>…</think> blocks, and instructs users to keep these blocks in the conversation history across turns. Removing these segments harms quality in multi step tasks and tool chains. This requirement is explicit on the model page on HF.

Benchmarks that target coding and agents

The MiniMax team reports a set of agent and code evaluations are closer to developer workflows than static QA. On Terminal Bench, the table shows 46.3. On Multi SWE Bench, it shows 36.2. On BrowseComp, it shows 44.0. SWE Bench Verified is listed at 69.4 with the scaffold detail, OpenHands with 128k context and 100 steps.

https://github.com/MiniMax-AI/MiniMax-M2

MiniMax’s official announcement stresses 8% of Claude Sonnet pricing, and near 2x speed, plus a free access window. The same note provides the specific token prices and the trial deadline.

Comparison M1 vs M2

AspectMiniMax M1MiniMax M2Total parameters456B total229B in model card metadata, model card text says 230B totalActive parameters per token45.9B active10B activeCore designHybrid Mixture of Experts with Lightning AttentionSparse Mixture of Experts targeting coding and agent workflowsThinking formatThinking budget variants 40k and 80k in RL training, no think tag protocol requiredInterleaved thinking with <think>…</think> segments that must be preserved across turnsBenchmarks highlightedAIME, LiveCodeBench, SWE-bench Verified, TAU-bench, long context MRCR, MMLU-ProTerminal-Bench, Multi SWE-Bench, SWE-bench Verified, BrowseComp, GAIA text only, Artificial Analysis intelligence suiteInference defaultstemperature 1.0, top p 0.95model card shows temperature 1.0, top p 0.95, top k 40, launch page shows top k 20Serving guidancevLLM recommended, Transformers path also documentedvLLM and SGLang recommended, tool calling guide providedPrimary focusLong context reasoning, efficient scaling of test time compute, CISPO reinforcement learningAgent and code native workflows across shell, browser, retrieval, and code runners

Key Takeaways

M2 ships as open weights on Hugging Face under MIT, with safetensors in F32, BF16, and FP8 F8_E4M3.

The model is a compact MoE with 229B total parameters and ~10B active per token, which the card ties to lower memory use and steadier tail latency in plan, act, verify loops typical of agents.

Outputs wrap internal reasoning in <think>…</think> and the model card explicitly instructs retaining these segments in conversation history, warning that removal degrades multi-step and tool-use performance.

Reported results cover Terminal-Bench, (Multi-)SWE-Bench, BrowseComp, and others, with scaffold notes for reproducibility, and day-0 serving is documented for SGLang and vLLM with concrete deploy guides.

Editorial Notes

MiniMax M2 lands with open weights under MIT, a mixture of experts design with 229B total parameters and about 10B activated per token, which targets agent loops and coding tasks with lower memory and steadier latency. It ships on Hugging Face in safetensors with FP32, BF16, and FP8 formats, and provides deployment notes plus a chat template. The API documents Anthropic compatible endpoints and lists pricing with a limited free window for evaluation. vLLM and SGLang recipes are available for local serving and benchmarking. Overall, MiniMax M2 is a very solid open release.

Check out the API Doc, Weights and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post MiniMax Releases MiniMax M2: A Mini Open Model Built for Max Coding and Agentic Workflows at 8% Claude Sonnet Price and ~2x Faster appeared first on MarkTechPost.

Zhipu AI Releases ‘Glyph’: An AI Framework for Scaling the Context …

Can we render long texts as images and use a VLM to achieve 3–4× token compression, preserving accuracy while scaling a 128K context toward 1M-token workloads? A team of researchers from Zhipu AI release Glyph, an AI framework for scaling the context length through visual-text compression. It renders long textual sequences into images and processes them using vision–language models. The system renders ultra long text into page images, then a vision language model, VLM, processes those pages end to end. Each visual token encodes many characters, so the effective token sequence shortens, while semantics are preserved. Glyph can achieve 3-4x token compression on long text sequences without performance degradation, enabling significant gains in memory efficiency, training throughput, and inference speed.

https://arxiv.org/pdf/2510.17800

Why Glyph?

Conventional methods expand positional encodings or modify attention, compute and memory still scale with token count. Retrieval trims inputs, but risks missing evidence and adds latency. Glyph changes the representation, it converts text to images and shifts burden to a VLM that already learns OCR, layout, and reasoning. This increases information density per token, so a fixed token budget covers more original context. Under extreme compression, the research team show a 128K context VLM can address tasks that originate from 1M token level text.

https://arxiv.org/pdf/2510.17800

System design and training

The method has three stages, continual pre training, LLM driven rendering search, and post training. Continual pre training exposes the VLM to large corpora of rendered long text with diverse typography and styles. The objective aligns visual and textual representations, and transfers long context skills from text tokens to visual tokens. The rendering search is a genetic loop driven by an LLM. It mutates page size, dpi, font family, font size, line height, alignment, indent, and spacing. It evaluates candidates on a validation set to optimize accuracy and compression jointly. Post training uses supervised fine tuning and reinforcement learning with Group Relative Policy Optimization, plus an auxiliary OCR alignment task. The OCR loss improves character fidelity when fonts are small and spacing is tight.

https://arxiv.org/pdf/2510.17800

Results, performance and efficiency…

LongBench and MRCR establish accuracy and compression under long dialogue histories and document tasks. The model achieves an average effective compression ratio about 3.3 on LongBench, with some tasks near 5, and about 3.0 on MRCR. These gains scale with longer inputs, since every visual token carries more characters. Reported speedups versus the text backbone at 128K inputs are about 4.8 times for prefill, about 4.4 times for decoding, and about 2 times for supervised fine tuning throughput. The Ruler benchmark confirms that higher dpi at inference time improves scores, since crisper glyphs help OCR and layout parsing. The research team reports dpi 72 with average compression 4.0 and maximum 7.7 on specific sub tasks, dpi 96 with average compression 2.2 and maximum 4.4, and dpi 120 with average 1.2 and maximum 2.8. The 7.7 maximum belongs to Ruler, not to MRCR.

https://arxiv.org/pdf/2510.17800

So, what? Applications

Glyph benefits multimodal document understanding. Training on rendered pages improves performance on MMLongBench Doc relative to a base visual model. This indicates that the rendering objective is a useful pretext for real document tasks that include figures and layout. The main failure mode is sensitivity to aggressive typography. Very small fonts and tight spacing degrade character accuracy, especially for rare alphanumeric strings. The research team exclude the UUID subtask on Ruler. The approach assumes server side rendering and a VLM with strong OCR and layout priors.

Key Takeaways

Glyph renders long text into images, then a vision language model processes those pages. This reframes long-context modeling as a multimodal problem and preserves semantics while reducing tokens.

The research team reports token compression is 3 to 4 times with accuracy comparable to strong 8B text baselines on long-context benchmarks.

Prefill speedup is about 4.8 times, decoding speedup is about 4.4 times, and supervised fine tuning throughput is about 2 times, measured at 128K inputs.

The system uses continual pretraining on rendered pages, an LLM driven genetic search over rendering parameters, then supervised fine tuning and reinforcement learning with GRPO, plus an OCR alignment objective.

Evaluations include LongBench, MRCR, and Ruler, with an extreme case showing a 128K context VLM addressing 1M token level tasks. Code and model card are public on GitHub and Hugging Face.

Editorial Comments

Glyph treats long context scaling as visual text compression, it renders long sequences into images and lets a VLM process them, reducing tokens while preserving semantics. The research team claims 3 to 4 times token compression with accuracy comparable to Qwen3 8B baselines, about 4 times faster prefilling and decoding, and about 2 times faster SFT throughput. The pipeline is disciplined, continual pre training on rendered pages, an LLM genetic rendering search over typography, then post training. The approach is pragmatic for million token workloads under extreme compression, yet it depends on OCR and typography choices, which remain knobs. Overall, visual text compression offers a concrete path to scale long context while controlling compute and memory.

Check out the Paper, Weights and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Zhipu AI Releases ‘Glyph’: An AI Framework for Scaling the Context Length through Visual-Text Compression appeared first on MarkTechPost.

Hosting NVIDIA speech NIM models on Amazon SageMaker AI: Parakeet ASR

This post was written with NVIDIA and the authors would like to thank Adi Margolin, Eliuth Triana, and Maryam Motamedi for their collaboration.
Organizations today face the challenge of processing large volumes of audio data–from customer calls and meeting recordings to podcasts and voice messages–to unlock valuable insights. Automatic Speech Recognition (ASR) is a critical first step in this process, converting speech to text so that further analysis can be performed. However, running ASR at scale is computationally intensive and can be expensive. This is where asynchronous inference on Amazon SageMaker AI comes in. By deploying state-of-the-art ASR models (like NVIDIA Parakeet models) on SageMaker AI with asynchronous endpoints, you can handle large audio files and batch workloads efficiently. With asynchronous inference, long-running requests can be processed in the background (with results delivered later); it also supports auto-scaling to zero when there’s no work and handles spikes in demand without blocking other jobs.
In this blog post, we’ll explore how to host the NVIDIA Parakeet ASR model on SageMaker AI and integrate it into an asynchronous pipeline for scalable audio processing. We’ll also highlight the benefits of Parakeet’s architecture and the NVIDIA Riva toolkit for speech AI, and discuss how to use NVIDIA NIM for deployment on AWS.
NVIDIA speech AI technologies: Parakeet ASR and Riva Framework
NVIDIA offers a comprehensive suite of speech AI technologies, combining high-performance models with efficient deployment solutions. At its core, the Parakeet ASR model family represents state-of-the-art speech recognition capabilities, achieving industry-leading accuracy with low word error rates (WERs) . The model’s architecture uses the Fast Conformer encoder with the CTC or transducer decoder, enabling 2.4× faster processing than standard Conformers while maintaining accuracy.
NVIDIA speech NIM is a collection of GPU-accelerated microservices for building customizable speech AI applications. NVIDIA Speech models deliver accurate transcription accuracy and natural, expressive voices in over 36 languages–ideal for customer service, contact centers, accessibility, and global enterprise workflows. Developers can fine-tune and customize models for specific languages, accents, domains, and vocabularies, supporting accuracy and brand voice alignment.
Seamless integration with LLMs and the NVIDIA Nemo Retriever make NVIDIA models ideal for agentic AI applications, helping your organization stand out with more secure, high-performing, voice AI. The NIM framework delivers these services as containerized solutions, making deployment straightforward through Docker containers that include the necessary dependencies and optimizations.
This combination of high-performance models and deployment tools provides organizations with a complete solution for implementing speech recognition at scale.
Solution overview
The architecture illustrated in the diagram showcases a comprehensive asynchronous inference pipeline designed specifically for ASR and summarization workloads. The solution provides a robust, scalable, and cost-effective processing pipeline.

Architecture components
The architecture consists of five key components working together to create an efficient audio processing pipeline. At its core, the SageMaker AI asynchronous endpoint hosts the Parakeet ASR model with auto scaling capabilities that can scale to zero when idle for cost optimization.

The data ingestion process begins when audio files are uploaded to Amazon Simple Storage Service (Amazon S3), triggering AWS Lambda functions that process metadata and initiate the workflow.
For event processing, the SageMaker endpoint automatically sends out Amazon Simple Notification Service (Amazon SNS) success and failure notifications through separate queues, enabling proper handling of transcriptions.
Successfully transcribed content on Amazon S3 moves to Amazon Bedrock LLMs for intelligent summarization and additional processing like classification and insights extraction.
Finally, a comprehensive tracking system using Amazon DynamoDB stores workflow status and metadata, enabling real-time monitoring and analytics of the entire pipeline.

Detailed implementation walkthrough
In this section, we will provide the detailed walkthrough of the solution implementation.
SageMaker asynchronous endpoint prerequisites
To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with least-privilege permissions to manage resources created. For details, refer to Create an AWS account. You might need to request a service quota increase for the corresponding SageMaker async hosting instances. In this example, we need one ml.g5.xlarge SageMaker async hosting instance and a ml.g5.xlarge SageMaker notebook instance. You can also choose a different integrated development environment (IDE), but make sure the environment contains GPU compute resources for local testing.
SageMaker asynchronous endpoint configuration
When you deploy a custom model like Parakeet, SageMaker has a couple of options:

Use a NIM container provided by NVIDIA
Use a large model inference (LMI) container
Use a prebuilt PyTorch container

We’ll provide examples for all three approaches.
Using an NVIDIA NIM container
NVIDIA NIM provides a streamlined approach to deploying optimized AI models through containerized solutions. Our implementation takes this concept further by creating a unified SageMaker AI endpoint that intelligently routes between HTTP and gRPC protocols to help maximize both performance and capabilities while simplifying the deployment process.
Innovative dual-protocol architecture
The key innovation is the combined HTTP + gRPC architecture that exposes a single SageMaker AI endpoint with intelligent routing capabilities. This design addresses the common challenge of choosing between protocol efficiency and feature completeness by automatically selecting the optimal transport method. The HTTP route is optimized for simple transcription tasks with files under 5MB, providing faster processing and lower latency for common use cases. Meanwhile, the gRPC route supports larger files (SageMaker AI real-time endpoints support a max payload of 25MB) and advanced features like speaker diarization with precise word-level timing information. The system’s auto-routing functionality analyzes incoming requests to determine file size and requested features, then automatically selects the most appropriate protocol without requiring manual configuration. For applications that need explicit control, the endpoint also supports forced routing through /invocations/http for simple transcription or /invocations/grpc when speaker diarization is required. This flexibility allows both automated optimization and fine-grained control based on specific application requirements.
Advanced speech recognition and speaker diarization capabilities
The NIM container enables a comprehensive audio processing pipeline that seamlessly combines speech recognition with speaker identification through the NVIDIA Riva integrated capabilities. The container handles audio preprocessing, including format conversion and segmentation, while ASR and speaker diarization processes run concurrently on the same audio stream. Results are automatically aligned using overlapping time segments, with each transcribed segment receiving appropriate speaker labels (for example, Speaker_0, Speaker_1). The inference handler processes audio files through the complete pipeline, initializing both ASR and speaker diarization services, running them in parallel, and aligning transcription segments with speaker labels. The output includes the full transcription, timestamped segments with speaker attribution, confidence scores, and total speaker count in a structured JSON format.
Implementation and deployment
The implementation extends NVIDIA parakeet-1-1b-ctc-en-us NIM container as the foundation, adding a Python aiohttp server that seamlessly manages the complete NIM lifecycle by automatically starting and monitoring the service. The server handles protocol adaptation by translating SageMaker inference requests to appropriate NIM APIs, implements the intelligent routing logic that analyzes request characteristics, and provides comprehensive error handling with detailed error messages and fallback mechanisms for robust production deployment. The containerized solution streamlines deployment through standard Docker and AWS CLI commands, featuring a pre-configured Docker file with the necessary dependencies and optimizations. The system accepts multiple input formats including multipart form-data (recommended for maximum compatibility), JSON with base64 encoding for simple integration scenarios, and raw binary uploads for direct audio processing.
For detailed implementation instructions and working examples, teams can reference the complete implementation and deployment notebook in the AWS samples repository, which provides comprehensive guidance on deploying Parakeet ASR with NIM on SageMaker AI using the bring your own container (BYOC) approach. For organizations with specific architectural preferences, separate HTTP-only and gRPC-only implementations are also available, providing simpler deployment models for teams with well-defined use cases while the combined implementation offers maximum flexibility and automatic optimization.
AWS customers can deploy these models either as production-grade NVIDIA NIM containers directly from SageMaker Marketplace or JumpStart, or open source NVIDIA models available on Hugging Face, which can be deployed through custom containers on SageMaker or Amazon Elastic Kubernetes Service (Amazon EKS). This allows organizations to choose between fully managed, enterprise-tier endpoints with auto-scaling and security, or flexible open-source development for research or constrained use cases.
Using an AWS LMI container
LMI containers are designed to simplify hosting large models on AWS. These containers include optimized inference engines like vLLM, FasterTransformer, or TensorRT-LLM that can automatically handle things like model parallelism, quantization, and batching for large models. The LMI container is essentially a pre-configured Docker image that runs an inference server (for example a Python server with these optimizations) and allows you to specify model parameters by using environment variables.
To use the LMI container for Parakeet, we would typically:

Choose the appropriate LMI image: AWS provides different LMI images for different frameworks. For Parakeet , we might use the DJLServing image for efficient inference. Alternatively, NVIDIA Triton Inference Server (which Riva uses) is an option if we package the model in ONNX or TensorRT format.
Specify the model configuration: With LMI, we often provide a model_id (if pulling from Hugging Face Hub) or a path to our model, along with configuration for how to load it (number of GPUs, tensor parallel degree, quantization bits). The container then downloads the model and initializes it with the specified settings. We can also download our own model files from Amazon S3 instead of using the Hub.
Define the inference handler: The LMI container might require a small handler script or configuration to tell it how to process requests. For ASR, this might involve reading the audio input, passing it to the model, and returning text.

AWS LMI containers deliver high performance and scalability through advanced optimization techniques, including continuous batching, tensor parallelism, and state-of-the-art quantization methods. LMI containers integrate multiple inference backends (vLLM, TensorRT-LLM through a single unified configuration), helping users seamlessly experiment and switch between frameworks to find the optimal performance stack for your specific use case.
Using a SageMaker PyTorch container
SageMaker offers PyTorch Deep Learning Containers (DLCs) that come with PyTorch and many common libraries pre-installed. In this example, we demonstrated how to extend our prebuilt container to install necessary packages for the model. You can download the model directly from Hugging Face during the endpoint creation or download the Parakeet model artifacts, packaging it with necessary configuration files into a model.tar.gz archive, and uploading it to Amazon S3. Along with the model artifacts, an inference.py script is required as the entry point script to define model loading and inference logic, including audio preprocessing and transcription handling. When using the SageMaker Python SDK to create a PyTorchModel, the SDK will automatically repackage the model archive to include the inference script under /opt/ml/model/code/inference.py, while keeping model artifacts in /opt/ml/model/ on the endpoint. Once the endpoint is deployed successfully, it can be invoked through the predict API by sending audio files as byte streams to get transcription results.
For the SageMaker real-time endpoint, we currently allow a maximum of 25MB for payload size. Make sure you have set up the container to also allow the maximum request size. However, if you are planning to use the same model for the asynchronous endpoint, the maximum file size that the async endpoint supports is 1GB and the response time is up to 1 hour. Accordingly, you should setup the container to be prepared for this payload size and timeout. When using the PyTorch containers, here are some key configuration parameters to consider:

SAGEMAKER_MODEL_SERVER_WORKERS: Set the number of torch workers that will load the number of models copied into GPU memory.
TS_DEFAULT_RESPONSE_TIMEOUT: Set the time out setting for Torch server workers; for long audio processing, you can set it to a higher number
TS_MAX_REQUEST_SIZE: Set the byte size values for requests to 1G for async endpoints.
TS_MAX_RESPONSE_SIZE: Set the byte size values for response.

In the example notebook, we also showcase how to leverage the SageMaker local session provided by the SageMaker Python SDK. It helps you create estimators and run training, processing, and inference jobs locally using Docker containers instead of managed AWS infrastructure, providing a fast way to test and debug your machine learning scripts before scaling to production.
CDK pipeline prerequisites
Before deploying this solution, make sure you have:

AWS CLI configured with appropriate permissions – Installation Guide
AWS Cloud Development Kit (AWS CDK) installed – Installation Guide
Node.js 18+ and Python 3.9+ installed
Docker – Installation Guide
SageMaker endpoint deployed with your ML model (Parakeet ASR models or similar)
Amazon SNS topics created for success and failure notifications

CDK pipeline setup
The solution deployment begins with provisioning the necessary AWS resources using Infrastructure as Code (IaC) principles. AWS CDK creates the foundational components including:

DynamoDB Table: Configured for on-demand capacity to track invocation metadata, processing status, and results
S3 Buckets: Secure storage for input audio files, transcription outputs, and summarization results
SNS topics: Separate queues for success and failure event handling
Lambda functions: Serverless functions for metadata processing, status updates, and workflow orchestration
IAM roles and policies: Appropriate permissions for cross-service communication and resource access

Environment setup
Clone the repository and install dependencies:

# Install degit, a library for downloading specific sub directories
npm install -g degit

# Clone just the specific folder
npx degit aws-samples/genai-ml-platform-examples/infrastructure/automated-speech-recognition-async-pipeline-sagemaker-ai/sagemaker-async-batch-inference-cdk sagemaker-async-batch-inference-cdk

# Navigate to folder
cd sagemaker-async-batch-inference-cdk

# Install Node.js dependencies
npm install

# Set up Python virtual environment
python3 -m venv .venv
source .venv/bin/activate

# On Windows:
.venvScriptsactivate
pip install -r requirements.txt

Configuration
Update the SageMaker endpoint configuration in bin/aws-blog-sagemaker.ts:

vim bin/aws-blog-sagemaker.ts

# Change the endpoint name
sageMakerConfig: {
endpointName: ‘your-sagemaker-endpoint-name’,
enableSageMakerAccess: true
}

If you have followed the notebook to deploy the endpoint, you should have created the two SNS topics. Otherwise, make sure you create the correct SNS topics using CLI:

# Create SNS topics
aws sns create-topic –name success-inf
aws sns create-topic –name failed-inf

Build and deploy
Before you deploy the AWS CloudFormation template, make sure Docker is running.

# Compile TypeScript to JavaScript
npm run build

# Bootstrap CDK (first time only)
npx cdk bootstrap

# Deploy the stack
npx cdk deploy

Verify deployment
After successful deployment, note the output values:

DynamoDB table name for status tracking
Lambda function ARNs for processing and status updates
SNS topic ARNs for notifications

Submit audio file for processing
Processing Audio Files
Update the upload_audio_invoke_lambda.sh

LAMBDA_ARN=”YOUR_LAMBDA_FUNCTION_ARN”
S3_BUCKET=”YOUR_S3_BUCKET_ARN”

Run the Script:
AWS_PROFILE=default ./scripts/upload_audio_invoke_lambda.sh
This script will:

Download a sample audio file
Upload the audio file to your s3 bucket
Send the bucket path to Lambda and trigger the transcription and summarization pipeline

Monitoring progress
You can check the result in DynamoDB table using the following command:

aws dynamodb scan –table-name YOUR_DYNAMODB_TABLE_NAME

Check processing status in the DynamoDB table:

submitted: Successfully queued for inference
completed: Transcription completed successfully
failed: Processing encountered an error

Audio processing and workflow orchestration
The core processing workflow follows an event-driven pattern:
Initial processing and metadata extraction: When audio files are uploaded to S3, the triggered Lambda function analyzes the file metadata, validates format compatibility, and creates detailed invocation records in DynamoDB. This facilitates comprehensive tracking from the moment audio content enters the system.
Asynchronous Speech Recognition: Audio files are processed through the SageMaker endpoint using optimized ASR models. The asynchronous process can handle various file sizes and durations without timeout concerns. Each processing request is assigned a unique identifier for tracking purposes.
Success path processing: Upon successful transcription, the system automatically initiates the summarization workflow. The transcribed text is sent to Amazon Bedrock, where advanced language models generate contextually appropriate summaries based on configurable parameters such as summary length, focus areas, and output format.
Error handling and recovery: Failed processing attempts trigger dedicated Lambda functions that log detailed error information, update processing status, and can initiate retry logic for transient failures. This robust error handling results in minimal data loss and provides clear visibility into processing issues.
Real-world applications
Customer service analytics: Organizations can process thousands of customer service call recordings to generate transcriptions and summaries, enabling sentiment analysis, quality assurance, and insights extraction at scale.
Meeting and conference processing: Enterprise teams can automatically transcribe and summarize meeting recordings, creating searchable archives and actionable summaries for participants and stakeholders.
Media and content processing: Media companies can process podcast episodes, interviews, and video content to generate transcriptions and summaries for improved accessibility and content discoverability.
Compliance and legal documentation: Legal and compliance teams can process recorded depositions, hearings, and interviews to create accurate transcriptions and summaries for case preparation and documentation.
Cleanup
Once you have used the solution, remove the SageMaker endpoints to prevent incurring additional costs. You can use the provided code to delete real-time and asynchronous inference endpoints, respectively:

# Delete real-time inference
endpointreal_time_predictor.delete_endpoint()

# Delete asynchronous inference
endpointasync_predictor.delete_endpoint()

You should also delete all the resources created by the CDK stack.

# Delete CDK Stack
cdk destroy

Conclusion
The integration of powerful NVIDIA speech AI technologies with AWS cloud infrastructure creates a comprehensive solution for large-scale audio processing. By combining Parakeet ASR’s industry-leading accuracy and speed with NVIDIA Riva’s optimized deployment framework on the Amazon SageMaker asynchronous inference pipeline, organizations can achieve both high-performance speech recognition and cost-effective scaling. The solution leverages the managed services of AWS (SageMaker AI, Lambda, S3, and Bedrock) to create an automated, scalable pipeline for processing audio content. With features like auto scaling to zero, comprehensive error handling, and real-time monitoring through DynamoDB, organizations can focus on extracting business value from their audio content rather than managing infrastructure complexity. Whether processing customer service calls, meeting recordings, or media content, this architecture delivers reliable, efficient, and cost-effective audio processing capabilities. To experience the full potential of this solution, we encourage you to explore the solution and reach out to us if you have any specific business requirements and would like to customise the solution for your use case.

About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Tony Trinh is a Senior AI/ML Specialist Architect at AWS. With 13+ years of experience in the IT industry, Tony specializes in architecting scalable, compliance-driven AI and ML solutions—particularly in generative AI, MLOps, and cloud-native data platforms. As part of his PhD, he’s doing research in Multimodal AI and Spatial AI. In his spare time, Tony enjoys hiking, swimming and experimenting with home improvement.
Alick Wong is a Senior Solutions Architect at Amazon Web Services, where he helps startups and digital-native businesses modernize, optimize, and scale their platforms in the cloud. Drawing on his experience as a former startup CTO, he works closely with founders and engineering leaders to drive growth and innovation on AWS.
Andrew Smith is a Sr. Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.
Derrick Choo is a Senior AI/ML Specialist Solutions Architect at AWS who accelerates enterprise digital transformation through cloud adoption, AI/ML, and generative AI solutions. He specializes in full-stack development and ML, designing end-to-end solutions spanning frontend interfaces, IoT applications, data integrations, and ML models, with a particular focus on computer vision and multi-modal systems.
Tim Ma is a Principal Specialist in Generative AI at AWS, where he collaborates with customers to design and deploy cutting-edge machine learning solutions. He also leads go-to-market strategies for generative AI services, helping organizations harness the potential of advanced AI technologies.
Curt Lockhart is an AI Solutions Architect at NVIDIA, where he helps customers deploy language and vision models to build end to end AI workflows using NVIDIA’s tooling on AWS. He enjoys making complex AI feel approachable and spending his time exploring the art, music, and outdoors of the Pacific Northwest.
Francesco Ciannella is a senior engineer at NVIDIA, where he works on conversational AI solutions built around large language models (LLMs) and audio language models (ALMs). He holds a M.S. in engineering of telecommunications from the University of Rome “La Sapienza” and an M.S. in language technologies from the School of Computer Science at Carnegie Mellon University.

How to Build an Agentic Decision-Tree RAG System with Intelligent Quer …

In this tutorial, we build an advanced Agentic Retrieval-Augmented Generation (RAG) system that goes beyond simple question answering. We design it to intelligently route queries to the right knowledge sources, perform self-checks to assess answer quality, and iteratively refine responses for improved accuracy. We implement the entire system using open-source tools like FAISS, SentenceTransformers, and Flan-T5. As we progress, we explore how routing, retrieval, generation, and self-evaluation combine to form a decision-tree-style RAG pipeline that mimics real-world agentic reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(” Setting up dependencies…”)
import subprocess
import sys
def install_packages():
packages = [‘sentence-transformers’, ‘transformers’, ‘torch’, ‘faiss-cpu’, ‘numpy’, ‘accelerate’]
for package in packages:
print(f”Installing {package}…”)
subprocess.check_call([sys.executable, ‘-m’, ‘pip’, ‘install’, ‘-q’, package])
try:
import faiss
except ImportError:
install_packages()
print(“✓ All dependencies installed! Importing modules…n”)
import torch
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings(‘ignore’)
print(“✓ All modules loaded successfully!n”)

We begin by installing all necessary dependencies, including Transformers, FAISS, and SentenceTransformers, to ensure smooth local execution. We verify installations and install essential modules such as NumPy, PyTorch, and FAISS for embedding, retrieval, and generation. We confirm that all libraries load successfully before moving ahead with the main pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass VectorStore:
def __init__(self, embedding_model=’all-MiniLM-L6-v2′):
print(f”Loading embedding model: {embedding_model}…”)
self.embedder = SentenceTransformer(embedding_model)
self.documents = []
self.index = None
def add_documents(self, docs: List[str], sources: List[str]):
self.documents = [{“text”: doc, “source”: src} for doc, src in zip(docs, sources)]
embeddings = self.embedder.encode(docs, show_progress_bar=False)
dimension = embeddings.shape[1]
self.index = faiss.IndexFlatL2(dimension)
self.index.add(embeddings.astype(‘float32’))
print(f”✓ Indexed {len(docs)} documentsn”)
def search(self, query: str, k: int = 3) -> List[Dict]:
query_vec = self.embedder.encode([query]).astype(‘float32’)
distances, indices = self.index.search(query_vec, k)
return [self.documents[i] for i in indices[0]]

We design the VectorStore class to store and retrieve documents efficiently using FAISS-based similarity search. We embed each document using a transformer model and build an index for fast retrieval. This allows us to quickly fetch the most relevant context for any incoming query. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass QueryRouter:
def __init__(self):
self.categories = {
‘technical’: [‘how’, ‘implement’, ‘code’, ‘function’, ‘algorithm’, ‘debug’],
‘factual’: [‘what’, ‘who’, ‘when’, ‘where’, ‘define’, ‘explain’],
‘comparative’: [‘compare’, ‘difference’, ‘versus’, ‘vs’, ‘better’, ‘which’],
‘procedural’: [‘steps’, ‘process’, ‘guide’, ‘tutorial’, ‘how to’]
}
def route(self, query: str) -> str:
query_lower = query.lower()
scores = {}
for category, keywords in self.categories.items():
score = sum(1 for kw in keywords if kw in query_lower)
scores[category] = score
best_category = max(scores, key=scores.get)
return best_category if scores[best_category] > 0 else ‘factual’

We introduce the QueryRouter class to classify queries by intent, technical, factual, comparative, or procedural. We use keyword matching to determine which category best fits the input question. This routing step ensures that the retrieval strategy adapts dynamically to different query styles. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AnswerGenerator:
def __init__(self, model_name=’google/flan-t5-base’):
print(f”Loading generation model: {model_name}…”)
self.generator = pipeline(‘text2text-generation’, model=model_name, device=0 if torch.cuda.is_available() else -1, max_length=256)
device_type = “GPU” if torch.cuda.is_available() else “CPU”
print(f”✓ Generator ready (using {device_type})n”)
def generate(self, query: str, context: List[Dict], query_type: str) -> str:
context_text = “nn”.join([f”[{doc[‘source’]}]: {doc[‘text’]}” for doc in context])

Context:
{context_text}

Question: {query}

Answer:”””
answer = self.generator(prompt, max_length=200, do_sample=False)[0][‘generated_text’]
return answer.strip()
def self_check(self, query: str, answer: str, context: List[Dict]) -> Tuple[bool, str]:
if len(answer) < 10:
return False, “Answer too short – needs more detail”
context_keywords = set()
for doc in context:
context_keywords.update(doc[‘text’].lower().split()[:20])
answer_words = set(answer.lower().split())
overlap = len(context_keywords.intersection(answer_words))
if overlap < 2:
return False, “Answer not grounded in context – needs more evidence”
query_keywords = set(query.lower().split())
if len(query_keywords.intersection(answer_words)) < 1:
return False, “Answer doesn’t address the query – rephrase needed”
return True, “Answer quality acceptable”

We built the AnswerGenerator class to handle answer creation and self-evaluation. Using the Flan-T5 model, we generate text responses grounded in retrieved documents. Then, we perform a self-check to assess the length of the answer, context grounding, and relevance, ensuring our output is meaningful and accurate. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AgenticRAG:
def __init__(self):
self.vector_store = VectorStore()
self.router = QueryRouter()
self.generator = AnswerGenerator()
self.max_iterations = 2
def add_knowledge(self, documents: List[str], sources: List[str]):
self.vector_store.add_documents(documents, sources)
def query(self, question: str, verbose: bool = True) -> Dict:
if verbose:
print(f”n{‘=’*60}”)
print(f” Query: {question}”)
print(f”{‘=’*60}”)
query_type = self.router.route(question)
if verbose:
print(f” Route: {query_type.upper()} query detected”)
k_docs = {‘technical’: 2, ‘comparative’: 4, ‘procedural’: 3}.get(query_type, 3)
iteration = 0
answer_accepted = False
while iteration < self.max_iterations and not answer_accepted:
iteration += 1
if verbose:
print(f”n Iteration {iteration}”)
context = self.vector_store.search(question, k=k_docs)
if verbose:
print(f” Retrieved {len(context)} documents from sources:”)
for doc in context:
print(f” – {doc[‘source’]}”)
answer = self.generator.generate(question, context, query_type)
if verbose:
print(f” Generated answer: {answer[:100]}…”)
answer_accepted, feedback = self.generator.self_check(question, answer, context)
if verbose:
status = “✓ ACCEPTED” if answer_accepted else “✗ REJECTED”
print(f” Self-check: {status}”)
print(f” Feedback: {feedback}”)
if not answer_accepted and iteration < self.max_iterations:
question = f”{question} (provide more specific details)”
k_docs += 1
return {‘answer’: answer, ‘query_type’: query_type, ‘iterations’: iteration, ‘accepted’: answer_accepted, ‘sources’: [doc[‘source’] for doc in context]}

We combine all components into the AgenticRAG system, which orchestrates routing, retrieval, generation, and quality checking. The system iteratively refines its answers based on self-evaluation feedback, adjusting the query or expanding context when necessary. This creates a feedback-driven decision-tree RAG that automatically improves performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
print(“n” + “=”*60)
print(” AGENTIC RAG WITH ROUTING & SELF-CHECK”)
print(“=”*60 + “n”)
documents = [
“RAG (Retrieval-Augmented Generation) combines information retrieval with text generation. It retrieves relevant documents and uses them as context for generating accurate answers.”
]
sources = [“Python Documentation”, “ML Textbook”, “Neural Networks Guide”, “Deep Learning Paper”, “Transformer Architecture”, “RAG Research Paper”]
rag = AgenticRAG()
rag.add_knowledge(documents, sources)
test_queries = [“What is Python?”, “How does machine learning work?”, “Compare neural networks and deep learning”]
for query in test_queries:
result = rag.query(query, verbose=True)
print(f”n{‘=’*60}”)
print(f” FINAL RESULT:”)
print(f” Answer: {result[‘answer’]}”)
print(f” Query Type: {result[‘query_type’]}”)
print(f” Iterations: {result[‘iterations’]}”)
print(f” Accepted: {result[‘accepted’]}”)
print(f”{‘=’*60}n”)
if __name__ == “__main__”:
main()

We finalize the demo by loading a small knowledge base and running test queries through the Agentic RAG pipeline. We observe how the model routes, retrieves, and refines answers step by step, printing intermediate results for transparency. By the end, we confirm that our system successfully delivers accurate, self-validated answers using only local computation.

In conclusion, we create a fully functional Agentic RAG framework that autonomously retrieves, reasons, and refines its answers. We witness how the system dynamically routes different query types, evaluates its own responses, and improves them through iterative feedback, all within a lightweight, local environment. Through this exercise, we deepen our understanding of RAG architectures and also experience how agentic components can transform static retrieval systems into self-improving intelligent agents.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build an Agentic Decision-Tree RAG System with Intelligent Query Routing, Self-Checking, and Iterative Refinement? appeared first on MarkTechPost.

Meet ‘kvcached’: A Machine Learning Library to Enable Virtualized, …

Large language model serving often wastes GPU memory because engines pre-reserve large static KV cache regions per model, even when requests are bursty or idle. Meet ‘kvcached‘, a library to enable virtualized, elastic KV cache for LLM serving on shared GPUs. kvcached has been developed by a research from Berkeley’s Sky Computing Lab (University of California, Berkeley) in close collaboration with Rice University and UCLA, and with valuable input from collaborators and colleagues at NVIDIA, Intel Corporation, Stanford University. It introduces an OS-style virtual memory abstraction for the KV cache that lets serving engines reserve contiguous virtual space first, then back only the active portions with physical GPU pages on demand. This decoupling raises memory utilization, reduces cold starts, and enables multiple models to time share and space share a device without heavy engine rewrites.

https://github.com/ovg-project/kvcached

What kvcached changes?

With kvcached, an engine creates a KV cache pool that is contiguous in the virtual address space. As tokens arrive, the library maps physical GPU pages lazily at a fine granularity using CUDA virtual memory APIs. When requests complete or models go idle, pages unmap and return to a shared pool, which other colocated models can immediately reuse. This preserves simple pointer arithmetic in kernels, and removes the need for per engine user level paging. The project targets SGLang and vLLM integration, and it is released under the Apache 2.0 license. Installation and a one command quick start are documented in the Git repository.

https://yifanqiao.notion.site/Solve-the-GPU-Cost-Crisis-with-kvcached-289da9d1f4d68034b17bf2774201b141

How does it impact at scale?

Production workloads host many models with long tail traffic and spiky bursts. Static reservations leave memory stranded and slow down time to first token when models must be activated or swapped. The Prism research paper shows that multi-LLM serving requires cross model memory coordination at runtime, not just compute scheduling. Prism implements on demand mapping of physical to virtual pages and a two level scheduler, and reports more than 2 times cost savings and 3.3 times higher TTFT SLO attainment versus prior systems on real traces. kvcached focuses on the memory coordination primitive, and provides a reusable component that brings this capability to mainstream engines.

https://www.arxiv.org/pdf/2505.04021

Performance signals

The kvcached team reports 1.2 times to 28 times faster time to first token in multi model serving, due to immediate reuse of freed pages and the removal of large static allocations. These numbers come from multi-LLM scenarios where activation latency and memory headroom dominate tail latency. The research team note kvcached’s compatibility with SGLang and vLLM, and describe elastic KV allocation as the core mechanism.

https://yifanqiao.notion.site/Solve-the-GPU-Cost-Crisis-with-kvcached-289da9d1f4d68034b17bf2774201b141

How is it related to recent research?

Recent work has moved from fixed partitioning to virtual memory based methods for KV management. Prism extends VMM based allocation to multi-LLM settings with cross model coordination and scheduling. Prior efforts like vAttention explore CUDA VMM for single model serving to avoid fragmentation without PagedAttention. The arc is clear, use virtual memory to keep KV contiguous in virtual space, then map physical pages elastically as the workload evolves. kvcached operationalizes this idea as a library, which simplifies adoption inside existing engines.

https://www.arxiv.org/pdf/2505.04021

Practical Applications for Devs

Colocation across models: Engines can colocate several small or medium models on one device. When one model goes idle, its KV pages free quickly and another model can expand its working set without restart. This reduces head of line blocking during bursts and improves TTFT SLO attainment.

Activation behavior: Prism reports activation times of about 0.7 seconds for an 8B model and about 1.5 seconds for a 70B model with streaming activation. kvcached benefits from similar principles because virtual reservations allow engines to prepare address ranges in advance, then map pages as tokens arrive.

Autoscaling for serverless LLM: Fine grained page mapping makes it feasible to scale replicas more frequently and to run cold models in a warm state with minimal memory footprint. This enables tighter autoscaling loops and reduces the blast radius of hot spots.

Offloading and future work. Virtual memory opens the door to KV offload to host memory or NVMe when the access pattern allows it. NVIDIA’s recent guide on managed memory for KV offload on GH200 class systems shows how unified address spaces can extend capacity at acceptable overheads. The kvcached maintainers also discuss offload and compaction directions in public threads. Verify throughput and latency in your own pipeline, since access locality and PCIe topology have strong effects.

https://www.arxiv.org/pdf/2505.04021

Key Takeaways

kvcached virtualizes the KV cache using GPU virtual memory, engines reserve contiguous virtual space and map physical pages on demand, enabling elastic allocation and reclamation under dynamic loads.

It integrates with mainstream inference engines, specifically SGLang and vLLM, and is released under Apache 2.0, making adoption and modification straightforward for production serving stacks.

Public benchmarks report 1.2 times to 28 times faster time to first token in multi model serving due to immediate reuse of freed KV pages and the removal of large static reservations.

Prism shows that cross model memory coordination, implemented via on demand mapping and two level scheduling, delivers more than 2 times cost savings and 3.3 times higher TTFT SLO attainment on real traces, kvcached supplies the memory primitive that mainstream engines can reuse.

For clusters that host many models with bursty, long tail traffic, virtualized KV cache allows safe colocation, faster activation, and tighter autoscaling, with reported activation around 0.7 seconds for an 8B model and 1.5 seconds for a 70B model in the Prism evaluation.

Editorial Comments

kvcached is an effective approach toward GPU memory virtualization for LLM serving, not a full operating system, and that clarity matters. The library reserves virtual address space for the KV cache, then maps physical pages on demand, which enables elastic sharing across models with minimal engine changes. This aligns with evidence that cross model memory coordination is essential for multi model workloads and improves SLO attainment and cost under real traces. Overall, kvcached advances GPU memory coordination for LLM serving, production value depends on per cluster validation.

Check out the GitHub Repo, Paper 1, Paper 2 and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meet ‘kvcached’: A Machine Learning Library to Enable Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs appeared first on MarkTechPost.

5 Common LLM Parameters Explained with Examples

Large language models (LLMs) offer several parameters that let you fine-tune their behavior and control how they generate responses. If a model isn’t producing the desired output, the issue often lies in how these parameters are configured. In this tutorial, we’ll explore some of the most commonly used ones — max_completion_tokens, temperature, top_p, presence_penalty, and frequency_penalty — and understand how each influences the model’s output.

Installing the dependencies

Copy CodeCopiedUse a different Browserpip install openai pandas matplotlib

Loading OpenAI API Key

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘OPENAI_API_KEY’] = getpass(‘Enter OpenAI API Key: ‘)

Initializing the Model

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
model=”gpt-4.1″
client = OpenAI()

Max Tokens

Max Tokens is the maximum number of tokens the model can generate during a run. The model will try to stay within this limit across all turns. If it exceeds the specified number, the run will stop and be marked as incomplete.

A smaller value (like 16) limits the model to very short answers, while a higher value (like 80) allows it to generate more detailed and complete responses. Increasing this parameter gives the model more room to elaborate, explain, or format its output more naturally.

Copy CodeCopiedUse a different Browserprompt = “What is the most popular French cheese?”
for tokens in [16, 30, 80]:
print(f”n— max_output_tokens = {tokens} —“)
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “developer”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
max_completion_tokens=tokens
)
print(response.choices[0].message.content)

Temperature

In Large Language Models (LLMs), the temperature parameter controls the diversity and randomness of generated outputs. Lower temperature values make the model more deterministic and focused on the most probable responses — ideal for tasks that require accuracy and consistency. Higher values, on the other hand, introduce creativity and variety by allowing the model to explore less likely options. Technically, temperature scales the probabilities of predicted tokens in the softmax function: increasing it flattens the distribution (more diverse outputs), while decreasing it sharpens the distribution (more predictable outputs).

In this code, we’re prompting the LLM to give 10 different responses (n_choices = 10) for the same question — “What is one intriguing place worth visiting?” — across a range of temperature values. By doing this, we can observe how the diversity of answers changes with temperature. Lower temperatures will likely produce similar or repeated responses, while higher temperatures will show a broader and more varied distribution of places.

Copy CodeCopiedUse a different Browserprompt = “What is one intriguing place worth visiting? Give a single-word answer and think globally.”

temperatures = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]
n_choices = 10
results = {}

for temp in temperatures:
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
temperature=temp,
n=n_choices
)

# Collect all n responses in a list
results[temp] = [response.choices[i].message.content.strip() for i in range(n_choices)]

# Display results
for temp, responses in results.items():
print(f”n— temperature = {temp} —“)
print(responses)

As we can see, as the temperature increases to 0.6, the responses become more diverse, moving beyond the repeated single answer “Petra.” At a higher temperature of 1.5, the distribution shifts, and we can see responses like Kyoto, and Machu Picchu as well.

Top P

Top P (also known as nucleus sampling) is a parameter that controls how many tokens the model considers based on a cumulative probability threshold. It helps the model focus on the most likely tokens, often improving coherence and output quality.

In the following visualization, we first set a temperature value and then apply Top P = 0.5 (50%), meaning only the top 50% of the probability mass is kept. Note that when temperature = 0, the output is deterministic, so Top P has no effect.

The generation process works as follows:

Apply the temperature to adjust the token probabilities.

Use Top P to retain only the most probable tokens that together make up 50% of the total probability mass.

Renormalize the remaining probabilities before sampling.

We’ll visualize how the token probability distribution changes across different temperature values for the question:“What is one intriguing place worth visiting?”

Copy CodeCopiedUse a different Browserprompt = “What is one intriguing place worth visiting? Give a single-word answer and think globally.”

temperatures = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]
n_choices = 10
results_ = {}

for temp in temperatures:
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
temperature=temp,
n=n_choices,
top_p=0.5
)

# Collect all n responses in a list
results_[temp] = [response.choices[i].message.content.strip() for i in range(n_choices)]

# Display results
for temp, responses in results_.items():
print(f”n— temperature = {temp} —“)
print(responses)

Since Petra consistently accounted for more than 50% of the total response probability, applying Top P = 0.5 filters out all other options. As a result, the model only selects “Petra” as the final output in every case.

Frequency Penalty

Frequency Penalty controls how much the model avoids repeating the same words or phrases in its output.

Range: -2 to 2

Default: 0

When the frequency penalty is higher, the model gets penalized for using words it has already used before. This encourages it to choose new and different words, making the text more varied and less repetitive.

In simple terms — a higher frequency penalty = less repetition and more creativity.

We’ll test this using the prompt:

“List 10 possible titles for a fantasy book. Give the titles only and each title on a new line.”

Copy CodeCopiedUse a different Browserprompt = “List 10 possible titles for a fantasy book. Give the titles only and each title on a new line.”
frequency_penalties = [-2.0, -1.0, 0.0, 0.5, 1.0, 1.5, 2.0]
results = {}

for fp in frequency_penalties:
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
frequency_penalty=fp,
temperature=0.2
)

text = response.choices[0].message.content
items = [line.strip(“- “).strip() for line in text.split(“n”) if line.strip()]
results[fp] = items

# Display results
for fp, items in results.items():
print(f”n— frequency_penalty = {fp} —“)
print(items)

Low frequency penalties (-2 to 0): Titles tend to repeat, with familiar patterns like “The Shadow Weaver’s Oath”, “Crown of Ember and Ice”, and “The Last Dragon’s Heir” appearing frequently.

Moderate penalties (0.5 to 1.5): Some repetition remains, but the model starts generating more varied and creative titles.

High penalty (2.0): The first three titles are still the same, but after that, the model produces diverse, unique, and imaginative book names (e.g., “Whisperwind Chronicles: Rise of the Phoenix Queen”, “Ashes Beneath the Willow Tree”).

Presence Penalty

Presence Penalty controls how much the model avoids repeating words or phrases that have already appeared in the text.

Range: -2 to 2

Default: 0

A higher presence penalty encourages the model to use a wider variety of words, making the output more diverse and creative.

Unlike the frequency penalty, which accumulates with each repetition, the presence penalty is applied once to any word that has already appeared, reducing the chance it will be repeated in the output. This helps the model produce text with more variety and originality.

Copy CodeCopiedUse a different Browserprompt = “List 10 possible titles for a fantasy book. Give the titles only and each title on a new line.”
presence_penalties = [-2.0, -1.0, 0.0, 0.5, 1.0, 1.5, 2.0]
results = {}

for fp in frequency_penalties:
response = client.chat.completions.create(
model=model,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
presence_penalty=fp,
temperature=0.2
)

text = response.choices[0].message.content
items = [line.strip(“- “).strip() for line in text.split(“n”) if line.strip()]
results[fp] = items

# Display results
for fp, items in results.items():
print(f”n— presence_penalties = {fp} —“)
print(items)

Low to Moderate Penalty (-2.0 to 0.5): Titles are somewhat varied, with some repetition of common fantasy patterns like “The Shadow Weaver’s Oath”, “The Last Dragon’s Heir”, “Crown of Ember and Ice”.

Medium Penalty (1.0 to 1.5): The first few popular titles remain, while later titles show more creativity and unique combinations. Examples: “Ashes of the Fallen Kingdom”, “Secrets of the Starbound Forest”, “Daughter of Storm and Stone”.

Maximum Penalty (2.0): Top three titles stay the same, but the rest become highly diverse and imaginative. Examples: “Moonfire and Thorn”, “Veil of Starlit Ashes”, “The Midnight Blade”.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post 5 Common LLM Parameters Explained with Examples appeared first on MarkTechPost.

How to Build, Train, and Compare Multiple Reinforcement Learning Agent …

In this tutorial, we explore advanced applications of Stable-Baselines3 in reinforcement learning. We design a fully functional, custom trading environment, integrate multiple algorithms such as PPO and A2C, and develop our own training callbacks for performance tracking. As we progress, we train, evaluate, and visualize agent performance to compare algorithmic efficiency, learning curves, and decision strategies, all within a streamlined workflow that runs entirely offline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install stable-baselines3[extra] gymnasium pygame
import numpy as np
import gymnasium as gym
from gymnasium import spaces
import matplotlib.pyplot as plt
from stable_baselines3 import PPO, A2C, DQN, SAC
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import torch

class TradingEnv(gym.Env):
def __init__(self, max_steps=200):
super().__init__()
self.max_steps = max_steps
self.action_space = spaces.Discrete(3)
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(5,), dtype=np.float32)
self.reset()
def reset(self, seed=None, options=None):
super().reset(seed=seed)
self.current_step = 0
self.balance = 1000.0
self.shares = 0
self.price = 100.0
self.price_history = [self.price]
return self._get_obs(), {}
def _get_obs(self):
price_trend = np.mean(self.price_history[-5:]) if len(self.price_history) >= 5 else self.price
return np.array([
self.balance / 1000.0,
self.shares / 10.0,
self.price / 100.0,
price_trend / 100.0,
self.current_step / self.max_steps
], dtype=np.float32)
def step(self, action):
self.current_step += 1
trend = 0.001 * np.sin(self.current_step / 20)
self.price *= (1 + trend + np.random.normal(0, 0.02))
self.price = np.clip(self.price, 50, 200)
self.price_history.append(self.price)
reward = 0
if action == 1 and self.balance >= self.price:
shares_to_buy = int(self.balance / self.price)
cost = shares_to_buy * self.price
self.balance -= cost
self.shares += shares_to_buy
reward = -0.01
elif action == 2 and self.shares > 0:
revenue = self.shares * self.price
self.balance += revenue
self.shares = 0
reward = 0.01
portfolio_value = self.balance + self.shares * self.price
reward += (portfolio_value – 1000) / 1000
terminated = self.current_step >= self.max_steps
truncated = False
return self._get_obs(), reward, terminated, truncated, {“portfolio”: portfolio_value}
def render(self):
print(f”Step: {self.current_step}, Balance: ${self.balance:.2f}, Shares: {self.shares}, Price: ${self.price:.2f}”)

We define our custom TradingEnv, where an agent learns to make buy, sell, or hold decisions based on simulated price movements. We define the observation and action spaces, implement the reward structure, and ensure our environment reflects a realistic market scenario with fluctuating trends and noise. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ProgressCallback(BaseCallback):
def __init__(self, check_freq=1000, verbose=1):
super().__init__(verbose)
self.check_freq = check_freq
self.rewards = []
def _on_step(self):
if self.n_calls % self.check_freq == 0:
mean_reward = np.mean([ep_info[“r”] for ep_info in self.model.ep_info_buffer])
self.rewards.append(mean_reward)
if self.verbose:
print(f”Steps: {self.n_calls}, Mean Reward: {mean_reward:.2f}”)
return True

print(“=” * 60)
print(“Setting up custom trading environment…”)
env = TradingEnv()
check_env(env, warn=True)
print(“✓ Environment validation passed!”)
env = Monitor(env)
vec_env = DummyVecEnv([lambda: env])
vec_env = VecNormalize(vec_env, norm_obs=True, norm_reward=True)

Here, we create a ProgressCallback to monitor training progress and record mean rewards at regular intervals. We then validate our custom environment using Stable-Baselines3’s built-in checker, wrap it for monitoring and normalization, and prepare it for training across multiple algorithms. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=” * 60)
print(“Training multiple RL algorithms…”)
algorithms = {
“PPO”: PPO(“MlpPolicy”, vec_env, verbose=0, learning_rate=3e-4, n_steps=2048),
“A2C”: A2C(“MlpPolicy”, vec_env, verbose=0, learning_rate=7e-4),
}
results = {}
for name, model in algorithms.items():
print(f”nTraining {name}…”)
callback = ProgressCallback(check_freq=2000, verbose=0)
model.learn(total_timesteps=50000, callback=callback, progress_bar=True)
results[name] = {“model”: model, “rewards”: callback.rewards}
print(f”✓ {name} training complete!”)

print(“n” + “=” * 60)
print(“Evaluating trained models…”)
eval_env = Monitor(TradingEnv())
for name, data in results.items():
mean_reward, std_reward = evaluate_policy(data[“model”], eval_env, n_eval_episodes=20, deterministic=True)
results[name][“eval_mean”] = mean_reward
results[name][“eval_std”] = std_reward
print(f”{name}: Mean Reward = {mean_reward:.2f} +/- {std_reward:.2f}”)

We train and evaluate two different reinforcement learning algorithms, PPO and A2C, on our trading environment. We log their performance metrics, capture mean rewards, and compare how efficiently each agent learns profitable trading strategies through consistent exploration and exploitation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n” + “=” * 60)
print(“Generating visualizations…”)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
ax = axes[0, 0]
for name, data in results.items():
ax.plot(data[“rewards”], label=name, linewidth=2)
ax.set_xlabel(“Training Checkpoints (x1000 steps)”)
ax.set_ylabel(“Mean Episode Reward”)
ax.set_title(“Training Progress Comparison”)
ax.legend()
ax.grid(True, alpha=0.3)

ax = axes[0, 1]
names = list(results.keys())
means = [results[n][“eval_mean”] for n in names]
stds = [results[n][“eval_std”] for n in names]
ax.bar(names, means, yerr=stds, capsize=10, alpha=0.7, color=[‘#1f77b4’, ‘#ff7f0e’])
ax.set_ylabel(“Mean Reward”)
ax.set_title(“Evaluation Performance (20 episodes)”)
ax.grid(True, alpha=0.3, axis=’y’)

ax = axes[1, 0]
best_model = max(results.items(), key=lambda x: x[1][“eval_mean”])[1][“model”]
obs = eval_env.reset()[0]
portfolio_values = [1000]
for _ in range(200):
action, _ = best_model.predict(obs, deterministic=True)
obs, reward, done, truncated, info = eval_env.step(action)
portfolio_values.append(info.get(“portfolio”, portfolio_values[-1]))
if done:
break
ax.plot(portfolio_values, linewidth=2, color=’green’)
ax.axhline(y=1000, color=’red’, linestyle=’–‘, label=’Initial Value’)
ax.set_xlabel(“Steps”)
ax.set_ylabel(“Portfolio Value ($)”)
ax.set_title(f”Best Model ({max(results.items(), key=lambda x: x[1][‘eval_mean’])[0]}) Episode”)
ax.legend()
ax.grid(True, alpha=0.3)

We visualize our training results by plotting learning curves, evaluation scores, and portfolio trajectories for the best-performing model. We also analyze how the agent’s actions translate into portfolio growth, which helps us interpret model behavior and assess decision consistency during simulated trading sessions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserax = axes[1, 1]
obs = eval_env.reset()[0]
actions = []
for _ in range(200):
action, _ = best_model.predict(obs, deterministic=True)
actions.append(action)
obs, _, done, truncated, _ = eval_env.step(action)
if done:
break
action_names = [‘Hold’, ‘Buy’, ‘Sell’]
action_counts = [actions.count(i) for i in range(3)]
ax.pie(action_counts, labels=action_names, autopct=’%1.1f%%’, startangle=90, colors=[‘#ff9999’, ‘#66b3ff’, ‘#99ff99’])
ax.set_title(“Action Distribution (Best Model)”)
plt.tight_layout()
plt.savefig(‘sb3_advanced_results.png’, dpi=150, bbox_inches=’tight’)
print(“✓ Visualizations saved as ‘sb3_advanced_results.png'”)
plt.show()

print(“n” + “=” * 60)
print(“Saving and loading models…”)
best_name = max(results.items(), key=lambda x: x[1][“eval_mean”])[0]
best_model = results[best_name][“model”]
best_model.save(f”best_trading_model_{best_name}”)
vec_env.save(“vec_normalize.pkl”)
loaded_model = PPO.load(f”best_trading_model_{best_name}”)
print(f”✓ Best model ({best_name}) saved and loaded successfully!”)
print(“n” + “=” * 60)
print(“TUTORIAL COMPLETE!”)
print(f”Best performing algorithm: {best_name}”)
print(f”Final evaluation score: {results[best_name][‘eval_mean’]:.2f}”)
print(“=” * 60)

Finally, we visualize the action distribution of the best agent to understand its trading tendencies and save the top-performing model for reuse. We demonstrate model loading, confirm the best algorithm, and complete the tutorial with a clear summary of performance outcomes and insights gained.

In conclusion, we have created, trained, and compared multiple reinforcement learning agents in a realistic trading simulation using Stable-Baselines3. We observe how each algorithm adapts to market dynamics, visualize their learning trends, and identify the most profitable strategy. This hands-on implementation strengthens our understanding of RL pipelines and demonstrates how customizable, efficient, and scalable Stable-Baselines3 can be for complex, domain-specific tasks such as financial modeling.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build, Train, and Compare Multiple Reinforcement Learning Agents in a Custom Trading Environment Using Stable-Baselines3 appeared first on MarkTechPost.

A New AI Research from Anthropic and Thinking Machines Lab Stress Test …

AI companies use model specifications to define target behaviors during training and evaluation. Do current specs state the intended behaviors with enough precision, and do frontier models exhibit distinct behavioral profiles under the same spec? A team of researchers from Anthropic, Thinking Machines Lab and Constellation present a systematic method that stress tests model specs using value tradeoff scenarios, then quantifies cross model disagreement as a signal of gaps or contradictions in the spec. The research team analyzed 12 frontier LLMs from Anthropic, OpenAI, Google, and xAI and links high disagreement to specification violations, missing guidance on response quality, and evaluator ambiguity. The team also released a public dataset

Model specifications are the written rules that alignment systems try to enforce. If a spec is complete and precise, models trained to follow it should not diverge widely on the same input. The research team operationalizes this intuition. It generates more than 300,000 scenarios that force a choice between two legitimate values, such as social equity and business effectiveness. It then scores responses on a 0 to 6 spectrum using value spectrum rubrics and measures disagreement as the standard deviation across models. High disagreement localizes the spec clauses that need clarification or additional examples.

https://arxiv.org/pdf/2510.07686

So, what is the method used in this research?

The research team starts from a taxonomy of 3,307 fine grained values observed in natural Claude traffic, which is more granular than typical model specs. For each pair of values, they generate a neutral query and two biased variants that lean toward one value. They build value spectrum rubrics that map positions from 0, which means strongly opposing the value, to 6, which means strongly favoring the value. They classify responses from 12 models against these rubrics and define disagreement as the maximum standard deviation across the two value dimensions. To remove near duplicates while keeping the hard cases, they use a disagreement weighted k center selection with Gemini embeddings and a 2 approximation greedy algorithm.

https://arxiv.org/pdf/2510.07686

Scale and releases

The dataset on Hugging Face shows three subsets. The default split has about 132,000 rows, the complete split has about 411,000 rows, and the judge evaluations split has about 24,600 rows. The card lists modality, format as parquet, and license as Apache 2.0.

Understanding the Results

Disagreement predicts spec violations: Testing five OpenAI models against the public OpenAI model spec, high disagreement scenarios have 5 to 13 times higher frequent non compliance. The research team interprets the pattern as evidence of contradictions and ambiguities in the spec text rather than idiosyncrasies of a single model.

Specs lack granularity on quality inside the safe region: Some scenarios produce responses that all pass compliance, yet differ in helpfulness. For instance, one model refuses and offers safe alternatives, while another only refuses. The spec accepts both, which indicates missing guidance on quality standards.

Evaluator models disagree on compliance: Three LLM judges, Claude 4 Sonnet, o3, and Gemini 2.5 Pro, show only moderate agreement with Fleiss Kappa near 0.42. The blog attributes conflicts to interpretive differences such as conscientious pushback versus transformation exceptions.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Provider level character patterns: Aggregating high disagreement scenarios reveals consistent value preferences. Claude models prioritize ethical responsibility and intellectual integrity and objectivity. OpenAI models tend to favor efficiency and resource optimization. Gemini 2.5 Pro and Grok more often emphasize emotional depth and authentic connection. Other values, such as business effectiveness, personal growth and wellbeing, and social equity and justice, show mixed patterns across providers.

Refusals and false positives: The analysis shows topic sensitive refusal spikes. It documents false positive refusals, including legitimate synthetic biology study plans and standard Rust unsafe types that are often safe in context. Claude models are the most cautious by rate of refusal and often provide alternative suggestions, and o3 most often issues direct refusals without elaboration. All models show high refusal rates on child grooming risks.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Outliers reveal misalignment and over conservatism: Grok 4 and Claude 3.5 Sonnet produce the most outlier responses, but for different reasons. Grok is more permissive on requests that others consider harmful. Claude 3.5 sometimes over rejects benign content. Outlier mining is a useful lens for locating both safety gaps and excessive filtering.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Key Takeaways

Method and scale: The study stress-tests model specs using value-tradeoff scenarios generated from a 3,307-value taxonomy, producing 300,000+ scenarios and evaluating 12 frontier LLMs across Anthropic, OpenAI, Google, and xAI.

Disagreement ⇒ spec problems: High cross-model disagreement strongly predicts issues in specs, including contradictions and coverage gaps. In tests against the OpenAI model spec, high-disagreement items show 5 to 13× higher frequent non-compliance.

Public release: The team released a dataset for independent auditing and reproduction.

Provider-level behavior: Aggregated results reveal systematic value preferences, for example Claude prioritizes ethical responsibility, Gemini emphasizes emotional depth, while OpenAI and Grok optimize for efficiency. Some values, such as business effectiveness and social equity and justice, show mixed patterns.

Refusals and outliers: High-disagreement slices expose both false-positive refusals on benign topics and permissive responses on risky ones. Outlier analysis identifies cases where one model diverges from at least 9 of the other 11, useful for pinpointing misalignment and over-conservatism.

Editorial Comments

This research turns disagreement into a measurable diagnostic for spec quality, not a vibe. The research team generates 300,000 plus value trade off scenarios, scores responses on a 0 to 6 rubric, then uses cross model standard deviation to locate specification gaps. High disagreement predicts frequent non compliance by 5 to 13 times under the OpenAI model spec. Judge models show only moderate agreement, Fleiss Kappa near 0.42, which exposes interpretive ambiguity. Provider level value patterns are clear, Claude favors ethical responsibility, OpenAI favors efficiency and resource optimization, Gemini and Grok emphasize emotional depth and authentic connection. The dataset enables reproduction. Deploy this to debug specs before deployment, not after.

Check out the Paper, Dataset, and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A New AI Research from Anthropic and Thinking Machines Lab Stress Tests Model Specs and Reveal Character Differences among Language Models appeared first on MarkTechPost.