November 2025 - Page 8 of 9

Cache-to-Cache(C2C): Direct Semantic Communication Between Large Langu …

Posted on November 5, 2025 by i-genie

Can large language models collaborate without sending a single token of text? a team of researchers from Tsinghua University, Infinigence AI, The Chinese University of Hong Kong, Shanghai AI Laboratory, and Shanghai Jiao Tong University say yes. Cache-to-Cache (C2C) is a new communication paradigm where large language models exchange information through their KV-Cache rather than through generated text.

https://arxiv.org/pdf/2510.03215

Text communication is the bottleneck in multi LLM systems

Current multi LLM systems mostly use text to communicate. One model writes an explanation, another model reads it as context.

This design has three practical costs:

Internal activations are compressed into short natural language messages. Much of the semantic signal in the KV-Cache never crosses the interface.

Natural language is ambiguous. Even with structured protocols, a coder model may encode structural signals, such as the role of an HTML <p> tag, that do not survive a vague textual description.

Every communication step requires token by token decoding, which dominates latency in long analytical exchanges.

The C2C work asks a direct question, can we treat KV-Cache as the communication channel instead.

Oracle experiments, can KV-Cache carry communication

The research team first run two oracle style experiments to test whether KV-Cache is a useful medium.

Cache enrichment oracle

They compare three setups on multiple choice benchmarks:

Direct, prefill on the question only.

Few shot, prefill on exemplars plus question, longer cache.

Oracle, prefill on exemplars plus question, then discard the exemplar segment and keep only the question aligned slice of the cache, so cache length is the same as Direct.

https://arxiv.org/pdf/2510.03215

Oracle improves accuracy from 58.42 percent to 62.34 percent at the same cache length, while Few shot reaches 63.39 percent. This demonstrates that enriching the question KV-Cache itself, even without more tokens, improves performance. Layer wise analysis shows that enriching only selected layers is better than enriching all layers, which later motivates a gating mechanism.

Cache transformation oracle

Next, they test whether KV-Cache from one model can be transformed into the space of another model. A three layer MLP is trained to map KV-Cache from Qwen3 4B to Qwen3 0.6B. t SNE plots show that the transformed cache lies inside the target cache manifold, but only in a sub region.

https://arxiv.org/pdf/2510.03215

C2C, direct semantic communication through KV-Cache

Based on these oracles, the research team defines Cache-to-Cache communication between a Sharer and a Receiver model.

During prefill, both models read the same input and produce layer wise KV-Cache. For each Receiver layer, C2C selects a mapped Sharer layer and applies a C2C Fuser to produce a fused cache. During decoding, the Receiver predicts tokens conditioned on this fused cache instead of its original cache.

The C2C Fuser follows a residual integration principle and has three modules:

Projection module concatenates Sharer and Receiver KV-Cache vectors, applies a projection layer, then a feature fusion layer.

Dynamic weighting module modulates heads based on the input so that some attention heads rely more on Sharer information.

Learnable gate adds a per layer gate that decides whether to inject Sharer context into that layer. The gate uses a Gumbel sigmoid during training and becomes binary at inference.

Sharer and Receiver can come from different families and sizes, so C2C also defines:

Token alignment by decoding Receiver tokens to strings and re encoding them with the Sharer tokenizer, then choosing Sharer tokens with maximal string coverage.

Layer alignment using a terminal strategy that pairs top layers first and walks backward until the shallower model is fully covered.

https://arxiv.org/pdf/2510.03215

During training, both LLMs are frozen. Only the C2C module is trained, using a next token prediction loss on Receiver outputs. The main C2C fusers are trained on the first 500k samples of the OpenHermes2.5 dataset, and evaluated on OpenBookQA, ARC Challenge, MMLU Redux and C Eval.

Accuracy and latency, C2C versus text communication

Across many Sharer Receiver combinations built from Qwen2.5, Qwen3, Llama3.2 and Gemma3, C2C consistently improves Receiver accuracy and reduces latency. For results:

C2C achieves about 8.5 to 10.5 percent higher average accuracy than individual models.

C2C outperforms text communication by about 3.0 to 5.0 percent on average.

C2C delivers around 2x average speedup in latency compared with text based collaboration, and in some configurations the speedup is larger.

A concrete example uses Qwen3 0.6B as Receiver and Qwen2.5 0.5B as Sharer. On MMLU Redux, the Receiver alone reaches 35.53 percent, text to text reaches 41.03 percent, and C2C reaches 42.92 percent. Average time per query for text to text is 1.52 units, while C2C stays close to the single model at 0.40. Similar patterns appear on OpenBookQA, ARC Challenge and C Eval.

On LongBenchV1, with the same pair, C2C outperforms text communication across all sequence length buckets. For sequences of 0 to 4k tokens, text communication reaches 29.47 while C2C reaches 36.64. Gains remain for 4k to 8k and for longer contexts.

https://arxiv.org/pdf/2510.03215

Key Takeaways

Cache-to-Cache communication lets a Sharer model send information to a Receiver model directly via KV-Cache, so collaboration does not need intermediate text messages, which removes the token bottleneck and reduces semantic loss in multi model systems.

Two oracle studies show that enriching only the question aligned slice of the cache improves accuracy at constant sequence length, and that KV-Cache from a larger model can be mapped into a smaller model’s cache space through a learned projector, confirming cache as a viable communication medium.

C2C Fuser architecture combines Sharer and Receiver caches with a projection module, dynamic head weighting and a learnable per layer gate, and integrates everything in a residual way, which allows the Receiver to selectively absorb Sharer semantics without destabilizing its own representation.

Consistent accuracy and latency gains are observed across Qwen2.5, Qwen3, Llama3.2 and Gemma3 model pairs, with about 8.5 to 10.5 percent average accuracy improvement over a single model, 3 to 5 percent gains over text to text communication, and around 2x faster responses because unnecessary decoding is removed.

Editorial Comments

Cache-to-Cache reframes multi LLM communication as a direct semantic transfer problem, not a prompt engineering problem. By projecting and fusing KV-Cache between Sharer and Receiver with a neural fuser and learnable gating, C2C uses the deep specialized semantics of both models while avoiding explicit intermediate text generation, which is an information bottleneck and a latency cost. With 8.5 to 10.5 percent higher accuracy and about 2x lower latency than text communication, C2C is a strong systems level step toward KV native collaboration between models.

Check out the Paper and Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cache-to-Cache(C2C): Direct Semantic Communication Between Large Language Models via KV-Cache Fusion appeared first on MarkTechPost.

Iterate faster with Amazon Bedrock AgentCore Runtime direct code deplo …

Posted on November 5, 2025 by i-genie

Amazon Bedrock AgentCore is an agentic platform for building, deploying, and operating effective agents securely at scale. Amazon Bedrock AgentCore Runtime is a fully managed service of Bedrock AgentCore, which provides low latency serverless environments to deploy agents and tools. It provides session isolation, supports multiple agent frameworks including popular open-source frameworks, and handles multimodal workloads and long-running agents.
AgentCore Runtime supports container based deployments where the container definition is provided in a Dockerfile, and the agent is built as a container image. Customers who have container build and deploy pipelines benefit from this method, where agent deployment can be integrated into existing pipelines.
Today, AgentCore Runtime has launched a second method to deploy agents – direct code deployment (for Python). Agent code and its dependencies can be packaged as a zip archive, alleviating the need for Docker definition and ECR dependencies. This makes it straightforward for developers to prototype and iterate faster. This method is a good fit for customers who prefer not to worry about Docker expertise and container infrastructure when deploying agents.
In this post, we’ll demonstrate how to use direct code deployment (for Python).
Introducing AgentCore Runtime direct code deployment
With the container deployment method, developers create a Dockerfile, build ARM-compatible containers, manage ECR repositories, and upload containers for code changes. This works well where container DevOps pipelines have already been established to automate deployments.
However, customers looking for fully managed deployments can benefit from direct code deployment, which can significantly improve developer time and productivity. Direct code deployment provides a secure and scalable path forward for rapid prototyping agent capabilities to deploying production workloads at scale.
We’ll discuss the strengths of each deployment option to help you choose the right approach for your use case.

With direct code deployment, developers create a zip archive of code and dependencies, upload to Amazon S3, and configure the bucket in the agent configuration. When using the AgentCore starter toolkit, the toolkit handles dependency detection, packaging, and upload which provides a much-simplified developer experience. Direct code deployment is also supported using the API.
Let’s compare the deployment steps at a high level between the two methods:
Container-based deployment
The container-based deployment method involves the following steps:

Create a Dockerfile
Build ARM-compatible container
Create ECR repository
Upload to ECR
Deploy to AgentCore Runtime

Direct code deployment
The direct code deployment method involves the following steps:

Package your code and dependencies into a zip archive
Upload it to S3
Configure the bucket in agent configuration
Deploy to AgentCore Runtime

How to use direct code deployment
Let’s illustrate how direct code deployment works with an agent created with Strands Agents SDK and using the AgentCore starter-toolkit to deploy the agent.
Prerequisites
Before you begin, make sure you have the following:

Any of the versions of Python 3.10 to 3.13
Your preferred package manager installed. For example, we use uv package manager.
AWS account for creating and deploying agents
Amazon Bedrock model access to Anthropic Claude Sonnet 4.0

Step 1: Initialize your project
Set up a new Python project using the uv package manager, then navigate into the project directory:

uv init <project> –python 3.13
cd <project>

Step 2: Add the dependencies for the project
Install the required Bedrock AgentCore libraries and development tools for your project. In this example, dependencies are added using .toml file, alternatively they can be specified in requirements.txt file:

uv add bedrock-agentcore strands-agents strands-agents-tools
uv add –dev bedrock-agentcore-starter-toolkit
source .venv/bin/activate

Step 3: Create an agent.py file
Create the main agent implementation file that defines your AI agent’s behavior:

from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent, tool
from strands_tools import calculator
from strands.models import BedrockModel
import logging

app = BedrockAgentCoreApp(debug=True)

# Logging setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create a custom tool
@tool
def weather():
“”” Get weather “””
return “sunny”

model_id = “us.anthropic.claude-sonnet-4-20250514-v1:0″
model = BedrockModel(
model_id=model_id,
)

agent = Agent(
model=model,
tools=[calculator, weather],
system_prompt=”You’re a helpful assistant. You can do simple math calculation, and tell the weather.”
)

@app.entrypoint
def invoke(payload):
“””Your AI agent function”””
user_input = payload.get(“prompt”, “Hello! How can I help you today?”)
logger.info(“n User input: %s”, user_input)
response = agent(user_input)
logger.info(“n Agent result: %s “, response.message)
return response.message[‘content’][0][‘text’]

if __name__ == “__main__”:
app.run()

Step 4: Deploy to AgentCore Runtime
Configure and deploy your agent to the AgentCore Runtime environment:

agentcore configure –entrypoint agent.py –name <agent-name>

This will launch an interactive session where you configure the S3 bucket to upload the zip deployment package to and choose a deployment configuration type (as shown in the following configuration). To opt for direct code deployment, choose option 1 – Code Zip.
Deployment Configuration
Select deployment type:

Code Zip (recommended) – Simple, serverless, no Docker required
Container – For custom runtimes or complex dependencies

agentcore launch

This command creates a zip deployment package, uploads it to the specified S3 bucket, and launches the agent in the AgentCore Runtime environment, making it ready to receive and process requests.
To test the solution, let’s prompt the agent to see how the weather is:

agentcore invoke ‘{“prompt”:”How is the weather today?”}’

The first deployment takes approximately 30 seconds to complete, but subsequent updates to the agent benefit from the streamlined direct code deployment process and should take less than half the time, supporting faster iteration cycles during development.
When to choose direct code instead of container-based deployment
Let’s look at some of the dimensions and see how the direct code and container-based deployment options are different. This will help you choose the option that’s right for you:

Deployment process: Direct code deploys agents as zip files with no Docker, ECR, or CodeBuild required. Container-based deployment uses Docker and ECR with full Dockerfile control.
Deployment time: Although there is not much difference during first deployment of an agent, subsequent updates to the agent are significantly faster with direct code deployment (from an average of 30 seconds for containers to about 10 seconds for direct code deployment).
Artifact storage: Direct code stores ZIP packages in an S3 bucket. Container-based deployment stores Docker images in Amazon ECR. Direct code deployment incurs storage costs at standard S3 storage rates (starting February 27th 2026) as artifacts are stored in the service account. Container-based deployment incurs Amazon ECR charges in your account.
Customization: Direct code deployment supports custom dependencies through ZIP-based packaging, while container based depends on a Dockerfile.
Package size: Direct code deployment limits the package size to 250MB whereas container-based packages can be up to 2GB in size.
Language Support: Direct code currently supports Python 3.10, 3.11, 3.12, and 3.13. Container-based deployment supports many languages and runtimes.

Our general guidance is:
Container-based deployment is the right choice when your package exceeds 250MB, you have existing container CI/CD pipelines, or you need highly specialized dependencies and custom packaging requirements. Choose containers if you require multi-language support, custom system dependencies or direct control over artifact storage and versioning in your account.
Direct code deployment is the right choice when your package is under 250MB, you use Python 3.10-3.13 with common frameworks like LangGraph, Strands, or CrewAI, and you need rapid prototyping with fast iteration cycles. Choose direct code if your build process is straightforward without complex dependencies, and you want to remove the Docker/ECR/CodeBuild setup.
A hybrid approach works well for many teams, use direct code for rapid prototyping and experimentation where fast iteration and simple setup accelerate development, then graduate to containers for production when package size, multi-language requirements, or specialized build processes demand it.
Conclusion
Amazon Bedrock AgentCore direct code deployment makes iterative agent development cycles even faster, while still benefiting from enterprise security and scale of deployments. Developers can now rapidly prototype and iterate by deploying their code directly, without having to create a container. To get started with Amazon Bedrock AgentCore direct code deployment, visit the AWS documentation.

About the authors
Chaitra Mathur is as a GenAI Specialist Solutions Architect at AWS. She works with customers across industries in building scalable generative AI platforms and operationalizing them. Throughout her career, she has shared her expertise at numerous conferences and has authored several blogs in the Machine Learning and Generative AI domains.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.

How to Build Supervised AI Models When You Don’t Have Annotated Data

Posted on November 4, 2025 by i-genie

One of the biggest challenges in real-world machine learning is that supervised models require labeled data—yet in many practical scenarios, the data you start with is almost always unlabeled. Manually annotating thousands of samples isn’t just slow; it’s expensive, tedious, and often impractical.

This is where active learning becomes a game-changer.

Active learning is a subset of machine learning in which the algorithm is not a passive consumer of data—it becomes an active participant. Instead of labeling the entire dataset upfront, the model intelligently selects which data points it wants labeled next. It interactively queries a human or oracle for labels on the most informative samples, allowing it to learn faster using far fewer annotations. Check out the FULL CODES here.

Here’s how the workflow typically looks:

Begin by labeling a small seed portion of the dataset to train an initial, weak model.

Use this model to generate predictions and confidence scores on the unlabeled data.

Compute a confidence metric (e.g., probability gap) for each prediction.

Select only the lowest-confidence samples—the ones the model is most unsure about.

Manually label these uncertain samples and add them to the training set.

Retrain the model and repeat the cycle of predict → rank confidence → label → retrain.

After several iterations, the model can achieve near–fully supervised performance while requiring far fewer manually labeled samples.

In this article, we’ll walk through how to apply this strategy step-by-step and show how active learning can help you build high-quality supervised models with minimal labeling effort. Check out the FULL CODES here.

Installing & Importing the libraries

Copy CodeCopiedUse a different Browserpip install numpy pandas scikit-learn matplotlib

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

For this tutorial, we will be using the make_classification dataset from the sklearn library

Copy CodeCopiedUse a different BrowserSEED = 42 # For reproducibility
N_SAMPLES = 1000 # Total number of data points
INITIAL_LABELED_PERCENTAGE = 0.10 # Your constraint: Start with 10% labeled data
NUM_QUERIES = 20 # Number of times we ask the “human” to label a confusing sample

NUM_QUERIES = 20 represents the annotation budget in an active learning setup. In a real-world workflow, this would mean the model selects the 20 most confusing samples and sends them to human annotators to label—each annotation costing time and money. In our simulation, we replicate this process automatically: during each iteration, the model selects one uncertain sample, the code instantly retrieves its true label (acting as the human oracle), and the model is retrained with this new information.

Thus, setting NUM_QUERIES = 20 means we’re simulating the benefit of labeling only 20 strategically chosen samples and observing how much the model improves with that limited but valuable human effort.

Data Generation and Splitting Strategy for Active Learning

This block handles data generation and the initial split that powers the entire Active Learning experiment. It first uses make_classification to create 1,000 synthetic samples for a two-class problem. The dataset is then split into a 10% held-out test set for final evaluation and a 90% pool for training. From this pool, only 10% is kept as the small initial labeled set—matching the constraint of starting with very limited annotations—while the remaining 90% becomes the unlabeled pool. This setup creates the realistic low-label scenario Active Learning is designed for, with a large pool of unlabeled samples ready for strategic querying. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserX, y = make_classification(
n_samples=N_SAMPLES, n_features=10, n_informative=5, n_redundant=0,
n_classes=2, n_clusters_per_class=1, flip_y=0.1, random_state=SEED
)

# 1. Split into 90% Pool (samples to be queried) and 10% Test (final evaluation)
X_pool, X_test, y_pool, y_test = train_test_split(
X, y, test_size=0.10, random_state=SEED, stratify=y
)

# 2. Split the 90% Pool into Initial Labeled (10% of the pool) and Unlabeled (90% of the pool)
X_labeled_current, X_unlabeled_full, y_labeled_current, y_unlabeled_full = train_test_split(
X_pool, y_pool, test_size=1.0 – INITIAL_LABELED_PERCENTAGE,
random_state=SEED, stratify=y_pool
)

# A set to track indices in the unlabeled pool for efficient querying and removal
unlabeled_indices_set = set(range(X_unlabeled_full.shape[0]))

print(f”Initial Labeled Samples (STARTING N): {len(y_labeled_current)}”)
print(f”Unlabeled Pool Samples: {len(unlabeled_indices_set)}”)

Initial Training and Baseline Evaluation

This block trains the initial Logistic Regression model using only the small labeled seed set and evaluates its accuracy on the held-out test set. The labeled sample count and baseline accuracy are then stored as the first points in the performance history, establishing a starting benchmark before Active Learning begins. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserlabeled_size_history = []
accuracy_history = []

# Train the baseline model on the small initial labeled set
baseline_model = LogisticRegression(random_state=SEED, max_iter=2000)
baseline_model.fit(X_labeled_current, y_labeled_current)

# Evaluate performance on the held-out test set
y_pred_init = baseline_model.predict(X_test)
accuracy_init = accuracy_score(y_test, y_pred_init)

# Record the baseline point (x=90, y=0.8800)
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy_init)

print(f”INITIAL BASELINE (N={labeled_size_history[0]}): Test Accuracy: {accuracy_history[0]:.4f}”)

Active Learning Loop

This block contains the heart of the Active Learning process, where the model iteratively selects the most uncertain sample, receives its true label, retrains, and evaluates performance. In each iteration, the current model predicts probabilities for all unlabeled samples, identifies the one with the highest uncertainty (least confidence), and “queries” its true label—simulating a human annotator. The newly labeled data point is added to the training set, a fresh model is retrained, and accuracy is recorded. Repeating this cycle for 20 queries demonstrates how targeted labeling quickly improves model performance with minimal annotation effort. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsercurrent_model = baseline_model # Start the loop with the baseline model

print(f”nStarting Active Learning Loop ({NUM_QUERIES} Queries)…”)

# ———————————————–
# The Active Learning Loop (Query, Annotate, Retrain, Evaluate)
# Purpose: Run 20 iterations to demonstrate strategic labeling gains.
# ———————————————–
for i in range(NUM_QUERIES):
if not unlabeled_indices_set:
print(“Unlabeled pool is empty. Stopping.”)
break

# — A. QUERY STRATEGY: Find the Least Confident Sample —
# 1. Get probability predictions from the CURRENT model for all unlabeled samples
probabilities = current_model.predict_proba(X_unlabeled_full)
max_probabilities = np.max(probabilities, axis=1)

# 2. Calculate Uncertainty Score (1 – Max Confidence)
uncertainty_scores = 1 – max_probabilities

# 3. Identify the index of the sample with the MAXIMUM uncertainty score
current_indices_list = list(unlabeled_indices_set)
current_uncertainty = uncertainty_scores[current_indices_list]
most_uncertain_idx_in_subset = np.argmax(current_uncertainty)
query_index_full = current_indices_list[most_uncertain_idx_in_subset]
query_uncertainty_score = uncertainty_scores[query_index_full]

# — B. HUMAN ANNOTATION SIMULATION —
# This is the single critical step where the human annotator intervenes.
# We look up the true label (y_unlabeled_full) for the sample the model asked for.
X_query = X_unlabeled_full[query_index_full].reshape(1, -1)
y_query = np.array([y_unlabeled_full[query_index_full]])

# Update the Labeled Set: Add the new annotated sample (N becomes N+1)
X_labeled_current = np.vstack([X_labeled_current, X_query])
y_labeled_current = np.hstack([y_labeled_current, y_query])
# Remove the sample from the unlabeled pool
unlabeled_indices_set.remove(query_index_full)

# — C. RETRAIN and EVALUATE —
# Train the NEW model on the larger, improved labeled set
current_model = LogisticRegression(random_state=SEED, max_iter=2000)
current_model.fit(X_labeled_current, y_labeled_current)

# Evaluate the new model on the held-out test set
y_pred = current_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Record results for plotting
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy)

# Output status
print(f”nQUERY {i+1}: Labeled Samples: {len(y_labeled_current)}”)
print(f” > Test Accuracy: {accuracy:.4f}”)
print(f” > Uncertainty Score: {query_uncertainty_score:.4f}”)

final_accuracy = accuracy_history[-1]

Final Result

The experiment successfully validated the efficiency of Active Learning. By focusing annotation efforts on only 20 strategically selected samples (increasing the labeled set from 90 to 110), the model’s performance on the unseen Test Set improved from 0.8800 (88%) to 0.9100 (91%).

This 3 percentage point increase in accuracy was achieved with a minimal increase in annotation effort—roughly a 22% increase in the size of the training data resulted in a measurable and meaningful performance boost.

In essence, the Active Learner acts as an intelligent curator, ensuring that every dollar or minute spent on human labeling provides the maximum possible benefit, proving that smart labeling is far more valuable than random or bulk labeling. Check out the FULL CODES here.

Plotting the results

Copy CodeCopiedUse a different Browserplt.figure(figsize=(10, 6))
plt.plot(labeled_size_history, accuracy_history, marker=’o’, linestyle=’-‘, color=’#00796b’, label=’Active Learning (Least Confidence)’)
plt.axhline(y=final_accuracy, color=’red’, linestyle=’–‘, alpha=0.5, label=’Final Accuracy’)
plt.title(‘Active Learning: Accuracy vs. Number of Labeled Samples’)
plt.xlabel(‘Number of Labeled Samples’)
plt.ylabel(‘Test Set Accuracy’)
plt.grid(True, linestyle=’–‘, alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build Supervised AI Models When You Don’t Have Annotated Data appeared first on MarkTechPost.

Anyscale and NovaSky Team Releases SkyRL tx v0.1.0: Bringing Tinker Co …

Posted on November 4, 2025 by i-genie

How can AI teams run Tinker style reinforcement learning on large language models using their own infrastructure with a single unified engine? Anyscale and NovaSky (UC Berkeley) Team releases SkyRL tx v0.1.0 that gives developers a way to run a Tinker compatible training and inference engine directly on their own hardware, while keeping the same minimal API that Tinker exposes in the managed service.

The research team describes SkyRL tx as a unified training and inference engine that implements the Tinker API and allows people to run a Tinker like service on their own infrastructure. This v0.1.0 version is the first of its series that supports reinforcement learning end to end, and it also makes sampling significantly faster.

Tinker API in brief

Tinker from Thinking Machines is a training API built around four core functions. forward_backward performs a forward pass and a backward pass and accumulates gradients. optim_step updates model weights based on those gradients. sample generates tokens for interaction, evaluation or RL actions. save_state writes checkpoints for resuming training.

Instead of a full task specific fine tuning abstraction, Tinker exposes these low level primitives so that users can implement their own supervised or reinforcement learning loops in regular Python code, while the service handles GPU scheduling and distributed execution.

SkyRL tx targets this exact API and implements an open backend that users can deploy locally. It keeps the Tinker programming model, while removing the need to rely only on the hosted environment.

Where SkyRL tx fits inside SkyRL

SkyRL is a full stack reinforcement learning library for large language models that includes skyrl-agent for long horizon agents, skyrl-train for training, and skyrl-gym for tool use environments such as math, coding, search and SQL.

Within this stack, skyrl-tx is marked as an experimental cross platform library that exposes a local Tinker like REST API for model post training. SkyRL tx therefore becomes the system layer that connects RL logic, environments and training code to concrete GPU resources through the Tinker interface.

Architecture, inference engine that also trains

The SkyRL tx architecture is described as an inference engine that also supports backward passes. It has four main components:

REST API server that processes incoming requests from different users.

Database that tracks metadata about models, checkpoints, requests and futures, and also acts as a job queue. The current implementation uses SQLite behind an interface that also supports other SQL databases such as Postgres.

Engine that schedules and batches requests across users. Each engine instance serves a single base model and can attach many LoRA adapters.

Worker that executes forward and backward passes and holds model definitions and optimizer states. Multiple workers would be enabling more advanced multi node sharding in upcoming versions

What v0.1.0 adds?

The v0.1.0 release focuses on reinforcement learning support and performance improvements. The official release highlights several concrete changes:

Sampling is now much faster, since it is jitted and properly batched and sharded in the engine.

Different sampling parameters per request, per request seeds and stop tokens are now supported, which is useful when many experiments share a base model.

After several fixes, the RL loop now runs properly through the engine.

Gradient checkpointing support and micro batching for sampling are implemented.

Postgres is now supported as a database backend, next to SQLite.

Running RL end to end on 8 H100 GPUs

The official release contains a specific code recipe for running reinforcement learning end to end on a cluster with 8 H100 GPUs.

First, users clone the SkyRL repository and in the skyrl-tx folder start the engine with:

Copy CodeCopiedUse a different Browseruv run –extra gpu –extra tinker -m tx.tinker.api
–base-model Qwen/Qwen3-4B
–max-lora-adapters 3
–max-lora-rank 1
–tensor-parallel-size 8
–train-micro-batch-size 8 > out.log

Then they clone the Tinker Cookbook from the Thinking Machines team and in the tinker_cookbook/recipes folder run:

Copy CodeCopiedUse a different Browserexport TINKER_API_KEY=dummy
export WANDB_API_KEY=<your key>
uv run –with wandb –with tinker rl_loop.py
base_url=http://localhost:8000
model_name=”Qwen/Qwen3-4B”
lora_rank=1
max_length=1024
save_every=100

This produces a reward curve that confirms the RL loop runs correctly through the local SkyRL tx backend.

Key Takeaways

SkyRL tx v0.1.0 implements a local, Tinker compatible engine that unifies training and inference for LLM post training.

The system exposes Tinker primitives, forward_backward, optim_step, sample and save_state over REST, while handling batching, LoRA adapters and device placement internally.

Architecture is split into API server, SQL database, scheduling engine and workers that execute forward and backward passes for a single base model with multiple LoRA adapters.

v0.1.0 adds end to end reinforcement learning support, faster jitted and sharded sampling, per request sampling parameters, gradient checkpointing, micro batching and Postgres support.

Editorial Comments

SkyRL tx v0.1.0 is a practical step for dev teams that want Tinker style reinforcement learning on their own clusters with a consistent Tinker API surface. The design that treats the system as an inference engine that also runs backward passes is clean and reduces stack divergence. Support for LoRA, gradient checkpointing, micro batching and Postgres is a concrete systems upgrade. Overall, this release turns Tinker compatibility into an actionable local RL backend for LLM

Check out the Repo and Official Release. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anyscale and NovaSky Team Releases SkyRL tx v0.1.0: Bringing Tinker Compatible Reinforcement Learning RL Engine To Local GPU Clusters appeared first on MarkTechPost.

How to Design a Persistent Memory and Personalized Agentic AI System w …

Posted on November 4, 2025 by i-genie

In this tutorial, we explore how to build an intelligent agent that remembers, learns, and adapts to us over time. We implement a Persistent Memory & Personalisation system using simple, rule-based logic to simulate how modern Agentic AI frameworks store and recall contextual information. As we progress, we see how the agent’s responses evolve with experience, how memory decay helps prevent overload, and how personalisation improves performance. We aim to understand, step by step, how persistence transforms a static chatbot into a context-aware, evolving digital companion. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport math, time, random
from typing import List

class MemoryItem:
def __init__(self, kind:str, content:str, score:float=1.0):
self.kind = kind
self.content = content
self.score = score
self.t = time.time()

class MemoryStore:
def __init__(self, decay_half_life=1800):
self.items: List[MemoryItem] = []
self.decay_half_life = decay_half_life

def _decay_factor(self, item:MemoryItem):
dt = time.time() – item.t
return 0.5 ** (dt / self.decay_half_life)

We established the foundation for our agent’s long-term memory. We define the MemoryItem class to hold each piece of information and build a MemoryStore with an exponential decay mechanism. We begin laying the foundation for storing and aging information just like a human’s memory. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def add(self, kind:str, content:str, score:float=1.0):
self.items.append(MemoryItem(kind, content, score))

def search(self, query:str, topk=3):
scored = []
for it in self.items:
decay = self._decay_factor(it)
sim = len(set(query.lower().split()) & set(it.content.lower().split()))
final = (it.score * decay) + sim
scored.append((final, it))
scored.sort(key=lambda x: x[0], reverse=True)
return [it for _, it in scored[:topk] if _ > 0]

def cleanup(self, min_score=0.1):
new = []
for it in self.items:
if it.score * self._decay_factor(it) > min_score:
new.append(it)
self.items = new

We expand the memory system by adding methods to insert, search, and clean old memories. We implement a simple similarity function and a decay-based cleanup routine, enabling the agent to remember relevant facts while automatically forgetting weak or outdated ones. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass Agent:
def __init__(self, memory:MemoryStore, name=”PersonalAgent”):
self.memory = memory
self.name = name

def _llm_sim(self, prompt:str, context:List[str]):
base = “OK. ”
if any(“prefers short” in c for c in context):
base = “”
reply = base + f”I considered {len(context)} past notes. ”
if “summarize” in prompt.lower():
return reply + “Summary: ” + ” | “.join(context[:2])
if “recommend” in prompt.lower():
if any(“cybersecurity” in c for c in context):
return reply + “Recommended: write more cybersecurity articles.”
if any(“rag” in c for c in context):
return reply + “Recommended: build an agentic RAG demo next.”
return reply + “Recommended: continue with your last topic.”
return reply + “Here’s my response to: ” + prompt

def perceive(self, user_input:str):
ui = user_input.lower()
if “i like” in ui or “i prefer” in ui:
self.memory.add(“preference”, user_input, 1.5)
if “topic:” in ui:
self.memory.add(“topic”, user_input, 1.2)
if “project” in ui:
self.memory.add(“project”, user_input, 1.0)
def act(self, user_input:str):
mems = self.memory.search(user_input, topk=4)
ctx = [m.content for m in mems]
answer = self._llm_sim(user_input, ctx)
self.memory.add(“dialog”, f”user said: {user_input}”, 0.6)
self.memory.cleanup()
return answer, ctx

We design an intelligent agent that utilizes memory to inform its responses. We create a mock language model simulator that adapts replies based on stored preferences and topics. At the same time, the perception function enables the agent to dynamically capture new insights about the user. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef evaluate_personalisation(agent:Agent):
agent.memory.add(“preference”, “User likes cybersecurity articles”, 1.6)
q = “Recommend what to write next”
ans_personal, _ = agent.act(q)
empty_mem = MemoryStore()
cold_agent = Agent(empty_mem)
ans_cold, _ = cold_agent.act(q)
gain = len(ans_personal) – len(ans_cold)
return ans_personal, ans_cold, gain

Now we give our agent the ability to act and evaluate itself. We allow it to recall memories to shape contextual answers and add a small evaluation loop to compare personalised responses versus a memory-less baseline, quantifying how much the memory helps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsermem = MemoryStore(decay_half_life=60)
agent = Agent(mem)

print(“=== Demo: teaching the agent about yourself ===”)
inputs = [
“I prefer short answers.”,
“I like writing about RAG and agentic AI.”,
“Topic: cybersecurity, phishing, APTs.”,
“My current project is to build an agentic RAG Q&A system.”
]
for inp in inputs:
agent.perceive(inp)

print(“n=== Now ask the agent something ===”)
user_q = “Recommend what to write next in my blog”
ans, ctx = agent.act(user_q)
print(“USER:”, user_q)
print(“AGENT:”, ans)
print(“USED MEMORY:”, ctx)

print(“n=== Evaluate personalisation benefit ===”)
p, c, g = evaluate_personalisation(agent)
print(“With memory :”, p)
print(“Cold start :”, c)
print(“Personalisation gain (chars):”, g)

print(“n=== Current memory snapshot ===”)
for it in agent.memory.items:
print(f”- {it.kind} | {it.content[:60]}… | score~{round(it.score,2)}”)

Finally, we run the full demo to see our agent in action. We feed it user inputs, observe how it recommends personalised actions, and check its memory snapshot. We witness the emergence of adaptive behaviour, proof that persistent memory transforms a static script into a learning companion.

In conclusion, we demonstrate how adding memory and personalisation makes our agent more human-like, capable of remembering preferences, adapting plans, and forgetting outdated details naturally. We observe that even simple mechanisms such as decay and retrieval significantly improve the agent’s relevance and response quality. By the end, we realize that persistent memory is the foundation of next-generation Agentic AI, one that learns continuously, tailors experiences intelligently, and maintains context dynamically in a fully local, offline setup.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Persistent Memory and Personalized Agentic AI System with Decay and Self-Evaluation? appeared first on MarkTechPost.

How Switchboard, MD automates real-time call transcription in clinical …

Posted on November 4, 2025 by i-genie

In high-volume healthcare contact centers, every patient conversation carries both clinical and operational significance, making accurate real-time transcription necessary for automated workflows. Accurate, instant transcription enables intelligent automation without sacrificing clarity or care, so that teams can automate electronic medical record (EMR) record matching, streamline workflows, and eliminate manual data entry. By removing routine process steps, staff can stay fully focused on patient conversations, improving both the experience and the outcome. As healthcare systems seek to balance efficiency with empathy, real-time transcription has become a capability for delivering responsive, high-quality care at scale.
Switchboard, MD is a physician-led AI and data science company with a mission to prioritize the human connection in medicine. Its service improves patient engagement and outcomes, while reducing inefficiency and burnout. By designing and deploying clinically relevant solutions, Switchboard, MD helps providers and operators collaborate more effectively to deliver great experiences for both patients and staff. One of its key solutions is streamlining the contact center using AI voice automation, real-time medical record matching, and suggested next steps, which has led to significant reductions in queue times and call abandonment rates.
With more than 20,000 calls handled each month, Switchboard, MD supports healthcare providers in delivering timely, personalized communication at scale. Its AI platform is already helping reduce call queue times, improve patient engagement, and streamline contact center operations for clinics and health systems. Customers using Switchboard have seen outcomes such as:

75% reduction in queue times
59% reduction in call abandonment rate

Despite these early successes, Switchboard faced a critical challenge: their existing transcription approach couldn’t scale economically while maintaining the accuracy required for clinical workflows. Cost and word error rate (WER) weren’t just operational metrics—they were critical enablers for scaling automation and expanding Switchboard’s impact across more patient interactions.
In this post, we examine the specific challenges Switchboard, MD faced with scaling transcription accuracy and cost-effectiveness in clinical environments, their evaluation process for selecting the right transcription solution, and the technical architecture they implemented using Amazon Connect and Amazon Kinesis Video Streams. This post details the impressive results achieved and demonstrates how they were able to use this foundation to automate EMR matching and give healthcare staff more time to focus on patient care. Finally, we’ll look at the broader implications for healthcare AI automation and how other organizations can implement similar solutions using Amazon Bedrock.
Choosing an accurate, scalable, and cost-effective transcription model for contact center automation
Switchboard, MD needed a transcription solution that delivered high accuracy at a sustainable cost. In clinical settings, transcription accuracy is critical because errors can compromise EMR record matching, affect recommended treatment plans, and disrupt automated workflows. At the same time, scaling support for thousands of calls each week meant that inference costs couldn’t be ignored.
Switchboard initially explored multiple paths, including evaluating open source models such as Open AI’s Whisper model hosted locally. But these options presented tradeoffs—either in performance, cost, or integration complexity.
After testing, the team determined that Amazon Nova Sonic provided the right combination of transcription quality and efficiency needed to support their healthcare use case. The model performed reliably across live caller audio, even in noisy or variable conditions. It delivered:

80–90% lower transcription costs
A word error rate of 4% on Switchboard’s proprietary evaluation dataset
Low-latency output that aligned with their need for real-time processing

Equally important, Nova Sonic integrated smoothly into Switchboard’s existing architecture, minimizing engineering lift and accelerating deployment. With this foundation, the team reduced manual transcription steps and scaled accurate, real-time automation across thousands of patient interactions.

“Our vision is to restore the human connection in medicine by removing administrative barriers that get in the way of meaningful interaction. Nova Sonic gave us the speed and accuracy we needed to transcribe calls in real time—so our customers can focus on what truly matters: the patient conversation. By reducing our transcription costs by 80–90%, it’s also made real-time automation sustainable at scale.” – Dr. Blake Anderson, Founder, CEO, and CTO, Switchboard, MD

Architecture and implementation
Switchboard’s architecture uses Amazon Connect to capture live audio from both patients and representatives. Switchboard processes audio streams through Amazon Kinesis Video Streams , which handles the real-time media conversion before routing the data to containerized AWS Lambda functions. Switchboard’s Lambda functions establish bidirectional streaming connections with Amazon Nova Sonic using BedrockRuntimeClient’s InvokeModelWithBidirectionalStream API. This novel architecture creates separate transcription streams for each conversation participant, which Switchboard recombines to create the complete transcription record. The entire processing pipeline runs in a serverless environment, providing scalable operation designed to handle thousands of concurrent calls while using Nova Sonic’s real-time speech-to-text capabilities for immediate transcription processing.
Nova Sonic integration: Real-time speech processing
Harnessing Amazon Nova Sonic’s advanced audio streaming and processing, Switchboard developed and built the capability of separating and recombining speakers’ streams and transcripts. This makes Amazon Nova Sonic particularly effective for Switchboard’s healthcare applications, where accurate transcription and speaker identification are crucial.
Amazon Nova Sonic offers configurable settings that can be optimized for different healthcare use cases, with the flexibility to prioritize either transcription or speech generation based on specific needs. A key cost-optimization feature is the ability to adjust speech output tokens – organizations can set lower token values when primarily focused on transcription, resulting in significant cost savings while maintaining high accuracy. This versatility and cost flexibility makes Amazon Nova Sonic a valuable tool for healthcare organizations like Switchboard looking to implement voice-enabled solutions.
Why serverless: Strategic advantages for healthcare innovation
Switchboard’s choice of a serverless architecture using Amazon Connect, Amazon Kinesis Video Streams, and containerized Lambda functions represents a strategic decision that maximizes operational efficiency while minimizing infrastructure overhead. The serverless approach eliminates the need to provision, manage, and monitor underlying infrastructure, so that Switchboard’s engineering team can focus on developing clinical automation features rather than server management. This architecture provides built-in fault tolerance and high availability for critical healthcare communications without requiring extensive configuration from Switchboard’s team.
Switchboard’s event-driven architecture, shown in the following figure, enables the system to scale from handling dozens to thousands of concurrent calls, with AWS automatically managing capacity provisioning behind the scenes. The pay-as-you-go billing model helps Switchboard pay only for compute resources used during call processing, optimizing costs while eliminating the risk of over-provisioning servers that would sit idle during low-volume periods.

Conclusion
Switchboard, MD’s implementation of Amazon Nova Sonic demonstrates how the right transcription technology can transform healthcare operations. By achieving 80–90% cost reductions while maintaining clinical-grade accuracy, they’ve created a sustainable foundation for scaling AI-powered patient interactions across the healthcare industry.
By building on Amazon Bedrock, Switchboard now has the flexibility to expand automation across more use cases and provider networks. Their success exemplifies how healthcare innovators can combine accuracy, speed, and efficiency to transform how care teams connect with patients—one conversation at a time.
Get started with Amazon Nova on the Amazon Bedrock console. Learn more about Amazon Nova models at the Amazon Nova product page.

About the authors
Tanner Jones is a Technical Account Manager in AWS Enterprise Support, where he helps customers navigate and optimize their production applications on AWS. He specializes in helping customers develop applications that incorporate AI agents, with a particular focus on building safe multi-agent systems.
Anuj Jauhari is a Sr. Product Marketing Manager at AWS, where he helps customers innovate and drive business impact with generative AI solutions built on Amazon Nova models.
Jonathan Woods is a Solutions Architect at AWS based in Nashville currently working with SMB customers. He has a passion for communicating AWS technology to businesses in a relevant way making it easy for customers to innovate. Outside of work, he tries keeping up with his three kids.
Nauman Zulfiqar is a senior account manager based in New York working with SMB clients. He loves building and maintaining strong customer relationships, understanding their business challenges and serving as the customer’s primary business advocate within AWS.

How to Create AI-ready APIs?

Posted on November 3, 2025 by i-genie

Postman recently released a comprehensive checklist and developer guide for building AI-ready APIs, highlighting a simple truth: even the most powerful AI models are only as good as the data they receive—and that data comes through your APIs. If your endpoints are inconsistent, unclear, or unreliable, models waste time fixing bad inputs instead of producing insight. Postman’s playbook distills years of best practices into practical steps that help teams make their APIs predictable, machine-readable, and dependable for AI workloads.

This article summarizes the key ideas from that playbook. As we move into a world where Agents—not humans—will make purchases, compare options, and interact with services, APIs must evolve. Unlike developers, Agents can’t compensate for messy docs or ambiguous behavior. They rely on standardized patterns and automatically generated, machine-consumable documentation that stays in sync with your schema. The goal is simple: create APIs that humans and AI agents can understand instantly, so your systems can scale smarter and unlock their full potential.

Machine consumable metadata

Humans can infer missing details from vague API docs, but AI agents can’t—they rely entirely on explicit, machine-readable metadata. Instead of saying “this endpoint returns user preferences,” an AI-ready API must define everything: request type, parameter schema, response structure, and object definitions. Clear metadata like the example above removes ambiguity, ensures agents don’t guess, and makes APIs fully understandable to machines.

Rich Error Semantics

Developers can interpret vague errors like “Something went wrong,” but AI agents can’t—they need precise, structured guidance. AI-ready APIs must clearly spell out what failed, why it failed, and how to fix it. Rich error metadata with fields like code, message, expected, and received removes guesswork and enables agents to self-correct instead of getting stuck.

Introspection Capabilities

For APIs to be AI-ready, they must move beyond human-centric, vague documentation. Unlike developers who can infer missing details using context and RESTful conventions, AI agents rely entirely on structured data for planning and execution. This means APIs must provide complete introspection through a full schema, explicitly defining all endpoints, parameters, data schemas, and error codes. Without this clarity, AI systems are forced to guess, which inevitably leads to broken workflows and unreliable, hallucinated behavior.

Consistent Naming Patterns

AI systems rely on consistent patterns, so predictable naming conventions make your API far easier for them to understand and navigate. When endpoints and fields follow clear, uniform structures—like proper REST methods and consistent casing—AI can infer relationships and behaviors without guesswork. This reduces ambiguity and enables more accurate automation, reasoning, and integration across your entire API.

Predictable behaviour

AI agents need strict consistency—same inputs should always produce the same structure, format, and fields. Humans can troubleshoot inconsistent responses using intuition, but AI can’t assume or investigate; it only learns from the patterns you provide. If naming, nesting, or errors vary across endpoints, the agent becomes unreliable or breaks entirely. To be AI-ready, your API must enforce predictable responses, uniform naming, consistent error handling, and zero hidden edge cases. In short: inconsistent inputs lead to inconsistent agent behavior.

Proper documentation

Humans can look things up when docs are unclear, but AI agents can’t—they only know what your API explicitly tells them. Without clear, complete documentation, an agent can’t discover endpoints, understand parameters, predict responses, or recover from errors. Good documentation isn’t optional for AI-ready APIs—it’s the only way agents can learn and reliably interact with your system.

Reliable and fast

AI agents act as orchestrators, making rapid and often parallel API calls—so your API’s speed and reliability directly impact their performance. Humans can wait out slow responses or retry manually, but agents will time out, fail, or break entire workflows. In fast, automated environments, an AI system is only as strong as the APIs it relies on. If your API can’t keep up, neither can your AI.

Discoverability

Humans can track down missing APIs through wikis, chats, code, or intuition—but AI agents can’t. If an API isn’t clearly published with structured, searchable metadata, it simply doesn’t exist to them. AI systems depend on standardized, discoverable specs and examples to understand how to use an API. Making your API visible, accessible, and well-indexed—through platforms like the Postman API Network—ensures both developers and agents can reliably find and integrate it.
The post How to Create AI-ready APIs? appeared first on MarkTechPost.

LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Para …

Posted on November 3, 2025 by i-genie

How do you design a single model that can listen, see, read and respond in real time across text, image, video and audio without losing the efficiency? Meituan’s LongCat team has released LongCat Flash Omni, an open source omni modal model with 560 billion parameters and about 27 billion active per token, built on the shortcut connected Mixture of Experts design that LongCat Flash introduced. The model extends the text backbone to vision, video and audio, and it keeps a 128K context so it can run long conversations and document level understanding in one stack.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

Architecture and Modal Attachments

LongCat Flash Omni keeps the language model unchanged, then adds perception modules. A LongCat ViT encoder processes both images and video frames so there is no separate video tower. An audio encoder together with the LongCat Audio Codec turns speech into discrete tokens, then the decoder can output speech from the same LLM stream, which enables real time audio visual interaction.

Streaming and Feature Interleaving

The research team describes chunk wise audio visual feature interleaving, where audio features, video features and timestamps are packed into 1 second segments. Video is sampled at 2 frames per second by default, then the rate is adjusted according to video length, the report does not tie the sampling rule to user or model speaking phases, so the correct description is duration conditioned sampling. This keeps latency low and still provides spatial context for GUI, OCR and video QA tasks.

Curriculum from Text to Omni

Training follows a staged curriculum. The research team first trains the LongCat Flash text backbone, which activates 18.6B to 31.3B parameters per token, average 27B, then applies text speech continued pretraining, then multimodal continued pretraining with image and video, then context extension to 128K, then audio encoder alignment.

Systems Design, Modality Decoupled Parallelism

Because the encoders and the LLM have different compute patterns, Meituan uses modality decoupled parallelism. Vision and audio encoders run with hybrid sharding and activation recomputation, the LLM runs with pipeline, context and expert parallelism, and a ModalityBridge aligns embeddings and gradients. The research team reports that multimodal supervised fine tuning keeps more than 90 percent of the throughput of text only training, which is the main systems result in this release.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

Benchmarks and Positioning

LongCat Flash Omni reaches 61.4 on OmniBench, this is higher than Qwen 3 Omni Instruct at 58.5 and Qwen 2.5 Omni at 55.0, but lower than Gemini 2.5 Pro at 66.8. On VideoMME it scores 78.2, which is close to GPT 4o and Gemini 2.5 Flash, and on VoiceBench it reaches 88.7, slightly higher than GPT 4o Audio in the same table.

Key Takeaways

LongCat Flash Omni is an open source omni modal model built on Meituan’s 560B MoE backbone, it activates about 27B parameters per token through shortcut connected MoE with zero computation experts, so it keeps large capacity but inference friendly compute.

The model attaches unified vision video encoding and a streaming audio path to the existing LongCat Flash LLM, using 2 fps default video sampling with duration conditioned adjustment, and packs audio visual features into 1 second chunks for synchronized decoding, which is what enables real time any to any interaction.

LongCat Flash Omni scores 61.4 on OmniBench, above Qwen 3 Omni Instruct at 58.5, but below Gemini 2.5 Pro at 66.8.

Meituan uses modality decoupled parallelism, vision and audio encoders run with hybrid sharding, the LLM runs with pipeline, context and expert parallelism, and report more than 90 percent of text only throughput for multimodal SFT, which is the main systems contribution of the release.

Editorial Comments

This release shows that Meituan is trying to make omni modal interaction practical, not experimental. It keeps the 560B Shortcut connected Mixture of Experts with 27B activated, so the language backbone stays compatible with earlier LongCat releases. It adds streaming audio visual perception with 2 fps default video sampling and duration conditioned adjustment, so latency remains low without losing spatial grounding. It reports over 90 percent text only throughput in multimodal supervised fine tuning through modality decoupled parallelism.

Check out the Paper, Model Weights and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction appeared first on MarkTechPost.

Comparing the Top 6 OCR (Optical Character Recognition) Models/Systems …

Posted on November 3, 2025 by i-genie

Optical character recognition has moved from plain text extraction to document intelligence. Modern systems must read scanned and digital PDFs in one pass, preserve layout, detect tables, extract key value pairs, and work with more than one language. Many teams now also want OCR that can feed RAG and agent pipelines directly. In 2025, 6 systems cover most real workloads:

Google Cloud Document AI, Enterprise Document OCR

Amazon Textract

Microsoft Azure AI Document Intelligence

ABBYY FineReader Engine and FlexiCapture

PaddleOCR 3.0

DeepSeek OCR, Contexts Optical Compression

The goal of this comparison is not to rank them on a single metric, because they target different constraints. The goal is to show which system to use for a given document volume, deployment model, language set, and downstream AI stack.

Image source: Marktechpost.com

Evaluation dimensions

We compare on 6 stable dimensions:

Core OCR quality on scanned, photographed and digital PDFs.

Layout and structure tables, key value pairs, selection marks, reading order.

Language and handwriting coverage.

Deployment model fully managed, container, on premises, self hosted.

Integration with LLM, RAG and IDP tools.

Cost at scale.

1. Google Cloud Document AI, Enterprise Document OCR

Google’s Enterprise Document OCR takes PDFs and images, whether scanned or digital, and returns text with layout, tables, key value pairs and selection marks. It also exposes handwriting recognition in 50 languages and can detect math and font style. This matters for financial statements, educational forms and archives. Output is structured JSON that can be sent to Vertex AI or any RAG system.

Strengths

High quality OCR on business documents.

Strong layout graph and table detection.

One pipeline for digital and scanned PDFs, which keeps ingestion simple.

Enterprise grade, with IAM and data residency.

Limits

It is a metered Google Cloud service.

Custom document types still require configuration.

Use when your data is already on Google Cloud or when you must preserve layout for a later LLM stage.

2. Amazon Textract

Textract provides two API lanes, synchronous for small documents and asynchronous for large multipage PDFs. It extracts text, tables, forms, signatures and returns them as blocks with relationships. AnalyzeDocument in 2025 can also answer queries over the page which simplifies invoice or claim extraction. The integration with S3, Lambda and Step Functions makes it easy to turn Textract into an ingestion pipeline.

Strengths

Reliable table and key value extraction for receipts, invoices and insurance forms.

Clear sync and batch processing model.

Tight AWS integration, good for serverless and IDP on S3.

Limits

Image quality has a visible effect, so camera uploads may need preprocessing.

Customization is more limited than Azure custom models.

Locked to AWS.

Use when the workload is already in AWS and you need structured JSON out of the box.

3. Microsoft Azure AI Document Intelligence

Azure’s service, renamed from Form Recognizer, combines OCR, generic layout, prebuilt models and custom neural or template models. The 2025 release added layout and read containers, so enterprises can run the same model on premises. The layout model extracts text, tables, selection marks and document structure and is designed for further processing by LLMs.

Strengths

Best in class custom document models for line of business forms.

Containers for hybrid and air gapped deployments.

Prebuilt models for invoices, receipts and identity documents.

Clean JSON output.

Limits

Accuracy on some non English documents can still be slightly behind ABBYY.

Pricing and throughput must be planned because it is still a cloud first product.

Use when you need to teach the system your own templates or when you are a Microsoft shop that wants the same model in Azure and on premises.

4. ABBYY FineReader Engine and FlexiCapture

ABBYY stays relevant in 2025 because of 3 things, accuracy on printed documents, very wide language coverage, and deep control over preprocessing and zoning. The current Engine and FlexiCapture products support 190 and more languages, export structured data, and can be embedded in Windows, Linux and VM workloads. ABBYY is also strong in regulated sectors where data cannot leave the premises.

Strengths

Very high recognition quality on scanned contracts, passports, old documents.

Largest language set in this comparison.

FlexiCapture can be tuned to messy recurring documents.

Mature SDKs.

Limits

License cost is higher than open source.

Deep learning based scene text is not the focus.

Scaling to hundreds of nodes needs engineering.

Use when you must run on premises, must process many languages, or must pass compliance audits.

5. PaddleOCR 3.0

PaddleOCR 3.0 is an Apache licensed open source toolkit that aims to bridge images and PDFs to LLM ready structured data. It ships with PP OCRv5 for multilingual recognition, PP StructureV3 for document parsing and table reconstruction, and PP ChatOCRv4 for key information extraction. It supports 100 plus languages, runs on CPU and GPU, and has mobile and edge variants.

Strengths

Free and open, no per page cost.

Fast on GPU, usable on edge.

Covers detection, recognition and structure in one project.

Active community.

Limits

You must deploy, monitor and update it.

For European or financial layouts you often need postprocessing or fine tuning.

Security and durability are your responsibility.

Use when you want full control, or you want to build a self hosted document intelligence service for LLM RAG.

6. DeepSeek OCR, Contexts Optical Compression

DeepSeek OCR was released in October 2025. It is not a classical OCR. It is an LLM centric vision language model that compresses long text and documents into high resolution images, then decodes them. The public model card and blog report around 97 percent decoding accuracy at 10 times compression and around 60 percent at 20 times compression. It is MIT licensed, built around a 3B decoder, and already supported in vLLM and Hugging Face. This makes it interesting for teams that want to reduce token cost before calling an LLM.

Strengths

Self hosted, GPU ready.

Excellent for long context and mixed text plus tables because compression happens before decoding.

Open license.

Fits modern agentic stacks.

Limits

There is no standard public benchmark yet that puts it against Google or AWS, so enterprises must run their own tests.

Requires a GPU with enough VRAM.

Accuracy depends on chosen compression ratio.

Use when you want OCR that is optimized for LLM pipelines rather than for archive digitization.

Head to head comparison

FeatureGoogle Cloud Document AI (Enterprise Document OCR)Amazon TextractAzure AI Document IntelligenceABBYY FineReader Engine / FlexiCapturePaddleOCR 3.0DeepSeek OCRCore taskOCR for scanned and digital PDFs, returns text, layout, tables, KVP, selection marks OCR for text, tables, forms, IDs, invoices, receipts, with sync and async APIs OCR plus prebuilt and custom models, layout, containers for on premises High accuracy OCR and document capture for large, multilingual, on premises workloads Open source OCR and document parsing, PP OCRv5, PP StructureV3, PP ChatOCRv4 LLM centric OCR that compresses document images and decodes them for long context AI Text and layoutBlocks, paragraphs, lines, words, symbols, tables, key value pairs, selection marks Text, relationships, tables, forms, query responses, lending analysis Text, tables, KVP, selection marks, figure extraction, structured JSON, v4 layout model Zoning, tables, form fields, classification through FlexiCapture StructureV3 rebuilds tables and document hierarchy, KIE modules available Reconstructs content after optical compression, good for long pages, needs local evaluation HandwritingPrinted and handwriting for 50 languages Handwriting in forms and free text Handwriting supported in read and layout models Printed very strong, handwriting available via capture templates Supported, may need domain tuning Depends on image and compression ratio, not yet benchmarked vs cloud Languages200+ OCR languages, 50 handwriting languages Main business languages, invoices, IDs, receipts Major business languages, expanding in v4.x 190–201 languages depending on edition, widest in this table 100+ languages in v3.0 stack Multilingual via VLM decoder, coverage good but not exhaustively published, test per project DeploymentFully managed Google CloudFully managed AWS, synchronous and asynchronous jobs Managed Azure service plus read and layout containers (2025) for on premises On premises, VM, customer cloud, SDK centric Self hosted, CPU, GPU, edge, mobile Self hosted, GPU, vLLM ready, license to verify Integration pathExports structured JSON to Vertex AI, BigQuery, RAG pipelines Native to S3, Lambda, Step Functions, AWS IDP Azure AI Studio, Logic Apps, AKS, custom models, containers BPM, RPA, ECM, IDP platforms Python pipelines, open RAG stacks, custom document services LLM and agent stacks that want to reduce tokens first, vLLM and HF supported Cost modelPay per 1,000 pages, volume discounts Pay per page or document, AWS billing Consumption based, container licensing for local runs Commercial license, per server or per volume Free, infra onlyFree repo, GPU cost, license to confirmBest fitMixed scanned and digital PDFs on Google Cloud, layout preservedAWS ingestion of invoices, receipts, loan packages at scaleMicrosoft shops that need custom models and hybridRegulated, multilingual, on premises processingSelf hosted document intelligence for LLM and RAGLong document LLM pipelines that need optical compression

What to use when

Cloud IDP on invoices, receipts, medical forms: Amazon Textract or Azure Document Intelligence.

Mixed scanned and digital PDFs for banks and telcos on Google Cloud: Google Document AI Enterprise Document OCR.

Government archive or publisher with 150 plus languages and no cloud: ABBYY FineReader Engine and FlexiCapture.

Startup or media company building its own RAG over PDFs: PaddleOCR 3.0.

LLM platform that wants to shrink context before inference: DeepSeek OCR.

Editorial Comments

Google Document AI, Amazon Textract, and Azure AI Document Intelligence all deliver layout aware OCR with tables, key value pairs, and selection marks as structured JSON outputs, while ABBYY FineReader Engine 12 R7 and FlexiCapture export structured data in XML and the new JSON format and support 190 to 201 languages for on premises processing. PaddleOCR 3.0 provides Apache licensed PP OCRv5, PP StructureV3, and PP ChatOCRv4 for self hosted document parsing. DeepSeek OCR reports 97% decoding precision below 10x compression and about 60% at 20x, so enterprises must run local benchmarks before rollout in production workloads. Overall, OCR in 2025 is document intelligence first, recognition second.

References:

Google Cloud Document AI – Enterprise Document OCRhttps://docs.cloud.google.com/document-ai/docs/enterprise-document-ocr (Google Cloud Documentation)

Google Cloud – Document AI product pagehttps://cloud.google.com/document-ai (Google Cloud)

Amazon Textract – product pagehttps://aws.amazon.com/textract/ (Amazon Web Services, Inc.)

Amazon Textract – analyzing documents (tables, forms, queries, signatures)https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html (AWS Documentation)

Microsoft Azure AI Document Intelligence – docshttps://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/ (Microsoft Learn)

Microsoft Azure AI Document Intelligence – product pagehttps://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence (Microsoft Azure)

ABBYY FineReader Engine 12 R7 – release posthttps://www.abbyy.com/blog/finereader-engine-12-r7-release/ (ABBYY)

ABBYY FlexiCapture – product pagehttps://www.abbyy.com/flexicapture/ (ABBYY)

PaddleOCR – official GitHub repohttps://github.com/PaddlePaddle/PaddleOCR (GitHub)

DeepSeek OCR – official launch blog (Contexts Optical Compression)https://deepseek.ai/blog/deepseek-ocr-context-compression (deepseek.ai)

DeepSeek OCR – GitHub repositoryhttps://github.com/deepseek-ai/DeepSeek-OCR (GitHub)

DeepSeek OCR – coverage on compression ratioshttps://venturebeat.com/ai/deepseek-drops-open-source-model-that-compresses-text-10x-through-images (venturebeat.com)

The post Comparing the Top 6 OCR (Optical Character Recognition) Models/Systems in 2025 appeared first on MarkTechPost.

A Coding Implementation of a Comprehensive Enterprise AI Benchmarking …

Posted on November 2, 2025 by i-genie

In this tutorial, we develop a comprehensive benchmarking framework to evaluate various types of agentic AI systems on real-world enterprise software tasks. We design a suite of diverse challenges, from data transformation and API integration to workflow automation and performance optimization, and assess how various agents, including rule-based, LLM-powered, and hybrid ones, perform across these domains. By running structured benchmarks and visualizing key performance metrics, such as accuracy, execution time, and success rate, we gain a deeper understanding of each agent’s strengths and trade-offs in enterprise environments. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport json
import time
import random
from typing import Dict, List, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

@dataclass
class Task:
id: str
name: str
description: str
category: str
complexity: int
expected_output: Any

@dataclass
class BenchmarkResult:
task_id: str
agent_name: str
success: bool
execution_time: float
accuracy: float
error_message: str = “”

class EnterpriseTaskSuite:
def __init__(self):
self.tasks = self._create_tasks()

def _create_tasks(self) -> List[Task]:
return [
Task(“data_transform”, “CSV Data Transformation”,
“Transform customer data by aggregating sales”, “data_processing”, 3,
{“total_sales”: 15000, “avg_order”: 750}),
Task(“api_integration”, “REST API Integration”,
“Parse API response and extract key metrics”, “integration”, 2,
{“status”: “success”, “active_users”: 1250}),
Task(“workflow_automation”, “Multi-Step Workflow”,
“Execute data validation -> processing -> reporting”, “automation”, 4,
{“validated”: True, “processed”: 100, “report_generated”: True}),
Task(“error_handling”, “Error Recovery”,
“Handle malformed data gracefully”, “reliability”, 3,
{“errors_caught”: 5, “recovery_success”: True}),
Task(“optimization”, “Query Optimization”,
“Optimize database query performance”, “performance”, 5,
{“execution_time_ms”: 45, “rows_scanned”: 1000}),
Task(“data_validation”, “Schema Validation”,
“Validate data against business rules”, “validation”, 2,
{“valid_records”: 95, “invalid_records”: 5}),
Task(“reporting”, “Executive Dashboard”,
“Generate KPI summary report”, “analytics”, 3,
{“revenue”: 125000, “growth”: 0.15, “customer_count”: 450}),
Task(“integration_test”, “System Integration”,
“Test end-to-end integration flow”, “testing”, 4,
{“all_systems_connected”: True, “latency_ms”: 120}),
]

def get_task(self, task_id: str) -> Task:
return next((t for t in self.tasks if t.id == task_id), None)

We define the core data structures for our benchmarking system. We create the Task and BenchmarkResult data classes and initialize the EnterpriseTaskSuite, which holds multiple enterprise-relevant tasks such as data transformation, reporting, and integration. We laid the foundation for consistently evaluating different types of agents across these tasks. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass BaseAgent:
def __init__(self, name: str):
self.name = name

def execute(self, task: Task) -> Dict[str, Any]:
raise NotImplementedError

class RuleBasedAgent(BaseAgent):
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.1, 0.3))
if task.category == “data_processing”:
return {“total_sales”: 15000 + random.randint(-500, 500),
“avg_order”: 750 + random.randint(-50, 50)}
elif task.category == “integration”:
return {“status”: “success”, “active_users”: 1250}
elif task.category == “automation”:
return {“validated”: True, “processed”: 98, “report_generated”: True}
else:
return task.expected_output

We introduce the base agent structure and implement the RuleBasedAgent, which mimics traditional automation logic using predefined rules. We simulate how such agents execute tasks deterministically while maintaining speed and reliability, giving us a baseline for comparison with more advanced agents. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass LLMAgent(BaseAgent):
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.2, 0.5))
accuracy_boost = 0.95 if task.complexity >= 4 else 0.90
result = {}
for key, value in task.expected_output.items():
if isinstance(value, (int, float)):
variation = value * (1 – accuracy_boost)
result[key] = value + random.uniform(-variation, variation)
else:
result[key] = value
return result

class HybridAgent(BaseAgent):
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.15, 0.35))
if task.complexity <= 2:
return task.expected_output
else:
result = {}
for key, value in task.expected_output.items():
if isinstance(value, (int, float)):
variation = value * 0.03
result[key] = value + random.uniform(-variation, variation)
else:
result[key] = value
return result

We develop two intelligent agent types, the LLMAgent, representing reasoning-based AI systems, and the HybridAgent, which combines rule-based precision with LLM adaptability. We design these agents to show how learning-based methods improve task accuracy, especially for complex enterprise workflows. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass BenchmarkEngine:
def __init__(self, task_suite: EnterpriseTaskSuite):
self.task_suite = task_suite
self.results: List[BenchmarkResult] = []

def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
print(f”n{‘=’*60}”)
print(f”Benchmarking Agent: {agent.name}”)
print(f”{‘=’*60}”)
for task in self.task_suite.tasks:
print(f”nTask: {task.name} (Complexity: {task.complexity}/5)”)
for i in range(iterations):
result = self._execute_task(agent, task, i+1)
self.results.append(result)
status = “✓ PASS” if result.success else “✗ FAIL”
print(f” Run {i+1}: {status} | Time: {result.execution_time:.3f}s | Accuracy: {result.accuracy:.2%}”)

Here, we build the core of our benchmarking engine, which manages agent evaluation across the defined task suite. We implement methods to run each agent multiple times per task, log results, and measure key parameters like execution time and accuracy. This creates a systematic and repeatable benchmarking loop. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser def _execute_task(self, agent: BaseAgent, task: Task, run_num: int) -> BenchmarkResult:
start_time = time.time()
try:
output = agent.execute(task)
execution_time = time.time() – start_time
accuracy = self._calculate_accuracy(output, task.expected_output)
success = accuracy >= 0.85
return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=success,
execution_time=execution_time, accuracy=accuracy)
except Exception as e:
execution_time = time.time() – start_time
return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=False,
execution_time=execution_time, accuracy=0.0, error_message=str(e))

def _calculate_accuracy(self, output: Dict, expected: Dict) -> float:
if not output:
return 0.0
scores = []
for key, expected_val in expected.items():
if key not in output:
scores.append(0.0)
continue
actual_val = output[key]
if isinstance(expected_val, bool):
scores.append(1.0 if actual_val == expected_val else 0.0)
elif isinstance(expected_val, (int, float)):
diff = abs(actual_val – expected_val)
tolerance = abs(expected_val * 0.1)
score = max(0, 1 – (diff / (tolerance + 1e-9)))
scores.append(score)
else:
scores.append(1.0 if actual_val == expected_val else 0.0)
return np.mean(scores) if scores else 0.0

We define the task execution logic and the accuracy computation. We measure each agent’s performance by comparing their outputs against expected results using a scoring mechanism. This step ensures our benchmarking process is quantitative and fair, providing insights into how closely agents align with business expectations. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser def generate_report(self):
df = pd.DataFrame([asdict(r) for r in self.results])
print(f”n{‘=’*60}”)
print(“BENCHMARK REPORT”)
print(f”{‘=’*60}n”)
for agent_name in df[‘agent_name’].unique():
agent_df = df[df[‘agent_name’] == agent_name]
print(f”{agent_name}:”)
print(f” Success Rate: {agent_df[‘success’].mean():.1%}”)
print(f” Avg Execution Time: {agent_df[‘execution_time’].mean():.3f}s”)
print(f” Avg Accuracy: {agent_df[‘accuracy’].mean():.2%}n”)
return df

def visualize_results(self, df: pd.DataFrame):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle(‘Enterprise Agent Benchmarking Results’, fontsize=16, fontweight=’bold’)
success_rate = df.groupby(‘agent_name’)[‘success’].mean()
axes[0, 0].bar(success_rate.index, success_rate.values, color=[‘#3498db’, ‘#e74c3c’, ‘#2ecc71’])
axes[0, 0].set_title(‘Success Rate by Agent’, fontweight=’bold’)
axes[0, 0].set_ylabel(‘Success Rate’)
axes[0, 0].set_ylim(0, 1.1)
for i, v in enumerate(success_rate.values):
axes[0, 0].text(i, v + 0.02, f'{v:.1%}’, ha=’center’, fontweight=’bold’)
time_data = df.groupby(‘agent_name’)[‘execution_time’].mean()
axes[0, 1].bar(time_data.index, time_data.values, color=[‘#3498db’, ‘#e74c3c’, ‘#2ecc71’])
axes[0, 1].set_title(‘Average Execution Time’, fontweight=’bold’)
axes[0, 1].set_ylabel(‘Time (seconds)’)
for i, v in enumerate(time_data.values):
axes[0, 1].text(i, v + 0.01, f'{v:.3f}s’, ha=’center’, fontweight=’bold’)
df.boxplot(column=’accuracy’, by=’agent_name’, ax=axes[1, 0])
axes[1, 0].set_title(‘Accuracy Distribution’, fontweight=’bold’)
axes[1, 0].set_xlabel(‘Agent’)
axes[1, 0].set_ylabel(‘Accuracy’)
plt.sca(axes[1, 0])
plt.xticks(rotation=15)
task_complexity = {t.id: t.complexity for t in self.task_suite.tasks}
df[‘complexity’] = df[‘task_id’].map(task_complexity)
complexity_perf = df.groupby([‘agent_name’, ‘complexity’])[‘accuracy’].mean().unstack()
complexity_perf.plot(kind=’line’, ax=axes[1, 1], marker=’o’, linewidth=2)
axes[1, 1].set_title(‘Accuracy by Task Complexity’, fontweight=’bold’)
axes[1, 1].set_xlabel(‘Task Complexity’)
axes[1, 1].set_ylabel(‘Accuracy’)
axes[1, 1].legend(title=’Agent’, loc=’best’)
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

if __name__ == “__main__”:
print(“Enterprise Software Benchmarking for Agentic Agents”)
print(“=”*60)
task_suite = EnterpriseTaskSuite()
benchmark = BenchmarkEngine(task_suite)
agents = [RuleBasedAgent(“Rule-Based Agent”), LLMAgent(“LLM Agent”), HybridAgent(“Hybrid Agent”)]
for agent in agents:
benchmark.run_benchmark(agent, iterations=3)
results_df = benchmark.generate_report()
benchmark.visualize_results(results_df)
results_df.to_csv(‘agent_benchmark_results.csv’, index=False)
print(“nResults exported to: agent_benchmark_results.csv”)

We generate detailed reports and create visual analytics for performance comparison. We analyze metrics such as success rate, execution time, and accuracy across agents and task complexities. Finally, we export the results to CSV file, completing a full enterprise-grade evaluation workflow.

In conclusion, we implemented a robust, extensible benchmarking system that enables us to measure and compare the efficiency, adaptability, and accuracy of multiple agentic AI approaches. We observed how different architectures excel at different levels of task complexity and how visual analytics highlight performance trends. This process enables us to evaluate existing agents and provides a strong foundation for next-generation enterprise AI agents, optimized for reliability and intelligence.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks appeared first on MarkTechPost.

DeepAgent: A Deep Reasoning AI Agent that Performs Autonomous Thinking …

Posted on November 2, 2025 by i-genie

Most agent frameworks still run a predefined Reason, Act, Observe loop, so the agent can only use the tools that are injected in the prompt. This works for small tasks, but it fails when the toolset is large, when the task is long, and when the agent must change strategy in the middle of reasoning. The team from Renmin University of China and Xiaohongshu proposes DeepAgent as an end to end deep reasoning agent that keeps all of this inside one coherent reasoning process.

https://arxiv.org/pdf/2510.21618

Unified Reasoning With On Demand Tool Discovery

DeepAgent lets the model output four action types directly in text, internal thought, tool search, tool call, and memory fold. When the agent decides to search, it queries a dense index that contains tool descriptions from large registries, for example 16,000 plus RapidAPI tools and 3,912 ToolHop tools, then it receives only the top ranked tools back in context. This makes tool access dynamic, the model does not depend on a front loaded tool list, and it stays aligned with real environments where tools change.

Autonomous Memory Folding for Long Horizon Tasks

Long sequences of tool calls, web results, and code responses will overflow the context. DeepAgent solves this with an autonomous memory folding step. When the model emits the fold token, an auxiliary LLM compresses the full history into three memories, Episodic Memory that records task events, Working Memory that records the current sub goal and recent issues, and Tool Memory that records tool names, arguments, and outcomes. These memories are fed back as structured text, so the agent continues from a compact but information rich state.

ToolPO, Reinforcement Learning for Tool Use

Supervised traces do not teach robust tool use, because correct tool calls are only a few tokens inside a long generation. The research team introduce Tool Policy Optimization, ToolPO, to fix this. ToolPO runs rollouts on LLM simulated APIs, so training is stable and cheap, then it attributes reward to the exact tool call tokens, this is tool call advantage attribution, and it trains with a clipped PPO style objective. This is how the agent learns not only to call tools, but also to decide when to search and when to fold memory.

https://arxiv.org/pdf/2510.21618

Benchmarks, Labeled Tools vs Open Set Tools

The research team evaluates on 5 general tool use benchmarks, ToolBench, API Bank, TMDB, Spotify, ToolHop, and on 4 downstream tasks, ALFWorld, WebShop, GAIA, HLE. In the labeled tool setting, where every method is given the exact tools it needs, DeepAgent 32B RL with a QwQ 32B backbone reports 69.0 on ToolBench, 75.3 on API Bank, 89.0 on TMDB, 75.4 on Spotify, and 51.3 on ToolHop, which is the strongest 32B level result across all 5 datasets. Workflow baselines such as ReAct and CodeAct can match single datasets, for example ReAct with strong models is high on TMDB and Spotify, but none of them stay high on all 5, so the fair summary is that DeepAgent is more uniform, not that others are always low.

In the open set retrieval setting, which is the realistic one, DeepAgent must first find tools and then call them. Here DeepAgent 32B RL reaches 64.0 on ToolBench and 40.6 on ToolHop, while the strongest workflow baselines reach 55.0 on ToolBench and 36.2 on ToolHop, so the end to end agent still holds the lead. The research team also shows that autonomous tool retrieval itself lifts workflow agents, but DeepAgent gains more, which confirms that the architecture and the training are matched to large toolsets.

https://arxiv.org/pdf/2510.21618

Downstream Environments

On ALFWorld, WebShop, GAIA, and HLE, all under a 32B reasoning model, DeepAgent reports 91.8 percent success on ALFWorld, 34.4 percent success and 56.3 score on WebShop, 53.3 on GAIA, and a higher score than workflow agents on HLE. These tasks are longer and noisier, so the combination of memory folding and ToolPO is the likely source of the gap.

Key Takeaways

DeepAgent keeps the whole agent loop inside one reasoning stream, the model can think, search tools, call them, and continue, so it is not limited to a fixed ReAct style workflow.

It uses dense retrieval over large tool registries, 16,000 plus RapidAPI tools and about 3,900 ToolHop tools, so tools do not have to be pre listed in the prompt, they are discovered on demand.

The autonomous memory folding module compresses long interaction histories into episodic, working, and tool memories, which prevents context overflow and keeps long horizon reasoning stable.

Tool Policy Optimization, ToolPO, trains tool use end to end with simulated APIs and token level advantage attribution, so the agent learns to issue correct tool calls, not only to reach the final answer.

On 5 tool benchmarks and 4 downstream tasks, DeepAgent at 32B scale is more consistent than workflow baselines in both labeled tool and open set settings, especially on ToolBench and ToolHop where tool discovery matters most.

https://arxiv.org/pdf/2510.21618

Editorial Comments

DeepAgent is a practical step toward agent architectures that do not depend on fixed tool prompts, because it unifies autonomous thinking, dense tool retrieval over 16,000 plus RapidAPIs and 3,900 plus ToolHop tools, structured tool calling, and memory folding in one loop. The use of LLM simulated APIs in ToolPO is an engineering choice, but it solves the latency and instability problem that hurts prior tool agents. The evaluation shows consistent 32B level gains in both labeled tool and open set settings, not isolated peaks. This release makes large toolspaces actually usable for LLM agents. Overall, DeepAgent confirms that end to end tool agents with memory and RL are emerging as the default pattern.

Check out the Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepAgent: A Deep Reasoning AI Agent that Performs Autonomous Thinking, Tool Discovery, and Action Execution within a Single Reasoning Process appeared first on MarkTechPost.

Anthropic’s New Research Shows Claude can Detect Injected Concepts, …

Posted on November 2, 2025 by i-genie

How do you tell whether a model is actually noticing its own internal state instead of just repeating what training data said about thinking? In a latest Anthropic’s research study ‘Emergent Introspective Awareness in Large Language Models‘ asks whether current Claude models can do more than talk about their abilities, it asks whether they can notice real changes inside their network. To remove guesswork, the research team does not test on text alone, they directly edit the model’s internal activations and then ask the model what happened. This lets them tell apart genuine introspection from fluent self description.

Method, concept injection as activation steering

The core method is concept injection, described in the Transformer Circuits write up as an application of activation steering. The researchers first capture an activation pattern that corresponds to a concept, for example an all caps style or a concrete noun, then they add that vector into the activations of a later layer while the model is answering. If the model then says, there is an injected thought that matches X, that answer is causally grounded in the current state, not in prior internet text. Anthropic research team reports that this works best in later layers and with tuned strength.

https://transformer-circuits.pub/2025/introspection/index.html

Main result, about 20 percent success with zero false positives in controls

Claude Opus 4 and Claude Opus 4.1 show the clearest effect. When the injection is done in the correct layer band and with the right scale, the models correctly report the injected concept in about 20 percent of trials. On control runs with no injection, production models do not falsely claim to detect an injected thought over 100 runs, which makes the 20 percent signal meaningful.

Separating internal concepts from user text

A natural objection is that the model could be importing the injected word into the text channel. Anthropic researchers tests this. The model receives a normal sentence, the researchers inject an unrelated concept such as bread on the same tokens, and then they ask the model to name the concept and to repeat the sentence. The stronger Claude models can do both, they keep the user text intact and they name the injected thought, which shows that internal concept state can be reported separately from the visible input stream. For agent style systems, this is the interesting part, because it shows that a model can talk about the extra state that tool calls or agents may depend on.

Prefill, using introspection to tell what was intended

Another experiment targets an evaluation problem. Anthropic prefilled the assistant message with content the model did not plan. By default Claude says that the output was not intended. When the researchers retroactively inject the matching concept into earlier activations, the model now accepts the prefilled output as its own and can justify it. This shows that the model is consulting an internal record of its previous state to decide authorship, not only the final text. That is a concrete use of introspection.

Key Takeaways

Concept injection gives causal evidence of introspection: Anthropic shows that if you take a known activation pattern, inject it into Claude’s hidden layers, and then ask the model what is happening, advanced Claude variants can sometimes name the injected concept. This separates real introspection from fluent roleplay.

Best models succeed only in a narrow regime: Claude Opus 4 and 4.1 detect injected concepts only when the vector is added in the right layer band and with tuned strength, and the reported success rate is around the same scale Anthropic stated, while production runs show 0 false positives in controls, so the signal is real but small.

Models can keep text and internal ‘thoughts’ separate: In experiments where an unrelated concept is injected on top of normal input text, the model can both repeat the user sentence and report the injected concept, which means the internal concept stream is not just leaking into the text channel.

Introspection supports authorship checks: When Anthropic prefilled outputs that the model did not intend, the model disavowed them, but if the matching concept was retroactively injected, the model accepted the output as its own. This shows the model can consult past activations to decide whether it meant to say something.

This is a measurement tool, not a consciousness claim: The research team frame the work as functional, limited introspective awareness that could feed future transparency and safety evaluations, including ones about evaluation awareness, but they do not claim general self awareness or stable access to all internal features.

Editorial Comments

Anthropic’s ‘Emergent Introspective Awareness in LLMs‘ research is a useful measurement advance, not a grand metaphysical claim. The setup is clean, inject a known concept into hidden activations using activation steering, then query the model for a grounded self report. Claude variants sometimes detect and name the injected concept, and they can keep injected ‘thoughts’ distinct from input text, which is operationally relevant for agent debugging and audit trails. The research team also shows limited intentional control of internal states. Constraints remain strong, effects are narrow, and reliability is modest, so downstream use should be evaluative, not safety critical.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers appeared first on MarkTechPost.

Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise …

Posted on November 1, 2025 by i-genie

How can a small model learn to solve tasks it currently fails at, without rote imitation or relying on a correct rollout? A team of researchers from Google Cloud AI Research and UCLA have released a training framework, ‘Supervised Reinforcement Learning’ (SRL), that makes 7B scale models actually learn from very hard math and agent trajectories that normal supervised fine tuning and outcome based reinforcement learning RL cannot learn from.

Small open source models such as Qwen2.5 7B Instruct fail on the hardest problems in s1K 1.1, even when the teacher trace is good. If we apply supervised fine tuning on the full DeepSeek R1 style solutions, the model imitates token by token, the sequence is long, the data is only 1,000 items, and the final scores drop below the base model.

https://arxiv.org/pdf/2510.25992

Core idea of ‘Supervised Reinforcement Learning’ SRL

‘Supervised Reinforcement Learning’ (SRL) keeps the RL style optimization, but it injects supervision into the reward channel instead of into the loss. Each expert trajectory from s1K 1.1 is parsed into a sequence of actions. For every prefix of that sequence, the research team creates a new training example, the model first produces a private reasoning span wrapped in <think> … </think>, then it outputs the action for that step, and only this action is compared with the teacher action using a sequence similarity metric based on difflib. The reward is dense because every step has a score, even when the final answer is wrong. The rest of the text, the reasoning part, is not constrained, so the model can search its own chain without being forced to copy the teacher tokens.

Math results

All models are initialized from Qwen2.5 7B Instruct and all are trained on the same DeepSeek R1 formatted s1K 1.1 set, so comparisons are clean. The exact numbers in Table 1 are:

Base Qwen2.5 7B Instruct, AMC23 greedy 50.0, AIME24 greedy 13.3, AIME25 greedy 6.7.

SRL, AMC23 greedy 50.0, AIME24 greedy 16.7, AIME25 greedy 13.3.

SRL then RLVR, AMC23 greedy 57.5, AIME24 greedy 20.0, AIME25 greedy 10.0.

https://arxiv.org/pdf/2510.25992

This is the key improvement, SRL alone already removes the SFT degradation and raises AIME24 and AIME25, and when RLVR is run after SRL, the system reaches the best open source scores in the research. The research team is explicit that the best pipeline is SRL then RLVR, not SRL in isolation.

Software engineering results

The research team also applies SRL to Qwen2.5 Coder 7B Instruct using 5,000 verified agent trajectories generated by claude 3 7 sonnet, every trajectory is decomposed into step wise instances, and in total 134,000 step items are produced. Evaluation is on SWE Bench Verified. The base model gets 5.8 percent in the oracle file edit mode and 3.2 percent end to end. SWE Gym 7B gets 8.4 percent and 4.2 percent. SRL gets 14.8 percent and 8.6 percent, which is about 2 times the base model and clearly higher than the SFT baseline.

https://arxiv.org/pdf/2510.25992

Key Takeaways

SRL reformulates hard reasoning as step wise action generation, the model first produces an internal monologue then outputs a single action, and only that action is rewarded by sequence similarity, so the model gets signal even when the final answer is wrong.

SRL is run on the same DeepSeek R1 formatted s1K 1.1 data as SFT and RLVR, but unlike SFT it does not overfit long demonstrations, and unlike RLVR it does not collapse when no rollout is correct.

On math, the exact order that gives the strongest results in the research is, initialize Qwen2.5 7B Instruct with SRL, then apply RLVR, which pushes reasoning benchmarks higher than either method alone.

The same SRL recipe generalizes to agentic software engineering, using 5,000 verified trajectories from claude 3 7 sonnet 20250219, and it lifts SWE Bench Verified well above both the base Qwen2.5 Coder 7B Instruct and the SFT style SWE Gym 7B baseline.

Compared to other step wise RL methods that need an extra reward model, this SRL keeps a GRPO style objective and uses only actions from expert trajectories and a lightweight string similarity, so it is easy to run on small hard datasets.

Editorial Comments

‘Supervised Reinforcement Learning’ (SRL) is a practical contribution by the research team. It keeps the GRPO style reinforcement learning setup, but it replaces fragile outcome level rewards with supervised, step wise rewards that are computed directly from expert trajectories, so the model always receives informative signal, even in the Dhard regime where RLVR and SFT both stall. It is important that the research team shows SRL on math and on SWE Bench Verified with the same recipe, and that the strongest configuration is SRL followed by RLVR, not either one alone. This makes SRL a realistic path for open models to learn hard tasks. Overall, SRL is a clean bridge between process supervision and RL that open model teams can adopt immediately.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems appeared first on MarkTechPost.

OpenAI Releases Research Preview of ‘gpt-oss-safeguard’: Two Open- …

Posted on November 1, 2025 by i-genie

OpenAI has released a research preview of gpt-oss-safeguard, two open weight safety reasoning models that let developers apply custom safety policies at inference time. The models come in two sizes, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, both fine tuned from gpt-oss, both licensed under Apache 2.0, and both available on Hugging Face for local use.

https://openai.com/index/introducing-gpt-oss-safeguard/

Why Policy-Conditioned Safety Matters?

Conventional moderation models are trained on a single fixed policy. When that policy changes, the model must be retrained or replaced. gpt-oss-safeguard reverses this relationship. It takes the developer authored policy as input together with the user content, then reasons step by step to decide whether the content violates the policy. This turns safety into a prompt and evaluation task, which is better suited for fast changing or domain specific harms such as fraud, biology, self harm or game specific abuse.

Same Pattern as OpenAI’s Internal Safety Reasoner

OpenAI states that gpt-oss-safeguard is an open weight implementation of the Safety Reasoner used internally across systems like GPT 5, ChatGPT Agent and Sora 2. In production settings OpenAI already runs small high recall filters first, then escalates uncertain or sensitive items to a reasoning model, and in recent launches up to 16 percent of total compute was spent on safety reasoning. The open release lets external teams reproduce this defense in depth pattern instead of guessing how OpenAI’s stack works.

Model Sizes and Hardware Fit

The large model, gpt-oss-safeguard-120b, has 117B parameters with 5.1B active parameters and is sized to fit on a single 80GB H100 class GPU. The smaller gpt-oss-safeguard-20b has 21B parameters with 3.6B active parameters and targets lower latency or smaller GPUs, including 16GB setups. Both models were trained on the harmony response format, so prompts must follow that structure otherwise results will degrade. The license is Apache 2.0, the same as the parent gpt-oss models, so commercial local deployment is permitted.

https://openai.com/index/introducing-gpt-oss-safeguard/

Evaluation Results

OpenAI evaluated the models on internal multi policy tests and on public datasets. In multi policy accuracy, where the model must correctly apply several policies at once, gpt-oss-safeguard and OpenAI’s internal Safety Reasoner outperform gpt-5-thinking and the open gpt-oss baselines. On the 2022 moderation dataset the new models slightly outperform both gpt-5-thinking and the internal Safety Reasoner, however OpenAI specifies that this gap is not statistically significant, so it should not be oversold. On ToxicChat, the internal Safety Reasoner still leads, with gpt-oss-safeguard close behind. This places the open models in the competitive range for real moderation tasks.

Recommended Deployment Pattern

OpenAI is explicit that pure reasoning on every request is expensive. The recommended setup is to run small, fast, high recall classifiers on all traffic, then send only uncertain or sensitive content to gpt-oss-safeguard, and when user experience requires fast responses, to run the reasoner asynchronously. This mirrors OpenAI’s own production guidance and reflects the fact that dedicated task specific classifiers can still win when there is a large high quality labeled dataset.

Key Takeaways

gpt-oss-safeguard is a research preview of two open weight safety reasoning models, 120b and 20b, that classify content using developer supplied policies at inference time, so policy changes do not require retraining.

The models implement the same Safety Reasoner pattern OpenAI uses internally across GPT 5, ChatGPT Agent and Sora 2, where a first fast filter routes only risky or ambiguous content to a slower reasoning model.

Both models are fine tuned from gpt-oss, keep the harmony response format, and are sized for real deployments, the 120b model fits on a single H100 class GPU, the 20b model targets 16GB level hardware, and both are Apache 2.0 on Hugging Face.

On internal multi policy evaluations and on the 2022 moderation dataset, the safeguard models outperform gpt-5-thinking and the gpt-oss baselines, but OpenAI notes that the small margin over the internal Safety Reasoner is not statistically significant.

OpenAI recommends using these models in a layered moderation pipeline, together with community resources such as ROOST, so platforms can express custom taxonomies, audit the chain of thought, and update policies without touching weights.

Editorial Comments

OpenAI is taking an internal safety pattern and making it reproducible, which is the most important part of this launch. The models are open weight, policy conditioned and Apache 2.0, so platforms can finally apply their own taxonomies instead of accepting fixed labels. The fact that gpt-oss-safeguard matches and sometimes slightly exceeds the internal Safety Reasoner on the 2022 moderation dataset, while outperforming gpt-5-thinking on multi policy accuracy, but with a non statistically significant margin, shows the approach is already usable. The recommended layered deployment is realistic for production.
The post OpenAI Releases Research Preview of ‘gpt-oss-safeguard’: Two Open-Weight Reasoning Models for Safety Classification Tasks appeared first on MarkTechPost.

How to Design an Autonomous Multi-Agent Data and Infrastructure Strate …

Posted on November 1, 2025 by i-genie

In this tutorial, we build an Agentic Data and Infrastructure Strategy system using the lightweight Qwen2.5-0.5B-Instruct model for efficient execution. We begin by creating a flexible LLM agent framework and then develop specialized agents that handle different layers of data management, from ingestion and quality analysis to infrastructure optimization. We integrate these agents into an orchestrator that coordinates their interactions, ensuring smooth multi-agent collaboration across the data pipeline. Through hands-on examples like e-commerce and IoT pipelines, we explore how autonomous decision-making can streamline complex data operations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q transformers torch accelerate datasets huggingface_hub
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import json, time
from typing import List, Dict, Any
from dataclasses import dataclass
from datetime import datetime
import pandas as pd

class LightweightLLMAgent:
def __init__(self, role: str, model_name: str = “Qwen/Qwen2.5-0.5B-Instruct”):
self.role = role
self.model_name = model_name
self.device = “cuda” if torch.cuda.is_available() else “cpu”
print(f”Loading {model_name} for {role} agent on {self.device}…”)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if self.device == “cuda” else torch.float32,
device_map=”auto”
)
self.conversation_history = []

def generate_response(self, prompt: str, max_tokens: int = 150) -> str:
messages = [
{“role”: “system”, “content”: f”You are a {self.role} agent in a data infrastructure system.”},
{“role”: “user”, “content”: prompt}
]
text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = self.tokenizer([text], return_tensors=”pt”).to(self.device)
with torch.no_grad():
generated_ids = self.model.generate(
model_inputs.input_ids,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
top_p=0.95
)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
self.conversation_history.append({“prompt”: prompt, “response”: response})
return response

We start by setting up the lightweight LLM agent infrastructure using the Qwen2.5-0.5B-Instruct model. We load the model and tokenizer, and define a base agent class capable of handling contextual conversations and generating intelligent responses. This forms the core foundation upon which our specialized agents operate efficiently within Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DataIngestionAgent(LightweightLLMAgent):
def __init__(self):
super().__init__(role=”Data Ingestion Specialist”)
def analyze_data_source(self, source_info: Dict) -> Dict:
prompt = f”””Analyze this data source and provide ingestion strategy:
Source Type: {source_info.get(‘type’, ‘unknown’)}
Volume: {source_info.get(‘volume’, ‘unknown’)}
Frequency: {source_info.get(‘frequency’, ‘unknown’)}
Provide a brief strategy focusing on: 1) Ingestion method, 2) Key considerations.”””
strategy = self.generate_response(prompt, max_tokens=100)
return {“source”: source_info, “strategy”: strategy, “timestamp”: datetime.now().isoformat()}

class DataQualityAgent(LightweightLLMAgent):
def __init__(self):
super().__init__(role=”Data Quality Analyst”)
def assess_data_quality(self, data_sample: Dict) -> Dict:
prompt = f”””Assess data quality for this sample:
Completeness: {data_sample.get(‘completeness’, ‘N/A’)}%
Consistency: {data_sample.get(‘consistency’, ‘N/A’)}%
Issues Found: {data_sample.get(‘issues’, 0)}
Provide brief quality assessment and top 2 recommendations.”””
assessment = self.generate_response(prompt, max_tokens=100)
return {“assessment”: assessment, “severity”: self._calculate_severity(data_sample), “timestamp”: datetime.now().isoformat()}
def _calculate_severity(self, data_sample: Dict) -> str:
completeness = data_sample.get(‘completeness’, 100)
consistency = data_sample.get(‘consistency’, 100)
avg_score = (completeness + consistency) / 2
if avg_score >= 90: return “LOW”
elif avg_score >= 70: return “MEDIUM”
else: return “HIGH”

We design the Data Ingestion and Data Quality agents to focus on structured analysis of data pipelines. We let the ingestion agent determine the best approach to data flow, while the quality agent evaluates data completeness, consistency, and issues to provide actionable insights. Together, they establish the first two layers of autonomous data management. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass InfrastructureOptimizationAgent(LightweightLLMAgent):
def __init__(self):
super().__init__(role=”Infrastructure Optimization Specialist”)
def optimize_resources(self, metrics: Dict) -> Dict:
prompt = f”””Analyze infrastructure metrics and suggest optimizations:
CPU Usage: {metrics.get(‘cpu_usage’, 0)}%
Memory Usage: {metrics.get(‘memory_usage’, 0)}%
Storage: {metrics.get(‘storage_used’, 0)}GB / {metrics.get(‘storage_total’, 0)}GB
Query Latency: {metrics.get(‘query_latency’, 0)}ms
Provide 2 optimization recommendations.”””
recommendations = self.generate_response(prompt, max_tokens=100)
return {“current_metrics”: metrics, “recommendations”: recommendations, “priority”: self._calculate_priority(metrics), “timestamp”: datetime.now().isoformat()}
def _calculate_priority(self, metrics: Dict) -> str:
cpu = metrics.get(‘cpu_usage’, 0)
memory = metrics.get(‘memory_usage’, 0)
if cpu > 85 or memory > 85: return “CRITICAL”
elif cpu > 70 or memory > 70: return “HIGH”
else: return “NORMAL”

We develop the Infrastructure Optimization Agent to continuously analyze key metrics like CPU, memory, and storage utilization. We use it to generate intelligent optimization suggestions, helping us maintain high performance and resource efficiency. This agent ensures that our infrastructure remains responsive and scalable during data operations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AgenticDataOrchestrator:
def __init__(self):
print(“n” + “=”*70)
print(“Initializing Agentic Data Infrastructure System”)
print(“=”*70 + “n”)
self.ingestion_agent = DataIngestionAgent()
self.quality_agent = DataQualityAgent()
self.optimization_agent = InfrastructureOptimizationAgent()
self.execution_log = []
def process_data_pipeline(self, pipeline_config: Dict) -> Dict:
results = {“pipeline_id”: pipeline_config.get(“id”, “unknown”), “start_time”: datetime.now().isoformat(), “stages”: []}
print(“n[Stage 1] Data Ingestion Analysis”)
ingestion_result = self.ingestion_agent.analyze_data_source(pipeline_config.get(“source”, {}))
print(f”Strategy: {ingestion_result[‘strategy’][:150]}…”)
results[“stages”].append({“stage”: “ingestion”, “result”: ingestion_result})
print(“n[Stage 2] Data Quality Assessment”)
quality_result = self.quality_agent.assess_data_quality(pipeline_config.get(“quality_metrics”, {}))
print(f”Assessment: {quality_result[‘assessment’][:150]}…”)
print(f”Severity: {quality_result[‘severity’]}”)
results[“stages”].append({“stage”: “quality”, “result”: quality_result})
print(“n[Stage 3] Infrastructure Optimization”)
optimization_result = self.optimization_agent.optimize_resources(pipeline_config.get(“infrastructure_metrics”, {}))
print(f”Recommendations: {optimization_result[‘recommendations’][:150]}…”)
print(f”Priority: {optimization_result[‘priority’]}”)
results[“stages”].append({“stage”: “optimization”, “result”: optimization_result})
results[“end_time”] = datetime.now().isoformat()
results[“status”] = “completed”
self.execution_log.append(results)
return results
def generate_summary_report(self) -> pd.DataFrame:
if not self.execution_log: return pd.DataFrame()
summary_data = []
for log in self.execution_log:
summary_data.append({“Pipeline ID”: log[“pipeline_id”], “Start Time”: log[“start_time”], “Status”: log[“status”], “Stages Completed”: len(log[“stages”])})
return pd.DataFrame(summary_data)

We built an Agentic Data Orchestrator to coordinate all specialized agents under a unified workflow. We use it to manage end-to-end pipeline execution, triggering ingestion, quality checks, and optimization sequentially. By doing this, we bring structure, collaboration, and automation to the entire multi-agent system. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
orchestrator = AgenticDataOrchestrator()
print(“n” + “=”*70)
print(“EXAMPLE 1: E-commerce Data Pipeline”)
print(“=”*70)
ecommerce_pipeline = {
“id”: “ecommerce_pipeline_001”,
“source”: {“type”: “REST API”, “volume”: “10GB/day”, “frequency”: “real-time”},
“quality_metrics”: {“completeness”: 87, “consistency”: 92, “issues”: 15},
“infrastructure_metrics”: {“cpu_usage”: 78, “memory_usage”: 82, “storage_used”: 450, “storage_total”: 1000, “query_latency”: 250}
}
result1 = orchestrator.process_data_pipeline(ecommerce_pipeline)
print(“nn” + “=”*70)
print(“EXAMPLE 2: IoT Sensor Data Pipeline”)
print(“=”*70)
iot_pipeline = {
“id”: “iot_pipeline_002”,
“source”: {“type”: “Message Queue (Kafka)”, “volume”: “50GB/day”, “frequency”: “streaming”},
“quality_metrics”: {“completeness”: 95, “consistency”: 88, “issues”: 8},
“infrastructure_metrics”: {“cpu_usage”: 65, “memory_usage”: 71, “storage_used”: 780, “storage_total”: 2000, “query_latency”: 180}
}
result2 = orchestrator.process_data_pipeline(iot_pipeline)
print(“nn” + “=”*70)
print(“EXECUTION SUMMARY REPORT”)
print(“=”*70 + “n”)
summary_df = orchestrator.generate_summary_report()
print(summary_df.to_string(index=False))
print(“n” + “=”*70)
print(“Tutorial Complete!”)
print(“=”*70)
print(“nKey Concepts Demonstrated:”)
print(“✓ Lightweight LLM agent architecture”)
print(“✓ Specialized agents for different data tasks”)
print(“✓ Multi-agent orchestration”)
print(“✓ Infrastructure monitoring and optimization”)
print(“✓ Autonomous decision-making in data pipelines”)

if __name__ == “__main__”:
main()

We demonstrate our complete system through two real-world examples, an e-commerce and an IoT data pipeline. We observe how each agent performs its role autonomously while contributing to a shared objective. Finally, we generate a summary report, confirming the orchestration’s efficiency and the power of lightweight agentic intelligence.

In conclusion, we design and execute an intelligent, multi-agent data infrastructure framework powered by a compact open-source model. We witness how independent yet cooperative agents can autonomously analyze, assess, and optimize real-world data systems. The entire setup demonstrates how lightweight LLMs can efficiently handle infrastructure intelligence, while also highlighting how agentic orchestration transforms traditional data workflows into adaptive, self-optimizing systems ready for scalable enterprise applications.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Autonomous Multi-Agent Data and Infrastructure Strategy System Using Lightweight Qwen Models for Efficient Pipeline Intelligence? appeared first on MarkTechPost.