Building a Secure and Memory-Enabled Cipher Workflow for AI Agents wit …

In this tutorial, we walk through building a compact but fully functional Cipher-based workflow. We start by securely capturing our Gemini API key in the Colab UI without exposing it in code. We then implement a dynamic LLM selection function that can automatically switch between OpenAI, Gemini, or Anthropic based on which API key is available. The setup phase ensures Node.js and the Cipher CLI are installed, after which we programmatically generate a cipher.yml configuration to enable a memory agent with long-term recall. We create helper functions to run Cipher commands directly from Python, store key project decisions as persistent memories, retrieve them on demand, and finally spin up Cipher in API mode for external integration. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport os, getpass
os.environ[“GEMINI_API_KEY”] = getpass.getpass(“Enter your Gemini API key: “).strip()

import subprocess, tempfile, pathlib, textwrap, time, requests, shlex

def choose_llm():
if os.getenv(“OPENAI_API_KEY”):
return “openai”, “gpt-4o-mini”, “OPENAI_API_KEY”
if os.getenv(“GEMINI_API_KEY”):
return “gemini”, “gemini-2.5-flash”, “GEMINI_API_KEY”
if os.getenv(“ANTHROPIC_API_KEY”):
return “anthropic”, “claude-3-5-haiku-20241022”, “ANTHROPIC_API_KEY”
raise RuntimeError(“Set one API key before running.”)

We start by securely entering our Gemini API key using getpass so it stays hidden in the Colab UI. We then define a choose_llm() function that checks our environment variables and automatically selects the appropriate LLM provider, model, and key based on what is available. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run(cmd, check=True, env=None):
print(“▸”, cmd)
p = subprocess.run(cmd, shell=True, text=True, capture_output=True, env=env)
if p.stdout: print(p.stdout)
if p.stderr: print(p.stderr)
if check and p.returncode != 0:
raise RuntimeError(f”Command failed: {cmd}”)
return p

We create a run() helper function that executes shell commands, prints both stdout and stderr for visibility, and raises an error if the command fails when check is enabled, making our workflow execution more transparent and reliable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef ensure_node_and_cipher():
run(“sudo apt-get update -y && sudo apt-get install -y nodejs npm”, check=False)
run(“npm install -g @byterover/cipher”)

We define ensure_node_and_cipher() to install Node.js, npm, and the Cipher CLI globally, ensuring our environment has all the necessary dependencies before running any Cipher-related commands. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef write_cipher_yml(workdir, provider, model, key_env):
cfg = “””
llm:
provider: {provider}
model: {model}
apiKey: ${key_env}
systemPrompt:
enabled: true
content: |
You are an AI programming assistant with long-term memory of prior decisions.
embedding:
disabled: true
mcpServers:
filesystem:
type: stdio
command: npx
args: [‘-y’,’@modelcontextprotocol/server-filesystem’,’.’]
“””.format(provider=provider, model=model, key_env=key_env)

(workdir / “memAgent”).mkdir(parents=True, exist_ok=True)
(workdir / “memAgent” / “cipher.yml”).write_text(cfg.strip() + “n”)

We implement write_cipher_yml() to generate a cipher.yml configuration file inside a memAgent folder, setting the chosen LLM provider, model, and API key, enabling a system prompt with long-term memory, and registering a filesystem MCP server for file operations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef cipher_once(text, env=None, cwd=None):
cmd = f’cipher {shlex.quote(text)}’
p = subprocess.run(cmd, shell=True, text=True, capture_output=True, env=env, cwd=cwd)
print(“Cipher says:n”, p.stdout or p.stderr)
return p.stdout.strip() or p.stderr.strip()

We define cipher_once() to run a single Cipher CLI command with the provided text, capture and display its output, and return the response, allowing us to interact with Cipher programmatically from Python. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef start_api(env, cwd):
proc = subprocess.Popen(“cipher –mode api”, shell=True, env=env, cwd=cwd,
stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for _ in range(30):
try:
r = requests.get(“http://127.0.0.1:3000/health”, timeout=2)
if r.ok:
print(“API /health:”, r.text)
break
except: pass
time.sleep(1)
return proc

We create start_api() to launch Cipher in API mode as a subprocess, then repeatedly poll its /health endpoint until it responds, ensuring the API server is ready before proceeding. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
provider, model, key_env = choose_llm()
ensure_node_and_cipher()
workdir = pathlib.Path(tempfile.mkdtemp(prefix=”cipher_demo_”))
write_cipher_yml(workdir, provider, model, key_env)
env = os.environ.copy()

cipher_once(“Store decision: use pydantic for config validation; pytest fixtures for testing.”, env, str(workdir))
cipher_once(“Remember: follow conventional commits; enforce black + isort in CI.”, env, str(workdir))

cipher_once(“What did we standardize for config validation and Python formatting?”, env, str(workdir))

api_proc = start_api(env, str(workdir))
time.sleep(3)
api_proc.terminate()

if __name__ == “__main__”:
main()

In main(), we select the LLM provider, install dependencies, and create a temporary working directory with a cipher.yml configuration. We then store key project decisions in Cipher’s memory, query them back, and finally start the Cipher API server briefly before shutting it down, demonstrating both CLI and API-based interactions.

In conclusion, we have a working Cipher environment that securely manages API keys, selects the right LLM provider automatically, and configures a memory-enabled agent entirely through Python automation. Our implementation includes decision logging, memory retrieval, and a live API endpoint, all orchestrated in a Notebook/Colab-friendly workflow. This makes the setup reusable for other AI-assisted development pipelines, allowing us to store and query project knowledge programmatically while keeping the environment lightweight and easy to redeploy.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Join our ML Subreddit

Sponsor us

The post Building a Secure and Memory-Enabled Cipher Workflow for AI Agents with Dynamic LLM Selection and API Integration appeared first on MarkTechPost.

NuMind AI Releases NuMarkdown-8B-Thinking: A Reasoning Breakthrough in …

NuMind AI has officially released NuMarkdown-8B-Thinking, an open-source (MIT License) reasoning OCR Vision-Language Model (VLM) that redefines how complex documents are digitized and structured. Unlike traditional OCR systems, NuMarkdown-8B-Thinking doesn’t just extract text—it thinks about a document’s layout, structure, and formatting before generating a precise, ready-to-use Markdown file.

This makes it the first reasoning VLM purpose-built for converting PDFs, scanned documents, and spreadsheets into clean, structured Markdown—ideal for Retrieval-Augmented Generation (RAG) workflows, AI-powered knowledge bases, and large-scale document archiving.

How NuMarkdown-8B-Thinking Is Different?

The model introduces a reasoning-first approach to OCR. Instead of directly rendering extracted text, NuMarkdown-8B-Thinking generates “thinking tokens” — internal reasoning steps that help it understand document layouts before producing the final output.

This capability allows it to handle formats and structures that stump most conventional and even AI-powered OCR systems, including:

Multi-column layouts with complex reading orders

Tables with merged, nested, or irregular cells

Mixed visual elements (images, decorative headers, watermarks)

Historical or degraded scans where layout inference is crucial

The number of reasoning tokens varies with complexity—anywhere from 20% to 500% of the final Markdown length—showing how much the model “thinks” before it “writes.”

Training and Architecture

NuMarkdown-8B-Thinking is a fine-tuned version of Qwen 2.5-VL-7B from Alibaba—one of the strongest open-source multi-modal models available.

Its training pipeline involved two key phases:

Supervised Fine-Tuning (SFT) on synthetic document samples where each example included:

Raw document input

Intermediate reasoning steps (layout parsing, structure inference)

Final Markdown representation

Reinforcement Learning with GRPO, using a layout-centric reward that encouraged accurate reconstruction of document formatting and spatial relationships.

This two-stage process gave NuMarkdown-8B-Thinking the ability to maintain high accuracy even on challenging layouts that typically require human-level judgment.

Benchmark Results: Outperforming OCR Heavyweights

In independent evaluations and user testing, NuMarkdown-8B-Thinking demonstrates state-of-the-art reasoning for OCR-to-Markdown tasks:

Beats:

Generalist models like GPT-4o

Specialized OCR-focused models like OCRFlux

Competitive with:

Large closed-source reasoning models like Gemini 2.5

Just behind elite models like Gemini Flash Reasoning in blind, multi-model user rankings

Users particularly highlight its ability to:

Correctly infer reading order in non-linear layouts

Preserve intricate table formatting

Output clean, parsing-friendly Markdown for RAG ingestion without further post-processing

Example in Action

Imagine a scanned annual report page with:

Multi-level headings

Sidebars and multiple columns

A financial table with merged cells and uneven row spacing

A footer with legal disclaimers

NuMarkdown-8B-Thinking first produces reasoning tokens outlining the structure (“Column 1: Intro paragraph… Column 2: Continue paragraph… Footer text at bottom… Table spans two columns…”), then outputs Markdown that accurately reflects both content and layout.

This transparent reasoning layer makes the model’s decisions auditable—a major plus in enterprise, legal, and archival contexts.

Deployment Options

Whether you’re a researcher, developer, or enterprise AI engineer, NuMarkdown-8B-Thinking is ready to slot into your workflow:

Hugging Face: Available for direct testing and integration.

Local Execution: Model weights and quantized GGUF versions are published for CPU/GPU-friendly deployment.

API-friendly: Compatible with OpenAI-style APIs and Hugging Face Transformers for rapid integration into pipelines.

Its MIT License ensures full freedom for commercial, academic, or personal projects—no vendor lock-in or costly API gates.

Why This Matters

For industries that rely on accurate document digitization—finance, legal, healthcare, government archives—layout fidelity is as important as textual accuracy. Most OCR systems treat layout as an afterthought; NuMarkdown-8B-Thinking treats it as a reasoning problem.

By combining open-sourcing, layout reasoning, and RAG-optimized Markdown output, NuMarkdown-8B-Thinking offers a transparent, verifiable, and high-performance alternative to proprietary document AI solutions.

Check out the Model on Hugging Face and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Join our ML Subreddit

Sponsor us

The post NuMind AI Releases NuMarkdown-8B-Thinking: A Reasoning Breakthrough in OCR and Document-to-Markdown Conversion appeared first on MarkTechPost.

Genie Envisioner: A Unified Video-Generative Platform for Scalable, In …

Embodied AI agents that can perceive, think, and act in the real world mark a key step toward the future of robotics. A central challenge is building scalable, reliable robotic manipulation, the skill of deliberately interacting with and controlling objects through selective contact. While progress spans analytic methods, model-based approaches, and large-scale data-driven learning, most systems still operate in disjoint stages of data collection, training, and evaluation. These stages often require custom setups, manual curation, and task-specific tweaks, creating friction that slows progress, hides failure patterns, and hampers reproducibility. This highlights the need for a unified framework to streamline learning and assessment. 

Robotic manipulation research has progressed from analytical models to neural world models that learn dynamics directly from sensory inputs, using both pixel and latent spaces. Large-scale video generation models can produce realistic visuals but often lack action conditioning, long-term temporal consistency, and multi-view reasoning needed for control. Vision-language-action models follow instructions but are limited by imitation-based learning, preventing error recovery and planning. Policy evaluation remains challenging, as physics simulators require heavy tuning, and real-world testing is resource-intensive. Existing evaluation metrics often emphasize visual quality over task success, highlighting the need for benchmarks that better capture real-world manipulation performance. 

The Genie Envisioner (GE), developed by researchers from AgiBot Genie Team, NUS LV-Lab, and BUAA, is a unified platform for robotic manipulation that combines policy learning, simulation, and evaluation in a video-generative framework. Its core, GE-Base, is a large-scale, instruction-driven video diffusion model capturing spatial, temporal, and semantic dynamics of real-world tasks. GE-Act maps these representations to precise action trajectories, while GE-Sim offers fast, action-conditioned video-based simulation. The EWMBench benchmark evaluates visual realism, physical accuracy, and instruction-action alignment. Trained on over a million episodes, GE generalizes across robots and tasks, enabling scalable, memory-aware, and physically grounded embodied intelligence research. 

GE’s design unfolds in three key parts. GE-Base is a multi-view, instruction-conditioned video diffusion model trained on over 1 million robotic manipulation episodes. It learns latent trajectories that capture how scenes evolve under given commands. Building on that, GE-Act translates these latent video representations into real action signals via a lightweight, flow-matching decoder, offering quick, precise motor control even on robots not in the training data. GE-Sim repurposes GE-Base’s generative power into an action-conditioned neural simulator, enabling closed-loop, video-based rollout at speeds far beyond real hardware. The EWMBench suite then evaluates the system holistically across video realism, physical consistency, and alignment between instructions and resulting actions.

In evaluations, Genie Envisioner showed strong real-world and simulated performance across varied robotic manipulation tasks. GE-Act achieved rapid control generation (54-step trajectories in 200 ms) and consistently outperformed leading vision-language-action baselines in both step-wise and end-to-end success rates. It adapted to new robot types, like Agilex Cobot Magic and Dual Franka, with only an hour of task-specific data, excelling in complex deformable object tasks. GE-Sim delivered high-fidelity, action-conditioned video simulations for scalable, closed-loop policy testing. The EWMBench benchmark confirmed GE-Base’s superior temporal alignment, motion consistency, and scene stability over state-of-the-art video models, aligning closely with human quality judgments. 

In conclusion, Genie Envisioner is a unified, scalable platform for dual-arm robotic manipulation that merges policy learning, simulation, and evaluation into one video-generative framework. Its core, GE-Base, is an instruction-guided video diffusion model capturing the spatial, temporal, and semantic patterns of real-world robot interactions. GE-Act builds on this by converting these representations into precise, adaptable action plans, even on new robot types with minimal retraining. GE-Sim offers high-fidelity, action-conditioned simulation for closed-loop policy refinement, while EWMBench provides rigorous evaluation of realism, alignment, and consistency. Extensive real-world tests highlight the system’s superior performance, making it a strong foundation for general-purpose, instruction-driven embodied intelligence. 

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Join our ML Subreddit

Sponsor us

The post Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation appeared first on MarkTechPost.

Demystifying Amazon Bedrock Pricing for a Chatbot Assistant

“How much will it cost to run our chatbot on Amazon Bedrock?” This is one of the most frequent questions we hear from customers exploring AI solutions. And it’s no wonder — calculating costs for AI applications can feel like navigating a complex maze of tokens, embeddings, and various pricing models. Whether you’re a solution architect, technical leader, or business decision-maker, understanding these costs is crucial for project planning and budgeting. In this post, we’ll look at Amazon Bedrock pricing through the lens of a practical, real-world example: building a customer service chatbot. We’ll break down the essential cost components, walk through capacity planning for a mid-sized call center implementation, and provide detailed pricing calculations across different foundation models. By the end of this post, you’ll have a clear framework for estimating your own Amazon Bedrock implementation costs and understanding the key factors that influence them.
For those that aren’t familiar, Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Amazon Bedrock provides a comprehensive toolkit for powering AI applications, including pre-trained large language models (LLMs), Retrieval Augmented Generation (RAG) capabilities, and seamless integration with existing knowledge bases. This powerful combination enables the creation of chatbots that can understand and respond to customer queries with high accuracy and contextual relevance.
Solution overview
For this example, our Amazon Bedrock chatbot will use a curated set of data sources and use Retrieval-Augmented Generation (RAG) to retrieve relevant information in real time. With RAG, our output from the chatbot will be enriched with contextual information from our data sources, giving our users a better customer experience. When understanding Amazon Bedrock pricing, it’s crucial to familiarize yourself with several key terms that significantly influence the expected cost. These components not only form the foundation of how your chatbot functions but also directly impact your pricing calculations. Let’s explore these key components. Key Components

Data Sources – The documents, manuals, FAQs, and other information artifacts that form your chatbot’s knowledge base.
Retrieval-Augmented Generation (RAG) – The process of optimizing the output of a large language model by referencing an authoritative knowledge base outside of its training data sources before generating a response. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.
Tokens – A sequence of characters that a model can interpret or predict as a single unit of meaning. For example, with text models, a token could correspond not just to a word, but also to a part of a word with grammatical meaning (such as “-ed”), a punctuation mark (such as “?”), or a common phrase (such as “a lot”). Amazon Bedrock prices are based on the number of input and output tokens processed.
Context Window – The maximum amount of text (measured in tokens) that an LLM can process in one request. This includes both the input text and additional context needed to generate a response. A larger context window allows the model to consider more information when generating responses, enabling more comprehensive and contextually appropriate outputs.
Embeddings – Dense vector representations of text that capture semantic meaning. In a RAG system, embeddings are created for both knowledge base documents and user queries, enabling semantic similarity searches to retrieve the most relevant information from your knowledge base to augment the LLM’s responses.
Vector Store: A vector store contains the embeddings for your data sources and acts as your knowledge base. Embeddings Model: Embedding models are machine learning models that convert data (text, images, code, etc.) into fixed-size numerical vectors. These vectors capture the semantic meaning of the input in a format that can be used for similarity search, clustering, classification, recommendation systems, and retrieval-augmented generation (RAG).
Large Language Models (LLMs) – Models trained on vast volumes of data that use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. Amazon Bedrock offers a diverse selection of these foundation models (FMs), each with different capabilities and specialized strengths.

The figure below demonstrates the architecture of a fully managed RAG solution on AWS.

Estimating Pricing
One of the most challenging aspects of implementing an AI solution is accurately predicting your capacity needs. Without proper capacity estimation, you might either over-provision (leading to unnecessary costs) or under-provision (resulting in performance issues). Let’s walk through how to approach this crucial planning step for a real-world scenario. Before we dive into the numbers, let’s understand the key factors that affect your capacity and costs:

Embeddings: Vector representations of your text that enable semantic search capabilities. Each document in your knowledge base needs to be converted into embeddings, which impacts both processing costs and storage requirements.
User Queries: The incoming questions or requests from your users. Understanding your expected query volume and complexity is crucial, as each query consumes tokens and requires processing power.
LLM Responses: The AI-generated answers to user queries. The length and complexity of these responses directly affect your token usage and processing costs.
Concurrency: The number of simultaneous users your system needs to handle. Higher concurrency requirements may necessitate additional infrastructure and can affect your choice of pricing model.

To make this concrete, let’s examine a typical call center implementation. Imagine you’re planning to deploy a customer service chatbot for a mid-sized organization handling product inquiries and support requests. Here’s how we’d break down the capacity planning: First, consider your knowledge base. In our scenario, we’re working with 10,000 support documents, each averaging 500 tokens in length. These documents need to be chunked into smaller pieces for effective retrieval, with each document typically splitting into 5 chunks. This gives us a total of 5 million tokens for our knowledge base. For the embedding process, those 10,000 documents will generate approximately 50,000 embeddings when we account for chunking and overlapping content. This is important because embeddings affect both your initial setup costs and ongoing storage needs.
Now, let’s look at the operational requirements. Based on typical call center volumes, we’re planning for:

10,000 customer queries per month
Query lengths varying from 50 to 200 tokens (depending on complexity)
Average response length of 100 tokens per interaction
Peak usage of 100 simultaneous users

When we aggregate these numbers, our monthly capacity requirements shape up to:

5 million tokens for processing our knowledge base
50,000 embeddings for semantic search
500,000 tokens for handling user queries
1 million tokens for generating responses

Understanding these numbers is crucial because they directly impact your costs in several ways:

Initial setup costs for processing and embedding your knowledge base
Ongoing storage costs for maintaining your vector database and document storage
Monthly processing costs for handling user interactions
Infrastructure costs to support your concurrency requirements

This gives us a solid foundation for our cost calculations, which we’ll explore in detail in the next section.
Calculating total cost of ownership (TCO)
Amazon Bedrock offers flexible pricing modes. With Amazon Bedrock, you are charged for model inference and customization. You have a choice of two pricing plans for inference: 1. On-Demand and Batch: This mode allows you to use FMs on a pay-as-you-go basis without having to make time-based term commitments. 2. Provisioned Throughput: This mode allows you to provision sufficient throughput to meet your application’s performance requirements in exchange for a time-based term commitment.

On-demand – Ideal for infrequent or unpredictable usage
Batch – Designed for processing large volumes of data in a single operation
Provisioned throughput – Tailored for applications with consistent and predictable workloads

To calculate the TCO for this scenario as one-time cost we’ll consider the foundation model, the volume of data in the knowledge base, the estimated number of queries and responses, and the concurrency level mentioned above. For this scenario we’ll be using an on-demand pricing model and showing how the pricing would be for some of the foundation models available on Amazon Bedrock.
The On-Demand Pricing formula will be:
The cost of this setup will be the sum of cost of LLM inferences and cost of vector store. To estimate cost of inferences, you can obtain the number of input tokens, context size and output tokens in the response metadata returned by the LLM. Total Cost Incurred = ((Input Tokens + Context Size) * Price per 1000 Input Tokens + Output tokens * Price per 1000 Output Tokens) + Embeddings. For input tokens we will be adding an additional context size of about 150 tokens for User Queries. Therefore as per our assumption of 10,000 User Queries, the total Context Size will be 1,500,000 tokens.
The following is a comparison of estimated monthly costs for various models on Amazon Bedrock based on our example use case using the on-demand pricing formula:
Embeddings Cost:
For text embeddings on Amazon Bedrock, we can choose from Amazon Titan Embeddings V2 model or Cohere Embeddings Model. In this example we are calculating a one-time cost for the embeddings.

Amazon Titan Text Embeddings V2:

Price per 1,000 input tokens – $0.00002
Cost of Embeddings – (Data Sources + User Queries) * Embeddings cost per 1000 tokens

(5,000,000 +500,000) * 0.00002/1000 = $0.11

Cohere Embeddings:

Price per 1,000 input tokens – $0.0001
Cost of Embeddings – (5,000,000+500,000) * 0.0001/1000 =$0.55

The usual cost of vector stores has 2 components: size of vector data + number of requests to the store. You can choose whether to let the Amazon Bedrock console set up a vector store in Amazon OpenSearch Serverless for you or to use one that you have created in a supported service and configured with the appropriate fields. If you’re using OpenSearch Serverless as part of your setup, you’ll need to consider its costs. Pricing details can be found here: OpenSearch Service Pricing .
Here using the On-Demand pricing formula, the overall cost is calculated using some foundation models (FMs) available on Amazon Bedrock and the Embeddings cost.
• Anthropic Claude:

Claude 4 Sonnet: ((500,000 +1,500,000) tokens/1000 * $0.003 + 1,000,000 tokens/1000* $0.015 = $21+0.11= $21.11
Claude 3 Haiku: ((500,000 +1,500,000) tokens/1000 * $0.00025 + 1,000,000 tokens/1000* $0.00125 = $1.75+0.11= $1.86

• Amazon Nova:

Amazon Nova Pro: ((500,000 +1,500,000) tokens/1000 * $0.0008 + 1,000,000 tokens/1000* $0.0032= $4.8+0.11= $4.91
Amazon Nova Lite: ((500,000 +1,500,000) tokens/1000 * $0.00006 + 1,000,000 tokens/1000* $0.00024 = $0.36+0.11= $0.47

• Meta Llama:

Llama 4 Maverick (17B): ((500,000 +1,500,000) tokens/1000 * $0.00024 + 1,000,000 tokens/1000* $0.00097= $1.45+0.11= $1.56
Llama 3.3 Instruct (70B): ((500,000 +1,500,000) tokens/1000 * $0.00072 + 1,000,000 tokens/1000* $0.00072 = $2.16+0.11= $2.27

Evaluate models not just on their natural language understanding (NLU) and generation (NLG) capabilities, but also on their price-per-token ratios for both input and output processing. Consider whether premium models with higher per-token costs deliver proportional value for your specific use case, or if more cost-effective alternatives like Amazon Nova Lite or Meta Llama models can meet your performance requirements at a fraction of the cost.
Conclusion
Understanding and estimating Amazon Bedrock costs doesn’t have to be overwhelming. As we’ve demonstrated through our customer service chatbot example, breaking down the pricing into its core components – token usage, embeddings, and model selection – makes it manageable and predictable.
Key takeaways for planning your Bedrock implementation costs:

Start with a clear assessment of your knowledge base size and expected query volume
Consider both one-time costs (initial embeddings) and ongoing operational costs
Compare different foundation models based on both performance and pricing
Factor in your concurrency requirements when choosing between on-demand, batch, or provisioned throughput pricing

By following this systematic approach to cost estimation, you can confidently plan your Amazon Bedrock implementation and choose the most cost-effective configuration for your specific use case. Remember that the cheapest option isn’t always the best – consider the balance between cost, performance, and your specific requirements when making your final decision.
Getting Started with Amazon Bedrock
With Amazon Bedrock, you have the flexibility to choose the most suitable model and pricing structure for your use case. We encourage you to explore the AWS Pricing Calculator for more detailed cost estimates based on your specific requirements.
To learn more about building and optimizing chatbots with Amazon Bedrock, check out the workshop Building with Amazon Bedrock.
We’d love to hear about your experiences building chatbots with Amazon Bedrock. Share your success stories or challenges in the comments!

About the authors
Srividhya Pallay is a Solutions Architect II at Amazon Web Services (AWS) based in Seattle, where she supports small and medium-sized businesses (SMBs) and specializes in Generative Artificial Intelligence and Games. Srividhya holds a Bachelor’s degree in Computational Data Science from Michigan State University College of Engineering, with a minor in Computer Science and Entrepreneurship. She holds 6 AWS Certifications.
Prerna Mishra is a Solutions Architect at Amazon Web Services(AWS) supporting Enterprise ISV customers. She specializes in Generative AI and MLOPs as part of Machine Learning and Artificial Intelligence community. She graduated from New York University in 2022 with a Master’s degree in Data Science and Information Systems.
Brian Clark is a Solutions Architect at Amazon Web Services (AWS) supporting Enterprise customers in the financial services vertical. He is a part of the Machine Learning and Artificial Intelligence community and specializes in Generative AI and Agentic workflows. Brian has over 14 years of experience working in technology and holds 8 AWS certifications.

Fine-tune OpenAI GPT-OSS models on Amazon SageMaker AI using Hugging F …

Released on August 5, 2025, OpenAI’s GPT-OSS models, gpt-oss-20b and gpt-oss-120b, are now available on AWS through Amazon SageMaker AI and Amazon Bedrock. These pre-trained, text-only Transformer models are built on a Mixture-of-Experts (MoE) architecture that activates only a subset of parameters per token, delivering high reasoning performance while reducing compute costs. They specialize in coding, scientific analysis, and mathematical reasoning, and support a 128,000 context length, adjustable reasoning levels (low/medium/high), chain-of-thought (CoT) reasoning with audit-friendly traces, structured outputs, and tool use to support agentic-AI workflows. As discussed in OpenAI’s documentation, both models have undergone safety-focused training and adversarial fine-tuning evaluations to assess and strengthen robustness against misuse. The following table summarizes the model specifications.

Model
Layers
Total Parameters
Active Parameters Per Token
Total Experts
Active Experts Per Token
Context Length

openai/gpt-oss-120b
36
117 billion
5.1 billion
128
4
128,000

openai/gpt-oss-20b
24
21 billion
3.6 billion
32
4
128,000

The GPT-OSS models are deployable using Amazon SageMaker JumpStart and also accessible through Amazon Bedrock APIs. Both options provide developers the flexibility to deploy and integrate GPT-OSS models into your production-grade AI workflows. Beyond out-of-the-box deployment, these models can be fine-tuned to align with specific domains and use cases, using open source tools from the Hugging Face ecosystem and running on the fully managed infrastructure of SageMaker AI.
Fine-tuning large language models (LLMs) is the process of adjusting a pre-trained model’s weights using a smaller, task-specific dataset to tailor its behavior to a particular domain or application. Fine-tuning large models like GPT-OSS transforms them from a broad generalist into a domain-specific expert without the cost of training from scratch. Adapting the model to your data and terminology can deliver more accurate, context-aware outputs, improves reliability, and reduces hallucinations. The result is a specialized GPT-OSS that excels at targeted tasks while retaining the scalability, flexibility, and open-weight benefits ideal for secure, enterprise-grade deployment.
In this post, we walk through the process of fine-tuning a GPT-OSS model in a fully managed training environment using SageMaker AI training jobs. The workflow uses the Hugging Face TRL library for fine-tuning, the Hugging Face Accelerate library to simplify distributed training across multiple GPUs and nodes, and the DeepSpeed ZeRO-3 optimization technique to reduce memory usage by partitioning model states across devices for efficient training of billion-parameter models. We then apply this setup to fine-tune the GPT-OSS model on a multilingual reasoning dataset, HuggingFaceH4/Multilingual-Thinking, enabling GPT-OSS to handle structured, CoT reasoning across multiple languages.
Solution overview
SageMaker AI is a managed machine learning (ML) service that streamlines the entire foundation model (FM) lifecycle. It provides hosted, interactive notebooks for rapid exploration, fully managed ephemeral training jobs for large-scale and distributed fine-tuning, and Amazon SageMaker HyperPod clusters that offer granular control over persistent training infrastructure for large-scale model training and fine-tuning workloads. By using managed hosting in SageMaker, you can serve models reliably in production, and the suite of AIOps-ready tools, such as reusable pipelines and fully managed MLflow, support experiment tracking, model registration, and seamless deployment. With built-in governance and enterprise-grade security, SageMaker AI provides data engineers, data scientists, and ML engineers with a unified, fully managed platform to build, train, deploy, and govern FMs end-to-end.
GPT-OSS can be fine-tuned on SageMaker using the latest Hugging Face TRL library, which can be written as recipes for fine-tuning LLMs using Hugging Face SFTTrainer. These recipes can also be adapted to fine-tune other open-weight language or vision models such as Qwen, Mistral, Meta, and many more. In this post, we show how to fine-tune GPT-OSS in a distributed setup either on a single node multi-GPU setup or across multi-node multi-GPU setup, using Hugging Face Accelerate to manage multi-device training and DeepSpeed ZeRO-3 to train large models more efficiently. Together, they help you fine-tune faster and scale to larger datasets.
We also highlight MXFP4 (Microscaling FP4), a 4-bit floating-point quantization format from the Open Compute Project. It groups tensors into small blocks, each sharing a scaling factor, which reduces memory and compute needs while helping preserve model accuracy—making it well-suited for efficient model training. Complementing quantization, we explore Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA, for the adaptation of large models by learning a small set of additional parameters instead of modifying all weights. This approach is memory- and compute-efficient, highly compatible with quantized models, and supports fine-tuning even on constrained hardware environments.
The following diagram illustrates this configuration (source).

By using MXFP4 quantization, PEFT fine-tuning methods like LoRA, and distributed training with Hugging Face Accelerate and DeepSpeed ZeRO-3 together, we can efficiently and scalably fine-tune large models like gpt-oss-120b and gpt-oss-20b for high-performance customization while keeping infrastructure and compute costs manageable.
Prerequisites
To fine-tune GPT-OSS models on SageMaker AI, you must have the following prerequisites:

An AWS account that will contain your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, see AWS Identity and Access Management for Amazon SageMaker AI.
You can run the notebook provided in this post from your preferred development environment, including interactive development environments (IDEs) such as PyCharm or Visual Studio Code, provided your AWS credentials are properly set up and configured to access your AWS account. To set up your local environment, refer to Configuring settings for the AWS CLI. Optionally, we recommend using Amazon SageMaker Studio for straightforward development process on SageMaker AI.
If you’re following along with this post, we use the ml.p5en.48xlarge instance for fine-tuning the 120B model and the ml.p4de.24xlarge instance for the 20B model. You will need access to these SageMaker compute instances to run the example notebook presented in this post. If you’re unsure, you can review the AWS service quotas on the AWS Management Console:

Choose Amazon SageMaker as the AWS service under Manage Quotas.
Select ml.p4de.24xlarge for training job usage or ml.p5en.48xlarge for training job usage based on the model you’re interested in fine-tuning and request an increase at account level.

Access to the GitHub repo.

Business outcomes for fine-tuning GPT-OSS
Global enterprises increasingly need AI tools that support complex reasoning across multiple languages—whether for multilingual virtual assistants, cross-location support desks, or international knowledge systems. Although FMs offer a powerful starting point, their effectiveness in diverse linguistic contexts hinges on structured reasoning inputs—datasets that surface logic steps explicitly and across languages. That’s why testing with a multilingual, CoT-style dataset is a valuable first step. It lets you verify how well a model holds reasoning coherence when switching between languages and reasoning patterns, laying a robust foundation before scaling to larger, domain-specific multilingual datasets. GPT-OSS is particularly well-suited for this task, with its native CoT capabilities, long 128,000 context window, and adjustable reasoning levels, making it ideal for evaluating and refining multilingual reasoning performance before production deployment.
Fine-tune GPT-OSS models for multi-lingual reasoning on SageMaker AI
In this section, we walk through how to fine-tune OpenAI’s GPT-OSS models on SageMaker AI using training jobs. SageMaker training jobs support distributed multi-GPU and multi-node configurations, so you can spin up high-performance clusters on demand, train billion-parameter models faster, and automatically shut down resources when the job finishes.
Set up your environment
In the following sections, we run the code from SageMaker Studio JupyterLab notebook instances. You can also use your preferred IDE, such as VS Code or PyCharm, but make sure your local environment is configured to work with AWS, as discussed in the prerequisites.
Complete the following steps:

On the SageMaker AI console, choose Domains in the navigation pane, then open your domain.
In the navigation pane under Applications and IDEs, choose Studio.
On the User profiles tab, locate your user profile, then choose Launch and Studio.

In SageMaker Studio, launch an ml.t3.medium JupyterLab notebook instance with at least 50 GB of storage.

A large notebook instance isn’t required, because the fine-tuning job will run on a separate ephemeral training job instance with NVIDIA accelerators.

To begin fine-tuning, start by cloning the GitHub repo and navigating to 3_distributed_training/models/openai–gpt-oss directory, then launch the finetune_gpt_oss.ipynb notebook with a Python 3.12 or higher version kernel:

# clone github repo
git clone https://github.com/aws-samples/amazon-sagemaker-generativeai.git

Dataset for fine-tuning
Selecting and curating the right dataset is a critical first step in fine-tuning any LLM. In this post, we use the Hugging FaceH4/Multilingual-Thinking dataset, which is a multilingual reasoning dataset containing CoT examples translated into languages such as French, Spanish, and German. Its combination of diverse languages, varied reasoning tasks, and explicit step-by-step thought processes makes it well-suited for evaluating how a model handles structured reasoning, adapts to multilingual inputs, and maintains logical consistency across different linguistic contexts. With around 1,000 examples, it’s small enough for quick experimentation yet sufficient to demonstrate fine-tuning and evaluation of large pre-trained models like GPT-OSS. The dataset can be loaded in just a few lines of code using the Hugging Face Datasets library:

# load datasets in memory
dataset_name = ‘HuggingFaceH4/Multilingual-Thinking’
dataset = load_dataset(dataset_name, split=”train”)

The following code is some sample data:

{
  “reasoning_language”: “French”,
  “developer”: “You are a recipe suggestion bot, …”,
  “user”: “Can you provide me with a step-by-step …”,
  “analysis”: “D’accord, l’utilisateur souhaite une recette …”,
  “final”: “Certainly! Here’s a classic homemade chocolate …”,
  “messages”: [
    {
      “content”: “reasoning language: FrenchnnYou are a …”,
      “role”: “system”,
      “thinking”: null
    },
    {
      “content”: “Can you provide me with a step-by-step …”,
      “role”: “user”,
      “thinking”: null
    },
    {
      “content”: “Certainly! Here’s a classic homemade chocolate …”,
      “role”: “assistant”,
      “thinking”: “D’accord, l’utilisateur souhaite une recette …“
    }
  ]
}

For supervised fine-tuning, we use only the data in the messages key to train our GPT-OSS model. Because TRL’s SFTTrainer natively supports this format, it can be used as-is. We extract all rows containing only the messages key, save them in JSONL format, and upload the file to Amazon Simple Storage Service (Amazon S3). This makes sure the dataset is readily accessible to SageMaker training jobs at runtime.

# preserve only messages key 
dataset = dataset.remove_columns(
    [col for col in dataset.column_names if col != “messages”]
)
# save as JSONL format
dataset_filename = os.path.join(dataset_parent_path, f”{dataset_name.replace(‘/’, ‘–‘).replace(‘.’, ‘-‘)}.jsonl”)
dataset.to_json(dataset_filename, lines=True)

from sagemaker.s3 import S3Uploader

# select a data destination bucket
data_s3_uri = f”s3://{sess.default_bucket()}/dataset”

# upload to S3
uploaded_s3_uri = S3Uploader.upload(
    local_path=dataset_filename,
    desired_s3_uri=data_s3_uri
)
print(f”Uploaded {dataset_filename} to > {uploaded_s3_uri}”)

Experimentation tracking with MLflow (Optional)
SageMaker AI offers the fully managed MLflow capability, so you can track multiple training runs within experiments, compare results with visualizations, evaluate models, and register the best ones in the model registry. MLflow also supports integration with agentic workflows.
TRL’s SFTTrainer natively integrates with experimentation tracking tools such as MLflow, TensorBoard, Weights & Biases, and more. With SFTTrainer, you can log training parameters, hyperparameters, loss metrics, system metrics, and more to a centralized location, providing you with audit trails, governance, and streamlined experiment tracking. This step is optional; if you choose not to use SageMaker managed MLflow, you can set the SFTTrainer parameter reports_to to tensorboard, which will log all metrics locally to disk for visualization using a local or remote TensorBoard service.

# set none to log to local disk
MLFLOW_TRACKING_SERVER_ARN = None # or “arn:aws:sagemaker:us-west-2:<account-id>:mlflow-tracking-server/<server-name>”

if MLFLOW_TRACKING_SERVER_ARN:
    reports_to = “mlflow”
else:
    reports_to = “tensorboard”
print(“reports to:”, reports_to)

Experiments logged from TRL’s SFTTrainer to an MLflow tracking server in SageMaker automatically capture key metrics and parameters. The SageMaker managed MLflow service renders real-time visualizations, profiles training hardware with minimal setup, enables side-by-side run comparisons, and provides built-in evaluation tools to track, train, and assess your fine-tuning jobs end-to-end.

Fine-tune GPT-OSS on training jobs
The following example demonstrates how to fine-tune the gpt-oss-20b model. To switch to gpt-oss-120b, simply update the model_name. The model-to-instance mapping shown in this section has been tested as part of this notebook workflow. You can adjust the instance type and instance count to fit your specific use case.
The following table summarizes the different model specifications.

GPT‑OSS Model
SageMaker Instance
GPU Specifications

openai/gpt-oss-120b
ml.p5en.48xlarge
8× NVIDIA H200 GPUs, 96 GB HBM3 each

openai/gpt-oss-20b
ml.p4de.24xlarge
8× NVIDIA A100 GPUs, 80 GB HBM2e each

# User-defined variables
model_name = “openai/gpt-oss-20b”
tokenizer_name = “openai/gpt-oss-20b”

# dataset path inside a sagemaker container
dataset_path = “/opt/ml/input/data/training/HuggingFaceH4–Multilingual-Thinking.jsonl”
output_path = “/opt/ml/model/openai-gpt-oss-20b-HuggingFaceH4-Multilingual-Thinking/”

# support only for Ampere, Hopper and Grace Blackwell
bf16_flag = “true” 

SageMaker training jobs automatically download datasets from the specified S3 prefix or file into the training container, mapping them to /opt/ml/input. Training artifacts and logs are stored in /opt/ml/output, and the final trained or fine-tuned model is saved to /opt/ml/model. Saving the model to this path allows SageMaker to automatically detect it for downstream workflows such as model registration, deployment, and other automation. You can set or unset the bf16_flag to choose between float16 and bfloat16. float16 uses less memory but has a smaller numeric range, whereas bfloat16 provides a wider range with similar memory savings, making it more stable for training large models. bfloat16 is supported on newer GPU architectures such as NVIDIA Ampere, Hopper, and Grace Blackwell.
Fine-tuning with open source Hugging Face recipes
With Hugging Face’s TRL library, you can define Supervised Fine-Tuning (SFT) recipes, which are essentially preconfigured training workflows that streamline fine-tuning FMs like Meta, Qwen, Mistral, and now OpenAI GPT‑OSS with minimal setup. These recipes simplify the process of adapting models to new datasets using TRL’s SFTTrainer and configuration tools.

yaml_template = “””# Model arguments
model_name_or_path: {{ model_name }}
tokenizer_name_or_path: {{ tokenizer_name }}
model_revision: main
torch_dtype: bfloat16
attn_implementation: kernels-community/vllm-flash-attn3
bf16: {{ bf16_flag }}
tf32: false
output_dir: {{ output_dir }}

# Dataset arguments
dataset_id_or_path: {{ dataset_path }}
max_seq_length: 2048
packing: true
packing_strategy: wrapped

# LoRA arguments
use_peft: true
lora_target_modules: “all-linear”
### Specific to GPT-OSS
lora_modules_to_save: [“7.mlp.experts.gate_up_proj”, “7.mlp.experts.down_proj”, “15.mlp.experts.gate_up_proj”, “15.mlp.experts.down_proj”, “23.mlp.experts.gate_up_proj”, “23.mlp.experts.down_proj”]
lora_r: 8
lora_alpha: 16

# Training arguments
num_train_epochs: 1.
per_device_train_batch_size: 6
per_device_eval_batch_size: 6
gradient_accumulation_steps: 3
gradient_checkpointing: true
optim: adamw_torch_fused
gradient_checkpointing_kwargs:
  use_reentrant: true
learning_rate: 1.0e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1
max_grad_norm: 0.3
bf16: {{ bf16_flag }}
bf16_full_eval: {{ bf16_flag }}
tf32: false

# Logging arguments
logging_strategy: steps
logging_steps: 2
report_to:
  – {{ reports_to }}
save_strategy: “epoch”
seed: 42
“””

config_filename = “openai-gpt-oss-20b-qlora.yaml”

The recipe.yaml file contains the following key parameters:

Model arguments:

model_name_or_path or tokenizer_name_or_path – Path or identifier for the base model and tokenizer to fine-tune. Models can be loaded locally from disk or the Hugging Face Hub.
torch_dtype – Sets training precision. bfloat16 offers float16-level memory savings with a wider numeric range for better stability, and is supported on NVIDIA Ampere, Hopper, and Grace Blackwell GPUs. Alternatively, set to float16 for older versions of NVIDIA GPUs.
attn_implementation – Uses vLLM FlashAttention 3 (kernels-community/vllm-flash-attn3) kernels for faster attention computation, supported for newer Hopper GPUs. Alternatively, set eager for older NVIDIA GPUs.

Dataset arguments:

dataset_id_or_path – Local dataset location as JSONL file or Hugging Face Hub ID for the dataset.
max_seq_length – Maximum token length per sequence (for example, 2048). Provide longer sequence lengths for datasets that require longer reasoning output tokens. Longer sequence lengths consume more GPU memory.

LoRA arguments:

use_peft – Enables PEFT using LoRA. Set to true for PEFT or false for full fine-tuning.
lora_target_modules – Target layers for LoRA adaptation (for example, all-linear layers is default for most dense and MoEs).
lora_modules_to_save – GPT-OSS-specific layers to keep in full precision during LoRA training.
lora_r or lora_alpha – Rank and scaling factor for LoRA updates.

Logging and saving arguments:

report_to – Experiment tracking integration (such as MLflow or TensorBoard).

After a recipe is defined and tested, you can seamlessly swap configurations such as the model name, dataset, number of epochs, or PEFT settings and run or rerun the fine-tuning workflow with minimal or no code changes.
SageMaker estimators
As a next step, we use a SageMaker training job estimator to spin up a training cluster and run the model fine-tuning. The SageMaker AI estimators API provide a high-level API to define and run training jobs on fully managed infrastructure, handling environment setup, scaling, and artifact management. You can specify training scripts, input data, and compute resources without manually provisioning servers. SageMaker also offers prebuilt Hugging Face and PyTorch estimators, which come optimized for their respective frameworks, making it straightforward to train and fine-tune models with minimal setup.
It’s recommended to use Python 3.12 and higher to fine-tune GPT-OSS with the following packages installed. Add or update the requirements.txt file in your script’s root directory with the following packages. SageMaker estimators will automatically detect this file and install the listed dependencies at runtime.

%%writefile code/requirements.txt
transformers>=4.55.0
kernels>=0.9.0
datasets==4.0.0
bitsandbytes==0.46.1
trl>=0.20.0
peft>=0.17.0
lighteval==0.10.0
hf-transfer==0.1.8
hf_xet
tensorboard
liger-kernel==0.6.1
deepspeed==0.17.4
lm-eval[api]==0.4.9
Pillow
mlflow
sagemaker-mlflow==0.1.0
triton
git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Define a SageMaker estimator and point it to your local training script directory. SageMaker will package the contents and place them in /opt/ml/code inside the training container. This includes your training script, additional modules in the directory, and if a requirements.txt file is present, SageMaker will automatically install the listed packages at runtime.

pytorch_estimator = PyTorch(
    image_uri=”763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker”,
    entry_point=”accelerate_sagemaker_train.sh”, # Adapted bash script to train using accelerate on SageMaker – Multi-GPU
    source_dir=”code”,
    instance_type=training_instance_type,
    instance_count=1, # multi-node training support
    base_job_name=f”{job_name}-pytorch”,
    role=role,
    …
    hyperparameters={
        “num_process”: NUM_GPUS, # define the number of GPUs to run distributed training, per instance
        “config”: f”recipes/{config_filename}”,
    }
)

The following is the directory structure for fine-tuning GPT-OSS on SageMaker AI training jobs:

code/
├── accelerate/                       # Accelerate configuration files
├── accelerate_sagemaker_train.sh      # Launch script for distributed training with Accelerate on SageMaker training jobs
├── gpt_oss_sft.py                     # Main training script for supervised fine-tuning (SFT) of GPT-OSS
├── recipes/                           # Predefined training configuration recipes (YAML)
└── requirements.txt                   # Python dependencies installed at runtime

To fine-tune across multiple GPUs, we use Hugging Face Accelerate and DeepSpeed ZeRO-3, which work together to train large models more efficiently. Hugging Face Accelerate simplifies launching distributed training by automatically handling device placement, process management, and mixed precision settings. DeepSpeed ZeRO-3 reduces memory usage by partitioning optimizer states, gradients, and parameters across devices—allowing billion-parameter models to fit and train faster.
You can run your SFTTrainer script with Hugging Face Accelerate using a simple command like the following:

accelerate launch
    –config_file accelerate/zero3.yaml  
    –num_processes 8 gpt_oss_sft.py  
    –config recipes/openai-gpt-oss-20b-qlora.yaml

SageMaker executes this command inside the training container because we set entry_point=”accelerate_sagemaker_train.sh” when initializing the SageMaker estimator. The accelerate_sagemaker_train.sh script is defined as follows:

#!/bin/bash
set -e

# Launch fine-tuning with Accelerate + DeepSpeed (Zero3)
accelerate launch
  –config_file accelerate/zero3.yaml
  –num_processes “$NUM_GPUS”
  gpt_oss_sft.py
  –config “$CONFIG_PATH”

PEFT vs. full fine-tuning
The gpt_oss_sft.py script lets you choose between PEFT and full fine-tuning by setting use_peft to true or false. Full fine-tuning gives you greater control over the base model weights, enabling broader adaptability and expressiveness. However, it also carries the risk of catastrophic forgetting and higher resource consumption during the training process.
At the end of training, you will have the fully adapted model weights, which can be deployed to a SageMaker endpoint for inference. You can then run predictions against the deployed endpoint using the SageMaker Predictor.
Conclusion
In this post, we demonstrated how to fine-tune OpenAI’s GPT-OSS models (gpt-oss-120b and gpt-oss-20b) on SageMaker AI using SageMaker training jobs, the Hugging Face TRL library, and distributed training with Hugging Face Accelerate and DeepSpeed ZeRO-3. By combining the fully managed, ephemeral infrastructure of SageMaker with TRL’s streamlined fine-tuning recipes, you can adapt GPT-OSS to your domain quickly and efficiently, using either PEFT for cost-effective customization or full fine-tuning for maximum model control. With the resulting model artifacts, you can deploy to SageMaker endpoints for secure, scalable inference and bring advanced reasoning capabilities directly into your enterprise workflows.
If you’re interested in exploring further, the GitHub repo contains all the resources used in this walkthrough. It’s a great starting point for experimenting with fine-tuning GPT-OSS on your own datasets and deploying the resulting models to SageMaker for real-world applications. You can get set up with a notebook in minutes using the SageMaker Studio domain quick setup and start experimenting right away.

About the authors
Pranav Murthy is a Senior Generative AI Data Scientist at AWS, specializing in helping organizations innovate with Generative AI, Deep Learning, and Machine Learning on Amazon SageMaker AI. Over the past 10+ years, he has developed and scaled advanced computer vision (CV) and natural language processing (NLP) models to tackle high-impact problems—from optimizing global supply chains to enabling real-time video analytics and multilingual search. When he’s not building AI solutions, Pranav enjoys playing strategic games like chess, traveling to discover new cultures, and mentoring aspiring AI practitioners. You can find Pranav on LinkedIn.
Sumedha Swamy is a Senior Manager of Product Management at Amazon Web Services (AWS), where he leads several areas of the Amazon SageMaker, including SageMaker Studio – the industry-leading integrated development environment for machine learning, developer and administrator experiences, AI infrastructure, and SageMaker SDK.

AI-Driven Antitrust and Competition Law: Algorithmic Collusion, Self-L …

AI in Market Economics and Pricing Algorithms

AI-driven pricing models, particularly those utilizing reinforcement learning (RL), can lead to outcomes resembling traditional collusion, fundamentally altering market dynamics. Unlike human-set strategies in oligopoly models, AI agents, like Q-learning, autonomously learn pricing strategies from data, often resulting in supra-competitive pricing due to agents’ ability to detect rivals’ actions and adjust in real-time. Such algorithms can mimic tacit collusion without direct coordination, often creating more stable, high-price outcomes than human actors could.

However, skepticism persists. In complex, noisy markets, economists argue that independent AI agents may struggle to form stable collusive strategies unless there’s direct coordination, like shared data. When AI-based coordination occurs via shared pricing data, it could violate antitrust laws. Algorithms often use large datasets to adjust pricing, and when non-public data is shared, it can subtly coordinate behavior.

One of the primary issues with AI-based pricing is its opacity—many deep learning models are black boxes, making it difficult for regulators to discern whether pricing outcomes are due to collusion or legitimate optimization. This complexity, combined with feedback loops between agents, complicates the identification of collusive behavior.

Antitrust Law Perspectives:

U.S. Law: Under the Sherman Act, price-fixing or conspiracies to restrain trade are prohibited. Courts require direct evidence of coordination, but using algorithms to coordinate pricing can still be seen as a violation if it results in cartel-like behavior.

EU Law: The EU’s competition law also prohibits anti-competitive agreements or practices under Articles 101 and 102 of the TFEU. If algorithms signal or align pricing systematically, it may be considered a concerted practice, akin to tacit collusion.

UK Law: Post-Brexit, the UK mirrors EU law and applies strict antitrust standards to algorithmic collusion. Algorithmic pricing without explicit coordination could still violate competition law.

Forms of Algorithmic Collusion:

Explicit Cartels: Algorithms intentionally coordinate prices, as seen in the Topkins case.

Tacit Learning Collusion: Independent AI agents autonomously settle on collusive pricing through self-learning, without direct communication.

Hub-and-Spoke Collusion: A third-party vendor’s software aggregates data from multiple firms to align pricing, leading to indirect coordination.

Algorithmic Signaling: Algorithms may deduce rivals’ pricing from publicly available data and adjust accordingly, resulting in coordinated pricing patterns.

Legal Frameworks:

Predictable Agent Model: Firms are responsible for algorithmic behavior if they can predict and control pricing outcomes.

Digital Eye Model: If algorithms are highly autonomous and opaque, determining firm responsibility becomes more complex. The EU’s draft AI Act addresses these concerns by ensuring firms can detect and intervene in anticompetitive effects.

Graphical and Mathematical Models: Multi-agent reinforcement learning (MARL) underpins algorithmic collusion, where agents optimize long-term profits through repeated interactions. Whether tacit collusion occurs depends on the algorithm’s design and the market’s complexity.

Legal Challenges in Detecting and Prosecuting AI-Facilitated Collusion

Agreement and Intent: U.S. antitrust law under Section 1 requires proof of an intentional, concerted agreement. However, when AI agents independently learn from market conditions, no explicit agreement or human coordination may exist. In cases like Topkins, where direct communication occurred, collusion was clear. For AI-driven collusion, courts must determine if firms “implicitly agreed” through their algorithms, possibly using agency doctrines. If AI autonomously leads to collusion, it could be seen as the firm’s decision, as the company “knew” the likely outcomes.

Meeting of Minds for Non-humans: Traditional antitrust requires human agreement (e.g., U.S. Interstate Circuit case), but with AI, it’s unclear if an algorithm can “understand” collusion. Courts may adapt this doctrine: if firms independently use the same algorithm, could it imply collusion? In Duffy v. Yardi, the court found that landlords using the same AI tool for pricing could form a conspiracy, even without direct communication.

Mens Rea and Corporate Liability: AI lacks criminal intent, but liability can be ascribed to firms or human agents. Courts may treat AI behavior as the firm’s action, inferring liability if companies knew or should have known what their algorithm would do. This could be framed as “willful blindness” or responsibility for AI decisions under the doctrine of respondeat superior (liability for employees’ actions).

Evidence and Proof: Detecting algorithmic collusion is difficult due to the lack of traditional evidence like emails or meetings. Investigators might reverse-engineer algorithms or subpoena training data. In cases like RealPage, circumstantial evidence like user-interface design and marketing materials helped show intent. Data science tools may also be used to spot collusive price patterns, though distinguishing natural market behavior from coordinated action remains a challenge.

Per Se vs Rule-of-Reason Analysis: Should algorithmic pricing be automatically deemed illegal (per se)? Some courts apply per se rules to traditional cartels, but with AI, there’s uncertainty. In RealPage and Yardi, courts debated whether novelty of AI should prevent per se treatment, with some preferring a rule-of-reason analysis to assess the competitive effects. In Europe, the focus is on whether AI-facilitated pricing constitutes an “agreement” or “concerted practice,” with no need for criminal intent under Article 101 of the TFEU.

Regulatory Uncertainty and Enforcement Limits: Both U.S. and EU regulators face challenges in monitoring AI-driven markets, especially in detecting tacit collusion. While studies on dynamic pricing and AI’s impact are ongoing, formal enforcement often starts only after significant evidence emerges. The tension between preventing collusion and avoiding stifling innovation is a key issue. Authorities must apply traditional antitrust doctrines creatively, ensuring that AI’s competitive effects are captured without overextending rules that could limit beneficial AI use.

In conclusion, detecting and prosecuting AI-facilitated collusion requires adapting traditional antitrust frameworks to address the complexities of AI. Challenges include proving intent, adapting “meeting of minds” concepts, and handling opaque AI logic, with regulators increasingly turning to hybrid approaches to prove collusion in algorithmic contexts.

Enforcement and Legislative Responses to Algorithmic Collusion

Case Enforcement (U.S.):

Topkins (2015): The first criminal case against algorithmic price-fixing, where an executive instructed his company’s algorithm to set specific prices, was recognized as antitrust violation due to direct human coordination.

RealPage (2024): DOJ filed a case against RealPage’s RENTmaximizer for enabling price-fixing in rental housing. Landlords using the software aligned rents, violating Sherman Act Sections 1 (price-fixing) and 2 (monopolization). A private class action and state lawsuits followed.

Duffy v. Yardi (2024): Tenants sued apartment complexes and Yardi for using RENTmaximizer to fix rents. The court found the use of the algorithm could be seen as per se illegal price-fixing due to mutual understanding among participants.

Caution in Courts: Some courts have been cautious, noting that per se illegality may not always apply to algorithmic collusion. For instance, in RealPage, a judge suggested that a reasoned analysis of competitive impact may be more appropriate.

Regulatory Guidance and Private Enforcement (EU/UK):

EU: The European Commission has yet to bring a confirmed case but has expressed concern over algorithmic collusion. Its 2023 Horizontal Guidelines warn that AI-driven tacit collusion may be treated as a concerted practice under Article 101.

UK: The CMA has warned businesses about algorithmic pricing risks. It penalized Amazon resellers for using software to coordinate prices, treating algorithmic price coordination as illegal. CMA continues to issue guidance to avoid price-fixing via software.

Legislative Efforts (U.S. and States):

PAC Act (2025): The U.S. Preventing Algorithmic Collusion Act would presume that exchanging sensitive information via pricing algorithms constitutes an agreement under the Sherman Act. It would also require disclosure of algorithmic use and allow for audits of algorithmic pricing practices.

California Legislation (2025): California’s SB295 would criminalize the use of pricing algorithms trained on non-public competitor data to coordinate prices. Violations would carry penalties and treble damages. Critics argue this may stifle innovation, but supporters argue it addresses specific misuse.

Proposed Reforms (EU and Others):

EU AI Act: If passed, the AI Act would impose transparency and record-keeping requirements for high-risk AI systems, potentially covering pricing algorithms. The idea is to ensure algorithmic accountability and transparency.

Global Coordination: The OECD recommends re-examining the concept of agreement in the context of algorithmic collusion. Agencies globally are exploring the regulation of algorithmic coordination with research and policy roundtables.

Industry and Compliance Responses:

Firms are adopting a multidisciplinary approach to compliance, combining legal, data science, and engineering teams to audit algorithms and perform impact assessments. Automated tools are being piloted by regulators to detect suspicious pricing patterns.

Global Jurisdictions:

Canada: The Competition Bureau is consulting on algorithmic pricing, emphasizing the need for updated laws to address AI-driven collusion.

Australia: The ACCC has issued guidance on dynamic pricing but hasn’t prosecuted algorithmic collusion yet.

Japan and China: Both have issued guidelines and concerns about AI-driven collusion and are focusing on regulating algorithmic coordination.

In conclusion, U.S. authorities are actively pursuing algorithmic collusion cases (e.g., Topkins, RealPage), while EU/UK regulators are emphasizing that traditional competition laws apply to algorithmic schemes. Legislative efforts like the PAC Act and California’s SB295 aim to adapt antitrust laws to the digital age. Globally, there is a growing consensus on the need for enhanced scrutiny and international cooperation in addressing algorithmic collusion.

Proposed Reforms and Forward-Looking Frameworks for AI-Driven Collusion

Given the complexity of AI-driven collusion, various proposals aim to adapt antitrust law and policy:

Revisiting the Agreement Requirement: Some scholars propose modifying the law to treat certain algorithmic behaviors as inherently collusive. A legislative example, like the PAC Act’s presumption, could treat using competitor-trained algorithms as an agreement. Proposals suggest that coordinated algorithmic outcomes (identified through data analysis) should be presumed illegal unless firms prove independent justifications.

Algorithmic Transparency and Auditing: Transparency is a key theme, requiring firms to disclose and allow scrutiny of their pricing algorithms. The EU AI Act’s “data governance” provisions would mandate transparency in training data and decision logic. Proposals suggest regulators should be able to demand algorithmic logs during investigations and consider data access during mergers that might enable algorithmic collusion.

Enhanced Competition Compliance: Extending compliance programs to algorithm design is suggested. Firms could be required to certify that AI pricing systems incorporate antitrust safeguards, such as avoiding competitors’ private data. The idea of “compliance by design” (advocated by Commissioner Vestager) would require firms to demonstrate that algorithms don’t have collusive features.

Structural Remedies and Merger Review: Proposals call for scrutiny of mergers involving data or technology sharing that could enable algorithmic coordination. Mergers where one firm acquires another for access to pricing data or machine-learning models could be challenged on collusion grounds. This approach treats algorithms and data as part of market structure, but regulators caution that blocking mergers alone may not suffice if algorithmic collusion spreads.

Global Cooperation and Standards: International cooperation is essential, given the borderless nature of digital markets. The 2025 OECD report advocates for sharing insights on detecting algorithmic collusion and potentially harmonizing evidentiary standards across jurisdictions. Proposals suggest a “digital chapter” in competition law and even an international convention on algorithmic competition fairness to avoid divergent standards.

Adaptive Enforcement Tools: Enforcement agencies are exploring new techniques. Some are experimenting with economic detection algorithms to scan price data for collusion patterns, known as “computational antitrust.” Others suggest setting up specialized data science units (e.g., the DOJ’s Technology and Financial Investigations Unit) to audit algorithms. Joint research projects between DG COMP and AI experts in the EU may help develop methodologies for evaluating algorithmic markets.

Using Existing Tools: While these reforms are discussed, agencies emphasize using existing antitrust tools creatively. Complex economic effects, like in hub-and-spoke or parallel pricing cases, have been tackled before, and algorithmic collusion could similarly be addressed under current doctrines with innovative evidence.

References

Calvano, E., Calzolari, G., Denicolò, V., & Pastorello, S. (2020). Artificial Intelligence, Algorithmic Pricing, and Collusion. American Economic Review, 110(10): 3267–3297[1].

Competition and Markets Authority (UK). Online sales of posters and frames (Case CE/98023). CMA Infringement Decision (August 2016)[27][28].

Competition and Markets Authority (UK). “Pricing algorithms and competition law: what you need to know.” CMA Blog (Nov. 2024)[16].

European Commission. Guidelines on the application of Article 101 TFEU (2023), para. 379 (“collusion by code”)[4].

Giacalone, M. (2024). “Algorithmic Collusion: Corporate Accountability and the Application of Art. 101 TFEU,” European Papers: Insight 9(3), pp. 1048–1061[12][15].

OECD (2017). Algorithms and Collusion: Competition Policy in the Digital Age. OECD Publishing, Paris[35][36].

United States v. Topkins, No. 15-cr-00201 (N.D. Cal. Apr. 6, 2015)[20].

United States v. RealPage, Inc., Case No. 1:24-cv-00710-WLO-JLW (M.D.N.C. 2024). DOJ Complaint (Aug. 23, 2024)[3].

Duffy v. Yardi Systems, Inc., 64 F.4th 326 (9th Cir. 2023) (trial court ruling)[21][18].

Calzolari, G. et al. (2020). American Economic Review (as above).

Klein, T. (2020). Autonomous Algorithmic Collusion: Q-Learning Under Sequential Pricing. (Am. Econ. Review Working Paper)[7].

Lepore, N. (2021). AI Pricing Collusion: Multi-Agent RL in Bertrand Competition. (Senior Thesis, Harvard College)[8].

DOJ Press Release, “Justice Department Sues RealPage for Algorithmic Pricing Scheme” (Aug. 23, 2024)[3].

Wick, R.F. & Kalema, W.E. (2025). “Mandatory vs. Suggested Pricing: Algorithmic Price Setting and the Sherman Act.” Cohen & Gresser Client Advisory (Feb. 11, 2025)[20][2].

Morgan Lewis (2024). “US District Court Denies Motion to Dismiss Algorithmic Pricing Antitrust Claims” (Dec. 2024)[21][18].

Competition Bureau Canada (2025). Algorithmic pricing and competition: Discussion paper (June 10, 2025)[43].

Additional sources include legal commentaries, law review essays, and press coverage as cited in the body (see in-text citations).

The post AI-Driven Antitrust and Competition Law: Algorithmic Collusion, Self-Learning Pricing Tools, and Legal Challenges in the US and EU appeared first on MarkTechPost.

Using RouteLLM to Optimize LLM Usage

RouteLLM is a flexible framework for serving and evaluating LLM routers, designed to maximize performance while minimizing cost.

Key features:

Seamless integration — Acts as a drop-in replacement for the OpenAI client or runs as an OpenAI-compatible server, intelligently routing simpler queries to cheaper models.

Pre-trained routers out of the box — Proven to cut costs by up to 85% while preserving 95% of GPT-4 performance on widely used benchmarks like MT-Bench.

Cost-effective excellence — Matches the performance of leading commercial offerings while being over 40% cheaper.

Extensible and customizable — Easily add new routers, fine-tune thresholds, and compare performance across multiple benchmarks.

Source: https://github.com/lm-sys/RouteLLM/tree/main

In this tutorial, we’ll walk through how to:

Load and use a pre-trained router.

Calibrate it for your own use case.

Test routing behavior on different types of prompts.

Check out the Full Codes here.

Installing the dependencies

Copy CodeCopiedUse a different Browser!pip install “routellm[serve,eval]”

Loading OpenAI API Key

To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access.

RouteLLM leverages LiteLLM to support chat completions from a wide range of both open-source and closed-source models. You can check out the list of providers at https://litellm.vercel.app/docs/providers if you want to use some other model. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘OPENAI_API_KEY’] = getpass(‘Enter OpenAI API Key: ‘)

Downloading Config File

RouteLLM uses a configuration file to locate pretrained router checkpoints and the datasets they were trained on. This file tells the system where to find the models that decide whether to send a query to the strong or weak model. Check out the Full Codes here.

Do I need to edit it?

For most users — no. The default config already points to well-trained routers (mf, bert, causal_llm) that work out of the box. You only need to change it if you plan to:

Train your own router on a custom dataset.

Replace the routing algorithm entirely with a new one.

For this tutorial, we’ll keep the config as is and simply:

Set our strong and weak model names in code.

Add our API keys for the chosen providers.

Use a calibrated threshold to balance cost and quality.

Check out the Full Codes here.

Copy CodeCopiedUse a different Browser!wget https://raw.githubusercontent.com/lm-sys/RouteLLM/main/config.example.yaml

Initializing the RouteLLM Controller

In this code block, we import the necessary libraries and initialize the RouteLLM Controller, which will manage how prompts are routed between models. We specify routers=[“mf”] to use the Matrix Factorization router, a pretrained decision model that predicts whether a query should be sent to the strong or weak model.

The strong_model parameter is set to “gpt-5”, a high-quality but more expensive model, while the weak_model parameter is set to “o4-mini”, a faster and cheaper alternative. For each incoming prompt, the router evaluates its complexity against a threshold and automatically chooses the most cost-effective option—ensuring that simple tasks are handled by the cheaper model while more challenging ones get the stronger model’s capabilities.

This configuration allows you to balance cost efficiency and response quality without manual intervention. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport os
import pandas as pd
from routellm.controller import Controller

client = Controller(
routers=[“mf”], # Model Fusion router
strong_model=”gpt-5″,
weak_model=”o4-mini”
)

Copy CodeCopiedUse a different Browser!python -m routellm.calibrate_threshold –routers mf –strong-model-pct 0.1 –config config.example.yaml

This command runs RouteLLM’s threshold calibration process for the Matrix Factorization (mf) router. The –strong-model-pct 0.1 argument tells the system to find the threshold value that routes roughly 10% of queries to the strong model (and the rest to the weak model).

Using the –config config.example.yaml file for model and router settings, the calibration determined:

For 10% strong model calls with mf, the optimal threshold is 0.24034.

This means that any query with a router-assigned complexity score above 0.24034 will be sent to the strong model, while those below it will go to the weak model, aligning with your desired cost–quality trade-off.

Defining the threshold & prompts variables

Here, we define a diverse set of test prompts designed to cover a range of complexity levels. They include simple factual questions (likely to be routed to the weak model), medium reasoning tasks (borderline threshold cases), and high-complexity or creative requests (more suited for the strong model), along with code generation tasks to test technical capabilities. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserthreshold = 0.24034

prompts = [
# Easy factual (likely weak model)
“Who wrote the novel ‘Pride and Prejudice’?”,
“What is the largest planet in our solar system?”,

# Medium reasoning (borderline cases)
“If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?”,
“Explain why the sky appears blue during the day and red/orange during sunset.”,

# High complexity / creative (likely strong model)
“Write a 6-line rap verse about climate change using internal rhyme.”,
“Summarize the differences between supervised, unsupervised, and reinforcement learning with examples.”,

# Code generation
“Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.”,
“Generate SQL to find the top 3 highest-paying customers from a ‘sales’ table.”
]

Evaluating Win Rate

The following code calculates the win rate for each test prompt using the mf router, showing the likelihood that the strong model will outperform the weak model.

Based on the calibrated threshold of 0.24034, two prompts —

“If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?” (0.303087)

“Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.” (0.272534)

— exceed the threshold and would be routed to the strong model.

All other prompts remain below the threshold, meaning they would be served by the weaker, cheaper model. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserwin_rates = client.batch_calculate_win_rate(prompts=pd.Series(prompts), router=”mf”)

# Store results in DataFrame
_df = pd.DataFrame({
“Prompt”: prompts,
“Win_Rate”: win_rates
})

# Show full text without truncation
pd.set_option(‘display.max_colwidth’, None)

These results also help in fine-tuning the routing strategy — by analyzing the win rate distribution, we can adjust the threshold to better balance cost savings and performance.

Routing Prompts Through Calibrated Model Fusion (MF) Router

This code iterates over the list of test prompts and sends each one to the RouteLLM controller using the calibrated mf router with the specified threshold (router-mf-{threshold}).

For each prompt, the router decides whether to use the strong or weak model based on the calculated win rate.

The response includes both the generated output and the actual model that was selected by the router.

These details — the prompt, model used, and generated output — are stored in the results list for later analysis. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserresults = []
for prompt in prompts:
response = client.chat.completions.create(
model=f”router-mf-{threshold}”,
messages=[{“role”: “user”, “content”: prompt}]
)
message = response.choices[0].message[“content”]
model_used = response.model # RouteLLM returns the model actually used

results.append({
“Prompt”: prompt,
“Model Used”: model_used,
“Output”: message
})

df = pd.DataFrame(results)

In the results, prompts 2 and 6 exceeded the threshold win rate and were therefore routed to the gpt-5 strong model, while the rest were handled by the weaker model.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Join our ML Subreddit

Sponsor us

The post Using RouteLLM to Optimize LLM Usage appeared first on MarkTechPost.

From 100,000 to Under 500 Labels: How Google AI Cuts LLM Training Data …

Google Research has unveiled a groundbreaking method for fine-tuning large language models (LLMs) that slashes the amount of required training data by up to 10,000x, while maintaining or even improving model quality. This approach centers on active learning and focusing expert labeling efforts on the most informative examples—the “boundary cases” where model uncertainty peaks.

The Traditional Bottleneck

Fine-tuning LLMs for tasks demanding deep contextual and cultural understanding—like ad content safety or moderation—has typically required massive, high-quality labeled datasets. Most data is benign, meaning that for policy violation detection, only a small fraction of examples matter, driving up the cost and complexity of data curation. Standard methods also struggle to keep up when policies or problematic patterns shift, necessitating expensive retraining.

Google’s Active Learning Breakthrough

How It Works:

LLM-as-Scout: The LLM is used to scan a vast corpus (hundreds of billions of examples) and identify cases it’s least certain about.

Targeted Expert Labeling: Instead of labeling thousands of random examples, human experts only annotate those borderline, confusing items.

Iterative Curation: This process repeats, with each batch of new “problematic” examples informed by the latest model’s confusion points.

Rapid Convergence: Models are fine-tuned in multiple rounds, and the iteration continues until the model’s output aligns closely with expert judgment—measured by Cohen’s Kappa, which compares agreement between annotators beyond chance.

Image source: https://research.google/blog/achieving-10000x-training-data-reduction-with-high-fidelity-labels/

Impact:

Data Needs Plummet: In experiments with Gemini Nano-1 and Nano-2 models, alignment with human experts reached parity or better using 250–450 well-chosen examples rather than ~100,000 random crowdsourced labels—a reduction of three to four orders of magnitude.

Model Quality Rises: For more complex tasks and larger models, performance improvements reached 55–65% over baseline, demonstrating more reliable alignment with policy experts.

Label Efficiency: For reliable gains using tiny datasets, high label quality was consistently necessary (Cohen’s Kappa > 0.8).

Why It Matters

This approach flips the traditional paradigm. Rather than drowning models in vast pools of noisy, redundant data, it leverages both LLMs’ ability to identify ambiguous cases and the domain expertise of human annotators where their input is most valuable. The benefits are profound:

Cost Reduction: Vastly fewer examples to label, dramatically lowering labor and capital expenditure.

Faster Updates: The ability to retrain models on a handful of examples makes adaptation to new abuse patterns, policy changes, or domain shifts rapid and feasible.

Societal Impact: Enhanced capacity for contextual and cultural understanding increases the safety and reliability of automated systems handling sensitive content.

In Summary

Google’s new methodology enables LLM fine-tuning on complex, evolving tasks with just hundreds (not hundreds of thousands) of targeted, high-fidelity labels—ushering in far leaner, more agile, and cost-effective model development.

Check out the technical article from Google blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Join our ML Subreddit

Sponsor us

The post From 100,000 to Under 500 Labels: How Google AI Cuts LLM Training Data by Orders of Magnitude appeared first on MarkTechPost.

9 Agentic AI Workflow Patterns Transforming AI Agents in 2025

Table of contentsWhy Classic AI Agent Workflows FailThe 9 Agentic Workflow Patterns for 2025Sequential IntelligenceParallel ProcessingIntelligent RoutingSelf-Improving SystemsHow These Patterns Revolutionize AI AgentsReal-World Impact & Implementation Best PracticesConclusion

AI agents are at a pivotal moment: simply calling a language model is no longer enough for production-ready solutions. In 2025, intelligent automation depends on orchestrated, agentic workflows—modular coordination blueprints that transform isolated AI calls into systems of autonomous, adaptive, and self-improving agents. Here’s how nine workflow patterns can unlock the next generation of scalable, robust AI agents.

Why Classic AI Agent Workflows Fail

Most failed agent implementations rely on “single-step thinking”—expecting one model call to solve complex, multi-part problems. AI agents succeed when their intelligence is orchestrated across multi-step, parallel, routed, and self-improving workflows. According to Gartner, by 2028, at least 33% of enterprise software will depend on agentic AI, but overcoming the 85% failure rate requires these new paradigms.

The 9 Agentic Workflow Patterns for 2025

Sequential Intelligence

(1) Prompt Chaining:

Tasks are decomposed into step-by-step subgoals where each LLM’s output becomes the next step’s input. Ideal for complex customer support agents, assistants, and pipelines that require context preservation throughout multi-turn conversations.

(2) Plan and Execute:

Agents autonomously plan multi-step workflows, execute each stage sequentially, review outcomes, and adjust as needed. This adaptive “plan–do–check–act” loop is vital for business process automation and data orchestration, providing resilience against failures and offering granular control over progress.

Parallel Processing

(3) Parallelization:

Splitting a large task into independent sub-tasks for concurrent execution by multiple agents or LLMs. Popular for code review, candidate evaluation, A/B testing, and building guardrails, parallelization drastically reduces time to resolution and improves consensus accuracy.

(4) Orchestrator–Worker:

A central “orchestrator” agent breaks tasks down, assigns work to specialized “workers,” then synthesizes results. This pattern powers retrieval-augmented generation (RAG), coding agents, and sophisticated multi-modal research by leveraging specialization.

Intelligent Routing

(5) Routing:

Input classification decides which specialized agent should handle each part of a workflow, achieving separation of concerns and dynamic task assignment. This is the backbone of multi-domain customer support and debate systems, where routing enables scalable expertise.

(6) Evaluator–Optimizer:

Agents collaborate in a continuous loop: one generates solutions, the other evaluates and suggests improvements. This enables real-time data monitoring, iterative coding, and feedback-driven design—improving quality with every cycle.

Self-Improving Systems

(7) Reflection:

Agents self-review their performance after each run, learning from errors, feedback, and changing requirements. Reflection elevates agents from static performers to dynamic learners, essential for long-term automation in data-centric environments, such as app building or regulatory compliance.

(8) Rewoo:

Extensions of ReACT allow agents to plan, substitute strategies, and compress workflow logic—reducing computational overhead and aiding fine-tuning, especially in deep search and multi-step Q&A domains.

(9) Autonomous Workflow:

Agents continuously operate in loops, leveraging tool feedback and environmental signals for perpetual self-improvement. This is at the heart of autonomous evaluations and dynamic guardrail systems, allowing agents to operate reliably with minimal intervention.

How These Patterns Revolutionize AI Agents

Orchestrated Intelligence: These patterns unite isolated model calls into intelligent, context-aware agentic systems, each optimized for different problem structures (sequential, parallel, routed, and self-improving).

Complex Problem Solving: Collaborative agent workflows tackle problems that single LLM agents cannot address, dividing and conquering complexity for reliable business outcomes.

Continuous Improvement: By learning from feedback and failures at every step, agentic workflows evolve—offering a path to truly autonomous, adaptive intelligence.

Scalability & Flexibility: Agents can be specialized, added, or swapped, yielding modular pipelines that scale from simple automation to enterprise-grade orchestrations.

Real-World Impact & Implementation Best Practices

Design for Modularity: Build agents as composable, specialized entities. Orchestration patterns manage timing, data flow, and dependencies.

Leverage Tool Integration: Success depends on seamless interplay between agents and external systems (APIs, cloud, RPA), enabling dynamic adaptation to evolving requirements.

Focus on Feedback Loops: Reflection and evaluator–optimizer workflows keep agents improving, boosting precision and reliability in dynamic environments like healthcare, finance, and customer service.

Conclusion

Agentic workflows are no longer a future concept—they are the cornerstone of today’s leading AI teams. By mastering these nine patterns, developers and architects can unlock scalable, resilient, and adaptive AI systems that thrive in real-world production. The shift from single-step execution to orchestrated intelligence marks the dawn of enterprise-wide automation, making agentic thinking a required skill for the age of autonomous AI.

Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Join our ML Subreddit

Sponsor us

The post 9 Agentic AI Workflow Patterns Transforming AI Agents in 2025 appeared first on MarkTechPost.

Building an Advanced PaperQA2 Research Agent with Google Gemini for Sc …

In this tutorial, we walk through building an advanced PaperQA2 AI Agent powered by Google’s Gemini model, designed specifically for scientific literature analysis. We set up the environment in Google Colab/Notebook, configure the Gemini API, and integrate it seamlessly with PaperQA2 to process and query multiple research papers. By the end of the setup, we have an intelligent agent capable of answering complex questions, performing multi-question analyses, and conducting comparative research across papers, all while providing clear answers with evidence from source documents. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser!pip install paper-qa>=5 google-generativeai requests pypdf2 -q

import os
import asyncio
import tempfile
import requests
from pathlib import Path
from paperqa import Settings, ask, agent_query
from paperqa.settings import AgentSettings
import google.generativeai as genai

GEMINI_API_KEY = “Use Your Own API Key Here”
os.environ[“GEMINI_API_KEY”] = GEMINI_API_KEY

genai.configure(api_key=GEMINI_API_KEY)
print(” Gemini API key configured successfully!”)

We begin by installing the required libraries, including PaperQA2 and Google’s Generative AI SDK, and then import the necessary modules for our project. We set our Gemini API key as an environment variable and configure it, ensuring the integration is ready for use. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef download_sample_papers():
“””Download sample AI/ML research papers for demonstration”””
papers = {
“attention_is_all_you_need.pdf”: “https://arxiv.org/pdf/1706.03762.pdf”,
“bert_paper.pdf”: “https://arxiv.org/pdf/1810.04805.pdf”,
“gpt3_paper.pdf”: “https://arxiv.org/pdf/2005.14165.pdf”
}

papers_dir = Path(“sample_papers”)
papers_dir.mkdir(exist_ok=True)

print(” Downloading sample research papers…”)
for filename, url in papers.items():
filepath = papers_dir / filename
if not filepath.exists():
try:
response = requests.get(url, stream=True, timeout=30)
response.raise_for_status()
with open(filepath, ‘wb’) as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f” Downloaded: {filename}”)
except Exception as e:
print(f” Failed to download {filename}: {e}”)
else:
print(f” Already exists: {filename}”)

return str(papers_dir)

papers_directory = download_sample_papers()

def create_gemini_settings(paper_dir: str, temperature: float = 0.1):
“””Create optimized settings for PaperQA2 with Gemini models”””

return Settings(
llm=”gemini/gemini-1.5-flash”,
summary_llm=”gemini/gemini-1.5-flash”,

agent=AgentSettings(
agent_llm=”gemini/gemini-1.5-flash”,
search_count=6,
timeout=300.0,
),

embedding=”gemini/text-embedding-004″,

temperature=temperature,
paper_directory=paper_dir,

answer=dict(
evidence_k=8,
answer_max_sources=4,
evidence_summary_length=”about 80 words”,
answer_length=”about 150 words, but can be longer”,
max_concurrent_requests=2,
),

parsing=dict(
chunk_size=4000,
overlap=200,
),

verbosity=1,
)

We download a set of well-known AI/ML research papers for our analysis and store them in a dedicated folder. We then create optimized PaperQA2 settings configured to use Gemini for all LLM and embedding tasks, fine-tuning parameters like search count, evidence retrieval, and parsing for efficient and accurate literature processing. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass PaperQAAgent:
“””Advanced AI Agent for scientific literature analysis using PaperQA2″””

def __init__(self, papers_directory: str, temperature: float = 0.1):
self.settings = create_gemini_settings(papers_directory, temperature)
self.papers_dir = papers_directory
print(f” PaperQA Agent initialized with papers from: {papers_directory}”)

async def ask_question(self, question: str, use_agent: bool = True):
“””Ask a question about the research papers”””
print(f”n Question: {question}”)
print(” Searching through research papers…”)

try:
if use_agent:
response = await agent_query(query=question, settings=self.settings)
else:
response = ask(question, settings=self.settings)

return response

except Exception as e:
print(f” Error processing question: {e}”)
return None

def display_answer(self, response):
“””Display the answer with formatting”””
if response is None:
print(” No response received”)
return

print(“n” + “=”*60)
print(” ANSWER:”)
print(“=”*60)

answer_text = getattr(response, ‘answer’, str(response))
print(f”n{answer_text}”)

contexts = getattr(response, ‘contexts’, getattr(response, ‘context’, []))
if contexts:
print(“n” + “-“*40)
print(” SOURCES USED:”)
print(“-“*40)
for i, context in enumerate(contexts[:3], 1):
context_name = getattr(context, ‘name’, getattr(context, ‘doc’, f’Source {i}’))
context_text = getattr(context, ‘text’, getattr(context, ‘content’, str(context)))
print(f”n{i}. {context_name}”)
print(f” Text preview: {context_text[:150]}…”)

async def multi_question_analysis(self, questions: list):
“””Analyze multiple questions in sequence”””
results = {}
for i, question in enumerate(questions, 1):
print(f”n Processing question {i}/{len(questions)}”)
response = await self.ask_question(question)
results = response

if response:
print(f” Completed: {question[:50]}…”)
else:
print(f” Failed: {question[:50]}…”)

return results

async def comparative_analysis(self, topic: str):
“””Perform comparative analysis across papers”””
questions = [
f”What are the key innovations in {topic}?”,
f”What are the limitations of current {topic} approaches?”,
f”What future research directions are suggested for {topic}?”,
]

print(f”n Starting comparative analysis on: {topic}”)
return await self.multi_question_analysis(questions)

async def basic_demo():
“””Demonstrate basic PaperQA functionality”””
agent = PaperQAAgent(papers_directory)

question = “What is the transformer architecture and why is it important?”
response = await agent.ask_question(question)
agent.display_answer(response)

print(” Running basic demonstration…”)
await basic_demo()

async def advanced_demo():
“””Demonstrate advanced multi-question analysis”””
agent = PaperQAAgent(papers_directory, temperature=0.2)

questions = [
“How do attention mechanisms work in transformers?”,
“What are the computational challenges of large language models?”,
“How has pre-training evolved in natural language processing?”
]

print(” Running advanced multi-question analysis…”)
results = await agent.multi_question_analysis(questions)

for question, response in results.items():
print(f”n{‘=’*80}”)
print(f”Q: {question}”)
print(‘=’*80)
if response:
answer_text = getattr(response, ‘answer’, str(response))
display_text = answer_text[:300] + “…” if len(answer_text) > 300 else answer_text
print(display_text)
else:
print(” No answer available”)

print(“n Running advanced demonstration…”)
await advanced_demo()

async def research_comparison_demo():
“””Demonstrate comparative research analysis”””
agent = PaperQAAgent(papers_directory)

results = await agent.comparative_analysis(“attention mechanisms in neural networks”)

print(“n” + “=”*80)
print(” COMPARATIVE ANALYSIS RESULTS”)
print(“=”*80)

for question, response in results.items():
print(f”n {question}”)
print(“-” * 50)
if response:
answer_text = getattr(response, ‘answer’, str(response))
print(answer_text)
else:
print(” Analysis unavailable”)
print()

print(” Running comparative research analysis…”)
await research_comparison_demo()

̌We define a PaperQAAgent that uses our Gemini-tuned PaperQA2 settings to search papers, answer questions, and cite sources with clean display helpers. We then run basic, advanced multi-question, and comparative demos so we can interrogate literature end-to-end and summarize findings efficiently. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserdef create_interactive_agent():
“””Create an interactive agent for custom queries”””
agent = PaperQAAgent(papers_directory)

async def query(question: str, show_sources: bool = True):
“””Interactive query function”””
response = await agent.ask_question(question)

if response:
answer_text = getattr(response, ‘answer’, str(response))
print(f”n Answer:n{answer_text}”)

if show_sources:
contexts = getattr(response, ‘contexts’, getattr(response, ‘context’, []))
if contexts:
print(f”n Based on {len(contexts)} sources:”)
for i, ctx in enumerate(contexts[:3], 1):
ctx_name = getattr(ctx, ‘name’, getattr(ctx, ‘doc’, f’Source {i}’))
print(f” {i}. {ctx_name}”)
else:
print(” Sorry, I couldn’t find an answer to that question.”)

return response

return query

interactive_query = create_interactive_agent()

print(“n Interactive agent ready! You can now ask custom questions:”)
print(“Example: await interactive_query(‘How do transformers handle long sequences?’)”)

def print_usage_tips():
“””Print helpful usage tips”””
tips = “””
USAGE TIPS FOR PAPERQA2 WITH GEMINI:

1. Question Formulation:
– Be specific about what you want to know
– Ask about comparisons, mechanisms, or implications
– Use domain-specific terminology

2. Model Configuration:
– Gemini 1.5 Flash is free and reliable
– Adjust temperature (0.0-1.0) for creativity vs precision
– Use smaller chunk_size for better processing

3. Document Management:
– Add PDFs to the papers directory
– Use meaningful filenames
– Mix different types of papers for better coverage

4. Performance Optimization:
– Limit concurrent requests for free tier
– Use smaller evidence_k values for faster responses
– Cache results by saving the agent state

5. Advanced Usage:
– Chain multiple questions for deeper analysis
– Use comparative analysis for research reviews
– Combine with other tools for complete workflows

Example Questions to Try:
– “Compare the attention mechanisms in BERT vs GPT models”
– “What are the computational bottlenecks in transformer training?”
– “How has pre-training evolved from word2vec to modern LLMs?”
– “What are the key innovations that made transformers successful?”
“””
print(tips)

print_usage_tips()

def save_analysis_results(results: dict, filename: str = “paperqa_analysis.txt”):
“””Save analysis results to a file”””
with open(filename, ‘w’, encoding=’utf-8′) as f:
f.write(“PaperQA2 Analysis Resultsn”)
f.write(“=” * 50 + “nn”)

for question, response in results.items():
f.write(f”Question: {question}n”)
f.write(“-” * 30 + “n”)
if response:
answer_text = getattr(response, ‘answer’, str(response))
f.write(f”Answer: {answer_text}n”)

contexts = getattr(response, ‘contexts’, getattr(response, ‘context’, []))
if contexts:
f.write(f”nSources ({len(contexts)}):n”)
for i, ctx in enumerate(contexts, 1):
ctx_name = getattr(ctx, ‘name’, getattr(ctx, ‘doc’, f’Source {i}’))
f.write(f” {i}. {ctx_name}n”)
else:
f.write(“Answer: No response availablen”)
f.write(“n” + “=”*50 + “nn”)

print(f” Results saved to: {filename}”)

print(” Tutorial complete! You now have a fully functional PaperQA2 AI Agent with Gemini.”)

We create an interactive query helper that allows us to ask custom questions on demand and optionally view cited sources. We also print practical usage tips and add a saver that writes every Q&A with source names to a results file, wrapping up the tutorial with a ready-to-use workflow.

In conclusion, we successfully created a fully functional AI research assistant that leverages the speed and versatility of Gemini with the robust paper processing capabilities of PaperQA2. We can now interactively explore scientific papers, run targeted queries, and even perform in-depth comparative analyses with minimal effort. This setup enhances our ability to digest complex research and also streamlines the entire literature review process, enabling us to focus on insights rather than manual searching.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Discuss on Hacker News

Join our ML Subreddit

Sponsor us

The post Building an Advanced PaperQA2 Research Agent with Google Gemini for Scientific Literature Analysis appeared first on MarkTechPost.

Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Rea …

Introduction

Large Language Models (LLMs) have set new benchmarks in natural language processing, but their tendency for hallucination—generating inaccurate outputs—remains a critical issue for knowledge-intensive applications. Retrieval-Augmented Generation (RAG) frameworks attempt to solve this by incorporating external knowledge into language generation. However, traditional RAG approaches rely on chunk-based retrieval, which limits their ability to represent complex semantic relationships. Entity-relation graph-based RAG methods (GraphRAG) address some structural limitations, but still face high construction cost, one-shot retrieval inflexibility, and dependence on long-context reasoning and carefully crafted prompts.

Researchers from Nanyang Technological University, National University of Singapore, Beijing Institute of Computer Technology and Application, and Beijing Anzhen Hospital have introduced Graph-R1, an agentic GraphRAG framework powered by end-to-end reinforcement learning.

Image source: https://arxiv.org/pdf/2507.21892v1

Core Innovations of Graph-R1

1. Lightweight Knowledge Hypergraph Construction

Graph-R1 constructs knowledge as a hypergraph, where each knowledge segment is extracted using LLM-driven n-ary relation extraction. This approach encodes richer and more semantically grounded relationships, boosting agentic reasoning capabilities while maintaining manageable cost and computational requirements.

Efficiency: Only 5.69s and $2.81 per 1,000 tokens for construction (vs. $3.35 for GraphRAG and $4.14 for HyperGraphRAG), while generating semantically rich graphs with 120,499 nodes and 98,073 edges.

2. Multi-Turn Agentic Retrieval Process

Graph-R1 models retrieval as a multi-turn interaction loop (“think-retrieve-rethink-generate”), allowing the agent to adaptively query and refine its knowledge path, unlike previous methods that use one-shot retrieval.

Dynamic Reasoning: The agent decides at each step whether to continue exploring or terminate with an answer. Entity-based and direct hyperedge retrieval are fused through reciprocal rank aggregation, improving the chances of retrieving the most relevant knowledge.

3. End-to-End Reinforcement Learning Optimization

Graph-R1 uses Group Relative Policy Optimization (GRPO) for end-to-end RL, integrating rewards for format adherence, relevance, and answer correctness. This unified reward guides agents to develop generalizable reasoning strategies tightly aligned with both the knowledge structure and output quality.

Outcome-directed reward mechanism: Combines format rewards (structural coherence) and answer rewards (semantic accuracy) for effective optimization, only rewarding answers embedded in structurally valid reasoning trajectories.

Key Findings

Benchmarking on RAG QA Tasks

Graph-R1 was evaluated across six standard QA datasets (2WikiMultiHopQA, HotpotQA, Musique, Natural Questions, PopQA, TriviaQA).

MethodAvg. F1 (Qwen2.5-7B)NaiveGeneration13.87StandardRAG15.89GraphRAG24.87HyperGraphRAG29.40Search-R146.19R1-Searcher42.29Graph-R157.82

Graph-R1 achieves up to 57.82 average F1 with Qwen2.5-7B, surpassing all previous baselines by a wide margin. Larger base models amplify its performance gains.

Ablation Analysis

Component ablation demonstrates that removing hypergraph construction, multi-turn reasoning, or RL optimization dramatically reduces performance, validating the necessity of each module within Graph-R1.

Retrieval and Efficiency

Graph-R1 retrieval is more concise and effective. It achieves high F1 scores with moderate average content lengths (~1200-1500 tokens per exchange), and supports more interaction turns (average 2.3-2.5), facilitating stable and accurate knowledge extraction.2507.21892v1.pdf

Generation cost is minimal: Despite richer representation, Graph-R1’s response time per query (7.0s) and per-query cost ($0) outperforms graph-based competitors like HyperGraphRAG (9.6s, $8.76).2507.21892v1.pdf

Generation Quality

Graph-R1’s generation quality is evaluated across seven dimensions—comprehensiveness, knowledgeability, correctness, relevance, diversity, logical coherence, factuality—and consistently outperforms all RL-based and graph-based baselines, achieving top scores in correctness (86.9), relevance (95.2), and coherence (88.5).

Generalizability

Cross-validation on out-of-distribution (O.O.D.) settings reveals that Graph-R1 maintains robust performance across datasets, with O.O.D./I.I.D. ratios often above 85%, demonstrating strong domain generalization properties.

Theoretical Guarantees

Graph-R1 is supported by information-theoretic analyses:

Graph-structured knowledge provides higher information density per retrieval and faster convergence to correct answers compared to chunk-based retrieval.

Multi-turn interaction enables the agent to achieve higher retrieval efficiency by dynamically focusing on high-impact graph regions.

End-to-end RL optimization bridges graph-structured evidence and language generation, reducing output entropy and error rates.

Algorithmic Workflow (High-Level)

Knowledge Hypergraph Extraction: LLM extracts n-ary relations to build entity and hyperedge sets.

Multi-turn Agentic Reasoning: The agent alternates between reflective thinking, querying, hypergraph retrieval (entity and hyperedge dual paths), and synthesis.

GRPO Optimization: RL policy is updated using sampled trajectories and reward normalization, enforcing structure and answer correctness.

Conclusion

Graph-R1 demonstrates that integrating hypergraph-based knowledge representation, agentic multi-turn reasoning, and end-to-end RL delivers unprecedented gains in factual QA performance, retrieval efficiency, and generation quality, charting the path for next-generation agentic and knowledge-driven LLM systems.

FAQ 1: What is the key innovation of Graph-R1 compared to earlier GraphRAG and RAG systems?

Graph-R1 introduces an agentic framework where retrieval is modeled as a multi-turn interaction rather than a single one-shot process. Its main innovations are:

Hypergraph Knowledge Representation: Instead of simple entity-relation graphs or text chunks, Graph-R1 constructs a semantic hypergraph that enables more expressive, n-ary relationships between entities.

Multi-Turn Reasoning Loop: The agent operates in repeated cycles of “think–retrieve–rethink–generate” over the hypergraph, dynamically focusing queries rather than retrieving everything at once.

End-to-End Reinforcement Learning (RL): The agent is trained with a reward function that simultaneously optimizes for step-wise logical reasoning and final answer correctness, enabling tighter alignment between structured knowledge and natural language answers.

FAQ 2: How does Graph-R1’s retrieval and generation efficiency compare to previous methods?

Graph-R1 is significantly more efficient and effective in both retrieval and answer generation:

Lower Construction & Retrieval Cost: For building the knowledge hypergraph, Graph-R1 takes only 5.69 seconds and costs $2.81 per 1,000 tokens (on the 2Wiki dataset), outperforming similar graph-based methods.

Faster and Cheaper Generation: Query response times (average 7 seconds per query) and generation costs ($0 per query) are better than prior graph-RAG systems, such as HyperGraphRAG.

Conciseness & Robustness: Graph-R1 answers are both more concise (usually 1,200–1,500 tokens) and more accurate due to the multi-turn interaction, with state-of-the-art F1 scores across six QA datasets.

FAQ 3: In which scenarios or domains is the Graph-R1 framework most applicable?

Graph-R1 is ideal for complex knowledge-intensive applications demanding both factual accuracy and reasoning transparency, such as:

Healthcare and Medical AI: Where multi-hop reasoning, traceability, and reliability are essential.

Legal and Regulatory Domains: That require precise grounded answers and interpretable multi-step reasoning.

Enterprise Knowledge Automation: For tasks needing scalable, dynamic querying and retrieval across large document or data corpora.The model’s architecture also allows for easy adaptation to other fields that benefit from agentic, multi-turn knowledge search anchored in structured representations.

Check out the Paper here and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. 

Discuss on Hacker News

Join our ML Subreddit

Sponsor us

The post Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning appeared first on MarkTechPost.

A Developer’s Guide to OpenAI’s GPT-5 Model Capabilities

In this tutorial, we’ll explore the new capabilities introduced in OpenAI’s latest model, GPT-5. The update brings several powerful features, including the Verbosity parameter, Free-form Function Calling, Context-Free Grammar (CFG), and Minimal Reasoning. We’ll look at what they do and how to use them in practice. Check out the Full Codes here.

Installing the libraries

Copy CodeCopiedUse a different Browser!pip install pandas openai

To get an OpenAI API key, visit https://platform.openai.com/settings/organization/api-keys and generate a new key. If you’re a new user, you may need to add billing details and make a minimum payment of $5 to activate API access. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘OPENAI_API_KEY’] = getpass(‘Enter OpenAI API Key: ‘)

Verbosity Parameter

The Verbosity parameter lets you control how detailed the model’s replies are without changing your prompt.

low → Short and concise, minimal extra text.

medium (default) → Balanced detail and clarity.

high → Very detailed, ideal for explanations, audits, or teaching. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
import pandas as pd
from IPython.display import display

client = OpenAI()

question = “Write a poem about a detective and his first solve”

data = []

for verbosity in [“low”, “medium”, “high”]:
response = client.responses.create(
model=”gpt-5-mini”,
input=question,
text={“verbosity”: verbosity}
)

# Extract text
output_text = “”
for item in response.output:
if hasattr(item, “content”):
for content in item.content:
if hasattr(content, “text”):
output_text += content.text

usage = response.usage
data.append({
“Verbosity”: verbosity,
“Sample Output”: output_text,
“Output Tokens”: usage.output_tokens
})

Copy CodeCopiedUse a different Browser# Create DataFrame
df = pd.DataFrame(data)

# Display nicely with centered headers
pd.set_option(‘display.max_colwidth’, None)
styled_df = df.style.set_table_styles(
[
{‘selector’: ‘th’, ‘props’: [(‘text-align’, ‘center’)]}, # Center column headers
{‘selector’: ‘td’, ‘props’: [(‘text-align’, ‘left’)]} # Left-align table cells
]
)

display(styled_df)

The output tokens scale roughly linearly with verbosity: low (731) → medium (1017) → high (1263).

Free-Form Function Calling

Free-form function calling lets GPT-5 send raw text payloads—like Python scripts, SQL queries, or shell commands—directly to your tool, without the JSON formatting used in GPT-4. Check out the Full Codes here.

This makes it easier to connect GPT-5 to external runtimes such as:

Code sandboxes (Python, C++, Java, etc.)

SQL databases (outputs raw SQL directly)

Shell environments (outputs ready-to-run Bash)

Config generators

Copy CodeCopiedUse a different Browserfrom openai import OpenAI

client = OpenAI()

response = client.responses.create(
model=”gpt-5-mini”,
input=”Please use the code_exec tool to calculate the cube of the number of vowels in the word ‘pineapple'”,
text={“format”: {“type”: “text”}},
tools=[
{
“type”: “custom”,
“name”: “code_exec”,
“description”: “Executes arbitrary python code”,
}
]
)

Copy CodeCopiedUse a different Browserprint(response.output[1].input)

This output shows GPT-5 generating raw Python code that counts the vowels in the word pineapple, calculates the cube of that count, and prints both values. Instead of returning a structured JSON object (like GPT-4 typically would for tool calls), GPT-5 delivers plain executable code. This makes it possible to feed the result directly into a Python runtime without extra parsing.

Context-Free Grammar (CFG)

A Context-Free Grammar (CFG) is a set of production rules that define valid strings in a language. Each rule rewrites a non-terminal symbol into terminals and/or other non-terminals, without depending on the surrounding context.

CFGs are useful when you want to strictly constrain the model’s output so it always follows the syntax of a programming language, data format, or other structured text — for example, ensuring generated SQL, JSON, or code is always syntactically correct.

For comparison, we’ll run the same script using GPT-4 and GPT-5 with an identical CFG to see how both models adhere to the grammar rules and how their outputs differ in accuracy and speed. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
import re

client = OpenAI()

email_regex = r”^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}$”

prompt = “Give me a valid email address for John Doe. It can be a dummy email”

# No grammar constraints — model might give prose or invalid format
response = client.responses.create(
model=”gpt-4o”, # or earlier
input=prompt
)

output = response.output_text.strip()
print(“GPT Output:”, output)
print(“Valid?”, bool(re.match(email_regex, output)))

Copy CodeCopiedUse a different Browserfrom openai import OpenAI

client = OpenAI()

email_regex = r”^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}$”

prompt = “Give me a valid email address for John Doe. It can be a dummy email”

response = client.responses.create(
model=”gpt-5″, # grammar-constrained model
input=prompt,
text={“format”: {“type”: “text”}},
tools=[
{
“type”: “custom”,
“name”: “email_grammar”,
“description”: “Outputs a valid email address.”,
“format”: {
“type”: “grammar”,
“syntax”: “regex”,
“definition”: email_regex
}
}
],
parallel_tool_calls=False
)

print(“GPT-5 Output:”, response.output[1].input)

This example shows how GPT-5 can adhere more closely to a specified format when using a Context-Free Grammar.

With the same grammar rules, GPT-4 produced extra text around the email address (“Sure, here’s a test email you can use for John Doe: johndoe@example.com”), which makes it invalid according to the strict format requirement.

GPT-5, however, output exactly john.doe@example.com, matching the grammar and passing validation. This demonstrates GPT-5’s improved ability to follow CFG constraints precisely. Check out the Full Codes here.

Minimal Reasoning

Minimal reasoning mode runs GPT-5 with very few or no reasoning tokens, reducing latency and delivering a faster time-to-first-token.

It’s ideal for deterministic, lightweight tasks such as:

Data extraction

Formatting

Short rewrites

Simple classification

Because the model skips most intermediate reasoning steps, responses are quick and concise. If not specified, the reasoning effort defaults to medium. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport time
from openai import OpenAI

client = OpenAI()

prompt = “Classify the given number as odd or even. Return one word only.”

start_time = time.time() # Start timer

response = client.responses.create(
model=”gpt-5″,
input=[
{ “role”: “developer”, “content”: prompt },
{ “role”: “user”, “content”: “57” }
],
reasoning={
“effort”: “minimal” # Faster time-to-first-token
},
)

latency = time.time() – start_time # End timer

# Extract model’s text output
output_text = “”
for item in response.output:
if hasattr(item, “content”):
for content in item.content:
if hasattr(content, “text”):
output_text += content.text

print(“——————————–“)
print(“Output:”, output_text)
print(f”Latency: {latency:.3f} seconds”)

Discuss on Hacker News

Join our ML Subreddit

Sponsor us

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A Developer’s Guide to OpenAI’s GPT-5 Model Capabilities appeared first on MarkTechPost.

Cloudflare vs Perplexity: The Battle Over AI Web Scraping Heats Up

Reading through Cloudflare’s detailed exposé and the extensive media coverage, the controversy surrounding Perplexity AI’s web scraping practices is deeper — and more polarizing — than it first appears. Cloudflare accuses Perplexity of systematically ignoring website blocks and masking its identity to scrape data from sites that have opted out, raising serious questions about ethics, transparency, and the future of the Internet’s business model.

What Cloudflare Observed

Cloudflare’s report and independent investigations show that Perplexity, an AI startup, allegedly crawls and scrapes content from websites that explicitly signal (through robots.txt and direct blocks) that AI tools are not welcome. The technical evidence includes changing user agents to impersonate browsers like Google Chrome on macOS and rotating Autonomous System Numbers (ASNs) — sophisticated tactics intended to evade detection and blocks. Cloudflare claims it detected this covert scraping across tens of thousands of domains, generating millions of requests daily, and fingerprinted the crawler using machine learning and other network signals.

Why the Accusations Matter

For decades, websites have used robots.txt as a “gentleman’s agreement” to tell bots what’s allowed. While illegal in very few jurisdictions, the norm among leaders like OpenAI and Anthropic is to respect these signals. Perplexity’s alleged approach undermines this unwritten contract, suggesting a willingness to bypass website owners’ wishes in pursuit of training data.

This issue exploded just as Cloudflare launched its new “Pay Per Crawl” marketplace, which lets publishers charge for AI bot access and blocks most crawlers by default. Major outlets — The Atlantic, BuzzFeed, Time Inc., and O’Reilly — have signed up, and over 2.5million websites now disallow AI training outright.

Perplexity Responds

Perplexity’s spokesperson dismissed Cloudflare’s blog post as little more than a “sales pitch,” claiming the screenshots “show that no content was accessed” and denying ownership of the bot in question. Perplexity later argued that much of what Cloudflare saw was user-driven fetching (an AI agent acting on direct user requests) rather than automated crawling — a key distinction in ongoing debates about what “scraping” really means. They also mentioned that similar incidents had happened before, notably accusations of plagiarism from outlets like Wired, and the company has struggled to define its own standards for content use.

Divided Reactions & Broader Implications

Cloudflare’s stance: Protect publishers’ business models, enforce block signals, and charge for “AI access” to content.

Perplexity’s defense: AI web agents, when acting for users, shouldn’t be distinguished from human browsing.

Community Debate: Some argue on social platforms that if a user requests a public site via Perplexity, it’s akin to opening it in Firefox. Others counter that this hurts site owners’ ad-driven revenue and control over their data.

The Big Picture: The Internet’s Business Model Is Changing

Content monetization is rapidly shifting. Publishers are moving from ads to access fees, and scraping is becoming a pay-to-play market.

Transparency and compliance are no longer optional. AI firms face mounting reputational and legal risks if caught evading blocks or misusing content.

Data partnerships will define the future. Major AI players are investing in licensing deals with publishers rather than relying on stealth scraping.

Conclusion

Whether Perplexity is being singled out unfairly or genuinely violating web norms, this is a watershed moment. The era of “free data” for AI is ending. Ethics, economics, and new gatekeeping platforms like Cloudflare are pushing a shift toward paid data, greater accountability, and sustainable content partnerships. Unless AI companies adapt, they’ll face locked gates and a fragmented, paywalled Internet — and that ultimately reshapes the foundation of the digital world.

Discuss on Hacker News

Join our ML Subreddit

Sponsor us

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. 

The post Cloudflare vs Perplexity: The Battle Over AI Web Scraping Heats Up appeared first on MarkTechPost.

A Code Implementation to Build a Multi-Agent Research System with Open …

In this tutorial, we begin by showcasing the power of OpenAI Agents as the driving force behind our multi-agent research system. We set up our Colab environment with the OpenAI API key, installed the OpenAI Agents SDK, and then defined custom function tools, web_search, analyze_data, and save_research, to harness the agents’ capabilities. We instantiate three specialized OpenAI Agents (Research Specialist, Data Analyst, and Research Coordinator), each with clear, role-specific instructions and tool access. We demonstrate how these agents collaborate asynchronously and synchronously, maintain session memory for continuity, and allow rapid experimentation through helper functions. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser!pip install openai-agents python-dotenv

import asyncio
import json
from datetime import datetime
from agents import Agent, Runner, function_tool, SQLiteSession
import os

os.environ[‘OPENAI_API_KEY’] = ‘Use Your Own API Key’

We install openai-agents and python-dotenv, then import asyncio, json, datetime, and the core SDK primitives (Agent, Runner, function_tool, SQLiteSession). We set OPENAI_API_KEY in the environment so we can immediately run our agents in this runtime. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser@function_tool
def web_search(query: str, max_results: int = 3) -> str:
“””Simulate web search results for demonstration”””
results = [
f”Result 1 for ‘{query}’: Latest findings show significant developments…”,
f”Result 2 for ‘{query}’: Research indicates new approaches in this field…”,
f”Result 3 for ‘{query}’: Expert analysis suggests important implications…”
]
return f”Search results for ‘{query}’:n” + “n”.join(results[:max_results])

@function_tool
def analyze_data(data: str, analysis_type: str = “summary”) -> str:
“””Analyze provided data with different analysis types”””
analyses = {
“summary”: f”Summary: The data contains {len(data.split())} key points with main themes around innovation and efficiency.”,
“detailed”: f”Detailed Analysis: Breaking down the {len(data)} characters of data reveals patterns in methodology and conclusions.”,
“trends”: f”Trend Analysis: Current data suggests upward trajectory with 3 major inflection points identified.”
}
return analyses.get(analysis_type, “Analysis complete: Standard evaluation performed.”)

@function_tool
def save_research(title: str, content: str, category: str = “general”) -> str:
“””Save research findings to a structured format”””
timestamp = datetime.now().strftime(“%Y-%m-%d %H:%M:%S”)
research_entry = {
“title”: title,
“content”: content,
“category”: category,
“timestamp”: timestamp,
“id”: f”research_{len(content) % 1000}”
}
return f” Research saved: ‘{title}’ in category ‘{category}’ at {timestamp}”

We define three function tools for our agents: web_search simulates quick results, analyze_data returns summary/detailed/trend insights, and save_research stores findings with a timestamped ID. We use them to gather signals, turn text into insights, and persist outputs for later steps. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserresearch_agent = Agent(
name=”Research Specialist”,
instructions=”””You are an expert researcher who:
– Conducts thorough web searches on any topic
– Analyzes information critically and objectively
– Identifies key insights and patterns
– Always uses tools to gather and analyze data before responding”””,
tools=[web_search, analyze_data]
)

analyst_agent = Agent(
name=”Data Analyst”,
instructions=”””You are a senior data analyst who:
– Takes research findings and performs deep analysis
– Identifies trends, patterns, and actionable insights
– Creates structured summaries and recommendations
– Uses analysis tools to enhance understanding”””,
tools=[analyze_data, save_research]
)

coordinator_agent = Agent(
name=”Research Coordinator”,
instructions=”””You are a research coordinator who:
– Manages multi-step research projects
– Delegates tasks to appropriate specialists
– Synthesizes findings from multiple sources
– Makes final decisions on research direction
– Handoff to research_agent for initial data gathering
– Handoff to analyst_agent for detailed analysis”””,
handoffs=[research_agent, analyst_agent],
tools=[save_research]
)

We define three OpenAI Agents with clear roles: the Research Specialist gathers and synthesizes information, the Data Analyst deep-dives and saves structured outputs, and the Research Coordinator orchestrates handoffs and final decisions. Together, we delegate, analyze with tools, and produce actionable summaries end-to-end. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserasync def run_advanced_research_workflow():
“””Demonstrates a complete multi-agent research workflow”””

session = SQLiteSession(“research_session_001″)

print(” Starting Advanced Multi-Agent Research System”)
print(“=” * 60)

research_topic = “artificial intelligence in healthcare 2024″

print(f”n PHASE 1: Initiating research on ‘{research_topic}'”)
result1 = await Runner.run(
coordinator_agent,
f”I need comprehensive research on ‘{research_topic}’. Please coordinate a full research workflow including data gathering, analysis, and final report generation.”,
session=session
)
print(f”Coordinator Response: {result1.final_output}”)

print(f”n PHASE 2: Requesting detailed trend analysis”)
result2 = await Runner.run(
coordinator_agent,
“Based on the previous research, I need a detailed trend analysis focusing on emerging opportunities and potential challenges. Save the final analysis for future reference.”,
session=session
)
print(f”Analysis Response: {result2.final_output}”)

print(f”n PHASE 3: Direct specialist analysis”)
result3 = await Runner.run(
analyst_agent,
“Perform a detailed analysis of the healthcare AI market, focusing on regulatory challenges and market opportunities. Categorize this as ‘market_analysis’.”,
session=session
)
print(f”Specialist Response: {result3.final_output}”)

print(“n Research workflow completed successfully!”)
return result1, result2, result3

async def run_focused_analysis():
“””Shows focused single-agent capabilities”””

print(“n FOCUSED ANALYSIS DEMO”)
print(“-” * 40)

result = await Runner.run(
research_agent,
“Research in quantum computing and analyze the key breakthroughs from 2024.”,
max_turns=5
)

print(f”Focused Analysis Result: {result.final_output}”)
return result

def quick_research_sync(topic: str):
“””Synchronous research for quick queries”””

print(f”n QUICK SYNC RESEARCH: {topic}”)
print(“-” * 40)

result = Runner.run_sync(
research_agent,
f”Quickly research {topic} and provide 3 key insights.”
)

print(f”Quick Result: {result.final_output}”)
return result

We run a full multi-agent workflow with session memory (three phases coordinated by the coordinator and analyst). We perform a focused single-agent analysis with a turn cap, and finally, we trigger a quick synchronous research helper for fast, three-insight summaries. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserasync def main():
“””Main function demonstrating all capabilities”””

print(” OpenAI Agents SDK – Advanced Tutorial”)
print(“Building a Multi-Agent Research System”)
print(“=” * 60)

try:
await run_advanced_research_workflow()

await run_focused_analysis()

quick_research_sync(“blockchain adoption in enterprise”)

print(“n Tutorial completed successfully!”)
print(“nKey Features Demonstrated:”)
print(” Multi-agent coordination with handoffs”)
print(” Custom function tools”)
print(” Session memory for conversation continuity”)
print(” Async and sync execution patterns”)
print(” Structured workflows with max_turns control”)
print(” Specialized agent roles and capabilities”)

except Exception as e:
print(f” Error: {e}”)
print(“nTroubleshooting tips:”)
print(“- Ensure OPENAI_API_KEY is set correctly”)
print(“- Check internet connection”)
print(“- Verify openai-agents package is installed”)

if __name__ == “__main__”:
import nest_asyncio
nest_asyncio.apply()

asyncio.run(main())

def create_custom_agent(name: str, role: str, tools_list: list = None):
“””Helper function to create custom agents quickly”””
return Agent(
name=name,
instructions=f”You are a {role} who provides expert assistance.”,
tools=tools_list or []
)

custom_agent = create_custom_agent(“Code Reviewer”, “senior software engineer”, [analyze_data])
result = Runner.run_sync(custom_agent, “Review this Python code for best practices”)

print(“n Tutorial Notes:”)
print(“- Modify research topics and agent instructions to explore different use cases”)
print(“- Add your own custom tools using the @function_tool decorator”)
print(“- Experiment with different agent handoff patterns”)
print(“- Use sessions for multi-turn conversations”)
print(“- Perfect for Colab – just add your OpenAI API key and run!”)

We orchestrate the end-to-end demo with main(), running the multi-agent workflow, a focused analysis, and a quick sync task, while handling errors and logging key features. We also provide a helper to spin up custom agents and show a synchronous “Code Reviewer” example for immediate feedback.

In conclusion, we wrap up the Advanced OpenAI Agents tutorial by highlighting the core strengths of this framework: coordinated multi-agent collaboration, extensible custom tools, persistent session memory, and flexible execution modes. We encourage you to expand on these foundations by adding new tools, crafting custom agent roles, and experimenting with different handoff strategies. We emphasize that this modular architecture empowers you to build sophisticated AI-driven research pipelines with minimal boilerplate.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Code Implementation to Build a Multi-Agent Research System with OpenAI Agents, Function Tools, Handoffs, and Session Memory appeared first on MarkTechPost.

Meet CoAct-1: A Novel Multi-Agent System that Synergistically Combines …

A Team of researchers from USC, Salesforce AI and University of Washington have introduced CoAct-1, a pioneering multi-agent computer-using agent (CUA) that marks a significant leap in autonomous computer operation. By elevating coding to a first-class action—on par with traditional GUI manipulation—CoAct-1 overcomes longstanding challenges of efficiency and reliability in complex, long-horizon computer tasks. On the demanding OSWorld benchmark, CoAct-1 sets a new gold standard, achieving a state-of-the-art (SOTA) success rate of 60.76%, making it the first CUA agent to surpass the 60% mark.

Why CoAct-1? Bridging the Efficiency Gap in Computer-Using Agents

Conventional CUA agents rely solely on pixel-based GUI interaction—emulating human users by clicking, typing, and navigating interfaces. While this approach mimics user workflows, it proves fragile and inefficient for intricate, multi-step tasks, especially those involving dense UI layouts, multi-app pipelines, or complex OS operations. Single errors such as a mis-click can derail entire workflows, and sequence lengths balloon as tasks increase in complexity.

Efforts to mitigate these issues have included augmenting GUI agents with high-level planners, as seen in systems like GTA-1 and modular multi-agent frameworks. However, these methods cannot escape the bottleneck of GUI-centric action spaces, ultimately limiting both efficiency and robustness.

CoAct-1: Hybrid Architecture with Coding as Action

CoAct-1 takes a fundamentally different approach by integrating three specialized agents:

Orchestrator: The high-level planner that decomposes complex tasks and dynamically delegates each subtask either to the Programmer or the GUI Operator based on task requirements.

Programmer: Executes backend operations—file management, data processing, environment configuration—directly via Python or Bash scripts, bypassing cumbersome GUI action sequences.

GUI Operator: Uses a vision-language model to interact with visual interfaces when human-like UI navigation is indispensable.

This hybrid model enables CoAct-1 to strategically substitute brittle and lengthy mouse-keyboard operations with concise, reliable code execution, while still leveraging GUI interactions where necessary.

Evaluation on OSWorld: Record-Setting Performance

OSWorld—a leading benchmark featuring 369 tasks spanning office productivity, IDEs, browsers, file managers, and multi-app workflows—proves an exacting testbed for agentic systems. Each task mirrors real-world language goals and is assessed by a granular rule-based scoring system.

Results

Overall SOTA Success Rate: CoAct-1 achieves 60.76% on the 100+ step category—the first CUA agent to cross the 60-point threshold. This outpaces GTA-1 (53.10%), OpenAI CUA 4o (31.40%), UI-TARS-1.5 (29.60%), and other leading frameworks.

Stepped Allowance Performance: At a 100-step budget, CoAct-1 scores 59.93%, again leading all competitors.

Efficiency: Completes tasks with an average of 10.15 steps per successful task, compared to 15.22 for GTA-1, 14.90 for UI-TARS, and with much higher success than OpenAI CUA 4o, which, despite fewer steps (6.14), achieves only 31.40% success.

Breakdown

CoAct-1 dominates across task types, with especially large gains in workflows benefitting from code execution:

Multi-App: 47.88% (vs. GTA-1’s 38.34%)

OS Tasks: 75.00%

VLC: 66.07%

In productivity and IDE domains (LibreOffice Calc, Writer, VSCode), it consistently leads or ties with the SOTA.

Key Insights: What Drives CoAct-1’s Gains?

Coding Actions Replace Redundant GUI Sequences: For operations like batch image resizing or advanced file manipulations, single scripts replace dozens of error-prone clicks, reducing both steps and risk of failure.

Dynamic Delegation: The Orchestrator’s flexible task assignment ensures optimal use of coding vs. GUI actions.

Improvement with Stronger Backbones: The best configuration uses OpenAI CUA 4o for the GUI Operator, OpenAI o3 for the Orchestrator, and o4-mini for the Programmer, reaching the top 60.76% score. Systems using only smaller or less capable backbones score significantly lower.

Efficiency Correlates with Reliability: Fewer steps directly reduce opportunities for error—the single strongest predictor of successful completion.

Conclusion: A Leap Forward in Generalized Computer Automation

By making coding a first-class system action alongside GUI manipulation, CoAct-1 delivers both a quantum leap in success and efficiency, and illustrates the practical path forward for scalable, reliable autonomous computer agents. Its hybrid architecture and dynamic execution logic set a new high-water mark for the CUA field, heralding robust advances in real-world computer automation.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Meet CoAct-1: A Novel Multi-Agent System that Synergistically Combines GUI-based Control with Direct Programmatic Execution appeared first on MarkTechPost.