Anthropic Introduces Code Review via Claude Code to Automate Complex S …

In the frantic arms race of ‘AI for code,’ we’ve moved past the era of the glorified autocomplete. Today, Anthropic is double-downing on a more ambitious vision: the AI agent that doesn’t just write your boilerplate, but actually understands why your Kubernetes cluster is screaming at 3:00 AM.

With the recent launch of Claude Code and its high-octane Code Review capabilities, Anthropic is signaling a shift from ‘chatbot’ to ‘collaborator.’ For devs drowning in legacy technical debt, the message is clear: the bar for ‘good enough’ code just got a lot higher.

The Agentic Leap: Beyond Static Analysis

The main idea of this update is the transition to agentic coding. Unlike traditional Static Analysis Security Testing (SAST) tools that rely on rigid pattern matching, Claude Code operates as a stateful agent. According to Anthropic’s latest internal benchmarks, the model can now chain together an average of 21.2 independent tool calls—such as editing files, running terminal commands, and navigating directories—without needing human intervention. That’s a 116% increase in autonomy over the last six months.

This means Claude isn’t just looking at a single file; it’s reasoning across your entire repository. It uses a specialized CLAUDE.md file—a ‘manual’ for the AI—to understand project-specific conventions, data pipeline dependencies, and infrastructure quirks.

Inside the ‘Code Review’ Engine

When you run a review via Claude Code, the model isn’t just checking for missing semicolons. It’s performing what Anthropic calls frontier cybersecurity reasoning.

Take the recent pilot with Mozilla’s Firefox. In just two weeks, Claude Opus 4.6 scanned the browser’s massive codebase and surfaced 22 vulnerabilities. More impressively, 14 of those were classified as high-severity. To put that in perspective: the entire global security research community typically reports about 70 such bugs for Firefox in a full year.

How does it do it?

Logical Reasoning over Pattern Matching: Instead of looking for a ‘known bad’ string, Claude reasons about algorithms. In the CGIF library, it discovered a heap buffer overflow by analyzing the LZW compression logic—a bug that had evaded traditional coverage-guided fuzzing for decades.

Multi-Stage Verification: Every finding goes through a self-correction loop. Claude attempts to ‘disprove’ its own vulnerability report to filter out the false positives that typically plague AI-generated reviews.

Remediation Directives: It doesn’t just point at the fire; it hands you the extinguisher. The tool suggests targeted patches that engineers can approve or iterate on in real-time within the CLI.

The Technical Stack: MCP and ‘Auto-Accept’ Mode

Anthropic is pushing the Model Context Protocol (MCP) as the standard for how these agents interact with your data. By using MCP servers instead of raw CLI access for sensitive databases (like BigQuery), dev teams can maintain granular security logging while letting Claude perform complex data migrations or infrastructure debugging.

One of the key important features making waves is Auto-Accept Mode (triggered by shift+tab). This allows devs to set up autonomous loops where Claude writes code, runs tests, and iterates until the tests pass. It’s high-velocity ‘vibe coding’ for the enterprise, though Anthropic warns that humans should still be the final gatekeepers for critical business logic.

Key Takeaways

The Shift to Agentic Autonomy: We have moved beyond simple code completion to agentic coding. Claude Code can now chain an average of 21.2 independent tool calls (editing files, running terminal commands, and navigating directories) without human intervention—a 116% increase in autonomy over the last six months.

Superior Vulnerability Detection: In a landmark pilot with Mozilla, Claude surfaced 22 unique vulnerabilities in Firefox in just two weeks. 14 were high-severity, representing nearly 20% of the high-severity bugs typically found by the entire global research community in a full year.

Logical Reasoning vs. Pattern Matching: Unlike traditional SAST tools that look for ‘known bad’ code strings, Claude uses frontier cybersecurity reasoning. It identified a decades-old heap buffer overflow in the CGIF library by logically analyzing LZW compression algorithms, a feat that had previously evaded expert human review and automated fuzzing.

Standardized Context with CLAUDE.md and MCP: Professional integration now relies on the CLAUDE.md file to provide the AI with project-specific ‘manuals’ and the Model Context Protocol (MCP) to allow the agent to interact securely with external data sources like BigQuery or Snowflake without compromising sensitive credentials.

The ‘Auto-Accept’ Workflow: For high-velocity development, the Shift+Tab shortcut allows devs to toggle into Auto-Accept Mode. This enables an autonomous loop where the agent writes code, runs tests, and iterates until the task is solved, transforming the developer’s role from a ‘writer’ to an ‘editor/director.’

Check out Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic Introduces Code Review via Claude Code to Automate Complex Security Research Using Advanced Agentic Multi-Step Reasoning Loops appeared first on MarkTechPost.

Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Give …

In the fast-moving world of agentic workflows, the most powerful AI model is still only as good as its documentation. Today, Andrew Ng and his team at DeepLearning.AI officially launched Context Hub, an open-source tool designed to bridge the gap between an agent’s static training data and the rapidly evolving reality of modern APIs.

You ask an agent like Claude Code to build a feature, but it hallucinates a parameter that was deprecated six months ago or fails to utilize a more efficient, newer endpoint. Context Hub provides a simple CLI-based solution to ensure your coding agent always has the ‘ground truth’ it needs to perform.

The Problem: When LLMs Live in the Past

Large Language Models (LLMs) are frozen in time the moment their training ends. While Retrieval-Augmented Generation (RAG) has helped ground models in private data, the ‘public’ documentation they rely on is often a mess of outdated blog posts, legacy SDK examples, and deprecated StackOverflow threads.

The result is what developers are calling ‘Agent Drift.’ Consider a hypothetical but highly plausible scenario: a dev asks an agent to call OpenAI’s GPT-5.2. Even if the newer responses API has been the industry standard for a year, the agent—relying on its core training—might stubbornly stick to the older chat completions API. This leads to broken code, wasted tokens, and hours of manual debugging.

Coding agents often use outdated APIs and hallucinate parameters. Context Hub is designed to intervene at the exact moment an agent starts guessing.

chub: The CLI for Agent Context

At its core, Context Hub is built around a lightweight CLI tool called chub. It functions as a curated registry of up-to-date, versioned documentation, served in a format optimized for LLM consumption.

Instead of an agent scraping the web and getting lost in noisy HTML, it uses chub to fetch precise markdown docs. The workflow is straightforward: you install the tool and then prompt your agent to use it.

The standard chub toolset includes:

chub search: Allows the agent to find the specific API or skill it needs.

chub get: Fetches the curated documentation, often supporting specific language variants (e.g., –lang py or –lang js) to minimize token waste.

chub annotate: This is where the tool begins to differentiate itself from a standard search engine.

The Self-Improving Agent: Annotations and Workarounds

One of the most compelling features is the ability for agents to ‘remember’ technical hurdles. Historically, if an agent discovered a specific workaround for a bug in a beta library, that knowledge would vanish the moment the session ended.

With Context Hub, an agent can use the chub annotate command to save a note to the local documentation registry. For example, if an agent realizes that a specific webhook verification requires a raw body rather than a parsed JSON object, it can run:

chub annotate stripe/api “Needs raw body for webhook verification”

In the next session, when the agent (or any agent on that machine) runs chub get stripe/api, that note is automatically appended to the documentation. This effectively gives coding agents a “long-term memory” for technical nuances, preventing them from rediscovering the same wheel every morning.

Crowdsourcing the ‘Ground Truth‘

While annotations remain local to the developer’s machine, Context Hub also introduces a feedback loop designed to benefit the entire community. Through the chub feedback command, agents can rate documentation with up or down votes and apply specific labels like accurate, outdated, or wrong-examples.

This feedback flows back to the maintainers of the Context Hub registry. Over time, the most reliable documentation surfaces to the top, while outdated entries are flagged and updated by the community. It’s a decentralized approach to maintaining documentation that evolves as fast as the code it describes.

Key Takeaways

Solves ‘Agent Drift’: Context Hub addresses the critical issue where AI agents rely on their static training data, causing them to use outdated APIs or hallucinate parameters that no longer exist.

CLI-Driven Ground Truth: Through the chub CLI, agents can instantly fetch curated, LLM-optimized markdown documentation for specific APIs, ensuring they build with the most modern standards (e.g., using the newer OpenAI Responses API instead of Chat Completions).

Persistent Agent Memory: The chub annotate feature allows agents to save specific technical workarounds or notes to a local registry. This prevents the agent from having to ‘rediscover’ the same solution in future sessions.

Collaborative Intelligence: By using chub feedback, agents can vote on the accuracy of documentation. This creates a crowdsourced ‘ground truth’ where the most reliable and up-to-date resources surface for the entire developer community.

Language-Specific Precision: The tool minimizes ‘token waste’ by allowing agents to request documentation specifically tailored to their current stack (using flags like –lang py or –lang js), making the context both dense and highly relevant.

Check out GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Andrew Ng’s Team Releases Context Hub: An Open Source Tool that Gives Your Coding Agent the Up-to-Date API Documentation It Needs appeared first on MarkTechPost.

The ‘Bayesian’ Upgrade: Why Google AI’s New Teaching Method is t …

Large Language Models (LLMs) are the world’s best mimics, but when it comes to the cold, hard logic of updating beliefs based on new evidence, they are surprisingly stubborn. A team of researchers from Google argue that the current crop of AI agents falls far short of ‘probabilistic reasoning’—the ability to maintain and update a ‘world model’ as new information trickles in.

The solution? Stop trying to give them the right answers and start teaching them how to guess like a mathematician.

The Problem: The ‘One-and-Done’ Plateau

While LLMs like Gemini-1.5 Pro and GPT-4.1 Mini can write code or summarize emails, they struggle as interactive agents. Imagine a flight booking assistant: it needs to infer your preferences (price vs. duration) by watching which flights you pick over several rounds.

The research team found that off-the-shelf LLMs—including heavyweights like Llama-3-70B and Qwen-2.5-32B—showed ‘little or no improvement’ after the first round of interaction. While a ‘Bayesian Assistant’ (a symbolic model using Bayes’ rule) gets more accurate with every data point, standard LLMs plateaued almost immediately, failing to adapt their internal ‘beliefs’ to the user’s specific reward function.

Meet Bayesian Teaching

The research team introduced a technique called Bayesian Teaching. Instead of fine-tuning a model on ‘correct’ data (what they call an Oracle Teacher), they fine-tuned it to mimic a Bayesian Assistant—a model that explicitly uses Bayes’ rule to update a probability distribution over possible user preferences.

Here is the technical breakdown:

The Task: A five-round flight recommendation interaction. Flights are defined by features like price, duration, and stops.

The Reward Function: A vector representing user preferences (e.g., a strong preference for low prices).

The Posterior Update: After each round, the Bayesian Assistant updates its posterior distribution based on the prior (initial assumptions) and the likelihood (the probability the user would pick a certain flight given a specific reward function).

By using Supervised Fine-Tuning (SFT) on these Bayesian interactions, the research team forced the LLMs to adopt the process of reasoning under uncertainty, not just the final result.

Why ‘Educated Guesses’ Beat Correct Answers

The most counter-intuitive finding of the research is that Bayesian Teaching consistently outperformed Oracle Teaching.

In ‘Oracle Teaching,’ the model is trained on a teacher that already knows exactly what the user wants. In ‘Bayesian Teaching,’ the teacher is often wrong in early rounds because it is still learning. However, those ‘educated guesses’ provide a much stronger learning signal. By watching the Bayesian Assistant struggle with uncertainty and then update its beliefs after receiving feedback, the LLM learns the ‘skill’ of belief updating.

The results were stark: Bayesian-tuned models (like Gemma-2-9B or Llama-3-8B) were not only more accurate but agreed with the ‘gold standard’ Bayesian strategy roughly 80% of the time—significantly higher than their original versions.

Generalization: Beyond Flights to Web Shopping

For devs, the ‘holy grail’ is generalization. A model trained on flight data shouldn’t just be good at flights; it should understand the concept of learning from a user.

The research team tested their fine-tuned models on:

Increased Complexity: Moving from four flight features to eight.

New Domains: Hotel recommendations.

Real-World Scenarios: A web shopping task using real products (titles and descriptions) from a simulated environment.

Even though the models were only fine-tuned on synthetic flight data, they successfully transferred those probabilistic reasoning skills to hotel booking and web shopping. In fact, the Bayesian LLMs even outperformed human participants in some rounds, as humans often deviate from normative reasoning standards due to biases or inattention.

The Neuro-Symbolic Bridge

This research highlights a unique strength of deep learning: the ability to distill a classic, symbolic model (the Bayesian Assistant) into a neural network (the LLM).

While symbolic models are great for simple, codified tasks, they are notoriously difficult to build for ‘messy’ real-world domains like web shopping. By teaching the LLM to mimic the symbolic model’s strategy, it is possible to get the best of both worlds: the rigorous reasoning of a Bayesian and the flexible, natural-language understanding of a transformer.

Key Takeaways

LLMs Struggle with Belief Updating: Off-the-shelf LLMs, including state-of-the-art models like Gemini-1.5 Pro and GPT-4.1 Mini, fail to effectively update their beliefs as they receive new information, with performance often plateauing after a single interaction.

Bayesian Teaching Outperforms Direct Training: Teaching an LLM to mimic the ‘educated guesses’ and uncertainty of a normative Bayesian model is more effective than training it directly on correct answers (oracle teaching).

Probabilistic Skills Generalize Across Domains: LLMs fine-tuned on simple synthetic tasks (e.g., flight recommendations) can successfully transfer their belief-updating skills to more complex, real-world scenarios like web shopping and hotel recommendations.

Neural Models Are More Robust to Human Noise: While a purely symbolic Bayesian model is optimal for consistent simulated users, fine-tuned LLMs demonstrate greater robustness when interacting with humans, whose choices often deviate from their stated preferences due to noise or bias.

Effective Distillation of Symbolic Strategies: The research proves that LLMs can learn to approximate complex symbolic reasoning strategies through supervised fine-tuning, allowing them to apply these strategies in domains too messy or complex to be codified explicitly in a classic symbolic model.

Check out Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post The ‘Bayesian’ Upgrade: Why Google AI’s New Teaching Method is the Key to LLM Reasoning appeared first on MarkTechPost.

Run NVIDIA Nemotron 3 Nano as a fully managed serverless model on Amaz …

This post is cowritten with Abdullahi Olaoye, Curtice Lockhart, Nirmal Kumar Juluru from NVIDIA.
We are excited to announce that NVIDIA’s Nemotron 3 Nano is now available as a fully managed and serverless model in Amazon Bedrock. This follows our earlier announcement at AWS re:Invent supporting NVIDIA Nemotron 2 Nano 9B and NVIDIA Nemotron 2 Nano VL 12B models.
With NVIDIA Nemotron open models on Amazon Bedrock, you can accelerate innovation and deliver tangible business value without having to manage infrastructure complexities. You can power your generative AI applications with Nemotron’s capabilities through the inference capabilities of Amazon Bedrock and harness the benefit of its extensive features and tooling.
This post explores the technical characteristics of the NVIDIA Nemotron 3 Nano model and discusses potential application use cases. Additionally, it provides technical guidance to help you get started using this model for your generative AI applications within the Amazon Bedrock environment.
About Nemotron 3 Nano
NVIDIA Nemotron 3 Nano is a small language model (SLM) with a hybrid Mixture-of-Experts (MoE) architecture that delivers high compute efficiency and accuracy that developers can use to build specialized agentic AI systems. The model is fully open with open-weights, datasets, and recipes facilitating transparency and confidence for developers and enterprises. Compared to other similar sized models, Nemotron 3 Nano excels in coding and reasoning tasks, taking the lead on benchmarks such as SWE Bench Verified, AIME 2025, Arena Hard v2, and IFBench.
Model overview:

Architecture:

Mixture-of-Experts (MoE) with Hybrid Transformer-Mamba Architecture
Supports Token Budget for providing accuracy while avoiding overthinking

Accuracy:

Leading accuracy on coding, scientific reasoning, math, tool calling, instruction following, and chat
Nemotron 3 Nano leads on benchmarks such as SWE Bench, AIME 2025, Humanity Last Exam, IFBench, RULER, and Arena Hard (compared to other open language models with 30 billion or fewer MoE)

Model size: 30 B with 3 B active parameters
Context length: 256K
Model input: Text
Model output: Text

Nemotron 3 Nano combines Mamba, Transformer, and Mixture-of-Experts layers into a single backbone to help balance efficiency, reasoning accuracy, and scale. Mamba enables long-range sequence modeling with low memory overhead, while Transformer layers help add precise attention for structured reasoning tasks like code, math, and planning. MoE routing further boosts scalability by activating only a subset of experts per token, helping to improve latency and throughput. This makes Nemotron 3 Nano especially well-suited for agent clusters running many concurrent, lightweight workflows.
To learn more about Nemotron 3 Nano’s architecture and how it is trained, see Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate.
Model benchmarks
The following image shows that Nemotron 3 Nano leads in the most attractive quadrant in Artificial Analysis Openness Index vs. Intelligence Index. Why openness matters: It builds trust through transparency. Developers and enterprises can confidently build on Nemotron with clear visibility into the model, data pipeline, and data characteristics, enabling straightforward auditing and governance.

Title: Chart showing Nemotron 3 Nano in the most attractive quadrant in Artificial Analysis Openness vs Intelligence Index (Source: Artificial Analysis)
As shown in the following image, Nemotron 3 Nano provides leading accuracy with the highest efficiency among the open models and scores an impressive 52 points, a significant jump over the previous Nemotron 2 Nano model. Token demand is increasing due to agentic AI, so the ability to ‘think fast’ (arrive at the correct answer quickly while using fewer tokens) is critical. Nemotron 3 Nano delivers high throughput with its efficient Hybrid Transformer-Mamba and MoE architecture.

Title: NVIDIA Nemotron 3 Nano provides highest efficiency with leading accuracy among open models with an impressive 52 points score on Artificial Analysis Intelligence vs. Output Speed Index. (Source: Artificial Analysis)
NVIDIA Nemotron 3 Nano use cases
Nemotron 3 Nano helps power various use cases for different industries. Some of the use cases include

Finance – Accelerate loan processing by extracting data, analyzing income patterns, detecting fraudulent operations, reducing cycle times, and risk.
Cybersecurity – Automatically triage vulnerabilities, perform in-depth malware analysis, and proactively hunt for security threats.
Software development – Assist with tasks like code summarization.
Retail – Optimize inventory management and help enhance in-store service with real-time, personalized product recommendations and support.

Get started with NVIDIA Nemotron 3 Nano in Amazon Bedrock
To test NVIDIA Nemotron 3 Nano in Amazon Bedrock, complete the following steps:

Navigate to the Amazon Bedrock console and select Chat/Text playground from the left menu (under the Test section).
Choose Select model in the upper-left corner of the playground.
Choose NVIDIA from the category list, then select NVIDIA Nemotron 3 Nano.
Choose Apply to load the model.

After selection, you can test the model immediately. Let’s use the following prompt to generate a unit test in Python code using the pytest framework:
Write a pytest unit test suite for a Python function called calculate_mortgage(principal, rate, years). Include test cases for: 1) A standard 30-year fixed loan 2) An edge case with 0% interest 3) Error handling for negative input values.
Complex tasks like this prompt can benefit from a chain of thought approach to help produce a precise result based on the reasoning capabilities built natively into the model.

Using the AWS CLI and SDKs
You can access the model programmatically using the model ID nvidia.nemotron-nano-3-30b. The model supports both the InvokeModel and Converse APIs through the AWS Command Line Interface (AWS CLI) and AWS SDK with nvidia.nemotron-nano-3-30b as the model ID. Further, it supports the Amazon Bedrock OpenAI SDK compatible API.
Run the following command to invoke the model directly from your terminal using the AWS Command Line Interface (AWS CLI) and the InvokeModel API:

aws bedrock-runtime invoke-model
–model-id nvidia.nemotron-nano-3-30b
–region us-west-2
–body ‘{“messages”: [{“role”: “user”, “content”: “Type_Your_Prompt_Here”}], “max_tokens”: 512, “temperature”: 0.5, “top_p”: 0.9}’
–cli-binary-format raw-in-base64-out
invoke-model-output.txt

To invoke the model through the AWS SDK for Python (boto3), use the following script to send a prompt to the model, in this case by using the Converse API:

import boto3
from botocore.exceptions import ClientError

# Create a Bedrock Runtime client in the AWS Region you want to use.
client = boto3.client(“bedrock-runtime”, region_name=”us-west-2″)

# Set the model ID
model_id = “nvidia.nemotron-nano-3-30b”

# Start a conversation with the user message.

user_message = “Type_Your_Prompt_Here”
conversation = [
{
“role”: “user”,

“content”: [{“text”: user_message}],
}
]

try:
# Send the message to the model using a basic inference configuration.
response = client.converse(
modelId=model_id,

messages=conversation,
inferenceConfig={“maxTokens”: 512, “temperature”: 0.5, “topP”: 0.9},
)

# Extract and print the response text.
response_text = response[“output”][“message”][“content”][0][“text”]
print(response_text)

except (ClientError, Exception) as e:
print(f”ERROR: Can’t invoke ‘{model_id}’. Reason: {e}”)
exit(1)

To invoke the model through the Amazon Bedrock OpenAI-compatible ChatCompletions endpoint, you can do so by using the OpenAI SDK:

# Import OpenAI SDK
from openai import OpenAI

# Set environment variables
os.environ[“OPENAI_API_KEY”] = “<insert your bedrock API key>”
os.environ[“OPENAI_BASE_URL”] = “https://bedrock-runtime.<AWS region>.amazon.com/openai/v1”

# Set the model ID
model_id = “nvidia.nemotron-nano-3-30b”

# Set prompts
system_prompt = “Type_Your_System_Prompt_Here”
user_message = “Type_Your_User_Prompt_Here”

# Use ChatCompletionsAPI
response = client.chat.completions.create(
model= model _ID,
messages=[
{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: user_message}
],
temperature=0,
max_completion_tokens=1000
)

# Extract and print the response text
print(response.choices[0].message.content)

Use NVIDIA Nemotron 3 Nano with Amazon Bedrock features
You can enhance your generative AI applications by combining Nemotron 3 Nano with the Amazon Bedrock managed tools. Use Amazon Bedrock Guardrails to implement safeguards and Amazon Knowledge Bases to create robust Retrieval Augmented Generation (RAG) workflows.
Amazon Bedrock guardrails
Guardrails is a managed safety layer that helps enforce responsible AI by filtering harmful content, redacting sensitive information (PII), and blocking specific topics across prompts and responses. It works across multiple models to help detect prompt injection attacks and hallucinations.
Example use case: If you’re building a mortgage assistant, you can help prevent it from offering general investment advice. By configuring a filter for the word “stocks”, user prompts containing that term can be immediately blocked and receive a custom message.
To set up a guardrail, complete the following steps:

In the Amazon Bedrock console, navigate to the Build section on the left and select Guardrails.
Create a new guardrail and configure the necessary filters for your use case.

After configured, test the guardrail with various prompts to verify its performance. You can then fine-tune settings, such as denied topics, word filters, and PII redaction, to match your specific safety requirements. For a deep dive, see Create your guardrail.
Amazon Bedrock Knowledge Bases
Amazon Bedrock Knowledge Bases automates the complete RAG workflow. It handles ingesting content from your data sources, chunking it into searchable segments, converting them into vector embeddings, and storing them in a vector database. Then, when a user submits a query, the system matches the input against stored vectors to find semantically similar content, which is then used to augment the prompt sent to the foundation model.
For this example, we uploaded PDFs (for example, Buying a New Home, Home Loan Toolkit, Shopping for a Mortgage) to Amazon Simple Storage Service (Amazon S3) and selected Amazon OpenSearch Serverless as the vector store. The following code demonstrates how to query this knowledge base using the RetrieveAndGenerate API, while automatically facilitating safety compliance alignment through a specific Guardrail ID.

import boto3
bedrock_agent_runtime_client = boto3.client(‘bedrock-agent-runtime’)
response = bedrock_agent_runtime_client.retrieve_and_generate(
input={
‘text’: ‘I am interested in purchasing a home. What steps should I take to make sure I am prepared to take on a mortgage?’
},
retrieveAndGenerateConfiguration={
‘knowledgeBaseConfiguration’: {
‘generationConfiguration’: {
‘guardrailConfiguration’: {
‘guardrailId’: ‘<INSERT GUARDRAIL ID>’,
‘guardrailVersion’: ‘1’
}
},
‘knowledgeBaseId’: ‘<INSERT KNOWLEDGE BASE ID>’,
‘modelArn’: ‘arn:aws:bedrock:us-east-1::foundation-model/nvidia.nemotron-nano-3-30b’,
“generationConfiguration”: {
“promptTemplate”: {
“textPromptTemplate”: (
“You are a helpful assistant that answers questions about mortgages”
“search results.nn”
“Search results:n$search_results$nn”
“User query:n$query$nn”
“Answer clearly and concisely.”
)
},
},
“orchestrationConfiguration”: {
“promptTemplate”: {
“textPromptTemplate”: (
“You are very knowledgeable on mortgages”
“Conversation so far:n$conversation_history$nn”
“User query:n$query$nn”
“$output_format_instructions$”
)
}
}
},
‘type’: ‘KNOWLEDGE_BASE’
}
)
print(response)

It directs the NVIDIA Nemotron 3 Nano model to synthesize the retrieved documents into a clear, grounded answer using your custom prompt template. To set up your own pipeline, review the full walkthrough in the Amazon Bedrock User Guide.
Conclusion
In this post, we showed you how to get started with NVIDIA Nemotron 3 Nano on Amazon Bedrock for fully managed serverless inference. We also showed you how to use the model with Amazon Bedrock Knowledge Bases and Amazon Bedrock Guardrails. The model is now available in the US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Mumbai), South America (Sao Paulo), Europe (London), and Europe (Milan) AWS Regions. Check the full Region list for future updates. To learn more, check out NVIDIA Nemotron and give NVIDIA Nemotron 3 Nano a try in the Amazon Bedrock console today.

About the authors

Antonio Rodriguez
Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of different sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Aris Tsakpinis
Aris Tsakpinis is a Senior Specialist Solutions Architect for Generative AI focusing on open weight models on Amazon Bedrock and the broader generative AI open-source environment. Alongside his professional role, he is pursuing a PhD in Machine Learning Engineering at the University of Regensburg, where his research focuses on applied generative AI in scientific domains.

Abdullahi Olaoye
Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with cloud providers to help enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Curtice Lockhart
Curtice Lockhart is an AI Solutions Architect at NVIDIA, where he helps customers deploy language and vision models to build end-to-end AI workflows using NVIDIA’s tooling on AWS. He enjoys making complex AI concepts feel approachable and spending his time exploring the art, music, and being outdoors.

Nirmal Kumar Juluru
Nirmal Kumar Juluru is a product marketing manager at NVIDIA driving the adoption of Nemotron and NeMo. He previously worked as a software developer. Nirmal holds an MBA from Carnegie Mellon University and a bachelors in computer science from BITS Pilani.

Access Anthropic Claude models in India on Amazon Bedrock with Global …

The adoption and implementation of generative AI inference has increased with organizations building more operational workloads that use AI capabilities in production at scale. To help customers achieve the scale of their generative AI applications, Amazon Bedrock offers cross-Region inference (CRIS) profiles. CRIS is a powerful feature that organizations can use to seamlessly distribute inference processing across multiple AWS Regions. This capability helps you get higher throughput while you’re building at scale and helps keep your generative AI applications responsive and reliable even under heavy load.
We are excited to introduce Global cross-Region inference for Amazon Bedrock and bring Anthropic Claude models in India. Amazon Bedrock now offers Anthropic’s Claude Opus 4.6, Claude Sonnet 4.6, and Claude Haiku 4.5 through Amazon Bedrock Global cross-Region inference (CRIS) for customers operating in India. These frontier models deliver a massive 1-million token context window and advanced agentic capabilities, allowing your applications to process vast datasets and complex workflows with unprecedented speed and intelligence. With this launch, customers using ap-south-1 (Mumbai) and ap-south-2 (Hyderabad) can access Anthropic’s latest Claude models on Amazon Bedrock while benefiting from global inference capacity and highly available inference managed by Amazon Bedrock. With global CRIS, customers can scale inference workloads seamlessly, improve resiliency, and reduce operational complexity. In this post, you will discover how to use Amazon Bedrock’s Global cross-Region Inference for Claude models in India. We will guide you through the capabilities of each Claude model variant and how to get started with a code example to help you start building generative AI applications immediately.
Core functionality of Global cross-Region inference
Global cross-Region inference helps organizations manage unplanned traffic bursts by using compute resources across inference capacity across commercial AWS Regions (Regions other than the AWS GovCloud (US) Regions and the China Regions) globally. This section explores how the Global cross-Region inference feature works and the technical mechanisms that power its functionality.
Understanding inference profiles
Global cross-Region inference is offered through Inference profiles. Inference profiles operate on two key concepts:

Source Region – The Region from which the API request is made
Destination Region – A Region to which Amazon Bedrock can route the request for inference

To use Anthropic models, Amazon Bedrock offers out of the box Global Inference profiles. For example:

Opus 4.6: <global.anthropic.claude-opus-4-6-v1>
Sonnet 4.6: <global.anthropic.claude-sonnet-4-6>
Opus 4.5: <global.anthropic.claude-opus-4-5-20251101-v1:0>
Sonnet 4.5: <global.anthropic.claude-sonnet-4-5-20250929-v1:0>
Haiku 4.5: <global.anthropic.claude-haiku-4-5-20251001-v1:0>

For customers in India using BOM (ap-south-1) and HYD (ap-south-2), the respective source and destinations would be as follows:

Source -> Destination

BOM (ap-south-1) -> AWS commercial Regions
HYD (ap-south-2) -> AWS commercial Regions

For information about considerations for choosing between Geographic and Global cross-Region inference, see Choosing between Geographic and Global cross-Region inference on the Amazon Bedrock User Guide.
Implementing global cross-Region inference
As of today, you can implement global CRIS for the following models.

Name
Model
Inference profile ID
Inference Processing Destination Regions

Global Anthropic Claude Opus 4.6
Claude Opus 4.6
global.anthropic.claude-opus-4-6-v1
Commercial AWS Regions

Global Anthropic Claude Sonnet 4.6
Claude Sonnet 4.6
global.anthropic.claude-sonnet-4-6
Commercial AWS Regions

Global Anthropic Claude Haiku 4.5
Claude Haiku 4.5
global.anthropic.claude-haiku-4-5-20251001-v1:0
Commercial AWS Regions

Global Claude Sonnet 4.5
Claude Sonnet 4.5
global.anthropic.claude-sonnet-4-5-20250929-v1:0
Commercial AWS Regions

GLOBAL Anthropic Claude Opus 4.5
Claude Opus 4.5
global.anthropic.claude-opus-4-5-20251101-v1:0
Commercial AWS Regions

For example, to use Global cross-Region inference with Anthropic’s Claude Opus 4.5, you should complete the following key steps:

Use the global inference profile ID – When making API calls to Amazon Bedrock, specify the global Anthropic’s Claude Opus 4.5 inference profile ID (global.anthropic.claude-opus-4-5-20251101-v1:0) instead of a Region-specific model ID. This works with InvokeModel, InvokeModelWithResponseStream, and Converse and ConverseStream APIs.
Configure IAM permissions – To enable Global cross-Region inference for your users, you must apply a three-part AWS Identity and Access Management policy to the role. For more information, see Configuring IAM policy for global cross-Region inference.

Global cross-Region inference for India’s peak demand seasons
Indian enterprises face unique challenges during high-traffic periods. Diwali shopping surges, Dussehra ecommerce spikes, Eid celebrations, Christmas festivities, tax filing deadlines, cricket tournaments, and festival seasons when customer engagement peaks dramatically. Global cross-Region inference provides customers in India with the throughput elasticity needed to handle these demand surges without degradation.During such peak periods ecommerce platforms, fintech applications, and customer service chatbots experience 3-5x the normal traffic. With Global CRIS, your applications automatically access inference capacity across AWS Commercial Regions, helping with

Uninterrupted service during festival shopping peaks – When millions of customers simultaneously interact with AI-powered product recommendations and customer support, for example, during Diwali, Eid, Christmas, and regional celebrations
Seamless tax season processing – Handles the surge in document analysis, form processing, and compliance queries during the July-September tax filing periods
Festival campaign scalability – Deploys AI-driven marketing campaigns during festivals without capacity planning overhead
Business continuity during cultural and sporting events – Maintains performance during cricket tournaments, election periods, and other events that drive massive engagement spikes

By routing requests globally, customers in India gain access to a significantly larger capacity pool—transforming from Regional Tokens per Minute (TPM) limits to global-scale throughput. This means that your generative AI applications remain responsive and reliable when your business needs them most, without the operational complexity of manual multi-Region orchestration or the risk of customer-facing throttling errors.
Step by step guidance for getting started with inferencing with Global cross-Region inference on Amazon Bedrock:
There are two approaches to infer the Global cross-Region supported models. Let’s understand these approaches in detail.
Approach 1: Inferencing on global CRIS models on AWS Console with Amazon Bedrock playgrounds – Asia Pacific (Mumbai) Region
The Amazon Bedrock playgrounds provide a visual interface to experiment on different models by using different configuration parameters. You can use playgrounds to test and compare different models by experimenting with prompts before integrating them into your application.
To get started using playgrounds, complete the following steps:

Log in to the AWS Console and select region as Asia Pacific (Mumbai) Region.
Navigate to Amazon Bedrock console
On the side navigation menu for Infer, select Inference profiles.
Global inference profiles will start with the keyword global.
The model ID to invoke models via global CRIS is obtained via the corresponding Inference profile ID column from the System-defined inference profiles table. At the time of writing this blog post, it supports Anthropic Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, and other foundation models.
Select the Name of the model for example, Global Anthropic Claude Opus 4.6 and navigate to the screen where you will be shown the model details.
Review the Inference profile overview of the chosen model from previous step. To perform inference in Chat/Text playground, choose Open In Playground button in the console.
You can start inferencing the model on Amazon Bedrock chat / text playground, by trying out some prompts

Approach 2: Inferencing global cross-Region inference models programmatically
To invoke the global models programmatically, one can use InvokeModel, Converse API for real-time requests, and InvokeModelWithResponseStream and ConverseStream API for streaming workloads. The full source code demonstrating these invocation APIs is available at the GitHub repository aws-samples/sample-amazon-bedrock-global-cris.
Invoke Anthropic Claude model with global cross-Region inference using Converse API
Let’s understand the implementation of global cross-Region inference on global.anthropic.claude-opus-4-6-v1 model using Converse API. We recommend the Converse API for conversational applications with a unified interface.
import boto3

# Initialize Bedrock client for India region (Mumbai)
bedrock = boto3.client(“bedrock-runtime”, region_name=”ap-south-1″)

# Global CRIS model ID for Claude Opus 4.6
MODEL_ID = “global.anthropic.claude-opus-4-6-v1”

try:
print(“Invoking Claude Opus 4.6 via Global CRIS…”)

# Use Converse API for simplified interaction
response = bedrock.converse(
messages=[
{
“role”: “user”,
“content”: [{“text”: “Explain cloud computing in 2 sentences.”}],
}
],
modelId=MODEL_ID,
)

# Extract and display response
response_text = response[“output”][“message”][“content”][0][“text”]
print(“Response:”, response_text)

# Display token usage information
usage = response.get(“usage”, {})
print(“Tokens used:”, usage)

if usage:
print(f”Input tokens: {usage.get(‘inputTokens’, ‘N/A’)}”)
print(f”Output tokens: {usage.get(‘outputTokens’, ‘N/A’)}”)
print(f”Total tokens: {usage.get(‘totalTokens’, ‘N/A’)}”)

except Exception as e:
print(f”Error: {e}”)
print(“Please check your AWS credentials and region configuration.”)
Code samples to invoke the Anthropic Claude model with global cross-Region inference with different API types
The code samples for InvokeModel, InvokeModelWithResponseStream, and Converse and ConverseStream APIs for Global CRIS models can be referenced as follows:

Model name
Inference profile ID
Invocation API
GitHub sample code

Global Anthropic Claude Opus 4.6
global.anthropic.claude-opus-4-6-v1
Converse
Code

ConverseStream
Code

InvokeModel
Code, Advanced Usage

InvokeModelWithResponseStream
Code, Advanced Usage

Global Anthropic Claude Sonnet 4.6
global.anthropic.claude-sonnet-4-6
Converse
Code

ConverseStream
Code

InvokeModel
Code

InvokeModelWithResponseStream
Code

Global Anthropic Claude Haiku 4.5
global.anthropic.claude-haiku-4-5-20251001-v1:0
Converse
Code

ConverseStream
Code

InvokeModel
Code

InvokeModelWithResponseStream
Code

Global Claude Sonnet 4.5
global.anthropic.claude-sonnet-4-5-20250929-v1:0
Converse
Code

ConverseStream
Code

InvokeModel
Code

InvokeModelWithResponseStream
Code

GLOBAL Anthropic Claude Opus 4.5
global.anthropic.claude-opus-4-5-20251101-v1:0
Converse
Code

ConverseStream
Code

InvokeModel
Code

InvokeModelWithResponseStream
Code

You can also work with global CRIS models using Application inference profiles. The sample is available on GitHub repository at application-inference-profile/multi_tenant_inference_profile_example.py. You can learn more on cross-Region (system-defined) inference profiles and application inference profiles from Set up a model invocation resource using inference profiles.
Monitoring and logging with Global cross-Region inference
When using global cross-Region inference, Amazon CloudWatch and AWS CloudTrail continue to record log entries only in the source Region where the request originated. This streamlines monitoring and logging by maintaining the records in a single Region regardless of where the inference request is ultimately processed.
So far, we have learnt how to invoke global cross-Region inference supported models using InvokeModel, InvokeModelWithResponseStream, and Converse and ConverseStream APIs. We also captured the usage from the response from these APIs. Next, we will learn how to efficiently capture logs and metrics, thus improving the overall observability and traceability of the requests and responses made for the global endpoints. We will build dashboards using several key features:

Model invocation logging to push logs to CloudWatch Logs
Generative AI observability
Querying CloudTrail event data store to track which destination Region executes each inference request

This phased approach will help you how to set up and understand where your requests are being processed and help monitor your model performance effectively.
Phase 1: CloudWatch metrics enablement, graphs report snapshot
To gain comprehensive visibility into your global CRIS usage, you will need to enable model invocation logging and set up monitoring dashboards. This will help you track performance metrics, token usage, and identify which Regions are processing your requests.
Step 1: Enabling model invocation logging
To enable model invocation logging, navigate to the Amazon Bedrock service page in the AWS console and perform the following actions as shown.

Select Settings under the Configure and learn section of side navigation that appears on the left. Toggle the radio on for Model invocation logging.
At the top of the settings page, toggle on Model invocation logging. This will start publishing invocation logs for your Bedrock model usage.
Select where you want your logs to be stored:

S3 only – Store logs in Amazon Simple Storage Service (Amazon S3)
CloudWatch Logs only – Store logs in CloudWatch (recommended for most users)
Both S3 and CloudWatch Logs – Store logs in both locations
Choose either CloudWatch Logs only or Both S3 and CloudWatch Logs because we must visualize the CloudWatch gen AI Dashboard.

If you selected CloudWatch as a destination:

Refer Create a log group in CloudWatch Logs to create a log group.
Enter the Log group name where invocation logs will be published. Note that logs and model input/output data up to 100 KB will be stored in this group.

Under the section Choose a method to authorize Bedrock, select one of the following options to indicate how Amazon Bedrock should be authorized to write logs:

Use an existing service role – Select a pre-existing IAM role.
Create and use a new role – Let AWS create a new role automatically. For simplicity, we will select Create and use a new role.

Choose Save settings to apply your logging configuration.

Step 2: Generative AI observability for Model Invocations
You can use CloudWatch Generative AI Observability dashboard to monitor Model Invocations performance. You can track metrics such as invocation count, token usage, and errors using out-of-box views. We already enabled model invocation logging, so we will navigate to the CloudWatch service page in AWS Console, and choose Model Invocations under Gen AI Observability.
With Gen AI observability, you have complete visibility into your Gen AI workload’s performance with key metrics, end-to-end prompt tracing and step-by-step analysis of large language model (LLM) interactions. You can quickly diagnose issues and gain real-time insights into the performance and reliability of your entire AI stack.

Phase 2: Understanding destination Region inference using CloudTrail for CRIS
To track which Region processed a request, CloudTrail events include an additionalEventData field with an inferenceRegion key that specifies the destination Region. Organizations can monitor and analyze the distribution of their inference requests across the AWS Global Infrastructure.
Step 1: Create event data store in CloudTrail

In the AWS Console, navigate to CloudTrail service page, on the side navigation menu for Lake, select Event data stores.

Provide an Event data store name of your choice and keep everything as default. Choose the Next button to proceed to Choose events.

In the Choose events section, keep every selection default and select Next to proceed to Enrich events, enable large events.

Optionally, you can enrich CloudTrail events by adding resource tag keys and IAM global condition keys and increase the event size. You can use this information to efficiently categorize, search, and analyze CloudTrail events.

Finally, choose Create event data store to create our event data store.

Step 2: Query the Event data store
For Amazon Bedrock Model Invocation, track which Region processed a request, CloudTrail events include an additionalEventData field with an inferenceRegion key that specifies the destination Region.

In the CloudTrail console, navigate to the Query section under Lake.
Select the event data store from the dropdown list.
Enter your natural language prompt in the Query generator text box to describe what you want to query.

Example Prompt: For Amazon Bedrock Model Invocation, track which Region processed a request, CloudTrail events include an additionalEventData field with an inferenceRegion key that specifies the destination Region.

Select Generate query button to convert your prompt into SQL syntax. The query will appear similar to the following. SELECT eventTime,
awsRegion,
element_at(additionalEventData, ‘inferenceRegion’) AS inferenceRegion,
eventName,
userIdentity.arn AS userArn,
requestId
FROM <REPLACE_WITH_YOUR_EVENT_DATA_STORE_ID>
WHERE eventSource IN (
‘bedrock.amazonaws.com’
)
AND eventName IN (‘InvokeModel’, ‘InvokeModelWithResponseStream’, ‘Converse’, ‘ConverseStream’)
AND eventTime >= ‘2025-11-06 00:00:00’
AND eventTime <= ‘2026-03-05 23:59:59’

Choose Run to execute the generated SQL query and view the results.

You can observe that the Query Results show awsRegion from where the inference request originated from and the inferenceRegion which indicates the destination Region where the request was actually processed from.
Phase 3: Understanding Query results from Event data store
From the query results in the Event data store for global cross-Region inference requests

awsRegion indicates the origin of the request, and inferenceRegion indicates the destination Region where the inference request was processed.
eventName indicates the invocation API calls to the global CRIS models using InvokeModel, InvokeModelWithResponseStream, and Converse and ConverseStream APIs are captured.
We might also see results with inferenceRegion blank, which indicate that the inference request executed within the Region where the request has originated from. Hence the destination Region in this case is the same as awsRegion.

Take your AI applications global
The Amazon Bedrock Global cross-Region inference can empower Indian organizations to build resilient, high-performing AI applications with intelligent request routing across the worldwide infrastructure of AWS. With the comprehensive monitoring capabilities demonstrated in this post, you can gain complete visibility into your application’s performance and can track exactly where your inference requests are being processed.
Start your journey with global CRIS today by implementing the code examples provided for Anthropic’s Claude Haiku 4.5, Sonnet 4.6, and Opus 4.6. You can enable model invocation logging to gain insights through CloudWatch Gen AI observability, and use CloudTrail to track cross-Region request routing. Whether you’re optimizing for performance with global profiles or maintaining compliance with geography-specific profiles, Amazon Bedrock provides the flexibility that your organization needs.
For more information about Global cross-Region inference for Anthropic’s Claude Opus 4.6 and Claude Sonnet 4.6 in Amazon Bedrock, see Increase throughput with cross-Region inference, Supported Regions and models for inference profiles, and Use an inference profile in model invocation.
We and our partners are excited to see what customers build with this AI Inference capability.

About the authors

Pavan Kumar Rao Navule
Pavan Kumar Rao Navule is a Senior Solutions Architect at Amazon Web Services, where he works with ISVs in India to help them innovate on the AWS. He is specialized in architecting AI/ML and generative AI services at AWS. Pavan is a published author for the book, Getting Started with V Programming. In his free time, Pavan enjoys listening to the great magical voices of Sia and Rihanna.

Sudhanshu Hate
Sudhanshu Hate brings nearly three decades of AI/ML innovation to his role as Principal AI/ML Specialist at AWS. He partners with organizations worldwide—from Fortune 1000 enterprises to India’s digital-native startups—to accelerate their MLOps, FMOps, and Generative AI initiatives. Before Amazon, Sudhanshu built open-source AI and Gamification systems from the ground up, leading teams that successfully released these solutions for over 100 clients. A recognized thought leader, he has contributed multiple patents, two books, and numerous technical publications to the field, and regularly shares his insights at industry conferences.

Melanie Li
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Saurabh Trikande
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Jared Dean
Jared Dean is a Principal AI/ML Solutions Architect at AWS. Jared works with customers across industries to develop machine learning applications that improve efficiency. He is interested in all things AI, technology, and BBQ.

Andrej Karpathy Open-Sources ‘Autoresearch’: A 630-Line Python Too …

Andrej Karpathy released autoresearch, a minimalist Python tool designed to enable AI agents to autonomously conduct machine learning experiments. The project is a stripped-down version of the nanochat LLM training core, condensed into a single-file repository of approximately ~630 lines of code. It is optimized for execution on a single NVIDIA GPU.

The Autonomous Iteration Loop

The framework establishes a specific division of labor between the human researcher and the AI agent. The system operates on a continuous feedback loop where progress is tracked via git commits on a feature branch.

ComponentResponsibilityFile FormatHumanIterates on high-level research instructions and constraints..md (Markdown)AI AgentProposes and implements modifications to the training script..py (Python)ExecutionConducts a fixed-length training run to evaluate the changes.Shell/Python

The agent reads the human-provided instructions, modifies the training code—adjusting neural network architecture, optimizers, or hyperparameters—and executes a training run that lasts exactly five minutes.

Evaluation Metrics and Validation

To ensure the agent only retains beneficial changes, the system uses bits-per-byte (BPB) as the primary validation metric. BPB measures the compression efficiency of the model on a validation dataset; a lower score indicates a more accurate model.

Validation Protocol: The agent only commits code changes to the git branch if the final BPB score is lower than the previous best.

Observed Performance: In initial runs, Karpathy demonstrated the agent successfully reducing validation loss from 1.0 to 0.97 BPB through autonomous code iteration.

Granularity: Every completed 5-minute training run is represented as a data point, allowing researchers to compare the effectiveness of different prompts or agent configurations over time.

Case Study: Implementation by Shopify’s Tobi Lutke

Following the release, Shopify CEO Tobi Lutke adapted the autoresearch framework for an internal project. By allowing the agent to iterate on a smaller model architecture, Lutke reported a 19% improvement in validation scores. Notably, the agent-optimized smaller model eventually outperformed a larger model that had been configured through standard manual methods.

OK this thing is totally insane. Before going to bed I…* used try to make a new qmdresearcher directory* told my pi to read this github repo and make a version of that for the qmd query-expansion model with the goal of highest quality score and speed. Get training data from… https://t.co/hbCfD62ElJ— tobi lutke (@tobi) March 8, 2026

Karpathy noted that the specific code tweaks discovered by the agent were later integrated back into his broader nanochat framework, demonstrating that the tool can discover optimizations applicable to larger-scale production systems.

I packaged up the “autoresearch” project into a new self-contained minimal repo if people would like to play over the weekend. It’s basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then:– the human iterates on the… pic.twitter.com/3tyOq2P9c6— Andrej Karpathy (@karpathy) March 7, 2026

Technical Significance for Devs

For Devs, autoresearch represents a shift toward ‘agentic’ workflows in model development. Rather than manually tuning hyperparameters, the engineering task shifts to prompt engineering the agent to navigate the search space more effectively. The ~630-line constraint ensures that the entire codebase fits within the context window of modern LLMs, minimizing errors in code generation and allowing the agent to maintain a ‘holistic’ understanding of the training script.

Key Takeaways

Autonomous Research Loop: The framework enables AI agents to autonomously iterate on ML experiments by reading a human-provided Markdown (.md) instruction file and modifying a Python (.py) training script without manual intervention.

~630-Line Core: By stripping the nanochat LLM training core down to a single-file, ~630-line repository, the codebase is small enough to fit entirely within an LLM’s context window, reducing code generation errors.

Efficiency-Driven Metrics: The agent runs fixed 5-minute training sprints on a single NVIDIA GPU and only commits code changes to a git feature branch if they result in a lower bits-per-byte (BPB) validation score.

Proven Performance Gains: In a real-world test (as mentioned on a tweet), Shopify CEO Tobi Lutke used the tool to achieve a 19% improvement in model scores, resulting in a smaller, agent-optimized model that outperformed a larger, manually configured one.

Shift in Engineering Focus: The project moves the developer’s role from manual hyperparameter tuning to agent engineering, where the goal is to optimize the prompts that direct the AI to find the most efficient neural architectures and training settings.

Check out the the Repo here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Andrej Karpathy Open-Sources ‘Autoresearch’: A 630-Line Python Tool Letting AI Agents Run Autonomous ML Experiments on Single GPUs appeared first on MarkTechPost.

Beyond Accuracy: Quantifying the Production Fragility Caused by Excess …

At first glance, adding more features to a model seems like an obvious way to improve performance. If a model can learn from more information, it should be able to make better predictions. In practice, however, this instinct often introduces hidden structural risks. Every additional feature creates another dependency on upstream data pipelines, external systems, and data quality checks. A single missing field, schema change, or delayed dataset can quietly degrade predictions in production.

The deeper issue is not computational cost or system complexity — it is weight instability. In regression models, especially when features are correlated or weakly informative, the optimizer struggles to assign credit in a meaningful way. Coefficients can shift unpredictably as the model attempts to distribute influence across overlapping signals, and low-signal variables may appear important simply due to noise in the data. Over time, this leads to models that look sophisticated on paper but behave inconsistently when deployed.

In this article, we will examine why adding more features can make regression models less reliable rather than more accurate. We will explore how correlated features distort coefficient estimates, how weak signals get mistaken for real patterns, and why each additional feature increases production fragility. To make these ideas concrete, we will walk through examples using a property pricing dataset and compare the behavior of large “kitchen-sink” models with leaner, more stable alternatives.

Importing the dependencies

Copy CodeCopiedUse a different Browserpip install seaborn scikit-learn pandas numpy matplotlib

Copy CodeCopiedUse a different Browserimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(“ignore”)

plt.rcParams.update({
“figure.facecolor”: “#FAFAFA”,
“axes.facecolor”: “#FAFAFA”,
“axes.spines.top”: False,
“axes.spines.right”:False,
“axes.grid”: True,
“grid.color”: “#E5E5E5”,
“grid.linewidth”: 0.8,
“font.family”: “monospace”,
})

SEED = 42
np.random.seed(SEED)

This code sets a clean, consistent Matplotlib style by adjusting background colors, grid appearance, and removing unnecessary axis spines for clearer visualizations. It also sets a fixed NumPy random seed (42) to ensure that any randomly generated data remains reproducible across runs.

Synthetic Property Dataset

Copy CodeCopiedUse a different BrowserN = 800 # training samples

# ── True signal features ────────────────────────────────────
sqft = np.random.normal(1800, 400, N) # strong signal
bedrooms = np.round(sqft / 550 + np.random.normal(0, 0.4, N)).clip(1, 6)
neighborhood = np.random.choice([0, 1, 2], N, p=[0.3, 0.5, 0.2]) # categorical

# ── Derived / correlated features (multicollinearity) ───────
total_rooms = bedrooms + np.random.normal(2, 0.3, N) # ≈ bedrooms
floor_area_m2 = sqft * 0.0929 + np.random.normal(0, 1, N) # ≈ sqft in m²
lot_sqft = sqft * 1.4 + np.random.normal(0, 50, N) # ≈ sqft scaled

# ── Weak / spurious features ────────────────────────────────
door_color_code = np.random.randint(0, 10, N).astype(float)
bus_stop_age_yrs = np.random.normal(15, 5, N)
nearest_mcdonalds_m = np.random.normal(800, 200, N)

# ── Pure noise features (simulate 90 random columns) ────────
noise_features = np.random.randn(N, 90)
noise_df = pd.DataFrame(
noise_features,
columns=[f”noise_{i:03d}” for i in range(90)]
)

# ── Target: house price ─────────────────────────────────────
price = (
120 * sqft
+ 8_000 * bedrooms
+ 30_000 * neighborhood
– 15 * bus_stop_age_yrs # tiny real effect
+ np.random.normal(0, 15_000, N) # irreducible noise
)

# ── Assemble DataFrames ──────────────────────────────────────
signal_cols = [“sqft”, “bedrooms”, “neighborhood”,
“total_rooms”, “floor_area_m2”, “lot_sqft”,
“door_color_code”, “bus_stop_age_yrs”,
“nearest_mcdonalds_m”]

df_base = pd.DataFrame({
“sqft”: sqft,
“bedrooms”: bedrooms,
“neighborhood”: neighborhood,
“total_rooms”: total_rooms,
“floor_area_m2”: floor_area_m2,
“lot_sqft”: lot_sqft,
“door_color_code”: door_color_code,
“bus_stop_age_yrs”: bus_stop_age_yrs,
“nearest_mcdonalds_m”: nearest_mcdonalds_m,
“price”: price,
})

df_full = pd.concat([df_base.drop(“price”, axis=1), noise_df,
df_base[[“price”]]], axis=1)

LEAN_FEATURES = [“sqft”, “bedrooms”, “neighborhood”]
NOISY_FEATURES = [c for c in df_full.columns if c != “price”]

print(f”Lean model features : {len(LEAN_FEATURES)}”)
print(f”Noisy model features: {len(NOISY_FEATURES)}”)
print(f”Dataset shape : {df_full.shape}”)

This code constructs a synthetic dataset designed to mimic a real-world property pricing scenario, where only a small number of variables truly influence the target while many others introduce redundancy or noise. The dataset contains 800 training samples. Core signal features such as square footage (sqft), number of bedrooms, and neighborhood category represent the primary drivers of house prices. In addition to these, several derived features are intentionally created to be highly correlated with the core variables—such as floor_area_m2 (a unit conversion of square footage), lot_sqft, and total_rooms. These variables simulate multicollinearity, a common issue in real datasets where multiple features carry overlapping information.

The dataset also includes weak or spurious features—such as door_color_code, bus_stop_age_yrs, and nearest_mcdonalds_m—which have little or no meaningful relationship with property price. To further replicate the “kitchen-sink model” problem, the script generates 90 completely random noise features, representing irrelevant columns that often appear in large datasets. The target variable price is constructed using a known formula where square footage, bedrooms, and neighborhood have the strongest influence, while bus stop age has a very small effect and random noise introduces natural variability.

Finally, two feature sets are defined: a lean model containing only the three true signal features (sqft, bedrooms, neighborhood) and a noisy model containing every available column except the target. This setup allows us to directly compare how a minimal, high-signal feature set performs against a large, feature-heavy model filled with redundant and irrelevant variables.

Weight Dilution via Multicollinearity

Copy CodeCopiedUse a different Browserprint(“n── Correlation between correlated feature pairs ──”)
corr_pairs = [
(“sqft”, “floor_area_m2”),
(“sqft”, “lot_sqft”),
(“bedrooms”, “total_rooms”),
]
for a, b in corr_pairs:
r = np.corrcoef(df_full[a], df_full[b])[0, 1]
print(f” {a:20s} {b:20s} r = {r:.3f}”)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
fig.suptitle(“Weight Dilution: Correlated Feature Pairs”,
fontsize=13, fontweight=”bold”, y=1.02)

for ax, (a, b) in zip(axes, corr_pairs):
ax.scatter(df_full[a], df_full[b],
alpha=0.25, s=12, color=”#3B6FD4″)
r = np.corrcoef(df_full[a], df_full[b])[0, 1]
ax.set_title(f”r = {r:.3f}”, fontsize=11)
ax.set_xlabel(a); ax.set_ylabel(b)

plt.tight_layout()
plt.savefig(“01_multicollinearity.png”, dpi=150, bbox_inches=”tight”)
plt.show()
print(“Saved → 01_multicollinearity.png”)

This section demonstrates multicollinearity, a situation where multiple features contain nearly identical information. The code computes correlation coefficients for three intentionally correlated feature pairs: sqft vs floor_area_m2, sqft vs lot_sqft, and bedrooms vs total_rooms. 

As the printed results show, these relationships are extremely strong (r ≈ 1.0, 0.996, and 0.945), meaning the model receives multiple signals describing the same underlying property characteristic.

The scatter plots visualize this overlap. Because these features move almost perfectly together, the regression optimizer struggles to determine which feature should receive credit for predicting the target. Instead of assigning a clear weight to one variable, the model often splits the influence across correlated features in arbitrary ways, leading to unstable and diluted coefficients. This is one of the key reasons why adding redundant features can make a model less interpretable and less stable, even if predictive performance initially appears similar.

Weight Instability Across Retraining Cycles

Copy CodeCopiedUse a different BrowserN_CYCLES = 30
SAMPLE_SZ = 300 # size of each retraining slice

scaler_lean = StandardScaler()
scaler_noisy = StandardScaler()

# Fit scalers on full data so units are comparable
X_lean_all = scaler_lean.fit_transform(df_full[LEAN_FEATURES])
X_noisy_all = scaler_noisy.fit_transform(df_full[NOISY_FEATURES])
y_all = df_full[“price”].values

lean_weights = [] # shape: (N_CYCLES, 3)
noisy_weights = [] # shape: (N_CYCLES, 3) — first 3 cols only for comparison

for cycle in range(N_CYCLES):
idx = np.random.choice(N, SAMPLE_SZ, replace=False)

X_l = X_lean_all[idx]; y_c = y_all[idx]
X_n = X_noisy_all[idx]

m_lean = Ridge(alpha=1.0).fit(X_l, y_c)
m_noisy = Ridge(alpha=1.0).fit(X_n, y_c)

lean_weights.append(m_lean.coef_)
noisy_weights.append(m_noisy.coef_[:3]) # sqft, bedrooms, neighborhood

lean_weights = np.array(lean_weights)
noisy_weights = np.array(noisy_weights)

print(“n── Coefficient Std Dev across 30 retraining cycles ──”)
print(f”{‘Feature’:<18} {‘Lean σ’:>10} {‘Noisy σ’:>10} {‘Amplification’:>14}”)
for i, feat in enumerate(LEAN_FEATURES):
sl = lean_weights[:, i].std()
sn = noisy_weights[:, i].std()
print(f” {feat:<16} {sl:>10.1f} {sn:>10.1f} ×{sn/sl:.1f}”)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
fig.suptitle(“Weight Instability: Lean vs. Noisy Model (30 Retraining Cycles)”,
fontsize=13, fontweight=”bold”, y=1.02)

colors = {“lean”: “#2DAA6E”, “noisy”: “#E05C3A”}

for i, feat in enumerate(LEAN_FEATURES):
ax = axes[i]
ax.plot(lean_weights[:, i], color=colors[“lean”],
linewidth=2, label=”Lean (3 features)”, alpha=0.9)
ax.plot(noisy_weights[:, i], color=colors[“noisy”],
linewidth=2, label=”Noisy (100+ features)”, alpha=0.9, linestyle=”–“)
ax.set_title(f’Coefficient: “{feat}”‘, fontsize=11)
ax.set_xlabel(“Retraining Cycle”)
ax.set_ylabel(“Standardised Weight”)
if i == 0:
ax.legend(fontsize=9)

plt.tight_layout()
plt.savefig(“02_weight_instability.png”, dpi=150, bbox_inches=”tight”)
plt.show()
print(“Saved → 02_weight_instability.png”)

This experiment simulates what happens in real production systems where models are periodically retrained on fresh data. Over 30 retraining cycles, the code randomly samples subsets of the dataset and fits two models: a lean model using only the three core signal features, and a noisy model using the full feature set containing correlated and random variables. By tracking the coefficients of the key features across each retraining cycle, we can observe how stable the learned weights remain over time.

The results show a clear pattern: the noisy model exhibits significantly higher coefficient variability. 

For example, the standard deviation of the sqft coefficient increases by 2.6×, while bedrooms becomes 2.2× more unstable compared to the lean model. The plotted lines make this effect visually obvious—the lean model’s coefficients remain relatively smooth and consistent across retraining cycles, whereas the noisy model’s weights fluctuate much more. This instability arises because correlated and irrelevant features force the optimizer to redistribute credit unpredictably, making the model’s behavior less reliable even if overall accuracy appears similar.

Signal-to-Noise Ratio (SNR) Degradation

Copy CodeCopiedUse a different Browsercorrelations = df_full[NOISY_FEATURES + [“price”]].corr()[“price”].drop(“price”)
correlations = correlations.abs().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(14, 5))
bar_colors = [
“#2DAA6E” if f in LEAN_FEATURES
else “#E8A838” if f in [“total_rooms”, “floor_area_m2”, “lot_sqft”,
“bus_stop_age_yrs”]
else “#CCCCCC”
for f in correlations.index
]

ax.bar(range(len(correlations)), correlations.values,
color=bar_colors, width=0.85, edgecolor=”none”)

# Legend patches
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor=”#2DAA6E”, label=”High-signal (lean set)”),
Patch(facecolor=”#E8A838″, label=”Correlated / low-signal”),
Patch(facecolor=”#CCCCCC”, label=”Pure noise”),
]
ax.legend(handles=legend_elements, fontsize=10, loc=”upper right”)
ax.set_title(“Signal-to-Noise Ratio: |Correlation with Price| per Feature”,
fontsize=13, fontweight=”bold”)
ax.set_xlabel(“Feature rank (sorted by |r|)”)
ax.set_ylabel(“|Pearson r| with price”)
ax.set_xticks([])

plt.tight_layout()
plt.savefig(“03_snr_degradation.png”, dpi=150, bbox_inches=”tight”)
plt.show()
print(“Saved → 03_snr_degradation.png”)

This section measures the signal strength of each feature by computing its absolute correlation with the target variable (price). The bar chart ranks all features by their correlation, highlighting the true high-signal features in green, correlated or weak features in orange, and the large set of pure noise features in gray.

The visualization shows that only a small number of variables carry meaningful predictive signal, while the majority contribute little to none. When many low-signal or noisy features are included in a model, they dilute the overall signal-to-noise ratio, making it harder for the optimizer to consistently identify the features that truly matter.

Feature Drift Simulation

Copy CodeCopiedUse a different Browserdef predict_with_drift(model, scaler, X_base, drift_col_idx,
drift_magnitude, feature_cols):
“””Inject drift into one feature column and measure prediction shift.”””
X_drifted = X_base.copy()
X_drifted[:, drift_col_idx] += drift_magnitude
return model.predict(scaler.transform(X_drifted))

# Re-fit both models on the full dataset
sc_lean = StandardScaler().fit(df_full[LEAN_FEATURES])
sc_noisy = StandardScaler().fit(df_full[NOISY_FEATURES])

m_lean_full = Ridge(alpha=1.0).fit(
sc_lean.transform(df_full[LEAN_FEATURES]), y_all)
m_noisy_full = Ridge(alpha=1.0).fit(
sc_noisy.transform(df_full[NOISY_FEATURES]), y_all)

X_lean_raw = df_full[LEAN_FEATURES].values
X_noisy_raw = df_full[NOISY_FEATURES].values
base_lean = m_lean_full.predict(sc_lean.transform(X_lean_raw))
base_noisy = m_noisy_full.predict(sc_noisy.transform(X_noisy_raw))

# Drift the “bus_stop_age_yrs” feature (low-signal, yet in noisy model)
drift_col_noisy = NOISY_FEATURES.index(“bus_stop_age_yrs”)
drift_range = np.linspace(0, 20, 40) # up to 20-year drift in bus stop age

rmse_lean_drift, rmse_noisy_drift = [], []
for d in drift_range:
preds_noisy = predict_with_drift(
m_noisy_full, sc_noisy, X_noisy_raw,
drift_col_noisy, d, NOISY_FEATURES)
# Lean model doesn’t even have this feature → unaffected
rmse_lean_drift.append(
np.sqrt(mean_squared_error(base_lean, base_lean))) # 0 by design
rmse_noisy_drift.append(
np.sqrt(mean_squared_error(base_noisy, preds_noisy)))

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(drift_range, rmse_lean_drift, color=”#2DAA6E”,
linewidth=2.5, label=”Lean model (feature not present)”)
ax.plot(drift_range, rmse_noisy_drift, color=”#E05C3A”,
linewidth=2.5, linestyle=”–“,
label=’Noisy model (“bus_stop_age_yrs” drifts)’)
ax.fill_between(drift_range, rmse_noisy_drift,
alpha=0.15, color=”#E05C3A”)
ax.set_xlabel(“Feature Drift Magnitude (years)”, fontsize=11)
ax.set_ylabel(“Prediction Shift RMSE ($)”, fontsize=11)
ax.set_title(“Feature Drift Sensitivity:nEach Extra Feature = Extra Failure Point”,
fontsize=13, fontweight=”bold”)
ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig(“05_drift_sensitivity.png”, dpi=150, bbox_inches=”tight”)
plt.show()
print(“Saved → 05_drift_sensitivity.png”)

This experiment illustrates how feature drift can silently affect model predictions in production. The code introduces gradual drift into a weak feature (bus_stop_age_yrs) and measures how much the model’s predictions change. Since the lean model does not include this feature, its predictions remain completely stable, while the noisy model becomes increasingly sensitive as the drift magnitude grows.

The resulting plot shows prediction error steadily increasing as the feature drifts, highlighting an important production reality: every additional feature becomes another potential failure point. Even low-signal variables can introduce instability if their data distribution shifts or upstream pipelines change.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Beyond Accuracy: Quantifying the Production Fragility Caused by Excessive, Redundant, and Low-Signal Features in Regression appeared first on MarkTechPost.

Building Next-Gen Agentic AI: A Complete Framework for Cognitive Bluep …

In this tutorial, we build a complete cognitive blueprint and runtime agent framework. We define structured blueprints for identity, goals, planning, memory, validation, and tool access, and use them to create agents that not only respond but also plan, execute, validate, and systematically improve their outputs. Along the tutorial, we show how the same runtime engine can support multiple agent personalities and behaviors through blueprint portability, making the overall design modular, extensible, and practical for advanced agentic AI experimentation.

Copy CodeCopiedUse a different Browserimport json, yaml, time, math, textwrap, datetime, getpass, os
from typing import Any, Callable, Dict, List, Optional
from dataclasses import dataclass, field
from enum import Enum

from openai import OpenAI
from pydantic import BaseModel
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from rich.tree import Tree

try:
from google.colab import userdata
OPENAI_API_KEY = userdata.get(‘OPENAI_API_KEY’)
except Exception:
OPENAI_API_KEY = getpass.getpass(” Enter your OpenAI API key: “)

os.environ[“OPENAI_API_KEY”] = OPENAI_API_KEY
client = OpenAI(api_key=OPENAI_API_KEY)
console = Console()

class PlanningStrategy(str, Enum):
SEQUENTIAL = “sequential”
HIERARCHICAL = “hierarchical”
REACTIVE = “reactive”

class MemoryType(str, Enum):
SHORT_TERM = “short_term”
EPISODIC = “episodic”
PERSISTENT = “persistent”

class BlueprintIdentity(BaseModel):
name: str
version: str = “1.0.0”
description: str
author: str = “unknown”

class BlueprintMemory(BaseModel):
type: MemoryType = MemoryType.SHORT_TERM
window_size: int = 10
summarize_after: int = 20

class BlueprintPlanning(BaseModel):
strategy: PlanningStrategy = PlanningStrategy.SEQUENTIAL
max_steps: int = 8
max_retries: int = 2
think_before_acting: bool = True

class BlueprintValidation(BaseModel):
require_reasoning: bool = True
min_response_length: int = 10
forbidden_phrases: List[str] = []

class CognitiveBlueprint(BaseModel):
identity: BlueprintIdentity
goals: List[str]
constraints: List[str] = []
tools: List[str] = []
memory: BlueprintMemory = BlueprintMemory()
planning: BlueprintPlanning = BlueprintPlanning()
validation: BlueprintValidation = BlueprintValidation()
system_prompt_extra: str = “”

def load_blueprint_from_yaml(yaml_str: str) -> CognitiveBlueprint:
return CognitiveBlueprint(**yaml.safe_load(yaml_str))

RESEARCH_AGENT_YAML = “””
identity:
name: ResearchBot
version: 1.2.0
description: Answers research questions using calculation and reasoning
author: Auton Framework Demo
goals:
– Answer user questions accurately using available tools
– Show step-by-step reasoning for all answers
– Cite the method used for each calculation
constraints:
– Never fabricate numbers or statistics
– Always validate mathematical results before reporting
– Do not answer questions outside your tool capabilities
tools:
– calculator
– unit_converter
– date_calculator
– search_wikipedia_stub
memory:
type: episodic
window_size: 12
summarize_after: 30
planning:
strategy: sequential
max_steps: 6
max_retries: 2
think_before_acting: true
validation:
require_reasoning: true
min_response_length: 20
forbidden_phrases:
– “I don’t know”
– “I cannot determine”
“””

DATA_ANALYST_YAML = “””
identity:
name: DataAnalystBot
version: 2.0.0
description: Performs statistical analysis and data summarization
author: Auton Framework Demo
goals:
– Compute descriptive statistics for given data
– Identify trends and anomalies
– Present findings clearly with numbers
constraints:
– Only work with numerical data
– Always report uncertainty when sample size is small (< 5 items)
tools:
– calculator
– statistics_engine
– list_sorter
memory:
type: short_term
window_size: 6
planning:
strategy: hierarchical
max_steps: 10
max_retries: 3
think_before_acting: true
validation:
require_reasoning: true
min_response_length: 30
forbidden_phrases: []
“””

We set up the core environment and define the cognitive blueprint, which structures how an agent thinks and behaves. We create strongly typed models for identity, memory configuration, planning strategy, and validation rules using Pydantic and enums. We also define two YAML-based blueprints, allowing us to configure different agent personalities and capabilities without changing the underlying runtime system.

Copy CodeCopiedUse a different Browser@dataclass
class ToolSpec:
name: str
description: str
parameters: Dict[str, str]
function: Callable
returns: str

class ToolRegistry:
def __init__(self):
self._tools: Dict[str, ToolSpec] = {}

def register(self, name: str, description: str,
parameters: Dict[str, str], returns: str):
def decorator(fn: Callable) -> Callable:
self._tools[name] = ToolSpec(name, description, parameters, fn, returns)
return fn
return decorator

def get(self, name: str) -> Optional[ToolSpec]:
return self._tools.get(name)

def call(self, name: str, **kwargs) -> Any:
spec = self._tools.get(name)
if not spec:
raise ValueError(f”Tool ‘{name}’ not found in registry”)
return spec.function(**kwargs)

def get_tool_descriptions(self, allowed: List[str]) -> str:
lines = []
for name in allowed:
spec = self._tools.get(name)
if spec:
params = “, “.join(f”{k}: {v}” for k, v in spec.parameters.items())
lines.append(
f”• {spec.name}({params})n”
f” → {spec.description}n”
f” Returns: {spec.returns}”
)
return “n”.join(lines)

def list_tools(self) -> List[str]:
return list(self._tools.keys())

registry = ToolRegistry()

@registry.register(
name=”calculator”,
description=”Evaluates a safe mathematical expression”,
parameters={“expression”: “A math expression string, e.g. ‘2 ** 10 + 5 * 3′”},
returns=”Numeric result as float”
)
def calculator(expression: str) -> str:
try:
allowed = {k: v for k, v in math.__dict__.items() if not k.startswith(“_”)}
allowed.update({“abs”: abs, “round”: round, “pow”: pow})
return str(eval(expression, {“__builtins__”: {}}, allowed))
except Exception as e:
return f”Error: {e}”

@registry.register(
name=”unit_converter”,
description=”Converts between common units of measurement”,
parameters={
“value”: “Numeric value to convert”,
“from_unit”: “Source unit (km, miles, kg, lbs, celsius, fahrenheit, liters, gallons, meters, feet)”,
“to_unit”: “Target unit”
},
returns=”Converted value as string with units”
)
def unit_converter(value: float, from_unit: str, to_unit: str) -> str:
conversions = {
(“km”, “miles”): lambda x: x * 0.621371,
(“miles”, “km”): lambda x: x * 1.60934,
(“kg”, “lbs”): lambda x: x * 2.20462,
(“lbs”, “kg”): lambda x: x / 2.20462,
(“celsius”, “fahrenheit”): lambda x: x * 9/5 + 32,
(“fahrenheit”, “celsius”): lambda x: (x – 32) * 5/9,
(“liters”, “gallons”): lambda x: x * 0.264172,
(“gallons”, “liters”): lambda x: x * 3.78541,
(“meters”, “feet”): lambda x: x * 3.28084,
(“feet”, “meters”): lambda x: x / 3.28084,
}
key = (from_unit.lower(), to_unit.lower())
if key in conversions:
return f”{conversions[key](float(value)):.4f} {to_unit}”
return f”Conversion from {from_unit} to {to_unit} not supported”

@registry.register(
name=”date_calculator”,
description=”Calculates days between two dates, or adds/subtracts days from a date”,
parameters={
“operation”: “‘days_between’ or ‘add_days'”,
“date1”: “Date string in YYYY-MM-DD format”,
“date2”: “Second date for days_between (YYYY-MM-DD), or number of days for add_days”
},
returns=”Result as string”
)
def date_calculator(operation: str, date1: str, date2: str) -> str:
try:
d1 = datetime.datetime.strptime(date1, “%Y-%m-%d”)
if operation == “days_between”:
d2 = datetime.datetime.strptime(date2, “%Y-%m-%d”)
return f”{abs((d2 – d1).days)} days between {date1} and {date2}”
elif operation == “add_days”:
result = d1 + datetime.timedelta(days=int(date2))
return f”{result.strftime(‘%Y-%m-%d’)} (added {date2} days to {date1})”
return f”Unknown operation: {operation}”
except Exception as e:
return f”Error: {e}”

@registry.register(
name=”search_wikipedia_stub”,
description=”Returns a stub summary for well-known topics (demo — no live internet)”,
parameters={“topic”: “Topic to look up”},
returns=”Short text summary”
)
def search_wikipedia_stub(topic: str) -> str:
stubs = {
“openai”: “OpenAI is an AI research company founded in 2015. It created GPT-4 and the ChatGPT product.”,
}
for key, val in stubs.items():
if key in topic.lower():
return val
return f”No stub found for ‘{topic}’. In production, this would query Wikipedia’s API.”

We implement the tool registry that allows agents to discover and use external capabilities dynamically. We design a structured system in which tools are registered with metadata, including parameters, descriptions, and return values. We also implement several practical tools, such as a calculator, unit converter, date calculator, and a Wikipedia search stub that the agents can invoke during execution.

Copy CodeCopiedUse a different Browser@registry.register(
name=”statistics_engine”,
description=”Computes descriptive statistics on a list of numbers”,
parameters={“numbers”: “Comma-separated list of numbers, e.g. ‘4,8,15,16,23,42’”},
returns=”JSON with mean, median, std_dev, min, max, count”
)
def statistics_engine(numbers: str) -> str:
try:
nums = [float(x.strip()) for x in numbers.split(“,”)]
n = len(nums)
mean = sum(nums) / n
sorted_nums = sorted(nums)
mid = n // 2
median = sorted_nums[mid] if n % 2 else (sorted_nums[mid-1] + sorted_nums[mid]) / 2
std_dev = math.sqrt(sum((x – mean) ** 2 for x in nums) / n)
return json.dumps({
“count”: n, “mean”: round(mean, 4), “median”: round(median, 4),
“std_dev”: round(std_dev, 4), “min”: min(nums),
“max”: max(nums), “range”: max(nums) – min(nums)
}, indent=2)
except Exception as e:
return f”Error: {e}”

@registry.register(
name=”list_sorter”,
description=”Sorts a comma-separated list of numbers”,
parameters={“numbers”: “Comma-separated numbers”, “order”: “‘asc’ or ‘desc'”},
returns=”Sorted comma-separated list”
)
def list_sorter(numbers: str, order: str = “asc”) -> str:
nums = [float(x.strip()) for x in numbers.split(“,”)]
nums.sort(reverse=(order == “desc”))
return “, “.join(str(n) for n in nums)

@dataclass
class MemoryEntry:
role: str
content: str
timestamp: float = field(default_factory=time.time)
metadata: Dict = field(default_factory=dict)

class MemoryManager:
def __init__(self, config: BlueprintMemory, llm_client: OpenAI):
self.config = config
self.client = llm_client
self._history: List[MemoryEntry] = []
self._summary: str = “”

def add(self, role: str, content: str, metadata: Dict = None):
self._history.append(MemoryEntry(role=role, content=content, metadata=metadata or {}))
if (self.config.type == MemoryType.EPISODIC and
len(self._history) > self.config.summarize_after):
self._compress_memory()

def _compress_memory(self):
to_compress = self._history[:-self.config.window_size]
self._history = self._history[-self.config.window_size:]
text = “n”.join(f”{e.role}: {e.content[:200]}” for e in to_compress)
try:
resp = self.client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “content”:
f”Summarize this conversation history in 3 sentences:n{text}”}],
max_tokens=150
)
self._summary += ” ” + resp.choices[0].message.content.strip()
except Exception:
self._summary += f” [compressed {len(to_compress)} messages]”

def get_messages(self, system_prompt: str) -> List[Dict]:
messages = [{“role”: “system”, “content”: system_prompt}]
if self._summary:
messages.append({“role”: “system”,
“content”: f”[Memory Summary]: {self._summary.strip()}”})
for entry in self._history[-self.config.window_size:]:
messages.append({
“role”: entry.role if entry.role != “tool” else “assistant”,
“content”: entry.content
})
return messages

def clear(self):
self._history = []
self._summary = “”

@property
def message_count(self) -> int:
return len(self._history)

We extend the tool ecosystem and introduce the memory management layer that stores conversation history and compresses it when necessary. We implement statistical tools and sorting utilities that enable the data analysis agent to perform structured numerical operations. At the same time, we design a memory system that tracks interactions, summarizes long histories, and provides contextual messages to the language model.

Copy CodeCopiedUse a different Browser@dataclass
class PlanStep:
step_id: int
description: str
tool: Optional[str]
tool_args: Dict[str, Any]
reasoning: str

@dataclass
class Plan:
task: str
steps: List[PlanStep]
strategy: PlanningStrategy

class Planner:
def __init__(self, blueprint: CognitiveBlueprint,
registry: ToolRegistry, llm_client: OpenAI):
self.blueprint = blueprint
self.registry = registry
self.client = llm_client

def _build_planner_prompt(self) -> str:
bp = self.blueprint
return textwrap.dedent(f”””
You are {bp.identity.name}, version {bp.identity.version}.
{bp.identity.description}

## Your Goals:
{chr(10).join(f’ – {g}’ for g in bp.goals)}

## Your Constraints:
{chr(10).join(f’ – {c}’ for c in bp.constraints)}

## Available Tools:
{self.registry.get_tool_descriptions(bp.tools)}

## Planning Strategy: {bp.planning.strategy}
## Max Steps: {bp.planning.max_steps}

Given a user task, produce a JSON execution plan with this exact structure:
{{
“steps”: [
{{
“step_id”: 1,
“description”: “What this step does”,
“tool”: “tool_name or null if no tool needed”,
“tool_args”: {{“arg1”: “value1”}},
“reasoning”: “Why this step is needed”
}}
]
}}

Rules:
– Only use tools listed above
– Set tool to null for pure reasoning steps
– Keep steps <= {bp.planning.max_steps}
– Return ONLY valid JSON, no markdown fences
{bp.system_prompt_extra}
“””).strip()

def plan(self, task: str, memory: MemoryManager) -> Plan:
system_prompt = self._build_planner_prompt()
messages = memory.get_messages(system_prompt)
messages.append({“role”: “user”, “content”:
f”Create a plan to complete this task: {task}”})
resp = self.client.chat.completions.create(
model=”gpt-4o-mini”, messages=messages,
max_tokens=1200, temperature=0.2
)
raw = resp.choices[0].message.content.strip()
raw = raw.replace(““`json”, “”).replace(““`”, “”).strip()
data = json.loads(raw)
steps = [
PlanStep(
step_id=s[“step_id”], description=s[“description”],
tool=s.get(“tool”), tool_args=s.get(“tool_args”, {}),
reasoning=s.get(“reasoning”, “”)
)
for s in data[“steps”]
]
return Plan(task=task, steps=steps, strategy=self.blueprint.planning.strategy)

@dataclass
class StepResult:
step_id: int
success: bool
output: str
tool_used: Optional[str]
error: Optional[str] = None

@dataclass
class ExecutionTrace:
plan: Plan
results: List[StepResult]
final_answer: str

class Executor:
def __init__(self, blueprint: CognitiveBlueprint,
registry: ToolRegistry, llm_client: OpenAI):
self.blueprint = blueprint
self.registry = registry
self.client = llm_client

We implement the planning system that transforms a user task into a structured execution plan composed of multiple steps. We design a planner that instructs the language model to produce a JSON plan containing reasoning, tool selection, and arguments for each step. This planning layer allows the agent to break complex problems into smaller executable actions before performing them.

Copy CodeCopiedUse a different Browser def execute_plan(self, plan: Plan, memory: MemoryManager,
verbose: bool = True) -> ExecutionTrace:
results: List[StepResult] = []
if verbose:
console.print(f”n[bold yellow] Executing:[/] {plan.task}”)
console.print(f” Strategy: {plan.strategy} | Steps: {len(plan.steps)}”)

for step in plan.steps:
if verbose:
console.print(f”n [cyan]Step {step.step_id}:[/] {step.description}”)
try:
if step.tool and step.tool != “null”:
if verbose:
console.print(f” Tool: [green]{step.tool}[/] | Args: {step.tool_args}”)
output = self.registry.call(step.tool, **step.tool_args)
result = StepResult(step.step_id, True, str(output), step.tool)
if verbose:
console.print(f” Result: {output}”)
else:
context_text = “n”.join(
f”Step {r.step_id} result: {r.output}” for r in results)
prompt = (
f”Previous results:n{context_text}nn”
f”Now complete this step: {step.description}n”
f”Reasoning hint: {step.reasoning}”
) if context_text else (
f”Complete this step: {step.description}n”
f”Reasoning hint: {step.reasoning}”
)
sys_prompt = (
f”You are {self.blueprint.identity.name}. ”
f”{self.blueprint.identity.description}. ”
f”Constraints: {‘; ‘.join(self.blueprint.constraints)}”
)
resp = self.client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[
{“role”: “system”, “content”: sys_prompt},
{“role”: “user”, “content”: prompt}
],
max_tokens=500, temperature=0.3
)
output = resp.choices[0].message.content.strip()
result = StepResult(step.step_id, True, output, None)
if verbose:
preview = output[:120] + “…” if len(output) > 120 else output
console.print(f” Reasoning: {preview}”)
except Exception as e:
result = StepResult(step.step_id, False, “”, step.tool, str(e))
if verbose:
console.print(f” Error: {e}”)
results.append(result)

final_answer = self._synthesize(plan, results, memory)
return ExecutionTrace(plan=plan, results=results, final_answer=final_answer)

def _synthesize(self, plan: Plan, results: List[StepResult],
memory: MemoryManager) -> str:
steps_summary = “n”.join(
f”Step {r.step_id} ({” if r.success else ”}): {r.output[:300]}”
for r in results
)
synthesis_prompt = (
f”Original task: {plan.task}nn”
f”Step results:n{steps_summary}nn”
f”Provide a clear, complete final answer. Integrate all step results.”
)
sys_prompt = (
f”You are {self.blueprint.identity.name}. ”
+ (“Always show your reasoning. ” if self.blueprint.validation.require_reasoning else “”)
+ f”Goals: {‘; ‘.join(self.blueprint.goals)}”
)
messages = memory.get_messages(sys_prompt)
messages.append({“role”: “user”, “content”: synthesis_prompt})
resp = self.client.chat.completions.create(
model=”gpt-4o-mini”, messages=messages,
max_tokens=600, temperature=0.3
)
return resp.choices[0].message.content.strip()

@dataclass
class ValidationResult:
passed: bool
issues: List[str]
score: float

class Validator:
def __init__(self, blueprint: CognitiveBlueprint, llm_client: OpenAI):
self.blueprint = blueprint
self.client = llm_client

def validate(self, answer: str, task: str,
use_llm_check: bool = False) -> ValidationResult:
issues = []
v = self.blueprint.validation

if len(answer) < v.min_response_length:
issues.append(f”Response too short: {len(answer)} chars (min: {v.min_response_length})”)

answer_lower = answer.lower()
for phrase in v.forbidden_phrases:
if phrase.lower() in answer_lower:
issues.append(f”Forbidden phrase detected: ‘{phrase}'”)

if v.require_reasoning:
indicators = [“because”, “therefore”, “since”, “step”, “first”,
“result”, “calculated”, “computed”, “found that”]
if not any(ind in answer_lower for ind in indicators):
issues.append(“Response lacks visible reasoning or explanation”)

if use_llm_check:
issues.extend(self._llm_quality_check(answer, task))

return ValidationResult(passed=len(issues) == 0,
issues=issues,
score=max(0.0, 1.0 – len(issues) * 0.25))

def _llm_quality_check(self, answer: str, task: str) -> List[str]:
prompt = (
f”Task: {task}nnAnswer: {answer[:500]}nn”
f’Does this answer address the task? Reply JSON: {{“on_topic”: true/false, “issue”: “…”}}’
)
try:
resp = self.client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “content”: prompt}],
max_tokens=100
)
raw = resp.choices[0].message.content.strip().replace(““`json”,””).replace(““`”,””)
data = json.loads(raw)
if not data.get(“on_topic”, True):
return [f”LLM quality check: {data.get(‘issue’, ‘off-topic’)}”]
except Exception:
pass
return []

We build the executor and validation logic that actually performs the steps generated by the planner. We implement a system that can either call registered tools or perform reasoning through the language model, depending on the step definition. We also add a validator that checks the final response against blueprint constraints such as minimum length, reasoning requirements, and forbidden phrases.

Copy CodeCopiedUse a different Browser@dataclass
class AgentResponse:
agent_name: str
task: str
final_answer: str
trace: ExecutionTrace
validation: ValidationResult
retries: int
total_steps: int

class RuntimeEngine:
def __init__(self, blueprint: CognitiveBlueprint,
registry: ToolRegistry, llm_client: OpenAI):
self.blueprint = blueprint
self.memory = MemoryManager(blueprint.memory, llm_client)
self.planner = Planner(blueprint, registry, llm_client)
self.executor = Executor(blueprint, registry, llm_client)
self.validator = Validator(blueprint, llm_client)

def run(self, task: str, verbose: bool = True) -> AgentResponse:
bp = self.blueprint
if verbose:
console.print(Panel(
f”[bold]Agent:[/] {bp.identity.name} v{bp.identity.version}n”
f”[bold]Task:[/] {task}n”
f”[bold]Strategy:[/] {bp.planning.strategy} | ”
f”Max Steps: {bp.planning.max_steps} | ”
f”Max Retries: {bp.planning.max_retries}”,
title=” Runtime Engine Starting”, border_style=”blue”
))

self.memory.add(“user”, task)
retries, trace, validation = 0, None, None

for attempt in range(bp.planning.max_retries + 1):
if attempt > 0 and verbose:
console.print(f”n[yellow]⟳ Retry {attempt}/{bp.planning.max_retries}[/]”)
console.print(f” Issues: {‘, ‘.join(validation.issues)}”)

if verbose:
console.print(“n[bold magenta] Phase 1: Planning…[/]”)
try:
plan = self.planner.plan(task, self.memory)
if verbose:
tree = Tree(f”[bold]Plan ({len(plan.steps)} steps)[/]”)
for s in plan.steps:
icon = “” if s.tool else “”
branch = tree.add(f”{icon} Step {s.step_id}: {s.description}”)
if s.tool:
branch.add(f”[green]Tool:[/] {s.tool}”)
branch.add(f”[yellow]Args:[/] {s.tool_args}”)
console.print(tree)
except Exception as e:
if verbose: console.print(f”[red]Planning failed:[/] {e}”)
break

if verbose:
console.print(“n[bold magenta] Phase 2: Executing…[/]”)
trace = self.executor.execute_plan(plan, self.memory, verbose=verbose)

if verbose:
console.print(“n[bold magenta] Phase 3: Validating…[/]”)
validation = self.validator.validate(trace.final_answer, task)

if verbose:
status = “[green]PASSED[/]” if validation.passed else “[red]FAILED[/]”
console.print(f” Validation: {status} | Score: {validation.score:.2f}”)
for issue in validation.issues:
console.print(f” {issue}”)

if validation.passed:
break

retries += 1
self.memory.add(“assistant”, trace.final_answer)
self.memory.add(“user”,
f”Your previous answer had issues: {‘; ‘.join(validation.issues)}. ”
f”Please improve.”
)

if trace:
self.memory.add(“assistant”, trace.final_answer)

if verbose:
console.print(Panel(
trace.final_answer if trace else “No answer generated”,
title=f” Final Answer — {bp.identity.name}”,
border_style=”green”
))

return AgentResponse(
agent_name=bp.identity.name, task=task,
final_answer=trace.final_answer if trace else “”,
trace=trace, validation=validation,
retries=retries,
total_steps=len(trace.results) if trace else 0
)

def reset_memory(self):
self.memory.clear()

def build_engine(blueprint_yaml: str, registry: ToolRegistry,
llm_client: OpenAI) -> RuntimeEngine:
return RuntimeEngine(load_blueprint_from_yaml(blueprint_yaml), registry, llm_client)

if __name__ == “__main__”:

print(“n” + “=”*60)
print(“DEMO 1: ResearchBot”)
print(“=”*60)
research_engine = build_engine(RESEARCH_AGENT_YAML, registry, client)
research_engine.run(
task=(
“how many steps of 20cm height would that be? Also, if I burn 0.15 ”
“calories per step, what’s the total calorie burn? Show all calculations.”
)
)

print(“n” + “=”*60)
print(“DEMO 2: DataAnalystBot”)
print(“=”*60)
analyst_engine = build_engine(DATA_ANALYST_YAML, registry, client)
analyst_engine.run(
task=(
“Analyze this dataset of monthly sales figures (in thousands): ”
“142, 198, 173, 155, 221, 189, 203, 167, 244, 198, 212, 231. ”
“Compute key statistics, identify the best and worst months, ”
“and calculate growth from first to last month.”
)
)

print(“n” + “=”*60)
print(“PORTABILITY DEMO: Same task → 2 different blueprints”)
print(“=”*60)
SHARED_TASK = “Calculate 15% of 2,500 and tell me the result.”

responses = {}
for name, yaml_str in [
(“ResearchBot”, RESEARCH_AGENT_YAML),
(“DataAnalystBot”, DATA_ANALYST_YAML),
]:
eng = build_engine(yaml_str, registry, client)
responses[name] = eng.run(SHARED_TASK, verbose=False)

table = Table(title=” Blueprint Portability”, show_header=True, show_lines=True)
table.add_column(“Agent”, style=”cyan”, width=18)
table.add_column(“Steps”, style=”yellow”, width=6)
table.add_column(“Valid?”, width=7)
table.add_column(“Score”, width=6)
table.add_column(“Answer Preview”, width=55)

for name, r in responses.items():
table.add_row(
name, str(r.total_steps),
“” if r.validation.passed else “”,
f”{r.validation.score:.2f}”,
r.final_answer[:140] + “…”
)
console.print(table)

We assemble the runtime engine that orchestrates planning, execution, memory updates, and validation into a complete autonomous workflow. We run multiple demonstrations showing how different blueprints produce different behaviors while using the same core architecture. Finally, we illustrate blueprint portability by running the same task across two agents and comparing their results.

In conclusion, we created a fully functional Auton-style runtime system that integrates cognitive blueprints, tool registries, memory management, planning, execution, and validation into a cohesive framework. We demonstrated how different agents can share the same underlying architecture while behaving differently through customized blueprints, highlighting the design’s flexibility and power. Through this implementation, we not only explored how modern runtime agents operate but also built a strong foundation that we can extend further with richer tools, stronger memory systems, and more advanced autonomous behaviors.

Check out the Full Codes here and Related Paper. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Building Next-Gen Agentic AI: A Complete Framework for Cognitive Blueprint Driven Runtime Agents with Memory Tools and Validation appeared first on MarkTechPost.

Google Launches TensorFlow 2.21 And LiteRT: Faster GPU Performance, Ne …

Google has officially released TensorFlow 2.21. The most significant update in this release is the graduation of LiteRT from its preview stage to a fully production-ready stack. Moving forward, LiteRT serves as the universal on-device inference framework, officially replacing TensorFlow Lite (TFLite).

This update streamlines the deployment of machine learning models to mobile and edge devices while expanding hardware and framework compatibility.

LiteRT: Performance and Hardware Acceleration

When deploying models to edge devices (like smartphones or IoT hardware), inference speed and battery efficiency are primary constraints. LiteRT addresses this with updated hardware acceleration:

GPU Improvements: LiteRT delivers 1.4x faster GPU performance compared to the previous TFLite framework.

NPU Integration: The release introduces state-of-the-art NPU acceleration with a unified, streamlined workflow for both GPU and NPU across edge platforms.

This infrastructure is specifically designed to support cross-platform GenAI deployment for open models like Gemma.

Lower Precision Operations (Quantization)

To run complex models on devices with limited memory, developers use a technique called quantization. This involves lowering the precision—the number of bits—used to store a neural network’s weights and activations.

TensorFlow 2.21 significantly expands the tf.lite operators’ support for lower-precision data types to improve efficiency:

The SQRT operator now supports int8 and int16x8.

Comparison operators now support int16x8.

tfl.cast now supports conversions involving INT2 and INT4.

tfl.slice has added support for INT4.

tfl.fully_connected now includes support for INT2.

Expanded Framework Support

Historically, converting models from different training frameworks into a mobile-friendly format could be difficult. LiteRT simplifies this by offering first-class PyTorch and JAX support via seamless model conversion.

Developers can now train their models in PyTorch or JAX and convert them directly for on-device deployment without needing to rewrite the architecture in TensorFlow first.

Maintenance, Security, and Ecosystem Focus

Google is shifting its TensorFlow Core resources to focus heavily on long-term stability. The development team will now exclusively focus on:

Security and bug fixes: Quickly addressing security vulnerabilities and critical bugs by releasing minor and patch versions as required.

Dependency updates: Releasing minor versions to support updates to underlying dependencies, including new Python releases.

Community contributions: Continuing to review and accept critical bug fixes from the open-source community.

These commitments apply to the broader enterprise ecosystem, including: TF.data, TensorFlow Serving, TFX, TensorFlow Data Validation, TensorFlow Transform, TensorFlow Model Analysis, TensorFlow Recommenders, TensorFlow Text, TensorBoard, and TensorFlow Quantum.

Key Takeaways

LiteRT Officially Replaces TFLite: LiteRT has graduated from preview to full production, officially becoming Google’s primary on-device inference framework for deploying machine learning models to mobile and edge environments.

Major GPU and NPU Acceleration: The updated runtime delivers 1.4x faster GPU performance compared to TFLite and introduces a unified workflow for NPU (Neural Processing Unit) acceleration, making it easier to run heavy GenAI workloads (like Gemma) on specialized edge hardware.

Aggressive Model Quantization (INT4/INT2): To maximize memory efficiency on edge devices, tf.lite operators have expanded support for extreme lower-precision data types. This includes int8/int16 for SQRT and comparison operations, alongside INT4 and INT2 support for cast, slice, and fully_connected operators.

Seamless PyTorch and JAX Interoperability: Developers are no longer locked into training with TensorFlow for edge deployment. LiteRT now provides first-class, native model conversion for both PyTorch and JAX, streamlining the pipeline from research to production.

Check out the Technical details and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Launches TensorFlow 2.21 And LiteRT: Faster GPU Performance, New NPU Acceleration, And Seamless PyTorch Edge Deployment Upgrades appeared first on MarkTechPost.

A Production-Style NetworKit 11.2.1 Coding Tutorial for Large-Scale Gr …

In this tutorial, we implement a production-grade, large-scale graph analytics pipeline in NetworKit, focusing on speed, memory efficiency, and version-safe APIs in NetworKit 11.2.1. We generate a large-scale free network, extract the largest connected component, and then compute structural backbone signals via k-core decomposition and centrality ranking. We also detect communities with PLM and quantify quality using modularity; estimate distance structure using effective and estimated diameters; and, finally, sparsify the graph to reduce cost while preserving key properties. We export the sparsified graph as an edgelist so we can reuse it in downstream workflows, benchmarking, or graph ML preprocessing.

Copy CodeCopiedUse a different Browser!pip -q install networkit pandas numpy psutil

import gc, time, os
import numpy as np
import pandas as pd
import psutil
import networkit as nk

print(“NetworKit:”, nk.__version__)
nk.setNumberOfThreads(min(2, nk.getMaxNumberOfThreads()))
nk.setSeed(7, False)

def ram_gb():
p = psutil.Process(os.getpid())
return p.memory_info().rss / (1024**3)

def tic():
return time.perf_counter()

def toc(t0, msg):
print(f”{msg}: {time.perf_counter()-t0:.3f}s | RAM~{ram_gb():.2f} GB”)

def report(G, name):
print(f”n[{name}] nodes={G.numberOfNodes():,} edges={G.numberOfEdges():,} directed={G.isDirected()} weighted={G.isWeighted()}”)

def force_cleanup():
gc.collect()

PRESET = “LARGE”

if PRESET == “LARGE”:
N = 120_000
M_ATTACH = 6
AB_EPS = 0.12
ED_RATIO = 0.9
elif PRESET == “XL”:
N = 250_000
M_ATTACH = 6
AB_EPS = 0.15
ED_RATIO = 0.9
else:
N = 80_000
M_ATTACH = 6
AB_EPS = 0.10
ED_RATIO = 0.9

print(f”nPreset={PRESET} | N={N:,} | m={M_ATTACH} | approx-betweenness epsilon={AB_EPS}”)

We set up the Colab environment with NetworKit and monitoring utilities, and we lock in a stable random seed. We configure thread usage to match the runtime and define timing and RAM-tracking helpers for each major stage. We choose a scale preset that controls graph size and approximation knobs so the pipeline stays large but manageable.

Copy CodeCopiedUse a different Browsert0 = tic()
G = nk.generators.BarabasiAlbertGenerator(M_ATTACH, N).generate()
toc(t0, “Generated BA graph”)
report(G, “G”)

t0 = tic()
cc = nk.components.ConnectedComponents(G)
cc.run()
toc(t0, “ConnectedComponents”)
print(“components:”, cc.numberOfComponents())

if cc.numberOfComponents() > 1:
t0 = tic()
G = nk.graphtools.extractLargestConnectedComponent(G, compactGraph=True)
toc(t0, “Extracted LCC (compactGraph=True)”)
report(G, “LCC”)

force_cleanup()

We generate a large Barabási–Albert graph and immediately log its size and runtime footprint. We compute connected components to understand fragmentation and quickly diagnose topology. We extract the largest connected component and compact it to improve the rest of the pipeline’s performance and reliability.

Copy CodeCopiedUse a different Browsert0 = tic()
core = nk.centrality.CoreDecomposition(G)
core.run()
toc(t0, “CoreDecomposition”)
core_vals = np.array(core.scores(), dtype=np.int32)
print(“degeneracy (max core):”, int(core_vals.max()))
print(“core stats:”, pd.Series(core_vals).describe(percentiles=[0.5, 0.9, 0.99]).to_dict())

k_thr = int(np.percentile(core_vals, 97))

t0 = tic()
nodes_backbone = [u for u in range(G.numberOfNodes()) if core_vals[u] >= k_thr]
G_backbone = nk.graphtools.subgraphFromNodes(G, nodes_backbone)
toc(t0, f”Backbone subgraph (k>={k_thr})”)
report(G_backbone, “Backbone”)

force_cleanup()

t0 = tic()
pr = nk.centrality.PageRank(G, damp=0.85, tol=1e-8)
pr.run()
toc(t0, “PageRank”)

pr_scores = np.array(pr.scores(), dtype=np.float64)
top_pr = np.argsort(-pr_scores)[:15]
print(“Top PageRank nodes:”, top_pr.tolist())
print(“Top PageRank scores:”, pr_scores[top_pr].tolist())

t0 = tic()
abw = nk.centrality.ApproxBetweenness(G, epsilon=AB_EPS)
abw.run()
toc(t0, “ApproxBetweenness”)

abw_scores = np.array(abw.scores(), dtype=np.float64)
top_abw = np.argsort(-abw_scores)[:15]
print(“Top ApproxBetweenness nodes:”, top_abw.tolist())
print(“Top ApproxBetweenness scores:”, abw_scores[top_abw].tolist())

force_cleanup()

We compute the core decomposition to measure degeneracy and identify the network’s high-density backbone. We extract a backbone subgraph using a high core-percentile threshold to focus on structurally important nodes. We run PageRank and approximate betweenness to rank nodes by influence and bridge-like behavior at scale.

Copy CodeCopiedUse a different Browsert0 = tic()
plm = nk.community.PLM(G, refine=True, gamma=1.0, par=”balanced”)
plm.run()
toc(t0, “PLM community detection”)

part = plm.getPartition()
num_comms = part.numberOfSubsets()
print(“communities:”, num_comms)

t0 = tic()
Q = nk.community.Modularity().getQuality(part, G)
toc(t0, “Modularity”)
print(“modularity Q:”, Q)

sizes = np.array(list(part.subsetSizeMap().values()), dtype=np.int64)
print(“community size stats:”, pd.Series(sizes).describe(percentiles=[0.5, 0.9, 0.99]).to_dict())

t0 = tic()
eff = nk.distance.EffectiveDiameter(G, ED_RATIO)
eff.run()
toc(t0, f”EffectiveDiameter (ratio={ED_RATIO})”)
print(“effective diameter:”, eff.getEffectiveDiameter())

t0 = tic()
diam = nk.distance.EstimatedDiameter(G)
diam.run()
toc(t0, “EstimatedDiameter”)
print(“estimated diameter:”, diam.getDiameter().distance)

force_cleanup()

We detect communities using PLM and record the number of communities found on the large graph. We compute modularity and summarize community-size statistics to validate the structure rather than simply trusting the partition. We estimate global distance behavior using effective diameter and estimated diameter in an API-safe way for NetworKit 11.2.1.

Copy CodeCopiedUse a different Browsert0 = tic()
sp = nk.sparsification.LocalSimilaritySparsifier(G, 0.7)
G_sparse = sp.getSparsifiedGraph()
toc(t0, “LocalSimilarity sparsification (alpha=0.7)”)
report(G_sparse, “Sparse”)

t0 = tic()
pr2 = nk.centrality.PageRank(G_sparse, damp=0.85, tol=1e-8)
pr2.run()
toc(t0, “PageRank on sparse”)
pr2_scores = np.array(pr2.scores(), dtype=np.float64)
print(“Top PR nodes (sparse):”, np.argsort(-pr2_scores)[:15].tolist())

t0 = tic()
plm2 = nk.community.PLM(G_sparse, refine=True, gamma=1.0, par=”balanced”)
plm2.run()
toc(t0, “PLM on sparse”)
part2 = plm2.getPartition()
Q2 = nk.community.Modularity().getQuality(part2, G_sparse)
print(“communities (sparse):”, part2.numberOfSubsets(), “| modularity (sparse):”, Q2)

t0 = tic()
eff2 = nk.distance.EffectiveDiameter(G_sparse, ED_RATIO)
eff2.run()
toc(t0, “EffectiveDiameter on sparse”)
print(“effective diameter (orig):”, eff.getEffectiveDiameter(), “| (sparse):”, eff2.getEffectiveDiameter())

force_cleanup()

out_path = “/content/networkit_large_sparse.edgelist”
t0 = tic()
nk.graphio.EdgeListWriter(“t”, 0).write(G_sparse, out_path)
toc(t0, “Wrote edge list”)
print(“Saved:”, out_path)

print(“nAdvanced large-graph pipeline complete.”)

We sparsify the graph using local similarity to reduce the number of edges while retaining useful structure for downstream analytics. We rerun PageRank, PLM, and effective diameter on the sparsified graph to check whether key signals remain consistent. We export the sparsified graph as an edgelist so we can reuse it across sessions, tools, or additional experiments.

In conclusion, we developed an end-to-end, scalable NetworKit workflow that mirrors real large-network analysis: we started from generation, stabilized the topology with LCC extraction, characterized the structure through cores and centralities, discovered communities and validated them with modularity, and captured global distance behavior through diameter estimates. We then applied sparsification to shrink the graph while keeping it analytically meaningful and saving it for repeatable pipelines. The tutorial provides a practical template we can reuse for real datasets by replacing the generator with an edgelist reader, while keeping the same analysis stages, performance tracking, and export steps.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Production-Style NetworKit 11.2.1 Coding Tutorial for Large-Scale Graph Analytics, Communities, Cores, and Sparsification appeared first on MarkTechPost.

Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Mo …

Microsoft has released Phi-4-reasoning-vision-15B, a 15 billion parameter open-weight multimodal reasoning model designed for image and text tasks that require both perception and selective reasoning. It is a compact model built to balance reasoning quality, compute efficiency, and training-data requirements, with particular strength in scientific and mathematical reasoning and understanding user interfaces.

https://arxiv.org/pdf/2603.03975

What the model is built on?

Phi-4-reasoning-vision-15B combines the Phi-4-Reasoning language backbone with the SigLIP-2 vision encoder using a mid-fusion architecture. In this setup, the vision encoder first converts images into visual tokens, then those tokens are projected into the language model embedding space and processed by the pretrained language model. This design acts as a practical trade-off: it preserves strong cross-modal reasoning while keeping training and inference costs manageable compared with heavier early-fusion designs.

https://arxiv.org/pdf/2603.03975

Why Microsoft took the smaller-model route?

Many recent vision-language models have grown in parameter count and token usage, which raises both latency and deployment cost. Phi-4-reasoning-vision-15B was built as a smaller alternative that still handles common multimodal workloads without relying on extremely large training datasets or excessive inference-time token generation. The model was trained on 200 billion multimodal tokens, building on Phi-4-Reasoning, which was trained on 16 billion tokens, and ultimately on the Phi-4 base model, which was trained on 400 billion unique tokens. Microsoft contrasts that with the more than 1 trillion tokens used to train several recent multimodal models such as Qwen 2.5 VL, Qwen 3 VL, Kimi-VL, and Gemma 3.

https://arxiv.org/pdf/2603.03975

High-resolution perception was a core design choice

Microsoft team explains one of the more useful technical lessons in their technical report that multimodal reasoning often fails because perception fails first. Models can miss the answer not because they lack reasoning ability, but because they fail to extract the relevant visual details from dense images such as screenshots, documents, or interfaces with small interactive elements.

Phi-4-reasoning-vision-15B uses a dynamic resolution vision encoder with up to 3,600 visual tokens, which is intended to support high-resolution understanding for tasks such as GUI grounding and fine-grained document analysis. The Microsoft team states that high-resolution, dynamic-resolution encoders yield consistent improvements, and explicitly notes that accurate perception is a prerequisite for high-quality reasoning.

Mixed reasoning instead of forcing reasoning everywhere

A second important design decision is the model’s mixed reasoning and non-reasoning training strategy. Rather than forcing chain-of-thought-style reasoning for all tasks, Microsoft team trained the model to switch between two modes. Reasoning samples include <think>…</think> traces, while non-reasoning samples begin with <nothink> and are used for perception-focused tasks such as captioning, grounding, OCR, and simple VQA. The reasoning data makes up about 20% of the overall training mixture.

The goal of this hybrid setup is to let the model respond directly on tasks where longer reasoning adds latency without improving accuracy, while still invoking structured reasoning on tasks such as math and science. Microsoft team also notes an important limitation: the boundary between these modes is learned implicitly, so switching is not always optimal. Users can override the default behavior through explicit prompting with <think> or <nothink> tokens.

What areas are stronger?

Microsoft team highlights 2 main application areas. The first is scientific and mathematical reasoning over visual inputs, including handwritten equations, diagrams, charts, tables, and quantitative documents. The second is computer-use agent tasks, where the model interprets screen content, localizes GUI elements, and supports interaction with desktop, web, or mobile interfaces.

https://arxiv.org/pdf/2603.03975

Benchmark results

Microsoft team reports the following benchmark scores for Phi-4-reasoning-vision-15B: 84.8 on AI2DTEST, 83.3 on ChartQATEST, 44.9 on MathVerseMINI, 36.2 on MathVisionMINI, 75.2 on MathVistaMINI, 54.3 on MMMUVAL, 64.5 on MMStar, 76.0 on OCRBench, and 88.2 on ScreenSpotv2. The technical report also notes that these results were generated using Eureka ML Insights and VLMEvalKit, with fixed evaluation settings, and that Microsoft team presents them as comparison results rather than leaderboard claims.

Key Takeaways

Phi-4-reasoning-vision-15B is a 15B open-weight multimodal model built by combining Phi-4-Reasoning with the SigLIP-2 vision encoder in a mid-fusion architecture.

Microsoft team designed the model for compact multimodal reasoning, with a focus on math, science, document understanding, and GUI grounding, rather than scaling to a much larger parameter count.

High-resolution visual perception is a core part of the system, with support for dynamic resolution encoding and up to 3,600 visual tokens, which helps on dense screenshots, documents, and interface-heavy tasks.

The model uses mixed reasoning and non-reasoning training, allowing it to switch between <think> and <nothink> modes depending on whether a task needs explicit reasoning or direct perception-based output.

Microsoft’s reported benchmarks show strong performance for its size, including results on AI2DTEST, ChartQATEST, MathVistaMINI, OCRBench, and ScreenSpotv2, which supports its positioning as a compact but capable vision-language reasoning model.

Check out the Paper, Repo and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Model for Math, Science, and GUI Understanding appeared first on MarkTechPost.

OpenAI Introduces Codex Security in Research Preview for Context-Aware …

OpenAI has introduced Codex Security, an application security agent that analyzes a codebase, validates likely vulnerabilities, and proposes fixes that developers can review before patching. The product is now rolling out in research preview to ChatGPT Enterprise, Business, and Edu customers through Codex web.

Why OpenAI Built Codex Security?

The product is designed for a problem that most engineering teams already know well: security tools often generate too many weak findings, while software teams are shipping code faster with AI-assisted development. In its announcement, OpenAI team argues that the main issue is not just detection quality, but lack of system context. A vulnerability that looks severe in a generic scan may be low impact in the actual application, while a subtle issue tied to architecture or trust boundaries may be missed entirely. Codex Security is positioned as a context-aware system that tries to reduce that gap.

How Codex Security Works?

Codex Security works in 3 stages:

Step 1: Building a Project-Specific Threat Model

The first step is to analyze the repository and generate a project-specific threat model. The system examines the security-relevant structure of the codebase to model what the application does, what it trusts, and where it may be exposed. That threat model is editable, which matters in practice because real systems usually include organization-specific assumptions that automated tooling cannot infer reliably on its own. Allowing teams to refine the model helps keep the analysis aligned with the actual architecture instead of a generic security template.

Step 2: Finding and Validating Vulnerabilities

The second step is vulnerability discovery and validation. Codex Security uses the threat model as context to search for issues and classify findings by their likely real-world impact within that system. Where possible, it pressure-tests findings in sandboxed validation environments. If users configure an environment tailored to the project, the system can validate potential issues in the context of the running application. This deeper validation can reduce false positives further and may allow the system to generate working proof-of-concepts. For engineering teams, that distinction is important: a proof that a flaw is exploitable in the actual system is more useful than a raw static warning because it gives clearer evidence for prioritization and remediation.

Step 3: Proposing Fixes with System Context

The third step is remediation. Codex Security proposes fixes using the full surrounding system context, with the goal of producing patches that improve security while minimizing regressions. Users can filter findings to focus on issues with the highest impact for their team. In addition, Codex Security can learn from feedback over time. When a user changes the criticality of a finding, that feedback can be used to refine the threat model and improve precision in later scans.

https://openai.com/index/codex-security-now-in-research-preview/

A Shift from Pattern Matching to Context-Aware Review

This workflow reflects a broader shift in application security tooling. Traditional scanners are effective at finding known classes of unsafe patterns, but they often struggle to distinguish between code that is theoretically risky and code that is actually exploitable in a specific deployment. OpenAI team is effectively treating security review as a reasoning problem over repository structure, runtime assumptions, and trust boundaries, rather than as a pure pattern-matching task. That does not remove the need for human review, but it can make the review process narrower and more evidence-driven if the validation step works as described. This framing is an inference from the product design, not a benchmarked independent conclusion.

Beta Metrics Reported by OpenAI

OpenAI also shared beta results. Scans on the same repositories over time showed increasing precision, and in one case noise was reduced by 84% since the initial rollout. The rate of findings with over-reported severity decreased by more than 90%, while false positive rates on detections fell by more than 50% across all repositories. Over the last 30 days, Codex Security reportedly scanned more than 1.2 million commits across external repositories in its beta cohort, identifying 792 critical findings and 10,561 high-severity findings. OpenAI team adds that critical issues appeared in under 0.1% of scanned commits. These are vendor-reported metrics, but they indicate that OpenAI is optimizing for higher-confidence findings rather than maximum alert volume.

Open-Source Security Work and CVE Reporting

The release also includes an open-source component along with Codex for OSS. OpenAI team has been using Codex Security on open-source repositories it depends on and sharing high-impact findings with maintainers. They also lists OpenSSH, GnuTLS, GOGS, Thorium, libssh, PHP, and Chromium among the projects where it reported critical vulnerabilities. It says 14 CVEs have been assigned, with dual reporting on 2 of them.

Key Takeaways

OpenAI launched Codex Security in research preview for ChatGPT Enterprise, Business, and Edu customers through Codex web, with free usage for the next month.

Codex Security is an application security agent, not just a scanner. OpenAI says it analyzes project context to identify vulnerabilities, validate them, and propose patches developers can review.

The system works in 3 stages: it builds an editable threat model, then prioritizes and validates issues in sandboxed environments where possible, and finally proposes fixes with full system context.

The product is designed to reduce security triage noise. In beta, it reports 84% less noise in one case, more than 90% reduction in over-reported severity, and more than 50% lower false positive rates across repositories.

OpenAI is also extending the product to open source through Codex for OSS, which offers eligible maintainers 6 months of ChatGPT Pro with Codex, conditional access to Codex Security, and API credits.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAI Introduces Codex Security in Research Preview for Context-Aware Vulnerability Detection, Validation, and Patch Generation Across Codebases appeared first on MarkTechPost.

Google AI Releases a CLI Tool (gws) for Workspace APIs: Providing a Un …

Integrating Google Workspace APIs—such as Drive, Gmail, Calendar, and Sheets—into applications and data pipelines typically requires writing boilerplate code to handle REST endpoints, pagination, and OAuth 2.0 flows. Google AI team just released a CLI Tool (gws) for Google Workspace. The open-source googleworkspace/cli (invoked via the gws command) provides a unified, dynamic command-line interface to manage these services.

Designed for both human developers and AI agents, gws eliminates the need for custom wrapper scripts by providing structured JSON outputs, native Model Context Protocol (MCP) support, and automated authentication workflows.

Dynamic API Discovery Architecture

Unlike traditional CLI tools that compile a static list of commands, gws builds its command surface dynamically at runtime.

When executed, gws uses a two-phase parsing strategy:

It reads the first argument to identify the target service (e.g., drive).

It fetches that service’s Google Discovery Document (cached for 24 hours).

It builds a command tree from the document’s resources and methods.

It parses the remaining arguments, authenticates, and executes the HTTP request.

Because of this architecture, gws automatically supports new Google Workspace API endpoints the moment they are added to the Discovery Service.

Core Features for Software Engineers and Data Scientists

The CLI can be installed via npm (npm install -g @googleworkspace/cli) or built from source (cargo install –path .). Once installed, it offers several built-in utilities for data extraction and automation:

Introspection and Preview: Every resource includes –help documentation generated from the Discovery API. You can view the schema of any method (e.g., gws schema drive.files.list) or use the –dry-run flag to preview the exact HTTP request before execution.

Structured Data Extraction: By default, every response—including errors and metadata—is returned as structured JSON.

Auto-Pagination: For devs pulling large datasets, the –page-all flag automatically handles API cursors. It streams paginated results as NDJSON (Newline Delimited JSON), which can be piped directly into command-line JSON processors:Bashgws drive files list –params ‘{“pageSize”: 100}’ –page-all | jq -r ‘.files[].name’

Integration with AI Agents and MCP

A primary use case for gws is serving as a tool-calling backend for Large Language Models (LLMs).

Model Context Protocol (MCP) Server: By running gws mcp -s drive,gmail,calendar, the CLI starts an MCP server over stdio. This exposes Workspace APIs as structured tools that any MCP-compatible client (like Claude Desktop or VS Code) can natively call.

Pre-built Agent Skills: The repository includes over 100 Agent Skills covering all supported APIs and common workflows. AI Engineers can install these directly into agent environments using npx skills add github:googleworkspace/cli.

Gemini CLI Extension: Developers using the Gemini CLI can install the gws extension (gemini extensions install https://github.com/googleworkspace/cli), allowing the local Gemini agent to inherit gws credentials and manage Workspace resources natively.

Model Armor (Response Sanitization): To mitigate prompt injection risks when feeding API data to an LLM, gws supports Google Cloud Model Armor. Passing the –sanitize flag scans API responses for malicious payloads before the data reaches your agent.

Authentication Workflows

The CLI handles authentication securely across different environments, replacing the need for manual token management in custom scripts. Precedence is given to explicit tokens, followed by credentials files, and finally local keyring storage.

Local Desktop: Running gws auth setup initiates an interactive flow to configure a Google Cloud project, enable necessary APIs, and handle OAuth login. Credentials are encrypted at rest using AES-256-GCM and stored in the OS keyring.

Headless / CI/CD: For server environments, developers can complete the interactive auth locally and export the plaintext credentials:Bashgws auth export –unmasked > credentials.json On the headless machine, point the CLI to this file using an environment variable: export GOOGLE_WORKSPACE_CLI_CREDENTIALS_FILE=/path/to/credentials.json.

Service Accounts: gws natively supports server-to-server Service Account key files and Domain-Wide Delegation via the GOOGLE_WORKSPACE_CLI_IMPERSONATED_USER variable.

Check out the Repo here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases a CLI Tool (gws) for Workspace APIs: Providing a Unified Interface for Humans and AI Agents appeared first on MarkTechPost.

A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pi …

In this tutorial, we explore how we use Daft as a high-performance, Python-native data engine to build an end-to-end analytical pipeline. We start by loading a real-world MNIST dataset, then progressively transform it using UDFs, feature engineering, aggregations, joins, and lazy execution. Also, we demonstrate how to seamlessly combine structured data processing, numerical computation, and machine learning. By the end, we are not just manipulating data, we are building a complete model-ready pipeline powered by Daft’s scalable execution engine.

Copy CodeCopiedUse a different Browser!pip -q install daft pyarrow pandas numpy scikit-learn

import os
os.environ[“DO_NOT_TRACK”] = “true”

import numpy as np
import pandas as pd
import daft
from daft import col

print(“Daft version:”, getattr(daft, “__version__”, “unknown”))

URL = “https://github.com/Eventual-Inc/mnist-json/raw/master/mnist_handwritten_test.json.gz”

df = daft.read_json(URL)
print(“nSchema (sampled):”)
print(df.schema())

print(“nPeek:”)
df.show(5)

We install Daft and its supporting libraries directly in Google Colab to ensure a clean, reproducible environment. We configure optional settings and verify the installed version to confirm everything is working correctly. By doing this, we establish a stable foundation for building our end-to-end data pipeline.

Copy CodeCopiedUse a different Browserdef to_28x28(pixels):
arr = np.array(pixels, dtype=np.float32)
if arr.size != 784:
return None
return arr.reshape(28, 28)

df2 = (
df
.with_column(
“img_28x28”,
col(“image”).apply(to_28x28, return_dtype=daft.DataType.python())
)
.with_column(
“pixel_mean”,
col(“img_28x28”).apply(lambda x: float(np.mean(x)) if x is not None else None,
return_dtype=daft.DataType.float32())
)
.with_column(
“pixel_std”,
col(“img_28x28”).apply(lambda x: float(np.std(x)) if x is not None else None,
return_dtype=daft.DataType.float32())
)
)

print(“nAfter reshaping + simple features:”)
df2.select(“label”, “pixel_mean”, “pixel_std”).show(5)

We load a real-world MNIST JSON dataset directly from a remote URL using Daft’s native reader. We inspect the schema and preview the data to understand its structure and column types. It allows us to validate the dataset before applying transformations and feature engineering.

Copy CodeCopiedUse a different Browser@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()), batch_size=512)
def featurize(images_28x28):
out = []
for img in images_28x28.to_pylist():
if img is None:
out.append(None)
continue
img = np.asarray(img, dtype=np.float32)
row_sums = img.sum(axis=1) / 255.0
col_sums = img.sum(axis=0) / 255.0
total = img.sum() + 1e-6
ys, xs = np.indices(img.shape)
cy = float((ys * img).sum() / total) / 28.0
cx = float((xs * img).sum() / total) / 28.0
vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()/255.0, img.std()/255.0], dtype=np.float32)])
out.append(vec.astype(np.float32).tolist())
return out

df3 = df2.with_column(“features”, featurize(col(“img_28x28”)))

print(“nFeature column created (list[float]):”)
df3.select(“label”, “features”).show(2)

We reshape the raw pixel arrays into structured 28×28 images using a row-wise UDF. We compute statistical features, such as the mean and standard deviation, to enrich the dataset. By applying these transformations, we convert raw image data into structured and model-friendly representations.

Copy CodeCopiedUse a different Browserlabel_stats = (
df3.groupby(“label”)
.agg(
col(“label”).count().alias(“n”),
col(“pixel_mean”).mean().alias(“mean_pixel_mean”),
col(“pixel_std”).mean().alias(“mean_pixel_std”),
)
.sort(“label”)
)

print(“nLabel distribution + summary stats:”)
label_stats.show(10)

df4 = df3.join(label_stats, on=”label”, how=”left”)

print(“nJoined label stats back onto each row:”)
df4.select(“label”, “n”, “mean_pixel_mean”, “mean_pixel_std”).show(5)

We implement a batch UDF to extract richer feature vectors from the reshaped images. We perform group-by aggregations and join summary statistics back to the dataset for contextual enrichment. This demonstrates how we combine scalable computation with advanced analytics within Daft.

Copy CodeCopiedUse a different Browsersmall = df4.select(“label”, “features”).collect().to_pandas()

small = small.dropna(subset=[“label”, “features”]).reset_index(drop=True)

X = np.vstack(small[“features”].apply(np.array).values).astype(np.float32)
y = small[“label”].astype(int).values

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

clf = LogisticRegression(max_iter=1000, n_jobs=None)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
acc = accuracy_score(y_test, pred)

print(“nBaseline accuracy (feature-engineered LogisticRegression):”, round(acc, 4))
print(“nClassification report:”)
print(classification_report(y_test, pred, digits=4))

out_df = df4.select(“label”, “features”, “pixel_mean”, “pixel_std”, “n”)
out_path = “/content/daft_mnist_features.parquet”
out_df.write_parquet(out_path)

print(“nWrote parquet to:”, out_path)

df_back = daft.read_parquet(out_path)
print(“nRead-back check:”)
df_back.show(3)

We materialize selected columns into pandas and train a baseline Logistic Regression model. We evaluate performance to validate the usefulness of our engineered features. Also, we persist the processed dataset to Parquet format, completing our end-to-end pipeline from raw data ingestion to production-ready storage.

In this tutorial, we built a production-style data workflow using Daft, moving from raw JSON ingestion to feature engineering, aggregation, model training, and Parquet persistence. We demonstrated how to integrate advanced UDF logic, perform efficient groupby and join operations, and materialize results for downstream machine learning, all within a clean, scalable framework. Through this process, we saw how Daft enables us to handle complex transformations while remaining Pythonic and efficient. We finished with a reusable, end-to-end pipeline that showcases how we can combine modern data engineering and machine learning workflows in a unified environment.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing appeared first on MarkTechPost.

OpenAI Releases Symphony: An Open Source Agentic Framework for Orchest …

OpenAI has released Symphony, an open-source framework designed to manage autonomous AI coding agents through structured ‘implementation runs.’ The project provides a system for automating software development tasks by connecting issue trackers to LLM-based agents.

System Architecture: Elixir and the BEAM

Symphony is built using Elixir and the Erlang/BEAM runtime. The choice of stack focuses on fault tolerance and concurrency. Since autonomous agents often perform long-running tasks that may fail or require retries, the BEAM’s supervision trees allow Symphony to manage hundreds of isolated implementation runs simultaneously.

The system uses PostgreSQL (via Ecto) for state persistence and is designed to run as a persistent daemon. It operates by polling an issue tracker—currently defaulting to Linear—to identify tasks that are ready for an agent to address.

The Implementation Run Lifecycle

The core unit of work in Symphony is the implementation run. The lifecycle of a run follows a specific sequence:

Polling and Triggering: Symphony monitors a specific state in the issue tracker (e.g., ‘Ready for Agent’).

Sandbox Isolation: For each issue, the framework creates a deterministic, per-issue workspace. This ensures the agent’s actions are confined to a specific directory and do not interfere with other concurrent runs.

Agent Execution: An agent (typically using OpenAI’s models) is initialized to perform the task described in the issue.

Proof of Work: Before a task is considered complete, the agent must provide ‘proof of work.’ This includes generating CI status reports, passing unit tests, providing PR review feedback, and creating a walkthrough of the changes.

Landing: If the proof of work is verified, the agent ‘lands’ the code by submitting or merging a Pull Request (PR) into the repository.

Configuration via WORKFLOW.md

Symphony utilizes an in-repo configuration file named WORKFLOW.md. This file serves as the technical contract between the developer team and the agent. It contains:

The agent’s primary system instructions and prompts.

Runtime settings for the implementation environment.

Specific rules for how the agent should interact with the codebase.

By keeping these instructions in the repository, teams can version-control their agent policies alongside their source code, ensuring that the agent’s behavior remains consistent with the specific version of the codebase it is modifying.

Harness Engineering Requirements

The documentation specifies that Symphony is most effective in environments that practice harness engineering. This refers to a repository structure that is optimized for machine interaction. Key requirements include:

Hermetic Testing: Tests that can run locally and reliably without external dependencies.

Machine-Readable Docs: Documentation and scripts that allow an agent to discover how to build, test, and deploy the project autonomously.

Modular Architecture: Codebases where side effects are minimized, allowing agents to make changes with high confidence.

Key Takeaways

Fault-Tolerant Orchestration via Elixir: Symphony utilizes Elixir and the Erlang/BEAM runtime to manage agent lifecycles. This architectural choice provides the high concurrency and fault tolerance necessary for supervising long-running, independent ‘implementation runs’ without system-wide failures.

State-Managed Implementation Runs: The framework transitions AI coding from manual prompting to an automated loop: it polls issue trackers (like Linear), creates isolated sandboxed workspaces, executes the agent, and requires ‘Proof of Work’ (CI passes and walkthroughs) before code is merged.

Version-Controlled Agent Contracts: Through the WORKFLOW.md specification, agent prompts and runtime configurations are stored directly in the repository. This treats the AI’s operating instructions as code, ensuring that agent behavior is versioned and synchronized with the specific branch it is modifying.

Dependency on Harness Engineering: For the system to be effective, repositories must adopt harness engineering. This involves structuring codebases for machine legibility, including hermetic (self-contained) test suites and modular architectures that allow agents to verify their own work autonomously.

Focused Scheduler Scope: Symphony is defined strictly as a scheduler, runner, and tracker reader. It is designed specifically to bridge the gap between project management tools and code execution, rather than serving as a general-purpose multi-tenant platform or a broad workflow engine.

Check out the Repo here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OpenAI Releases Symphony: An Open Source Agentic Framework for Orchestrating Autonomous AI Agents through Structured, Scalable Implementation Runs appeared first on MarkTechPost.