This AI Paper Introduces VLM-R³: A Multimodal Framework for Region Re …

Multimodal reasoning ability helps machines perform tasks such as solving math problems embedded in diagrams, reading signs from photographs, or interpreting scientific charts. The integration of both visual and linguistic information enables these systems to more closely mirror human thought processes, making them suitable for tasks that require visual interpretation combined with logical progression.

A major challenge in this area is the inability of current systems to revisit specific parts of an image while reasoning dynamically. Traditional models usually begin by analyzing an image once and then proceed with the rest of the reasoning in pure text. This approach limits accuracy in situations that require revisiting the image to confirm a detail or extract new visual cues during mid-reasoning. These shortcomings are particularly pronounced in tasks that require fine-grained spatial awareness, such as identifying small labels in scientific documents or resolving ambiguities in visually complex scenes.

Some tools and models have been introduced to address this gap, but they often treat visual grounding as a one-time operation. For example, existing systems like LLaVA-CoT or Qwen2.5-VL offer some visual-text integration. Still, they don’t let the model repeatedly and selectively query parts of an image based on the evolving reasoning process. The grounding, if performed, is generally static and lacks the flexibility to adapt based on intermediate reasoning steps. Moreover, these methods do not train models to determine the importance of specific image regions, leading to limitations in complex problem-solving.

Researchers from Peking University, Alibaba Group, and ZEEKR Intelligent Technology have introduced a model called VLM-R³. This model tackles the challenge by allowing a more interactive connection between vision and reasoning. It equips the model with the capacity to determine when visual clarification is needed, identify the exact image region for analysis, and re-integrate this visual content into the reasoning process. This approach mimics human problem-solving, where one might zoom into a chart or revisit a paragraph to verify a detail before making a decision. The model’s structure emphasizes refining its decisions iteratively by relying on visual evidence throughout the reasoning process.

To accomplish this, the researchers built a dataset named Visuo-Lingual Interleaved Rationale (VLIR), designed to train models in a stepwise interaction between images and text. VLM-R³ incorporates this dataset and operates using a method called Region-Conditioned Reinforcement Policy Optimization (R-GRPO). This training strategy encourages the model to selectively focus on informative parts of an image, perform transformations such as cropping or zooming, and incorporate those changes into subsequent logical steps. It simulates how humans shift their attention across different visual elements in response to their thoughts. The architecture integrates a pipeline that loops reasoning with visual inspection in real time, enhancing the system’s ability to interact with visual data during inference.

The results demonstrate a strong performance across multiple benchmarks. On MathVista, the model reached 70.4%, an increase from 68.2% in the baseline. For MathVision, the improvement was from 25.1% to 30.2%. On ScienceQA, it posted a 14.3% improvement, reaching 87.9% over the baseline’s 73.6%. On the hallucination test (HallusionBench), the model achieved 62.0%, outperforming others like Mulberry, which scored 54.1%. VLM-R³ also showed superior results on document understanding in DocVQA with a 96.8% score. Comparisons showed that even though it uses fewer parameters than closed-source models like Gemini-2 Flash or GPT-4o, it delivers competitive accuracy, particularly in tasks requiring detailed visual analysis and interleaved reasoning.

This work clearly outlines a problem that exists in how models handle vision during reasoning and presents a well-structured solution. By integrating a method for ongoing image analysis, researchers from the Alibaba Group, Peking University, and ZEEKR have advanced a powerful idea—models that look again, think, and refine. The proposed framework significantly improves accuracy in complex tasks and provides a blueprint for more robust, visually aware AI systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces VLM-R³: A Multimodal Framework for Region Recognition, Reasoning, and Refinement in Visual-Linguistic Tasks appeared first on MarkTechPost.

Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models fo …

Meta AI has introduced V-JEPA 2, a scalable open-source world model designed to learn from video at internet scale and enable robust visual understanding, future state prediction, and zero-shot planning. Building upon the joint-embedding predictive architecture (JEPA), V-JEPA 2 demonstrates how self-supervised learning from passive internet video, combined with minimal robot interaction data, can yield a modular foundation for intelligent physical agents.

Scalable Self-Supervised Pretraining from 1M Hours of Video

V-JEPA 2 is pretrained on over 1 million hours of internet-scale video combined with 1 million images. Using a visual mask denoising objective, the model learns to reconstruct masked spatiotemporal patches in a latent representation space. This approach avoids the inefficiencies of pixel-level prediction by focusing on predictable scene dynamics while disregarding irrelevant noise.

To scale JEPA pretraining to this level, Meta researchers introduced four key techniques:

Data scaling: Constructed a 22M-sample dataset (VideoMix22M) from public sources like SSv2, Kinetics, HowTo100M, YT-Temporal-1B, and ImageNet.

Model scaling: Expanded the encoder capacity to over 1B parameters using ViT-g.

Training schedule: Adopted a progressive resolution strategy and extended pretraining to 252K iterations.

Spatial-temporal augmentation: Trained on progressively longer and higher-resolution clips, reaching 64 frames at 384×384 resolution.

These design choices led to an 88.2% average accuracy across six benchmark tasks—including SSv2, Diving-48, Jester, Kinetics, COIN, and ImageNet—surpassing previous baselines.

Understanding via Masked Representation Learning

V-JEPA 2 exhibits strong motion understanding capabilities. On the Something-Something v2 benchmark, it achieves 77.3% top-1 accuracy, outperforming models like InternVideo and VideoMAEv2. For appearance understanding, it remains competitive with state-of-the-art image-text pretraining models like DINOv2 and PEcoreG.

The encoder’s representations were evaluated using attentive probes, verifying that self-supervised learning alone can yield transferable and domain-agnostic visual features applicable across diverse classification tasks.

Temporal Reasoning via Video Question Answering

To assess temporal reasoning, the V-JEPA 2 encoder is aligned with a multimodal large language model and evaluated on multiple video question-answering tasks. Despite lacking language supervision during pretraining, the model achieves:

84.0% on PerceptionTest

76.9% on TempCompass

44.5% on MVP

36.7% on TemporalBench

40.3% on TOMATO

These results challenge the assumption that visual-language alignment requires co-training from the start, demonstrating that a pretrained video encoder can be aligned post hoc with strong generalization.

V-JEPA 2-AC: Learning Latent World Models for Robotic Planning

A key innovation in this release is V-JEPA 2-AC, an action-conditioned variant of the pretrained encoder. Fine-tuned using only 62 hours of unlabeled robot video from the Droid dataset, V-JEPA 2-AC learns to predict future video embeddings conditioned on robot actions and poses. The architecture is a 300M parameter transformer with block-causal attention, trained using a teacher-forcing and rollout objective.

This allows zero-shot planning through model-predictive control. The model infers action sequences by minimizing the distance between imagined future states and visual goals using the Cross-Entropy Method (CEM). It achieves high success in tasks such as reaching, grasping, and pick-and-place on unseen robot arms in different labs—without any reward supervision or additional data collection.

Benchmarks: Robust Performance and Planning Efficiency

Compared to baselines like Octo (behavior cloning) and Cosmos (latent diffusion world models), V-JEPA 2-AC:

Executes plans in ~16 seconds per step (versus 4 minutes for Cosmos).

Reaches a 100% success rate on reach tasks.

Outperforms others in grasp and manipulation tasks across object types.

Notably, it operates using a monocular RGB camera without calibration or environment-specific fine-tuning, reinforcing the generalization capability of the learned world model.

Conclusion

Meta’s V-JEPA 2 represents a significant advancement in scalable self-supervised learning for physical intelligence. By decoupling observation learning from action conditioning and leveraging large-scale passive video, V-JEPA 2 demonstrates that general-purpose visual representations can be harnessed for both perception and control in the real world.

Check out the Paper, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning appeared first on MarkTechPost.

Run Multiple AI Coding Agents in Parallel with Container-Use from Dagg …

In AI-driven development, coding agents have become indispensable collaborators. These autonomous or semi-autonomous tools can write, test, and refactor code, dramatically accelerating development cycles. However, as the number of agents working on a single codebase grows, so do the challenges: dependency conflicts, state leakage between agents, and the difficulty of tracking each agent’s actions. The container-use project from Dagger addresses these challenges by offering containerized environments tailored for coding agents. By isolating each agent in its container, developers can run multiple agents concurrently without interference, inspect their activities in real-time, and intervene directly when necessary.

Traditionally, when a coding agent executes tasks, such as installing dependencies, running build scripts, or launching servers, it does so within the developer’s local environment. This approach quickly leads to conflicts: one agent may upgrade a shared library that breaks another agent’s workflow, or an errant script may leave behind artifacts that obscure subsequent runs. Containerization elegantly solves these issues by encapsulating each agent’s environment. Rather than babysitting agents one by one, you can spin up entirely fresh environments, experiment safely, and discard failures instantly, all while maintaining visibility into exactly what each agent executed.

Moreover, because containers can be managed through familiar tools, Docker, git, and standard CLI utilities, container-use integrates seamlessly into existing workflows. Instead of locking into a proprietary solution, teams can leverage their preferred tech stack, whether that means Python virtual environments, Node.js toolchains, or system-level packages. The result is a flexible architecture that empowers developers to harness the full potential of coding agents, without sacrificing control or transparency.

Installation and Setup

Getting started with container-use is straightforward. The project provides a Go-based CLI tool, ‘cu’, which you build and install via a simple ‘make’ command. By default, the build targets your current platform, but cross-compilation is supported through standard ‘TARGETPLATFORM’ environment variables.

Copy CodeCopiedUse a different Browser# Build the CLI tool
make

# (Optional) Install into your PATH
make install && hash -r

After running these commands, the ‘cu’ binary becomes available in your shell, ready to launch containerized sessions for any MCP-compatible agent. If you need to compile for a different architecture, say, ARM64 for a Raspberry Pi, simply prefix the build with the target platform:

Copy CodeCopiedUse a different BrowserTARGETPLATFORM=linux/arm64 make

This flexibility ensures that whether you’re developing on macOS, Windows Subsystem for Linux, or any flavor of Linux, you can generate an environment-specific binary with ease.

Integrating with Your Favorite Agents

One of container-use’s strengths is its compatibility with any agent that speaks the Model Context Protocol (MCP). The project provides example integrations for popular tools like Claude Code, Cursor, GitHub Copilot, and Goose. Integration typically involves adding ‘container-use’ as an MCP server in your agent’s configuration and enabling it:

Claude Code uses an NPM helper to register the server. You can merge Dagger’s recommended instructions into your ‘CLAUDE.md’ so that running ‘claude’ automatically spawns agents in isolated containers:

Copy CodeCopiedUse a different Browser npx @anthropic-ai/claude-code mcp add container-use — $(which cu) stdio
curl -o CLAUDE.md https://raw.githubusercontent.com/dagger/container-use/main/rules/agent.md

Goose, a browser-based agent framework, reads from ‘~/.config/goose/config.yaml’. Adding a ‘container-use’ section there directs Goose to launch each browsing agent inside its own container:

Copy CodeCopiedUse a different Browser extensions:
container-use:
name: container-use
type: stdio
enabled: true
cmd: cu
args:
– stdio
envs: {}

Cursor, the AI code assistant, can be hooked by dropping a rule file into your project. With ‘curl’ you fetch the recommended rule and place it in ‘.cursor/rules/container-use.mdc’.

VSCode and GitHub Copilot users can update their ‘settings.json’ and ‘.github/copilot-instructions.md’ respectively, pointing to the ‘cu’ command as the MCP server. Copilot then executes its code completions inside the encapsulated environment. Kilo Code integrates through a JSON-based settings file, letting you specify the ‘cu’ command and any required arguments under ‘mcpServers’. Each of these integrations ensures that, regardless of which assistant you choose, your agents operate in their sandbox, thereby removing the risk of cross-contamination and simplifying cleanup after each run.

Hands-On Examples

To illustrate how container-use can revolutionize your development workflow, the Dagger repository includes several ready-to-run examples. These demonstrate typical use cases and highlight the tool’s flexibility:

Hello World: In this minimal example, an agent scaffolds a simple HTTP server, say, using Flask or Node’s ‘http’ module, and launches it within its container. You can hit ‘localhost’ in your browser to confirm that the code generated by the agent runs as expected, entirely isolated from your host system.

Parallel Development: Here, two agents spin up distinct variations of the same app, one using Flask and another using FastAPI, each in its own container and on separate ports. This scenario demonstrates how to evaluate multiple approaches side by side without worrying about port collisions or dependency conflicts.

Security Scanning: In this pipeline, an agent performs routine maintenance, updating vulnerable dependencies, rerunning the build to ensure nothing broke, and generating a patch file that captures all changes. The entire process unfolds in a throwaway container, leaving your repository in its original state unless you decide to merge the patches.

Running these examples is as simple as piping the example file into your agent command. For instance, with Claude Code:

Copy CodeCopiedUse a different Browsercat examples/hello_world.md | claude

Or with Goose:

Copy CodeCopiedUse a different Browsergoose run -i examples/hello_world.md -s

After execution, you’ll see each agent commit its work to a dedicated git branch that represents its container. Inspecting these branches via ‘git checkout’ lets you review, test, or merge changes on your terms.

One common concern when delegating tasks to agents is knowing what they did, not just what they claim. container-use addresses this through a unified logging interface. When you start a session, the tool records every command, output, and file change into your repository’s ‘.git’ history under a special remote called ‘container-use’. You can follow along as the container spins up, the agent runs commands, and the environment evolves.

If an agent encounters an error or goes off track, you don’t have to watch logs in a separate window. A simple command brings up an interactive view:

Copy CodeCopiedUse a different Browsercu watch

This live view shows you which container branch is active, the latest outputs, and even gives you the option to drop into the agent’s shell. From there, you can debug manually: inspect environment variables, run your own commands, or edit files on the fly. This direct intervention capability ensures that agents remain collaborators rather than inscrutable black boxes.

While the default container images provided by container-use cover many node, Python, and system-level use cases, you might have specialized needs, say, custom compilers or proprietary libraries. Fortunately, you can control the Dockerfile that underpins each container. By placing a ‘Containerfile’ (or ‘Dockerfile’) at the root of your project, the ‘cu’ CLI will build a tailor-made image before launching the agent. This approach enables you to pre-install system packages, clone private repositories, or configure complex toolchains, all without affecting your host environment.

A typical custom Dockerfile might start from an official base, add OS-level packages, set environment variables, and install language-specific dependencies:

Copy CodeCopiedUse a different BrowserFROM ubuntu:22.04
RUN apt-get update && apt-get install -y git build-essential
WORKDIR /workspace
COPY requirements.txt .
RUN pip install -r requirements.txt

Once you’ve defined your container, any agent you invoke will operate within that context by default, inheriting all the pre-configured tools and libraries you need.

In conclusion, as AI agents undertake increasingly complex development tasks, the need for robust isolation and transparency grows in parallel. container-use from Dagger offers a pragmatic solution: containerized environments that ensure reliability, reproducibility, and real-time visibility. By building on standard tools, including Docker, Git, and shell scripts, and offering seamless integrations with popular MCP-compatible agents, it lowers the barrier to safe, scalable, multi-agent workflows.
The post Run Multiple AI Coding Agents in Parallel with Container-Use from Dagger appeared first on MarkTechPost.

Accelerating Articul8’s domain-specific model development with Amazo …

This post was co-written with Renato Nascimento, Felipe Viana, Andre Von Zuben from Articul8.
Generative AI is reshaping industries, offering new efficiencies, automation, and innovation. However, generative AI requires powerful, scalable, and resilient infrastructures that optimize large-scale model training, providing rapid iteration and efficient compute utilization with purpose-built infrastructure and automated cluster management.
In this post, we share how Articul8 is accelerating their training and deployment of domain-specific models (DSMs) by using Amazon SageMaker HyperPod and achieving over 95% cluster utilization and a 35% improvement in productivity.
What is SageMaker HyperPod?
SageMaker HyperPod is an advanced distributed training solution designed to accelerate the development of scalable, reliable, and secure generative AI model development. Articul8 uses SageMaker HyperPod to efficiently train large language models (LLMs) on diverse, representative data and uses its observability and resiliency features to keep the training environment stable over the long duration of training jobs. SageMaker HyperPod provides the following features:

Fault-tolerant compute clusters with automated faulty node replacement during model training
Efficient cluster utilization through observability and performance monitoring
Seamless model experimentation with streamlined infrastructure orchestration using Slurm and Amazon Elastic Kubernetes Service (Amazon EKS)

Who is Articul8?
Articul8 was established to address the gaps in enterprise generative AI adoption by developing autonomous, production-ready products. For instance, they found that most general-purpose LLMs often fall short in delivering the accuracy, efficiency, and domain-specific knowledge needed for real-world business challenges. They are pioneering a set of DSMs that offer twofold better accuracy and completeness, compared to general-purpose models, at a fraction of the cost. (See their recent blog post for more details.)
The company’s proprietary ModelMesh technology serves as an autonomous layer that decides, selects, executes, and evaluates the right models at runtime. Think of it as a reasoning system that determines what to run, when to run it, and in what sequence, based on the task and context. It evaluates responses at every step to refine its decision-making, enabling more reliable and interpretable AI solutions while dramatically improving performance.
Articul8’s ModelMesh supports:

LLMs for general tasks
Domain-specific models optimized for industry-specific applications
Non-LLMs for specialized reasoning tasks or established domain-specific tasks (for example, scientific simulation)

Articul8’s domain-specific models are setting new industry standards across supply chain, energy, and semiconductor sectors. The A8-SupplyChain model, built for complex workflows, achieves 92% accuracy and threefold performance gains over general-purpose LLMs in sequential reasoning. In energy, A8-Energy models were developed with EPRI and NVIDIA as part of the Open Power AI Consortium, enabling advanced grid optimization, predictive maintenance, and equipment reliability. The A8-Semicon model has set a new benchmark, outperforming top open-source (DeepSeek-R1, Meta Llama 3.3/4, Qwen 2.5) and proprietary models (GPT-4o, Anthropic’s Claude) by twofold in Verilog code accuracy, all while running at 50–100 times smaller model sizes for real-time AI deployment.
Articul8 develops some of their domain-specific models using Meta’s Llama family as a flexible, open-weight foundation for expert-level reasoning. Through a rigorous fine-tuning pipeline with reasoning trajectories and curated benchmarks, general Llama models are transformed into domain specialists. To tailor models for areas like hardware description languages, Articul8 applies Reinforcement Learning with Verifiable Rewards (RLVR), using automated reward pipelines to specialize the model’s policy. In one case, a dataset of 50,000 documents was automatically processed into 1.2 million images, 360,000 tables, and 250,000 summaries, clustered into a knowledge graph of over 11 million entities. These structured insights fuel A8-DSMs across research, product design, development, and operations.
How SageMaker HyperPod accelerated the development of Articul8’s DSMs
Cost and time to train DSMs is critical for success for Articul8 in a rapidly evolving ecosystem. Training high-performance DSMs requires extensive experimentation, rapid iteration, and scalable compute infrastructure. With SageMaker HyperPod, Articul8 was able to:

Rapidly iterate on DSM training – SageMaker HyperPod resiliency features enabled Articul8 to train and fine-tune its DSMs in a fraction of the time required by traditional infrastructure
Optimize model training performance – By using the automated failure recovery feature in SageMaker HyperPod, Articul8 provided stable and resilient training processes
Reduce AI deployment time by four times and lower total cost of ownership by five times – The orchestration capabilities of SageMaker HyperPod alleviated the manual overhead of cluster management, allowing Articul8’s research teams to focus on model optimization rather than infrastructure upkeep

These advantages contributed to record-setting benchmark results by Articul8, proving that domain-specific models deliver superior real-world performance compared to general-purpose models.
Distributed training challenges and the role of SageMaker HyperPod
Distributed training across hundreds of nodes faces several critical challenges beyond basic resource constraints. Managing massive training clusters requires robust infrastructure orchestration and careful resource allocation for operational efficiency. SageMaker HyperPod offers both managed Slurm and Amazon EKS orchestration experience that streamlines cluster creation, infrastructure resilience, job submission, and observability. The following details focus on the Slurm implementation for reference:

Cluster setup – Although setting up a cluster is a one-time effort, the process is streamlined with a setup script that walks the administrator through each step of cluster creation. This post shows how this can be done in discrete steps.
Resiliency – Fault tolerance becomes paramount when operating at scale. SageMaker HyperPod handles node failures and network interruptions by replacing faulty nodes automatically. You can add the flag –auto-resume=1 with the Slurm srun command, and the distributed training job will recover from the last checkpoint.
Job submission – SageMaker HyperPod managed Slurm orchestration is a powerful way for data scientists to submit and manage distributed training jobs. Refer to the following example in the AWS-samples distributed training repo for reference. For instance, a distributed training job can be submitted with a Slurm sbatch command: sbatch 1.distributed-training-llama2.sbatch. You can use squeue and scancel to view and cancel jobs, respectively.
Observability – SageMaker HyperPod uses Amazon CloudWatch and open source managed Prometheus and Grafana services for monitoring and logging. Cluster administrators can view the health of the infrastructure (network, storage, compute) and utilization.

Solution overview
The SageMaker HyperPod platform enables Articul8 to efficiently manage high-performance compute clusters without requiring a dedicated infrastructure team. The service automatically monitors cluster health and replaces faulty nodes, making the deployment process frictionless for researchers.
To enhance their experimental capabilities, Articul8 integrated SageMaker HyperPod with Amazon Managed Grafana, providing real-time observability of GPU resources through a single-pane-of-glass dashboard. They also used SageMaker HyperPod lifecycle scripts to customize their cluster environment and install required libraries and packages. This comprehensive setup empowers Articul8 to conduct rapid experimentation while maintaining high performance and reliability—they reduced their customers’ AI deployment time by four times and lowered their total cost of ownership by five times.
The following diagram illustrates the observability architecture.

The platform’s efficiency in managing computational resources with minimum downtime has been particularly valuable for Articul8’s research and development efforts, empowering them to quickly iterate on their generative AI solutions while maintaining enterprise-grade performance standards. The following sections describe the setup and results in detail.
For the setup for this post, we begin with the AWS published workshop for SageMaker HyperPod, and adjust it to suit our workload.
Prerequisites
The following two AWS CloudFormation templates address the prerequisites of the solution setup.
For SageMaker HyperPod
This CloudFormation stack addresses the prerequisites for SageMaker HyperPod:

VPC and two subnets – A public subnet and a private subnet are created in an Availability Zone (provided as a parameter). The virtual private cloud (VPC) contains two CIDR blocks with 10.0.0.0/16 (for the public subnet) and 10.1.0.0/16 (for the private subnet). An internet gateway and NAT gateway are deployed in the public subnet.
Amazon FSx for Lustre file system – An Amazon FSx for Lustre volume is created in the specified Availability Zone, with a default of 1.2 TB storage, which can be overridden by a parameter. For this case study, we increased the storage size to 7.2 TB.
Amazon S3 bucket – The stack deploys endpoints for Amazon Simple Storage Service (Amazon S3) to store lifecycle scripts.
IAM role – An AWS Identity and Access Management (IAM) role is also created to help execute SageMaker HyperPod cluster operations.
Security group – The script creates a security group to enable EFA communication for multi-node parallel batch jobs.

For cluster observability
To get visibility into cluster operations and make sure workloads are running as expected, an optional CloudFormation stack has been used for this case study. This stack includes:

Node exporter – Supports visualization of CPU load averages, memory and disk usage, network traffic, file system, and disk I/O metrics
NVIDIA DCGM – Supports visualization of GPU utilization, temperatures, power usage, and memory usage
EFA metrics – Supports visualization of EFA network and error metrics, EFA RDMA performance, and so on.
FSx for Lustre – Supports visualization of file system read/write operations, free capacity, and metadata operations

Observability can be configured through YAML scripts to monitor SageMaker HyperPod clusters on AWS. Amazon Managed Service for Prometheus and Amazon Managed Grafana workspaces with associated IAM roles are deployed in the AWS account. Prometheus and exporter services are also set up on the cluster nodes.
Using Amazon Managed Grafana with SageMaker HyperPod helps you create dashboards to monitor GPU clusters and make sure they operate efficiently with minimum downtime. In addition, dashboards have become a critical tool to give you a holistic view of how specialized workloads consume different resources of the cluster, helping developers optimize their implementation.
Cluster setup
The cluster is set up with the following components (results might vary based on customer use case and deployment setup):

Head node and compute nodes – For this case study, we use a head node and SageMaker HyperPod compute nodes. The head node has an ml.m5.12xlarge instance, and the compute queue consists of ml.p4de.24xlarge instances.
Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes.
Local storage – Each node has 8 TB local NVME volume attached for local storage.
Scheduler – Slurm is used as an orchestrator. Slurm is an open source and highly scalable cluster management tool and job scheduling system for high-performance computing (HPC) clusters.
Accounting – As part of cluster configuration, a local MariaDB is deployed that keeps track of job runtime information.

Results
During this project, Articul8 was able to confirm the expected performance of A100 with the added benefit of creating a cluster using Slurm and providing observability metrics to monitor the health of various components (storage, GPU nodes, fiber). The primary validation was on the ease of use and rapid ramp-up of data science experiments. Furthermore, they were able to demonstrate near linear scaling with distributed training, achieving a 3.78 times reduction in time to train for Meta Llama-2 13B with 4x nodes. Having the flexibility to run multiple experiments, without losing development time from infrastructure overhead was an important accomplishment for the Articul8 data science team.
Clean up
If you run the cluster as part of the workshop, you can follow the cleanup steps to delete the CloudFormation resources after deleting the cluster.
Conclusion
This post demonstrated how Articul8 AI used SageMaker HyperPod to overcome the scalability and efficiency challenges of training multiple high-performing DSMs across key industries. By alleviating infrastructure complexity, SageMaker HyperPod empowered Articul8 to focus on building AI systems with measurable business outcomes. From semiconductor and energy to supply chain, Articul8’s DSMs are proving that the future of enterprise AI is not general—it’s purpose-built. Key takeaways include:

DSMs significantly outperform general-purpose LLMs in critical domains
SageMaker HyperPod accelerated the development of Articul8’s A8-Semicon, A8-SupplyChain, and Energy DSM models
Articul8 reduced AI deployment time by four times and lowered total cost of ownership by five times using the scalable, automated training infrastructure of SageMaker HyperPod

Learn more about SageMaker HyperPod by following this workshop. Reach out to your account team on how you can use this service to accelerate your own training workloads.

About the Authors
Yashesh A. Shroff, PhD. is a Sr. GTM Specialist in the GenAI Frameworks organization, responsible for scaling customer foundational model training and inference on AWS using self-managed or specialized services to meet cost and performance requirements. He holds a PhD in Computer Science from UC Berkeley and an MBA from Columbia Graduate School of Business.
Amit Bhatnagar is a Sr Technical Account Manager with AWS, in the Enterprise Support organization, with a focus on generative AI startups. He is responsible for helping key AWS customers with their strategic initiatives and operational excellence in the cloud. When he is not chasing technology, Amit loves to cook vegan delicacies and hit the road with his family to chase the horizon.
Renato Nascimento is the Head of Technology at Articul8, where he leads the development and execution of the company’s technology strategy. With a focus on innovation and scalability, he ensures the seamless integration of cutting-edge solutions into Articul8’s products, enabling industry-leading performance and enterprise adoption.
Felipe Viana is the Head of Applied Research at Articul8, where he leads the design, development, and deployment of innovative generative AI technologies, including domain-specific models, new model architectures, and multi-agent autonomous systems.
Andre Von Zuben is the Head of Architecture at Articul8, where he is responsible for designing and implementing scalable generative AI platform elements, novel generative AI model architectures, and distributed model training and deployment pipelines.

How VideoAmp uses Amazon Bedrock to power their media analytics interf …

This post was co-written with Suzanne Willard and Makoto Uchida from VideoAmp.
In this post, we illustrate how VideoAmp, a media measurement company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to develop a prototype of the VideoAmp Natural Language (NL) Analytics Chatbot to uncover meaningful insights at scale within media analytics data using Amazon Bedrock. The AI-powered analytics solution involved the following components:

A natural language to SQL pipeline, with a conversational interface, that works with complex queries and media analytics data from VideoAmp
An automated testing and evaluation tool for the pipeline

VideoAmp background
VideoAmp is a tech-first measurement company that empowers media agencies, brands, and publishers to precisely measure and optimize TV, streaming, and digital media. With a comprehensive suite of measurement, planning, and optimization solutions, VideoAmp offers clients a clear, actionable view of audiences and attribution across environments, enabling them to make smarter media decisions that help them drive better business outcomes. VideoAmp has seen incredible adoption for its measurement and currency solutions with 880% YoY growth, 98% coverage of the TV publisher landscape, 11 agency groups, and more than 1,000 advertisers. VideoAmp is headquartered in Los Angeles and New York with offices across the United States. To learn more, visit www.videoamp.com.
VideoAmp’s AI journey
VideoAmp has embraced AI to enhance its measurement and optimization capabilities. The company has integrated machine learning (ML) algorithms into its infrastructure to analyze vast amounts of viewership data across traditional TV, streaming, and digital services. This AI-driven approach allows VideoAmp to provide more accurate audience insights, improve cross-environment measurement, and optimize advertising campaigns in real time. By using AI, VideoAmp has been able to offer advertisers and media owners more precise targeting, better attribution models, and increased return on investment for their advertising spend. The company’s AI journey has positioned it as a leader in the evolving landscape of data-driven advertising and media measurement.
To take their innovations a step further, VideoAmp is building a brand-new analytics solution powered by generative AI, which will provide their customers with accessible business insights. Their goal for a beta product is to create a conversational AI assistant powered by large language models (LLMs) that allows VideoAmp’s data analysts and non-technical users such as content researchers and publishers to perform data analytics using natural language queries.
Use case overview
VideoAmp is undergoing a transformative journey by integrating generative AI into its analytics. The company aims to revolutionize how customers, including publishers, media agencies, and brands, interact with and derive insights from VideoAmp’s vast repository of data through a conversational AI assistant interface.
Presently, analysis by data scientists and analysts is done manually, requires technical SQL knowledge, and can be time-consuming for complex and high-dimensional datasets. Acknowledging the necessity for streamlined and accessible processes, VideoAmp worked with the GenAIIC to develop an AI assistant capable of comprehending natural language queries, generating and executing SQL queries on VideoAmp’s data warehouse, and delivering natural language summaries of retrieved information. The assistant allows non-technical users to surface data-driven insights, and it reduces research and analysis time for both technical and non-technical users.
Key success criteria for the project included:

The ability to convert natural language questions into SQL statements, connect to VideoAmp’s provided database, execute statements on VideoAmp performance metrics data, and create a natural language summary of results
A UI to ask natural language questions and view assistant output, which includes generated SQL queries, reasoning for the SQL statements, retrieved data, and natural language data summaries
Conversational support for the user to iteratively refine and filter asked questions
Low latency and cost-effectiveness
An automated evaluation pipeline to assess the quality and accuracy of the assistant

The team overcame a few challenges during the development process:

Adapting LLMs to understand the domain aspects of VideoAmp’s dataset – The dataset included highly industry-specific fields and metrics, and required complex queries to effectively filter and analyze. The queries often involved multiple specialized metric calculations, filters selecting from over 30 values, and extensive grouping and ordering.
Developing an automated evaluation pipeline – The pipeline is able to correctly identify if generated outputs are equivalent to ground truth data, even if they have different column aliasing, ordering, and metric calculations.

Solution overview
The GenAIIC team worked with VideoAmp to create an AI assistant that used Anthropic’s Claude 3 LLMs through Amazon Bedrock. Amazon Bedrock was chosen for this project because it provides access to high-quality foundation models (FMs), including Anthropic’s Claude 3 series, through a unified API. This allowed the team to quickly integrate the most suitable models for different components of the solution, such as SQL generation and data summarization.
Additional features in Amazon Bedrock, including Amazon Bedrock Prompt Management, native support for Retrieval Augmented Generation (RAG) and structured data retrieval through Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and fine-tuning, enable VideoAmp to quickly expand the analytics solution and take it to production. Amazon Bedrock also offers robust security and adheres to compliance certifications, allowing VideoAmp to confidently expand their AI analytics solution while maintaining data privacy and adhering to industry standards.
The solution is connected to a data warehouse. It supports a variety of database connections, such as Snowflake, SingleStore, PostgreSQL, Excel and CSV files, and more. The following diagram illustrates the high-level workflow of the solution.

The workflow consists of the following steps:

The user navigates to the frontend application and asks a question in natural language.
A Question Rewriter LLM component uses previous conversational context to augment the question with additional details if applicable. This allows follow-up questions and refinements to previous questions.
A Text-to-SQL LLM component creates a SQL query that corresponds to the user question.
The SQL query is executed in the data warehouse.
A Data-to-Text LLM component summarizes the retrieved data for the user.

The rewritten question, generated SQL, reasoning, and retrieved data are returned at each step.
AI assistant workflow details
In this section, we discuss the components of the AI assistant workflow in more detail.
Rewriter
After the user asks the question, the current question and the previous questions the user asked in the current session are sent to the Question Rewriter component, which uses Anthropic’s Claude 3 Sonnet model. If deemed necessary, the LLM uses context from the previous questions to augment the current user question to make it a standalone question with context included. This enables multi-turn conversational support for the user, allowing for natural interactions with the assistant.
For example, if a user first asked, “For the week of 09/04/2023 – 09/10/2023, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”, followed by, “Can I have the same data for one year later”, the rewriter would rewrite the latter question as “For the week of 09/03/2024 – 09/09/2024, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”
Text-to-SQL
The rewritten user question is sent to the Text-to-SQL component, which also uses Anthropic’s Claude 3 Sonnet model. The Text-to-SQL component uses information about the database in its prompt to generate a SQL query corresponding to the user question. It also generates an explanation of the query.
The text-to-SQL prompt addressed several challenges, such as industry-specific language in user questions, complex metrics, and several rules and defaults for filtering. The prompt was developed through several iterations, based on feedback and guidance from the VideoAmp team, and manual and automated evaluation.
The prompt consisted of four overarching sections: context, SQL instructions, task, and examples. During the development phase, database schema and domain- or task-specific knowledge were found to be critical, so one major part of the prompt was designed to incorporate them in the context. To make this solution reusable and scalable, a modularized design of the prompt/input system is employed, making it generic so it can be applied to other use cases and domains. The solution can support Q&A with multiple databases by dynamically switching/changing the corresponding context with an orchestrator if needed.
The context section contains the following details:

Database schema
Sample categories for relevant data fields such as television networks to aid the LLM in understanding what fields to use for identifiers in the question
Industry term definitions
How to calculate different types of metrics or aggregations
Default values or fields should be selected if not specified
Other domain- or task-specific knowledge

The SQL instructions contain the following details:

Dynamic insertion of today’s date as a reference for terms, such as “last 3 quarters”
Instructions on usage of sub-queries
Instructions on when to retrieve additional informational columns not specified in the user question
Known SQL syntax and database errors to avoid and potential fixes

In the task section, the LLM is given a detailed step-by-step process to formulate SQL queries based on the context. A step-by-step process is required for the LLM to correctly think through and assimilate the required context and rules. Without the step-by-step process, the team found that the LLM wouldn’t adhere to all instructions provided in the previous sections.
In the examples section, the LLM is given several examples of user questions, corresponding SQL statements, and explanations.
In addition to iterating on the prompt content, different content organization patterns were tested due to long context. The final prompt was organized with markdown and XML.
SQL execution
After the Text-to-SQL component outputs a query, the query is executed against VideoAmp’s data warehouse using database connector code. For this use case, only read queries for analytics are executed to protect the database from unexpected operations like updates or deletes. The credentials for the database are securely stored and accessed using AWS Secrets Manager and AWS Key Management Service (AWS KMS).
Data-to-Text
The data retrieved by the SQL query is sent to the Data-to-Text component, along with the rewritten user question. The Data-to-Text component, which uses Anthropic’s Claude 3 Haiku model, produces a concise summary of the retrieved data and answers the user question.
The final outputs are displayed on the frontend application as shown in the following screenshots (protected data is hidden).

Evaluation framework workflow details
The GenAIIC team developed a sophisticated automated evaluation pipeline for VideoAmp’s NL Analytics Chatbot, which directly informed prompt optimization and solution improvements and was a critical component in providing high-quality results.
The evaluation framework comprises of two categories:

SQL query evaluation – Generated SQL queries are evaluated for overall closeness to the ground truth SQL query. A key feature of the SQL evaluation component was the ability to account for column aliasing and ordering differences when comparing statements and determine equivalency.
Retrieved data evaluation – The retrieved data is compared to ground truth data to determine an exact match, after a few processing steps to account for column, formatting, and system differences.

The evaluation pipeline also produces detailed reports of the results and discrepancies between generated data and ground truth data.
Dataset
The dataset used for the prototype solution was hosted in a data warehouse and consisted of performance metrics data such as viewership, ratings, and rankings for television networks and programs. The field names were industry-specific, so a data dictionary was included in the text-to-SQL prompt as part of the schema. The credentials for the database are securely stored and accessed using Secrets Manager and AWS KMS.
Results
A set of test questions were evaluated by the GenAIIC and VideoAmp teams, focusing on three metrics:

Accuracy – Different accuracy metrics were analyzed, but exact matches between retrieved data and ground truth data were prioritized
Latency – LLM generation latency, excluding the time taken to query the database
Cost – Average cost per user question

Both the evaluation pipeline and human review reported high accuracies on the dataset, whereas costs and latencies remained low. Overall, the results were well-aligned with VideoAmp expectations. VideoAmp anticipates this solution will make it simple for users to handle complex data queries with confidence through intuitive natural language interactions, reducing the time to business insights.
Conclusion
In this post, we shared how the GenAIIC team worked with VideoAmp to build a prototype of the VideoAmp NL Analytics Chatbot, an end-to-end generative AI data analytics interface using Amazon Bedrock and Anthropic’s Claude 3 LLMs. The solution is equipped with a variety of state-of-the-art LLM-based techniques, such as question rewriting, text-to-SQL query generation, and summarization of data in natural language. It also includes an automated evaluation module for evaluating the correctness of generated SQL statements and retrieved data. The solution achieved high accuracy on VideoAmp’s evaluation samples. Users can interact with the solution through an intuitive AI assistant interface with conversational capabilities.
VideoAmp will soon be launching their new generative AI-powered analytics interface, which enables customers to analyze data and gain business insights through natural language conversation. Their successful work with the GenAIIC team will allow VideoAmp to use generative AI technology to swiftly deliver valuable insights for both technical and non-technical customers.
This is just one of the ways AWS enables builders to deliver generative AI-based solutions. You can get started with Amazon Bedrock and see how it can be integrated in example code bases. The GenAIIC is a group of science and strategy experts with comprehensive expertise spanning the generative AI journey, helping you prioritize use cases, build a roadmap, and move solutions into production. If you’re interested in working with the GenAIIC, reach out to them today.

About the authors
Suzanne Willard is the VP of Engineering at VideoAmp where she founded and leads the GenAI program, establishing the strategic vision and execution roadmap. With over 20 years experience she is driving innovation in AI technologies, creating transformative solutions that align with business objectives and set the company apart in the market.
Makoto Uchida is a senior architect at VideoAmp in the AI domain, acting as area technical lead of AI portfolio, responsible for defining and driving AI product and technical strategy in the content and ads measurement platform PaaS product. Previously, he was a software engineering lead in generative and predictive AI Platform at a major hyperscaler public Cloud service. He has also engaged with multiple startups, laying the foundation of Data/ML/AI infrastructures.
Shreya Mohanty is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she partners with customers across industries to design and implement high-impact GenAI-powered solutions. She specializes in translating customer goals into tangible outcomes that drive measurable impact.
Long Chen is a Sr. Applied Scientist at AWS Generative AI Innovation Center. He holds a Ph.D. in Applied Physics from University of Michigan – Ann Arbor. With more than a decade of experience for research and development, he works on innovative solutions in various domains using generative AI and other machine learning techniques, ensuring the success of AWS customers. His interest includes generative models, multi-modal systems and graph learning.
Amaran Asokkumar is a Deep Learning Architect at AWS, specializing in infrastructure, automation, and AI. He leads the design of GenAI-enabled solutions across industry segments. Amaran is passionate about all things AI and helping customers accelerate their GenAI exploration and transformation efforts.
Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit …

Introduction

Large Language Models (LLMs) have shown substantial improvements in reasoning and precision through reinforcement learning (RL) and test-time scaling techniques. Despite outperforming traditional unit test generation methods, most existing approaches such as O1-Coder and UTGEN require supervision from ground-truth code. This supervision increases data collection costs and limits the scale of usable training data.

Limitations of Existing Approaches

Conventional unit test generation relies on:

Software analysis methods, which are rule-based and rigid.

Neural machine translation techniques, which often lack semantic alignment.

While recent prompt-based and agentic methods improve performance, they still depend heavily on labeled code for fine-tuning. This reliance restricts adaptability and scalability, particularly in real-world, large-scale deployment scenarios.

CURE: A Self-Supervised Co-Evolutionary Approach

Researchers from the University of Chicago, Princeton University, Peking University, and ByteDance Seed introduce CURE, a self-supervised reinforcement learning framework that jointly trains a code generator and a unit test generator without any ground-truth code.

CURE operates using a self-play mechanism in which:

The LLM generates both correct and incorrect code.

The unit test generator learns to distinguish failure modes and refines itself accordingly.

This bidirectional co-evolution enhances both code generation and verification without external supervision.

Architecture and Methodology

Base Models and Sampling Strategy

CURE is built on Qwen2.5-7B and 14B Instruct models, with Qwen3-4B used for long-chain-of-thought (CoT) variants. Each training step samples:

16 candidate code completions.

16 task-derived unit tests.

Sampling is performed using vLLM with temperature 1.0 and top-p 1.0. For long-CoT models, a response-length-aware transformation penalizes lengthy outputs, improving inference-time efficiency.

Reward Function and Optimization

CURE introduces a mathematically grounded reward formulation to:

Maximize reward precision, defined as the likelihood that correct code scores higher than incorrect code across generated unit tests.

Apply response-based reward adjustments for long responses to reduce latency.

Optimization proceeds via policy gradient methods, jointly updating the coder and unit tester to improve their mutual performance.

Benchmark Datasets and Evaluation Metrics

CURE is evaluated on five standard coding datasets:

LiveBench

MBPP

LiveCodeBench

CodeContests

CodeForces

Performance is measured across:

Unit test accuracy

One-shot code generation accuracy

Best-of-N (BoN) accuracy using 16 code and test samples.

Performance and Efficiency Gains

The ReasonFlux-Coder models derived via CURE achieve:

+37.8% in unit test accuracy.

+5.3% in one-shot code generation accuracy.

+9.0% in BoN accuracy.

Notably, ReasonFlux-Coder-4B achieves 64.8% reduction in average unit test response length—substantially improving inference speed. Across all benchmarks, these models outperform traditional coding-supervised fine-tuned models (e.g., Qwen2.5-Coder-Instruct).

Application to Commercial LLMs

When ReasonFlux-Coder-4B is paired with GPT-series models:

GPT-4o-mini gains +5.5% BoN accuracy.

GPT-4.1-mini improves by +1.8%.

API costs are reduced while performance is enhanced, indicating a cost-effective solution for production-level inference pipelines.

Use as Reward Model for Label-Free Fine-Tuning

CURE-trained unit test generators can be repurposed as reward models in RL training. Using ReasonFlux-Coder-4B’s generated unit tests yields comparable improvements to human-labeled test supervision—enabling fully label-free reinforcement learning pipelines.

Broader Applicability and Future Directions

Beyond BoN, ReasonFlux-Coder models integrate seamlessly with agentic coding frameworks like:

MPSC (Multi-Perspective Self-Consistency)

AlphaCodium

S*

These systems benefit from CURE’s ability to refine both code and tests iteratively. CURE also boosts agentic unit test generation accuracy by over 25.1%, reinforcing its versatility.

Conclusion

CURE represents a significant advancement in self-supervised learning for code generation and validation, enabling large language models to jointly evolve their coding and unit test generation capabilities without reliance on ground-truth code. By leveraging a co-evolutionary reinforcement learning framework, CURE not only enhances core performance metrics such as one-shot accuracy and Best-of-N selection but also improves inference efficiency through response-length-aware optimization. Its compatibility with existing agentic coding pipelines and ability to function as a label-free reward model make it a scalable and cost-effective solution for both training and deployment scenarios.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post CURE: A Reinforcement Learning Framework for Co-Evolving Code and Unit Test Generation in LLMs appeared first on MarkTechPost.

Develop a Multi-Tool AI Agent with Secure Python Execution using Riza …

In this tutorial, we’ll harness Riza’s secure Python execution as the cornerstone of a powerful, tool-augmented AI agent in Google Colab. Beginning with seamless API key management, through Colab secrets, environment variables, or hidden prompts, we’ll configure your Riza credentials to enable sandboxed, audit-ready code execution. We’ll integrate Riza’s ExecPython tool into a LangChain agent alongside Google’s Gemini generative model, define an AdvancedCallbackHandler that captures both tool invocations and Riza execution logs, and build custom utilities for complex math and in-depth text analysis.

Copy CodeCopiedUse a different Browser%pip install –upgrade –quiet langchain-community langchain-google-genai rizaio python-dotenv

import os
from typing import Dict, Any, List
from datetime import datetime
import json
import getpass
from google.colab import userdata

We will install and upgrade the core libraries, LangChain Community extensions, Google Gemini integration, Riza’s secure execution package, and dotenv support, quietly in Colab. We then import standard utilities (e.g., os, datetime, json), typing annotations, secure input via getpass, and Colab’s user data API to manage environment variables and user secrets seamlessly.

Copy CodeCopiedUse a different Browserdef setup_api_keys():
“””Set up API keys using multiple secure methods.”””

try:
os.environ[‘GOOGLE_API_KEY’] = userdata.get(‘GOOGLE_API_KEY’)
os.environ[‘RIZA_API_KEY’] = userdata.get(‘RIZA_API_KEY’)
print(” API keys loaded from Colab secrets”)
return True
except:
pass

if os.getenv(‘GOOGLE_API_KEY’) and os.getenv(‘RIZA_API_KEY’):
print(” API keys found in environment”)
return True

try:
if not os.getenv(‘GOOGLE_API_KEY’):
google_key = getpass.getpass(” Enter your Google Gemini API key: “)
os.environ[‘GOOGLE_API_KEY’] = google_key

if not os.getenv(‘RIZA_API_KEY’):
riza_key = getpass.getpass(” Enter your Riza API key: “)
os.environ[‘RIZA_API_KEY’] = riza_key

print(” API keys set securely via input”)
return True
except:
print(” Failed to set API keys”)
return False

if not setup_api_keys():
print(” Please set up your API keys using one of these methods:”)
print(” 1. Colab Secrets: Go to in left panel, add GOOGLE_API_KEY and RIZA_API_KEY”)
print(” 2. Environment: Set GOOGLE_API_KEY and RIZA_API_KEY before running”)
print(” 3. Manual input: Run the cell and enter keys when prompted”)
exit()

The above cell defines a setup_api_keys() function that securely retrieves your Google Gemini and Riza API keys by first attempting to load them from Colab secrets, then falling back to existing environment variables, and finally prompting you to enter them via hidden input if needed. If none of these methods succeed, it prints instructions on how to provide your keys and exits the notebook.

Copy CodeCopiedUse a different Browserfrom langchain_community.tools.riza.command import ExecPython
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import HumanMessage, AIMessage
from langchain.memory import ConversationBufferWindowMemory
from langchain.tools import Tool
from langchain.callbacks.base import BaseCallbackHandler

We import Riza’s ExecPython tool alongside LangChain’s core components for building a tool‐calling agent, namely the Gemini LLM wrapper (ChatGoogleGenerativeAI), the agent executor and creation functions (AgentExecutor, create_tool_calling_agent), the prompt and message templates, conversation memory buffer, generic Tool wrapper, and the base callback handler for logging and monitoring agent actions. These building blocks let you assemble, configure, and track a memory-enabled, multi-tool AI agent in Colab.

Copy CodeCopiedUse a different Browserclass AdvancedCallbackHandler(BaseCallbackHandler):
“””Enhanced callback handler for detailed logging and metrics.”””

def __init__(self):
self.execution_log = []
self.start_time = None
self.token_count = 0

def on_agent_action(self, action, **kwargs):
timestamp = datetime.now().strftime(“%H:%M:%S”)
self.execution_log.append({
“timestamp”: timestamp,
“action”: action.tool,
“input”: str(action.tool_input)[:100] + “…” if len(str(action.tool_input)) > 100 else str(action.tool_input)
})
print(f” [{timestamp}] Using tool: {action.tool}”)

def on_agent_finish(self, finish, **kwargs):
timestamp = datetime.now().strftime(“%H:%M:%S”)
print(f” [{timestamp}] Agent completed successfully”)

def get_execution_summary(self):
return {
“total_actions”: len(self.execution_log),
“execution_log”: self.execution_log
}

class MathTool:
“””Advanced mathematical operations tool.”””

@staticmethod
def complex_calculation(expression: str) -> str:
“””Evaluate complex mathematical expressions safely.”””
try:
import math
import numpy as np

safe_dict = {
“__builtins__”: {},
“abs”: abs, “round”: round, “min”: min, “max”: max,
“sum”: sum, “len”: len, “pow”: pow,
“math”: math, “np”: np,
“sin”: math.sin, “cos”: math.cos, “tan”: math.tan,
“log”: math.log, “sqrt”: math.sqrt, “pi”: math.pi, “e”: math.e
}

result = eval(expression, safe_dict)
return f”Result: {result}”
except Exception as e:
return f”Math Error: {str(e)}”

class TextAnalyzer:
“””Advanced text analysis tool.”””

@staticmethod
def analyze_text(text: str) -> str:
“””Perform comprehensive text analysis.”””
try:
char_freq = {}
for char in text.lower():
if char.isalpha():
char_freq[char] = char_freq.get(char, 0) + 1

words = text.split()
word_count = len(words)
avg_word_length = sum(len(word) for word in words) / max(word_count, 1)

specific_chars = {}
for char in set(text.lower()):
if char.isalpha():
specific_chars[char] = text.lower().count(char)

analysis = {
“total_characters”: len(text),
“total_words”: word_count,
“average_word_length”: round(avg_word_length, 2),
“character_frequencies”: dict(sorted(char_freq.items(), key=lambda x: x[1], reverse=True)[:10]),
“specific_character_counts”: specific_chars
}

return json.dumps(analysis, indent=2)
except Exception as e:
return f”Analysis Error: {str(e)}”

Above cell brings together three essential pieces: an AdvancedCallbackHandler that captures every tool invocation with a timestamped log and can summarize the total actions taken; a MathTool class that safely evaluates complex mathematical expressions in a restricted environment to prevent unwanted operations; and a TextAnalyzer class that computes detailed text statistics, such as character frequencies, word counts, and average word length, and returns the results as formatted JSON.

Copy CodeCopiedUse a different Browserdef validate_api_keys():
“””Validate API keys before creating agents.”””
try:
test_llm = ChatGoogleGenerativeAI(
model=”gemini-1.5-flash”,
temperature=0
)
test_llm.invoke(“test”)
print(” Gemini API key validated”)

test_tool = ExecPython()
print(” Riza API key validated”)

return True
except Exception as e:
print(f” API key validation failed: {str(e)}”)
print(“Please check your API keys and try again”)
return False

if not validate_api_keys():
exit()

python_tool = ExecPython()
math_tool = Tool(
name=”advanced_math”,
description=”Perform complex mathematical calculations and evaluations”,
func=MathTool.complex_calculation
)
text_analyzer_tool = Tool(
name=”text_analyzer”,
description=”Analyze text for character frequencies, word statistics, and specific character counts”,
func=TextAnalyzer.analyze_text
)

tools = [python_tool, math_tool, text_analyzer_tool]

try:
llm = ChatGoogleGenerativeAI(
model=”gemini-1.5-flash”,
temperature=0.1,
max_tokens=2048,
top_p=0.8,
top_k=40
)
print(” Gemini model initialized successfully”)
except Exception as e:
print(f” Gemini Pro failed, falling back to Flash: {e}”)
llm = ChatGoogleGenerativeAI(
model=”gemini-1.5-flash”,
temperature=0.1,
max_tokens=2048
)

In this cell, we first define and run validate_api_keys() to ensure that both the Gemini and Riza credentials work, attempting a dummy LLM call and instantiating the Riza ExecPython tool. We exit the notebook if validation fails. We then instantiate python_tool for secure code execution, wrap our MathTool and TextAnalyzer methods into LangChain Tool objects, and collect them into the tools list. Finally, we initialize the Gemini model with custom settings (temperature, max_tokens, top_p, top_k), and if the “Pro” configuration fails, we gracefully fall back to the lighter “Flash” variant.

Copy CodeCopiedUse a different Browserprompt_template = ChatPromptTemplate.from_messages([
(“system”, “””You are an advanced AI assistant with access to powerful tools.

Key capabilities:
– Python code execution for complex computations
– Advanced mathematical operations
– Text analysis and character counting
– Problem decomposition and step-by-step reasoning

Instructions:
1. Always break down complex problems into smaller steps
2. Use the most appropriate tool for each task
3. Verify your results when possible
4. Provide clear explanations of your reasoning
5. For text analysis questions (like counting characters), use the text_analyzer tool first, then verify with Python if needed

Be precise, thorough, and helpful.”””),
(“human”, “{input}”),
(“placeholder”, “{agent_scratchpad}”),
])

memory = ConversationBufferWindowMemory(
k=5,
return_messages=True,
memory_key=”chat_history”
)

callback_handler = AdvancedCallbackHandler()

agent = create_tool_calling_agent(llm, tools, prompt_template)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
memory=memory,
callbacks=[callback_handler],
max_iterations=10,
early_stopping_method=”generate”
)

This cell constructs the agent’s “brain” and workflow: it defines a structured ChatPromptTemplate that instructs the system on its toolset and reasoning style, sets up a sliding-window conversation memory to retain the last five exchanges, and instantiates the AdvancedCallbackHandler for real-time logging. It then creates a tool‐calling agent by binding the Gemini LLM, custom tools, and prompt template, and wraps it in an AgentExecutor that manages execution (up to ten steps), leverages memory for context, streams verbose output, and halts cleanly once the agent generates a final response.

Copy CodeCopiedUse a different Browserdef ask_question(question: str) -> Dict[str, Any]:
“””Ask a question to the advanced agent and return detailed results.”””
print(f”n Processing: {question}”)
print(“=” * 50)

try:
result = agent_executor.invoke({“input”: question})

output = result.get(“output”, “No output generated”)

print(“n Execution Summary:”)
summary = callback_handler.get_execution_summary()
print(f”Tools used: {summary[‘total_actions’]}”)

return {
“question”: question,
“answer”: output,
“execution_summary”: summary,
“success”: True
}

except Exception as e:
print(f” Error: {str(e)}”)
return {
“question”: question,
“error”: str(e),
“success”: False
}

test_questions = [
“How many r’s are in strawberry?”,
“Calculate the compound interest on $1000 at 5% for 3 years”,
“Analyze the word frequency in the sentence: ‘The quick brown fox jumps over the lazy dog'”,
“What’s the fibonacci sequence up to the 10th number?”
]

print(” Advanced Gemini Agent with Riza – Ready!”)
print(” API keys configured securely”)
print(“Testing with sample questions…n”)

results = []
for question in test_questions:
result = ask_question(question)
results.append(result)
print(“n” + “=”*80 + “n”)

print(” FINAL SUMMARY:”)
successful = sum(1 for r in results if r[“success”])
print(f”Successfully processed: {successful}/{len(results)} questions”)

Finally, we define a helper function, ask_question(), that sends a user query to the agent executor, prints the question header, captures the agent’s response (or error), and then outputs a brief execution summary (showing how many tool calls were made). It then supplies a list of sample questions, covering counting characters, computing compound interest, analyzing word frequency, and generating a Fibonacci sequence, and iterates through them, invoking the agent on each and collecting the results. After running all tests, it prints a concise “FINAL SUMMARY” indicating how many queries were processed successfully, confirming that your Advanced Gemini + Riza agent is up and running in Colab.

In conclusion, by centering the architecture on Riza’s secure execution environment, we’ve created an AI agent that generates insightful responses via Gemini while also running arbitrary Python code in a fully sandboxed, monitored context. The integration of Riza’s ExecPython tool ensures that every computation, from advanced numerical routines to dynamic text analyses, is executed with rigorous security and transparency. With LangChain orchestrating tool calls and a memory buffer maintaining context, we now have a modular framework ready for real-world tasks such as automated data processing, research prototyping, or educational demos.

Check out the Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post Develop a Multi-Tool AI Agent with Secure Python Execution using Riza and Gemini appeared first on MarkTechPost.

How Do LLMs Really Reason? A Framework to Separate Logic from Knowledg …

Unpacking Reasoning in Modern LLMs: Why Final Answers Aren’t Enough

Recent advancements in reasoning-focused LLMs like OpenAI’s o1/3 and DeepSeek-R1 have led to notable improvements on complex tasks. However, the step-by-step reasoning behind these models remains unclear. Most evaluations focus on final-answer accuracy, which hides the reasoning process and doesn’t reveal how models combine knowledge and logic. Some earlier methods attempt to measure reasoning by comparing answers to the original question, but this approach is flawed since models often rely on prior deductions or internal knowledge. Domains such as math and medicine differ in their reasoning needs, highlighting the importance of developing better, domain-aware evaluation methods for building trustworthy AI.

The Shortcomings of Final-Answer Evaluations in Math and Medicine

Recent LLMs have made impressive strides in reasoning tasks, especially in math and medicine, thanks to better training data and reward strategies. However, most of this progress focuses on boosting final answer accuracy rather than understanding how the model reasons step-by-step. Past work has flagged factual errors in reasoning chains or measured similarity between reasoning steps and the original question. But such similarity doesn’t guarantee logical soundness or factual correctness, since LLMs often draw on internal knowledge or earlier reasoning.

A New Framework for Separating Knowledge and Logic in LLM Reasoning

Researchers from UC Santa Cruz, Stanford, and Tongji University go beyond final-answer evaluation by breaking down LLM reasoning into two key parts: factual knowledge and logical steps. They introduce a detailed framework that utilizes two metrics: the Knowledge Index (KI) for factual accuracy and Information Gain (InfoGain) for reasoning quality. Their analysis of Qwen models across math and medical tasks reveals that reasoning skills don’t easily transfer between domains. While supervised fine-tuning improves accuracy, it often harms reasoning depth. Reinforcement learning, however, helps refine reasoning by removing irrelevant information. This work highlights the importance of evaluating and training LLMs more thoughtfully.

Assessing Reasoning with Qwen2.5-7B and DeepSeek-R1 Models

The researchers evaluate reasoning in LLMs by analyzing Qwen2.5-7B and its DeepSeek-R1-distilled version, trained with SFT and RL. Using tasks from both math and medical domains, they decompose responses into logical steps and assess them using two key metrics: Information Gain (how much uncertainty is reduced with each reasoning step) and Knowledge Index (how factually accurate each step is, verified against expert sources). While InfoGain tracks the informativeness of each step, KI checks whether the knowledge aligns with real-world facts. This approach reveals how models reason and where they may falter in accuracy or logic.

Supervised Fine-Tuning vs. Reinforcement Learning in Domain-Specific Tasks

The study evaluates two variants of Qwen-2.5-7B—Qwen-Base and the distilled Qwen-R1 on medical tasks. Results show that Qwen-Base consistently outperforms Qwen-R1 in accuracy, knowledge retention, and reasoning, especially after SFT and RL. The distilled model likely struggles due to prior training focused on math and code, resulting in a domain mismatch. Interestingly, SFT enhances medical knowledge more effectively than RL, although it may slightly compromise reasoning efficiency. RL, on the other hand, improves both reasoning and knowledge when applied post-SFT. Medical benchmarks tend to rely more on factual knowledge than abstract reasoning, unlike math-focused tasks.

Conclusion: Toward More Interpretable and Trustworthy LLMs

In conclusion, the study introduces a framework that separates knowledge from reasoning to evaluate better how LLMs think, particularly in high-stakes areas like medicine and math. Using Qwen models trained with SFT and RL, the researchers found that while SFT improves factual accuracy, essential in medicine, it often weakens reasoning. RL, however, enhances reasoning by trimming out incorrect information. The framework could be extended to fields such as law or finance, where structured thinking is crucial. Overall, this approach helps clarify how LLMs make decisions and suggests ways to tailor their training for specific domains.

Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post How Do LLMs Really Reason? A Framework to Separate Logic from Knowledge appeared first on MarkTechPost.

Adobe enhances developer productivity using Amazon Bedrock Knowledge B …

Adobe Inc. excels in providing a comprehensive suite of creative tools that empower artists, designers, and developers across various digital disciplines. Their product landscape is the backbone of countless creative projects worldwide, ranging from web design and photo editing to vector graphics and video production.
Adobe’s internal developers use a vast array of wiki pages, software guidelines, and troubleshooting guides. Recognizing the challenge developers faced in efficiently finding the right information for troubleshooting, software upgrades, and more, Adobe’s Developer Platform team sought to build a centralized system. This led to the initiative Unified Support, designed to help thousands of the company’s internal developers get immediate answers to questions from a centralized place and reduce time and cost spent on developer support. For instance, a developer setting up a continuous integration and delivery (CI/CD) pipeline in a new AWS Region or running a pipeline on a dev branch can quickly access Adobe-specific guidelines and best practices through this centralized system.
The initial prototype for Adobe’s Unified Support provided valuable insights and confirmed the potential of the approach. This early phase highlighted key areas requiring further development to operate effectively at Adobe’s scale, including addressing scalability needs, simplifying resource onboarding, improving content synchronization mechanisms, and optimizing infrastructure efficiency. Building on these learnings, improving retrieval precision emerged as the next critical step.
To address these challenges, Adobe partnered with the AWS Generative AI Innovation Center, using Amazon Bedrock Knowledge Bases and the Vector Engine for Amazon OpenSearch Serverless. This solution dramatically improved their developer support system, resulting in a 20% increase in retrieval accuracy. Metadata filtering empowers developers to fine-tune their search, helping them surface more relevant answers across complex, multi-domain knowledge sources. This improvement not only enhanced the developer experience but also contributed to reduced support costs.
In this post, we discuss the details of this solution and how Adobe enhances their developer productivity.
Solution overview
Our project aimed to address two key objectives:

Document retrieval engine enhancement – We developed a robust system to improve search result accuracy for Adobe developers. This involved creating a pipeline for data ingestion, preprocessing, metadata extraction, and indexing in a vector database. We evaluated retrieval performance against Adobe’s ground truth data to produce high-quality, domain-specific results.
Scalable, automated deployment – To support Unified Support across Adobe, we designed a reusable blueprint for deployment. This solution accommodates large-scale data ingestion of various types and offers flexible configurations, including embedding model selection and chunk size adjustment.

Using Amazon Bedrock Knowledge Bases, we created a customized, fully managed solution that improved the retrieval effectiveness. Key achievements include a 20% increase in accuracy metrics for document retrieval, seamless document ingestion and change synchronization, and enhanced scalability to support thousands of Adobe developers. This solution provides a foundation for improved developer support and scalable deployment across Adobe’s teams. The following diagram illustrates the solution architecture.

Let’s take a closer look at our solution:

Amazon Bedrock Knowledge Bases index – The backbone of our system is Amazon Bedrock Knowledge Bases. Data is indexed through the following stages:

Data ingestion – We start by pulling data from Amazon Simple Storage Service (Amazon S3) buckets. This could be anything from resolutions to past issues or wiki pages.
Chunking – Amazon Bedrock Knowledge Bases breaks data down into smaller pieces, or chunks, defining the specific units of information that can be retrieved. This chunking process is configurable, allowing for optimization based on the specific needs of the business.
Vectorization – Each chunk is passed through an embedding model (in this case, Amazon Titan V2 on Amazon Bedrock) creating a 1,024-dimension numerical vector. This vector represents the semantic meaning of the chunk, allowing for similarity searches
Storage – These vectors are stored in the Amazon OpenSearch Serverless vector database, creating a searchable repository of information.

Runtime – When a user poses a question, our system competes the following steps:

Query vectorization – With the Amazon Bedrock Knowledge Bases Retrieve API, the user’s question is automatically embedded using the same embedding model used for the chunks during data ingestion.
Similarity search and retrieval – The system retrieves the most relevant chunks in the vector database based on similarity scores to the query.
Ranking and presentation – The corresponding documents are ranked based on the sematic similarity of their modest relevant chunks to the query, and the top-ranked information is presented to the user.

Multi-tenancy through metadata filtering
As developers, we often find ourselves seeking help across various domains. Whether it’s tackling CI/CD issues, setting up project environments, or adopting new libraries, the landscape of developer challenges is vast and varied. Sometimes, our questions even span multiple domains, making it crucial to have a system for retrieving relevant information. Metadata filtering empowers developers to retrieve not just semantically relevant information, but a well-defined subset of that information based on specific criteria. This powerful tool enables you to apply filters to your retrievals, helping developers narrow the search results to a limited set of documents based on the filter, thereby improving the relevancy of the search.
To use this feature, metadata files are provided alongside the source data files in an S3 bucket. To enable metadata-based filtering, each source data file needs to be accompanied by a corresponding metadata file. These metadata files used the same base name as the source file, with a .metadata.json suffix. Each metadata file included relevant attributes—such as domain, year, or type—to support multi-tenancy and fine-grained filtering in OpenSearch Service. The following code shows what an example metadata file looks like:

{
  “metadataAttributes”:
      {
        “domain”: “project A”,
        “year”: 2016,
        “type”: “wiki”
       }
 }

Retrieve API
The Retrieve API allows querying a knowledge base to retrieve relevant information. You can use it as follows:

Send a POST request to /knowledgebases/knowledgeBaseId/retrieve.
Include a JSON body with the following:

retrievalQuery – Contains the text query.
retrievalConfiguration – Specifies search parameters, such as number of results and filters.
nextToken – For pagination (optional).

The following is an example request syntax:

POST /knowledgebases/knowledgeBaseId/retrieve HTTP/1.1
Content-type: application/json
{
   “nextToken”: “string”,
   “retrievalConfiguration”: {
      “vectorSearchConfiguration”: {
         “filter”: { … },
         “numberOfResults”: number,
         “overrideSearchType”: “string”
      }
   },
   “retrievalQuery”: {
      “text”: “string”
   }
}

Additionally, you can set up the retriever with ease using the langchain-aws package:

from langchain_aws import AmazonKnowledgeBasesRetriever
retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id=”YOUR-ID”,
    retrieval_config={“vectorSearchConfiguration”: {“numberOfResults”: 4}},
)
retriever.get_relevant_documents(query=”What is the meaning of life?”)

This approach enables semantic querying of the knowledge base to retrieve relevant documents based on the provided query, simplifying the implementation of search.
Experimentation
To deliver the most accurate and efficient knowledge retrieval system, the Adobe and AWS teams put the solution to the test. The team conducted a series of rigorous experiments to fine-tune the system and find the optimal settings.
Before we dive into our findings, let’s discuss the metrics and evaluation process we used to measure success. We used the open source model evaluation framework Ragas to evaluate the retrieval system across two metrics: document relevance and mean reciprocal rank (MRR). Although Ragas comes with many metrics for evaluating model performance out of the box, we needed to implement these metrics by extending the Ragas framework with custom code.

Document relevance – Document relevance offers a qualitative approach to assessing retrieval accuracy. This metric uses a large language model (LLM) as an impartial judge to compare retrieved chunks against user queries. It evaluates how effectively the retrieved information addresses the developer’s question, providing a score between 1–10.
Mean reciprocal rank – On the quantitative side, we have the MRR metric. MRR evaluates how well a system ranks the first relevant item for a query. For each query, find the rank k of the highest-ranked relevant document. The score for that query is 1/k. MRR is the average of these 1/k scores over the entire set of queries. A higher score (closer to 1) signifies that the first relevant result is typically ranked high.

These metrics provide complementary insights: document relevance offers a content-based assessment, and MRR provides a ranking-based evaluation. Together, they offer a comprehensive view of the retrieval system’s effectiveness in finding and prioritizing relevant information.In our recent experiments, we explored various data chunking strategies to optimize the performance of retrieval. We tested several approaches, including fixed-size chunking as well as more advanced semantic chunking and hierarchical chunking.Semantic chunking focuses on preserving the contextual relationships within the data by segmenting it based on semantic meaning. This approach aims to improve the relevance and coherence of retrieved results.Hierarchical chunking organizes data into a hierarchical parent-child structure, allowing for more granular and efficient retrieval based on the inherent relationships within your data.
For more information on how to set up different chunking strategies, refer to Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.
We tested the following chunking methods with Amazon Bedrock Knowledge Bases:

Fixed-size short chunking – 400-token chunks with a 20% overlap (shown as the blue variant in the following figure)
Fixed-size long chunking – 1,000-token chunks with a 20% overlap
Hierarchical chunking – Parent chunks of 1,500 tokens and child chunks of 300 tokens, with a 60-token overlap
Semantic chunking – 400-token chunks with a 95% similarity percentile threshold

For reference, a paragraph of approximately 1,000 characters typically translates to around 200 tokens. To assess performance, we measured document relevance and MRR across different context sizes, ranging from 1–5. This comparison aims to provide insights into the most effective chunking strategy for organizing and retrieving information for this use case.The following figures illustrate the MRR and document relevance metrics, respectively.

As a result of these experiments, we found that MRR is a more sensitive metric for evaluating the impact of chunking strategies, particularly when varying the number of retrieved chunks (top-k from 1 to 5). Among the approaches tested, the fixed-size 400-token strategy—shown in blue—proved to be the simplest and most effective, consistently yielding the highest accuracy across different retrieval sizes.
Conclusion
In the journey to design Adobe’s developer Unified Support search and retrieval system, we’ve successfully harnessed the power of Amazon Bedrock Knowledge Bases to create a robust, scalable, and efficient solution. By configuring fixed-size chunking and using the Amazon Titan V2 embedding model, we achieved a remarkable 20% increase in accuracy metrics for document retrieval compared to Adobe’s existing solution, by running evaluations on the customer’s testing system and provided dataset.The integration of metadata filtering emerged as a game changing feature, allowing for seamless navigation across diverse domains and enabling customized retrieval. This capability proved invaluable for Adobe, given the complexity and breadth of their information landscape. Our comprehensive comparison of retrieval accuracy for different configurations of the Amazon Bedrock Knowledge Bases index has yielded valuable insights. The metrics we developed provide an objective framework for assessing the quality of retrieved context, which is crucial for applications demanding high-precision information retrieval. As we look to the future, this customized, fully managed solution lays a solid foundation for continuous improvement in developer support at Adobe, offering enhanced scalability and seamless support infrastructure in tandem with evolving developer needs.
For those interested in working with AWS on similar projects, visit Generative AI Innovation Center. To learn more about Amazon Bedrock Knowledge Bases, see Retrieve data and generate AI responses with knowledge bases.

About the Authors
Kamran Razi is a Data Scientist at the Amazon Generative AI Innovation Center. With a passion for delivering cutting-edge generative AI solutions, Kamran helps customers unlock the full potential of AWS AI/ML services to solve real-world business challenges. With over a decade of experience in software development, he specializes in building AI-driven solutions, including AI agents. Kamran holds a PhD in Electrical Engineering from Queen’s University.
Nay Doummar is an Engineering Manager on the Unified Support team at Adobe, where she’s been since 2012. Over the years, she has contributed to projects in infrastructure, CI/CD, identity management, containers, and AI. She started on the CloudOps team, which was responsible for migrating Adobe’s infrastructure to the AWS Cloud, marking the beginning of her long-term collaboration with AWS. In 2020, she helped build a support chatbot to simplify infrastructure-related assistance, sparking her passion for user support. In 2024, she joined a project to Unify Support for the Developer Platform, aiming to streamline support and boost productivity.
Varsha Chandan Bellara is a Software Development Engineer at Adobe, specializing in AI-driven solutions to boost developer productivity. She leads the development of an AI assistant for the Unified Support initiative, using Amazon Bedrock, implementing RAG to provide accurate, context-aware responses for technical support and issue resolution. With expertise in cloud-based technologies, Varsha combines her passion for containers and serverless architectures with advanced AI to create scalable, efficient solutions that streamline developer workflows.
Jan Michael Ong is a Senior Software Engineer at Adobe, where he supports the developer community and engineering teams through tooling and automation. Currently, he is part of the Developer Experience team at Adobe, working on AI projects and automation contributing to Adobe’s internal Developer Platform.
Justin Johns is a Deep Learning Architect at Amazon Web Services who is passionate about innovating with generative AI and delivering cutting-edge solutions for customers. With over 5 years of software development experience, he specializes in building cloud-based solutions powered by generative AI.
Gaurav Dhamija is a Principal Solutions Architect at Amazon Web Services, where he helps customers design and build scalable, reliable, and secure applications on AWS. He is passionate about developer experience, containers, and serverless technologies, and works closely with engineering teams to modernize application architectures. Gaurav also specializes in generative AI, using AWS generative AI services to drive innovation and enhance productivity across a wide range of use cases.
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.
Anila Joshi has more than a decade of experience building AI solutions. As a Senior Manager, Applied Science at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Amazon Nova Lite enables Bito to offer a free tier option for its AI-p …

This post is co-written by Amar Goel, co-founder and CEO of Bito.
Meticulous code review is a critical step in the software development process, one that helps delivery high-quality code that’s ready for enterprise use. However, it can be a time-consuming process at scale, when experts must review thousands of lines of code, looking for bugs or other issues. Traditional reviews can be subjective and inconsistent, because humans are human. As generative AI becomes more and more integrated into the software development process, a significant number of new AI-powered code review tools have entered the marketplace, helping software development groups raise quality and ship clean code faster—while boosting productivity and job satisfaction among developers.
Bito is an innovative startup that creates AI agents for a broad range of software developers. It emerged as a pioneer in its field in 2023, when it launched its first AI-powered developer agents, which brought large language models (LLMs) to code-writing environments. Bito’s executive team sees developers and AI as key elements of the future and works to bring them together in powerful new ways with its expanding portfolio of AI-powered agents. Its flagship product, AI Code Review Agent, speeds pull request (PR) time-to-merge up to 89%—accelerating development times and allowing developers to focus on writing code and innovating. Intended to assist, not replace, review by senior software engineers, Bito’s AI Code Review Agent focuses on summarizing, organizing, then suggesting line-level fixes with code base context, freeing engineers to focus on the more strategic aspects of code review, such as evaluating business logic.
With more than 100,000 active developers using its products globally, Bito has proven successful at using generative AI to streamline code review—delivering impressive results and business value to its customers, which include Gainsight, Privado, PubMatic, On-Board Data Systems (OBDS), and leading software enterprises. Bito’s customers have found that AI Code Review Agent provides the following benefits:

Accelerates PR merges by 89%
Reduces regressions by 34%
Delivers 87% of a PR’s feedback necessary for review
Has a 2.33 signal-to-noise ratio
Works in over 50 programming languages

Although these results are compelling, Bito needed to overcome inherent wariness among developers evaluating a flood of new AI-powered tools. To accelerate and expand adoption of AI Code Review Agent, Bito leaders wanted to launch a free tier option, which would let these prospective users experience the capabilities and value that AI Code Review Agent offered—and encourage them to upgrade to the full, for-pay Teams Plan. Its Free Plan would offer AI-generated PR summaries to provide an overview of changes, and its Teams Plan would provide more advanced features, such as downstream impact analysis and one-click acceptance of line-level code suggestions.
In this post, we share how Bito is able to offer a free tier option for its AI-powered code reviews using Amazon Nova.
Choosing a cost-effective model for a free tier offering
To offer a free tier option for AI Code Review Agent, Bito needed a foundation model (FM) that would provide the right level of performance and results at a reasonable cost. Offering code review for free to its potential customers would not be free for Bito, of course, because it would be paying inference costs. To identify a model for its Free Plan, Bito carried out a 2-week evaluation process across a broad range of models, including the high-performing FMs on Amazon Bedrock, as well as OpenAI GPT-4o mini. The Amazon Nova models—fast, cost-effective models that were recently introduced on Amazon Bedrock—were particularly interesting to the team.
At the end of its evaluation, Bito determined that Amazon Nova Lite delivered the right mix of performance and cost-effectiveness for its use cases. Its speed provided fast creation of code review summaries. However, cost—a key consideration for Bito’s Free Plan—proved to be the deciding factor. Ultimately, Amazon Nova Lite met Bito’s criteria for speed, cost, and quality. The combination of Amazon Nova Lite and Amazon Bedrock also make it possible for Bito to offer the reliability and security that their customers needed when entrusting Bito with their code. After all, careful control of code is one of Bito’s core promises to its customers. It doesn’t store code or use it for model training. And its products are SOC 2 Type 2-certified to provide data security, processing integrity, privacy, and confidentiality.
Implementing the right models for different tiers of offerings
Bito has now adopted Amazon Bedrock as its standardized platform to explore, add, and run models. Bito uses Amazon Nova Lite as the primary model for its Free Plan, and Anthropic’s Claude 3.7 Sonnet powers its for-pay Teams Plan, all accessed and integrated through the unified Amazon Bedrock API and controls. Amazon Bedrock provides seamless shifting from Amazon Nova Lite to Anthropic’s Sonnet when customers upgrade, with minimal code changes. Bito leaders are quick to point out that Amazon Nova Lite doesn’t just power its Free Plan—it inspired it. Without the very low cost of Amazon Nova Lite, they wouldn’t have been able to offer a free tier of AI Code Review Agent, which they viewed as a strategic move that would enable it to expand its enterprise customer base. This strategy quickly generated results, attracting three times more prospective customers to its Free Plan than anticipated. At the end of the 14-day trial period, a significant number of users convert to the full AI Code Review Agent to access its full array of capabilities.
Encouraged by the success with AI Code Review Agent, Bito is now using Amazon Nova Lite to power the chat capabilities of its offering for Bito Wingman, its latest AI agentic technology—a full-featured developer assistant in the integrated development environment (IDE) that combines code generation, error handling, architectural advice, and more. Again, the combination of quality and low cost of Amazon Nova Lite made it the right choice for Bito.
Conclusion
In this post, we shared how Bito—an innovative startup that offers a growing portfolio of AI-powered developer agents—chose Amazon Nova Lite to power its free tier offering of AI Code Review Agent, its flagship product. Its AI-powered agents are designed specifically to make developers’ lives easier and their work more impactful:

Amazon Nova Lite enabled Bito to meet one of its core business challenges—attracting enterprise customers. By introducing a free tier, Bito attracted three times more prospective new customers to its generative AI-driven flagship product—AI Code Review Agent.
Amazon Nova Lite outperformed other models during rigorous internal testing, providing the right level of performance at the very low cost Bito needed to launch a free tier of AI Code Review Agent.
Amazon Bedrock empowers Bito to seamlessly switch between models as needed for each tier of AI Code Review Agent—Amazon Nova Lite for its Free Plan and Anthropic’s Claude 3.7 Sonnet for its for-pay Teams Plan. Amazon Bedrock also provided security and privacy, critical considerations for Bito customers.
Bito shows how innovative organizations can use the combination of quality, cost-effectiveness, and speed in Amazon Nova Lite to deliver value to their customers—and to their business.

“Our challenge is to push the capabilities of AI to deliver new value to developers, but at a reasonable cost,” shares Amar Goel, co-founder and CEO of Bito. “Amazon Nova Lite gives us the very fast, low-cost model we needed to power the free offering of our AI Code Review Agent—and attract new customers.”
Get started with Amazon Nova on the Amazon Bedrock console. Learn more Amazon Nova Lite at the Amazon Nova product page.

About the authors
Eshan Bhatnagar is the Director of Product Management for Amazon AGI at Amazon Web Services.
Amar Goel is Co-Founder and CEO of Bito. A serial entrepreneur, Amar previously founded PubMatic (went public in 2020), and formerly worked at Microsoft, McKinsey, and was a software engineer at Netscape, the original browser company. Amar attended Harvard University. He is excited about using GenAI to power the next generation of how software gets built!

How Gardenia Technologies helps customers create ESG disclosure report …

This post was co-written with Federico Thibaud, Neil Holloway, Fraser Price, Christian Dunn, and Frederica Schrager from Gardenia Technologies
“What gets measured gets managed” has become a guiding principle for organizations worldwide as they begin their sustainability and environmental, social, and governance (ESG) journeys. Companies are establishing baselines to track their progress, supported by an expanding framework of reporting standards, some mandatory and some voluntary. However, ESG reporting has evolved into a significant operational burden. A recent survey shows that 55% of sustainability leaders cite excessive administrative work in report preparation, while 70% indicate that reporting demands inhibit their ability to execute strategic initiatives. This environment presents a clear opportunity for generative AI to automate routine reporting tasks, allowing organizations to redirect resources toward more impactful ESG programs.
Gardenia Technologies, a data analytics company, partnered with the AWS Prototyping and Cloud Engineering (PACE) team to develop Report GenAI, a fully automated ESG reporting solution powered by the latest generative AI models on Amazon Bedrock. This post dives deep into the technology behind an agentic search solution using tooling with Retrieval Augmented Generation (RAG) and text-to-SQL capabilities to help customers reduce ESG reporting time by up to 75%.
In this post, we demonstrate how AWS serverless technology, combined with agents in Amazon Bedrock, are used to build scalable and highly flexible agent-based document assistant applications.
Scoping the challenge: Growing ESG reporting requirements and complexity
Sustainability disclosures are now a standard part of corporate reporting, with 96% of the 250 largest companies reporting on their sustainability progress based on government and regulatory frameworks. To meet reporting mandates, organizations must overcome many data collection and process-based barriers. Data for a single report includes thousands of data points from a multitude of sources including official documentation, databases, unstructured document stores, utility bills, and emails. The EU Corporate Sustainability Reporting Directive (CSRD) framework, for example, comprises of 1,200 individual data points that need to be collected across an enterprise. Even voluntary disclosures like the CDP, which is approximately 150 questions, cover a wide range of questions related to climate risk and impact, water stewardship, land use, and energy consumption. Collecting this information across an organization is time consuming.
A secondary challenge is that many organizations with established ESG programs need to report to multiple disclosure frameworks, such as SASB, GRI, TCFD, each using different reporting and disclosure standards. To complicate matters, reporting requirements are continually evolving, leaving organizations struggling just to keep up with the latest changes. Today, much of this work is highly manual and leaves sustainability teams spending more time on managing data collection and answering questionnaires rather than developing impactful sustainability strategies.
Solution overview: Automating undifferentiated heavy lifting with AI agents
Gardenia’s approach to strengthen ESG data collection for enterprises is Report GenAI, an agentic framework using generative AI models on Amazon Bedrock to automate large chunks of the ESG reporting process. Report GenAI pre-fills reports by drawing on existing databases, document stores and web searches. The agent then works collaboratively with ESG professionals to review and fine-tune responses. This workflow has five steps to help automate ESG data collection and assist in curating responses. These steps include setup, batch-fill, review, edit, and repeat. Let’s explore each step in more detail.

Setup: The Report GenAI agent is configured and authorized to access an ESG and emissions database, client document stores (emails, previous reports, data sheets), and document searches over the public internet. Client data is stored within specified AWS Regions using encrypted Amazon Simple Storage Service (Amazon S3) buckets with VPC endpoints for secure access, while relational data is hosted in Amazon Relational Database Service (Amazon RDS) instances deployed within Gardenia’s virtual private cloud (VPC). This architecture helps make sure data residency requirements can be fulfilled, while maintaining strict access controls through private network connectivity. The agent also has access to the relevant ESG disclosure questionnaire including questions and expected response format (we refer to this as a report specification). The following figure is an example of the Report GenAI user interface at the agent configuration step. As shown in the figure, the user can choose which databases, documents, or other tools the agent will use to answer a given question.

Batch-fill: The agent then iterates through each question and data point to be disclosed and then retrieves relevant data from the client document stores and document searches. This information is processed to produce a response in the expected format depending on the disclosure report requirements.
Review: Each response includes cited sources and—if the response is quantitative—calculation methodology. This enables users to maintain a clear audit trail and verify the accuracy of batch-filled responses quickly.
Edit: While the agentic workflow is automated, our approach allows for a human-in-the-loop to review, validate, and iterate on batch-filled facts and figures. In the following figure, we show how users can chat with the AI assistant to request updates or manually refine responses. When the user is satisfied, the final answer is recorded. The agent will show references from which responses were sourced and allow the user to modify answers either directly or by providing an additional prompt.

Repeat: Users can batch-fill multiple reporting frameworks to simplify and expand their ESG disclosure scope while avoiding extra effort to manually complete multiple questionnaires. After a report has been completed, it can then be added to the client document store so future reports can draw on it for knowledge. Report GenAI also supports bring your own report, which allows users to develop their own reporting specification (question and response model), which can then be imported into the application, as shown in the following figure.

Now that you have a description of the Report GenAI workflow, let’s explore how the architecture is built.
Architecture deep-dive: A serverless generative AI agent
The Report GenAI architecture consists of six components as illustrated in the following figure: A user interface (UI), the generative AI executor, the web search endpoint, a text-to-SQL tool, the RAG tool, and an embedding generation pipeline. The UI, generative AI executor, and generation pipeline components help orchestrate the workflow. The remaining three components function together to generate responses to perform the following actions:

Web search tool: Uses an internet search engine to retrieve content from public web pages.
Text-to-SQL tool: Generates and executes SQL queries to the company’s emissions database hosted by Gardenia Technologies. The tool uses natural language requests, such as “What were our Scope 2 emissions in 2024,” as input and returns the results from the emissions database.
Retrieval Augmented Generation (RAG) tool: Accesses information from the corporate document store (such as procedures, emails, and internal reports) and uses it as a knowledge base. This component acts as a retriever to return relevant text from the document store as a plain text query.

Let’s take a look at each of the components.
1: Lightweight UI hosted on auto-scaled Amazon ECS Fargate
Users access Report GenAI by using the containerized Streamlit frontend. Streamlit offers an appealing UI for data and chat apps allowing data scientists and ML to build convincing user experiences with relatively limited effort. While not typically used for large-scale deployments, Streamlit proved to be a suitable choice for the initial iteration of Report GenAI.
The frontend is hosted on a load-balanced and auto-scaled Amazon Elastic Container Service (Amazon ECS) with Fargate launch type. This implementation of the frontend not only reduces the management overhead but also suits the expected intermittent usage pattern of Report GenAI, which is anticipated to be spikey with high-usage periods around the times when new reports must be generated (typically quarterly or yearly) and lower usage outside these windows. User authentication and authorization is handled by Amazon Cognito.
2: Central agent executor
The executor is an agent that uses reasoning capabilities of leading text-based foundation models (FMs) (for example, Anthropic’s Claude Sonnet 3.5 and Haiku 3.5) to break down user requests, gather information from document stores, and efficiently orchestrate tasks. The agent uses Reason and Act (ReAct), a prompt-based technique that enables large language models (LLMs) to generate both reasoning traces and task-specific actions in an interleaved manner. Reasoning traces help the model develop, track, and update action plans, while actions allow it to interface with a set of tools and information sources (also known as knowledge bases) that it can use to fulfil the task. The agent is prompted to think about an optimal sequence of actions to complete a given task with the tools at its disposal, observe the outcome, and iterate and improve until satisfied with the answer.
In combination, these tools provide the agent with capabilities to iteratively complete complex ESG reporting templates. The expected questions and response format for each questionnaire is captured by a report specification (ReportSpec) using Pydantic to enforce the desired output format for each reporting standard (for example, CDP, or TCFD). This ReportSpec definition is inserted into the task prompt. The first iteration of Report GenAI used Claude Sonnet 3.5 on Amazon Bedrock. As more capable and more cost effective LLMs become available on Amazon Bedrock (such as the recent release of Amazon Nova models), foundation models in Report GenAI can be swapped to remain up to date with the latest models.
The agent-executor is hosted on AWS Lambda and uses the open-source LangChain framework to implement the ReAct orchestration logic and implement the needed integration with memory, LLMs, tools and knowledge bases. LangChain offers deep integration with AWS using the first-party langchain-aws module. The module langchain-aws provides useful one-line wrappers to call tools using AWS Lambda, draw from a chat memory backed by Dynamo DB and call LLM models on Amazon Bedrock. LangChain also provides fine-grained visibility into each step of the ReAct decision making process to provide decision transparency.

3: Web-search tool
The web search tool is hosted on Lambda and calls an internet search engine through an API. The agent executor retrieves the information returned from the search engine to formulate a response. Web searches can be used in combination with the RAG tool to retrieve public context needed to formulate responses for certain generic questions, such as providing a short description of the reporting company or entity.
4: Text-to-SQL tool
A large portion of ESG reporting requirements is analytical in nature and requires processing of large amounts of numerical or tabular data. For example, a reporting standard might ask for total emissions in a certain year or quarter. LLMs are ill-equipped for questions of this nature. The Lambda-hosted text-to-SQL tool provides the agent with the required analytical capabilities. The tool uses a separate LLM to generate a valid SQL query given a natural language question along with the schema of an emissions database hosted on Gardenia. The generated query is then executed against this database and the results are passed back to the agent executor. SQL linters and error-correction loops are used for added robustness.
5: Retrieval Augmented Generation (RAG) tool
Much of the information required to complete ESG reporting resides in internal, unstructured document stores and can consist of PDF or Word documents, Excel spreadsheets, and even emails. Given the size of these document stores, a common approach is to use knowledge bases with vector embeddings for semantic search. The RAG tool enables the agent executor to retrieve only the relevant parts to answer questions from the document store. The RAG tool is hosted on Lambda and uses an in-memory Faiss index as a vector store. The index is persisted on Amazon S3 and loaded on demand whenever required. This workflow is advantageous for the given workload because of the intermittent usage of Report GenAI. The RAG tool accepts a plain text query from the agent executor as input, uses an embedding model on Amazon Bedrock to perform a vector search against the vector data base. The retrieved text is returned to the agent executor.
6: Embedding the generation asynchronous pipeline
To make text searchable, it must be indexed in a vector database. Amazon Step Functions provides a straightforward orchestration framework to manage this process: extracting plain text from the various document types, chunking it into manageable pieces, embedding the text, and then loading embeddings into a vector DB. Amazon Textract can be used as the first step for extracting text from visual-heavy documents like presentations or PDFs. An embedding model such as Amazon Titan Text Embeddings can then be used to embed the text and store it into a vector DB such as Lance DB. Note that Amazon Bedrock Knowledge Bases provides an end-to-end retrieval service automating most of the steps that were just described. However, for this application, Gardenia Technologies opted for a fully flexible implementation to retain full control over each design choice of the RAG pipeline (text extraction approach, embedding model choice, and vector database choice) at the expense of higher management and development overhead.
Evaluating agent performance
Making sure of accuracy and reliability in ESG reporting is paramount, given the regulatory and business implications of these disclosures. Report GenAI implements a sophisticated dual-layer evaluation framework that combines human expertise with advanced AI validation capabilities.
Validation is done both at a high level (such as evaluating full question responses) and sub-component level (such as breaking down to RAG, SQL search, and agent trajectory modules). Each of these has separate evaluation sets in addition to specific metrics of interest.
Human expert validation
The solution’s human-in-the-loop approach allows ESG experts to review and validate the AI-generated responses. This expert oversight serves as the primary quality control mechanism, making sure that generated reports align with both regulatory requirements and organization-specific context. The interactive chat interface enables experts to:

Verify factual accuracy of automated responses
Validate calculation methodologies
Verify proper context interpretation
Confirm regulatory compliance
Flag potential discrepancies or areas requiring additional review

A key feature in this process is the AI reasoning module, which displays the agent’s decision-making process, providing transparency into not only what answers were generated but how the agent arrived at those conclusions.

These expert reviews provide valuable training data that can be used to enhance system performance through refinements to RAG implementations, agent prompts, or underlying language models.
AI-powered quality assessment
Complementing human oversight, Report GenAI uses state-of-the-art LLMs on Amazon Bedrock as LLM judges. These models are prompted to evaluate:

Response accuracy relative to source documentation
Completeness of answers against question requirements
Consistency with provided context
Alignment with reporting framework guidelines
Mathematical accuracy of calculations

The LLM judge operates by:

Analyzing the original question and context
Reviewing the generated response and its supporting evidence
Comparing the response against retrieved data from structured and unstructured sources
Providing a confidence score and detailed assessment of the response quality
Flagging potential issues or areas requiring human review

This dual-validation approach creates a robust quality assurance framework that combines the pattern recognition capabilities of AI with human domain expertise. The system continuously improves through feedback loops, where human corrections and validations help refine the AI’s understanding and response generation capabilities.
How Omni Helicopters International cuts its reporting time by 75%
Omni Helicopters International cut their CDP reporting time by 75% using Gardenia’s Report GenAI solution. In previous years, OHI’s CDP reporting required one month of dedicated effort from their sustainability team. Using Report GenAI, OHI tracked their GHG inventory and relevant KPIs in real time and then prepared their 2024 CDP submission in just one week. Read the full story in Preparing Annual CDP Reports 75% Faster.
“In previous years we needed one month to complete the report, this year it took just one week,” said Renato Souza, Executive Manager QSEU at OTA. “The ‘Ask the Agent’ feature made it easy to draft our own answers. The tool was a great support and made things much easier compared to previous years.”
Conclusion
In this post, we stepped through how AWS and Gardenia collaborated to build Report GenAI, an automated ESG reporting solution that relieves ESG experts of the undifferentiated heavy lifting of data gathering and analysis associated with a growing ESG reporting burden. This frees up time for more impactful, strategic sustainability initiatives. Report GenAI is available on the AWS Marketplace today. To dive deeper and start developing your own generative AI app to fit your use case, explore this workshop on building an Agentic LLM assistant on AWS.

About the Authors
Federico Thibaud is the CTO and Co-Founder of Gardenia Technologies, where he leads the data and engineering teams, working on everything from data acquisition and transformation to algorithm design and product development. Before co-founding Gardenia, Federico worked at the intersection of finance and tech — building a trade finance platform as lead developer and developing quantitative strategies at a hedge fund.
Neil Holloway is Head of Data Science at Gardenia Technologies where he is focused on leveraging AI and machine learning to build and enhance software products. Neil holds a masters degree in Theoretical Physics, where he designed and built programs to simulate high energy collisions in particle physics.
Fraser Price is a GenAI-focused Software Engineer at Gardenia Technologies in London, where he focuses on researching, prototyping and developing novel approaches to automation in the carbon accounting space using GenAI and machine learning. He received his MEng in Computing: AI from Imperial College London.
Christian Dunn is a Software Engineer based in London building ETL pipelines, web-apps, and other business solutions at Gardenia Technologies.
Frederica Schrager is a Marketing Analyst at Gardenia Technologies.
Karsten Schroer is a Senior ML Prototyping Architect at AWS. He supports customers in leveraging data and technology to drive sustainability of their IT infrastructure and build cloud-native data-driven solutions that enable sustainable operations in their respective verticals. Karsten joined AWS following his PhD studies in applied machine learning & operations management. He is truly passionate about technology-enabled solutions to societal challenges and loves to dive deep into the methods and application architectures that underlie these solutions.
Mohamed Ali Jamaoui is a Senior ML Prototyping Architect with over 10 years of experience in production machine learning. He enjoys solving business problems with machine learning and software engineering, and helping customers extract business value with ML. As part of AWS EMEA Prototyping and Cloud Engineering, he helps customers build business solutions that leverage innovations in MLOPs, NLP, CV and LLMs.
Marco Masciola is a Senior Sustainability Scientist at AWS. In his role, Marco leads the development of IT tools and technical products to support AWS’s sustainability mission. He’s held various roles in the renewable energy industry, and leans on this experience to build tooling to support sustainable data center operations.
Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies.

The Visitor ID War Report with Larry Kim & Sanjay Jenkins

What happens when you plug dirty website visitor ID data into your ad stack and email flows? 

What happens when marketers push magical-sounding tech without disclosing how wrong it really is?

In this no-holds-barred conversation, I sat down with Sanjay Jenkins to discuss the shocking truth about most visitor ID solutions — and what we’re doing differently at Customers.ai.

Key Moments in “Customer Acquisition Magic: Visitor ID“

00:46 — What is Visitor ID?“It’s a magical promise: tell me who’s on my site, even if they’ve never filled out a form. But it only works if the IDs are accurate.”

02:22 — The Big Problem with Visitor ID“If this tech is so magical, why isn’t every major brand using it? Because most of it doesn’t work. Accuracy rates are horrifically low — 70-95% wrong.”

04:13 — Bad Data is Worse Than No Data“Marketing malpractice: sending spam to strangers based on false positives. The impact is catastrophic.”

07:04 — Email Deliverability Carnage“Industry benchmark: 0.3% spam complaint rate. These ID vendors? Try 3%. That’s 10x higher than what’s acceptable.”

09:18 — Real-World Comparison“In a head-to-head test against Opensend, they had a 1.5% spam rate. We had 0.1%. That’s 15x lower. And we did 5x more revenue per email.”

11:39 — How We Test for Website Visitor ID Accuracy“You compare emails generated by IDs with actual purchasers. If the guessed identity isn’t the buyer, the Website Visitor ID vendor was wrong.”

18:28 — Why Trust is So Damaged in This Space“Every brand has tried this tech and gotten burned. They’re reluctant to try again — but we’re showing results and accuracy that are impossible to ignore.”

20:35 — What Makes Customers.ai Different“We’re a second-generation visitor ID solution. We built our own identity graph. We don’t just license third-party junk or suck your CRM dry.”

27:40 — The Taste Test Approach“We don’t ask for trust. We ask for a side-by-side trial. And the difference isn’t subtle — it’s 5x, 10x, sometimes 20x.”

42:23 — The Spam Test from Hell“We sent 10,000 emails from a brand to inboxes we controlled. 70% landed in Promotions or Spam. Up to 40% were in Spam alone.”

45:36 — Why Some Vendors Still Thrive Despite Bad Data“The worse their data, the better your deliverability looks — because no one opens the emails, so no one marks them as spam.”

47:48 — What’s Next: Universal Form Prefill“We’re close to exposing identity on-site via form prefill. But we won’t ship it until we’re 95%+ accurate.”

52:39 — Why Visitor ID Matters“It’s a growth hack. And if we can clean up the category, it becomes a growth engine for the entire DTC and SaaS ecosystem.”

Visitor ID War Report: Conclusion

The Visitor ID space has been filled with hype, overpromises, and data malpractice. But there is a better way — and we’re building it.

Watch the full interview with me and Sanjay Jenkins on Replo’s “Shop Talk” podcast: https://www.youtube.com/watch?v=0Wzjs_sxwmI.

DM me or book a call if you want to run a side-by-side test and see the truth for yourself.

See the Anonymous Visitors Hiding on Your Site

Book a demo of Customers.ai’s U.S. website visitor identification, customer journey insights and remarketing platform to skyrocket conversions and sales.

Book a Demo

The post The Visitor ID War Report with Larry Kim & Sanjay Jenkins appeared first on Customers.ai.

ether0: A 24B LLM Trained with Reinforcement Learning RL for Advanced …

LLMs primarily enhance accuracy through scaling pre-training data and computing resources. However, the attention has shifted towards alternate scaling due to finite data availability. This includes test-time training and inference compute scaling. Reasoning models enhance performance by emitting thought processes before answers, initially through CoT prompting. Recently, reinforcement learning (RL) post-training has been used. Scientific domains present ideal opportunities for reasoning models. The reason is they involve “inverse problems” where solution quality assessment is straightforward but solution generation remains challenging. Despite conceptual alignment between structured scientific reasoning and model capabilities, current methods lack detailed approaches for scientific reasoning beyond multiple-choice benchmarks.

Technical Evolution of Reasoning Architectures

Reasoning models have evolved from early prompt-based methods such as CoT, zero-shot CoT, and Tree of Thought. They have progressed to complex RL approaches via Group Relative Policy Optimization (GRPO) and inference time scaling. Moreover, reasoning models in chemistry focus on knowledge-based benchmarks rather than complex reasoning tasks. Examples include retrosynthesis or molecular design. While datasets such as GPQA-D and MMLU assess chemical knowledge, they fail to evaluate complex chemical reasoning capabilities. Current scientific reasoning efforts remain fragmented. Limited attempts include OmniScience for general science, Med-R1 for medical vision-language tasks, and BioReason for genomic reasoning. However, no comprehensive framework exists for large-scale chemical reasoning model training.

ether0 Architecture and Design Principles

Researchers from FutureHouse have proposed ether0, a novel model that reasons in natural language and outputs molecular structures as SMILES strings. It demonstrates the efficacy of reasoning models in chemical tasks. It outperforms frontier LLMs, human experts, and general chemistry models. The training approach uses several optimizations over vanilla RL. This includes distillation of reasoning behavior, a dynamic curriculum, and expert model initialization to enhance efficiency and effectiveness. Moreover, factors such as data efficiency, failure modes, and reasoning behavior are analyzed. This analysis allows for a better understanding of the reasoning utility in solving chemistry problems.

Training Pipeline: Distillation and GRPO Integration

The model employs a multi-stage training procedure alternating between distillation and GRPO phases. The architecture introduces four special tokens. These tokens demarcate reasoning and answer boundaries. Training begins with SFT on long CoT sequences generated by DeepSeek-R1. These are filtered for valid SMILES format, and reasoning quality. Specialist RL then optimizes task-specific policies for different problem categories using GRPO. Then, distillation merges specialist models into a generalist. This merges occurs through SFT on correct responses collected throughout training. The final phase applies generalist GRPO to the merged model. This includes continuous quality filtering to remove low-quality reasoning and undesirable molecular substructures.

Performance Evaluation and Comparative Benchmarks

Ether0 demonstrates superior performance against both general-purpose LLMs like Claude and o1, and chemistry-specific models, including ChemDFM and TxGemma. It achieves the highest accuracy across all open-answer categories while maintaining competitive performance on multiple-choice questions. For data efficiency, the model outperforms traditional molecular transformer models. It is trained on only 60,000 reactions compared to full USPTO datasets. Ether0 achieves 70% accuracy after seeing 46,000 training examples. Molecular transformers achieved 64.1% on complete datasets in comparison. Under one-shot prompting conditions, ether0 surpasses all evaluated frontier models. Safety alignment procedures successfully filter 80% of unsafe questions without degrading performance on core chemistry tasks.

Conclusion: Implications for Future Scientific LLMs

In conclusion, researchers introduced ether0, a 24B-parameter model trained on ten challenging molecular tasks. It significantly outperforms frontier LLMs, domain experts, and specialized models. This is achieved through its interleaved RL and behavior distillation pipeline. The model exhibits exceptional data efficiency and reasoning capabilities. It excels in open-answer chemistry tasks involving molecular design, completion, modification, and synthesis. However, limitations include potential generalization challenges beyond organic chemistry. Moreover, there is a loss of general instruction-following and absence of tool-calling integration. The release of model weights, benchmark data, and reward functions establishes a foundation. This foundation aids in advancing scientific reasoning models across diverse domains.

Check out the Paper and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

Want to promote your product/webinar/service to 1 Million+ AI Engineers/Developers/Data Scientists/Architects/CTOs/CIOs? Lets Partner..
The post ether0: A 24B LLM Trained with Reinforcement Learning RL for Advanced Chemical Reasoning Tasks appeared first on MarkTechPost.

Meta Introduces LlamaRL: A Scalable PyTorch-Based Reinforcement Learni …

Reinforcement Learning’s Role in Fine-Tuning LLMs

Reinforcement learning has emerged as a powerful approach to fine-tune large language models (LLMs) for more intelligent behavior. These models are already capable of performing a wide range of tasks, from summarization to code generation. RL helps by adapting their outputs based on structured feedback. As demand grows for models to be not just accurate but also aligned with complex preferences or rules, RL provides a crucial mechanism to enhance their performance. Consequently, RL has become a central component in the post-training process of many advanced LLM systems.

The Infrastructure Challenges of Scaling RL for LLMs

A major challenge in applying RL to large-scale LLMs lies in its significant resource requirements. Training these models involves not just massive computation but also coordination between different components. Notable components include policy models, reward scorers, and critics. Model sizes scale into hundreds of billions of parameters, and issues like memory usage, data communication latency, and GPU idle time present difficult engineering problems. Without efficient design, these limitations hinder the ability to apply RL to newer, larger models. Achieving high GPU utilization and minimizing inter-process bottlenecks are vital for scalable and timely training.

Limitations of Previous RL Frameworks for LLMs

Prior solutions have struggled with either being too rigid or inefficient when scaled. Traditional synchronous frameworks execute generation and training in sequential steps, often causing GPU idle time due to mismatched task durations. Tools like DeepSpeed-Chat employ hybrid memory strategies but require models to share memory space. This results in performance bottlenecks during generation. Some distributed methods try to decouple components but still rely on heavy orchestration tools, limiting flexibility. Additionally, earlier frameworks often fail to optimize memory use for varying parallelism needs during training and inference.

Meta’s LlamaRL: A PyTorch-Based Distributed Asynchronous RL Framework

Meta researchers introduced LlamaRL, a fully asynchronous and distributed reinforcement learning framework. It is tailored for training massive LLMs on clusters ranging from a few to thousands of GPUs. They built LlamaRL entirely in PyTorch and implemented a single-controller design to simplify coordination. This design enables modular customization. Separate executors manage each RL component—such as the generator, trainer, and reward model—and operate in parallel. This asynchronous setup reduces waiting time throughout the RL pipeline. It also enables independent optimization of model parallelism and memory usage.

Key Features: Offloading, Memory Efficiency, and Asynchronous Execution

LlamaRL’s architecture prioritizes flexible execution and efficient memory usage. It offloads generation processes to dedicated executors, allowing the trainer to focus exclusively on model updates. Distributed Direct Memory Access (DDMA) supports this offloading. It uses NVIDIA NVLink to synchronize weights in under two seconds—even for models with 405 billion parameters. The framework applies Asynchronous Importance-weighted Policy Optimization (AIPO) to correct for off-policyness caused by asynchronous execution. Each executor operates independently, leverages fine-grained parallelism, and applies quantization techniques to inference models to further reduce compute and memory demands.

Real-World Performance Benchmarks: 10.7x Speedup on 405B Models

LlamaRL delivers significant improvements in training speed without compromising quality. On an 8B parameter model with 256 GPUs, it cuts the training step time from 22.45 seconds to 8.90 seconds. For the 70B model, the reduction is from 82.32 to 20.67 seconds. Most impressively, on a 405B parameter model across 1024 GPUs, LlamaRL slashes the RL step time from 635.8 to just 59.5 seconds and achieves a 10.7× speedup over the synchronous baseline. These gains results not only from asynchronous execution but also its decoupled memory and compute strategies. Benchmark evaluations on MATH and GSM8K confirm that LlamaRL maintains consistent performance. Some metrics even show slight improvements.

Final Thoughts: LlamaRL as a Scalable Path Forward in LLM Training

This research presents a practical and scalable solution to one of the most significant bottlenecks. The bottleneck is in training large language models (LLMs) using reinforcement learning. The introduction of asynchronous training through LlamaRL marks a substantial shift from traditional reinforcement learning (RL) pipelines. By addressing memory constraints, communication delays, and GPU inefficiencies, the framework provides a well-integrated solution for future developments in language model training.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter. ▷ Want to promote your product/webinar/service to 1 Million+ AI Engineers/Developers/Data Scientists/Architects/CTOs/CIOs? Lets Partner..
The post Meta Introduces LlamaRL: A Scalable PyTorch-Based Reinforcement Learning RL Framework for Efficient LLM Training at Scale appeared first on MarkTechPost.

Top 15 Vibe Coding Tools Transforming AI-Driven Software Development i …

As AI-first development redefines how software is built, “vibe coding” has emerged as a paradigm-shifting approach where developers simply say what they want, and an agent builds it. Coined by Andrej Karpathy, the term reflects a shift from code-heavy workflows to natural language-driven software prototyping.

Here’s a list of reliable vibe coding tools that support vibe coding workflows:

Cursor: An AI-native IDE that supports multi-agent prompting and iterative development. Known for its “Agent Mode,” Cursor combines GPT-4o and Claude 3 for full-project generation.

Replit: Browser-based IDE that includes Replit AI Agent for natural language code generation. Great for fast web app prototyping and sharing without setup.

Claude Code (Anthropic): A terminal-style interface that lets users talk to an AI to build and edit code. Retains project memory, enabling multi-step coding via natural language.

GitHub Copilot: Now offers “Agent Mode” for full-code task execution via prompts. Seamlessly integrated into VS Code and GitHub workflows.

Cascade by Windsurf: AI-driven code agent for real-time collaboration and autonomous code generation. Built for iterative dev flow with minimal input overhead.

Junie (JetBrains): JetBrains AI plugin designed for language-aware development. Offers prompt-based interaction and smart debugging workflows.

Augment Code: Chat‑based coding across various editors. Take the lead with local or remote agents to complete tasks end-to-end. Agents can plan, build, and open PRs for you to review.

Zed Editor: Zed is a next-generation code editor designed for high-performance collaboration with humans and AI.

Cody by Sourcegraph: AI assistant for reading, understanding, and updating large codebases. Ideal for tech debt refactoring and legacy system queries.

Tabnine: Context-aware autocompletion powered by on-device LLMs. Offers secure and private AI coding for enterprises.

Codex (OpenAI): OpenAI’s foundational code model that powers many vibe coding tools. Used for command-line, IDE, and app-level code gen.

Lovable: No-code platform with AI design and app-building capabilities. Perfect for product designers and non-technical founders.

Bolt: Generative backend builder that connects AI-driven frontend, APIs, and data stores. Great for MVPs and early-stage products.

Softr: No-code builder that turns Airtable or Google Sheets into responsive web apps. Natural language prompting speeds up UI generation.

Devin by Cognition AI: A fully autonomous AI software engineer capable of planning, coding, debugging, testing, and deploying apps end-to-end.

Conclusion

Vibe coding tools are not just a trend—they’re a functional shift in how software is designed, prototyped, and deployed. As AI agents become more capable, platforms like Cursor, Replit, Claude, and GitHub Copilot are redefining developer velocity. Adopting these vibe coding tools can help you ideate, build, and iterate—at the speed of thought.

Also, feel free to follow us on Twitter and don’t forget to join our 98k+ ML SubReddit and Subscribe to our Newsletter.
The post Top 15 Vibe Coding Tools Transforming AI-Driven Software Development in 2025 appeared first on MarkTechPost.