i-genie, Author at i-genie.co.uk

Getting Started with Agent Communication Protocol (ACP): Build a Weath …

Posted on July 7, 2025 by i-genie

The Agent Communication Protocol (ACP) is an open standard designed to enable seamless communication between AI agents, applications, and humans. As AI systems are often developed using diverse frameworks and infrastructures, they can end up isolated and incompatible, limiting their ability to collaborate. ACP addresses this fragmentation by offering a unified RESTful API that facilitates:

Multimodal communication

Both synchronous and asynchronous messaging

Real-time streaming

Support for stateful and stateless agent interactions

Discovery of agents, whether online or offline

Execution of long-running tasks

In this tutorial, we’ll take our first steps with ACP by building a basic server that provides London’s weather information and a simple client that can interact with it.

Setting up the dependencies

Installing the libraries

Copy CodeCopiedUse a different Browserpip install acp acp-sdk beeai-framework httpx

Creating the ACP Server

We’ll begin by setting up the ACP server, starting with the creation of an agent.py file.

We’ll begin by importing the necessary libraries. To fetch London’s weather data, we’ll use the httpx library to make a request to the Open‑Meteo API.

Copy CodeCopiedUse a different Browserimport asyncio
from collections.abc import AsyncGenerator
import httpx

from acp_sdk.models import Message, MessagePart
from acp_sdk.server import Context, RunYield, RunYieldResume, Server

server = Server()

Next, we’ll define an asynchronous helper function called get_london_weather that retrieves the current weather in London using the Open‑Meteo API. This function sends a request with London’s coordinates and returns a formatted weather summary including temperature, wind speed, and weather condition code.

Copy CodeCopiedUse a different Browserasync def get_london_weather() -> str:
“””Fetch current London weather from the free Open‑Meteo API.”””
params = {
“latitude”: 51.5072, # London coordinates
“longitude”: -0.1276,
“current_weather”: True,
“timezone”: “Europe/London”
}
url = “https://api.open-meteo.com/v1/forecast”

async with httpx.AsyncClient(timeout=10) as client:
resp = await client.get(url, params=params)
resp.raise_for_status()
cw = resp.json()[“current_weather”]

return (
f”Weather in London: {cw[‘temperature’]} °C, ”
f”wind {cw[‘windspeed’]} km/h, code {cw[‘weathercode’]}.”
)

This code defines an ACP-compatible agent using the @server.agent() decorator. The london_weather_agent function handles incoming messages by first yielding a thought message, then asynchronously fetching the current weather in London using the get_london_weather() helper. The weather data is then returned as a plain text message. Finally, server.run() starts the ACP server and makes the agent available to handle requests

Copy CodeCopiedUse a different Browser@server.agent()
async def london_weather_agent(
input: list[Message], context: Context
) -> AsyncGenerator[RunYield, RunYieldResume]:
“””Returns current London weather.”””
for _ in input:
yield {“thought”: “Fetching London weather…”}
weather = await get_london_weather()
yield Message(
role=”agent”,
parts=[MessagePart(content=weather, content_type=”text/plain”)]
)

server.run()

Running the server

Next, we’ll run the agent.py file to start the server. Once running, the ACP agent will be available to handle requests at http://localhost:8000

Copy CodeCopiedUse a different Browserpython agent.py

To verify that your agent is up and running, open a new terminal and execute the following curl command:

Copy CodeCopiedUse a different Browsercurl http://localhost:8000/agents

If everything is working correctly, you’ll receive a JSON response listing your agent, confirming that it’s available and ready to handle requests.

Copy CodeCopiedUse a different Browser{
“agents”: [
{
“name”: “london_weather_agent”,
“description”: “Returns current London weather.”,
“metadata”: {
“annotations”: null,
“documentation”: null,
“license”: null,
“programming_language”: null,
“natural_languages”: null,
“framework”: null,
“capabilities”: null,
“domains”: null,
“tags”: null,
“created_at”: null,
“updated_at”: null,
“author”: null,
“contributors”: null,
“links”: null,
“dependencies”: null,
“recommended_models”: null
},
“input_content_types”: [
“*/*”
],
“output_content_types”: [
“*/*”
]
}
]
}

Creating the ACP Client

We will now create an ACP client (client.py) to interact with our server.

This client script uses the ACP SDK to connect to the locally running london_weather_agent via the ACP server at http://localhost:8000. It sends a synchronous message asking for the weather using the run_sync method. Once the agent responds, the script prints out the returned weather details.

Copy CodeCopiedUse a different Browserimport asyncio

from acp_sdk.client import Client
from acp_sdk.models import Message, MessagePart

async def call_london_weather_agent() -> None:
async with Client(base_url=”http://localhost:8000″) as client:
run = await client.run_sync(
agent=”london_weather_agent”,
input=[
Message(
parts=[MessagePart(content=”Tell me the weather”, content_type=”text/plain”)]
)
],
)

print(“Response from london_weather_agent:”)
for message in run.output:
for part in message.parts:
print(“-“, part.content)

if __name__ == “__main__”:
asyncio.run(call_london_weather_agent())

Running the Client

In another terminal, run the following command to send request to our ACP server

Copy CodeCopiedUse a different Browserpython client.py

You should see a response from the server containing the current weather in London, returned by the london_weather_agent.

Copy CodeCopiedUse a different BrowserResponse from london_weather_agent:
– Weather in London: 20.8 °C, wind 10.1 km/h, code 3.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Getting Started with Agent Communication Protocol (ACP): Build a Weather Agent with Python appeared first on MarkTechPost.

SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for Sta …

Posted on July 7, 2025 by i-genie

Understanding Limitations of Current Reward Models

Although reward models play a crucial role in Reinforcement Learning from Human Feedback (RLHF), many of today’s top-performing open models still struggle to reflect the full range of complex human preferences. Even with sophisticated training techniques, meaningful progress has been limited. A major reason appears to be the shortcomings in current preference datasets, which are often too narrow, artificially generated, or poorly vetted. While some rule-based systems are effective for clear tasks like math or coding, they usually fail to capture nuanced human judgment. Moreover, common benchmarks like RewardBench are becoming less reliable indicators of real-world RM performance, showing poor correlation with downstream task success.

Challenges in Preference Data Creation and New Approaches

Creating high-quality preference data has traditionally relied on human annotators, but this method is time-consuming, costly, and sometimes inconsistent. To address this, recent techniques like RLAIF use LLMs to automate annotations, sometimes even outperforming humans. Newer approaches aim to combine the strengths of both by integrating LLM-generated data with human-verified labels. Meanwhile, reward models have evolved from simple scoring systems, such as the Bradley-Terry model, to more complex frameworks, including generative and direct optimization methods. Despite the availability of numerous robust open models and datasets, challenges persist in accurately capturing nuanced human preferences across diverse tasks and languages.

Introducing SynPref-40M: Large-Scale Human-AI Preference Dataset

Researchers from 2050 Research, Skywork AI introduce SynPref-40M, a massive dataset of 40 million preference pairs curated through a two-stage human-AI pipeline. Human annotators ensure quality through strict verification, while LLMs scale up data curation using human guidance. From this, they develop Skywork-Reward-V2, a family of eight reward models (0.6B–8B parameters) trained on a high-quality subset of 26 M. These models achieve state-of-the-art results across seven leading benchmarks, excelling in alignment, safety, objectivity, and robustness. The study highlights that success comes not just from data volume, but from careful, iterative curation that blends human expertise with AI scalability.

Scalable Two-Stage Human-AI Curation Pipeline

Current open reward models often suffer from overfitting to narrow benchmarks, such as RewardBench, which limits their real-world usefulness. To address this, the researchers introduce a two-stage, human-AI pipeline for curating large-scale preference data. Stage 1 starts with human-verified annotations to guide LLMs in labeling diverse preference attributes, followed by iterative training and error analysis to refine the reward model. Stage 2 scales this process using consistency checks between the best and a human-trained “gold” reward model, filtering reliable samples without further human input. This approach strikes a balance between quality and scalability, ultimately enabling the creation of tens of millions of high-quality preference pairs.

Benchmarking Skywork-Reward-V2: Compact Yet Powerful Models

The Skywork-Reward-V2 series demonstrates strong performance across multiple benchmarks, outperforming both larger models (e.g., 70B parameters) and emerging generative reward models. Trained using Qwen3 (0.6B–8B) and Llama 3.1/3.2 (1B–8B) backbones, these models achieve high scores on RewardBench, PPE, RM-Bench, and JudgeBench, with the best-performing variant (Llama-3.1-8B-40M) surpassing all others with an average score of 88.6. Despite smaller model sizes, Skywork-Reward-V2 models benefit from high-quality preference data (SynPref-40M) and efficient training setups, enabling them to generalize better in real-world RLHF scenarios. Notably, even mid-sized models like the Qwen3-1.7B outperform some 70B models, emphasizing the impact of training data quality and methodology over sheer parameter count.

Conclusion and Future Outlook: Scaling with Precision

In conclusion, SynPref-40M, a large-scale preference dataset built through a two-stage human-AI collaboration, combining human judgment with LLM-based scalability. Using a curated subset of 26 million preference pairs, the team developed the Skywork-Reward-V2, a suite of eight reward models (0.6B–8B parameters) that outperform existing models across seven key benchmarks. These models show strong generalization in aligning with human values, ensuring correctness, safety, and robustness to bias. Extensive studies confirm that both the data quality and curation method are key drivers of performance. Looking forward, the researchers aim to explore new training strategies, as reward models become central to LLM development and alignment.

Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models appeared first on MarkTechPost.

New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online …

Posted on July 7, 2025 by i-genie

Optimizing LLMs for Human Alignment Using Reinforcement Learning

Large language models often require a further alignment phase to optimize them for human use. In this phase, reinforcement learning plays a central role by enabling models to make decisions based on human feedback or task-based correctness. This fine-tuning allows for the models to align more closely with user expectations, making them more suitable for instruction-based applications or precise mathematical tasks.

Challenges in Choosing Offline vs. Online Reinforcement Learning Strategies

A major difficulty arises when choosing the most effective way to conduct this fine-tuning. Training methods fall into two extremes—offline approaches that depend on static, pre-generated data and fully online approaches that continuously update with each new interaction. Each method has distinct challenges. Offline models can’t adapt during training, which limits performance, while online models often demand more computational resources. Moreover, ensuring that models perform well across both mathematical (verifiable) and open-ended (non-verifiable) tasks adds further complexity to this choice.

Overview of Alignment Algorithms: DPO and GRPO

Historically, tools like Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have been employed for model alignment. DPO operates offline and is designed to work with preference-based data pairs. It is valued for its simplicity and data efficiency but lacks the adaptability of online methods. GRPO is based on the PPO algorithm and handles online fine-tuning by comparing groups of outputs to compute relative advantages. While GRPO adapts in real-time and suits dynamic reward systems, its on-policy nature increases computational load and makes experimentation more demanding.

A Balanced Alternative for LLM Alignment

Research introduced by Meta and NYU explored a method to overcome these limitations through a semi-online training setup. This technique modulates how frequently the model’s generation and training components are synchronized, rather than updating at every training step, as in fully online methods, or not at all, as in offline setups. The semi-online method strikes a middle ground by adjusting the synchronization rate. Researchers designed this approach to reduce training time and maintain high model adaptability. The modular setup also allowed them to apply either DPO or GRPO with task-specific reward models in a flexible manner.

Instruction Following and Mathematical Reasoning

The methodology involved fine-tuning the Llama-3.1-8B-Instruct model using two types of tasks: open-ended instruction following and math problem-solving. For non-verifiable tasks, user prompts were sampled from the WildChat-1M dataset and evaluated using the Athene-RM-8B reward model, which assigns scalar scores to each prompt. For verifiable tasks, the team utilized the NuminaMath dataset in conjunction with the Math-Verify toolkit, which verifies whether generated answers align with expected outputs. Training experiments were conducted on 32 NVIDIA H200 GPUs for training and 8 GPUs for inference, with different setups comparing offline, semi-online, and online synchronization intervals.

Performance Gains Across Both Verifiable and Non-Verifiable Tasks

The performance differences were observed. On Math500, the offline DPO reached 53.7% accuracy, whereas the semi-online DPO with a synchronization interval of s = 100 achieved 58.9%. Online DPO and GRPO showed similar results at 58.7% and 58.1%, respectively. Similar trends were observed on the NuminaMath benchmark, where the offline DPO achieved 36.4%, and semi-online variants increased this to 39.4% (s = 10). The performance gains were not limited to math tasks. When non-verifiable tasks were evaluated with AlpacaEval 2.0 and Arena-Hard benchmarks, models trained with mixed reward types performed consistently better. Combining verifiable and non-verifiable rewards in a single training setup resulted in stronger average scores, indicating that the method generalized effectively.

A Flexible, Scalable Approach for Reinforcement Learning in LLMs

This study demonstrates that fine-tuning large language models does not require strict adherence to either offline or online setups. By introducing a flexible synchronization scheme, the research team from Meta and NYU effectively increased training efficiency while maintaining or improving performance. The results show that carefully balancing reward types and training synchronization frequency leads to models that perform well across task types without incurring high computational costs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning appeared first on MarkTechPost.

AbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost R …

Posted on July 6, 2025 by i-genie

Recent research indicates that LLMs, particularly smaller ones, frequently struggle with robust reasoning. They tend to perform well on familiar questions but falter when those same problems are slightly altered, such as changing names or numbers, or adding irrelevant but related information. This weakness, known as poor out-of-distribution (OOD) generalization, results in notable accuracy drops, even in simple math tasks. One promising solution is to create synthetic variations of reasoning problems, helping models learn to focus on the underlying logic rather than surface details. Strengthening reasoning in this manner is crucial for developing more general and reliable AI systems.

Abstracting the Core Logic of LLM Reasoning Failures

LLMs have demonstrated impressive reasoning capabilities, yet they often falter when exposed to distribution shifts, such as changes in phrasing, numerical values, or the introduction of distractions. This vulnerability is evident across benchmarks in logic, mathematics, and commonsense reasoning. Prior solutions have relied on data augmentation to expose models to a broader variety of inputs, improving robustness but increasing computational demands. Researchers have also explored formats such as abstraction-of-thought and chain-of-abstraction to teach abstract reasoning, while planning techniques like chain-of-thought and tree-of-thought aid step-by-step problem-solving. Reinforcement learning and preference-based methods provide additional support for reasoning skill development beyond pattern memorization.

AbstRaL’s Symbolic Learning Method to Improve Reasoning Consistency

Researchers from Apple and EPFL propose AbstRaL, a method that teaches LLMs to understand abstract reasoning patterns rather than memorizing surface details. Instead of generating many varied training examples, which is computationally costly, AbstRaL helps LLMs learn the underlying structure of reasoning problems using reinforcement learning. This method connects these abstract patterns to symbolic tools, enabling more reliable problem-solving. Tested on GSM benchmarks, AbstRaL significantly improves LLM performance, especially when faced with input changes or distracting information. It outperforms models trained only with supervised learning by promoting more consistent and context-independent reasoning.

Four Steps to Abstract Symbolic Reasoning via AbstRaL

AbstRaL is a four-step framework designed to teach LLMs to reason abstractly rather than rely on surface patterns. First, it identifies key variables in a question and replaces them with symbolic placeholders. Then, using specially crafted data (GranulAR), the model learns to reason step-by-step with these abstract symbols. Next, it retrieves the general reasoning structure (abstraction) from the symbolic answer. Finally, it uses this abstraction with the original values to compute the correct answer. Reinforcement learning with two rewards, one for correctness and another for symbolic similarity, further improves the model’s ability to generate accurate, context-independent reasoning patterns.

GSM8K Variations Reveal AbstRaL’s Robustness Across LLM Sizes

The researchers evaluate AbstRaL on math reasoning tasks using models such as Llama-3 and Qwen2, training them with a dataset called GranulAR that rewrites math problems in an abstract symbolic form. This helps models focus on structure rather than surface details. They test robustness using altered versions of GSM8K problems, changing numbers, names, and phrasing. Compared to baselines like standard Chain-of-Thought prompting, AbstRaL shows stronger consistency and less accuracy drop on these variations. Especially for smaller models, it improves reliability across reworded inputs. The results suggest that teaching models to reason abstractly makes them more adaptable and less reliant on memorized patterns.

Teaching LLMs Abstract Thinking through Reinforcement Yields Robust Reasoning

In conclusion, AbstRaL is a method designed to enhance abstract reasoning in LLMs, making them more resilient to superficial changes in problems. Unlike traditional fine-tuning or data augmentation, AbstRaL uses reinforcement learning to train models on GranulAR rationales that mix Socratic chain-of-thought with detailed abstraction. This approach helps models strip away surface-level distractions and better connect with symbolic tools. Tested on challenging GSM8K perturbation benchmarks, AbstRaL notably reduces performance drops under distribution shifts, particularly in smaller models. The study shows that learning to abstract improves reasoning robustness more effectively than relying solely on direct supervision.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post AbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost Robustness on GSM Benchmarks appeared first on MarkTechPost.

Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms L …

Posted on July 6, 2025 by i-genie

Kyutai, an open AI research lab, has released a groundbreaking streaming Text-to-Speech (TTS) model with ~2 billion parameters. Designed for real-time responsiveness, this model delivers ultra-low latency audio generation (220 milliseconds) while maintaining high fidelity. It’s trained on an unprecedented 2.5 million hours of audio and is licensed under the permissive CC-BY-4.0, reinforcing Kyutai’s commitment to openness and reproducibility. This advancement redefines the efficiency and accessibility of large-scale speech generation models, particularly for edge deployment and agentic AI.

Unpacking the Performance: Sub-350ms Latency for 32 Concurrent Users on a Single L40 GPU

The model’s streaming capability is its most distinctive feature. On a single NVIDIA L40 GPU, the system can serve up to 32 concurrent users while keeping the latency under 350ms. For individual use, the model maintains a generation latency as low as 220ms, enabling nearly real-time applications such as conversational agents, voice assistants, and live narration systems. This performance is enabled through Kyutai’s novel Delayed Streams Modeling approach, which allows the model to generate speech incrementally as text arrives.

Key Technical Metrics:

Model size: ~2B parameters

Training data: 2.5 million hours of speech

Latency: 220ms single-user, <350ms with 32 users on one L40 GPU

Language support: English and French

License: CC-BY-4.0 (open source)

Delayed Streams Modeling: Architecting Real-Time Responsiveness

Kyutai’s innovation is anchored in Delayed Streams Modeling, a technique that allows speech synthesis to begin before the full input text is available. This approach is specifically designed to balance prediction quality with response speed, enabling high-throughput streaming TTS. Unlike conventional autoregressive models that suffer from response lag, this architecture maintains temporal coherence while achieving faster-than-real-time synthesis.

The codebase and training recipe for this architecture are available at Kyutai’s GitHub repository, supporting full reproducibility and community contributions.

Model Availability and Open Research Commitment

Kyutai has released the model weights and inference scripts on Hugging Face, making it accessible for researchers, developers, and commercial teams. The permissive CC-BY-4.0 license encourages unrestricted adaptation and integration into applications, provided proper attribution is maintained.

This release supports both batch and streaming inference, making it a versatile foundation for voice cloning, real-time chatbots, accessibility tools, and more. With pretrained models in both English and French, Kyutai sets the stage for multilingual TTS pipelines.

Implications for Real-Time AI Applications

By reducing the speech generation latency to the 200ms range, Kyutai’s model narrows the human-perceptible delay between intent and speech, making it viable for:

Conversational AI: Human-like voice interfaces with low turnaround

Assistive Tech: Faster screen readers and voice feedback systems

Media Production: Voiceovers with rapid iteration cycles

Edge Devices: Optimized inference for low-power or on-device environments

The ability to serve 32 users on a single L40 GPU without quality degradation also makes it attractive for scaling speech services efficiently in cloud environments.

Conclusion: Open, Fast, and Ready for Deployment

Kyutai’s streaming TTS release is a milestone in speech AI. With high-quality synthesis, real-time latency, and generous licensing, it addresses critical needs for both researchers and real-world product teams. The model’s reproducibility, multilingual support, and scalable performance make it a standout alternative to proprietary solutions.

For more details, you can explore the official model card on Hugging Face, technical explanation on Kyutai’s site, and implementation specifics on GitHub.
The post Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training appeared first on MarkTechPost.

Can We Improve Llama 3’s Reasoning Through Post-Training Alone? ASTR …

Posted on July 5, 2025 by i-genie

Improving the reasoning capabilities of large language models (LLMs) without architectural changes is a core challenge in advancing AI alignment and usability. Researchers at Meta AI and the University of Washington have introduced ASTRO—Autoregressive Search-Taught Reasoner—a novel post-training framework designed to enhance reasoning in Llama-3.1-70B-Instruct. ASTRO is unique in teaching models to perform in-context search, self-reflection, and backtracking, mechanisms often associated with human problem-solving and traditional symbolic search algorithms. Through this approach, ASTRO boosts Llama 3’s math performance on several competitive benchmarks with significant improvements:

MATH 500: 65.8% ➝ 81.8%

AMC 2023: 37.5% ➝ 64.4%

AIME 2024: 10.0% ➝ 30.0%

Search-Guided Chain-of-Thought Generation

ASTRO’s methodology begins with a Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. This search explores both correct and incorrect reasoning paths. The key innovation is procedure cloning: entire search trees are linearized into long chain-of-thoughts (CoT) that naturally encode both failures and recoveries via self-reflection and backtracking. These linearized traces are rewritten in natural language and used as the basis for supervised fine-tuning (SFT).

This results in a model that doesn’t just solve problems step-by-step but reevaluates its trajectory—often backtracking after self-assessment to correct intermediate reasoning mistakes. For instance, the model may interject with phrases like “Let’s go back to where we set up the equation” when its internal confidence drops.

Supervised Fine-Tuning: Injecting Search Priors

ASTRO fine-tunes Llama-3.1-70B-Instruct on 36.1K curated CoT solutions from MATH, AMC/AIME, and AoPS-style datasets. The model trained with ASTRO-SFT achieves:

MATH 500: 69.6%

AMC 2023: 51.9%

AIME 2024: 16.3%

These scores are competitive with or exceed those of baseline and SPOC/Step-KTO variants trained without explicit search priors. Importantly, even SFT alone—without reinforcement learning—yields performance boosts by exposing the model to search-structured reasoning data.

Reinforcement Learning with Search-Aware Initialization

ASTRO proceeds to reinforcement learning (RL) by initializing with the SFT checkpoint and running an RL loop using a modified Group Relative Policy Optimization (GRPO). Unlike standard preference-based RL, ASTRO employs verifiable reward signals (+1 for correct, -1 for incorrect) on 8.7K moderately difficult prompts. During training, the model’s CoT generation grows longer—from ~1.8K to ~6K tokens—demonstrating deeper internal exploration.

The resulting ASTRO-RL model achieves:

MATH 500: 81.8%

AMC 2023: 64.4%

AIME 2024: 30.0%

These results rival or exceed models with larger parameter counts and confirm the importance of ASTRO’s search-aware initialization.

Backtracking Behavior Correlates with Reasoning Success

A striking empirical observation is the positive correlation between backtracking frequency and performance. As training progresses, ASTRO-RL exhibits more self-corrective actions and deeper exploration. Pearson correlation coefficients across benchmarks exceed 0.8, indicating that self-reflection and backtracking are not merely cosmetic behaviors but functionally tied to better accuracy.

Comparative Insights and Broader Impact

Control experiments comparing ASTRO with models trained on direct CoT solutions (no search priors) reveal that even when trained on the same problem sets and search trees, ASTRO consistently outperforms. For instance, ASTRO-RL beats Direct-RL by:

+2% on MATH 500

+3.9% on AMC 2023

+2.9% on AIME 2024

Moreover, ASTRO’s outputs can be visualized as directed graphs, with nodes as reasoning steps and edges capturing transitions, reflections, and corrections—facilitating better interpretability.

ASTRO Key Takeaways Table

Conclusion

ASTRO demonstrates that LLMs like Llama 3 can learn to reason more effectively—not through larger models or longer pretraining, but via principled post-training techniques. By mimicking search algorithms in natural language, ASTRO enables models to think before answering, doubt their own steps, and correct themselves mid-reasoning. This framework sets a new benchmark for fine-tuning open LLMs to approach human-like reasoning through search-inspired behaviors.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Can We Improve Llama 3’s Reasoning Through Post-Training Alone? ASTRO Shows +16% to +20% Benchmark Gains appeared first on MarkTechPost.

A Tutorial on Using OpenAI Codex with GitHub Repositories for Seamless …

Posted on July 4, 2025 by i-genie

When we first land in the Codex environment, it feels like stepping into a co-pilot’s seat for coding. Codex is designed to take over much of the routine or overwhelming parts of software engineering, like understanding massive codebases, drafting PRs, and finding bugs, and help us focus on higher-level thinking. In this guided setup, we explore how to connect a GitHub repository, configure a smart environment, and utilize Codex to kick-start useful engineering tasks.

As we begin, we start with this blank workspace. At this point, we haven’t linked any code or given the assistant any instructions, so it’s patiently waiting for us to define the first step. It feels clean, open, and ready for us to steer the direction of our development work.

We then proceed to select the GitHub organization and repository with which Codex will work. In this case, we chose the “teammmtp” organization and linked it to the private `ai-scribe-stories` repo. Codex smartly filters only the repositories we have access to, ensuring we don’t accidentally link the wrong one. We’re also asked whether we want to allow the agent to use the internet. We chose to leave it off for now, meaning Codex will rely solely on local dependencies and scripts. This setting is ideal when we want to maintain a secure and fully deterministic environment.

Now, we get introduced to the actual powers of Codex as a software engineering agent. It outlines four main capabilities: drafting GitHub pull requests automatically, navigating our codebase to identify bugs and suggest improvements, running lint and tests to ensure code quality, and being powered by a fine-tuned model specifically designed for understanding large repositories. At this point, we also have access to the GitHub push menu where we can choose between actions like creating PRs, copying patch code, or applying git commands, just by clicking a dropdown. This interface makes our workflow seamless and gives us fine control over how we want to ship code.

With our repo and features ready, Codex recommends a set of initial tasks to get us started. We select suggestions that include explaining the overall code structure, identifying and fixing bugs, and reviewing for minor issues such as typos or broken tests. What’s great here is that Codex helps break the ice for us, even if we’re unfamiliar with the project. These cards serve as bite-sized onboarding challenges, enabling us to quickly understand and improve the codebase while seeing Codex in action. We checked all three, signaling that we’re ready for the assistant to begin analyzing and working alongside us.

In this task dashboard, we’re asked, “What are we coding next?”, a gentle nudge that we’re now in control of what the AI focuses on. We can either create a completely custom task or select from one of the three predefined options. We notice that Codex has also enabled “Best-of-N,” a feature that generates multiple implementation suggestions for a task, allowing us to pick the one we like most. We’ve linked the agent to the `main` branch of our repository and configured the task to run in a 1x container. It’s like telling a teammate, “Here’s the branch, here’s the task, go to work.”

Now Codex starts digging into the codebase. We see a command running in the terminal that’s grepping for the word “react” in `vite.config.ts`. This step demonstrates how Codex doesn’t just make blind assumptions; it actively searches through our files, identifies references to libraries and components, and builds a picture of the tools our project is using. Watching this in real time makes the experience feel dynamic, like having an assistant that’s not just smart but also curious and methodical in its approach.

Finally, Codex delivers a detailed breakdown of the codebase and some well-thought-out suggestions for improvement. We learn that the project is built using Vite, React, TypeScript, Tailwind CSS, and shadcn-ui. It identifies our routing, styling configurations, and toast logic. It also tells us what’s missing, such as automated testing and realistic data fetching. These insights go beyond basic code reading; they help us prioritize tasks that matter and create a roadmap for evolving the project. Codex also utilizes specific file names and components in its report, demonstrating that it truly understands our structure, not just superficially, but functionally.

In conclusion, we’ve connected a GitHub repository and also unlocked an AI-powered engineering assistant that reads our code, interprets its design, and proactively suggests ways to improve it. We experienced Codex transitioning from a passive helper to an active co-developer, offering guidance, running commands, and generating summaries just like a skilled teammate would. Whether we’re improving tests, documenting logic, or cleaning up structure, Codex provides the clarity and momentum we often need when diving into unfamiliar code. With this setup, we’re now ready to build faster, debug smarter, and collaborate more efficiently with AI as our coding partner.
The post A Tutorial on Using OpenAI Codex with GitHub Repositories for Seamless AI-Powered Development appeared first on MarkTechPost.

Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling …

Posted on July 4, 2025 by i-genie

Reward models are fundamental components for aligning LLMs with human feedback, yet they face the challenge of reward hacking issues. These models focus on superficial attributes such as response length or formatting rather than identifying true quality indicators like factuality and relevance. This problem arises because standard training objectives fail to differentiate between spurious correlations present in training data and genuine causal drivers of response quality. The failure to separate these factors leads to brittle reward models (RMs) that generate misaligned policies. Moreover, there is a need for a method that utilizes a causal understanding of preference formation to train RMs that are sensitive to causal quality attributes and invariant to various spurious cues.

Limitations of Existing RM Approaches and the Need for Causal Robustness

Existing methods try to solve reward hacking issues in standard RLHF systems that rely on Bradley-Terry or pairwise ranking methods. This includes architectural modifications, such as Odin, policy-level adjustments, and data-centric methods involving ensembles or consistency checks. Recent causal-inspired methods use MMD regularization against pre-specified spurious factors or estimate causal effects through corrected rewrites. However, these methods target only predetermined spurious factors, missing unknown correlates. While augmentation strategies remain coarse, and evaluation-focused methods fail to equip reward models with robust training mechanisms against diverse spurious variations.

Introducing Crome: Causally Robust Reward Modeling for LLMs

Researchers from Google DeepMind, McGill University, and MILA – Quebec AI Institute have proposed Crome (Causally Robust Reward Modeling), a framework built on an explicit causal model of answer generation. Crome trains RMs to differentiate genuine quality drivers from superficial cues by adding preference datasets with targeted, LLM-generated counterfactual examples. Moreover, it creates two types of synthetic training pairs: (a) Causal Augmentations, which introduce changes along specific causal attributes, such as factuality to enforce sensitivity to true quality shifts, and (b) Neutral Augmentations that enforce invariance along spurious attributes like style using tie-labels. Crome enhances robustness, increasing RewardBench accuracy by up to 4.5%, enhancing safety and reasoning.

Technical Approach: Counterfactual Augmentation and Composite Loss Optimization

The Crome operates through two main phases: generating attribute-aware counterfactual data based on a causal model and training the reward model with a specialized loss on combined data. It provides a theoretical analysis on how causal augmentation isolates true reward drivers from spurious correlates under an idealized model. Crome utilizes the UltraFeedback dataset with counterfactuals generated using Gemini 2.0 Flash, and evaluates performance on RewardBench and reWordBench. Researchers utilize diverse base LLMs in their experiments, including Gemma-2-9B-IT, Qwen2.5-7B, and Gemma-2-2B for both Pairwise Preference and Bradley-Terry reward models, with downstream alignment impact through Best-of-N selection on multiple tasks.

Performance Gains: From RewardBench to WildGuardTest

On RewardBench, Crome achieves improvements in ranking accuracy over RRM across diverse base models, with significant gains in Safety (up to 13.18%) and Reasoning (up to 7.19%) categories. Crome shows aggregate accuracy gains of up to 9.1% on reWordBench with Gemma-2-9B-IT in PairPM settings and superior performance on 21 out of 23 transformations. Moreover, it shows a smaller decrease in ranking accuracy from RewardBench to reWordBench compared to RRM (19.78% versus 21.54%). Crome shows excellent safety improvements on WildGuardTest with Best-of-N selection, achieving lower attack success ratios on harmful prompts while maintaining similar refusal rates on benign prompts.

Conclusion and Future Directions in Causal Data Augmentation

In conclusion, researchers introduced Crome, a causal framework that solves reward hacking issues during RM training. It employs two targeted synthetic data augmentation strategies: Causal Augmentations and Neutral Augmentations. Crome outperforms strong baselines across multiple base models and reward modeling techniques on RewardBench, and superior robustness on reWordBench against spurious correlations. This dataset curation-centered training method (i.e, Crome) opens new research directions in synthetic data generation for base model training, where causal attribute verification could prove highly beneficial for future developments in robust language model alignment.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling in LLM Alignment appeared first on MarkTechPost.

Thought Anchors: A Machine Learning Framework for Identifying and Meas …

Posted on July 4, 2025 by i-genie

Understanding the Limits of Current Interpretability Tools in LLMs

AI models, such as DeepSeek and GPT variants, rely on billions of parameters working together to handle complex reasoning tasks. Despite their capabilities, one major challenge is understanding which parts of their reasoning have the greatest influence on the final output. This is especially crucial for ensuring the reliability of AI in critical areas, such as healthcare or finance. Current interpretability tools, such as token-level importance or gradient-based methods, offer only a limited view. These approaches often focus on isolated components and fail to capture how different reasoning steps connect and impact decisions, leaving key aspects of the model’s logic hidden.

Thought Anchors: Sentence-Level Interpretability for Reasoning Paths

Researchers from Duke University and Aiphabet introduced a novel interpretability framework called “Thought Anchors.” This methodology specifically investigates sentence-level reasoning contributions within large language models. To facilitate widespread use, the researchers also developed an accessible, detailed open-source interface at thought-anchors.com, supporting visualization and comparative analysis of internal model reasoning. The framework comprises three primary interpretability components: black-box measurement, white-box method with receiver head analysis, and causal attribution. These approaches uniquely target different aspects of reasoning, providing comprehensive coverage of model interpretability. Thought Anchors explicitly measure how each reasoning step affects model responses, thus delineating meaningful reasoning flows throughout the internal processes of an LLM.

Evaluation Methodology: Benchmarking on DeepSeek and the MATH Dataset

The research team detailed three interpretability methods clearly in their evaluation. The first approach, black-box measurement, employs counterfactual analysis by systematically removing sentences within reasoning traces and quantifying their impact. For instance, the study demonstrated sentence-level accuracy assessments by running analyses over a substantial evaluation dataset, encompassing 2,000 reasoning tasks, each producing 19 responses. They utilized the DeepSeek Q&A model, which features approximately 67 billion parameters, and tested it on a specifically designed MATH dataset comprising around 12,500 challenging mathematical problems. Second, receiver head analysis measures attention patterns between sentence pairs, revealing how previous reasoning steps influence subsequent information processing. The study found significant directional attention, indicating that certain anchor sentences significantly guide subsequent reasoning steps. Third, the causal attribution method assesses how suppressing the influence of specific reasoning steps impacts subsequent outputs, thereby clarifying the precise contribution of internal reasoning elements. Combined, these techniques produced precise analytical outputs, uncovering explicit relationships between reasoning components.

Quantitative Gains: High Accuracy and Clear Causal Linkages

Applying Thought Anchors, the research group demonstrated notable improvements in interpretability. Black-box analysis achieved robust performance metrics: for each reasoning step within the evaluation tasks, the research team observed clear variations in impact on model accuracy. Specifically, correct reasoning paths consistently achieved accuracy levels above 90%, significantly outperforming incorrect paths. Receiver head analysis provided evidence of strong directional relationships, measured through attention distributions across all layers and attention heads within DeepSeek. These directional attention patterns consistently guided subsequent reasoning, with receiver heads demonstrating correlation scores averaging around 0.59 across layers, confirming the interpretability method’s ability to effectively pinpoint influential reasoning steps. Moreover, causal attribution experiments explicitly quantified how reasoning steps propagated their influence forward. Analysis revealed that causal influences exerted by initial reasoning sentences resulted in observable impacts on subsequent sentences, with a mean causal influence metric of approximately 0.34, further solidifying the precision of Thought Anchors.

Also, the research addressed another critical dimension of interpretability: attention aggregation. Specifically, the study analyzed 250 distinct attention heads within the DeepSeek model across multiple reasoning tasks. Among these heads, the research identified that certain receiver heads consistently directed significant attention toward particular reasoning steps, especially during mathematically intensive queries. In contrast, other attention heads exhibited more distributed or ambiguous attention patterns. The explicit categorization of receiver heads by their interpretability provided further granularity in understanding the internal decision-making structure of LLMs, potentially guiding future model architecture optimizations.

Key Takeaways: Precision Reasoning Analysis and Practical Benefits

Thought Anchors enhance interpretability by focusing specifically on internal reasoning processes at the sentence level, substantially outperforming conventional activation-based methods.

Combining black-box measurement, receiver head analysis, and causal attribution, Thought Anchors deliver comprehensive and precise insights into model behaviors and reasoning flows.

The application of the Thought Anchors method to the DeepSeek Q&A model (with 67 billion parameters) yielded compelling empirical evidence, characterized by a strong correlation (mean attention score of 0.59) and a causal influence (mean metric of 0.34).

The open-source visualization tool at thought-anchors.com provides significant usability benefits, fostering collaborative exploration and improvement of interpretability methods.

The study’s extensive attention head analysis (250 heads) further refined the understanding of how attention mechanisms contribute to reasoning, offering potential avenues for improving future model architectures.

Thought Anchors’ demonstrated capabilities establish strong foundations for utilizing sophisticated language models safely in sensitive, high-stakes domains such as healthcare, finance, and critical infrastructure.

The framework proposes opportunities for future research in advanced interpretability methods, aiming to refine the transparency and robustness of AI further.

Check out the Paper and Interaction. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Thought Anchors: A Machine Learning Framework for Identifying and Measuring Key Reasoning Steps in Large Language Models with Precision appeared first on MarkTechPost.

Transforming network operations with AI: How Swisscom built a network …

Posted on July 4, 2025 by i-genie

In the telecommunications industry, managing complex network infrastructures requires processing vast amounts of data from multiple sources. Network engineers often spend considerable time manually gathering and analyzing this data, taking away valuable hours that could be spent on strategic initiatives. This challenge led Swisscom, Switzerland’s leading telecommunications provider, to explore how AI can transform their network operations.
Swisscom’s Network Assistant, built on Amazon Bedrock, represents a significant step forward in automating network operations. This solution combines generative AI capabilities with a sophisticated data processing pipeline to help engineers quickly access and analyze network data. Swisscom used AWS services to create a scalable solution that reduces manual effort and provides accurate and timely network insights.
In this post, we explore how Swisscom developed their Network Assistant. We discuss the initial challenges and how they implemented a solution that delivers measurable benefits. We examine the technical architecture, discuss key learnings, and look at future enhancements that can further transform network operations. We highlight best practices for handling sensitive data for Swisscom to comply with the strict regulations governing the telecommunications industry. This post provides telecommunications providers or other organizations managing complex infrastructure with valuable insights into how you can use AWS services to modernize operations through AI-powered automation.
The opportunity: Improve network operations
Network engineers at Swisscom faced the daily challenge to manage complex network operations and maintain optimal performance and compliance. These skilled professionals were tasked to monitor and analyze vast amounts of data from multiple and decoupled sources. The process was repetitive and demanded considerable time and attention to detail. In certain scenarios, fulfilling the assigned tasks consumed more than 10% of their availability. The manual nature of their work presented several critical pain points. The data consolidation process from multiple network entities into a coherent overview was particularly challenging, because engineers had to navigate through various tools and systems to retrieve telemetry information about data sources and network parameters from extensive documentation, verify KPIs through complex calculations, and identify potential issues of diverse nature. This fragmented approach consumed valuable time and introduced the risk of human error in data interpretation and analysis. The situation called for a solution to address three primary concerns:

Efficiency in data retrieval and analysis
Accuracy in calculations and reporting
Scalability to accommodate growing data sources and use cases

The team required a streamlined approach to access and analyze network data, maintain compliance with defined metrics and thresholds, and deliver fast and accurate responses to events while maintaining the highest standards of data security and sovereignty.
Solution overview
Swisscom’s approach to develop the Network Assistant was methodical and iterative. The team chose Amazon Bedrock as the foundation for their generative AI application and implemented a Retrieval Augmented Generation (RAG) architecture using Amazon Bedrock Knowledge Bases to enable precise and contextual responses to engineer queries. The RAG approach is implemented in three distinct phases:

Retrieval – User queries are matched with relevant knowledge base content through embedding models
Augmentation – The context is enriched with retrieved information
Generation – The large language model (LLM) produces informed responses

The following diagram illustrates the solution architecture.

The solution architecture evolved through several iterations. The initial implementation established basic RAG functionality by feeding the Amazon Bedrock knowledge base with tabular data and documentation. However, the Network Assistant struggled to manage large input files containing thousands of rows with numerical values across multiple parameter columns. This complexity highlighted the need for a more selective approach that could identify only the rows relevant for specific KPI calculations. At that point, the retrieval process wasn’t returning the precise number of vector embeddings required to calculate the formulas, prompting the team to refine the solution for greater accuracy.
Next iterations enhanced the assistant with agent-based processing and action groups. The team implemented AWS Lambda functions using Pandas or Spark for data processing, facilitating accurate numerical calculations retrieval using natural language from the user input prompt.
A significant advancement was introduced with the implementation of a multi-agent approach, using Amazon Bedrock Agents, where specialized agents handle different aspects of the system:

Supervisor agent – Orchestrates interactions between documentation management and calculator agents to provide comprehensive and accurate responses.
Documentation management agent – Helps the network engineers access information in large volumes of data efficiently and extract insights about data sources, network parameters, configuration, or tooling.
Calculator agent – Supports the network engineers to understand complex network parameters and perform precise data calculations out of telemetry data. This produces numerical insights that help perform network management tasks; optimize performance; maintain network reliability, uptime, and compliance; and assist in troubleshooting.

This following diagram illustrates the enhanced data extract, transform, and load (ETL) pipeline interaction with Amazon Bedrock.

To achieve the desired accuracy in KPI calculations, the data pipeline was refined to achieve consistent and precise performance, which leads to meaningful insights. The team implemented an ETL pipeline with Amazon Simple Storage Service (Amazon S3) as the data lake to store input files following a daily batch ingestion approach, AWS Glue for automated data crawling and cataloging, and Amazon Athena for SQL querying. At this point, it became possible for the calculator agent to forego the Pandas or Spark data processing implementation. Instead, by using Amazon Bedrock Agents, the agent translates natural language user prompts into SQL queries. In a subsequent step, the agent runs the relevant SQL queries selected dynamically through analysis of various input parameters, providing the calculator agent an accurate result. This serverless architecture supports scalability, cost-effectiveness, and maintains high accuracy in KPI calculations. The system integrates with Swisscom’s on-premises data lake through daily batch data ingestion, with careful consideration of data security and sovereignty requirements.
To enhance data security and appropriate ethics in the Network Assistant responses, a series of guardrails were defined in Amazon Bedrock. The application implements a comprehensive set of data security guardrails to protect against malicious inputs and safeguard sensitive information. These include content filters that block harmful categories such as hate, insults, violence, and prompt-based threats like SQL injection. Specific denied topics and sensitive identifiers (for example, IMSI, IMEI, MAC address, or GPS coordinates) are filtered through manual word filters and pattern-based detection, including regular expressions (regex). Sensitive data such as personally identifiable information (PII), AWS access keys, and serial numbers are blocked or masked. The system also uses contextual grounding and relevance checks to verify model responses are factually accurate and appropriate. In the event of restricted input or output, standardized messaging notifies the user that the request can’t be processed. These guardrails help prevent data leaks, reduce the risk of DDoS-driven cost spikes, and maintain the integrity of the application’s outputs.
Results and benefits
The implementation of the Network Assistant is set to deliver substantial and measurable benefits to Swisscom’s network operations. The most significant impact is time savings. Network engineers are estimated to experience 10% reduction in time spent on routine data retrieval and analysis tasks. This efficiency gain translates to nearly 200 hours per engineer saved annually, and represents a significant improvement in operational efficiency. The financial impact is equally impressive. The solution is projected to provide substantial cost savings per engineer annually, with minimal operational costs at less than 1% of the total value generated. The return on investment increases as additional teams and use cases are incorporated into the system, demonstrating strong scalability potential.
Beyond the quantifiable benefits, the Network Assistant is expected to transform how engineers interact with network data. The enhanced data pipeline supports accuracy in KPI calculations, critical for network health tracking, and the multi-agent approach provides orchestrated and comprehensive responses to complex queries out of user natural language.
As a result, engineers can have instant access to a wide range of network parameters, data source information, and troubleshooting guidance from an individual personalized endpoint with which they can quickly interact and obtain insights through natural language. This enables them to focus on strategic tasks rather than routine data gathering and analysis, leading to a significant work reduction that aligns with Swisscom SRE principles.
Lessons learned
Throughout the development and implementation of the Swisscom Network Assistant, several learnings emerged that shaped the solution. The team needed to address data sovereignty and security requirements for the solution, particularly when processing data on AWS. This led to careful consideration of data classification and compliance with applicable regulatory requirements in the telecommunications sector, to make sure that sensitive data is handled appropriately. In this regard, the application underwent a strict threat model evaluation, verifying the robustness of its interfaces against vulnerabilities and acting proactively towards securitization. The threat model was applied to assess doomsday scenarios, and data flow diagrams were created to depict major data flows inside and beyond the application boundaries. The AWS architecture was specified in detail, and trust boundaries were set to indicate which portions of the application trusted each other. Threats were identified following the STRIDE methodology (Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege), and countermeasures, including Amazon Bedrock Guardrails, were defined to avoid or mitigate threats in advance.
A critical technical insight was that complex calculations involving significant data volume management required a different approach than mere AI model interpretation. The team implemented an enhanced data processing pipeline that combines the contextual understanding of AI models with direct database queries for numerical calculations. This hybrid approach facilitates both accuracy in calculations and richness in contextual responses.
The choice of a serverless architecture proved to be particularly beneficial: it minimized the need to manage compute resources and provides automatic scaling capabilities. The pay-per-use model of AWS services helped keep operational costs low and maintain high performance. Additionally, the team’s decision to implement a multi-agent approach provided the flexibility needed to handle diverse types of queries and use cases effectively.
Next steps
Swisscom has ambitious plans to enhance the Network Assistant’s capabilities further. A key upcoming feature is the implementation of a network health tracker agent to provide proactive monitoring of network KPIs. This agent will automatically generate reports to categorize issues based on criticality, enable faster response time, and improve the quality of issue resolution to potential network issues. The team is also exploring the integration of Amazon Simple Notification Service (Amazon SNS) to enable proactive alerting for critical network status changes. This can include direct integration with operational tools that alert on-call engineers, to further streamline the incident response process. The enhanced notification system will help engineers address potential issues before they critically impact network performance and obtain a detailed action plan including the affected network entities, the severity of the event, and what went wrong precisely.
The roadmap also includes expanding the system’s data sources and use cases. Integration with additional internal network systems will provide more comprehensive network insights. The team is also working on developing more sophisticated troubleshooting features, using the growing knowledge base and agentic capabilities to provide increasingly detailed guidance to engineers.
Additionally, Swisscom is adopting infrastructure as code (IaC) principles by implementing the solution using AWS CloudFormation. This approach introduces automated and consistent deployments while providing version control of infrastructure components, facilitating simpler scaling and management of the Network Assistant solution as it grows.
Conclusion
The Network Assistant represents a significant advancement in how Swisscom can manage its network operations. By using AWS services and implementing a sophisticated AI-powered solution, they have successfully addressed the challenges of manual data retrieval and analysis. As a result, they have boosted both accuracy and efficiency so network engineers can respond quickly and decisively to network events. The solution’s success is aided not only by the quantifiable benefits in time and cost savings but also by its potential for future expansion. The serverless architecture and multi-agent approach provide a solid foundation for adding new capabilities and scaling across different teams and use cases.As organizations worldwide grapple with similar challenges in network operations, Swisscom’s implementation serves as a valuable blueprint for using cloud services and AI to transform traditional operations. The combination of Amazon Bedrock with careful attention to data security and accuracy demonstrates how modern AI solutions can help solve real-world engineering challenges.
As managing network operations complexity continues to grow, the lessons from Swisscom’s journey can be applied to many engineering disciplines. We encourage you to consider how Amazon Bedrock and similar AI solutions might help your organization overcome its own comprehension and process improvement barriers. To learn more about implementing generative AI in your workflows, explore Amazon Bedrock Resources or contact AWS.
Additional resources
For more information about Amazon Bedrock Agents and its use cases, refer to the following resources:

Generative AI for telecom
Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 1
Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 2

About the authors
Pablo García Benedicto is an experienced Data & AI Cloud Engineer with strong expertise in cloud hyperscalers and data engineering. With a background in telecommunications, he currently works at Swisscom, where he leads and contributes to projects involving Generative AI applications and agents using Amazon Bedrock. Aiming for AI and data specialization, his latest projects focus on building intelligent assistants and autonomous agents that streamline business information retrieval, leveraging cloud-native architectures and scalable data pipelines to reduce toil and drive operational efficiency.
Rajesh Sripathi is a Generative AI Specialist Solutions Architect at AWS, where he partners with global Telecommunication and Retail & CPG customers to develop and scale generative AI applications. With over 18 years of experience in the IT industry, Rajesh helps organizations use cutting-edge cloud and AI technologies for business transformation. Outside of work, he enjoys exploring new destinations through his passion for travel and driving.
Ruben Merz Ruben Merz is a Principal Solutions Architect at AWS. With a background in distributed systems and networking, his work with customers at AWS focuses on digital sovereignty, AI, and networking.
Jordi Montoliu Nerin is a Data & AI Leader currently serving as Senior AI/ML Specialist at AWS, where he helps worldwide telecommunications customers implement AI strategies after previously driving Data & Analytics business across EMEA regions. He has over 10 years of experience, where he has led multiple Data & AI implementations at scale, led executions of data strategy and data governance frameworks, and has driven strategic technical and business development programs across multiple industries and continents. Outside of work, he enjoys sports, cooking and traveling.

End-to-End model training and deployment with Amazon SageMaker Unified …

Posted on July 4, 2025 by i-genie

Although rapid generative AI advancements are revolutionizing organizational natural language processing tasks, developers and data scientists face significant challenges customizing these large models. These hurdles include managing complex workflows, efficiently preparing large datasets for fine-tuning, implementing sophisticated fine-tuning techniques while optimizing computational resources, consistently tracking model performance, and achieving reliable, scalable deployment.The fragmented nature of these tasks often leads to reduced productivity, increased development time, and potential inconsistencies in the model development pipeline. Organizations need a unified, streamlined approach that simplifies the entire process from data preparation to model deployment.
To address these challenges, AWS has expanded Amazon SageMaker with a comprehensive set of data, analytics, and generative AI capabilities. At the heart of this expansion is Amazon SageMaker Unified Studio, a centralized service that serves as a single integrated development environment (IDE). SageMaker Unified Studio streamlines access to familiar tools and functionality from purpose-built AWS analytics and artificial intelligence and machine learning (AI/ML) services, including Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Bedrock, and Amazon SageMaker AI. With SageMaker Unified Studio, you can discover data through Amazon SageMaker Catalog, access it from Amazon SageMaker Lakehouse, select foundation models (FMs) from Amazon SageMaker JumpStart or build them through JupyterLab, train and fine-tune them with SageMaker AI training infrastructure, and deploy and test models directly within the same environment. SageMaker AI is a fully managed service to build, train, and deploy ML models—including FMs—for different use cases by bringing together a broad set of tools to enable high-performance, low-cost ML. It’s available as a standalone service on the AWS Management Console, or through APIs. Model development capabilities from SageMaker AI are available within SageMaker Unified Studio.
In this post, we guide you through the stages of customizing large language models (LLMs) with SageMaker Unified Studio and SageMaker AI, covering the end-to-end process starting from data discovery to fine-tuning FMs with SageMaker AI distributed training, tracking metrics using MLflow, and then deploying models using SageMaker AI inference for real-time inference. We also discuss best practices to choose the right instance size and share some debugging best practices while working with JupyterLab notebooks in SageMaker Unified Studio.
Solution overview
The following diagram illustrates the solution architecture. There are three personas: admin, data engineer, and user, which can be a data scientist or an ML engineer.

AWS SageMaker Unified Studio ML workflow showing data processing, model training, and deployment stages

Setting up the solution consists of the following steps:

The admin sets up the SageMaker Unified Studio domain for the user and sets the access controls. The admin also publishes the data to SageMaker Catalog in SageMaker Lakehouse.
Data engineers can create and manage extract, transform, and load (ETL) pipelines directly within Unified Studio using Visual ETL. They can transform raw data sources into datasets ready for exploratory data analysis. The admin can then manage the publication of these assets to the SageMaker Catalog, making them discoverable and accessible to other team members or users such as data engineers in the organization.
Users or data engineers can log in to the Unified Studio web-based IDE using the login provided by the admin to create a project and create a managed MLflow server for tracking experiments. Users can discover available data assets in the SageMaker Catalog and request a subscription to an asset published by the data engineer. After the data engineer approves the subscription request, the user performs an exploratory data analysis of the content of the table with the query editor or with a JupyterLab notebook, then prepares the dataset by connecting with SageMaker Catalog through an AWS Glue or Athena connection.
You can explore models from SageMaker JumpStart, which hosts over 200 models for various tasks, and fine-tune directly with the UI, or develop a training script for fine-tuning the LLM in the JupyterLab IDE. SageMaker AI provides distributed training libraries and supports various distributed training options for deep learning tasks. For this post, we use the PyTorch framework and use Hugging Face open source FMs for fine-tuning. We will show you how you can use parameter efficient fine-tuning (PEFT) with Low-Rank Adaptation (LoRa), where you freeze the model weights, train the model with modifying weight metrics, and then merge these LoRa adapters back to the base model after distributed training.
You can track and monitor fine-tuning metrics directly in SageMaker Unified Studio using MLflow, by analyzing metrics such as loss to make sure the model is correctly fine-tuned.
You can deploy the model to a SageMaker AI endpoint after the fine-tuning job is complete and test it directly from SageMaker Unified Studio.

Prerequisites
Before starting this tutorial, make sure you have the following:

An AWS account with permissions to create SageMaker resources. For setup instructions, see Set up an AWS account and create an administrator user.
Familiarity with Python and PyTorch for distributed training and model customization.

Set up SageMaker Unified Studio and configure user access
SageMaker Unified Studio is built on top of Amazon DataZone capabilities such as domains to organize your assets and users, and projects to collaborate with others users, securely share artifacts, and seamlessly work across compute services.
To set up Unified Studio, complete the following steps:

As an admin, create a SageMaker Unified Studio domain, and note the URL.
On the domain’s details page, on the User management tab, choose Configure SSO user access. For this post, we recommend setting up using single sign-on (SSO) access using the URL.

For more information about setting up user access, see Managing users in Amazon SageMaker Unified Studio.
Log in to SageMaker Unified Studio
Now that you have created your new SageMaker Unified Studio domain, complete the following steps to access SageMaker Unified Studio:

On the SageMaker console, open the details page of your domain.
Choose the link for the SageMaker Unified Studio URL.
Log in with your SSO credentials.

Now you’re signed in to SageMaker Unified Studio.
Create a project
The next step is to create a project. Complete the following steps:

In SageMaker Unified Studio, choose Select a project on the top menu, and choose Create project.
For Project name, enter a name (for example, demo).
For Project profile, choose your profile capabilities. A project profile is a collection of blueprints, which are configurations used to create projects. For this post, we choose All capabilities, then choose Continue.

Creating a project in Amazon SageMaker Unified Studio

Create a compute space
SageMaker Unified Studio provides compute spaces for IDEs that you can use to code and develop your resources. By default, it creates a space for you to get started with you project. You can find the default space by choosing Compute in the navigation pane and choosing the Spaces tab. You can then choose Open to go to the JuypterLab environment and add members to this space. You can also create a new space by choosing Create space on the Spaces tab.

To use SageMaker Studio notebooks cost-effectively, use smaller, general-purpose instances (like the T or M families) for interactive data exploration and prototyping. For heavy lifting like training or large-scale processing or deployment, use SageMaker AI training jobs and SageMaker AI prediction to offload the work to separate and more powerful instances such as the P5 family. We will show you in the notebook how you can run training jobs and deploy LLMs in the notebook with APIs. It is not recommended to run distributed workloads in notebook instances. The chances of kernel failures is high because JupyterLab notebooks should not be used for large distributed workloads (both for data and ML training).
The following screenshot shows the configuration options for your space. You can change your instance size from default (ml.t3.medium) to (ml.m5.xlarge) for the JupyterLab IDE. You can also increase the Amazon Elastic Block Store (Amazon EBS) volume capacity from 16 GB to 50 GB for training LLMs.

Canfigure space in Amazon SageMaker Unified Studio

Set up MLflow to track ML experiments
You can use MLflow in SageMaker Unified Studio to create, manage, analyze, and compare ML experiments. Complete the following steps to set up MLflow:

In SageMaker Unified Studio, choose Compute in the navigation pane.
On the MLflow Tracking Servers tab, choose Create MLflow Tracking Server.
Provide a name and create your tracking server.
Choose Copy ARN to copy the Amazon Resource Name (ARN) of the tracking server.

You will need this MLflow ARN in your notebook to set up distributed training experiment tracking.
Set up the data catalog
For model fine-tuning, you need access to a dataset. After you set up the environment, the next step is to find the relevant data from the SageMaker Unified Studio data catalog and prepare the data for model tuning. For this post, we use the Stanford Question Answering Dataset (SQuAD) dataset. This dataset is a reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Download the SQuaD dataset and upload it to SageMaker Lakehouse by following the steps in Uploading data.

Adding data to Catalog in Amazon SageMaker Unified Studio

To make this data discoverable by the users or ML engineers, the admin needs to publish this data to the Data Catalog. For this post, you can directly download the SQuaD dataset and upload it to the catalog. To learn how to publish the dataset to SageMaker Catalog, see Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory.
Query data with the query editor and JupyterLab
In many organizations, data preparation is a collaborative effort. A data engineer might prepare an initial raw dataset, which a data scientist then refines and augments with feature engineering before using it for model training. In the SageMaker Lakehouse data and model catalog, publishers set subscriptions for automatic or manual approval (wait for admin approval). Because you already set up the data in the previous section, you can skip this section showing how to subscribe to the dataset.
To subscribe to another dataset like SQuAD, open the data and model catalog in Amazon SageMaker Lakehouse, choose SQuAD, and subscribe.

Subscribing to any asset or dataset published by Admin

Next, let’s use the data explorer to explore the dataset you subscribed to. Complete the following steps:

On the project page, choose Data.
Under Lakehouse, expand AwsDataCatalog.
Expand your database starting from glue_db_.
Choose the dataset you created (starting with squad) and choose Query with Athena.

Querying the data using Query Editor in Amazon SageMaker Unfied Studio

Process your data through a multi-compute JupyterLab IDE notebook
SageMaker Unified Studio provides a unified JupyterLab experience across different languages, including SQL, PySpark, Python, and Scala Spark. It also supports unified access across different compute runtimes such as Amazon Redshift and Athena for SQL, Amazon EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark.
Complete the following steps to get started with the unified JupyterLab experience:

Open your SageMaker Unified Studio project page.
On the top menu, choose Build, and under IDE & APPLICATIONS, choose JupyterLab.
Wait for the space to be ready.
Choose the plus sign and for Notebook, choose Python 3.
Open a new terminal and enter git clonehttps://github.com/aws-samples/amazon-sagemaker-generativeai.
Go to the folder amazon-sagemaker-generativeai/3_distributed_training/distributed_training_sm_unified_studio/ and open the distributed training in unified studio.ipynb notebook to get started.
Enter the MLflow server ARN you created in the following code:

import os
os.environ[“mlflow_uri”] = “”
os.environ[“mlflow_experiment_name”] = “deepseek-r1-distill-llama-8b-sft”

Now you an visualize the data through the notebook.

On the project page, choose Data.
Under Lakehouse, expand AwsDataCatalog.
Expand your database starting from glue_db, copy the name of the database, and enter it in the following code:

db_name = “<enter your db name>”
table = “sqad”

You can now access the entire dataset directly by using the in-line SQL query capabilities of JupyterLab notebooks in SageMaker Unified Studio. You can follow the data preprocessing steps in the notebook.

%%sql project.athena
SELECT * FROM “<DATABASE_NAME>”.”sqad”;

The following screenshot shows the output.

We are going to split the dataset into a test set and training set for model training. When the data processing in done and we have split the data into test and training sets, the next step is to perform fine-tuning of the model using SageMaker Distributed Training.
Fine-tune the model with SageMaker Distributed training
You’re now ready to fine-tune your model by using SageMaker AI capabilities for training. Amazon SageMaker Training is a fully managed ML service offered by SageMaker that helps you efficiently train a wide range of ML models at scale. The core of SageMaker AI jobs is the containerization of ML workloads and the capability of managing AWS compute resources. SageMaker Training takes care of the heavy lifting associated with setting up and managing infrastructure for ML training workloads
We select one model directly from the Hugging Face Hub, DeepSeek-R1-Distill-Llama-8B, and develop our training script in the JupyterLab space. Because we want to distribute the training across all the available GPUs in our instance, by using PyTorch Fully Sharded Data Parallel (FSDP), we use the Hugging Face Accelerate library to run the same PyTorch code across distributed configurations. You can start the fine-tuning job directly in your JupyterLab notebook or use the SageMaker Python SDK to start the training job. We use the Trainer from transfomers to fine-tune our model. We prepared the script train.py, which loads the dataset from disk, prepares the model and tokenizer, and starts the training.
For configuration, we use TrlParser, and provide hyperparameters in a YAML file. You can upload this file and provide it to SageMaker similar to your datasets. The following is the config file for fine-tuning the model on ml.g5.12xlarge. Save the config file as args.yaml and upload it to Amazon Simple Storage Service (Amazon S3).

cat > ./args.yaml <<EOF
model_id: “deepseek-ai/DeepSeek-R1-Distill-Llama-8B” # Hugging Face model id
mlflow_uri: “${mlflow_uri}”
mlflow_experiment_name: “${mlflow_experiment_name}”
# sagemaker specific parameters
output_dir: “/opt/ml/model” # path to where SageMaker will upload the model
train_dataset_path: “/opt/ml/input/data/train/” # path to where FSx saves train dataset
test_dataset_path: “/opt/ml/input/data/test/” # path to where FSx saves test dataset
# training parameters
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
learning_rate: 2e-4 # learning rate scheduler
num_train_epochs: 1 # number of training epochs
per_device_train_batch_size: 2 # batch size per device during training
per_device_eval_batch_size: 1 # batch size for evaluation
gradient_accumulation_steps: 2 # number of steps before performing a backward/update pass
gradient_checkpointing: true # use gradient checkpointing
bf16: true # use bfloat16 precision
tf32: false # use tf32 precision
fsdp: “full_shard auto_wrap offload”
fsdp_config:
   backward_prefetch: “backward_pre”
   cpu_ram_efficient_loading: true
   offload_params: true
   forward_prefetch: false
   use_orig_params: true
merge_weights: true # merge weights in the base model
EOF

Use the following code to use the native PyTorch container image, pre-built for SageMaker:

image_uri = sagemaker.image_uris.retrieve(
   framework=”pytorch”,
   region=sagemaker_session.boto_session.region_name,
   version=”2.6.0″,
   instance_type=instance_type,
   image_scope=”training”
)

image_uri

Define the trainer as follows:

Define the ModelTrainer
model_trainer = ModelTrainer(
   training_image=image_uri,
   source_code=source_code,
   base_job_name=job_name,
   compute=compute_configs,
   distributed=Torchrun(),
   stopping_condition=StoppingCondition(
   max_runtime_in_seconds=7200
   ),
   hyperparameters={
   “config”: “/opt/ml/input/data/config/args.yaml” # path to TRL config which was uploaded to s3
   },
   output_data_config=OutputDataConfig(
   s3_output_path=output_path
   ),
)

Run the trainer with the following:

# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

You can follow the steps in the notebook.
You can explore the job execution in SageMaker Unified Studio. The training job runs on the SageMaker training cluster by distributing the computation across the four available GPUs on the selected instance type ml.g5.12xlarge. We choose to merge the LoRA adapter with the base model. This decision was made during the training process by setting the merge_weights parameter to True in our train_fn() function. Merging the weights provides a single, cohesive model that incorporates both the base knowledge and the domain-specific adaptations we’ve made through fine-tuning.
Track training metrics and model registration using MLflow
You created an MLflow server in an earlier step to track experiments and registered models, and provided the server ARN in the notebook.
You can log MLflow models and automatically register them with Amazon SageMaker Model Registry using either the Python SDK or directly through the MLflow UI. Use mlflow.register_model() to automatically register a model with SageMaker Model Registry during model training. You can explore the MLflow tracking code in train.py and the notebook. The training code tracks MLflow experiments and registers the model to the MLflow model registry. To learn more, see Automatically register SageMaker AI models with SageMaker Model Registry.
To see the logs, complete the following steps:

Choose Build, then choose Spaces.
Choose Compute in the navigation pane.
On the MLflow Tracking Servers tab, choose Open to open the tracking server.

You can see both the experiments and registered models.

Deploy and test the model using SageMaker AI Inference
When deploying a fine-tuned model on AWS, SageMaker AI Inference offers multiple deployment strategies. In this post, we use SageMaker real-time inference. The real-time inference endpoint is designed for having full control over the inference resources. You can use a set of available instances and deployment options for hosting your model. By using the SageMaker built-in container DJL Serving, you can take advantage of the inference script and optimization options available directly in the container. In this post, we deploy the fine-tuned model to a SageMaker endpoint for running inference, which will be used for testing the model.
In SageMaker Unified Studio, in JupyterLab, we create the Model object, which is a high-level SageMaker model class for working with multiple container options. The image_uri parameter specifies the container image URI for the model, and model_data points to the Amazon S3 location containing the model artifact (automatically uploaded by the SageMaker training job). We also specify a set of environment variables to configure the specific inference backend option (OPTION_ROLLING_BATCH), the degree of tensor parallelism based on the number of available GPUs (OPTION_TENSOR_PARALLEL_DEGREE), and the maximum allowable length of input sequences (in tokens) for models during inference (OPTION_MAX_MODEL_LEN).

model = Model(
   image_uri=image_uri,
   model_data=f”s3://{bucket_name}/{job_prefix}/{job_name}/output/model.tar.gz”,
   role=get_execution_role(),
   env={
   ‘HF_MODEL_ID’: “/opt/ml/model”,
   ‘OPTION_TRUST_REMOTE_CODE’: ‘true’,
   ‘OPTION_ROLLING_BATCH’: “vllm”,
   ‘OPTION_DTYPE’: ‘bf16’,
   ‘OPTION_TENSOR_PARALLEL_DEGREE’: ‘max’,
   ‘OPTION_MAX_ROLLING_BATCH_SIZE’: ‘1’,
   ‘OPTION_MODEL_LOADING_TIMEOUT’: ‘3600’,
   ‘OPTION_MAX_MODEL_LEN’: ‘4096’
   }
)

After you create the model object, you can deploy it to an endpoint using the deploy method. The initial_instance_count and instance_type parameters specify the number and type of instances to use for the endpoint. We selected the ml.g5.4xlarge instance for the endpoint. The container_startup_health_check_timeout and model_data_download_timeout parameters set the timeout values for the container startup health check and model data download, respectively.

model_id = “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”
endpoint_name = f”{model_id.split(‘/’)[-1].replace(‘.’, ‘-‘)}-sft-djl”
predictor = model.deploy(
   initial_instance_count=instance_count,
   instance_type=instance_type,
   container_startup_health_check_timeout=1800,
   model_data_download_timeout=3600
)

It takes a few minutes to deploy the model before it becomes available for inference and evaluation. You can test the endpoint invocation in JupyterLab, by using the AWS SDK with the boto3 client for sagemaker-runtime, or by using the SageMaker Python SDK and the predictor previously created, by using the predict API.

base_prompt = f”””<s> [INST] {{question}} [/INST] “””

prompt = base_prompt.format(
question=”What statue is in front of the Notre Dame building?”
)

predictor.predict({
   “inputs”: prompt,
   “parameters”: {
   “max_new_tokens”: 300,
   “temperature”: 0.2,
   “top_p”: 0.9,
   “return_full_text”: False,
   “stop”: [‘</s>’]
   }
})

You can also test the model invocation in SageMaker Unified Studio, on the Inference endpoint page and Text inference tab.
Troubleshooting
You might encounter some of the following errors while running your model training and deployment:

Training job fails to start – If a training job fails to start, make sure your IAM role AmazonSageMakerDomainExecution has the necessary permissions, verify the instance type is available in your AWS Region, and check your S3 bucket permissions. This role is created when an admin creates the domain, and you can ask the admin to check your IAM access permissions associated with this role.
Out-of-memory errors during training – If you encounter out-of-memory errors during training, try reducing the batch size, use gradient accumulation to simulate larger batches, or consider using a larger instance.
Slow model deployment – For slow model deployment, make sure model artifacts aren’t excessively large, and use appropriate instance types for inference and capacity available for that instance in your Region.

For more troubleshooting tips, refer to Troubleshooting guide.
Clean up
SageMaker Unified Studio by default shuts down idle resources such as JupyterLab spaces after 1 hour. However, you must delete the S3 bucket and the hosted model endpoint to stop incurring costs. You can delete the real-time endpoints you created using the SageMaker console. For instructions, see Delete Endpoints and Resources.
Conclusion
This post demonstrated how SageMaker Unified Studio serves as a powerful centralized service for data and AI workflows, showcasing its seamless integration capabilities throughout the fine-tuning process. With SageMaker Unified Studio, data engineers and ML practitioners can efficiently discover and access data through SageMaker Catalog, prepare datasets, fine-tune models, and deploy them—all within a single, unified environment. The service’s direct integration with SageMaker AI and various AWS analytics services streamlines the development process, alleviating the need to switch between multiple tools and environments. The solution highlights the service’s versatility in handling complex ML workflows, from data discovery and preparation to model deployment, while maintaining a cohesive and intuitive user experience. Through features like integrated MLflow tracking, built-in model monitoring, and flexible deployment options, SageMaker Unified Studio demonstrates its capability to support sophisticated AI/ML projects at scale.
To learn more about SageMaker Unified Studio, see An integrated experience for all your data and AI with Amazon SageMaker Unified Studio.
If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!

About the authors
Mona Mona currently works as a Sr World Wide Gen AI Specialist Solutions Architect at Amazon focusing on Gen AI Solutions. She was a Lead Generative AI specialist in Google Public Sector at Google before joining Amazon. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 19 blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference.
Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.
Lauren Mullennex is a Senior GenAI/ML Specialist Solutions Architect at AWS. She has a decade of experience in DevOps, infrastructure, and ML. Her areas of focus include MLOps/LLMOps, generative AI, and computer vision.

Together AI Releases DeepSWE: A Fully Open-Source RL-Trained Coding Ag …

Posted on July 3, 2025 by i-genie

Together AI has released DeepSWE, a state-of-the-art, fully open-sourced software engineering agent that is trained entirely through reinforcement learning (RL). Built on top of the Qwen3-32B language model, DeepSWE achieves 59% accuracy on the SWEBench-Verified benchmark and 42.2% Pass@1, topping the leaderboard among open-weight models. This launch represents a significant shift for Together AI, from traditional pretraining pipelines toward creating autonomous language agents that continuously learn and improve via real-world feedback.

Reinforcement Learning Meets Code Generation

DeepSWE is the result of post-training the Qwen3-32B foundation model using rLLM, Agentica’s modular reinforcement learning framework tailored for language agents. Unlike conventional supervised fine-tuning approaches, rLLM enables agents to adapt to real-world workflows through experience. DeepSWE has been specifically trained to solve complex software engineering tasks using a feedback-driven loop rather than static datasets.

The training pipeline incorporates Agentica’s R2EGym dataset—a software engineering benchmark designed for RL-style agent development. The framework focuses on training language models with action-oriented objectives, such as fixing bugs, completing functions, and editing code, rather than merely predicting next-token distributions. This aligns DeepSWE more closely with how human engineers iterate and learn from outcomes.

Performance Benchmarks and Capabilities

On SWEBench-Verified, the most rigorous benchmark for software engineering agents, DeepSWE scores 59% with test-time scaling. This significantly outperforms previous open-weight models. In Pass@1 evaluations—which measure the probability that the agent solves a problem correctly on the first attempt—DeepSWE reaches an impressive 42.2%.

These results underscore the power of RL-based training in enhancing agentic behavior, particularly in domains requiring iterative reasoning and precise outputs, such as code synthesis. The model’s architecture, inherited from Qwen3-32B, enables it to scale effectively while remaining suitable for real-world applications.

Open Source and Reproducibility at Its Core

One of the standout features of this release is its full transparency. Together AI and Agentica have open-sourced not only the DeepSWE model but also the entire training recipe, including the rLLM framework, the R2EGym dataset, and training configuration scripts. This promotes reproducibility and invites the broader research and developer communities to extend or build upon DeepSWE without restrictions.

Developers can access DeepSWE and rLLM via the following:

Model Weights: Hugging Face – DeepSWE

Training Framework: rLLM GitHub Repository

Training Documentation: DeepSWE Training Overview

From Language Reasoners to Language Agents

DeepSWE marks a philosophical and practical shift: from building models that reason about language to building agents that learn through interaction. Traditional LLMs have shown strong reasoning capabilities, but often lack the ability to adapt to feedback or improve with use. Reinforcement learning enables these models to not only perform well at launch but to get better over time, adapting to new problem distributions and domains.

This approach also opens the door for local deployment. Because DeepSWE is fully open-source and modular, it can be extended and retrained for organization-specific use cases. Developers and researchers can build their own agents on top of DeepSWE using rLLM to serve diverse domains such as web navigation, robotics, or autonomous research assistance.

Conclusion

DeepSWE is a milestone in the evolution of generative AI for software engineering. By applying reinforcement learning to large language models like Qwen3-32B and releasing the entire training infrastructure, Together AI is enabling a future where agents are not just pretrained and deployed, but continually trained and improved. This leap from language understanding to action-oriented agency has significant implications across programming, automation, and intelligent system design.

Model Weights: Hugging Face – DeepSWE

Training Framework: rLLM GitHub Repository

Training Documentation: DeepSWE Training Overview

All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Together AI Releases DeepSWE: A Fully Open-Source RL-Trained Coding Agent Based on Qwen3-32B and Achieves 59% on SWEBench appeared first on MarkTechPost.

Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement L …

Posted on July 3, 2025 by i-genie

Introduction: Reinforcement Learning Progress through Chain-of-Thought Prompting

LLMs have shown excellent progress in complex reasoning tasks through CoT prompting combined with large-scale reinforcement learning (RL). Models like Deepseek-R1-Zero have shown strong reasoning capabilities by applying RL directly to base models. Similarly, methods such as SimpleRL and Open-ReasonerZero show improvements in smaller models like the Qwen series. However, achieving success across different base model families remains a challenge. Moreover, applying R1-Zero-style training to base models such as the Llama series faces difficulty, posing a fundamental question about the underlying factors that lead different base models to behave inconsistently during reinforcement learning.

Limitations of RL Scaling on Llama Models

Large-scale RL advances in models like OpenAI’s o1, o3, and DeepSeek’s R1 on competition-level mathematics problems, motivating the exploration of RL on smaller models with less than 100B parameters. However, they are limited to the Qwen model family, while replicating results on families such as Llama is difficult. The lack of transparency in pre-training pipelines has made it difficult to understand how pre-training influences RL scaling. This has prompted unconventional studies, which found that one-shot prompting improves reasoning in Qwen but offers little benefit in Llama. Efforts to curate high-quality mathematical pre-training corpora through projects like OpenWebMath, MathPile, InfiMM-Web-Math, and FineMath have made progress but remain limited in scale under 100B tokens.

Exploring Mid-Training with Stable-then-Decay Strategy

Researchers from Shanghai Jiao Tong University investigate how mid-training strategies shape RL dynamics, focusing on Qwen and Llama. The study presents several insights: First, high-quality mathematical corpora such as MegaMath-Web-Pro boost both base model and RL outcomes. Second, using QA-style data, especially those with long CoT reasoning, further enhances RL results. Third, long CoT introduces verbosity and instability in RL training. Lastly, applying scaling during mid-training results in stronger downstream RL performance. Researchers introduce a two-stage mid-training strategy called Stable-then-Decay, where base models are first trained on 200B tokens, followed by 20B tokens across three CoT-focused branches, resulting in OctoThinker models that show strong RL compatibility.

RL Configuration and Benchmark Evaluation

Researchers use the MATH8K dataset for RL training prompts. The configuration includes a global training batch size of 128, 16 rollout responses per query, and a PPO mini-batch size of 64, with experiments conducted on Llama-3.2-3B-Base and Qwen2.5-3B-Base models. For evaluation, few-shot prompting is used for base language models, and zero-shot for RL-tuned models across indicator tasks, including GSM8K, MATH500, OlympiadBench, and AMC23. During RL training, Qwen models exhibit increasing response lengths that remain reasonable throughout, whereas Llama displays abnormal behavior, with average response lengths escalating to 4,096 tokens. Evaluation further reveals that RL-tuned Qwen2.5-3B achieves improvements across benchmarks, while Llama-3.2-3B shows only marginal gains.

OctoThinker Outperforms Llama in RL Compatibility

Each OctoThinker branch demonstrates 10%-20% improvement over the original Llama base model and consistent gains over the stable-stage model across all sizes when evaluated on 13 mathematical benchmarks. The OctoThinker-Zero families reveal diverse thinking behaviors during RL scaling, with strong performance from the OctoThinker-Long variant. When comparing three 3B-scale base models during RL training, OctoThinker-Long-3B outperforms the original Llama-3.2-3B model and reaches performance parity with Qwen2.5-3B, a model known for strong reasoning capabilities and extensive pre-training. The hybrid and short branches show slightly lower performance, especially on challenging benchmarks

Conclusion and Future Work: Toward RL-Ready Foundation Models

This paper investigates why base models such as Llama and Qwen exhibit divergent behaviors during RL for reasoning, showing that mid-training plays a major role in RL scalability. The two-stage mid-training strategy transforms Llama into a foundation model better suited for RL, resulting in OctoThinker models. Future research directions include:

Curating higher-quality mathematical corpora to improve mid-training.

Creating RL-friendly base models using open recipes without distillation from long CoT reasoning models.

Separating the QA format and content to understand their contributions individually.

Expanding the OctoThinker family with new branches, such as tool-integrated reasoning.

Check out the Paper, Hugging Face Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development appeared first on MarkTechPost.

ReasonFlux-PRM: A Trajectory-Aware Reward Model Enhancing Chain-of-Tho …

Posted on July 3, 2025 by i-genie

Understanding the Role of Chain-of-Thought in LLMs

Large language models are increasingly being used to solve complex tasks such as mathematics and scientific reasoning through structured chain-of-thought approaches. These models do not just jump to answers—they reason through intermediate steps that simulate logical thought processes. This technique allows for improved reasoning accuracy and clearer error tracing. As models become more sophisticated, it has become essential to evaluate not just final responses but also the reasoning steps that lead to them.

Limitations of Traditional PRMs in Reasoning Evaluation

One pressing issue is that most current reward models only assess final answers, ignoring how those conclusions were reached. However, frontier models like Deepseek-R1 now output extensive reasoning paths before delivering final responses. These trajectory-response pairs are being reused to train smaller models. The problem is that current Process Reward Models (PRMs) are not built to evaluate these full trajectories. This mismatch leads to unreliable supervision, which can degrade the performance of smaller models trained on trajectory-response data.

Challenges in Handling Disorganized Reasoning Chains

Traditional PRMs are primarily calibrated for structured, clean outputs rather than the lengthy and sometimes disorganized reasoning chains generated by advanced LLMs. Even advanced PRMs, such as Qwen2.5-Math-PRM-72B, show a limited ability to distinguish between high- and low-quality intermediate reasoning. When applied to trajectory-response outputs from Gemini or Deepseek-R1, these models often produce overlapping reward scores, indicating weak discrimination. Their limited sensitivity leads to poor data selection for downstream fine-tuning, and experiments confirm that models trained on PRM-selected data perform worse than those trained on human-curated datasets.

Introducing ReasonFlux-PRM for Trajectory-Level Supervision

Researchers from the University of Illinois Urbana-Champaign (UIUC), Princeton University, Cornell University, and ByteDance Seed introduced ReasonFlux-PRM. The research introduced ReasonFlux-PRM as a trajectory-aware model that evaluates both intermediate reasoning steps and final answers. It integrates step-level and trajectory-level scoring, enabling a more nuanced understanding of reasoning quality. ReasonFlux-PRM is trained on a 10,000-sample dataset of carefully curated math and science problems explicitly designed to mirror real-world trajectory-response formats.

Technical Framework of ReasonFlux-PRM

Technically, ReasonFlux-PRM operates by scoring each intermediate step in a trajectory concerning its contribution to the final answer. It uses a reference reward function that considers the prompt, prior reasoning steps, and final output to assign step-level scores. These are then aggregated to produce a total trajectory reward. The model supports multiple applications, including offline filtering of high-quality training data, dense reward provision during reinforcement learning using GRPO-based policy optimization, and Best-of-N test-time response selection to enhance inference quality. These capabilities make ReasonFlux-PRM more flexible and comprehensive than prior PRMs.

Empirical Results on Reasoning Benchmarks

In performance evaluations across tasks like AIME, MATH500, and GPQA-Diamond, ReasonFlux-PRM-7B outperformed Qwen2.5-Math-PRM-72B and human-curated data in several key metrics. Specifically, it achieved a 12.1% accuracy gain in supervised fine-tuning, a 4.5% improvement during reinforcement learning, and a 6.3% increase during test-time scaling. These gains are particularly considerable given that ReasonFlux-PRM is smaller in model size. Table 1 shows that the Qwen2.5-14B-Instruct model, when trained on data selected by ReasonFlux-PRM, achieved performance levels close to or exceeding human-curated baselines. In contrast, other PRMs resulted in significant drops of up to 26.6% in certain benchmarks.

Impact and Future Direction of ReasonFlux-PRM

This research addresses a crucial limitation in the training and evaluation of modern reasoning models. By enabling supervision over both thinking trajectories and final answers, ReasonFlux-PRM enhances the quality of training data and the reliability of model responses. It sets a new direction for systematically evaluating and improving reasoning processes in large models.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post ReasonFlux-PRM: A Trajectory-Aware Reward Model Enhancing Chain-of-Thought Reasoning in LLMs appeared first on MarkTechPost.

Optimize RAG in production environments using Amazon SageMaker JumpSta …

Posted on July 3, 2025 by i-genie

Generative AI has revolutionized customer interactions across industries by offering personalized, intuitive experiences powered by unprecedented access to information. This transformation is further enhanced by Retrieval Augmented Generation (RAG), a technique that allows large language models (LLMs) to reference external knowledge sources beyond their training data. RAG has gained popularity for its ability to improve generative AI applications by incorporating additional information, often preferred by customers over techniques like fine-tuning due to its cost-effectiveness and faster iteration cycles.
The RAG approach excels in grounding language generation with external knowledge, producing more factual, coherent, and relevant responses. This capability proves invaluable in applications such as question answering, dialogue systems, and content generation, where accuracy and informative outputs are crucial. For businesses, RAG offers a powerful way to use internal knowledge by connecting company documentation to a generative AI model. When an employee asks a question, the RAG system retrieves relevant information from the company’s internal documents and uses this context to generate an accurate, company-specific response. This approach enhances the understanding and usage of internal company documents and reports. By extracting relevant context from corporate knowledge bases, RAG models facilitate tasks like summarization, information extraction, and complex question answering on domain-specific materials, enabling employees to quickly access vital insights from vast internal resources. This integration of AI with proprietary information can significantly improve efficiency, decision-making, and knowledge sharing across the organization.
A typical RAG workflow consists of four key components: input prompt, document retrieval, contextual generation, and output. The process begins with a user query, which is used to search a comprehensive knowledge corpus. Relevant documents are then retrieved and combined with the original query to provide additional context for the LLM. This enriched input allows the model to generate more accurate and contextually appropriate responses. RAG’s popularity stems from its ability to use frequently updated external data, providing dynamic outputs without the need for costly and compute-intensive model retraining.
To implement RAG effectively, many organizations turn to platforms like Amazon SageMaker JumpStart. This service offers numerous advantages for building and deploying generative AI applications, including access to a wide range of pre-trained models with ready-to-use artifacts, a user-friendly interface, and seamless scalability within the AWS ecosystem. By using pre-trained models and optimized hardware, SageMaker JumpStart enables rapid deployment of both LLMs and embedding models, minimizing the time spent on complex scalability configurations.
In the previous post, we showed how to build a RAG application on SageMaker JumpStart using Facebook AI Similarity Search (Faiss). In this post, we show how to use Amazon OpenSearch Service as a vector store to build an efficient RAG application.
Solution overview
To implement our RAG workflow on SageMaker, we use a popular open source Python library known as LangChain. With LangChain, the RAG components are simplified into independent blocks that you can bring together using a chain object that will encapsulate the entire workflow. The solution consists of the following key components:

LLM (inference) – We need an LLM that will do the actual inference and answer the end-user’s initial prompt. For our use case, we use Meta Llama3 for this component. LangChain comes with a default wrapper class for SageMaker endpoints with which we can simply pass in the endpoint name to define an LLM object in the library.
Embeddings model – We need an embeddings model to convert our document corpus into textual embeddings. This is necessary for when we’re doing a similarity search on the input text to see what documents share similarities or contain the information to help augment our response. For this post, we use the BGE Hugging Face Embeddings model available in SageMaker JumpStart.
Vector store and retriever – To house the different embeddings we have generated, we use a vector store. In this case, we use OpenSearch Service, which allows for similarity search using k-nearest neighbors (k-NN) as well as traditional lexical search. Within our chain object, we define the vector store as the retriever. You can tune this depending on how many documents you want to retrieve.

The following diagram illustrates the solution architecture.

In the following sections, we walk through setting up OpenSearch, followed by exploring the notebook that implements a RAG solution with LangChain, Amazon SageMaker AI, and OpenSearch Service.
Benefits of using OpenSearch Service as a vector store for RAG
In this post, we showcase how you can use a vector store such as OpenSearch Service as a knowledge base and embedding store. OpenSearch Service offers several advantages when used for RAG in conjunction with SageMaker AI:

Performance – Efficiently handles large-scale data and search operations
Advanced search – Offers full-text search, relevance scoring, and semantic capabilities
AWS integration – Seamlessly integrates with SageMaker AI and other AWS services
Real-time updates – Supports continuous knowledge base updates with minimal delay
Customization – Allows fine-tuning of search relevance for optimal context retrieval
Reliability – Provides high availability and fault tolerance through a distributed architecture
Analytics – Provides analytical features for data understanding and performance improvement
Security – Offers robust features such as encryption, access control, and audit logging
Cost-effectiveness – Serves as an economical solution compared to proprietary vector databases
Flexibility – Supports various data types and search algorithms, offering versatile storage and retrieval options for RAG applications

You can use SageMaker AI with OpenSearch Service to create powerful and efficient RAG systems. SageMaker AI provides the machine learning (ML) infrastructure for training and deploying your language models, and OpenSearch Service serves as an efficient and scalable knowledge base for retrieval.
OpenSearch Service optimization strategies for RAG
Based on our learnings from the hundreds of RAG applications deployed using OpenSearch Service as a vector store, we’ve developed several best practices:

If you are starting from a clean slate and want to move quickly with something simple, scalable, and high-performing, we recommend using an Amazon OpenSearch Serverless vector store collection. With OpenSearch Serverless, you benefit from automatic scaling of resources, decoupling of storage, indexing compute, and search compute, with no node or shard management, and you only pay for what you use.
If you have a large-scale production workload and want to take the time to tune for the best price-performance and the most flexibility, you can use an OpenSearch Service managed cluster. In a managed cluster, you pick the node type, node size, number of nodes, and number of shards and replicas, and you have more control over when to scale your resources. For more details on best practices for operating an OpenSearch Service managed cluster, see Operational best practices for Amazon OpenSearch Service.
OpenSearch supports both exact k-NN and approximate k-NN. Use exact k-NN if the number of documents or vectors in your corpus is less than 50,000 for the best recall. For use cases where the number of vectors is greater than 50,000, exact k-NN will still provide the best recall but might not provide sub-100 millisecond query performance. Use approximate k-NN in use cases above 50,000 vectors for the best performance.
OpenSearch uses algorithms from the NMSLIB, Faiss, and Lucene libraries to power approximate k-NN search. There are pros and cons to each k-NN engine, but we find that most customers choose Faiss due to its overall performance in both indexing and search as well as the variety of different quantization and algorithm options that are supported and the broad community support.
Within the Faiss engine, OpenSearch supports both Hierarchical Navigable Small World (HNSW) and Inverted File System (IVF) algorithms. Most customers find HNSW to have better recall than IVF and choose it for their RAG use cases. To learn more about the differences between these engine algorithms, see Vector search.
To reduce the memory footprint to lower the cost of the vector store while keeping the recall high, you can start with Faiss HNSW 16-bit scalar quantization. This can also reduce search latencies and improve indexing throughput when used with SIMD optimization.
If using an OpenSearch Service managed cluster, refer to Performance tuning for additional recommendations.

Prerequisites
Make sure you have access to one ml.g5.4xlarge and ml.g5.2xlarge instance each in your account. A secret should be created in the same region as the stack is deployed.Then complete the following prerequisite steps to create a secret using AWS Secrets Manager:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.

For Secret type, select Other type of secret.
For Key/value pairs, on the Plaintext tab, enter a complete password.
Choose Next.

For Secret name, enter a name for your secret.
Choose Next.

Under Configure rotation, keep the settings as default and choose Next.

Choose Store to save your secret.

On the secret details page, note the secret Amazon Resource Name (ARN) to use in the next step.

Create an OpenSearch Service cluster and SageMaker notebook
We use AWS CloudFormation to deploy our OpenSearch Service cluster, SageMaker notebook, and other resources. Complete the following steps:

Launch the following CloudFormation template.
Provide the ARN of the secret you created as a prerequisite and keep the other parameters as default.

Choose Create to create your stack, and wait for the stack to complete (about 20 minutes).
When the status of the stack is CREATE_COMPLETE, note the value of OpenSearchDomainEndpoint on the stack Outputs tab.
Locate SageMakerNotebookURL in the outputs and choose the link to open the SageMaker notebook.

Run the SageMaker notebook
After you have launched the notebook in JupyterLab, complete the following steps:

Go to genai-recipes/RAG-recipes/llama3-RAG-Opensearch-langchain-SMJS.ipynb.

You can also clone the notebook from the GitHub repo.

Update the value of OPENSEARCH_URL in the notebook with the value copied from OpenSearchDomainEndpoint in the previous step (look for os.environ[‘OPENSEARCH_URL’] = “”). The port needs to be 443.
Run the cells in the notebook.

The notebook provides a detailed explanation of all the steps. We explain some of the key cells in the notebook in this section.
For the RAG workflow, we deploy the huggingface-sentencesimilarity-bge-large-en-v1-5 embedding model and meta-textgeneration-llama-3-8b-instruct LLM from Hugging Face. SageMaker JumpStart simplifies this process because the model artifacts, data, and container specifications are all prepackaged for optimal inference. These are then exposed using the SageMaker Python SDK high-level API calls, which let you specify the model ID for deployment to a SageMaker real-time endpoint:

sagemaker.jumpstart.model JumpStartModel

model_id  “meta-textgeneration-llama-3-8b-instruct”
accept_eula
model  JumpStartModel(model_idmodel_id)
llm_predictor  modeldeploy(accept_eulaaccept_eula)

model_id  “huggingface-sentencesimilarity-bge-large-en-v1-5”
text_embedding_model  JumpStartModel(model_idmodel_id)
embedding_predictor  text_embedding_modeldeploy()

Content handlers are crucial for formatting data for SageMaker endpoints. They transform inputs into the format expected by the model and handle model-specific parameters like temperature and token limits. These parameters can be tuned to control the creativity and consistency of the model’s responses.

class Llama38BContentHandler(LLMContentHandler):
content_type = “application/json”
accepts = “application/json”

   def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
   payload = {
   “inputs”: prompt,
   “parameters”: {
   “max_new_tokens”: 1000,
   “top_p”: 0.9,
   “temperature”: 0.6,
   “stop”: [“<|eot_id|>”],
   },
   }
   input_str = json.dumps(
   payload,
   )
   #print(input_str)
   return input_str.encode(“utf-8”)

We use PyPDFLoader from LangChain to load PDF files, attach metadata to each document fragment, and then use RecursiveCharacterTextSplitter to break the documents into smaller, manageable chunks. The text splitter is configured with a chunk size of 1,000 characters and an overlap of 100 characters, which helps maintain context between chunks. This preprocessing step is crucial for effective document retrieval and embedding generation, because it makes sure the text segments are appropriately sized for the embedding model and the language model used in the RAG system.

import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
documents = []
for idx, file in enumerate(filenames):
   loader = PyPDFLoader(data_root + file)
   document = loader.load()
   for document_fragment in document:
   document_fragment.metadata = metadata[idx]
   documents += document
# – in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
   # Set a really small chunk size, just to show.
   chunk_size=1000,
   chunk_overlap=100,
)
docs = text_splitter.split_documents(documents)
print(docs[100])

The following block initializes a vector store using OpenSearch Service for the RAG system. It converts preprocessed document chunks into vector embeddings using a SageMaker model and stores them in OpenSearch Service. The process is configured with security measures like SSL and authentication to provide secure data handling. The bulk insertion is optimized for performance with a sizeable batch size. Finally, the vector store is wrapped with VectorStoreIndexWrapper, providing a simplified interface for operations like querying and retrieval. This setup creates a searchable database of document embeddings, enabling quick and relevant context retrieval for user queries in the RAG pipeline.

from langchain.indexes.vectorstore import VectorStoreIndexWrapper
# Initialize OpenSearchVectorSearch
vectorstore_opensearch = OpenSearchVectorSearch.from_documents(
   docs,
   sagemaker_embeddings,
   http_auth=awsauth, # Auth will use the IAM role
   use_ssl=True,
   verify_certs=True,
   connection_class=RequestsHttpConnection,
   bulk_size=2000  # Increase this to accommodate the number of documents you have
)
# Wrap the OpenSearch vector store with the VectorStoreIndexWrapper
wrapper_store_opensearch = VectorStoreIndexWrapper(vectorstore=vectorstore_opensearch)

Next, we use the wrapper from the previous step along with the prompt template. We define the prompt template for interacting with the Meta Llama 3 8B Instruct model in the RAG system. The template uses specific tokens to structure the input in a way that the model expects. It sets up a conversation format with system instructions, user query, and a placeholder for the assistant’s response. The PromptTemplate class from LangChain is used to create a reusable prompt with a variable for the user’s query. This structured approach to prompt engineering helps maintain consistency in the model’s responses and guides it to act as a helpful assistant.

prompt_template = “””<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{query}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
“””
PROMPT = PromptTemplate(
template=prompt_template, input_variables=[“query”]
)
query = “How did AWS perform in 2021?”

answer = wrapper_store_opensearch.query(question=PROMPT.format(query=query), llm=llm)
print(answer)

Similarly, the notebook also shows how to use Retrieval QA, where you can customize how the documents fetched should be added to prompt using the chain_type parameter.
Clean up
Delete your SageMaker endpoints from the notebook to avoid incurring costs:

# Delete resources
llm_predictor.delete_model()
llm_predictor.delete_endpoint()
embedding_predictor.delete_model()
embedding_predictor.delete_endpoint()

Next, delete your OpenSearch cluster to stop incurring additional charges:aws cloudformation delete-stack –stack-name rag-opensearch
Conclusion
RAG has revolutionized how businesses use AI by enabling general-purpose language models to work seamlessly with company-specific data. The key benefit is the ability to create AI systems that combine broad knowledge with up-to-date, proprietary information without expensive model retraining. This approach transforms customer engagement and internal operations by delivering personalized, accurate, and timely responses based on the latest company data. The RAG workflow—comprising input prompt, document retrieval, contextual generation, and output—allows businesses to tap into their vast repositories of internal documents, policies, and data, making this information readily accessible and actionable. For businesses, this means enhanced decision-making, improved customer service, and increased operational efficiency. Employees can quickly access relevant information, while customers receive more accurate and personalized responses. Moreover, RAG’s cost-efficiency and ability to rapidly iterate make it an attractive solution for businesses looking to stay competitive in the AI era without constant, expensive updates to their AI systems. By making general-purpose LLMs work effectively on proprietary data, RAG empowers businesses to create dynamic, knowledge-rich AI applications that evolve with their data, potentially transforming how companies operate, innovate, and engage with both employees and customers.
SageMaker JumpStart has streamlined the process of developing and deploying generative AI applications. It offers pre-trained models, user-friendly interfaces, and seamless scalability within the AWS ecosystem, making it straightforward for businesses to harness the power of RAG.
Furthermore, using OpenSearch Service as a vector store facilitates swift retrieval from vast information repositories. This approach not only enhances the speed and relevance of responses, but also helps manage costs and operational complexity effectively.
By combining these technologies, you can create robust, scalable, and efficient RAG systems that provide up-to-date, context-aware responses to customer queries, ultimately enhancing user experience and satisfaction.
To get started with implementing this Retrieval Augmented Generation (RAG) solution using Amazon SageMaker JumpStart and Amazon OpenSearch Service, check out the example notebook on GitHub. You can also learn more about Amazon OpenSearch Service in the developer guide.

About the authors
Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.
Raghu Ramesha is an ML Solutions Architect. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
Sohaib Katariwala is a Sr. Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. His interests are in all things data and analytics. More specifically he loves to help customers use AI in their data strategy to solve modern day challenges.
Karan Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions.