i-genie, Author at i-genie.co.uk

A Coding Guide to Unlock mem0 Memory for Anthropic Claude Bot: Enablin …

Posted on May 11, 2025 by i-genie

In this tutorial, we walk you through setting up a fully functional bot in Google Colab that leverages Anthropic’s Claude model alongside mem0 for seamless memory recall. Combining LangGraph’s intuitive state-machine orchestration with mem0’s powerful vector-based memory store will empower our assistant to remember past conversations, retrieve relevant details on demand, and maintain natural continuity across sessions. Whether you’re building support bots, virtual assistants, or interactive demos, this guide will equip you with a robust foundation for memory-driven AI experiences.

Copy CodeCopiedUse a different Browser!pip install -qU langgraph mem0ai langchain langchain-anthropic anthropic

First, we install and upgrade LangGraph, the Mem0 AI client, LangChain with its Anthropic connector, and the core Anthropic SDK, ensuring we have all the latest libraries required for building a memory-driven Claude chatbot in Google Colab. Running it upfront will avoid dependency issues and streamline the setup process.

Copy CodeCopiedUse a different Browserimport os
from typing import Annotated, TypedDict, List

from langgraph.graph import StateGraph, START
from langgraph.graph.message import add_messages
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
from langchain_anthropic import ChatAnthropic
from mem0 import MemoryClient

We bring together the core building blocks for our Colab chatbot: it loads the operating-system interface for API keys, Python’s typed dictionaries and annotation utilities for defining conversational state, LangGraph’s graph and message decorators to orchestrate chat flow, LangChain’s message classes for constructing prompts, the ChatAnthropic wrapper to call Claude, and Mem0’s client for persistent memory storage.

Copy CodeCopiedUse a different Browseros.environ[“ANTHROPIC_API_KEY”] = “Use Your Own API Key”
MEM0_API_KEY = “Use Your Own API Key”

We securely inject our Anthropic and Mem0 credentials into the environment and a local variable, ensuring that the ChatAnthropic client and Mem0 memory store can authenticate properly without hard-coding sensitive keys throughout our notebook. Centralizing our API keys here, we maintain a clean separation between code and secrets while enabling seamless access to the Claude model and persistent memory layer.

Copy CodeCopiedUse a different Browserllm = ChatAnthropic(
model=”claude-3-5-haiku-latest”,
temperature=0.0,
max_tokens=1024,
anthropic_api_key=os.environ[“ANTHROPIC_API_KEY”]
)
mem0 = MemoryClient(api_key=MEM0_API_KEY)

We initialize our conversational AI core: first, it creates a ChatAnthropic instance configured to talk with Claude 3.5 Sonnet at zero temperature for deterministic replies and up to 1024 tokens per response, using our stored Anthropic key for authentication. Then it spins up a Mem0 MemoryClient with our Mem0 API key, giving our bot a persistent vector-based memory store to save and retrieve past interactions seamlessly.

Copy CodeCopiedUse a different Browserclass State(TypedDict):
messages: Annotated[List[HumanMessage | AIMessage], add_messages]
mem0_user_id: str

graph = StateGraph(State)

def chatbot(state: State):
messages = state[“messages”]
user_id = state[“mem0_user_id”]

memories = mem0.search(messages[-1].content, user_id=user_id)

context = “n”.join(f”- {m[‘memory’]}” for m in memories)
system_message = SystemMessage(content=(
“You are a helpful customer support assistant. ”
“Use the context below to personalize your answers:n” + context
))

full_msgs = [system_message] + messages
ai_resp: AIMessage = llm.invoke(full_msgs)

mem0.add(
f”User: {messages[-1].content}nAssistant: {ai_resp.content}”,
user_id=user_id
)

return {“messages”: [ai_resp]}

We define the conversational state schema and wire it into a LangGraph state machine: the State TypedDict tracks the message history and a Mem0 user ID, and graph = StateGraph(State) sets up the flow controller. Within the chatbot, the most recent user message is used to query Mem0 for relevant memories, a context-enhanced system prompt is constructed, Claude generates a reply, and that new exchange is saved back into Mem0 before returning the assistant’s response.

Copy CodeCopiedUse a different Browsergraph.add_node(“chatbot”, chatbot)
graph.add_edge(START, “chatbot”)
graph.add_edge(“chatbot”, “chatbot”)
compiled_graph = graph.compile()

We plug our chatbot function into LangGraph’s execution flow by registering it as a node named “chatbot,” then connecting the built-in START marker to that node. Hence, the conversation begins there, and finally creates a self-loop edge so each new user message re-enters the same logic. Calling graph.compile() then transforms this node-and-edge setup into an optimized, runnable graph object that will manage each turn of our chat session automatically.

Copy CodeCopiedUse a different Browserdef run_conversation(user_input: str, mem0_user_id: str):
config = {“configurable”: {“thread_id”: mem0_user_id}}
state = {“messages”: [HumanMessage(content=user_input)], “mem0_user_id”: mem0_user_id}
for event in compiled_graph.stream(state, config):
for node_output in event.values():
if node_output.get(“messages”):
print(“Assistant:”, node_output[“messages”][-1].content)
return

if __name__ == “__main__”:
print(“Welcome! (type ‘exit’ to quit)”)
mem0_user_id = “customer_123”
while True:
user_in = input(“You: “)
if user_in.lower() in [“exit”, “quit”, “bye”]:
print(“Assistant: Goodbye!”)
break
run_conversation(user_in, mem0_user_id)

We tie everything together by defining run_conversation, which packages our user input into the LangGraph state, streams it through the compiled graph to invoke the chatbot node, and prints out Claude’s reply. The __main__ guard then launches a simple REPL loop, prompting us to type messages, routing them through our memory-enabled graph, and gracefully exiting when we enter “exit”.

In conclusion, we’ve assembled a conversational AI pipeline that combines Anthropic’s cutting-edge Claude model with mem0’s persistent memory capabilities, all orchestrated via LangGraph in Google Colab. This architecture allows our bot to recall user-specific details, adapt responses over time, and deliver personalized support. From here, consider experimenting with richer memory-retrieval strategies, fine-tuning Claude’s prompts, or integrating additional tools into your graph.

Check out Colab Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post A Coding Guide to Unlock mem0 Memory for Anthropic Claude Bot: Enabling Context-Rich Conversations appeared first on MarkTechPost.

Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Mo …

Posted on May 11, 2025 by i-genie

Sparse large language models (LLMs) based on the Mixture of Experts (MoE) framework have gained traction for their ability to scale efficiently by activating only a subset of parameters per token. This dynamic sparsity allows MoE models to retain high representational capacity while limiting computation per token. However, with their increasing complexity and model size approaching trillions of parameters, training them efficiently requires algorithmic innovation and a tightly integrated hardware-software optimization. These challenges are especially relevant when deploying models on non-standard AI accelerators like Ascend NPUs, which require specific architectural alignment to deliver optimal performance.

A major technical challenge lies in the inefficient utilization of hardware resources while training sparse LLMs. Since only a portion of parameters are active for each token, workloads across devices become unbalanced, leading to synchronization delays and underused processing power. This imbalance also affects memory utilization as different experts process different numbers of tokens, sometimes exceeding capacity. These inefficiencies are compounded at a large scale, such as across thousands of AI chips, where communication and memory management bottlenecks significantly hinder throughput. The inability to fully harness the computational promise of sparsity in practice restricts the deployment of such models on hardware systems like Ascend NPUs.

Several strategies have been proposed to tackle these challenges. These include auxiliary losses to balance token distribution across experts and drop-and-pad strategies that limit expert overload by discarding tokens exceeding capacity. However, these techniques either reduce model performance or introduce inefficiencies in memory and computation. Other efforts include heuristic expert placement and traditional communication patterns like All-to-All dispatching, but these often fail to scale well or maintain high throughput. Moreover, standard memory-saving techniques like recomputation are usually coarse-grained, targeting whole layers instead of specific operations, leading to increased runtime without proportional memory savings.

Researchers from the Pangu team at Huawei Cloud introduced a highly structured and optimized training approach for large MoE models tailored to Ascend NPUs. They developed Pangu Ultra MoE, a sparse LLM with 718 billion parameters, focusing on aligning model architecture and system design with the capabilities of the Ascend hardware. Their approach begins with a simulation-based model configuration process that evaluates thousands of architecture variants using metrics grounded in actual hardware behavior. These simulations inform design decisions before any physical training is undertaken, thus saving substantial computational resources and enabling informed tuning of model hyperparameters.

The simulation method analyzes combinations of parameters such as the number of layers, hidden size, and expert count using a five-dimensional parallelism strategy that includes Pipeline Parallelism, Tensor Parallelism, Expert Parallelism, Data Parallelism, and Context Parallelism. The final model configuration adopted by Huawei included 256 experts, a hidden size 7680, and 61 transformer layers. To further optimize performance, researchers integrated an Adaptive Pipe Overlap mechanism to mask communication costs and used hierarchical All-to-All communication to reduce inter-node data transfer. They employed fine-grained recomputation, such as recomputing only key-value vectors in attention modules, and introduced tensor swapping to offload activation memory to host devices dynamically.

Pangu Ultra MoE achieved a Model Flops Utilization (MFU) of 30.0% and processed tokens at a rate of 1.46 million per second using 6,000 Ascend NPUs. The baseline MFU was 18.9% with 0.61 million tokens per second on 4,000 NPUs. The researchers also introduced dynamic expert placement strategies, improving device-level load balance and achieving a relative 10% MFU improvement. The model performed competitively on benchmark evaluations, attaining 81.3% on AIME2024, 97.4% on MATH500, 94.8% on CLUEWSC, and 91.5% on MMLU. In the healthcare domain, it outperformed DeepSeek R1 by scoring 87.1% on MedQA and 80.8% on MedMCQA, confirming its strength in domain-specific applications.

This study illustrates how the Pangu team at Huawei effectively tackled the core difficulties of training massive MoE models on specialized hardware. Their systematic architecture search, efficient communication techniques, and tailored memory optimizations represent a strong framework for scalable AI training. The work demonstrates practical ways to unlock the performance potential of sparse models and sets a direction for future system-aware AI design.

Check out Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Model Trained Efficiently on Ascend NPUs Using Simulation-Driven Architecture and System-Level Optimization appeared first on MarkTechPost.

ZeroSearch from Alibaba Uses Reinforcement Learning and Simulated Docu …

Posted on May 11, 2025 by i-genie

Large language models are now central to various applications, from coding to academic tutoring and automated assistants. However, a critical limitation persists in how these models are designed; they are trained on static datasets that become outdated over time. This creates a fundamental challenge because the language models cannot update their knowledge or validate responses against fresh, real-world data. As a result, while these models demonstrate strong performance on reasoning tasks or structured queries, their answers can still include fabricated or obsolete information, reducing their reliability in real-world usage. To maintain credibility, especially for applications requiring updated knowledge such as news, research, or product reviews, models must interact with external data sources in a timely and cost-efficient manner.

The core problem lies in teaching these models to effectively retrieve and incorporate external information. While fine-tuned pretraining helps develop a strong baseline understanding, the capacity to conduct meaningful, dynamic searches is missing. Equipping language models with this ability introduces practical constraints. Search engines used for external information retrieval provide varying document quality that introduces inconsistency in model training. Moreover, integrating reinforcement learning to simulate real-world searching requires large-scale interactions with live APIs, running up hundreds of thousands of calls, which becomes prohibitively expensive. This results in a bottleneck for academic research and commercial deployment, where cost and training scalability are critical.

Various methods have been developed to enhance language models’ search and retrieval capabilities. Some early techniques relied on prompt-based instructions that guided the model through processes like generating sub-queries or managing multi-step searches. These methods, however, heavily relied on manual tuning and often required extensive computational resources to ensure consistent outputs. Other approaches leaned on supervised fine-tuning for smaller models to perform more targeted retrieval, with models like Self-RAG and RetroLLM emerging in this space. There have also been experiments with techniques like Monte Carlo Tree Search to expand possible answer paths during inference dynamically. Reinforcement learning-based solutions like Search-R1 and DeepResearcher allowed models to interact directly with real search engines, offering a training experience closer to how users behave. However, these innovations still suffer from either complexity, high computational demand, or financial cost due to live interaction constraints.

Researchers from Tongyi Lab at Alibaba Group introduced an innovative solution called ZeroSearch. This reinforcement learning framework removes the need for live API-based search entirely. Instead, it uses another language model to simulate the behavior of a search engine. The simulation model is fine-tuned through supervised training to generate documents that either help or mislead the policy model, depending on whether the content is designed to be relevant or noisy. This allows complete control over the document quality and cost while enabling a realistic retrieval training experience. A key innovation lies in using curriculum-based learning during training, which means gradually introducing harder retrieval tasks by adjusting how much noise is present in the generated documents. This progression helps the policy model develop resilience and better reasoning skills over time without ever making a real search query.

The structure of ZeroSearch involves distinct phases in the reasoning process. The model first thinks internally using designated tags, then generates queries if it determines that additional information is needed. Finally, it outputs an answer only when sufficient context is acquired. This structured approach enforces clarity in decision-making and has been shown to improve transparency and answer quality. A minimal change in prompts guides document generation for the simulated search engine that controls whether the document appears helpful or misleading. The simulated LLM is fine-tuned using interaction data where each retrieval trajectory is labeled based on the correctness of the final answer. The policy model is taught to handle straightforward and complex search conditions by systematically varying document quality. A performance scaling function determines how much noise is introduced at each training stage, increasing the model’s ability to navigate uncertainty over time.

A 3-billion parameter model was able to simulate the retrieval process for training purposes effectively. The results became particularly notable with larger models. A 7B retrieval module was performed at a level comparable to Google Search regarding response quality. A 14B model even surpassed Google Search benchmarks. ZeroSearch also showed flexibility, functioning effectively across base and instruction-tuned LLMs of different sizes. It integrates well with a range of reinforcement learning algorithms, including PPO, GRPO, and Reinforce++, and it uses a reward design based on the F1 score rather than exact match to discourage the model from generating excessively long answers just to increase keyword overlap. Furthermore, ZeroSearch uses a masking mechanism during backpropagation to ensure that gradients are only computed on the policy model’s outputs, stabilizing training without sacrificing performance.

The research demonstrates a clear and efficient alternative to real-time search engine reliance. Using simulation-driven document generation removes the need for high-cost APIs, and the quality of training input is controlled with precision. The method also boosts model reasoning capability by introducing progressive noise and uncertainty, effectively mimicking how real-world data retrieval might fail or mislead. The policy model is trained to extract the most useful information. These traits make ZeroSearch a scalable and practical solution for commercial-grade applications.

This approach successfully identifies and addresses the twin challenges of document quality variability and economic cost that have limited real-time search integration in language model training. It combines document simulation, structured interaction, and reinforcement learning to ensure effectiveness and scalability. By relying solely on simulated data generation, the researchers achieved superior or comparable results to existing methods while removing all dependency on costly APIs.

Several Key Takeaways from the Research include the following:

A 3B model simulated realistic document retrieval effectively with zero API cost.

A 7B retrieval module matched Google Search performance in benchmark tests.

The 14B model exceeded real search engine performance.

Reinforcement learning was performed with a curriculum-based rollout that gradually introduced noise.

A simulation LLM generated both relevant and noisy documents via lightweight supervised fine-tuning.

Structured interaction phases (<think>, <search>, <answer>) improved model clarity and accuracy.

F1-based rewards discouraged reward hacking by penalizing irrelevant answer length.

Compatible with major RL algorithms including PPO, GRPO, and Reinforce++.

Training was stabilized using a gradient masking mechanism to prevent instability from simulated tokens.

Check out the Paper and Model on Hugging Face. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

Partner with us

The post ZeroSearch from Alibaba Uses Reinforcement Learning and Simulated Documents to Teach LLMs Retrieval Without Real-Time Search appeared first on MarkTechPost.

AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Tr …

Posted on May 10, 2025 by i-genie

LLMs have shown advancements in reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR), which relies on outcome-based feedback rather than imitating intermediate reasoning steps. Current RLVR works face critical scalability challenges as they heavily depend on manually curated collections of questions and answers for training. As reasoning models advance, constructing large-scale, high-quality datasets becomes increasingly unsustainable, similar to bottlenecks identified in LLM pretraining. Moreover, exclusive dependency on human-designed tasks may constrain AI systems’ capacity for autonomous learning and development, especially as they evolve beyond human intellectual capabilities.

Researchers have explored various approaches to enhance LLM reasoning capabilities. STaR pioneered self-bootstrapping using expert iteration and rejection sampling of outcome-verified responses to improve CoT reasoning. The o1 model deployed this concept at scale, achieving state-of-the-art results, and R1 later became the first open-weight model to match or surpass o1’s performance by introducing the “zero” setting where RL is applied directly to the base LLM. Further, self-play paradigms have evolved from Schmidhuber’s early two-agent setups to more complex implementations like AlphaGo and AlphaZero. Recent methods such as SPIN, Self-Rewarding Language Models, SPC, and SPAG have applied self-play to language models for alignment and reasoning.

Researchers from Tsinghua University, Beijing Institute for General Artificial Intelligence, and Pennsylvania State University have proposed an RLVR paradigm called Absolute Zero to enable a single model to autonomously generate and solve tasks that maximize its own learning progress without relying on any external data. Under this method, researchers have introduced the Absolute Zero Reasoner (AZR) that self-evolves its training curriculum and reasoning ability through a code executor that validates proposed code reasoning tasks and verifies answers, providing a unified source of verifiable reward to guide open-ended yet grounded learning. AZR can be effectively implemented across different model scales and remains compatible with various model classes, suggesting broad applicability.

LLMs provide an ideal framework for implementing AZR in multitask learning contexts. During each online rollout iteration in the absolute zero setting’s objective equation, AZR proposes new reasoning tasks based on task type and past self-generated examples, with explicit prompting to generate diverse tasks and then attempts to solve them, receiving grounded feedback for its model responses. AZR utilizes a code executor as both a flexible interface and verifiable environment, enabling automatic construction, execution, and validation of code reasoning tasks. Lastly, the AZR Algorithm includes buffer initialization, Task Proposal Inputs and Buffer Management, valid task construction, solution validation, and advantage estimator calculation through Task-Relative REINFORCE++.

The Absolute Zero Reasoner-Coder-7B has achieved state-of-the-art performance in the 7B overall average and coding average categories, surpassing previous best models by 1.8 absolute percentage points despite being entirely out-of-distribution for both math and code reasoning benchmarks. It outperforms models trained with expert-curated human data in coding by 0.3 absolute percentage points while never accessing such data itself. Scaling analysis reveals that AZR delivers greater gains on larger models, with the 7B and 14B models continuing to improve beyond 200 training steps while the 3B model plateaus. Out-of-distribution performance gains increase with model size: +5.7, +10.2, and +13.2 for 3B, 7B, and 14B, respectively.

In conclusion, researchers introduced the Absolute Zero paradigm to address data limitations in existing RLVR frameworks. Under this method, researchers present AZR, which trains models to propose and solve code-related reasoning tasks grounded by a code executor. However, there is a limitation regarding safety management in self-improving systems. The team observed several instances of safety-concerning CoT reasoning from the Llama-3.1-8B model, termed “uh-oh moments.” The findings indicate that while the Absolute Zero paradigm reduces human intervention needs in task curation, ongoing oversight remains necessary to address lingering safety concerns, highlighting a critical direction for future research.

Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

Partner with us

The post AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External Data appeared first on MarkTechPost.

Google Redefines Computer Science R&D: A Hybrid Research Model tha …

Posted on May 10, 2025 by i-genie

Computer science research has evolved into a multidisciplinary effort involving logic, engineering, and data-driven experimentation. With computing systems now deeply embedded in everyday life, research increasingly focuses on large-scale, real-time systems capable of adapting to diverse user needs. These systems often learn from massive datasets and must handle unpredictable interactions. As the scope of computer science broadens, so does the methodology, requiring tools and approaches that accommodate scalability, responsiveness, and empirical validation over purely theoretical models.

The difficulty arises when connecting innovative ideas to practical applications without losing the depth and risk inherent in true research. Rapid development cycles, product deadlines, and user expectations often overlap with the uncertain timelines and exploratory nature of research. The challenge is enabling meaningful innovation while maintaining relevance and practical outcomes. Finding a structure where exploration and implementation coexist is essential to making real progress in this demanding and high-impact field.

Traditionally, the division between research and engineering has led to inefficiencies. Research teams create conceptual models or prototypes, which are later handed over to engineering teams for scaling and integration. This separation often results in delays, failures in technology transfer, and difficulty adapting ideas to real-world use. Even when research has academic value, the lack of immediate relevance or scalable deployment options limits its broader impact. Conventional dissemination methods, such as peer-reviewed papers, don’t always align with the fast-moving demands of technology development.

Google introduced a hybrid research model integrating researchers directly into product and engineering teams. This approach was designed to reduce delays between ideation and implementation, enabling faster and more relevant outcomes. Researchers at Google, a company that runs at the intersection of massive computing infrastructure and billions of users, operate within small teams that remain involved from concept to deployment. By embedding development research, the risk of failure is offset by iterative learning and empirical data gathered from actual user interactions. This model promotes cross-functional innovation where knowledge flows seamlessly between domains.

The methodology adopted by Google supports research through robust infrastructure and real-time experimentation. Teams write production-ready code early and rely on continuous feedback from deployed services. Elaborate prototypes are avoided, as they slow the path to real user impact. Google’s services model allows even small teams to access powerful computing resources and integrate complex features quickly. Their projects are modularized, breaking long-term goals into smaller, achievable components. This structure keeps motivation high and provides frequent opportunities for measurable progress. Research is not isolated from engineering but rather supported by it, ensuring that practical constraints and user behavior shape every line of code and every experiment.

The results of this model are substantial. Google published 279 research papers in 2011, a steep rise from 13 in 2003, showing an increased emphasis on sharing its scientific advancements. High-impact systems such as MapReduce, BigTable, and the Google File System originated within this hybrid structure and have become foundational to modern computing. Over 1,000 open-source projects and hundreds of public APIs have emerged from this integrated approach. Google Translate and Voice Search are examples of small research teams that transitioned ideas into large-scale products. Contributions extend to global standards, with team members shaping specifications like HTML5.

By deeply connecting research with product development, Google has built a model that fosters innovation and delivers it at scale. Its hybrid research system empowers teams to work on difficult problems without being detached from practical realities. Projects are designed with user impact and academic relevance in mind, allowing teams to adjust direction quickly when goals are unmet. This has led to projects such as Google Health being re-evaluated when they did not yield the expected outcomes, showing the model’s flexibility and pragmatism.

Combining experimentation, real-world data, and scalable engineering, Google has built a framework that makes research outcomes more tangible and impactful. This paper clearly shows how a unified approach to research and engineering can bridge the gap between innovation and usability, offering a potential blueprint for other technology-driven organizations.

Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post Google Redefines Computer Science R&D: A Hybrid Research Model that Merges Innovation with Scalable Engineering appeared first on MarkTechPost.

ServiceNow AI Released Apriel-Nemotron-15b-Thinker: A Compact Yet Powe …

Posted on May 10, 2025 by i-genie

AI models today are expected to handle complex tasks such as solving mathematical problems, interpreting logical statements, and assisting with enterprise decision-making. Building such models demands the integration of mathematical reasoning, scientific understanding, and advanced pattern recognition. As the demand for intelligent agents in real-time applications, like coding assistants and business automation tools, continues to grow, there is a pressing need for models that combine strong performance with efficient memory and token usage, making them viable for deployment in practical hardware environments.

A central challenge in AI development is the resource intensity of large-scale reasoning models. Despite their strong capabilities, these models often require significant memory and computational resources, limiting their real-world applicability. This creates a gap between what advanced models can achieve and what users can realistically deploy. Even well-resourced enterprises may find running models demanding dozens of gigabytes of memory or high inference costs unsustainable. The issue is not just about building smarter models, but ensuring they are efficient and deployable in real-world platforms. High-performing models such as QWQ‑32b, o1‑mini, and EXAONE‑Deep‑32b excel at tasks involving mathematical reasoning and academic benchmarks. However, their dependence on high-end GPUs and high token consumption limits their use in production settings. These models highlight the ongoing trade-off in AI deployment: achieving high accuracy at the cost of scalability and efficiency.

Addressing this gap, researchers at ServiceNow introduced Apriel-Nemotron-15b-Thinker. This model consists of 15 billion parameters, a relatively modest size compared to its high-performing counterparts, yet it demonstrates performance on par with models almost twice its size. The primary advantage lies in its memory footprint and token efficiency. While delivering competitive results, it requires nearly half the memory of QWQ‑32b and EXAONE‑Deep‑32b. This directly contributes to improved operational efficiency in enterprise environments, making it feasible to integrate high-performance reasoning models into real-world applications without large-scale infrastructure upgrades.

The development of Apriel-Nemotron-15b-Thinker followed a structured three-stage training approach, each designed to enhance a specific aspect of the model’s reasoning capabilities. In the initial phase, termed Continual Pre-training (CPT), the model was exposed to over 100 billion tokens. These tokens were not generic text but carefully selected examples from domains requiring deep reasoning, mathematical logic, programming challenges, scientific literature, and logical deduction tasks. This exposure provided the foundational reasoning capabilities that distinguish the model from others. The second stage involved Supervised Fine-Tuning (SFT) using 200,000 high-quality demonstrations. These examples further calibrated the model’s responses to reasoning challenges, enhancing performance on tasks that require accuracy and attention to detail. The final tuning stage, GRPO (Guided Reinforcement Preference Optimization), refined the model’s outputs by optimizing alignment with expected results across key tasks. This pipeline ensures the model is intelligent, precise, structured, and scalable.

In enterprise-specific tasks such as MBPP, BFCL, Enterprise RAG, MT Bench, MixEval, IFEval, and Multi-Challenge, the model delivered competitive or superior performance compared to larger models. Regarding production efficiency, it consumed 40% fewer tokens than QWQ‑32b, significantly lowering inference costs. From a memory standpoint, it achieves all this with approximately 50% of the memory needed by QWQ‑32b and EXAONE-Deep‑32b, indicating a substantial improvement in deployment feasibility. Even in academic benchmarks, such as AIME-24, AIME-25, AMC-23, MATH-500, and GPQA, the model held its own, often equaling or surpassing the performance of other larger models, all while being significantly lighter in computational demand.

Several Key Takeaways from the Research on Apriel-Nemotron-15b-Thinker:

Apriel-Nemotron-15b-Thinker has 15 billion parameters, significantly smaller than QWQ-32b or EXAONE-Deep-32b, but performs competitively.

Uses a 3-phase training, 100B+ tokens in CPT, 200K fine-tuning demos in SFT, and final GRPO refinement.

Consumes around 50% less memory than QWQ-32b, allowing for easier deployment on enterprise hardware.

Uses 40% fewer tokens in production tasks than QWQ-32b, reducing inference cost and increasing speed.

Outperforms or equals larger models on MBPP, BFCL, Enterprise RAG, and academic tasks like GPQA and MATH-500.

Optimized for Agentic and Enterprise tasks, suggesting utility in corporate automation, coding agents, and logical assistants.

Designed specifically for real-world use, avoiding over-reliance on lab-scale compute environments.

Check out the Model on Hugging Face. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post ServiceNow AI Released Apriel-Nemotron-15b-Thinker: A Compact Yet Powerful Reasoning Model Optimized for Enterprise-Scale Deployment and Efficiency appeared first on MarkTechPost.

Elevate marketing intelligence with Amazon Bedrock and LLMs for conten …

Posted on May 10, 2025 by i-genie

In the media and entertainment industry, understanding and predicting the effectiveness of marketing campaigns is crucial for success. Marketing campaigns are the driving force behind successful businesses, playing a pivotal role in attracting new customers, retaining existing ones, and ultimately boosting revenue. However, launching a campaign isn’t enough; to maximize their impact and help achieve a favorable return on investment, it’s important to understand how these initiatives perform.
This post explores an innovative end-to-end solution and approach that uses the power of generative AI and large language models (LLMs) to transform marketing intelligence. We use Amazon Bedrock, a fully managed service that provides access to leading foundation models (FMs) through a unified API, to demonstrate how to build and deploy this marketing intelligence solution. By combining sentiment analysis from social media data with AI-driven content generation and campaign effectiveness prediction, businesses can make data-driven decisions that optimize their marketing efforts and drive better results.
The challenge
Marketing teams in the media and entertainment sector face several challenges:

Accurately gauging public sentiment towards their brand, products, or campaigns
Creating compelling, targeted content for various marketing channels
Predicting the effectiveness of marketing campaigns before execution
Reducing marketing costs while maximizing impact

To address these challenges, we explore a solution that harnesses the power of generative AI and LLMs. Our solution integrates sentiment analysis, content generation, and campaign effectiveness prediction into a unified architecture, allowing for more informed marketing decisions.
Solution overview
The following diagram illustrates the logical data flow for our solution by using sentiment analysis and content generation to enhance marketing strategies.

In this pattern, social media data flows through a streamlined data ingestion and processing pipeline for real-time handling. At its core, the system uses Amazon Bedrock LLMs to perform three key AI functions:

Analyzing the sentiment of social media content
Generating tailored content based on the insights obtained
Evaluating campaign effectiveness

The processed data is stored in databases or data warehouses, then made available for reporting through interactive dashboards and generated detailed performance reports, enabling businesses to visualize trends and extract meaningful insights about their social media performance using customizable metrics and KPIs. This pattern creates a comprehensive solution that transforms raw social media data into actionable business intelligence (BI) through advanced AI capabilities. By integrating LLMs such as Anthropic’s Claude 3.5 Sonnet, Amazon Nova Pro, and Meta Llama 3.2 3B Instruct Amazon Bedrock, the system provides tailored marketing content that adds business value.
The following is a breakdown of each step in this solution.
Prerequisites
This solution requires you to have an AWS account with the appropriate permissions.
Ingest social media data
The first step involves collecting social media data that is relevant to your marketing campaign, for example from platforms such as Bluesky:

Define hashtags and keywords to track hashtags related to your brand, product, or campaign.
Connect to social media platform APIs.
Set up your data storage system.
Configure real-time data streaming.

Conduct sentiment analysis with social media data
The next step involves conducting sentiment analysis on social media data. Here’s how it works:

Collect posts using relevant hashtags related to your brand, product, or campaign.
Feed the collected posts into an LLM using a prompt for sentiment analysis.
The LLM processes the textual content and outputs classifications (for example, positive, negative, or neutral) and explanations.

The following code is an example using the AWS SDK for Python (Boto3) that prompts the LLM for sentiment analysis:

import boto3
import json

# Initialize Bedrock Runtime client
bedrock = boto3.client(‘bedrock-runtime’)

def analyze_sentiment(text, model_id= {selected_model}):
# Construct the prompt
prompt = f”””You are an expert AI sentiment analyst with advanced natural language processing capabilities. Your task is to perform a sentiment analysis on a given social media post, providing a classification of positive, negative, or neutral, and detailed rationale.

Inputs:
Post: “{text}”

Instructions:
1. Carefully read and analyze the provided post content.
2. Consider the following aspects in your analysis:
– Overall tone of the message
– Choice of words and phrases
– Presence of emotional indicators (such as emojis, punctuation)
– Context and potential sarcasm or irony
– Balance of positive and negative elements, if any
3. Classify the sentiment as one of the following:
– Positive: The post expresses predominantly favorable or optimistic views
– Negative: The post expresses predominantly unfavorable or pessimistic views
– Neutral: The post lacks strong emotion or balances positive and negative elements.
4. Explain your classification with specific references to the post

Provide your response in the following format:
Sentiment: [Positive/Negative/Neutral]
Explanation: [Detailed explanation of your classification, including:
– Key words or phrases that influenced your decision
– Analysis of any emotional indicators
– Discussion of context and tone
– Explanation of any ambiguities or mixed signals]

Remember to be objective and base your analysis solely on the content of the post. If the sentiment is ambiguous or context-dependent, acknowledge this in your explanation.
“””

# Create the request body
body = json.dumps({
“prompt”: prompt,
“max_tokens_to_sample”: 500,
“temperature”: 0.5,
“top_p”: 1
})

# Invoke the model
response = bedrock.invoke_model(
modelId=model_id,
body=body
)

return json.loads(response[‘body’].read())

This analysis provides valuable insights into public perception, providing marketers the information they need to understand how their brand or campaign is resonating with the audience in real time.
The following output examples were obtained using Amazon Bedrock:

Sentiment analysis for Large Language Models:
=================================================

EXAMPLE PROMPT:
————————————————-
Analyze the sentiment of the following post. Classify it as positive, negative, or neutral, and provide a brief explanation for your classification.

Post: “Avoid [AnyCompany] printer ink refills at all costs! Leaked everywhere, and barely lasted a week. Overpriced. #PrinterInkNightmare #AnyCompanyFail”

Sentiment:
Explanation:
————————————————-

Using Amazon Nova Pro:
=====================================
Sentiment: Negative

Explanation: The post expresses strong dissatisfaction with [AnyCompany] printer ink refills. The author uses phrases like “avoid at all costs,” “leaked everywhere,” “barely lasted a week,” and “overpriced,” all of which convey a highly negative experience. Additionally, the hashtags “#PrinterInkNightmare” and “#AnyCompanyFail” further emphasize the negative sentiment by highlighting the frustration and failure associated with the product.

Using Meta Llama 3.2 3B Instruct:
================================
Classification: Negative

Explanation: The post expresses strong dissatisfaction and warning to others about purchasing printer ink refills from “AnyCompany”. The language used is emphatic, with words like “Avoid”, “Leaked everywhere”, and “Overpriced”, which convey a sense of frustration and disappointment. The hashtags “#PrinterInkNightmare” and “#AnyCompanyFail” further emphasize the negative sentiment, suggesting that the author has had a terrible experience with the product.

Using Anthropic Claude 3.5 Sonnet:
==================================
Sentiment: Negative

Explanation: This post expresses a strongly negative sentiment towards [AnyCompany]’s printer ink refills. Several factors contribute to this classification:

The use of the phrase “Avoid… at all costs” indicates a strong recommendation against the product.

Multiple negative aspects of the product are mentioned: leaking, short lifespan, and high price.

The use of the hashtag “#PrinterInkNightmare” emphasizes the user’s frustration and negative experience.

The hashtag “#AnyCompanyFail” directly criticizes the company, associating it with failure.

The overall tone is one of warning and disappointment, with no positive aspects mentioned.

The language used is emphatic and critical, clearly conveying the user’s dissatisfaction with the product and the company. This post is likely intended to warn others away from the product, further reinforcing its negative sentiment.
===============================

Analyze campaign effectiveness and generate content
The next step focuses on using AI for content creation and campaign effectiveness prediction:

Input campaign data points (target audience, messaging, channels, and so on) into an LLM tailored for generating marketing content.
The LLM generates relevant content such as ad copy, social media posts, or email campaigns based on the provided data.
Another LLM, designed for campaign effectiveness analysis, evaluates the generated content.
This analysis model outputs a score or measure of the content’s potential effectiveness, considering the campaign objectives and insights from the social media sentiment analysis.

Content generation
The following is an example that prompts a selected LLM for content generation:

import boto3
import json

# Initialize Bedrock Runtime client
bedrock = boto3.client(‘bedrock-runtime’)

def generate_marketing_content(
product,
target_audience,
key_message,
tone,
platform,
char_limit,
model_id= {selected_model}
):
prompt = f”””You are an expert AI social media copywriter with extensive experience in creating engaging, platform-specific content for marketing campaigns. Your task is to craft a compelling social media post based on the provided campaign details.

Inputs:
Product: {product}
Target Audience: {target_audience}
Key Message: {key_message}
Tone: {tone}
Platform: {platform}
Character Limit: {char_limit}

Instructions:
1. Carefully review all provided information.
2. Craft a social media post that:
– Accurately represents the product
– Resonates with the target audience
– Clearly conveys the key message
– Matches the specified tone
– Is optimized for the given platform
– Adheres to the character limit
3. Incorporate platform-specific best practices (i.e. hashtags for Twitter/Instagram, emojis if appropriate)
4. Make sure the post is attention-grabbing and encourage engagement (likes, shares, comments)
5. Include a call-to-action if appropriate for the campaign

Provide your response in the following format:
Generated Post: [Your social media post here, ensuring it’s within the character limit]

Remember to be creative, concise, and impactful. Ensure your post aligns perfectly with the provided campaign details and platform requirements.
“””

body = json.dumps({
“prompt”: prompt,
“max_tokens_to_sample”: 300,
“temperature”: 0.7,
“top_p”: 0.9
})

response = bedrock.invoke_model(
modelId=model_id,
body=body
)

return json.loads(response[‘body’].read())

The following output examples were obtained using Amazon Bedrock:

Text generation Prompt for Large Language Models:
=================================================
Create a social media post for the following marketing campaign:

Product: [AnyCompany printer ink cartridge refills]
Target Audience: [Home Office or small business users]
Key Message: [lower cost with same or similar results as original branded ink cartridges]
Tone: [Desired tone, e.g., friendly, professional, humorous]
Platform: [Social media platform, e.g., Bluesky]
Character Limit: [75]

Using Amazon Nova Pro:
=====================================
🖨 Save big on printing! Try [AnyCompany] ink cartridge refills for your home office or small biz. Enjoy lower costs with quality that matches the originals. Print smart, print affordable. 💼💰
#PrintSmart #CostSaving #AnyCompanyInk

Using Meta Llama 3.2 3B Instruct:
================================
“Ditch the expensive original ink cartridges! Our refill cartridges are made to match your printer’s original quality, at a fraction of the cost. Save up to 50% on your ink needs!”
#InkSavers #PrintOnABudget

Using Anthropic Claude 3.5 Sonnet:
===============================
“Print more, pay less! AnyCompany refills: OEM quality, half the price.”
#SmartOffice

Campaign effectiveness analysis
The following is an example of code that prompts the selected LLM for campaign effectiveness analysis:

import boto3
import json

# Initialize Bedrock Runtime client
bedrock = boto3.client(‘bedrock-runtime’)

def analyze_campaign_effectiveness(
campaign_objectives,
sentiment_summary,
marketing_content,
model_id= {selected_model}
):
prompt = f”””You are an expert AI marketing analyst with extensive experience in evaluating marketing campaigns. Your task is to assess a marketing campaign based on its content and alignment with objectives. Provide a thorough, impartial analysis using the information given.

Inputs:
Campaign Objectives: {campaign_objectives}
Positive Sentiments: {sentiment_summary[‘praises’]}
Negative Sentiments: {sentiment_summary[‘flaws’]}
Marketing Content: {marketing_content}

Instructions:
1. Carefully review all provided information.
2. Analyze how well the marketing content aligns with the campaign objectives.
3. Consider the positive and negative sentiments in your evaluation.
4. Provide an Effectiveness Score on a scale of 1-10, where 1 is completely ineffective and 10 is extremely effective.
5. Give a detailed explanation of your evaluation, including:
– Strengths of the campaign
– Areas for improvement
– How well the content addresses the objectives
– Impact of positive and negative sentiments
– Suggestions for enhancing campaign effectiveness

Provide your response in the following format:
1. Effectiveness Score: [Score]/10
2. Detailed explanation of the evaluation: [Your detailed explanation here, structured in clear paragraphs or bullet points]

Remember to be objective, specific, and constructive in your analysis. Base your evaluation solely on the provided information.
“””

body = json.dumps({
“prompt”: prompt,
“max_tokens_to_sample”: 800,
“temperature”: 0.3,
“top_p”: 1
})

response = bedrock.invoke_model(
modelId=model_id,
body=body
)

return json.loads(response[‘body’].read())

Let’s examine a step-by-step process for evaluating how effectively the generated marketing content aligns with campaign goals using audience feedback to enhance impact and drive better results.
The following diagram shows the logical flow of the application, which is executed in multiple steps, both within the application itself and through services like Amazon Bedrock.

The LLM takes several key inputs (shown in the preceding figure):

Campaign objectives – A textual description of the goals and objectives for the marketing campaign.
Positive sentiments (praises) – A summary of positive sentiments and themes extracted from the social media sentiment analysis.
Negative sentiments (flaws) – A summary of negative sentiments and critiques extracted from the social media sentiment analysis.
Generated marketing content – The content generated by the content generation LLM, such as ad copy, social media posts, and email campaigns.

The process involves the following underlying key steps (shown in the preceding figure):

Text vectorization – The campaign objectives, sentiment analysis results (positive and negative sentiments), and generated marketing content are converted into numerical vector representations using techniques such as word embeddings or Term Frequency-Inverse Document Frequency (TF-IDF).
Similarity calculation – The system calculates the similarity between the vector representations of the generated content and the campaign objectives, positive sentiments, and negative sentiments. Common similarity measures include cosine similarity or advanced transformer-based models.
Component scoring – Individual scores are computed to measure the alignment between the generated content and the campaign objectives (objective alignment score), the incorporation of positive sentiments (positive sentiment score), and the avoidance of negative sentiments (negative sentiment score).
Weighted scoring – The individual component scores are combined using a weighted average or scoring function to produce an overall effectiveness score. The weights are adjustable based on campaign priorities.
Interpretation and explanation – In addition to the numerical score, the system provides a textual explanation highlighting the content’s alignment with objectives and sentiments, along with recommendations for improvements.

The following is example output for the marketing campaign evaluation:

1. Effectiveness Score: 8/10
2. Detailed explanation of the evaluation:

Campaign Objectives:
• Increase brand awareness by 20%.
• Drive a 15% increase in website traffic.
• Boost social media engagement by 25%.
• Successfully launch the ink refill product.

Positive Sentiments:
• Creative and resonant content.
• Clear messaging on cost savings and quality.
• Effective use of hashtags and emojis.
• Generated positive buzz.

Negative Sentiments:
• Tone too casual for brand image.
• Weak call to action.
• Overly focused on cost savings.

Marketing Content:
• Social media posts, email campaigns, and a website landing page.

Strengths:
• Engaging and shareable content.
• Clear communication of benefits.
• Strong initial market interest.

Areas for Improvement:
• Align tone with brand image.
• Strengthen call to action.
• Balance cost focus with value proposition.

The campaign effectiveness analysis uses advanced natural language processing (NLP) and machine learning (ML) models to evaluate how well the generated marketing content aligns with the campaign objectives while incorporating positive sentiments and avoiding negative ones. By combining these steps, marketers can create data-driven content that is more likely to resonate with their audience and achieve campaign goals.
Impact and benefits
This AI-powered approach to marketing intelligence provides several key advantages:

Cost-efficiency – By predicting campaign effectiveness upfront, companies can optimize resource allocation and minimize spending on underperforming campaigns.
Monetizable insights – The data-driven insights gained from this analysis can be valuable not only internally but also as a potential offering for other businesses in the industry.
Precision marketing – A deeper understanding of audience sentiment and content alignment allows for more targeted campaigns tailored to audience preferences.
Competitive edge – AI-driven insights enable companies to make faster, more informed decisions, staying ahead of market trends.
Enhanced ROI – Ultimately, better campaign targeting and optimization lead to higher ROI, increased revenue, and improved financial outcomes.

Additional considerations
Though the potential of this approach is significant, there are several challenges to consider:

Data quality – High-quality, diverse input data is key to effective model performance.
Model customization – Adapting pre-trained models to specific industry needs and company voice requires careful adjustment. This might involve iterative prompt engineering and model adjustments.
Ethical use of AI – Responsible AI use involves addressing issues such as privacy, bias, and transparency when analyzing public data.
System integration – Seamlessly incorporating AI insights into existing workflows can be complex and might require changes to current processes.
Prompt engineering – Crafting effective prompts for LLMs requires continuous experimentation and refinement for best results. Learn more about prompt engineering techniques.

Clean up
To avoid incurring ongoing charges, clean up your resources when you’re done with this solution.
Conclusion
The integration of generative AI and large LLMs into marketing intelligence marks a transformative advancement for the media and entertainment industry. By combining real-time sentiment analysis with AI-driven content creation and campaign effectiveness prediction, companies can make data-driven decisions, reduce costs, and enhance the impact of their marketing efforts.
Looking ahead, the evolution of generative AI—including image generation models like Stability AI’s offerings on Amazon Bedrock and Amazon Nova’s creative content generation capabilities—will further expand possibilities for personalized and visually compelling campaigns. These advancements empower marketers to generate high-quality images, videos, and text that align closely with campaign objectives, offering more engaging experiences for target audiences.
Success in this new landscape requires not only adoption of AI tools but also developing the ability to craft effective prompts, analyze AI-driven insights, and continuously optimize both content and strategy. Those who use these cutting-edge technologies will be well-positioned to thrive in the rapidly evolving digital marketing environment.

About the Authors
Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers adopt and use the AWS Cloud. He is focused on big data, data lakes, streaming and batch analytics services, and generative AI technologies.
Dhara Vaishnav is Solution Architecture leader at AWS and provides technical advisory to enterprise customers to use cutting-edge technologies in generative AI, data, and analytics. She provides mentorship to solution architects to design scalable, secure, and cost-effective architectures that align with industry best practices and customers’ long-term goals.
Mayank Agrawal is a Senior Customer Solutions Manager at AWS in San Francisco, dedicated to maximizing enterprise cloud success through strategic transformation. With over 20 years in tech and a computer science background, he transforms businesses through strategic cloud adoption. His expertise in HR systems, digital transformation, and previous leadership at Accenture helps organizations across healthcare and professional services modernize their technology landscape.
Namita Mathew is a Solutions Architect at AWS, where she works with enterprise ISV customers to build and innovate in the cloud. She is passionate about generative AI and IoT technologies and how to solve emerging business challenges.
Wesley Petry is a Solutions Architect based in the NYC area, specialized in serverless and edge computing. He is passionate about building and collaborating with customers to create innovative AWS-powered solutions that showcase the art of the possible. He frequently shares his expertise at trade shows and conferences, demonstrating solutions and inspiring others across industries.

Multimodal LLMs Without Compromise: Researchers from UCLA, UW–Madiso …

Posted on May 9, 2025 by i-genie

LLMs have made significant strides in language-related tasks such as conversational AI, reasoning, and code generation. However, human communication extends beyond text, often incorporating visual elements to enhance understanding. To create a truly versatile AI, models need the ability to process and generate text and visual information simultaneously. Training such unified vision-language models from scratch using methods like autoregressive token prediction or a hybrid approach combining diffusion and language losses has shown strong performance. Still, it requires vast computational resources and retraining for each new modality. An alternative approach adapts pretrained LLMs with vision capabilities, which offers a more efficient path but often compromises the language model’s original performance.

Current research has focused on three main strategies: merging LLMs with standalone image generation models, training large multimodal models end-to-end, or using a combination of diffusion and autoregressive losses. While these methods have achieved state-of-the-art results, they either require retraining large models or result in degradation of the LLM’s core capabilities. Despite these challenges, leveraging pretrained LLMs with added vision components has demonstrated significant potential, particularly in tasks involving image understanding and generation. However, these methods still face limitations in terms of efficiency and flexibility.

Researchers from UCLA, the University of Wisconsin-Madison, and Adobe Research propose X-Fusion, which adapts pretrained LLMs for multimodal tasks while preserving language capabilities. X-Fusion utilizes a dual-tower architecture, freezing the LLM’s language weights while adding a vision-specific tower to process visual information. The approach aligns text and vision features at multiple levels, improving performance in image-to-text and text-to-image tasks. Through ablation studies, the researchers emphasize the importance of clean image data for training and show that aligning vision features with pre-trained representations accelerates convergence, especially for smaller models.

X-Fusion is a unified framework that adapts pretrained LLMs for vision tasks while retaining their language capabilities. It uses a dual-tower design, freezing the LLM’s text weights while introducing a separate vision tower for processing visual information. Images are tokenized using a pretrained encoder, and image and text tokens are jointly optimized. The model incorporates an optional X-Fuse operation to merge features from both towers for enhanced performance. X-Fusion is trained with autoregressive and image denoising losses, and its performance is evaluated on image generation (text-to-image) and image understanding (image-to-text) tasks.

The study evaluates the Dual Tower architecture against alternative transformer variants for multimodal integration. It compares the Single Tower, Gated Tower, and Dual Projection designs, highlighting the flexibility of the Dual Tower for image and text tasks. The Dual Tower performs best in image generation and understanding, outperforming other designs by 23% in FID without increasing training parameters. The study also investigates the effects of noise and data ratios on performance, finding that clean images improve understanding and generation. Additionally, aligning vision features with a pretrained encoder like CLIP boosts performance, especially for smaller models.

In conclusion, X-Fusion is a framework that adapts pretrained LLMs to multimodal tasks, such as image understanding and generation, while preserving language capabilities. It introduces a Dual Tower architecture where language weights remain fixed, and a separate trainable vision tower processes visual features. Experimental results show that X-Fusion outperforms alternative designs in image and text-to-image tasks. Key findings include the benefits of incorporating understanding-focused data, reducing noise in image data, and the positive impact of feature alignment, especially for smaller models. The research contributes valuable insights into building efficient multimodal models.

Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

ML News Community – r/machinelearningnews (92k+ members)

The post Multimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities appeared first on MarkTechPost.

NVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)

Posted on May 9, 2025 by i-genie

NVIDIA continues to push the boundaries of open AI development by open-sourcing its Open Code Reasoning (OCR) model suite — a trio of high-performance large language models purpose-built for code reasoning and problem-solving. The 32B, 14B, and 7B variants, all released under the Apache 2.0 license.

Benchmarked to Beat the Best

The Open Code Reasoning (OCR) models come with notable benchmark achievements, outperforming OpenAI’s o3-Mini and o1 (low) models on the LiveCodeBench benchmark. LiveCodeBench is a comprehensive evaluation suite for code reasoning tasks such as debugging, code generation, and logic completion in real-world developer environments. In direct comparison, NVIDIA’s 32B OCR model tops the leaderboard in reasoning capability for open models.

This leap in performance is attributed not only to model architecture, but to NVIDIA’s custom “OCR dataset” — a high-quality, code-centric training corpus designed to emphasize instruction-following, reasoning, and multi-step code problem solving. According to NVIDIA, this results in a 30% improvement in token efficiency, allowing the models to produce accurate code and logical outputs with fewer tokens.

A Model Lineup for Every Use Case

The Open Code Reasoning suite comes in three parameter scales:

OpenCodeReasoning-Nemotron-32B

OpenCodeReasoning-Nemotron-14B

OpenCodeReasoning-Nemotron-7B

Each model balances scale with performance. The 32B variant delivers state-of-the-art results for high-performance inference and research; the 14B model provides strong reasoning capabilities with reduced compute requirements, and the 7B variant is ideal for resource-constrained environments while retaining competitive performance on benchmarks.

All models are trained using the Nemotron architecture, NVIDIA’s transformer-based backbone optimized for multilingual, multi-task learning. The model weights and configurations are available on Hugging Face:

32B Model

14B Model

7B Model

32B Instruction-Tuned Variant

Compatible with Open Inference Ecosystems

A key feature of these models is out-of-the-box compatibility with popular inference frameworks:

llama.cpp for lightweight CPU/GPU inference

vLLM for optimized GPU serving and speculative decoding

Transformers by Hugging Face for training and evaluation pipelines

TGI (Text Generation Inference) for scalable API deployment

This flexibility allows developers, researchers, and enterprises to plug these models into existing code AI infrastructure with minimal overhead.

A Step Forward for Open Code Intelligence

With this release, NVIDIA contributes significantly to the growing ecosystem of open code models. By targeting code reasoning — a domain historically dominated by proprietary models — and releasing under a fully open and permissive license, NVIDIA empowers the broader AI and developer community to build, fine-tune, and deploy advanced reasoning models in production.

The Open Code Reasoning suite adds to NVIDIA’s growing portfolio of open LLMs and strengthens its stance on accessible, transparent AI development. Whether you’re building developer copilots, automated code review agents, or code generation services, these models offer a high-performing, cost-effective, and community-friendly alternative to closed solutions.

Check out the 32B Model, 14B Model, 7B Model and 32B Instruction-Tuned Variant. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

ML News Community – r/machinelearningnews (92k+ members)

The post NVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B) appeared first on MarkTechPost.

Hugging Face Releases nanoVLM: A Pure PyTorch Library to Train a Visio …

Posted on May 9, 2025 by i-genie

In a notable step toward democratizing vision-language model development, Hugging Face has released nanoVLM, a compact and educational PyTorch-based framework that allows researchers and developers to train a vision-language model (VLM) from scratch in just 750 lines of code. This release follows the spirit of projects like nanoGPT by Andrej Karpathy—prioritizing readability and modularity without compromising on real-world applicability.

nanoVLM is a minimalist, PyTorch-based framework that distills the core components of vision-language modeling into just 750 lines of code. By abstracting only what’s essential, it offers a lightweight and modular foundation for experimenting with image-to-text models, suitable for both research and educational use.

Technical Overview: A Modular Multimodal Architecture

At its core, nanoVLM combines together a visual encoder, a lightweight language decoder, and a modality projection mechanism to bridge the two. The vision encoder is based on SigLIP-B/16, a transformer-based architecture known for its robust feature extraction from images. This visual backbone transforms input images into embeddings that can be meaningfully interpreted by the language model.

On the textual side, nanoVLM uses SmolLM2, a causal decoder-style transformer that has been optimized for efficiency and clarity. Despite its compact nature, it is capable of generating coherent, contextually relevant captions from visual representations.

The fusion between vision and language is handled via a straightforward projection layer, aligning the image embeddings into the language model’s input space. The entire integration is designed to be transparent, readable, and easy to modify—perfect for educational use or rapid prototyping.

Performance and Benchmarking

While simplicity is a defining feature of nanoVLM, it still achieves surprisingly competitive results. Trained on 1.7 million image-text pairs from the open-source the_cauldron dataset, the model reaches 35.3% accuracy on the MMStar benchmark—a metric comparable to larger models like SmolVLM-256M, but using fewer parameters and significantly less compute.

The pre-trained model released alongside the framework, nanoVLM-222M, contains 222 million parameters, balancing scale with practical efficiency. It demonstrates that thoughtful architecture, not just raw size, can yield strong baseline performance in vision-language tasks.

This efficiency also makes nanoVLM particularly suitable for low-resource settings—whether it’s academic institutions without access to massive GPU clusters or developers experimenting on a single workstation.

Designed for Learning, Built for Extension

Unlike many production-level frameworks which can be opaque and over-engineered, nanoVLM emphasizes transparency. Each component is clearly defined and minimally abstracted, allowing developers to trace data flow and logic without navigating a labyrinth of interdependencies. This makes it ideal for educational purposes, reproducibility studies, and workshops.

nanoVLM is also forward-compatible. Thanks to its modularity, users can swap in larger vision encoders, more powerful decoders, or different projection mechanisms. It’s a solid base to explore cutting-edge research directions—whether that’s cross-modal retrieval, zero-shot captioning, or instruction-following agents that combine visual and textual reasoning.

Accessibility and Community Integration

In keeping with Hugging Face’s open ethos, both the code and the pre-trained nanoVLM-222M model are available on GitHub and the Hugging Face Hub. This ensures integration with Hugging Face tools like Transformers, Datasets, and Inference Endpoints, making it easier for the broader community to deploy, fine-tune, or build on top of nanoVLM.

Given Hugging Face’s strong ecosystem support and emphasis on open collaboration, it’s likely that nanoVLM will evolve with contributions from educators, researchers, and developers alike.

Conclusion

nanoVLM is a refreshing reminder that building sophisticated AI models doesn’t have to be synonymous with engineering complexity. In just 750 lines of clean PyTorch code, Hugging Face has distilled the essence of vision-language modeling into a form that’s not only usable, but genuinely instructive.

As multimodal AI becomes increasingly important across domains—from robotics to assistive technology—tools like nanoVLM will play a critical role in onboarding the next generation of researchers and developers. It may not be the largest or most advanced model on the leaderboard, but its impact lies in its clarity, accessibility, and extensibility.

Check out the Model and Repo. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

ML News Community – r/machinelearningnews (92k+ members)

The post Hugging Face Releases nanoVLM: A Pure PyTorch Library to Train a Vision-Language Model from Scratch in 750 Lines of Code appeared first on MarkTechPost.

Google Launches Gemini 2.5 Pro I/O: Outperforms GPT-4 in Coding, Suppo …

Posted on May 8, 2025 by i-genie

Just ahead of its annual I/O developer conference, Google has released an early preview of Gemini 2.5 Pro (I/O Edition)—a substantial update to its flagship AI model focused on software development and multimodal reasoning and understanding. This latest version delivers marked improvements in coding accuracy, web application generation, and video-based understanding, placing it at the forefront of large model evaluation leaderboards.

With top rankings in LM Arena’s WebDev and Coding categories, Gemini 2.5 Pro I/O emerges as a serious contender in applied AI programming assistance and multimodal intelligence.

Leading in Web App Development: Top of WebDev Arena

The I/O Edition distinguishes itself in frontend software development, achieving the top spot on the WebDev Arena leaderboard—a benchmark based on human evaluation of generated web applications. Compared to its predecessor, the model improves by +147 Elo points, underscoring meaningful progress in quality and consistency.

Key capabilities include:

End-to-End Frontend GenerationGemini 2.5 Pro I/O generates complete browser-ready applications from a single prompt. Outputs include well-structured HTML, responsive CSS, and functional JavaScript—reducing the need for iterative prompts or post-processing.

High-Fidelity UI GenerationThe model interprets structured UI prompts with precision, producing readable and modular code components that are suitable for direct deployment or integration into existing codebases.

Consistency Across ModalitiesOutputs remain consistent across various frontend tasks, enabling developers to use the model for layout prototyping, styling, and even component-level rendering.

This makes Gemini particularly valuable in streamlining frontend workflows, from mockup to functional prototype.

General Coding Performance: Outpacing GPT-4 and Claude 3.7

Beyond web development, Gemini 2.5 Pro I/O shows strong general-purpose coding capabilities. It now ranks first in LM Arena’s coding benchmark, ahead of competitors such as GPT-4 and Claude 3.7 Sonnet.

Notable enhancements include:

Multi-Step Programming SupportThe model can perform chained tasks such as code refactoring, optimization, and cross-language translation with increased accuracy.

Improved Tool UseGoogle reports a reduction in tool-calling errors during internal testing—an important milestone for real-time development scenarios where tool invocation is tightly coupled with model output.

Structured Instructions via Vertex AIIn enterprise environments, the model supports structured system instructions, giving teams greater control over execution flow, especially in multi-agent or workflow-based systems.

Together, these improvements make the I/O Edition a more reliable assistant for tasks that go beyond single-function completions—supporting real-world software development practices.

Native Video Understanding and Multimodal Contexts

In a notable leap toward generalist AI, Gemini 2.5 Pro I/O introduces built-in support for video understanding. The model scores 84.8% on the VideoMME benchmark, indicating robust performance in spatial-temporal reasoning tasks.

Key features include:

Direct Video-to-Structure UnderstandingDevelopers can feed video inputs into AI Studio and receive structured outputs—eliminating the need for manual intermediate steps or model switching.

Unified Multimodal Context WindowThe model accepts extended, multimodal sequences—text, image, and video—within a single context. This simplifies the development of cross-modal workflows where continuity and memory retention are essential.

Application ReadinessVideo understanding is integrated into AI Studio today, with extended capabilities available through Vertex AI, making the model immediately usable for enterprise-facing tools.

This makes Gemini suitable for a range of new use cases, from video content summarization and instructional QA to dynamic UI adaptation based on video feeds.

Deployment and Integration

Gemini 2.5 Pro I/O is now available across key Google platforms:

Google AI Studio: For interactive experimentation and rapid prototyping

Vertex AI: For enterprise-grade deployment with support for system-level configuration and tool use

Gemini App: For general access via natural language interfaces

While the model does not yet support fine-tuning, it accepts prompt-based customization and structured input/output, making it adaptable for task-specific pipelines without retraining.

Conclusion

Gemini 2.5 Pro I/O marks a significant step forward in making large language models practically useful for developers and enterprises alike. Its leadership on both WebDev and coding leaderboards, combined with native support for multimodal input, illustrates Google’s growing emphasis on real-world applicability.

Rather than focusing solely on raw language modeling benchmarks, this release prioritizes functional quality—offering developers structured, accurate, and context-aware outputs across a diverse range of tasks. With Gemini 2.5 Pro I/O, Google continues to shape the future of developer-centric AI systems.

Check out the Technical details and Try it here. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

ML News Community – r/machinelearningnews (92k+ members)

The post Google Launches Gemini 2.5 Pro I/O: Outperforms GPT-4 in Coding, Supports Native Video Understanding and Leads WebDev Arena appeared first on MarkTechPost.

Researchers from Fudan University Introduce Lorsa: A Sparse Attention …

Posted on May 8, 2025 by i-genie

Large Language Models (LLMs) have gained significant attention in recent years, yet understanding their internal mechanisms remains challenging. When examining individual attention heads in Transformer models, researchers have identified specific functionalities in some heads, such as induction heads that predict tokens like ‘Potter’ following ‘Harry’ when the phrase appears in context. Ablation studies confirm these heads’ causal relationship to model behaviours. However, most attention heads distribute focus across diverse contexts without clear functionality. The challenge lies in interpreting these complex attention patterns, as inter-head collaboration often occurs rather than isolated functionality. This phenomenon resembles feature superposition in neural interpretation, suggesting the existence of attention superposition in Multi-Head Self-Attention (MHSA) mechanisms. Understanding these complex interactions is crucial for developing more transparent and controllable language models.

Previous research has made significant strides in explaining individual attention head functionality using techniques like activation patching and path patching. These approaches have identified several specialised attention heads in transformer models, including composition heads, induction heads, name mover heads, number comparison heads, copy suppression heads, successor heads, and long context retrieval heads. However, the superposition hypothesis suggests that neurons relate to multiple non-orthogonal underlying features rather than single functionalities. Sparse Autoencoders have emerged as a promising method to extract overcomplete sets of sparse, linearly comprehensible features from neural networks. The success of these autoencoders demonstrates the universality of superposition across various dimensions, including model size, architecture types, and even different modalities. These methods, while valuable, still struggle to fully explain the complex interactions between attention heads and their collaborative behaviour in language models.

The research from the Shanghai Innovation Institute, OpenMOSS Team, School of Computer Science, Fudan University introduce Low-Rank Sparse Attention (Lorsa), a robust approach to disentangle atomic attention units from attention superposition. Lorsa replaces standard Multi-Head Self-Attention with an overcomplete set of attention heads that feature single-dimensional OV circuits and sparsity constraints. To evaluate Lorsa, researchers developed an exploration interface that provides comprehensive information on each Lorsa head, quantitatively assessing interpretability through top activations and attribution patterns. Results demonstrate that Lorsa’s monosemanticity compares favorably to Sparse Autoencoder features. The method was tested on both Pythia-160M and Llama-3.1-8B models, successfully identifying known attention mechanisms such as induction heads, name mover heads, successor heads, and attention sinks. Further analysis revealed arithmetic-specific Lorsa heads in Llama-3.1-8B and identified thematic anchor heads exhibiting long-range, topic-specific attention patterns. This approach provides unprecedented visibility into transformer attention mechanisms.

Attention superposition in Transformer models parallels how neurons represent more features than their dimensions. The research hypothesises that MHSA comprises multiple attention units in superposition, each attending between specific token pairs with interpretable read/write operations on the residual stream. This hypothesis suggests atomic attention units spread across multiple MHSA heads, while individual heads contain multiple units.

Three key pieces of evidence support attention superposition: First, polysemantic heads respond to unrelated inputs, like successor heads that increment days, numbers, and exhibit acronym/copying behaviours simultaneously. Second, most attention heads lack clear interpretation patterns, with studies showing failed interpretation attempts for over 90% of GPT-2 heads. Third, direct observations show attention output features collectively contributed by multiple heads, with approximately 25% of learned attention units spread across multiple MHSA heads.

Understanding attention superposition matters significantly for two key reasons. First, attribution-based circuit tracing becomes challenging when features compute collectively, as individual Query-Key patterns may be misled due to interference from other features within the same heads. Second, the structure of attention superposition may reveal important model biology motifs, raising questions about why certain attention units, like induction heads, are implemented by single MHSA heads while others exist in superposition.

The Lorsa architecture addresses these challenges through several innovative design elements. Lorsa is trained to predict MHSA outputs by minimising mean square error. It employs one-dimensional OV circuits that restrict read/write operations to specific residual stream features, aligning with the linear representation hypothesis. For Query and Key weights, Lorsa implements parameter sharing across every DLorsa QK head, maintaining parameter efficiency while preserving performance. This strategy makes Lorsa QK circuits similar to MHSA but with sparsity constraints on each OV dimension.

Lorsa employs orders of magnitude more heads than standard MHSA while activating only a small subset per token. For each position, Lorsa’s output aggregates only the top-K heads with the largest activation values, with the active head subset varying dynamically across token positions. This approach resembles TopK-SAEs, selecting the most salient linear components. While similar to attention Sparse Autoencoders, Lorsa differs in that its head activations derive from attention patterns of previous tokens rather than simple linear encoders with ReLU.

Lorsa’s interpretability assessment employs several key metrics to understand individual head functionality. Top activations help identify patterns by examining the 16 highest-activating tokens for each Lorsa head across 100 million samples from held-out data. The z pattern analysis decomposes activations linearly into token-wise contributions from preceding positions, revealing which previous tokens contribute to current activations. This approach parallels direct feature attribution analysis used for attention Sparse Autoencoders, but with simpler attribution involving just one one-dimensional OV circuit and a single QK circuit.

A visualisation dashboard provides comprehensive information about each Lorsa head. For example, a “you”-specific induction head shows several important patterns: it primarily reads from features indicating the current token is “you”/”your” through its weight vector, strongly activates a “say you” feature that amplifies the logit of “you,” and increases prediction probabilities for various “you” tokens. The QK attention pattern computation involves current token features at the query position and previous token features where the current token is “you,” with the previous token often being words like “with,” “thank,” or “do.” Interestingly, this particular Lorsa head is almost equally distributed between two MHSA heads (5.0 and 5.7), demonstrating how Lorsa successfully disentangles attention units that exist across multiple standard attention heads.

Results confirm Lorsa’s effectiveness in identifying known attention mechanisms across different models. Using path patching, researchers rediscovered previously documented monosemantic heads in Pythia-160M, including induction heads, name mover heads, copy suppression heads, successor heads, and attention sinks. In Llama-3.1-8B, they identified arithmetic-specific Lorsa heads that activate during simple arithmetic operations, with each head using distinct heuristics to fetch operands. In addition to this, they discovered “thematic anchor” heads that exhibit long-range attention to topically related tokens, suggesting a mechanism for maintaining persistent topic representations that bias subsequent token predictions toward domain-appropriate vocabulary and structures.

Low-Rank Sparse Attention successfully disentangles atomic attention units from attention superposition in Transformer models. The method effectively recovers known attention mechanisms while uncovering new interpretable behaviours, demonstrating its value for neural network interpretability. Despite these advances, significant challenges remain in unbinding QK circuits to achieve fully independent heads and reducing superposition effects. Future research directions include exploring low-dimensional QK structures, cross-layer superposition, and systematic Q/K/V composition.

Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

ML News Community – r/machinelearningnews (92k+ members)

The post Researchers from Fudan University Introduce Lorsa: A Sparse Attention Mechanism That Recovers Atomic Attention Units Hidden in Transformer Superposition appeared first on MarkTechPost.

A Step-by-Step Guide to Implement Intelligent Request Routing with Cla …

Posted on May 8, 2025 by i-genie

This article demonstrates how to build an intelligent routing system powered by Anthropic’s Claude models. This system improves response efficiency and quality by automatically classifying user requests and directing them to specialised handlers. The workflow analyses incoming queries, determines their intent, and routes them to appropriate processing pipelines—whether for customer support, technical assistance, or other domain-specific responses.

Step 1: Install the required Python packages

Copy CodeCopiedUse a different Browser!pip install anthropic pandas scikit-learn

Step 2: Import the necessary libraries for the project

Copy CodeCopiedUse a different Browserimport os
import json
import time
import pandas as pd
import numpy as np
from anthropic import Anthropic
from IPython.display import display, Markdown
from sklearn.metrics import classification_report

Step 3: Set up the Anthropic API authentication by defining your API key and initialising the Anthropic client

Copy CodeCopiedUse a different BrowserANTHROPIC_API_KEY = “{Your API KEY}”
client = Anthropic(api_key=ANTHROPIC_API_KEY)

Step 4: Create a sample dataset of customer queries with associated categories for training and testing the routing system.

Copy CodeCopiedUse a different Browsercustomer_queries = [
{“id”: 1, “query”: “What are your business hours?”, “category”: “General Question”},
{“id”: 2, “query”: “How do I reset my password?”, “category”: “Technical Support”},
{“id”: 3, “query”: “I want a refund for my purchase.”, “category”: “Refund Request”},
{“id”: 4, “query”: “Where can I find your privacy policy?”, “category”: “General Question”},
{“id”: 5, “query”: “The app keeps crashing when I try to upload photos.”, “category”: “Technical Support”},
{“id”: 6, “query”: “I ordered the wrong size, can I get my money back?”, “category”: “Refund Request”},
{“id”: 7, “query”: “Do you ship internationally?”, “category”: “General Question”},
{“id”: 8, “query”: “My account is showing incorrect information.”, “category”: “Technical Support”},
{“id”: 9, “query”: “I was charged twice for my order.”, “category”: “Refund Request”},
{“id”: 10, “query”: “What payment methods do you accept?”, “category”: “General Question”}
]

Step 5: Convert the customer queries list into a pandas DataFrame for easier manipulation and analysis. Then, display the DataFrame in the notebook to visualise the training dataset structure.

Copy CodeCopiedUse a different Browserdf = pd.DataFrame(customer_queries)
display(df)

Step 6: Define the core routing function that uses Claude 3.7 Sonnet to classify customer queries into predefined categories.

Copy CodeCopiedUse a different Browserdef route_query(query, client):
“””
Route a customer query to the appropriate category using Claude 3.5 Haiku.

Args:
query (str): The customer query to classify
client: Anthropic client

Returns:
str: The classified category
“””
system_prompt = “””
You are a query classifier for a customer service system.
Your job is to categorize customer queries into exactly one of these categories:
1. General Question – Basic inquiries about the company, products, policies, etc.
2. Refund Request – Any query related to refunds, returns, or billing issues
3. Technical Support – Questions about technical problems, bugs, or how to use products

Respond with ONLY the category name, nothing else.
“””

try:
response = client.messages.create(
model=”claude-3-7-sonnet-20250219″,
max_tokens=1024,
system=system_prompt,
messages=[{“role”: “user”, “content”: query}]
)

category = response.content[0].text.strip()

valid_categories = [“General Question”, “Refund Request”, “Technical Support”]
for valid_cat in valid_categories:
if valid_cat.lower() in category.lower():
return valid_cat

return “General Question”

except Exception as e:
print(f”Error in routing: {e}”)
return “General Question”

Step 7: Define three specialised handler functions for each query category, each using Claude 3.5 Sonnet with a category-specific system prompt.

Copy CodeCopiedUse a different Browserdef handle_general_question(query, client):
“””Handle general inquiries using Claude 3.5 Haiku.”””
system_prompt = “””
You are a customer service representative answering general questions about our company.
Be helpful, concise, and friendly. Provide direct answers to customer queries.
“””

try:
response = client.messages.create(
model=”claude-3-7-sonnet-20250219″,
max_tokens=1024,
system=system_prompt,
messages=[{“role”: “user”, “content”: query}]
)
return response.content[0].text.strip()
except Exception as e:
print(f”Error in general question handler: {e}”)
return “I apologize, but I’m having trouble processing your request. Please try again later.”

Copy CodeCopiedUse a different Browserdef handle_refund_request(query, client):
“””Handle refund requests using Claude 3.5 Sonnet for more nuanced responses.”””
system_prompt = “””
You are a customer service representative specializing in refunds and billing issues.
Respond to refund requests professionally and helpfully.
For any refund request, explain the refund policy clearly and provide next steps.
Be empathetic but follow company policy.
“””

try:
response = client.messages.create(
model=”claude-3-7-sonnet-20250219″,
max_tokens=1024,
system=system_prompt,
messages=[{“role”: “user”, “content”: query}]
)
return response.content[0].text.strip()
except Exception as e:
print(f”Error in refund request handler: {e}”)
return “I apologize, but I’m having trouble processing your refund request. Please contact our support team directly.”

Copy CodeCopiedUse a different Browserdef handle_technical_support(query, client):
“””Handle technical support queries using Claude 3.5 Sonnet for more detailed technical responses.”””
system_prompt = “””
You are a technical support specialist.
Provide clear, step-by-step solutions to technical problems.
If you need more information to resolve an issue, specify what information you need.
Prioritize simple solutions first before suggesting complex troubleshooting.
“””

try:
response = client.messages.create(
model=”claude-3-7-sonnet-20250219″,
max_tokens=1024,
system=system_prompt,
messages=[{“role”: “user”, “content”: query}]
)
return response.content[0].text.strip()
except Exception as e:
print(f”Error in technical support handler: {e}”)
return “I apologize, but I’m having trouble processing your technical support request. Please try our knowledge base or contact our support team.”

Step 8: Create the main workflow function that orchestrates the entire routing process. This function first classifies a query, tracks timing metrics, directs it to the appropriate specialised handler based on category, and returns a comprehensive results dictionary with performance statistics.

Copy CodeCopiedUse a different Browserdef process_customer_query(query, client):
“””
Process a customer query through the complete routing workflow.

Args:
query (str): The customer query
client: Anthropic client

Returns:
dict: Information about the query processing, including category and response
“””
start_time = time.time()
category = route_query(query, client)
routing_time = time.time() – start_time

start_time = time.time()
if category == “General Question”:
response = handle_general_question(query, client)
model_used = “claude-3-5-haiku-20240307”
elif category == “Refund Request”:
response = handle_refund_request(query, client)
model_used = “claude-3-5-sonnet-20240620”
elif category == “Technical Support”:
response = handle_technical_support(query, client)
model_used = “claude-3-5-sonnet-20240620”
else:
response = handle_general_question(query, client)
model_used = “claude-3-5-haiku-20240307”

handling_time = time.time() – start_time
total_time = routing_time + handling_time

return {
“query”: query,
“routed_category”: category,
“response”: response,
“model_used”: model_used,
“routing_time”: routing_time,
“handling_time”: handling_time,
“total_time”: total_time
}

Step 9: Process each query in the sample dataset through the routing workflow, collect the results with actual vs. predicted categories, and evaluate the system’s performance.

Copy CodeCopiedUse a different Browserresults = []

for _, row in df.iterrows():
query = row[‘query’]
result = process_customer_query(query, client)
result[“actual_category”] = row[‘category’]
results.append(result)

results_df = pd.DataFrame(results)
display(results_df[[“query”, “actual_category”, “routed_category”, “model_used”, “total_time”]])

accuracy = (results_df[“actual_category”] == results_df[“routed_category”]).mean()
print(f”Routing Accuracy: {accuracy:.2%}”)

from sklearn.metrics import classification_report
print(classification_report(results_df[“actual_category”], results_df[“routed_category”]))

Step 10: Simulated results.

Copy CodeCopiedUse a different Browsersimulated_results = []
for _, row in df.iterrows():
query = row[‘query’]
actual_category = row[‘category’]

if “hours” in query.lower() or “policy” in query.lower() or “ship” in query.lower() or “payment” in query.lower():
routed_category = “General Question”
model_used = “claude-3-5-haiku-20240307”
elif “refund” in query.lower() or “money back” in query.lower() or “charged” in query.lower():
routed_category = “Refund Request”
model_used = “claude-3-5-sonnet-20240620”
else:
routed_category = “Technical Support”
model_used = “claude-3-5-sonnet-20240620”

simulated_results.append({
“query”: query,
“actual_category”: actual_category,
“routed_category”: routed_category,
“model_used”: model_used,
“routing_time”: np.random.uniform(0.2, 0.5),
“handling_time”: np.random.uniform(0.5, 2.0)
})

simulated_df = pd.DataFrame(simulated_results)
simulated_df[“total_time”] = simulated_df[“routing_time”] + simulated_df[“handling_time”]
display(simulated_df[[“query”, “actual_category”, “routed_category”, “model_used”, “total_time”]])

Output

Step 11: Calculate and display the accuracy of the simulated routing system by comparing predicted categories with actual categories.

Copy CodeCopiedUse a different Browseraccuracy = (simulated_df[“actual_category”] == simulated_df[“routed_category”]).mean()
print(f”Simulated Routing Accuracy: {accuracy:.2%}”)

print(classification_report(simulated_df[“actual_category”], simulated_df[“routed_category”]))

Step 12: Create an interactive demo interface using IPython widgets.

Copy CodeCopiedUse a different Browserfrom IPython.display import HTML, display, clear_output
from ipywidgets import widgets

def create_demo_interface():
query_input = widgets.Textarea(
value=”,
placeholder=’Enter your customer service query here…’,
description=’Query:’,
disabled=False,
layout=widgets.Layout(width=’80%’, height=’100px’)
)

output = widgets.Output()

button = widgets.Button(
description=’Process Query’,
disabled=False,
button_style=’primary’,
tooltip=’Click to process the query’,
icon=’check’
)

def on_button_clicked(b):
with output:
clear_output()
query = query_input.value

if not query.strip():
print(“Please enter a query.”)
return

if “hours” in query.lower() or “policy” in query.lower() or “ship” in query.lower() or “payment” in query.lower():
category = “General Question”
model = “claude-3-5-haiku-20240307”
response = “Our standard business hours are Monday through Friday, 9 AM to 6 PM Eastern Time. Our customer service team is available during these hours to assist you.”
elif “refund” in query.lower() or “money back” in query.lower() or “charged” in query.lower():
category = “Refund Request”
model = “claude-3-5-sonnet-20240620”
response = “I understand you’re looking for a refund. Our refund policy allows returns within 30 days of purchase with a valid receipt. To initiate your refund, please provide your order number and the reason for the return.”
else:
category = “Technical Support”
model = “claude-3-5-sonnet-20240620”
response = “I’m sorry to hear you’re experiencing technical issues. Let’s troubleshoot this step by step. First, try restarting the application. If that doesn’t work, please check if the app is updated to the latest version.”

print(f”Routed to: {category}”)
print(f”Using model: {model}”)
print(“nResponse:”)
print(response)

button.on_click(on_button_clicked)

return widgets.VBox([query_input, button, output])

Output

Step 13: Implement an advanced routing function that not only classifies queries but also provides confidence scores and reasoning for each classification.

Copy CodeCopiedUse a different Browserdef advanced_route_query(query, client):
“””
An advanced routing function that includes confidence scores and fallback mechanisms.

Args:
query (str): The customer query to classify
client: Anthropic client

Returns:
dict: Classification result with category and confidence
“””
system_prompt = “””
You are a query classifier for a customer service system.
Your job is to categorize customer queries into exactly one of these categories:
1. General Question – Basic inquiries about the company, products, policies, etc.
2. Refund Request – Any query related to refunds, returns, or billing issues
3. Technical Support – Questions about technical problems, bugs, or how to use products

Respond in JSON format with:
1. “category”: The most likely category
2. “confidence”: A confidence score between 0 and 1
3. “reasoning”: A brief explanation of your classification

Example response:
{
“category”: “General Question”,
“confidence”: 0.85,
“reasoning”: “The query asks about business hours, which is basic company information.”
}
“””

try:
response = client.messages.create(
model=”claude-3-5-sonnet-20240620″,
max_tokens=150,
system=system_prompt,
messages=[{“role”: “user”, “content”: query}]
)

response_text = response.content[0].text.strip()

try:
result = json.loads(response_text)
if “category” not in result or “confidence” not in result:
raise ValueError(“Incomplete classification result”)

return result
except json.JSONDecodeError:
print(“Failed to parse JSON response. Using simple classification.”)
if “general” in response_text.lower():
return {“category”: “General Question”, “confidence”: 0.6, “reasoning”: “Fallback classification”}
elif “refund” in response_text.lower():
return {“category”: “Refund Request”, “confidence”: 0.6, “reasoning”: “Fallback classification”}
else:
return {“category”: “Technical Support”, “confidence”: 0.6, “reasoning”: “Fallback classification”}

except Exception as e:
print(f”Error in advanced routing: {e}”)
return {“category”: “General Question”, “confidence”: 0.3, “reasoning”: “Error fallback”}

Step 14: Create an enhanced query processing workflow with confidence-based routing that escalates low-confidence queries to specialised handling, incorporating simulated classification for demonstration purposes.

Copy CodeCopiedUse a different Browserdef advanced_process_customer_query(query, client, confidence_threshold=0.7):
“””
Process a customer query with confidence-based routing.

Args:
query (str): The customer query
client: Anthropic client
confidence_threshold (float): Minimum confidence score for automated routing

Returns:
dict: Information about the query processing
“””
start_time = time.time()

if “hours” in query.lower() or “policy” in query.lower() or “ship” in query.lower() or “payment” in query.lower():
classification = {
“category”: “General Question”,
“confidence”: np.random.uniform(0.7, 0.95),
“reasoning”: “Query related to business information”
}
elif “refund” in query.lower() or “money back” in query.lower() or “charged” in query.lower():
classification = {
“category”: “Refund Request”,
“confidence”: np.random.uniform(0.7, 0.95),
“reasoning”: “Query mentions refunds or billing issues”
}
elif “password” in query.lower() or “crash” in query.lower() or “account” in query.lower():
classification = {
“category”: “Technical Support”,
“confidence”: np.random.uniform(0.7, 0.95),
“reasoning”: “Query mentions technical problems”
}
else:
categories = [“General Question”, “Refund Request”, “Technical Support”]
classification = {
“category”: np.random.choice(categories),
“confidence”: np.random.uniform(0.4, 0.65),
“reasoning”: “Uncertain classification”
}

routing_time = time.time() – start_time

start_time = time.time()

if classification[“confidence”] >= confidence_threshold:
category = classification[“category”]
if category == “General Question”:
response = “SIMULATED GENERAL QUESTION RESPONSE: I’d be happy to help with your question about our business.”
model_used = “claude-3-5-haiku-20240307”
elif category == “Refund Request”:
response = “SIMULATED REFUND REQUEST RESPONSE: I understand you’re looking for a refund. Let me help you with that process.”
model_used = “claude-3-5-sonnet-20240620”
elif category == “Technical Support”:
response = “SIMULATED TECHNICAL SUPPORT RESPONSE: I see you’re having a technical issue. Let’s troubleshoot this together.”
model_used = “claude-3-5-sonnet-20240620”
else:
response = “I apologize, but I’m not sure how to categorize your request.”
model_used = “claude-3-5-sonnet-20240620”
else:
response = “SIMULATED ESCALATION RESPONSE: Your query requires special attention. I’ll have our advanced support system help you with this complex request.”
model_used = “claude-3-5-sonnet-20240620”
category = “Escalated (Low Confidence)”

handling_time = time.time() – start_time
total_time = routing_time + handling_time

return {
“query”: query,
“routed_category”: classification[“category”],
“confidence”: classification[“confidence”],
“reasoning”: classification[“reasoning”],
“final_category”: category,
“response”: response,
“model_used”: model_used,
“routing_time”: routing_time,
“handling_time”: handling_time,
“total_time”: total_time
}

Step 15: Test the advanced routing system with diverse sample queries.

Copy CodeCopiedUse a different Browsertest_queries = [
“What are your business hours?”,
“I need a refund for my order #12345”,
“My app keeps crashing when I try to save photos”,
“I received the wrong item in my shipment”,
“How do I change my shipping address?”,
“I’m not sure if my payment went through”,
“The product description was misleading”
]

advanced_results = []
for query in test_queries:
result = advanced_process_customer_query(query, None, 0.7)
advanced_results.append(result)

advanced_df = pd.DataFrame(advanced_results)
display(advanced_df[[“query”, “routed_category”, “confidence”, “final_category”, “model_used”]])

print(“nRouting Distribution:”)
print(advanced_df[“final_category”].value_counts())

print(f”nAverage Confidence: {advanced_df[‘confidence’].mean():.2f}”)

escalated = (advanced_df[“final_category”] == “Escalated (Low Confidence)”).sum()
print(f”Escalated Queries: {escalated} ({escalated/len(advanced_df):.1%})”)

Output

Step 16: Define a utility function to calculate key performance metrics for the routing system, including processing times, confidence levels, escalation rates, and category distribution statistics.

Copy CodeCopiedUse a different Browserdef calculate_routing_metrics(results_df):
“””
Calculate key metrics for routing performance.

Args:
results_df (DataFrame): Results of routing tests

Returns:
dict: Key performance metrics
“””
metrics = {
“total_queries”: len(results_df),
“avg_routing_time”: results_df[“routing_time”].mean(),
“avg_handling_time”: results_df[“handling_time”].mean(),
“avg_total_time”: results_df[“total_time”].mean(),
“avg_confidence”: results_df[“confidence”].mean(),
“escalation_rate”: (results_df[“final_category”] == “Escalated (Low Confidence)”).mean(),
}

category_distribution = results_df[“routed_category”].value_counts(normalize=True).to_dict()
metrics[“category_distribution”] = category_distribution

return metrics

Step 17: Generate and display a comprehensive performance report for the routing system.

Copy CodeCopiedUse a different Browsermetrics = calculate_routing_metrics(advanced_df)

print(“Routing System Performance Metrics:”)
print(f”Total Queries: {metrics[‘total_queries’]}”)
print(f”Average Routing Time: {metrics[‘avg_routing_time’]:.3f} seconds”)
print(f”Average Handling Time: {metrics[‘avg_handling_time’]:.3f} seconds”)
print(f”Average Total Time: {metrics[‘avg_total_time’]:.3f} seconds”)
print(f”Average Confidence: {metrics[‘avg_confidence’]:.2f}”)
print(f”Escalation Rate: {metrics[‘escalation_rate’]:.1%}”)
print(“nCategory Distribution:”)
for category, percentage in metrics[“category_distribution”].items():
print(f” {category}: {percentage:.1%}”)

Output

This intelligent request routing system demonstrates how Claude models can efficiently classify and handle diverse customer queries. By implementing category-specific handlers with appropriate model selection, the system delivers tailored responses while maintaining high accuracy. The confidence-based routing with escalation paths ensures complex queries receive specialised attention, creating a robust, scalable customer service solution.

Check out the Colab Notebook here. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post A Step-by-Step Guide to Implement Intelligent Request Routing with Claude appeared first on MarkTechPost.

How Deutsche Bahn redefines forecasting using Chronos models – Now a …

Posted on May 8, 2025 by i-genie

This post is co-written with Kilian Zimmerer and Daniel Ringler from Deutsche Bahn.
Every day, Deutsche Bahn (DB) moves over 6.6 million passengers across Germany, requiring precise time series forecasting for a wide range of purposes. However, building accurate forecasting models traditionally required significant expertise and weeks of development time.
Today, we’re excited to explore how the time series foundation model Chronos-Bolt, recently launched on Amazon Bedrock Marketplace and available through Amazon SageMaker JumpStart, is revolutionizing time series forecasting by enabling accurate predictions with minimal effort. Whereas traditional forecasting methods typically rely on statistical modeling, Chronos treats time series data as a language to be modeled and uses a pre-trained FM to generate forecasts — similar to how large language models (LLMs) generate texts. Chronos helps you achieve accurate predictions faster, significantly reducing development time compared to traditional methods.
In this post, we share how Deutsche Bahn is redefining forecasting using Chronos models, and provide an example use case to demonstrate how you can get started using Chronos.
Chronos: Learning the language of time series
The Chronos model family represents a breakthrough in time series forecasting by using language model architectures. Unlike traditional time series forecasting models that require training on specific datasets, Chronos can be used for forecasting immediately. The original Chronos model quickly became the number #1 most downloaded model on Hugging Face in 2024, demonstrating the strong demand for FMs in time series forecasting.
Building on this success, we recently launched Chronos-Bolt, which delivers higher zero-shot accuracy compared to original Chronos models. It offers the following improvements:

Up to 250 times faster inference
20 times better memory efficiency
CPU deployment support, making hosting costs up to 10 times less expensive

Now, you can use Amazon Bedrock Marketplace to deploy Chronos-Bolt. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that enables developers to discover, test, and use over 100 popular, emerging, and specialized FMs alongside the current selection of industry-leading models in Amazon Bedrock.
The challenge
Deutsche Bahn, Germany’s national railway company, serves over 1.8 billion passengers annually in long distance and regional rail passenger transport, making it one of the world’s largest railway operators. For more than a decade, Deutsche Bahn has been innovating together with AWS. AWS is the primary cloud provider for Deutsche Bahn and a strategic partner of DB Systel, a wholly owned subsidiary of DB AG that drives digitalization across all group companies.
Previously, Deutsche Bahn’s forecasting processes were highly heterogeneous across teams, requiring significant effort for each new use case. Different data sources required using multiple specialized forecasting methods, resulting in cost- and time-intensive manual effort. Company-wide, Deutsche Bahn identified dozens of different and independently operated forecasting processes. Smaller teams found it hard to justify developing customized forecasting solutions for their specific needs.
For example, the data analysis platform for passenger train stations of DB InfraGO AG integrates and analyzes diverse data sources, from weather data and SAP Plant Maintenance information to video analytics. Given the diverse data sources, a forecast method that was designed for one data source was usually not transferable to the other data sources.
To democratize forecasting capabilities across the organization, Deutsche Bahn needed a more efficient and scalable approach to handle various forecasting scenarios. Using Chronos, Deutsche Bahn demonstrates how cutting-edge technology can transform enterprise-scale forecasting operations.
Solution overview
A team enrolled in Deutsche Bahn’s accelerator program Skydeck, the innovation lab of DB Systel, developed a time series FM forecasting system using Chronos as the underlying model, in partnership with DB InfraGO AG. This system offers a secured internal API that can be used by Deutsche Bahn teams across the organization for efficient and simple-to-use time series forecasts, without the need to develop customized software.
The following diagram shows a simplified architecture of how Deutsche Bahn uses Chronos.

In the solution workflow, a user can pass timeseries data to Amazon API Gateway which serves as a secure front door for API calls, handling authentication and authorization. For more information on how to limit access to an API to authorized users only, refer to Control and manage access to REST APIs in API Gateway. Then, an AWS Lambda function is used as serverless compute for processing and passing requests to the Chronos model for inference. The fastest way to host a Chronos model is by using Amazon Bedrock Marketplace or SageMaker Jumpstart.
Impact and future plans
Deutsche Bahn tested the service on multiple use cases, such as predicting actual costs for construction projects and forecasting monthly revenue for retail operators in passenger stations. The implementation with Chronos models revealed compelling outcomes. The following table depicts the achieved results. In the first use case, we can observe that in zero-shot scenarios (meaning that the model has never seen the data before), Chronos models can achieve accuracy superior to established statistical methods like AutoARIMA and AutoETS, even though these methods were specifically trained on the data. Additionally, in both use cases, Chronos inference time is up to 100 times faster, and when fine-tuned, Chronos models outperform traditional approaches in both scenarios. For more details on fine-tuning Chronos, refer to Forecasting with Chronos – AutoGluon.

.
Model
Error (Lower is Better)
Prediction Time (seconds)
Training Time (seconds)

Deutsche Bahn test use case 1
AutoArima
0.202
40
.

AutoETS
0.2
9.1
.

Chronos Bolt Small (Zero Shot)
0.195
0.4
.

Chronos Bolt Base (Zero Shot)
0.198
0.6
.

Chronos Bolt Small (Fine-Tuned)
0.181
0.4
650

Chronos Bolt Base (Fine-Tuned)
0.186
0.6
1328

Deutsche Bahn test use case 2
AutoArima
0.13
100
.

AutoETS
0.136
18
.

Chronos Bolt Small (Zero Shot)
0.197
0.7
.

Chronos Bolt Base (Zero Shot)
0.185
1.2
.

Chronos Bolt Small (Fine-Tuned)
0.134
0.7
1012

Chronos Bolt Base (Fine-Tuned)
0.127
1.2
1893

Error is measured in SMAPE. Finetuning was stopped after 10,000 steps.
Based on the successful prototype, Deutsche Bahn is developing a company-wide forecasting service accessible to all DB business units, supporting different forecasting scenarios. Importantly, this will democratize the usage of forecasting across the organization. Previously resource-constrained teams are now empowered to generate their own forecasts, and forecast preparation time can be reduced from weeks to hours.
Example use case
Let’s walk through a practical example of using Chronos-Bolt with Amazon Bedrock Marketplace. We will forecast passenger capacity utilization at German long-distance and regional train stations using publicly available data.
Prerequisites
For this, you will use the AWS SDK for Python (Boto3) to programmatically interact with Amazon Bedrock. As prerequisites, you need to have the Python libraries boto3, pandas, and matplotlib installed. In addition, configure a connection to an AWS account such that Boto3 can use Amazon Bedrock. For more information on how to setup Boto3, refer to Quickstart – Boto3. If you are using Python inside an Amazon SageMaker notebook, the necessary packages are already installed.
Forecast passenger capacity
First, load the data with the historical passenger capacity utilization. For this example, focus on train station 239:

import pandas as pd

# Load data
df = pd.read_csv(
“https://mobilithek.info/mdp-api/files/aux/573351169210855424/benchmark_personenauslastung_bahnhoefe_training.csv”
)
df_train_station = df[df[“train_station”] == 239].reset_index(drop=True)

Next, deploy an endpoint on Amazon Bedrock Marketplace containing Chronos-Bolt. This endpoint acts as a hosted service, meaning that it can receive requests containing time series data and return forecasts in response.
Amazon Bedrock will assume an AWS Identity and Access Management (IAM) role to provision the endpoint. Modify the following code to reference your role. For a tutorial on creating an execution role, refer to How to use SageMaker AI execution roles.

import boto3
import time

def describe_endpoint(bedrock_client, endpoint_arn):
return bedrock_client.get_marketplace_model_endpoint(endpointArn=endpoint_arn)[
“marketplaceModelEndpoint”
]

def wait_for_endpoint(bedrock_client, endpoint_arn):
endpoint = describe_endpoint(bedrock_client, endpoint_arn)
while endpoint[“endpointStatus”] in [“Creating”, “Updating”]:
print(
f”Endpoint {endpoint_arn} status is still {endpoint[‘endpointStatus’]}.”
“Waiting 10 seconds before continuing…”
)
time.sleep(10)
endpoint = describe_endpoint(bedrock_client, endpoint_arn)
print(f”Endpoint status: {endpoint[‘status’]}”)

bedrock_client = boto3.client(service_name=”bedrock”)
region_name = bedrock_client.meta.region_name
executionRole = “arn:aws:iam::account-id:role/ExecutionRole” # Change to your role

# Deploy Endpoint
body = {
“modelSourceIdentifier”: f”arn:aws:sagemaker:{region_name}:aws:hub-content/SageMakerPublicHub/Model/autogluon-forecasting-chronos-bolt-base/2.0.0″,
“endpointConfig”: {
“sageMaker”: {
“initialInstanceCount”: 1,
“instanceType”: “ml.m5.xlarge”,
“executionRole”: executionRole,
}
},
“endpointName”: “brmp-chronos-endpoint”,
“acceptEula”: True,
}
response = bedrock_client.create_marketplace_model_endpoint(**body)
endpoint_arn = response[“marketplaceModelEndpoint”][“endpointArn”]

# Wait until the endpoint is created. This will take a few minutes.
wait_for_endpoint(bedrock_client, endpoint_arn)

Then, invoke the endpoint to make a forecast. Send a payload to the endpoint, which includes historical time series values and configuration parameters, such as the prediction length and quantile levels. The endpoint processes this input and returns a response containing the forecasted values based on the provided data.

import json

# Query endpoint
bedrock_runtime_client = boto3.client(service_name=”bedrock-runtime”)
body = json.dumps(
{
“inputs”: [
{“target”: df_train_station[“capacity”].values.tolist()},
],
“parameters”: {
“prediction_length”: 64,
“quantile_levels”: [0.1, 0.5, 0.9],
}
}
)
response = bedrock_runtime_client.invoke_model(modelId=endpoint_arn, body=body)
response_body = json.loads(response[“body”].read())

Now you can visualize the forecasts generated by Chronos-Bolt.

import matplotlib.pyplot as plt

# Plot forecast
forecast_index = range(len(df_train_station), len(df_train_station) + 64)
low = response_body[“predictions”][0][“0.1”]
median = response_body[“predictions”][0][“0.5”]
high = response_body[“predictions”][0][“0.9”]

plt.figure(figsize=(8, 4))
plt.plot(df_train_station[“capacity”], color=”royalblue”, label=”historical data”)
plt.plot(forecast_index, median, color=”tomato”, label=”median forecast”)
plt.fill_between(
forecast_index,
low,
high,
color=”tomato”,
alpha=0.3,
label=”80% prediction interval”,
)
plt.legend(loc=’upper left’)
plt.grid()
plt.show()

The following figure shows the output.

As we can see on the right-hand side of the preceding graph in red, the model is able to pick up the pattern that we can visually recognize on the left part of the plot (in blue). The Chronos model predicts a steep decline followed by two smaller spikes. It is worth highlighting that the model successfully predicted this pattern using zero-shot inference, that is, without being trained on the data. Going back to the original prediction task, we can interpret that this particular train station is underutilized on weekends.
Clean up
To avoid incurring unnecessary costs, use the following code to delete the model endpoint:

bedrock_client.delete_marketplace_model_endpoint(endpointArn=endpoint_arn)

# Confirm that endpoint is deleted
time.sleep(5)
try:
endpoint = describe_endpoint(bedrock_client, endpoint_arn=endpoint_arn)
print(endpoint[“endpointStatus”])
except ClientError as err:
assert err.response[‘Error’][‘Code’] ==’ResourceNotFoundException’
print(f”Confirmed that endpoint {endpoint_arn} was deleted”)

Conclusion
The Chronos family of models, particularly the new Chronos-Bolt model, represents a significant advancement in making accurate time series forecasting accessible. Through the simple deployment options with Amazon Bedrock Marketplace and SageMaker JumpStart, organizations can now implement sophisticated forecasting solutions in hours rather than weeks, while achieving state-of-the-art accuracy.
Whether you’re forecasting retail demand, optimizing operations, or planning resource allocation, Chronos models provide a powerful and efficient solution that can scale with your needs.

About the authors
Kilian Zimmerer is an AI and DevOps Engineer at DB Systel GmbH in Berlin. With his expertise in state-of-the-art machine learning and deep learning, alongside DevOps infrastructure management, he drives projects, defines their technical vision, and supports their successful implementation within Deutsche Bahn.
Daniel Ringler is a software engineer specializing in machine learning at DB Systel GmbH in Berlin. In addition to his professional work, he is a volunteer organizer for PyData Berlin, contributing to the local data science and Python programming community.
Pedro Eduardo Mercado Lopez is an Applied Scientist at Amazon Web Services, where he works on time series forecasting for labor planning and capacity planning with a focus on hierarchical time series and foundation models. He received a PhD from Saarland University, Germany, doing research in spectral clustering for signed and multilayer graphs.
Simeon Brüggenjürgen is a Solutions Architect at Amazon Web Services based in Munich, Germany. With a background in Machine Learning research, Simeon supported Deutsche Bahn on this project.
John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.
Michael Bohlke-Schneider is an Applied Science Manager at Amazon Web Services. At AWS, Michael works on machine learning and forecasting, with a focus on foundation models for structured data and AutoML. He received his PhD from the Technical University Berlin, where he worked on protein structure prediction.
Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems and robotics scientist—a field in which he holds a PhD.

LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researche …

Posted on May 7, 2025 by i-genie

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2, a family of speech-capable large language models (SpeechLMs) now available on Hugging Face. This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

Overview of the LLaMA-Omni2 Architecture

LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

Speech Encoder: Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.

Speech Adapter: Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.

Core LLM: The Qwen2.5 models serve as the main reasoning engine.

Streaming TTS Decoder: Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

Streaming Generation with Read-Write Scheduling

The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

Training Approach

Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

Training is executed in two stages:

Stage I: Independently optimizes the speech-to-text and text-to-speech modules.

Stage II: Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

Benchmark Results

The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

ModelLlama Q (S2S)Web Q (S2S)GPT-4o ScoreASR-WERLatency (ms)GLM-4-Voice (9B)50.715.94.093.481562.8LLaMA-Omni (8B)49.023.73.523.67346.7LLaMA-Omni2-7B60.731.34.153.26582.9

The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

Component Analyses

Gate Fusion Module: Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.

TTS Pretraining: Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.

Read/Write Strategies: Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

Conclusion

LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.

Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)
The post LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model appeared first on MarkTechPost.