This AI Paper from IBM and MIT Introduces SOLOMON: A Neuro-Inspired Re …

Adapting large language models for specialized domains remains challenging, especially in fields requiring spatial reasoning and structured problem-solving, even though they specialize in complex reasoning. Semiconductor layout design is a prime example, where AI tools must interpret geometric constraints and ensure precise component placement. Researchers are developing advanced AI architectures to enhance LLMs’ ability to process and apply domain-specific knowledge effectively.

A major limitation of general-purpose LLMs is their inability to convert theoretical knowledge into practical solutions. While these models can accurately define technical concepts, they often fail when solving real-world tasks that require spatial reasoning and structured logic. In semiconductor layout design, AI must go beyond text-based knowledge to ensure accurate placement of vias, metal layers, and circuit components. Without precise geometric relationships, layout designs may fail due to misalignment or incorrect spacing. Current models often require multiple rounds of human correction, making their deployment inefficient.

Several approaches have been developed to improve LLMs’ adaptability for domain-specific applications. Fine-tuning involves training LLMs with domain-specific data, but this process is time-intensive and requires significant computational resources. Retrieval-augmented generation (RAG) retrieves external knowledge to guide LLM outputs, but it does not fully address challenges related to structured problem-solving. In-context learning helps guide LLM reasoning by providing task-specific examples, yet it does not overcome spatial reasoning limitations. These methods offer incremental improvements but fail to deliver a comprehensive solution for applications requiring geometric logic.

Researchers at IBM T.J. Watson Research Center and MIT-IBM Watson AI Lab introduced SOLOMON, a neuro-inspired LLM reasoning network, to enhance domain-specific adaptability. Unlike conventional approaches, SOLOMON employs a multi-agent reasoning system that dynamically processes spatial constraints and geometric relationships. The framework integrates thought assessment mechanisms to refine outputs iteratively, improving problem-solving accuracy. SOLOMON leverages prompt engineering techniques to guide LLM-generated solutions, allowing it to adapt to semiconductor layout tasks with minimal retraining.

The architecture of SOLOMON is inspired by neuroscience and incorporates the Free Energy Principle, which optimizes reasoning by reducing discrepancies between expected and observed outcomes. The framework consists of three primary components: Thought Generators, Thought Assessors, and a Steering Subsystem. Thought Generators utilize diverse LLMs to produce multiple reasoning pathways, ensuring a broad range of solutions for complex tasks. The Thought Assessor evaluates these outputs, selecting the most logical and structured approach. The Steering Subsystem allows researchers to modify objectives dynamically, enabling more precise domain adaptation. Unlike fine-tuning, this architecture does not require continuous retraining, making it more efficient for specialized applications.

Researchers conducted experiments on 25 semiconductor layout tasks to evaluate SOLOMON’s effectiveness. The framework was compared to five baseline LLMs, including GPT-4o, Claude-3.5-Sonnet, and Llama-3 models. Each task assessed the models’ ability to generate geometric structures while maintaining spatial accuracy. SOLOMON demonstrated improvements in reducing runtime errors and scaling inaccuracies. The framework exhibited better spatial reasoning capabilities, improving placement precision and reducing mistakes in generated designs. SOLOMON instances also matched or exceeded the performance of o1-preview in multiple test categories, with the Claude-based SOLOMON performing strongly in certain complex tasks.

A key advantage of SOLOMON is its ability to correct logical inconsistencies and arithmetic errors in geometric designs. The Thought Assessor continuously refines generated layouts by analyzing previous iterations, mitigating common hallucination issues in traditional LLMs. The system effectively reduces misinterpretations and enhances the reliability of AI-generated designs. SOLOMON synchronizes reasoning across multiple LLMs when presented with ambiguous layout specifications, ensuring consistent and precise output. By incorporating hierarchical assessment mechanisms, the framework significantly improves AI-driven design accuracy.

This research highlights the importance of enhancing LLM reasoning capabilities rather than increasing model size. SOLOMON offers a structured and efficient approach for applying AI to domain-specific problem-solving, particularly in semiconductor layout design. Future research will focus on expanding the framework to other engineering applications, refining multimodal reasoning capabilities, and introducing iterative learning mechanisms to enhance AI decision-making. The introduction of SOLOMON represents a substantial advancement in making AI-driven tools more precise, adaptive, and effective for real-world industrial challenges.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post This AI Paper from IBM and MIT Introduces SOLOMON: A Neuro-Inspired Reasoning Network for Enhancing LLM Adaptability in Semiconductor Layout Design appeared first on MarkTechPost.

KAIST and DeepAuto AI Researchers Propose InfiniteHiP: A Game-Changing …

In large language models (LLMs), processing extended input sequences demands significant computational and memory resources, leading to slower inference and higher hardware costs. The attention mechanism, a core component, further exacerbates these challenges due to its quadratic complexity relative to sequence length. Also, maintaining the previous context using a key-value (KV) cache results in high memory overheads, limiting scalability. 

A key limitation of LLMs is their inability to handle sequences longer than their trained context window. Most models degrade in performance when faced with extended inputs due to inefficient memory management and growing attention computation costs. Existing solutions often rely on fine-tuning, which is resource-intensive and requires high-quality long-context datasets. Without an efficient method for context extension, tasks like document summarization, retrieval-augmented generation, and long-form text generation remain constrained.

Several approaches have been proposed to tackle the problem of long-context processing. FlashAttention2 (FA2) optimizes memory consumption by minimizing redundant operations during attention computation, yet it does not address computational inefficiency. Some models employ selective token attention, either statically or dynamically, to reduce processing overhead. KV cache eviction strategies have been introduced to remove older tokens selectively, but they risk permanently discarding important contextual information. HiP Attention is another approach that attempts to offload infrequently used tokens to external memory; however, it lacks efficient cache management, leading to increased latency. Despite these advances, no method has effectively addressed all three key challenges: 

Long-context generalization

Efficient memory management

Computational efficiency

Researchers from the KAIST, and DeepAuto.ai introduced InfiniteHiP, an advanced framework that enables efficient long-context inference while mitigating memory bottlenecks. The model achieves this through a hierarchical token pruning algorithm, which dynamically removes less relevant context tokens. This modular pruning strategy selectively retains tokens that contribute the most to attention computations, significantly reducing processing overhead. The framework also incorporates adaptive RoPE (Rotary Positional Embeddings) adjustments, allowing models to generalize to longer sequences without additional training. Also, InfiniteHiP employs a novel KV cache offloading mechanism, transferring less frequently accessed tokens to host memory while ensuring efficient retrieval. These techniques enable the model to process up to 3 million tokens on a 48GB GPU, making it the most scalable long-context inference method.

The core innovation of InfiniteHiP is its multi-stage pruning mechanism, which consistently improves context selection throughout multiple stages. Tokens are first divided into fixed-length pieces, and each piece is processed based on its attention computation contribution. A top-K selection approach ensures that only the most critical tokens are retained and others are dropped. The method followed by InfiniteHiP, unlike other hierarchical pruning models, is entirely parallelized, which renders it computationally effective. The KV cache management system optimizes memory utilization by dynamically offloading less important context tokens while maintaining retrieval flexibility. The model also utilizes multiple RoPE interpolation methods at different attention layers, thus facilitating smooth adaptation to long sequences.

The model demonstrates an 18.95× speedup in attention decoding for a one million-token context compared to traditional methods without additional training. The KV cache offloading technique reduces GPU memory consumption by up to 96%, making it practical for large-scale applications. In benchmark evaluations such as LongBench and ∞Bench, InfiniteHiP consistently outperforms state-of-the-art methods, achieving a 9.99% higher relative score than InfLLM. Also, decoding throughput is increased by 3.2× on consumer GPUs (RTX 4090) and 7.25× on enterprise-grade GPUs (L40S).

In conclusion, the research team successfully addressed the major bottlenecks of long-context inference with InfiniteHiP. The framework enhances LLM capabilities by integrating hierarchical token pruning, KV cache offloading, and RoPE generalization. This breakthrough enables pre-trained models to process extended sequences without losing context or increasing computational costs. The method is scalable, hardware-efficient, and applicable to various AI applications requiring long-memory retention.

Check out the Paper, Source Code and Live Demo. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post KAIST and DeepAuto AI Researchers Propose InfiniteHiP: A Game-Changing Long-Context LLM Framework for 3M-Token Inference on a Single GPU appeared first on MarkTechPost.

Nous Research Released DeepHermes 3 Preview: A Llama-3-8B Based Model …

AI has witnessed rapid advancements in NLP in recent years, yet many existing models still struggle to balance intuitive responses with deep, structured reasoning. While proficient in conversational fluency, traditional AI chat models often fail to meet when faced with complex logical queries requiring step-by-step analysis. On the other hand, models optimized for reasoning tend to lose the ability to engage in smooth, natural interactions. This gap has challenged developers, researchers, and enterprises seeking an AI seamlessly transitioning between different cognitive styles.

DeepHermes 3 Preview (DeepHermes-3-Llama-3-8B-Preview) is the latest iteration in Nous Research’s series of LLMs. As one of the first models to integrate both reasoning-based long-chain thought processing and conventional LLM response mechanisms, DeepHermes 3 marks a significant step in AI model sophistication. This preview version of the model refines AI annotation, judgment capabilities, and function-calling, offering a more advanced, flexible AI tool for researchers, developers, and enterprises.  

The core feature of DeepHermes 3 is its ability to switch between intuitive and deep reasoning, allowing users to customize how the model processes and delivers information. The model is an upgrade from its predecessor, Hermes 3, which brought agentic capabilities, richer roleplay dialogue, increased multi-turn conversational depth, and enhanced coherence over a longer context. The overall goal of the Hermes series has always been to make AI output consistent with user intent, thereby giving the end user significant control over response generation. This version is a departure from previous models, with its dual-processing mode allowing it to perform normal conversational responses and support complex reasoning. A system prompt can trigger the deep reasoning feature, allowing extended logical processing to improve response accuracy.

Image Source

DeepHermes 3 has undergone rigorous benchmarking to validate its reasoning capabilities. Using the Hugging Face Open-R1 evaluation suite, the model demonstrated significantly improved performance over standard instruction-tuned models. Benchmarks for reasoning mode “ON” revealed notable gains in complex problem-solving, particularly in mathematical reasoning tasks, compared to models that do not incorporate deep thought mechanisms. Compared to Meta’s Llama-3.1-8B, the DeepHermes 3 model displayed competitive or superior results in multiple test categories, showing improvements in contextual coherence, multi-step reasoning, and conversational memory retention.

DeepHermes 3 has adopted the Llama-Chat format for system prompts, a structured method that enhances its ability to process multi-turn conversations and context-driven responses. System prompts introduce new possibilities for user engagement, allowing individuals to guide the model’s stylistic choices, role assignment, and interactive rules. With its enhanced deep reasoning mode, the model can handle long-chain logic that extends across thousands of tokens. This mode ensures greater response accuracy in tasks requiring extensive contextual understanding, such as complex programming queries, mathematical problem-solving, and detailed analytical reasoning.  

Image Source

The model can be deployed using the Hugging Face Transformers library, which allows developers to customize the implementations for various tasks. Due to its flexible API integration, DeepHermes 3 can be used in enterprise systems, chatbot applications, and research systems where structured and unstructured queries must be processed. Further, the model has an improved function-calling feature that facilitates efficient processing of JSON-structured outputs. This feature makes it ideal for structured data extraction applications, such as automated financial reporting, customer service automation, and real-time AI-based decision-making systems. 

In conclusion, this version brings together intuitive response mechanisms of traditional, human-like responses and an extended chain of cognitive reasoning, thereby improving both response accuracy and the overall efficacy of the model. With advances in autonomous functionality, role-playing, multi-turn dialogue, and functional invocation, DeepHermes 3 is consistent with the overall thrust of the series on user-focused governance and navigability. Though presented as an early version with rudimentary reasoning capabilities, it has promise in tasks that gain from objective reasoning. Users can activate its deep-thinking mode using a special system prompt that induces the model to engage in extensive reasoning before responding.

Check out Model on HuggingFace. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Nous Research Released DeepHermes 3 Preview: A Llama-3-8B Based Model Combining Deep Reasoning, Advanced Function Calling, and Seamless Conversational Intelligence appeared first on MarkTechPost.

DeepSeek AI Introduces CODEI/O: A Novel Approach that Transforms Code- …

Large Language Models (LLMs) have advanced significantly in natural language processing, yet reasoning remains a persistent challenge. While tasks such as mathematical problem-solving and code generation benefit from structured training data, broader reasoning tasks—like logical deduction, scientific inference, and symbolic reasoning—suffer from sparse and fragmented data. Traditional approaches, such as continual pretraining on code, often embed reasoning signals implicitly, making it difficult for models to generalize. Even text-to-code generation methods remain constrained by syntax-specific learning, limiting their applicability beyond programming-related tasks. A more structured approach is needed to expose LLMs to fundamental reasoning patterns while preserving logical rigor.

DeepSeek AI Research presents CODEI/O, an approach that converts code-based reasoning into natural language. By transforming raw code into an input-output prediction format and expressing reasoning steps through Chain-of-Thought (CoT) rationales, CODEI/O allows LLMs to internalize core reasoning processes such as logic flow planning, decision tree traversal, and modular decomposition. Unlike conventional methods, CODEI/O separates reasoning from code syntax, enabling broader applicability while maintaining logical structure.

Technical Overview and Benefits

CODEI/O follows a structured data processing pipeline:

Collecting Raw Code Files: Over 450K functions were gathered from multiple sources, including algorithm repositories and educational programming datasets.

Standardizing the Data: The collected code was refined using DeepSeek-V2.5, ensuring clarity and execution compatibility.

Generating Input-Output Pairs: Functions were executed with varying inputs to create structured training examples across diverse reasoning tasks.

Generating Chain-of-Thought Reasoning: Using models like DeepSeek-V2.5, natural language explanations were generated to provide structured reasoning.

Verification and Refinement: Predictions were validated through execution, with incorrect responses revised iteratively to improve reasoning accuracy.

Key Features of CODEI/O:

Transformative Learning: Converts diverse code patterns into natural language CoT rationales, making reasoning transferable beyond programming contexts.

Syntax-Decoupled Learning: Separates logical reasoning from code syntax, improving adaptability across reasoning tasks.

Multi-Task Improvement: Enhances performance across symbolic, scientific, logical, mathematical, and commonsense reasoning domains.

Verifiability: Predictions can be validated through cached ground-truth matching or re-execution.

Iterative Refinement: A refined version, CODEI/O++, employs multi-turn revision to enhance reasoning accuracy.

Empirical Results and Performance

The impact of CODEI/O was tested across four base models (ranging from 7B to 30B parameters) on 14 reasoning benchmarks covering logic, symbolic inference, mathematics, scientific deduction, and commonsense reasoning.

Findings:

Consistent Improvements: CODEI/O training led to higher scores across reasoning benchmarks compared to traditional pretraining methods.

Generalization Across Tasks: Unlike existing approaches that improve specific tasks but degrade performance elsewhere, CODEI/O showed balanced enhancements.

Comparison to Baselines: CODEI/O outperformed datasets such as OpenMathInstruct2, OpenCoder-SFT-Stage1, and WebInstruct.

Effectiveness of Multi-Turn Refinement: CODEI/O++ further improved results by iteratively refining incorrect responses, leveraging execution feedback for better reasoning quality.

For instance, in logical and symbolic reasoning benchmarks such as BBH and CruxEval, CODEI/O led to notable performance gains. In math reasoning tasks (GSM8K, MATH, and MMLU-STEM), it demonstrated improvements over existing baselines. Even in commonsense reasoning, where code-based methods typically struggle, CODEI/O maintained robust results.

Conclusion

CODEI/O presents a structured way to enhance LLMs’ reasoning by leveraging input-output transformations from real-world code. Instead of focusing on isolated reasoning tasks, it extracts universal reasoning patterns and translates them into natural language explanations. This structured learning approach ensures that models acquire robust reasoning skills across different domains.

The introduction of multi-turn revision (CODEI/O++) further refines reasoning accuracy, demonstrating that iterative learning from execution feedback enhances model reliability. By making predictions verifiable, CODEI/O provides a scalable and reliable method for improving LLM reasoning.

By bridging code-based and natural language reasoning, CODEI/O offers a promising direction for enhancing LLMs’ cognitive abilities beyond programming-related tasks.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post DeepSeek AI Introduces CODEI/O: A Novel Approach that Transforms Code-based Reasoning Patterns into Natural Language Formats to Enhance LLMs’ Reasoning Capabilities appeared first on MarkTechPost.

ReasonFlux: Elevating LLM Reasoning with Hierarchical Template Scaling

Large language models (LLMs) have demonstrated exceptional problem-solving abilities, yet complex reasoning tasks—such as competition-level mathematics or intricate code generation—remain challenging. These tasks demand precise navigation through vast solution spaces and meticulous step-by-step deliberation. Existing methods, while improving accuracy, often suffer from high computational costs, rigid search strategies, and difficulty generalizing across diverse problems. In this paper researchers introduced a new framework, ReasonFlux that addresses these limitations by reimagining how LLMs plan and execute reasoning steps using hierarchical, template-guided strategies.  

Recent approaches to enhance LLM reasoning fall into two categories: deliberate search and reward-guided methods. Techniques like Tree of Thoughts (ToT) enable LLMs to explore multiple reasoning paths, while Monte Carlo Tree Search (MCTS) decomposes problems into steps guided by process reward models (PRMs). Though effective, these methods scale poorly due to excessive sampling and manual search design. For instance, MCTS requires iterating through thousands of potential steps, making it computationally prohibitive for real-world applications. Meanwhile, retrieval-augmented generation (RAG) methods like Buffer of Thought (BoT) leverage stored problem-solving templates but struggle to integrate multiple templates adaptively, limiting their utility in complex scenarios.  

ReasonFlux introduces a structured framework that combines a curated library of high-level thought templates with hierarchical reinforcement learning (HRL) to dynamically plan and refine reasoning paths. Instead of optimizing individual steps, it focuses on configuring optimal template trajectories—sequences of abstract problem-solving strategies retrieved from a structured knowledge base. This approach simplifies the search space and enables efficient adaptation to sub-problems. The framework consists of three main components:

Structured Template Library:  The research team constructed a library of 500 thought templates, each encapsulating a problem-solving strategy (e.g., “Trigonometric Substitution for Integral Optimization”). Templates include metadata—names, tags, descriptions, and application steps—enabling efficient retrieval. For example, a template tagged “Irrational Function Optimization” might guide an LLM to apply specific algebraic substitutions.  

Hierarchical Reinforcement Learning:

Structure-Based Fine-Tuning: A base LLM (e.g., Qwen2.5-32B) is fine-tuned to associate template metadata with their functional descriptions, ensuring it understands when and how to apply each template.  

Template Trajectory Optimization: Using preference learning, the model learns to rank template sequences by their effectiveness. For a given problem, multiple trajectories are sampled, and their success rates on similar problems determine rewards. This trains the model to prioritize high-reward sequences, refining its planning capability.  

Adaptive Inference Scaling:  During inference, ReasonFlux acts as a “navigator,” analyzing the problem to retrieve relevant templates and dynamically adjusting the trajectory based on intermediate results. For instance, if a step involving “Polynomial Factorization” yields unexpected constraints, the system might pivot to a “Constraint Propagation” template. This iterative interplay between planning and execution mirrors human problem-solving, where partial solutions inform subsequent steps.  

ReasonFlux was evaluated on competition-level benchmarks like MATH, AIME, and OlympiadBench, outperforming both frontier models (GPT-4o, Claude) and specialized open-source models (DeepSeek-V3, Mathstral). Key results include:  

91.2% accuracy on MATH, surpassing OpenAI’s o1-preview by 6.7%.  

56.7% on AIME 2024, exceeding DeepSeek-V3 by 45% and matching o1-mini.  

63.3% on OlympiadBench, a 14% improvement over prior methods.  

Moreover, the structured template library demonstrated strong generalization: when applied to variant problems, it boosted smaller models (e.g., 7B parameters) to outperform larger counterparts using direct reasoning. Additionally, ReasonFlux achieved a superior exploration-exploitation balance, requiring 40% fewer computational steps than MCTS and Best-of-N on complex tasks (Figure 5).  

In summary, ReasonFlux redefines how LLMs approach complex reasoning by decoupling high-level strategy from step-by-step execution. Its hierarchical template system reduces computational overhead while improving accuracy and adaptability, addressing critical gaps in existing methods. By leveraging structured knowledge and dynamic planning, the framework sets a new standard for efficient, scalable reasoning—proving that smaller, well-guided models can rival even the largest frontier systems. This innovation opens avenues for deploying advanced reasoning in resource-constrained environments, from education to automated code generation.  

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post ReasonFlux: Elevating LLM Reasoning with Hierarchical Template Scaling appeared first on MarkTechPost.

Google DeepMind Researchers Propose Matryoshka Quantization: A Techniq …

Quantization is a crucial technique in deep learning for reducing computational costs and improving model efficiency. Large-scale language models demand significant processing power, which makes quantization essential for minimizing memory usage and enhancing inference speed. By converting high-precision weights to lower-bit formats such as int8, int4, or int2, quantization reduces storage requirements. However, standard techniques often degrade accuracy, especially at low precisions like int2. Researchers must compromise accuracy for efficiency or maintain multiple models with different quantization levels. New strategies are strongly needed to preserve model quality while optimizing computational efficiency. 

The fundamental problem with quantization is handling precision reduction accurately. The approaches available so far either train unique models per precision or don’t take advantage of the integer data type’s hierarchical nature. Accuracy loss in quantization, as in the case of Int2, is most difficult because its memory gains hamper widespread utilization. LLMs like Gemma-2 9B and Mistral 7B are very computationally intensive, and a technique that enables a single model to operate on multiple precision levels would significantly improve efficiency. The necessity for a high-performance, flexible quantization method has prompted researchers to seek solutions outside of conventional methods.

Several quantization techniques exist, each balancing accuracy and efficiency. Learning-free methods like MinMax and GPTQ use statistical scaling to map model weights to lower bit widths without modifying parameters, but they lose accuracy at low precisions. Learning-based methods like Quantization Aware Training (QAT) and OmniQuant optimize quantization parameters using gradient descent. QAT updates model parameters to reduce post-quantization accuracy loss, while OmniQuant learns to scale and shift parameters without modifying core weights. However, both methods still require separate models for different precisions, complicating deployment.

Researchers at Google DeepMind introduced Matryoshka Quantization (MatQuant) to create a single model that functions across multiple precision levels. Unlike conventional methods that treat each bit-width separately, MatQuant optimizes a model for int8, int4, and int2 using a shared bit representation. This allows models to be deployed at different precisions without retraining, reducing computational and storage costs. MatQuant extracts lower-bit models from a high-bit model while preserving accuracy by leveraging the hierarchical structure of integer data types. Testing on Gemma-2 2B, Gemma-2 9B, and Mistral 7B models showed that MatQuant improves int2 accuracy by up to 10% over standard quantization techniques like QAT and OmniQuant.

MatQuant represents model weights at different precision levels using shared most significant bits (MSBs) and optimizes them jointly to maintain accuracy. The training process incorporates co-training and co-distillation, ensuring that the int2 representation retains critical information typically lost in conventional quantization. Instead of discarding lower-bit structures, MatQuant integrates them into a multi-scale optimization framework for efficient compression without performance loss. 

Experimental evaluations of MatQuant demonstrate its ability to mitigate accuracy loss from quantization. Researchers tested the method on Transformer-based LLMs, focusing on quantizing Feed-Forward Network (FFN) parameters, a key factor in inference latency. Results show that MatQuant’s int8 and int4 models achieve comparable accuracy to independently trained baselines while outperforming them at int2 precision. On the Gemma-2 9B model, MatQuant improved int2 accuracy by 8.01%, while the Mistral 7B model saw a 6.35% improvement over traditional quantization methods. The study also found that MatQuant’s right-shifted quantized weight distribution enhances accuracy across all bit-widths, particularly benefiting lower-precision models. Also, MatQuant enables seamless bit-width interpolation and layer-wise Mix’n’Match configurations, allowing flexible deployment based on hardware constraints.

Several Key Takeaways emerge from the Research on MatQuant:

Multi-Scale Quantization: MatQuant introduces a novel approach to quantization by training a single model that can operate at multiple precision levels (e.g., int8, int4, int2).

Nested Bit Structure Exploitation: The technique leverages the inherent nested structure within integer data types, allowing smaller bit-width integers to be derived from larger ones.

Enhanced Low-Precision Accuracy: MatQuant significantly improves the accuracy of int2 quantized models, outperforming traditional quantization methods like QAT and OmniQuant by up to 8%.

Versatile Application: MatQuant is compatible with existing learning-based quantization techniques such as Quantization Aware Training (QAT) and OmniQuant.

Demonstrated Performance: The method was successfully applied to quantize the FFN parameters of LLMs like Gemma-2 2B, 9B, and Mistral 7B, showcasing its practical utility.

Efficiency Gains: MatQuant enables the creation of models that offer a better trade-off between accuracy and computational cost, making it ideal for resource-constrained environments.

Pareto-Optimal Trade-Offs: It allows for seamless extraction of interpolative bit-widths, such as int6 and int3, and admits a dense accuracy-vs-cost Pareto-optimal trade-off by enabling layer-wise Mix’n’Match of different precisions.

In conclusion, MatQuant presents a solution to managing multiple quantized models by utilizing a multi-scale training approach that exploits the nested structure of integer data types. This provides a flexible, high-performance option for low-bit quantization in efficient LLM inference. This research demonstrates that a single model can be trained to operate at multiple precision levels without significantly declining accuracy, particularly at very low bit widths, marking an important advancement in model quantization techniques.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Google DeepMind Researchers Propose Matryoshka Quantization: A Technique to Enhance Deep Learning Efficiency by Optimizing Multi-Precision Models without Sacrificing Accuracy appeared first on MarkTechPost.

Salesforce AI Research Introduces Reward-Guided Speculative Decoding ( …

In recent years, the rapid scaling of large language models (LLMs) has led to extraordinary improvements in natural language understanding and reasoning capabilities. However, this progress comes with a significant caveat: the inference process—generating responses one token at a time—remains a computational bottleneck. As LLMs grow in size and complexity, the latency and energy demands for sequential token generation become substantial. These challenges are particularly acute in real-world deployments, where cost, speed, and scalability are critical. Traditional decoding approaches, such as greedy or beam search methods, often require repeated evaluations of large models, leading to high computational overhead. Moreover, even with parallel decoding techniques, maintaining both the efficiency and the quality of generated outputs can be elusive. This scenario has spurred a search for novel techniques that can reduce inference costs without sacrificing accuracy. Researchers have therefore been exploring hybrid approaches that combine lightweight models with more powerful counterparts, striving for an optimal balance between speed and performance—a balance that is essential for real-time applications, interactive systems, and large-scale deployment in cloud environments.

Salesforce AI Research Introduces Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). At its core, RSD leverages a dual-model strategy: a fast, lightweight “draft” model works in tandem with a more robust “target” model. The draft model generates preliminary candidate outputs rapidly, while a process reward model (PRM) evaluates the quality of these outputs in real time. Unlike traditional speculative decoding, which insists on strict unbiased token matching between the draft and target models, RSD introduces a controlled bias. This bias is carefully engineered to favor high-reward outputs—those deemed more likely to be correct or contextually relevant—thus significantly reducing unnecessary computations. The approach is grounded in a mathematically derived threshold strategy that determines when the target model should intervene. By dynamically mixing outputs from both models based on a reward function, RSD not only accelerates the inference process but also enhances the overall quality of the generated responses. Detailed in the attached paper , this breakthrough methodology represents a significant leap forward in addressing the inherent inefficiencies of sequential token generation in LLMs.

Technical Details and Benefits of RSD

Delving into the technical aspects, RSD operates by integrating two models in a sequential yet collaborative manner. Initially, the draft model produces candidate tokens or reasoning steps at a low computational cost. Each candidate is then evaluated using a reward function, which acts as a quality gate. If a candidate token’s reward exceeds a predetermined threshold, the output is accepted; if not, the system calls upon the more computationally intensive target model to generate a refined token. This process is guided by a weighting function—typically a binary step function—that adjusts the reliance on the draft versus the target model. The dynamic quality control afforded by the process reward model (PRM) ensures that only the most promising outputs bypass the target model, thereby saving on computation. One of the standout benefits of this approach is “biased acceleration,” where the controlled bias is not a detriment but rather a strategic choice to prioritize high-reward outcomes. This results in two key benefits: first, the overall inference process can be up to 4.4× faster compared to running the target model alone; second, it often yields a +3.5 average accuracy improvement over conventional parallel decoding baselines. In essence, RSD harmonizes efficiency with accuracy—allowing for a substantial reduction in the number of floating-point operations (FLOPs) while still delivering outputs that meet or even exceed the performance of the target model. The theoretical underpinnings and algorithmic details, such as the mixture distribution defined by PRSD and the adaptive acceptance criterion, provide a robust framework for practical deployment in diverse reasoning tasks.

Insights

The empirical validation of RSD is compelling. Experiments detailed in the paper demonstrate that, on challenging benchmarks such as GSM8K, MATH500, OlympiadBench, and GPQA, RSD consistently delivers superior performance. For instance, on the MATH500 benchmark—a dataset designed to test mathematical reasoning—RSD achieved an accuracy of 88.0 when configured with a 72B target model and a 7B PRM, compared to 85.6 for the target model running alone. Not only does this configuration reduce the computational load by nearly 4.4× fewer FLOPs, but it also enhances reasoning accuracy. The results underscore the potential of RSD to outperform traditional methods, such as speculative decoding (SD) and even advanced search-based techniques like beam search or Best-of-N strategies.

Conclusion: A New Paradigm for Efficient LLM Inference

In conclusion, Reward-Guided Speculative Decoding (RSD) marks a significant milestone in the quest for more efficient LLM inference. By intelligently combining a lightweight draft model with a powerful target model, and by introducing a reward-based acceptance criterion, RSD effectively addresses the dual challenges of computational cost and output quality. The innovative approach of biased acceleration allows the system to selectively bypass expensive computations for high-reward outputs, thereby streamlining the inference process. The dynamic quality control mechanism—anchored by a process reward model—ensures that computational resources are allocated judiciously, engaging the target model only when necessary. With empirical results showing up to 4.4× faster inference and an average accuracy improvement of +3.5 over traditional methods, RSD not only paves the way for more scalable LLM deployments but also sets a new standard in the design of hybrid decoding frameworks.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Salesforce AI Research Introduces Reward-Guided Speculative Decoding (RSD): A Novel Framework that Improves the Efficiency of Inference in Large Language Models (LLMs) Up To 4.4× Fewer FLOPs appeared first on MarkTechPost.

Layer Parallelism: Enhancing LLM Inference Efficiency Through Parallel …

LLMs have demonstrated exceptional capabilities, but their substantial computational demands pose significant challenges for large-scale deployment. While previous studies indicate that intermediate layers in deep neural networks can be reordered or removed without severely impacting performance, these insights have not been systematically leveraged to reduce inference costs. Given the rapid expansion of LLMs, which often contain hundreds of billions of parameters, optimizing inference is critical for improving efficiency, reducing latency, and reducing operational expenses. High-traffic applications relying on cloud-based LLM inference can incur monthly costs in the millions, making efficiency-driven solutions essential. Furthermore, the ability to deploy these models on resource-constrained devices necessitates strategies that maintain performance while minimizing computational overhead. Despite architectural similarities between modern transformers and deep residual networks, where layer depth can sometimes be redundant, research has yet to explore these redundancies to fully optimize inference efficiency.

Several approaches exist for improving the computational efficiency of LLMs, including pruning, quantization, and parallelization. Pruning eliminates redundant parameters to introduce sparsity, improving memory utilization and processing speed. On the other hand, Quantization reduces precision by converting floating-point computations to lower-bit integer formats like INT8 or INT4, enhancing hardware efficiency and energy savings. Additionally, parallelization techniques, such as tensor and pipeline parallelism, distribute workloads across multiple processing units to accelerate inference while addressing communication overhead. Recent innovations have also explored architectural modifications at the layer level, including layer fusion and dynamic recurrent execution, to streamline computational graphs. However, research has yet to focus on fusing consecutive layers through tensor parallelism, presenting an open avenue for optimizing inference further.

Researchers from the University of Geneva, EPFL, and Meta FAIR propose a method to reduce the depth of pre-trained LLMs while preserving performance. Modifying the computational graph enables parallel execution of grouped layer pairs, improving inference speed by approximately 1.20× without requiring retraining. Their approach maintains 95%-99% accuracy across perplexity and In-Context Learning (ICL) benchmarks. Additionally, fine-tuning helps recover minor performance losses. This method significantly enhances efficiency for large-scale LLM deployment, demonstrating that structural transformations, such as layer merging and reordering, can optimize computational workload while sustaining model effectiveness.

The study examines the effective depth of LLMs by applying transformations such as shuffling, merging, and pruning layers. Results indicate weak dependencies between intermediary layers, enabling certain layers to be reordered or parallelized with minimal perplexity loss. Running contiguous layers in parallel reduces depth while preserving performance, highlighting layer independence. Further, Layer Parallelism distributes computations across GPUs, optimizing efficiency through tensor parallelism. Modifications to attention and feed-forward networks ensure effective parallel execution. Adjustments to layer normalization help maintain stability. These findings suggest that transformer models can leverage parallelism to enhance computational efficiency without requiring substantial architectural modifications.

The study evaluates layer parallelism regarding inference speed, ICL accuracy, and fine-tuning for performance recovery. Experiments use Llama2 7B and Llama3.2 3B on dual A100 GPUs. Layer Parallelism is applied to merged layers, with Tensor Parallelism elsewhere. Results show that beyond 14 layers for Llama2 7B and 10 for Llama3.2 3B, ICL accuracy declines. Speed improves proportionally, reaching a 1.38x boost at aggressive parallelism. Fine-tuning parallelized layers on RedPajama data significantly restores accuracy, improving MMLU from 83.6% to 94.4% while maintaining speed gains, demonstrating the viability of Layer Parallelism with targeted adjustments.

In conclusion, the study introduces Layer Parallelism (LP), which restructures transformer computation by executing layer pairs in parallel, improving inference speed without retraining. Applied to Llama2 7B and Llama3.2 3B, LP reduced model depth by 21% and 18%, yielding speed-ups of 1.29x and 1.22x, respectively. Fine-tuning recovered 10.8% of lost accuracy, proving its effectiveness. These findings challenge the notion that transformer layers must process sequentially, suggesting selective parallelization is viable. LP enhances LLM efficiency in production, with future work exploring optimal layer grouping, interactions with quantization, and deeper theoretical insights into layer independence and computational efficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Layer Parallelism: Enhancing LLM Inference Efficiency Through Parallel Execution of Transformer Layers appeared first on MarkTechPost.

ByteDance Introduces UltraMem: A Novel AI Architecture for High-Perfor …

Large Language Models (LLMs) have revolutionized natural language processing (NLP) but face significant challenges in practical applications due to their large computational demands. While scaling these models improves performance, it creates substantial resource constraints in real-time applications. Current solutions like MoE Mixture of Experts (MoE) enhance training efficiency through selective parameter activation but suffer slower inference times due to increased memory access requirements. Another solution, Product Key Memory (PKM) maintains consistent memory access with fewer value embeddings but delivers subpar performance compared to MoE. MoE models, despite having 12 times more parameters than dense models, operate 2 to 6 times slower during inference.

Various approaches have emerged to address the computational challenges in LLMs. Researchers have focused on enhancing MoE’s gating functions through improved token choice mechanisms and expert selection strategies to combat expert imbalance. Recent developments involve slicing experts into smaller segments while activating multiple experts per token. PKM represents another approach, implementing the smallest possible expert configuration, with subsequent improvements including parallel operation with MLPs and modified value activation methods. Lastly, tensor decomposition techniques have been explored to break down large tensors into smaller components, with product quantization enabling vector reconstruction using fewer sub-vectors to reduce model parameters.

A team from Seed-Foundation-Model at ByteDance has proposed UltraMem, a novel architecture that revolutionizes the implementation of large-scale memory layers in language models. It is built upon the foundation of PKM while introducing ultra-sparse memory layers that dramatically improve computational efficiency and reduce inference latency. UltraMem achieves superior performance compared to both PKM and MoE models at equivalent scales, making it particularly suitable for resource-constrained environments. UltraMem demonstrates remarkable scaling capabilities, outperforming MoE in inference speed by up to 6 times under common batch sizes, while maintaining computational efficiency comparable to dense models.

UltraMem adopts a Pre-LayerNorm Transformer architecture with significant modifications to address the limitations of traditional PKM structures. The architecture distributes multiple smaller memory layers at fixed intervals throughout the transformer layers, replacing the single large memory layer used in PKM. This distribution tackles the difficulty in finding correct values when value size increases and the unbalanced computation across multiple GPUs during large-scale training. The design also addresses the inherent bias in product key decomposition, where traditional top-k retrieval is constrained by row and column positions. Moreover, the skip-layer structure optimizes the memory-bound operations during training and improves overall computational efficiency.

The performance evaluation of UltraMem across various model sizes shows impressive results against existing architectures. With equivalent parameters and computation costs, UltraMem outperforms PKM and MoE models as capacity increases. UltraMem model with 12 times the parameters matches the performance of a 6.5B dense model while maintaining the computational efficiency of a 1.6B dense model. Scaling experiments reveal that UltraMem maintains stable inference times even with exponential parameter growth, provided the activated parameters remain constant. This contrasts sharply with MoE models, which show significant performance degradation, highlighting UltraMem’s superior efficiency in managing sparse parameters.

This paper introduces UltraMem which represents a significant advancement in LLM architecture, showing superior performance characteristics compared to existing approaches. It achieves up to six times faster processing speeds than MoE models while maintaining minimal memory access requirements. UltraMem exhibits enhanced scaling capabilities as model capacity increases, outperforming MoE models with equivalent parameters and computational resources. These impressive results establish UltraMem as a promising foundation for developing more efficient and scalable language models, revolutionizing the field of NLP by enabling the creation of more powerful models while maintaining practical resource requirements.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post ByteDance Introduces UltraMem: A Novel AI Architecture for High-Performance, Resource-Efficient Language Models appeared first on MarkTechPost.

Meta AI Introduces CoCoMix: A Pretraining Framework Integrating Token …

The dominant approach to pretraining large language models (LLMs) relies on next-token prediction, which has proven effective in capturing linguistic patterns. However, this method comes with notable limitations. Language tokens often convey surface-level information, requiring models to process vast amounts of data to develop deeper reasoning capabilities. Additionally, token-based learning struggles with capturing long-term dependencies, making tasks that require planning and abstraction more difficult. Researchers have explored alternative strategies, such as knowledge distillation and structured input augmentation, but these approaches have not fully addressed the limitations of token-based learning. This raises an important question: Can LLMs be trained in a way that combines token-level processing with conceptual understanding? Meta AI introduces Continuous Concept Mixing (CoCoMix) as a potential solution.

CoCoMix: A Different Approach to Pretraining

CoCoMix integrates token prediction with the modeling of continuous concepts derived from hidden states of a pretrained model. The method employs a Sparse Autoencoder (SAE) to extract high-level semantic representations, which are then incorporated into the training process by interleaving them with token embeddings. This design allows the model to maintain the benefits of token-based learning while enhancing its ability to recognize and process broader conceptual structures. By enriching the token-based paradigm with concept-level information, CoCoMix aims to improve reasoning efficiency and model interpretability.

Technical Details and Benefits

CoCoMix operates through three main components:

Concept Extraction via Sparse Autoencoders (SAEs): A pretrained SAE identifies latent semantic features from a model’s hidden states, capturing information that extends beyond individual tokens.

Concept Selection with Attribution Scoring: Not all extracted concepts contribute equally to predictions. CoCoMix employs attribution methods to determine which concepts are most influential and should be retained.

Interleaving Continuous Concepts with Token Representations: The selected concepts are compressed into a continuous vector and integrated into the hidden states alongside token embeddings, allowing the model to utilize both token-level and conceptual information.

This approach improves sample efficiency, enabling models to achieve comparable performance with fewer training tokens. Additionally, CoCoMix enhances interpretability by making it possible to inspect and adjust the extracted concepts, offering a clearer view of how the model processes information.

Performance and Evaluation

Meta AI evaluated CoCoMix across multiple benchmarks, including OpenWebText, LAMBADA, WikiText-103, HellaSwag, PIQA, SIQA, Arc-Easy, and WinoGrande. The findings indicate:

Improved Sample Efficiency: CoCoMix matches the performance of next-token prediction while requiring 21.5% fewer training tokens.

Enhanced Generalization: Across various model sizes (69M, 386M, and 1.38B parameters), CoCoMix demonstrated consistent improvements in downstream task performance.

Effective Knowledge Transfer: CoCoMix supports knowledge transfer from smaller models to larger ones, outperforming traditional knowledge distillation techniques.

Greater Interpretability: The integration of continuous concepts allows for greater control and transparency in model decision-making, providing a clearer understanding of its internal processes.

Conclusion

CoCoMix presents an alternative approach to LLM pretraining by combining token prediction with concept-based reasoning. By incorporating structured representations extracted via SAEs, CoCoMix enhances efficiency and interpretability without disrupting the underlying next-token prediction framework. Experimental results suggest that this method provides a balanced way to improve language model training, particularly in areas requiring structured reasoning and transparent decision-making. Future research may focus on refining concept extraction methods and further integrating continuous representations into pretraining workflows.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Meta AI Introduces CoCoMix: A Pretraining Framework Integrating Token Prediction with Continuous Concepts appeared first on MarkTechPost.

Anthropic AI Launches the Anthropic Economic Index: A Data-Driven Look …

Artificial Intelligence is increasingly integrated into various sectors, yet there is limited empirical evidence on its real-world application across industries. Traditional research methods—such as predictive modeling and user surveys—struggle to capture AI’s evolving role in workplaces. This makes it difficult to assess its influence on productivity, labor markets, and economic structures. A more data-driven approach is necessary to gain meaningful insights into how AI is being utilized and its broader implications.

Anthropic AI has launched the Anthropic Economic Index, an initiative designed to track AI’s role in economic activities. The first report, based on millions of anonymized Claude conversations, maps AI usage across various job categories using the U.S. Department of Labor’s O*NET Database. The findings suggest that AI is primarily used in software development and writing tasks, with these categories accounting for nearly half of all AI interactions. Furthermore, 36% of occupations incorporate AI for at least a quarter of their associated tasks, indicating its growing presence in diverse industries. This framework provides a structured approach to observing AI’s economic footprint over time.

Technical Approach and Key Benefits

The Anthropic Economic Index leverages Clio, a privacy-preserving analysis tool, to study over four million conversations from Claude.ai users. By categorizing AI interactions according to occupational tasks defined in O*NET, the research highlights patterns in AI adoption. Some key observations include:

AI is widely used in software engineering and content creation, reflecting its strength in technical and creative domains.

The depth of AI usage varies by occupation, with 4% of professions using AI for at least 75% of their tasks.

Cognitive tasks, such as reading comprehension, writing, and critical thinking, dominate AI interactions, whereas physical and managerial tasks see lower engagement.

AI adoption is highest in mid-to-high wage occupations, particularly in the technology sector, while its presence in lower-wage or highly specialized fields remains limited.

One of the primary benefits of this approach is its ability to continuously track AI’s economic role, helping businesses, policymakers, and researchers understand how AI is reshaping the workforce.

Key Insights and Patterns

The study distinguishes between AI augmentation—where AI enhances human capabilities—and automation, where AI independently completes tasks. Findings indicate that 57% of AI interactions involve augmentation, such as refining ideas or generating drafts, while 43% are automation-driven, where AI executes tasks with minimal human intervention.

Other key observations include:

Software development and data science are the most AI-intensive fields, accounting for 37.2% of AI-related conversations.

Writing, education, and business operations also show significant AI usage, particularly in content creation and analytical tasks.

Occupations requiring physical labor, such as construction and healthcare support, demonstrate lower AI adoption.

AI use is most common in jobs requiring a bachelor’s degree, especially those in Job Zone 4 (substantial preparation required), whereas highly specialized fields (Job Zone 5), such as medicine and law, show lower adoption due to professional and regulatory constraints.

Conclusion

The Anthropic Economic Index provides a structured way to examine AI’s impact on various occupations. While AI adoption is growing, its role differs across professions—enhancing work in some areas while automating tasks in others. By offering a data-backed perspective on how AI is integrated into the economy, this initiative enables better-informed discussions on the future of work. As AI evolves, continued analysis will be essential to understanding its long-term economic effects and guiding thoughtful policy decisions.

Check out the Paper and Technical Details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Anthropic AI Launches the Anthropic Economic Index: A Data-Driven Look at AI’s Economic Role appeared first on MarkTechPost.

Can 1B LLM Surpass 405B LLM? Optimizing Computation for Small LLMs to …

Test-Time Scaling (TTS) is a crucial technique for enhancing the performance of LLMs by leveraging additional computational resources during inference. Despite its potential, there has been little systematic analysis of how policy models, Process Reward Models (PRMs), and problem complexity influence TTS, limiting its practical application. TTS can be categorized into Internal TTS, which encourages step-by-step reasoning through extended Chain-of-Thought (CoT) processes, and External TTS, which enhances performance using sampling or search-based methods with fixed models. The key challenge in External TTS lies in optimizing computational allocation for different tasks. Current methods employ PRMs to guide answer selection and scale test-time computation efficiently. However, a comprehensive evaluation of how factors impact TTS strategies remains unexplored, restricting the community’s understanding of optimal computation scaling for LLMs.

Prior research has explored multiple strategies to enhance LLM performance, including majority voting, search-based approaches, and self-refinement techniques. Test-time methods such as CoT prompting, self-verification, and external tool integration have proven effective in improving reasoning without modifying model parameters. PRMs, which outperform Output Reward Models (ORMs), significantly refine LLM-generated outputs. Recent advancements in PRMs focus on efficient data collection methods, implicit rewards, and advanced ranking techniques to improve mathematical reasoning. Tools like ProcessBench and PRMBench have been developed to facilitate benchmarking and evaluate PRM effectiveness. The evolution of PRMs and TTS strategies underscores the need for systematic research to optimize inference-time computation and enhance LLM capabilities across diverse tasks.

Researchers from Shanghai AI Laboratory, Tsinghua University, Harbin Institute of Technology, and BUPT investigate the impact of policy models, PRMs, and problem complexity on TTS through extensive experiments on MATH-500 and AIME24 tasks. Their findings show that compute-optimal TTS strategies depend on these factors, allowing smaller models (e.g., 1B, 3B, 7B) to outperform larger ones (e.g., 405B, GPT-4o, DeepSeek-R1) with greater efficiency. The study emphasizes the importance of reward-aware TTS for optimal scaling, demonstrating that strategic test-time computation significantly enhances LLM reasoning abilities across different architectures and task complexities.

Compute-optimal TTS optimally distributes computational resources for each problem. Prior approaches rely on PRMs as verifiers, either trained on the same policy model (on-policy) or a different one (offline). On-policy PRMs yield more accurate rewards, while offline PRMs face out-of-distribution challenges. Given the high cost of training PRMs per model, a general approach is needed. Experiments show that rewards significantly influence TTS performance. Thus, a reward-aware strategy is proposed, integrating rewards into compute allocation. Additionally, problem difficulty is better assessed using absolute thresholds rather than quantiles for more effective scaling strategies.

The study examines the effectiveness of Compute-Optimal TTS in enhancing the performance of small policy models compared to larger ones. Experiments assess whether TTS allows smaller models to outperform larger ones, improve upon CoT and majority voting, and surpass long-CoT methods. Findings reveal that small models using compute-optimal TTS can outperform significantly larger models on MATH-500 and AIME24 tasks. TTS improves efficiency by up to 256× compared to majority voting and boosts reasoning by 154.6% over CoT. Moreover, TTS outperforms several long-CoT-based methods, demonstrating its effectiveness in enhancing LLM reasoning capabilities.

In conclusion, the study examines compute-optimal TTS across various policy models, PRMs, and task complexities. Findings highlight that smaller models can surpass larger ones using optimized TTS, with a 1B model outperforming a 405B model. A 7B PRM also effectively supervises a 72B policy model, emphasizing a shift towards “weak-to-strong” supervision. Future work should focus on improving supervision methods for enhanced reasoning. While results are based on mathematical tasks, expanding TTS to coding and chemistry remains unexplored. These insights underscore TTS’s potential to refine LLM efficiency and adaptability across diverse challenges.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Can 1B LLM Surpass 405B LLM? Optimizing Computation for Small LLMs to Outperform Larger Models appeared first on MarkTechPost.

Build a dynamic, role-based AI agent using Amazon Bedrock inline agent …

AI agents continue to gain momentum, as businesses use the power of generative AI to reinvent customer experiences and automate complex workflows. We are seeing Amazon Bedrock Agents applied in investment research, insurance claims processing, root cause analysis, advertising campaigns, and much more. Agents use the reasoning capability of foundation models (FMs) to break down user-requested tasks into multiple steps. They use developer-provided instructions to create an orchestration plan and carry out that plan by securely invoking company APIs and accessing knowledge bases using Retrieval Augmented Generation (RAG) to accurately handle the user’s request.
Although organizations see the benefit of agents that are defined, configured, and tested as managed resources, we have increasingly seen the need for an additional, more dynamic way to invoke agents. Organizations need solutions that adjust on the fly—whether to test new approaches, respond to changing business rules, or customize solutions for different clients. This is where the new inline agents capability in Amazon Bedrock Agents becomes transformative. It allows you to dynamically adjust your agent’s behavior at runtime by changing its instructions, tools, guardrails, knowledge bases, prompts, and even the FMs it uses—all without redeploying your application.
In this post, we explore how to build an application using Amazon Bedrock inline agents, demonstrating how a single AI assistant can adapt its capabilities dynamically based on user roles.
Inline agents in Amazon Bedrock Agents
This runtime flexibility enabled by inline agents opens powerful new possibilities, such as:

Rapid prototyping – Inline agents minimize the time-consuming create/update/prepare cycles traditionally required for agent configuration changes. Developers can instantly test different combinations of models, tools, and knowledge bases, dramatically accelerating the development process.
A/B testing and experimentation – Data science teams can systematically evaluate different model-tool combinations, measure performance metrics, and analyze response patterns in controlled environments. This empirical approach enables quantitative comparison of configurations before production deployment.
Subscription-based personalization – Software companies can adapt features based on each customer’s subscription level, providing more advanced tools for premium users.
Persona-based data source integration – Institutions can adjust content complexity and tone based on the user’s profile, providing persona-appropriate explanations and resources by changing the knowledge bases associated to the agent on the fly.
Dynamic tool selection – Developers can create applications with hundreds of APIs, and quickly and accurately carry out tasks by dynamically choosing a small subset of APIs for the agent to consider for a given request. This is particularly helpful for large software as a service (SaaS) platforms needing multi-tenant scaling.

Inline agents expand your options for building and deploying agentic solutions with Amazon Bedrock Agents. For workloads needing managed and versioned agent resources with a pre-determined and tested configuration (specific model, instructions, tools, and so on), developers can continue to use InvokeAgent on resources created with CreateAgent. For workloads that need dynamic runtime behavior changes for each agent invocation, you can use the new InvokeInlineAgent API. With either approach, your agents will be secure and scalable, with configurable guardrails, a flexible set of model inference options, native access to knowledge bases, code interpretation, session memory, and more.
Solution overview
Our HR assistant example shows how to build a single AI assistant that adapts to different user roles using the new inline agent capabilities in Amazon Bedrock Agents. When users interact with the assistant, the assistant dynamically configures agent capabilities (such as model, instructions, knowledge bases, action groups, and guardrails) based on the user’s role and their specific selections. This approach creates a flexible system that adjusts its functionality in real time, making it more efficient than creating separate agents for each user role or tool combination. The complete code for this HR assistant example is available on our GitHub repo.
This dynamic tool selection enables a personalized experience. When an employee logs in without direct reports, they see a set of tools that they have access to based on their role. They can select from options like requesting vacation time, checking company policies using the knowledge base, using a code interpreter for data analysis, or submitting expense reports. The inline agent assistant is then configured with only these selected tools, allowing it to assist the employee with their chosen tasks. In a real-world example, the user would not need to make the selection, because the application would make that decision and automatically configure the agent invocation at runtime. We make it explicit in this application so that you can demonstrate the impact.
Similarly, when a manager logs in to the same system, they see an extended set of tools reflecting their additional permissions. In addition to the employee-level tools, managers have access to capabilities like running performance reviews. They can select which tools they want to use for their current session, instantly configuring the inline agent with their choices.
The inclusion of knowledge bases is also adjusted based on the user’s role. Employees and managers see different levels of company policy information, with managers getting additional access to confidential data like performance review and compensation details. For this demo, we’ve implemented metadata filtering to retrieve only the appropriate level of documents based on the user’s access level, further enhancing efficiency and security.
Let’s look at how the interface adapts to different user roles.
The employee view provides access to essential HR functions like vacation requests, expense submissions, and company policy lookups. Users can select which of these tools they want to use for their current session.

The manager view extends these options to include supervisory functions like compensation management, demonstrating how the inline agent can be configured with a broader set of tools based on user permissions.

The manager view extends these capabilities to include supervisory functions like compensation management, demonstrating how the inline agent dynamically adjusts its available tools based on user permissions. Without inline agents, we would need to build and maintain two separate agents.
As shown in the preceding screenshots, the same HR assistant offers different tool selections based on the user’s role. An employee sees options like Knowledge Base, Apply Vacation Tool, and Submit Expense, whereas a manager has additional options like Performance Evaluation. Users can select which tools they want to add to the agent for their current interaction.
This flexibility allows for quick adaptation to user needs and preferences. For instance, if the company introduces a new policy for creating business travel requests, the tool catalog can be quickly updated to include a Create Business Travel Reservation tool. Employees can then choose to add this new tool to their agent configuration when they need to plan a business trip, or the application could automatically do so based on their role.
With Amazon Bedrock inline agents, you can create a catalog of actions that is dynamically selected by the application or by users of the application. This increases the level of flexibility and adaptability of your solutions, making them a perfect fit for navigating the complex, ever-changing landscape of modern business operations. Users have more control over their AI assistant’s capabilities, and the system remains efficient by only loading the necessary tools for each interaction.
Technical foundation: Dynamic configuration and action selection
Inline agents allow dynamic configuration at runtime, enabling a single agent to effectively perform the work of many. By specifying action groups and modifying instructions on the fly, even within the same session, you can create versatile AI applications that adapt to various scenarios without multiple agent deployments.
The following are key points about inline agents:

Runtime configuration – Change the agent’s configuration, including its FM, at runtime. This enables rapid experimentation and adaptation without redeploying the application, reducing development cycles.
Governance at tool level – Apply governance and access control at the tool level. With agents changing dynamically at runtime, tool-level governance helps maintain security and compliance regardless of the agent’s configuration.
Agent efficiency – Provide only necessary tools and instructions at runtime to reduce token usage and improve the agent accuracy. With fewer tools to choose from, it’s less complicated for the agent to select the right one, reducing hallucinations in the tool selection process. This approach can also lead to lower costs and improved latency compared to static agents because removing unnecessary tools, knowledge bases, and instructions reduces the number of input and output tokens being processed by the agent’s large language model (LLM).
Flexible action catalog – Create reusable actions for dynamic selection based on specific needs. This modular approach simplifies maintenance, updates, and scalability of your AI applications.

The following are examples of reusable actions:

Enterprise system integration – Connect with systems like Salesforce, GitHub, or databases
Utility tools – Perform common tasks such as sending emails or managing calendars
Team-specific API access – Interact with specialized internal tools and services
Data processing – Analyze text, structured data, or other information
External services – Fetch weather updates, stock prices, or perform web searches
Specialized ML models – Use specific machine learning (ML) models for targeted tasks

When using inline agents, you configure parameters for the following:

Contextual tool selection based on user intent or conversation flow
Adaptation to different user roles and permissions
Switching between communication styles or personas
Model selection based on task complexity

The inline agent uses the configuration you provide at runtime, allowing for highly flexible AI assistants that efficiently handle various tasks across different business contexts.
Building an HR assistant using inline agents
Let’s look at how we built our HR Assistant using Amazon Bedrock inline agents:

Create a tool catalog – We developed a demo catalog of HR-related tools, including:

Knowledge Base – Using Amazon Bedrock Knowledge Bases for accessing company policies and guidelines based on the role of the application user. In order to filter the knowledge base content based on the user’s role, you also need to provide a metadata file specifying the type of employee’s roles that can access each file
Apply Vacation – For requesting and tracking time off.
Expense Report – For submitting and managing expense reports.
Code Interpreter – For performing calculations and data analysis.
Compensation Management – for conducting and reviewing employee compensation assessments (manager only access).

Set conversation tone – We defined multiple conversation tones to suit different interaction styles:

Professional – For formal, business-like interactions.
Casual – For friendly, everyday support.
Enthusiastic – For upbeat, encouraging assistance.

Implement access control – We implemented role-based access control. The application backend checks the user’s role (employee or manager) and provides access to appropriate tools and information and passes this information to the inline agent. The role information is also used to configure metadata filtering in the knowledge bases to generate relevant responses. The system allows for dynamic tool use at runtime. Users can switch personas or add and remove tools during their session, allowing the agent to adapt to different conversation needs in real time.
Integrate the agent with other services and tools – We connected the inline agent to:

Amazon Bedrock Knowledge Bases for company policies, with metadata filtering for role-based access.
AWS Lambda functions for executing specific actions (such as submitting vacation requests or expense reports).
A code interpreter tool for performing calculations and data analysis.

Create the UI – We created a Flask-based UI that performs the following actions:

Displays available tools based on the user’s role.
Allows users to select different personas.
Provides a chat window for interacting with the HR assistant.

To understand how this dynamic role-based functionality works under the hood, let’s examine the following system architecture diagram.

As shown in preceding architecture diagram, the system works as follows:

The end-user logs in and is identified as either a manager or an employee.
The user selects the tools that they have access to and makes a request to the HR assistant.
The agent breaks down the problems and uses the available tools to solve for the query in steps, which may include:

Amazon Bedrock Knowledge Bases (with metadata filtering for role-based access).
Lambda functions for specific actions.
Code interpreter tool for calculations.
Compensation tool (accessible only to managers to submit base pay raise requests).

The application uses the Amazon Bedrock inline agent to dynamically pass in the appropriate tools based on the user’s role and request.
The agent uses the selected tools to process the request and provide a response to the user.

This approach provides a flexible, scalable solution that can quickly adapt to different user roles and changing business needs.
Conclusion
In this post, we introduced the Amazon Bedrock inline agent functionality and highlighted its application to an HR use case. We dynamically selected tools based on the user’s roles and permissions, adapted instructions to set a conversation tone, and selected different models at runtime. With inline agents, you can transform how you build and deploy AI assistants. By dynamically adapting tools, instructions, and models at runtime, you can:

Create personalized experiences for different user roles
Optimize costs by matching model capabilities to task complexity
Streamline development and maintenance
Scale efficiently without managing multiple agent configurations

For organizations demanding highly dynamic behavior—whether you’re an AI startup, SaaS provider, or enterprise solution team—inline agents offer a scalable approach to building intelligent assistants that grow with your needs. To get started, explore our GitHub repo and HR assistant demo application, which demonstrate key implementation patterns and best practices.
To learn more about how to be most successful in your agent journey, read our two-part blog series:

Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 1
Best practices for building robust generative AI applications with Amazon Bedrock Agents – Part 2

To get started with Amazon Bedrock Agents, check out the following GitHub repository with example code.

About the authors
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.
Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.
Ashrith Chirutani is a Software Development Engineer at Amazon Web Services (AWS). He specializes in backend system design, distributed architectures, and scalable solutions, contributing to the development and launch of high-impact systems at Amazon. Outside of work, he spends his time playing ping pong and hiking through Cascade trails, enjoying the outdoors as much as he enjoys building systems.
Shubham Divekar is a Software Development Engineer at Amazon Web Services (AWS), working in Agents for Amazon Bedrock. He focuses on developing scalable systems on the cloud that enable AI applications frameworks and orchestrations. Shubham also has a background in building distributed, scalable, high-volume-high-throughput systems in IoT architectures.
Vivek Bhadauria is a Principal Engineer for Amazon Bedrock. He focuses on building deep learning-based AI and computer vision solutions for AWS customers. Oustide of work, Vivek enjoys trekking and following cricket.

Use language embeddings for zero-shot classification and semantic sear …

In this post, we discuss what embeddings are, show how to practically use language embeddings, and explore how to use them to add functionality such as zero-shot classification and semantic search. We then use Amazon Bedrock and language embeddings to add these features to a really simple syndication (RSS) aggregator application.
Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock offers a serverless experience, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using Amazon Web Services (AWS) services without having to manage infrastructure. For this post, we use the Cohere v3 Embed model on Amazon Bedrock to create our language embeddings.
Use case: RSS aggregator
To demonstrate some of the possible uses of these language embeddings, we developed an RSS aggregator website. RSS is a web feed that allows publications to publish updates in a standardized, computer-readable way. On our website, users can subscribe to an RSS feed and have an aggregated, categorized list of the new articles. We use embeddings to add the following functionalities:

Zero-shot classification – Articles are classified between different topics. There are some default topics, such as Technology, Politics, and Health & Wellbeing, as shown in the following screenshot. Users can also create their own topics.

Semantic search – Users can search their articles using semantic search, as shown in the following screenshot. Users can not only search for a specific topic but also narrow their search by factors such as tone or style.

This post uses this application as a reference point to discuss the technical implementation of the semantic search and zero-shot classification features.
Solution overview
This solution uses the following services:

Amazon API Gateway – The API is accessible through Amazon API Gateway. Caching is performed on Amazon CloudFront for certain topics to ease the database load.
Amazon Bedrock with Cohere v3 Embed – The articles and topics are converted into embeddings with the help of Amazon Bedrock and Cohere v3 Embed.
Amazon CloudFront and Amazon Simple Storage Service (Amazon S3) – The single-page React application is hosted using Amazon S3 and Amazon CloudFront.
Amazon Cognito – Authentication is done using Amazon Cognito user pools.
Amazon EventBridge – Amazon EventBridge and EventBridge schedules are used to coordinate new updates.
AWS Lambda – The API is a Fastify application written in TypeScript. It’s hosted on AWS Lambda.
Amazon Aurora PostgreSQL-Compatible Edition and pgvector – Amazon Aurora PostgreSQL-Compatible is used as the database, both for the functionality of the application itself and as a vector store using pgvector.
Amazon RDS Proxy – Amazon RDS Proxy is used for connection pooling.
Amazon Simple Queue Service (Amazon SQS) – Amazon SQS is used to queue events. It consumes one event at a time so it doesn’t hit the rate limit of Cohere in Amazon Bedrock.

The following diagram illustrates the solution architecture.

What are embeddings?
This section offers a quick primer on what embeddings are and how they can be used.
Embeddings are numerical representations of concepts or objects, such as language or images. In this post, we discuss language embeddings. By reducing these concepts to numerical representations, we can then use them in a way that a computer can understand and operate on.
Let’s take Berlin and Paris as an example. As humans, we understand the conceptual links between these two words. Berlin and Paris are both cities, they’re capitals of their respective countries, and they’re both in Europe. We understand their conceptual similarities almost instinctively, because we can create a model of the world in our head. However, computers have no built-in way of representing these concepts.
To represent these concepts in a way a computer can understand, we convert them into language embeddings. Language embeddings are high dimensional vectors that learn their relationships with each other through the training of a neural network. During training, the neural network is exposed to enormous amounts of text and learns patterns based on how words are colocated and relate to each other in different contexts.
Embedding vectors allow computers to model the world from language. For instance, if we embed “Berlin” and “Paris,” we can now perform mathematical operations on these embeddings. We can then observe some fairly interesting relationships. For instance, we could do the following: Paris – France + Germany ~= Berlin. This is because the embeddings capture the relationships between the words “Paris” and “France” and between “Germany” and “Berlin”—specifically, that Paris and Berlin are both capital cities of their respective countries.
The following graph shows the word vector distance between countries and their respective capitals.

Subtracting “France” from “Paris” removes the country semantics, leaving a vector representing the concept of a capital city. Adding “Germany” to this vector, we are left with something closely resembling “Berlin,” the capital of Germany. The vectors for this relationship are shown in the following graph.

For our use case, we use the pre-trained Cohere Embeddings model in Amazon Bedrock, which embeds entire texts rather than a single word. The embeddings represent the meaning of the text and can be operated on using mathematical operations. This property can be useful to map relationships such as similarity between texts.
Zero-shot classification
One way in which we use language embeddings is by using their properties to calculate how similar an article is to one of the topics.
To do this, we break down a topic into a series of different and related embeddings. For instance, for culture, we have a set of embeddings for sports, TV programs, music, books, and so on. We then embed the incoming title and description of the RSS articles, and calculate the similarity against the topic embeddings. From this, we can assign topic labels to an article.
The following figure illustrates how this works. The embeddings that Cohere generates are highly dimensional, containing 1,024 values (or dimensions). However, to demonstrate how this system works, we use an algorithm designed to reduce the dimensionality of the embeddings, t-distributed Stochastic Neighbor Embedding (t-SNE), so that we can view them in two dimensions. The following image uses these embeddings to visualize how topics are clustered based on similarity and meaning.

You can use the embedding of an article and check the similarity of the article against the preceding embeddings. You can then say that if an article is clustered closely to one of these embeddings, it can be classified with the associated topic.
This is the k-nearest neighbor (k-NN) algorithm. This algorithm is used to perform classification and regression tasks. In k-NN, you can make assumptions around a data point based on its proximity to other data points. For instance, you can say that an article that has proximity to the music topic shown in the preceding diagram can be tagged with the culture topic.
The following figure demonstrates this with an ArsTechnica article. We plot against the embedding of an article’s title and description: (The climate is changing so fast that we haven’t seen how bad extreme weather could get: Decades-old statistics no longer represent what is possible in the present day).

The advantage of this approach is that you can add custom, user-generated topics. You can create a topic by first creating a series of embeddings of conceptually related items. For instance, an AI topic would be similar to the embeddings for AI, Generative AI, LLM, and Anthropic, as shown in the following screenshot.

In a traditional classification system, we’d be required to train a classifier—a supervised learning task where we’d need to provide a series of examples to establish whether an article belongs to its respective topic. Doing so can be quite an intensive task, requiring labeled data and training the model. For our use case, we can provide examples, create a cluster, and tag articles without having to provide labeled examples or train additional models. This is shown in the following screenshot of results page of our website.
In our application, we ingest new articles on a schedule. We use EventBridge schedules to periodically call a Lambda function, which checks if there are new articles. If there are, it creates an embedding from them using Amazon Bedrock and Cohere.
We calculate the article’s distance to the different topic embeddings, and can then determine whether the article belongs to that category. This is done with Aurora PostgreSQL with pgvector. We store the embeddings of the topics and then calculate their distance using the following SQL query:

const topics = await sqlClient.then(it=> it.query(
`SELECT name, embedding_description, similarity
FROM (SELECT topic_id as name, embedding_description, (1- ABS( 1 –(embed.embedding <-> $1))) AS “similarity” FROM topic_embedding_link embed) topics
ORDER BY similarity desc`,
[toSql(articleEmbedding)]
))

The <-> operator in the preceding code calculates the Euclidean distance between the article and the topic embedding. This number allows us to understand how close an article is to one of the topics. We can then determine the appropriateness of a topic based on this ranking.
We then tag the article with the topic. We do this so that the subsequent request for a topic is as computationally as light as possible; we do a simple join rather than calculating the Euclidean distance.

const formattedTopicInsert = pgformat(
`INSERT INTO feed_article_topic_link(topic_id, feed_article_id) VALUES %L ON CONFLICT DO NOTHING`,
topicLinks
)

We also cache a specific topic/feed combination because these are calculated hourly and aren’t expected to change in the interim.
Semantic search
As previously discussed, the embeddings produced by Cohere contain a multitude of features; they embed the meanings and semantics of a word of phrase. We’ve also found that we can perform mathematical operations on these embeddings to do things such as calculate the similarity between two phrases or words.
We can use these embeddings and calculate the similarity between a search term and an embedding of an article with the k-NN algorithm to find articles that have similar semantics and meanings to the search term we’ve provided.
For example, in one of our RSS feeds, we have a lot of different articles that rate products. In a traditional search system, we’d rely on keyword matches to provide relevant results. Although it might be simple to find a specific article (for example, by searching “best digital notebooks”), we would need a different method to capture multiple product list articles.
In a semantic search system, we first transform the term “Product list” in an embedding. We can then use the properties of this embedding to perform a search within our embedding space. Using the k-NN algorithm, we can find articles that are semantically similar. As shown in the following screenshot, despite not containing the text “Product list” in either the title or description, we’ve been able to find articles that contain a product list. This is because we were able to capture the semantics of the query and match it to the existing embeddings we have for each article.

In our application, we store these embeddings using pgvector on Aurora PostgreSQL. pgvector is an open source extension that enables vector similarity search in PostgreSQL. We transform our search term into an embedding using Amazon Bedrock and Cohere v3 Embed.
After we’ve converted the search term to an embedding, we can compare it with the embeddings on the article that have been saved during the ingestion process. We can then use pgvector to find articles that are clustered together. The SQL code for that is as follows:

SELECT *
FROM (
SELECT feed_articles.id as id, title, feed_articles.feed_id as feed, feedName, slug, description, url, author, image, published_at as published, 1 – ABS(1 – (embedding <-> $2)) AS “similarity”
FROM feed_articles
INNER JOIN (select feed_id, name as feedName from feed_user_subscription fus where fus.user_id=$1) sub on feed_articles.feed_id=sub.feed_id
${feedId != undefined ? `WHERE feed_articles.feed_id = $4` : “”}
)
WHERE similarity > 0.95
ORDER BY similarity desc
LIMIT $3;

This code calculates the distance between the topics, and the embedding of this article as “similarity.” If this distance is close, then we can assume that the topic of the article is related, and we therefore attach the topic to the article.
Prerequisites
To deploy this application in your own account, you need the following prerequisites:

An active AWS account.
Model access for Cohere Embed English. On the Amazon Bedrock console, choose Model access in the navigation pane, then choose Manage model access. Select the FMs of your choice and request access.

The AWS Cloud Development Kit (AWS CDK) set up. For installation instructions, refer to Getting started with the AWS CDK.
A virtual private cloud (VPC) set up with access to private VPCs. For more information, refer to Create a VPC.

Deploy the AWS CDK stack
When the prerequisite steps are complete, you’re ready to set up the solution:

Clone the GitHub repository containing the solution files: git clone https://github.com/aws-samples/rss-aggregator-using-cohere-embeddings-bedrock

Navigate to the solution directory: cd infrastructure

In your terminal, export your AWS credentials for a role or user in ACCOUNT_ID. The role needs to have all necessary permissions for AWS CDK deployment:

export AWS_REGION=”<region>” – The AWS Region you want to deploy the application to
export AWS_ACCESS_KEY_ID=”<access-key>” – The access key of your role or user
export AWS_SECRET_ACCESS_KEY=”<secret-key>” – The secret key of your role or user

If you’re deploying the AWS CDK for the first time, run the following command: cdk bootstrap

To synthesize the AWS CloudFormation template, run the following command: cdk synth -c vpc_id=<ID Of your VPC>

To deploy, use the following command: cdk deploy -c vpc_id=<ID Of your VPC>

When deployment is finished, you can check these deployed stacks by visiting the AWS CloudFormation console, as shown in the following screenshot.

Clean up
Run the following command in the terminal to delete the CloudFormation stack provisioned using the AWS CDK:
cdk destroy –all
Conclusion
In this post, we explored what language embeddings are and how they can be used to enhance your application. We’ve learned how, by using the properties of embeddings, we can implement a real-time zero-shot classifier and can add powerful features such as semantic search.
The code for this application can be found on the accompanying GitHub repo. We encourage you to experiment with language embeddings and find out what powerful features they can enable for your applications!

About the Author
Thomas Rogers is a Solutions Architect based in Amsterdam, the Netherlands. He has a background in software engineering. At AWS, Thomas helps customers build cloud solutions, focusing on modernization, data, and integrations.

Convergence Labs Introduces the Large Memory Model (LM2): A Memory-Aug …

Transformer-based models have significantly advanced natural language processing (NLP), excelling in various tasks. However, they struggle with reasoning over long contexts, multi-step inference, and numerical reasoning. These challenges arise from their quadratic complexity in self-attention, making them inefficient for extended sequences, and their lack of explicit memory, which limits their ability to synthesize dispersed information effectively. Existing solutions, such as recurrent memory transformers (RMT) and retrieval-augmented generation (RAG), offer partial improvements but often sacrifice either efficiency or generalization.

Introducing the Large Memory Model (LM2)

Convergence Labs introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module to address the shortcomings of conventional models in long-context reasoning. Unlike standard Transformers, which rely solely on attention mechanisms, LM2 incorporates a structured memory system that interacts with input embeddings through cross-attention. The model’s memory updates are regulated by gating mechanisms, allowing it to selectively retain relevant information while preserving generalization capabilities. This design enables LM2 to maintain coherence across long sequences, facilitating improved relational reasoning and inference.

Technical Overview and Benefits

LM2 builds upon standard Transformer architecture by introducing three key innovations:

Memory-Augmented Transformer: A dedicated memory bank acts as an explicit long-term storage system, retrieving relevant information through cross-attention.

Hybrid Memory Pathway: Unlike previous models that modify the Transformer’s core structure, LM2 maintains the original information flow while integrating an auxiliary memory pathway.

Dynamic Memory Updates: The memory module selectively updates its stored information using learnable input, forget, and output gates, ensuring long-term retention without unnecessary accumulation of irrelevant data.

These enhancements allow LM2 to process long sequences more effectively while maintaining computational efficiency. By selectively incorporating relevant memory content, the model mitigates the gradual performance decline often observed in traditional architectures over extended contexts.

Experimental Results and Insights

To evaluate LM2’s effectiveness, it was tested on the BABILong dataset, designed to assess memory-intensive reasoning capabilities. The results indicate substantial improvements:

Short-context performance (0K context length): LM2 achieves an accuracy of 92.5%, surpassing RMT (76.4%) and vanilla Llama-3.2 (40.7%).

Long-context performance (1K–4K context length): As context length increases, all models experience some degradation, but LM2 maintains a higher accuracy. At 4K context length, LM2 achieves 55.9%, compared to 48.4% for RMT and 36.8% for Llama-3.2.

Extreme long-context performance (≥8K context length): While all models decline in accuracy, LM2 remains more stable, outperforming RMT in multi-step inference and relational argumentation.

Beyond memory-specific benchmarks, LM2 was tested on the MMLU dataset, which covers a broad range of academic subjects. The model demonstrated a 5.0% improvement over a pre-trained vanilla Transformer, particularly excelling in Humanities and Social Sciences, where contextual reasoning is crucial. These results indicate that LM2’s memory module enhances reasoning capabilities without compromising general task performance.

Conclusion

The introduction of LM2 offers a thoughtful approach to addressing the limitations of standard Transformers in long-context reasoning. By integrating an explicit memory module, LM2 improves multi-step inference, relational argumentation, and numerical reasoning while maintaining efficiency and adaptability. Experimental results demonstrate its advantages over existing architectures, particularly in tasks requiring extended context retention. Furthermore, LM2 performs well in general reasoning benchmarks, suggesting that memory integration does not hinder versatility. As memory-augmented models continue to evolve, LM2 represents a step toward more effective long-context reasoning in language models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Convergence Labs Introduces the Large Memory Model (LM2): A Memory-Augmented Transformer Architecture Designed to Address Long Context Reasoning Challenges appeared first on MarkTechPost.