TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recomme …

Precision therapy has emerged as a critical approach in healthcare, tailoring treatments to individual patient profiles to optimise outcomes while reducing risks. However, determining the appropriate medication involves a complex analysis of numerous factors: patient characteristics, comorbidities, potential drug interactions, contraindications, current clinical guidelines, drug mechanisms, and disease biology. While Large Language Models (LLMs) have demonstrated therapeutic task capabilities through pretraining and fine-tuning medical data, they face significant limitations. These models lack access to updated biomedical knowledge, frequently generate hallucinations, and struggle to reason reliably across multiple clinical variables. Also, retraining LLMs with new medical information proves computationally prohibitive due to catastrophic forgetting. The models also risk incorporating unverified or deliberately misleading medical content from their extensive training data, further compromising their reliability in clinical applications.

Tool-augmented LLMs have been developed to address knowledge limitations through external retrieval mechanisms like retrieval-augmented generation (RAG). These systems attempt to overcome hallucination issues by fetching drug and disease information from external databases. However, they still fall short in executing the multi-step reasoning process essential for effective treatment selection. Precision therapy would benefit significantly from iterative reasoning capabilities where models could access verified information sources, systematically evaluate potential interactions, and dynamically refine treatment recommendations based on comprehensive clinical analysis.

Researchers from Harvard Medical School, MIT Lincoln Laboratory, Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Broad Institute of MIT and Harvard, and Harvard Data Science Initiative introduce TXAGENT, representing an innovative AI system delivering evidence-grounded treatment recommendations by integrating multi-step reasoning with real-time biomedical tools. The agent generates natural language responses while providing transparent reasoning traces that document its decision-making process. It employs goal-driven tool selection, accessing external databases and specialized machine learning models to ensure accuracy. Supporting this framework is TOOLUNIVERSE, a comprehensive biomedical toolbox containing 211 expert-curated tools covering drug mechanisms, interactions, clinical guidelines, and disease annotations. These tools incorporate trusted sources like openFDA, Open Targets, and the Human Phenotype Ontology. To optimize tool selection, TXAGENT implements TOOLRAG, an ML-based retrieval system that dynamically identifies the most relevant tools from TOOLUNIVERSE based on query context.

TXAGENT’s architecture integrates three core components: TOOLUNIVERSE, comprising 211 diverse biomedical tools; a specialized LLM fine-tuned for multi-step reasoning and tool execution; and the TOOLRAG model for adaptive tool retrieval. Tool compatibility is enabled through TOOLGEN, a multi-agent system that generates tools from API documentation. The agent undergoes fine-tuning with TXAGENT-INSTRUCT, an extensive dataset containing 378,027 instruction-tuning samples derived from 85,340 multi-step reasoning traces, encompassing 177,626 reasoning steps and 281,695 function calls. This dataset is generated by QUESTIONGEN and TRACEGEN, multi-agent systems that create diverse therapeutic queries and stepwise reasoning traces covering treatment information and drug data from FDA labels dating back to 1939.

TXAGENT demonstrates exceptional capabilities in therapeutic reasoning through its multi-tool approach. The system utilizes numerous verified knowledge bases, including FDA-approved drug labels and Open Targets, to ensure accurate and reliable responses with transparent reasoning traces. It excels in four key areas: knowledge grounding using tool calls, retrieving verified information from trusted sources; goal-oriented tool selection through the TOOLRAG model; multi-step therapeutic reasoning for complex problems requiring multiple information sources; and real-time retrieval from continuously updated knowledge sources. Importantly, TXAGENT successfully identified indications for Bizengri, a drug approved in December 2024, well after its base model’s knowledge cutoff, by querying the openFDA API directly rather than relying on outdated internal knowledge.

TXAGENT represents a significant advancement in AI-assisted precision medicine, addressing critical limitations of traditional LLMs through multi-step reasoning and targeted tool integration. By generating transparent reasoning trails alongside recommendations, the system provides interpretable decision-making processes for therapeutic problems. The integration of TOOLUNIVERSE enables real-time access to verified biomedical knowledge, allowing TXAGENT to make recommendations based on current data rather than static training information. This approach enables the system to stay current with newly approved medications, assess appropriate indications, and deliver evidence-based prescriptions. By grounding all responses in verified sources and providing traceable decision steps, TXAGENT establishes a new standard for trustworthy AI in clinical decision support.

Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recommendations by Combining Multi-Step Reasoning with Real-Time Biomedical Tool Integration appeared first on MarkTechPost.

Meet LocAgent: Graph-Based AI Agents Transforming Code Localization fo …

Software maintenance is an integral part of the software development lifecycle, where developers frequently revisit existing codebases to fix bugs, implement new features, and optimize performance. A critical task in this phase is code localization, pinpointing specific locations in a codebase that must be modified. This process has gained significance with modern software projects’ increasing scale and complexity. The growing reliance on automation and AI-driven tools has led to integrating large language models (LLMs) in supporting tasks like bug detection, code search, and suggestion. However, despite the advancement of LLMs in language tasks, enabling these models to understand the semantics and structures of complex codebases remains a technical challenge researchers strive to overcome.

Talking about the problems, one of the most persistent problems in software maintenance is accurately identifying the relevant parts of a codebase that need changes based on user-reported issues or feature requests. Often, issue descriptions in natural language mention symptoms but not the actual root cause in code. This disconnect makes it difficult for developers and automated tools to link descriptions to the exact code elements needing updates. Furthermore, traditional methods struggle with complex code dependencies, especially when the relevant code spans multiple files or requires hierarchical reasoning. Poor code localization contributes to inefficient bug resolution, incomplete patches, and longer development cycles.

Prior methods for code localization mostly depend on dense retrieval models or agent-based approaches. Dense retrieval requires embedding the entire codebase into a searchable vector space, which is difficult to maintain and update for large repositories. These systems often perform poorly when issue descriptions lack direct references to relevant code. On the other hand, some recent approaches use agent-based models that simulate a human-like exploration of the codebase. However, they often rely on directory traversal and lack an understanding of deeper semantic links like inheritance or function invocation. This limits their ability to handle complex relationships between code elements not explicitly linked.

A team of researchers from Yale University, University of Southern California, Stanford University, and All Hands AI developed LocAgent, a graph-guided agent framework to transform code localization. Rather than depending on lexical matching or static embeddings, LocAgent converts entire codebases into directed heterogeneous graphs. These graphs include nodes for directories, files, classes, and functions and edges to capture relationships like function invocation, file imports, and class inheritance. This structure allows the agent to reason across multiple levels of code abstraction. The system then applies tools like SearchEntity, TraverseGraph, and RetrieveEntity to allow LLMs to explore the system step-by-step. The use of sparse hierarchical indexing ensures rapid access to entities, and the graph design supports multi-hop traversal, which is essential for finding connections across distant parts of the codebase.

LocAgent performs indexing within seconds and supports real-time usage, making it practical for developers and organizations. The researchers fine-tuned two open-source models, Qwen2.5-7B, and Qwen2.5-32B, on a curated set of successful localization trajectories. These models performed impressively on standard benchmarks. For instance, on the SWE-Bench-Lite dataset, LocAgent achieved 92.7% file-level accuracy using Qwen2.5-32B, compared to 86.13% with Claude-3.5 and lower scores from other models. On the newly introduced Loc-Bench dataset, which contains 660 examples across bug reports (282), feature requests (203), security issues (31), and performance problems (144), LocAgent again showed competitive results, achieving 84.59% Acc@5 and 87.06% Acc@10 at the file level. Even the smaller Qwen2.5-7B model delivered performance close to high-cost proprietary models while costing only $0.05 per example, a stark contrast to the $0.66 cost of Claude-3.5.

The core mechanism relies on a detailed graph-based indexing process. Each node, whether representing a class or function, is uniquely identified by a fully qualified name and indexed using BM25 for flexible keyword search. The model enables agents to simulate a reasoning chain that begins with extracting issue-relevant keywords, proceeds through graph traversals, and concludes with code retrievals for specific nodes. These actions are scored using a confidence estimation approach based on prediction consistency over multiple iterations. Notably, when the researchers disabled tools like TraverseGraph or SearchEntity, performance dropped by up to 18%, highlighting their importance. Further, multi-hop reasoning was critical; fixing traversal hops to one led to a decline in function-level accuracy from 71.53% to 66.79%.

When applied to downstream tasks like GitHub issue resolution, LocAgent increased the issue pass rate (Pass@10) from 33.58% in baseline Agentless systems to 37.59% with the fine-tuned Qwen2.5-32B model. The framework’s modularity and open-source nature make it a compelling solution for organizations looking for in-house alternatives to commercial LLMs. The introduction of Loc-Bench, with its broader representation of maintenance tasks, ensures fair evaluation without contamination from pre-training data.

Some Key Takeaways from the Research on LocAgent include the following:

LocAgent transforms codebases into heterogeneous graphs for multi-level code reasoning.  

It achieved up to 92.7% file-level accuracy on SWE-Bench-Lite with Qwen2.5-32B.  

Reduced code localization cost by approximately 86% compared to proprietary models. Introduced Loc-Bench dataset with 660 examples: 282 bugs, 203 features, 31 security, 144 performance. 

Fine-tuned models (Qwen2.5-7B, Qwen2.5-32B) performed comparably to Claude-3.5.  

Tools like TraverseGraph and SearchEntity proved essential, with accuracy drops when disabled.  

Demonstrated real-world utility by improving GitHub issue resolution rates.

It offers a scalable, cost-efficient, and effective alternative to proprietary LLM solutions.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Meet LocAgent: Graph-Based AI Agents Transforming Code Localization for Scalable Software Maintenance appeared first on MarkTechPost.

A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the …

Language processing in the brain presents a challenge due to its inherently complex, multidimensional, and context-dependent nature. Psycholinguists have attempted to construct well-defined symbolic features and processes for domains, such as phonemes for speech analysis and part-of-speech units for syntactic structures. Despite acknowledging some cross-domain interactions, research has focused on modeling each linguistic subfield in isolation through controlled experimental manipulations. This divide-and-conquer strategy shows limitations, as a significant gap has emerged between natural language processing and formal psycholinguistic theories. These models and theories struggle to capture the subtle, non-linear, context-dependent interactions occurring within and across levels of linguistic analysis.

Recent advances in LLMs have dramatically improved conversational language processing, summarization, and generation. These models excel in handling syntactic, semantic, and pragmatic properties of written text and in recognizing speech from acoustic recordings. Multimodal, end-to-end models represent a significant theoretical advancement over text-only models by providing a unified framework for transforming continuous auditory input into speech and word-level linguistic dimensions during natural conversations. Unlike traditional approaches, these deep acoustic-to-speech-to-language models shift to multidimensional vectorial representations where all elements of speech and language are embedded into continuous vectors across a population of simple computing units by optimizing straightforward objectives.

Researchers from Hebrew University, Google Research, Princeton University, Maastricht University, Massachusetts General Hospital and Harvard Medical School, New York University School of Medicine, and Harvard University have presented a unified computational framework that connects acoustic, speech, and word-level linguistic structures to investigate the neural basis of everyday conversations in the human brain. They utilized electrocorticography to record neural signals across 100 hours of natural speech production and detailed as participants engaged in open-ended real-life conversations. The team extracted various embedding like low-level acoustic, mid-level speech, and contextual word embeddings from a multimodal speech-to-text model called Whisper. Their model predicts neural activity at each level of the language processing hierarchy across hours of previously unseen conversations.

The internal workings of the Whisper acoustic-to-speech-to-language model are examined to model and predict neural activity during daily conversations. Three types of embeddings are extracted from the model for every word patients speak or hear: acoustic embeddings from the auditory input layer, speech embeddings from the final speech encoder layer, and language embeddings from the decoder’s final layers. For each embedding type, electrode-wise encoding models are constructed to map the embeddings to neural activity during speech production and comprehension. The encoding models show a remarkable alignment between human brain activity and the model’s internal population code, accurately predicting neural responses across hundreds of thousands of words in conversational data.

The Whisper model’s acoustic, speech, and language embeddings show exceptional predictive accuracy for neural activity across hundreds of thousands of words during speech production and comprehension throughout the cortical language network. During speech production, a hierarchical processing is observed where articulatory areas (preCG, postCG, STG) are better predicted by speech embeddings, while higher-level language areas (IFG, pMTG, AG) align with language embeddings. The encoding models show temporal specificity, with performance peaking more than 300ms before word onset during production and 300ms after onset during comprehension, with speech embeddings better predicting activity in perceptual and articulatory areas and language embeddings excelling in high-order language areas.

In summary, the acoustic-to-speech-to-language model offers a unified computational framework for investigating the neural basis of natural language processing. This integrated approach is a paradigm shift toward non-symbolic models based on statistical learning and high-dimensional embedding spaces. As these models evolve to process natural speech better, their alignment with cognitive processes may similarly improve. Some advanced models like GPT-4o incorporate visual modality alongside speech and text, while others integrate embodied articulation systems mimicking human speech production. The fast improvement of these models supports a shift to a unified linguistic paradigm that emphasizes the role of usage-based statistical learning in language acquisition as it is materialized in real-life contexts.

Check out the Paper, and Google Blog. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday Conversations appeared first on MarkTechPost.

Achieving Critical Reliability in Instruction-Following with LLMs: How …

Ensuring reliable instruction-following in LLMs remains a critical challenge. This is particularly important in customer-facing applications, where mistakes can be costly. Traditional prompt engineering techniques fail to deliver consistent results. A more structured and managed approach is necessary to improve adherence to business rules while maintaining flexibility.

This article explores key innovations, including granular atomic guidelines, dynamic evaluation and filtering of instructions, and Attentive Reasoning Queries (ARQs), while acknowledging implementation limitations and trade-offs.

The Challenge: Inconsistent AI Performance in Customer Service

LLMs are already providing tangible business value when used as assistants to human representatives in customer service scenarios. However, their reliability as autonomous customer-facing agents remains a challenge.Traditional approaches to developing conversational LLM applications often fail in real-world use cases. The two most common approaches are:

Iterative prompt engineering, which leads to inconsistent, unpredictable behavior.

Flowchart-based processing, which sacrifices the real magic of LLM-powered interactions: dynamic, free-flowing, human-like interactions.

In high-stakes customer-facing applications, such as banking, even minor errors can have serious consequences. For instance, an incorrectly executed API call (like transferring money) can lead to lawsuits and reputational damage. Conversely, mechanical interactions that lack naturalness and rapport hurt customer trust and engagement, limiting containment rates (cases resolved without human intervention).

For LLMs to reach their full potential as dynamic, autonomous agents in real-world cases, we must make them follow business-specific instructions consistently and at scale, while maintaining the flexibility of natural, free-flowing interactions.

How to Create a Reliable, Autonomous Customer Service Agent with LLMs

To address these gaps in LLMs and current approaches, and achieve a level of reliability and control that works well in real-world cases, we must question the approaches that failed.One of the first questions I had when I started working on Parlant (an open-source framework for customer-facing AI agents) was, “If an AI agent is found to mishandle a particular customer scenario, what would be the optimal process for fixing it?” Adding additional demands to an already-lengthy prompt, like “Here’s how you should approach scenario X…” would quickly become complicated to manage, and the results weren’t consistent anyhow. Besides that, adding those instructions unconditionally posed an alignment risk since LLMs are inherently biased by their input. It was therefore important that instructions for scenario X did not leak into other scenarios which potentially required a different approach.

We thus realized that instructions needed to apply only in their intended context. This made sense because, in real-life, when we catch unsatisfactory behavior in real-time in a customer-service interaction, we usually know how to correct it: We’re able to specify both what needs to improve as well as the context in which our feedback should apply. For example, “Be concise and to the point when discussing premium-plan benefits,” but “Be willing to explain our offering at length when comparing it to other solutions.”

In addition to this contextualization of instructions, in training a highly capable agent that can handle many use cases, we’d clearly need to tweak many instructions over time as we shaped our agent’s behavior to business needs and preferences. We needed a systematic approach.

Stepping back and rethinking, from first principles, our ideal expectations from modern AI-based interactions and how to develop them, this is what we understood about how such interactions should feel to customers:

Empathetic and coherent: Customers should feel in good hands when using AI.

Fluid, like Instant Messaging (IM): Allowing customers to switch topics back and forth, express themselves using multiple messages, and ask about multiple topics at a time.

Personalized: You should feel that the AI agent knows it’s speaking to you and understands your context.

From a developer perspective, we also realized that:

Crafting the right conversational UX is an evolutionary process. We should be able to confidently modify agent behavior in different contexts, quickly and easily, without worrying about breaking existing behavior.

Instructions should be respected consistently. This is hard to do with LLMs, which are inherently unpredictable creatures. An innovative solution was required.

Agent decisions should be transparent. The spectrum of possible issues related to natural language and behavior is too wide. Resolving issues in instruction-following without clear indications of how an agent interpreted our instructions in a given scenario would be highly impractical in production environments with deadlines.

Implementing Parlant’s Design Goals

Our main challenge was how to control and adjust an AI agent’s behavior while ensuring that instructions are not spoken in vain—that the AI agent implements them accurately and consistently. This led to a strategic design decision: granular, atomic guidelines.

1. Granular Atomic Guidelines

Complex prompts often overwhelm LLMs, leading to incomplete or inconsistent outputs with respect to the instructions they specify. We solved this in Parlant by dropping broad prompts for self-contained, atomic guidelines. Each guideline consists of:

Condition: A natural-language query that determines when the instruction should apply (e.g., “The customer inquires about a refund…”)

Action: The specific instruction the LLM should follow (e.g., “Confirm order details and offer an overview of the refund process.”)

By segmenting instructions into manageable units and systematically focusing their attention on each one at a time, we could get the LLM to evaluate and enforce them with higher accuracy.

2. Filtering and Supervision Mechanism

LLMs are highly influenced by the content of their prompts, even if parts of the prompt are not directly relevant to the conversation at hand.

Instead of presenting all guidelines at once, we made Parlant dynamically match and apply only the relevant set of instructions at each step of the conversation. This real-time matching can then be leveraged for:

Reduced cognitive overload for the LLM: We’d avoid prompt leaks and increase the model’s focus on the right instructions, leading to higher consistency. 

Supervision: We added a mechanism to highlight each guideline’s impact and enforce its application, increasing conformance across the board.

Explainability: Every evaluation and decision generated by the system includes a rationale detailing how guidelines were interpreted and the reasoning behind skipping or activating them at each point in the conversation.

Continuous improvement: By monitoring guideline effectiveness and agent interpretation, developers could easily refine their AI’s behavior over time. Because guidelines are atomic and supervised, you could easily make structured changes without breaking fragile prompts. 

3. Attentive Reasoning Queries (ARQs)

While “Chain of Thought” (CoT) prompting improves reasoning, it remains limited in its ability to maintain consistent, context-sensitive responses over time. Parlant introduces Attentive Reasoning Queries (ARQs)—a technique we’ve devised to ensure that multi-step reasoning stays effective, accurate, and predictable, even across thousands of runs. You can find our research paper on ARQs vs. CoT on parlant.io and arxiv.org.

ARQs work by directing the LLM’s attention back to high-priority instructions at key points in the response generation process, getting the LLM to attend to those instructions and reason about them right before it needs to apply them. We found that “localizing” the reasoning around the part of the response where a specific instruction needs to be applied provided significantly greater accuracy and consistency than a preliminary, nonspecific reasoning process like CoT.

Acknowledging Limitations

While these innovations improve instruction-following, there are challenges to consider:

Computational overhead: Implementing filtering and reasoning mechanisms increases processing time. However, with hardware and LLMs improving by the day, we saw this as a possibly controversial, yet strategic design choice.

Alternative approaches: In some low-risk applications, such as assistive AI co-pilots, simpler methods like prompt-tuning or workflow-based approaches often suffice.

Why Consistency Is Crucial for Enterprise-Grade Conversational AI

In regulated industries like finance, healthcare, and legal services, even 99% accuracy poses significant risk. A bank handling millions of monthly conversations cannot afford thousands of potentially critical errors. Beyond accuracy, AI systems must be constrained such that errors, even when they occur, remain within strict, acceptable bounds.

In response to the demand for greater accuracy in such applications, AI solution vendors often argue that humans also make mistakes. While this is true, the difference is that, with human employees, correcting them is usually straightforward. You can ask them why they handled a situation the way they did. You can provide direct feedback and monitor their results. But relying on “best-effort” prompt-engineering, while being blind to why an AI agent even made some decision in the first place, is an approach that simply doesn’t scale beyond basic demos.

This is why a structured feedback mechanism is so important. It allows you to pinpoint what changes need to be made, and how to make them while keeping existing functionality intact. It’s this realization that put us on the right track with Parlant early on.

Handling Millions of Customer Interactions with Autonomous AI Agents

For enterprises to deploy AI at scale, consistency and transparency are non-negotiable. A financial chatbot providing unauthorized advice, a healthcare assistant misguiding patients, or an e-commerce agent misrepresenting products can all have severe consequences.

Parlant redefines AI alignment by enabling:

Enhanced operational efficiency: Reducing human intervention while ensuring high-quality AI interactions.

Consistent brand alignment: Maintaining coherence with business values.

Regulatory compliance: Adhering to industry standards and legal requirements.

This methodology represents a shift in how AI alignment is approached in the first place. Using modular guidelines with intelligent filtering instead of long, complex prompts; adding explicit supervision and validation mechanisms to ensure things go as planned—these innovations mark a new standard for achieving reliability with LLMs. As AI-driven automation continues to expand in adoption, ensuring consistent instruction-following will become an accepted necessity, not an innovative luxury.

If your company is looking to deploy robust AI-powered customer service or any other customer-facing application, you should look into Parlant, an agent framework for controlled, explainable, and enterprise-ready AI interactions.

The post Achieving Critical Reliability in Instruction-Following with LLMs: How to Achieve AI Customer Service That’s 100% Reliable appeared first on MarkTechPost.

Fin-R1: A Specialized Large Language Model for Financial Reasoning and …

LLMs are advancing rapidly across multiple domains, yet their effectiveness in tackling complex financial problems remains an area of active investigation. The iterative development of LLMs has significantly driven the evolution of artificial intelligence toward artificial general intelligence (AGI). OpenAI’s o1 series and similar models like QwQ and Marco-o1 have improved complex reasoning capabilities by extending “chain-of-thought” reasoning through an iterative “exploration-reflection” approach. In finance, models such as XuanYuan-FinX1-Preview and Fino1 have showcased the potential of LLMs in cognitive reasoning tasks. Meanwhile, DeepSeekR1 adopts a different strategy, relying solely on RL with multi-stage training to enhance reasoning and inference abilities. By combining thousands of unsupervised RL training steps with a small cold-start dataset, DeepSeekR1 demonstrates strong emergent reasoning performance and readability, highlighting the effectiveness of RL-based methodologies in improving large-scale language models.

Despite these advancements, general-purpose LLMs struggle to adapt to specialized financial reasoning tasks. Financial decision-making requires interdisciplinary knowledge, including legal regulations, economic indicators, and mathematical modeling, while also demanding logical, step-by-step reasoning. Several challenges arise when deploying LLMs in financial applications. First, fragmented financial data complicates knowledge integration, leading to inconsistencies that hinder comprehensive understanding. Second, the black-box nature of LLMs makes their reasoning process difficult to interpret, conflicting with regulatory requirements for transparency and accountability. Finally, LLMs often struggle with generalization across financial scenarios, producing unreliable outputs in high-risk applications. These limitations pose significant barriers to their adoption in real-world financial systems, where accuracy and traceability are critical.

Researchers from Shanghai University of Finance & Economics, Fudan University, and FinStep have developed Fin-R1, a specialized LLM for financial reasoning. With a compact 7-billion-parameter architecture, Fin-R1 reduces deployment costs while addressing key economic challenges: fragmented data, lack of reasoning control, and weak generalization. It is trained on Fin-R1-Data, a high-quality dataset containing 60,091 CoT sourced from authoritative financial data. A two-stage training approach—Supervised Fine-Tuning (SFT) followed by RL—Fin-R1 enhances accuracy and interpretability. It performs well in financial benchmarks, excelling in financial compliance and robo-advisory applications.

The study presents a two-stage framework for constructing Fin-R1. The data generation phase involves creating a high-quality financial reasoning dataset, Fin-R1-Data, through data distillation with DeepSeek-R1 and filtering using an LLM-as-judge approach. In the model training phase, Fin-R1 is fine-tuned on Qwen2.5-7B-Instruct using SFT and Group Relative Policy Optimization (GRPO) to enhance reasoning and output consistency. The dataset combines open-source and proprietary financial data, refined through rigorous filtering. Training integrates supervised learning and reinforcement learning, incorporating structured prompts and reward mechanisms to improve financial reasoning accuracy and standardization.

The reasoning abilities of Fin-R1 in financial scenarios were evaluated through a comparative analysis against several state-of-the-art models, including DeepSeek-R1, Fin-R1-SFT, and various Qwen and Llama-based architectures. Despite its compact 7B parameter size, Fin-R1 achieved a notable average score of 75.2, ranking second overall. It outperformed all models of similar scale and exceeded DeepSeek-R1-Distill-Llama-70B by 8.7 points. Fin-R1 ranked highest in FinQA and ConvFinQA with scores of 76.0 and 85.0, respectively, demonstrating strong financial reasoning and cross-task generalization, particularly in benchmarks like Ant_Finance, TFNS, and Finance-Instruct-500K.

In conclusion, Fin-R1 is a large financial reasoning language model designed to tackle key challenges in financial AI, including fragmented data, inconsistent reasoning logic, and limited business generalization. It delivers state-of-the-art performance by utilizing a two-stage training process—SFT and RL—on the high-quality Fin-R1-Data dataset. With a compact 7B parameter scale, it achieves scores of 85.0 in ConvFinQA and 76.0 in FinQA, outperforming larger models. Future work aims to enhance financial multimodal capabilities, strengthen regulatory compliance, and expand real-world applications, driving innovation in fintech while ensuring efficient and intelligent financial decision-making.

Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Fin-R1: A Specialized Large Language Model for Financial Reasoning and Decision-Making appeared first on MarkTechPost.

Meta AI Researchers Introduced SWEET-RL and CollaborativeAgentBench: A …

Large language models (LLMs) are rapidly transforming into autonomous agents capable of performing complex tasks that require reasoning, decision-making, and adaptability. These agents are deployed in web navigation, personal assistance, and software development. To act effectively in real-world settings, these agents must handle multi-turn interactions that span several steps or decision points. This introduces the need for training methods beyond simple response generation and instead focuses on optimizing the entire trajectory of interactions. Reinforcement learning (RL) has emerged as a compelling approach to train such agents by refining their decision-making based on long-term rewards.

Despite their potential, LLM-based agents struggle with multi-turn decision-making. A major challenge lies in assigning proper credit to actions taken at earlier stages of interaction, which influence later outcomes. Traditional training methods rely on next-token prediction or imitate high-probability actions, which do not account for long-term dependencies or cumulative goals. As a result, these methods fail to address the high variance and inefficiency of long-horizon tasks, particularly in collaborative scenarios where understanding human intent and reasoning across multiple steps is critical.

Various reinforcement learning techniques have been adapted to fine-tune LLMs, especially from single-turn human feedback scenarios. Tools like PPO, RAFT, and DPO have been explored but exhibit significant limitations when applied to sequential interactions. These methods often fail at effective credit assignment across turns, making them less effective for multi-turn decision-making tasks. Benchmarks used to evaluate such tools lack the diversity and complexity required to assess performance in collaborative, real-world settings robustly. Value-based learning approaches are another alternative, but their need for custom heads and large amounts of task-specific fine-tuning data limit their generalization capabilities.

FAIR at Meta and UC Berkeley researchers proposed a new reinforcement learning method called SWEET-RL (Step-WisE Evaluation from Training-time Information). They also introduced a benchmark known as CollaborativeAgentBench or ColBench. This benchmark is central to the study, providing over 10,000 training tasks and over 1,000 test cases across two domains: backend programming and frontend design. ColBench simulates real collaboration between an AI agent and a human partner, where agents must ask questions, refine their understanding, and provide iterative solutions. For programming, agents are required to write functions in Python by asking for clarifications to refine missing specifications. In front-end tasks, agents must generate HTML code that matches a visual target through feedback-based corrections. Each task is designed to stretch the reasoning ability of the agent and mimic real-world constraints like limited interactions, capped at 10 turns per session.

SWEET-RL is built around an asymmetric actor-critic structure. The critic has access to additional information during training, such as the correct solution, which is not visible to the actor. This information allows the critic to evaluate each decision made by the agent with a much finer resolution. Instead of training a value function that estimates overall reward, SWEET-RL directly models an advantage function at each turn, using the Bradley-Terry optimization objective. The advantage function determines how much better or worse a particular action is compared to alternatives, helping the agent learn precise behaviors. For example, if an action aligns better with the human partner’s expectation, it receives a higher advantage score. This method simplifies credit assignment and aligns better with the pre-training architecture of LLMs, which rely on token-level prediction.

SWEET-RL achieved a 6% absolute improvement over other multi-turn reinforcement learning methods across both programming and design tasks. On backend programming tasks, it passed 48.0% of tests and achieved a success rate of 34.4%, compared to 28.2% for Multi-Turn DPO and 22.4% for zero-shot performance. On frontend design tasks, it reached a cosine similarity score of 76.9% and a win rate of 40.4%, improving from 38.6% with DPO and 33.8% with fine-tuning. Even when evaluated against top proprietary models like GPT-4o and O1-Mini, SWEET-RL closed the performance gap significantly, enabling the open-source Llama-3.1-8B model to match or exceed GPT-4o’s frontend win rate of 40.4%.

This research demonstrates that effective training of interactive agents hinges on precise, turn-by-turn feedback rather than generalized value estimations or broad supervision. SWEET-RL significantly improves credit assignment by leveraging training-time information and an architecture-aligned optimization approach. It enhances generalization, reduces training variance, and shows strong scalability, achieving better results with increased data. The algorithm also remains effective when applied to off-policy datasets, underlining its practicality in real-world scenarios with imperfect data. The research team created a meaningful evaluation framework by introducing ColBench as a benchmark tailored for realistic, multi-turn tasks. This combination with SWEET-RL provides a strong foundation for developing agents that can reason, adapt, and collaborate effectively over extended interactions.

Several key takeaways from this research include:

SWEET-RL improved backend programming success rates from 28.2% (DPO) to 34.4% and frontend win rates from 38.6% to 40.4%.  

It allowed Llama-3.1-8B to match the performance of GPT-4o, reducing dependency on proprietary models.  

The critic uses training-time information (e.g., correct solutions) that is invisible to the actor, creating an asymmetric training setup.  

Tasks in ColBench are capped at 10 rounds per session and include over 10,000 procedurally generated training examples.  

ColBench measures outcomes using unit test pass rates (for code) and cosine similarity (for web design), providing reliable evaluation.  

SWEET-RL directly learns a turn-wise advantage function, improving credit assignment without needing an intermediate value function.  

The model scales effectively with more data and performs well even on off-policy datasets from weaker models.  

Compared to traditional fine-tuning methods, SWEET-RL delivers higher performance with less overfitting and greater generalization.

Check out the Paper, GitHub Page and Dataset. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Meta AI Researchers Introduced SWEET-RL and CollaborativeAgentBench: A Step-Wise Reinforcement Learning Framework to Train Multi-Turn Language Agents for Realistic Human-AI Collaboration Tasks appeared first on MarkTechPost.

Microsoft AI Releases RD-Agent: An AI-Driven Tool for Performing R& …

Research and development (R&D) is crucial in driving productivity, particularly in the AI era. However, conventional automation methods in R&D often lack the intelligence to handle complex research challenges and innovation-driven tasks, making them less effective than human experts. Conversely, researchers leverage deep domain knowledge to generate ideas, test hypotheses, and refine processes through iterative experimentation. The rise of LLMs offers a potential solution by introducing advanced reasoning and decision-making capabilities, allowing them to function as intelligent agents that enhance efficiency in data-driven R&D workflows.

Despite their potential, LLMs must overcome key challenges to deliver meaningful industrial impact in R&D. A major limitation is their inability to evolve beyond their initial training, restricting their capacity to adapt to emerging developments. Additionally, while LLMs possess broad general knowledge, they often lack the depth required for specialized domains, limiting their effectiveness in solving industry-specific problems. To maximize their impact, LLMs must continuously acquire specialized knowledge through practical industry applications, ensuring they remain relevant and capable of addressing complex R&D challenges.Researchers at Microsoft Research Asia have developed RD-Agent, an AI-powered tool designed to automate R&D processes using LLMs. RD-Agent operates through an autonomous framework with two key components: Research, which generates and explores new ideas, and Development, which implements them. The system continuously improves through iterative refinement. RD-Agent functions as both a research assistant and a data-mining agent, automating tasks like reading papers, identifying financial and healthcare data patterns, and optimizing feature engineering. Now open-source on GitHub, RD-Agent is actively evolving to support more applications and enhance industry productivity.

In R&D, two primary challenges must be addressed: enabling continuous learning and acquiring specialized knowledge. Traditional LLMs, once trained, struggle to expand their expertise, limiting their ability to tackle industry-specific problems. To overcome this, RD-Agent employs a dynamic learning framework that integrates real-world feedback, allowing it to refine hypotheses and accumulate domain knowledge over time. RD-Agent continuously proposes, tests, and improves ideas by automating the research process, linking scientific exploration with real-world validation. This iterative feedback loop ensures that knowledge is systematically acquired and applied like human experts refine their understanding through experience.

In the development phase, RD-Agent enhances efficiency by prioritizing tasks and optimizing execution strategies through Co-STEER, a data-driven approach that evolves via continuous learning. This system begins with simple tasks and refines its development methods based on real-world feedback. To evaluate R&D capabilities, researchers have introduced RD2Bench, a benchmarking system that assesses LLM agents on model and data development tasks. Looking ahead, automating feedback comprehension, task scheduling, and cross-domain knowledge transfer remains a major challenge. By integrating research and development processes through continuous feedback, RD-Agent aims to revolutionize automated R&D, boosting innovation and efficiency across disciplines.

In conclusion, RD-Agent is an open-source AI-driven framework designed to automate and enhance R&D processes. It integrates two core components—Research for idea generation and development for implementation—to ensure continuous improvement through iterative feedback. By incorporating real-world data, RD-Agent evolves dynamically and acquires specialized knowledge. The system employs Co-STEER, a data-centric approach, and RD2Bench, a benchmarking tool, to refine development strategies and evaluate AI-driven R&D capabilities. This integrated approach enhances innovation, fosters cross-domain knowledge transfer, and improves efficiency, marking a significant step toward intelligent and automated research and development.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Microsoft AI Releases RD-Agent: An AI-Driven Tool for Performing R&D with LLM-based Agents appeared first on MarkTechPost.

Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model …

​Artificial intelligence has made significant strides in recent years, yet integrating real-time speech interaction with visual content remains a complex challenge. Traditional systems often rely on separate components for voice activity detection, speech recognition, textual dialogue, and text-to-speech synthesis. This segmented approach can introduce delays and may not capture the nuances of human conversation, such as emotions or non-speech sounds. These limitations are particularly evident in applications designed to assist visually impaired individuals, where timely and accurate descriptions of visual scenes are essential.​

Addressing these challenges, Kyutai has introduced MoshiVis, an open-source Vision Speech Model (VSM) that enables natural, real-time speech interactions about images. Building upon their earlier work with Moshi—a speech-text foundation model designed for real-time dialogue—MoshiVis extends these capabilities to include visual inputs. This enhancement allows users to engage in fluid conversations about visual content, marking a noteworthy advancement in AI development.

Technically, MoshiVis augments Moshi by integrating lightweight cross-attention modules that infuse visual information from an existing visual encoder into Moshi’s speech token stream. This design ensures that Moshi’s original conversational abilities remain intact while introducing the capacity to process and discuss visual inputs. A gating mechanism within the cross-attention modules enables the model to selectively engage with visual data, maintaining efficiency and responsiveness. Notably, MoshiVis adds approximately 7 milliseconds of latency per inference step on consumer-grade devices, such as a Mac Mini with an M4 Pro Chip, resulting in a total of 55 milliseconds per inference step. This performance stays well below the 80-millisecond threshold for real-time latency, ensuring smooth and natural interactions.

In practical applications, MoshiVis demonstrates its ability to provide detailed descriptions of visual scenes through natural speech. For instance, when presented with an image depicting green metal structures surrounded by trees and a building with a light brown exterior, MoshiVis articulates:​

“I see two green metal structures with a mesh top, and they’re surrounded by large trees. In the background, you can see a building with a light brown exterior and a black roof, which appears to be made of stone.”

This capability opens new avenues for applications such as providing audio descriptions for the visually impaired, enhancing accessibility, and enabling more natural interactions with visual information. By releasing MoshiVis as an open-source project, Kyutai invites the research community and developers to explore and expand upon this technology, fostering innovation in vision-speech models. The availability of the model weights, inference code, and visual speech benchmarks further supports collaborative efforts to refine and diversify the applications of MoshiVis.

In conclusion, MoshiVis represents a significant advancement in AI, merging visual understanding with real-time speech interaction. Its open-source nature encourages widespread adoption and development, paving the way for more accessible and natural interactions with technology. As AI continues to evolve, innovations like MoshiVis bring us closer to seamless integration of multimodal understanding, enhancing user experiences across various domains.

Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images appeared first on MarkTechPost.

NVIDIA AI Open Sources Dynamo: An Open-Source Inference Library for Ac …

​The rapid advancement of artificial intelligence (AI) has led to the development of complex models capable of understanding and generating human-like text. Deploying these large language models (LLMs) in real-world applications presents significant challenges, particularly in optimizing performance and managing computational resources efficiently.​

Challenges in Scaling AI Reasoning Models

As AI models grow in complexity, their deployment demands increase, especially during the inference phase—the stage where models generate outputs based on new data. Key challenges include:​

Resource Allocation: Balancing computational loads across extensive GPU clusters to prevent bottlenecks and underutilization is complex.​

Latency Reduction: Ensuring rapid response times is critical for user satisfaction, necessitating low-latency inference processes.​

Cost Management: The substantial computational requirements of LLMs can lead to escalating operational costs, making cost-effective solutions essential.​

Introducing NVIDIA Dynamo

In response to these challenges, NVIDIA has introduced Dynamo, an open-source inference library designed to accelerate and scale AI reasoning models efficiently and cost-effectively. As the successor to the NVIDIA Triton Inference Server, Dynamo offers a modular framework tailored for distributed environments, enabling seamless scaling of inference workloads across large GPU fleets. ​

Technical Innovations and Benefits

Dynamo incorporates several key innovations that collectively enhance inference performance:​

Disaggregated Serving: This approach separates the context (prefill) and generation (decode) phases of LLM inference, allocating them to distinct GPUs. By allowing each phase to be optimized independently, disaggregated serving improves resource utilization and increases the number of inference requests served per GPU. ​

GPU Resource Planner: Dynamo’s planning engine dynamically adjusts GPU allocation in response to fluctuating user demand, preventing over- or under-provisioning and ensuring optimal performance. ​

Smart Router: This component efficiently directs incoming inference requests across large GPU fleets, minimizing costly recomputations by leveraging knowledge from prior requests, known as KV cache. ​

Low-Latency Communication Library (NIXL): NIXL accelerates data transfer between GPUs and across diverse memory and storage types, reducing inference response times and simplifying data exchange complexities.

KV Cache Manager: By offloading less frequently accessed inference data to more cost-effective memory and storage devices, Dynamo reduces overall inference costs without impacting user experience. ​

Performance Insights

Dynamo’s impact on inference performance is substantial. When serving the open-source DeepSeek-R1 671B reasoning model on NVIDIA GB200 NVL72, Dynamo increased throughput—measured in tokens per second per GPU—by up to 30 times. Additionally, serving the Llama 70B model on NVIDIA Hopper resulted in more than a twofold increase in throughput. ​

These enhancements enable AI service providers to serve more inference requests per GPU, accelerate response times, and reduce operational costs, thereby maximizing returns on their accelerated compute investments. ​

Conclusion

NVIDIA Dynamo represents a significant advancement in the deployment of AI reasoning models, addressing critical challenges in scaling, efficiency, and cost-effectiveness. Its open-source nature and compatibility with major AI inference backends, including PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM, empower enterprises, startups, and researchers to optimize AI model serving across disaggregated inference environments. By leveraging Dynamo’s innovative features, organizations can enhance their AI capabilities, delivering faster and more efficient AI services to meet the growing demands of modern applications.

Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post NVIDIA AI Open Sources Dynamo: An Open-Source Inference Library for Accelerating and Scaling AI Reasoning Models in AI Factories appeared first on MarkTechPost.

A Step-by-Step Guide to Building a Semantic Search Engine with Sentenc …

Semantic search goes beyond traditional keyword matching by understanding the contextual meaning of search queries. Instead of simply matching exact words, semantic search systems capture the intent and contextual definition of the query and return relevant results even when they don’t contain the same keywords.

In this tutorial, we’ll implement a semantic search system using Sentence Transformers, a powerful library built on top of Hugging Face’s Transformers that provides pre-trained models specifically optimized for generating sentence embeddings. These embeddings are numerical representations of text that capture semantic meaning, allowing us to find similar content through vector similarity. We’ll create a practical application: a semantic search engine for a collection of scientific abstracts that can answer research queries with relevant papers, even when the terminology differs between the query and relevant documents.

First, let’s install the necessary libraries in our Colab notebook:

Copy CodeCopiedUse a different Browser!pip install sentence-transformers faiss-cpu numpy pandas matplotlib datasets

Now, let’s import the libraries we’ll need:

Copy CodeCopiedUse a different Browserimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
import faiss
from typing import List, Dict, Tuple
import time
import re
import torch

For our demonstration, we’ll use a collection of scientific paper abstracts. Let’s create a small dataset of abstracts from various fields:

Copy CodeCopiedUse a different Browserabstracts = [
{
“id”: 1,
“title”: “Deep Learning for Natural Language Processing”,
“abstract”: “This paper explores recent advances in deep learning models for natural language processing tasks. We review transformer architectures including BERT, GPT, and T5, and analyze their performance on various benchmarks including question answering, sentiment analysis, and text classification.”
},
{
“id”: 2,
“title”: “Climate Change Impact on Marine Ecosystems”,
“abstract”: “Rising ocean temperatures and acidification are severely impacting coral reefs and marine biodiversity. This study presents data collected over a 10-year period, demonstrating accelerated decline in reef ecosystems and proposing conservation strategies to mitigate further damage.”
},
{
“id”: 3,
“title”: “Advancements in mRNA Vaccine Technology”,
“abstract”: “The development of mRNA vaccines represents a breakthrough in immunization technology. This review discusses the mechanism of action, stability improvements, and clinical efficacy of mRNA platforms, with special attention to their rapid deployment during the COVID-19 pandemic.”
},
{
“id”: 4,
“title”: “Quantum Computing Algorithms for Optimization Problems”,
“abstract”: “Quantum computing offers potential speedups for solving complex optimization problems. This paper presents quantum algorithms for combinatorial optimization and compares their theoretical performance with classical methods on problems including traveling salesman and maximum cut.”
},
{
“id”: 5,
“title”: “Sustainable Urban Planning Frameworks”,
“abstract”: “This research proposes frameworks for sustainable urban development that integrate renewable energy systems, efficient public transportation networks, and green infrastructure. Case studies from five cities demonstrate reductions in carbon emissions and improvements in quality of life metrics.”
},
{
“id”: 6,
“title”: “Neural Networks for Computer Vision”,
“abstract”: “Convolutional neural networks have revolutionized computer vision tasks. This paper examines recent architectural innovations including residual connections, attention mechanisms, and vision transformers, evaluating their performance on image classification, object detection, and segmentation benchmarks.”
},
{
“id”: 7,
“title”: “Blockchain Applications in Supply Chain Management”,
“abstract”: “Blockchain technology enables transparent and secure tracking of goods throughout supply chains. This study analyzes implementations across food, pharmaceutical, and retail industries, quantifying improvements in traceability, reduction in counterfeit products, and enhanced consumer trust.”
},
{
“id”: 8,
“title”: “Genetic Factors in Autoimmune Disorders”,
“abstract”: “This research identifies key genetic markers associated with increased susceptibility to autoimmune conditions. Through genome-wide association studies of 15,000 patients, we identified novel variants that influence immune system regulation and may serve as targets for personalized therapeutic approaches.”
},
{
“id”: 9,
“title”: “Reinforcement Learning for Robotic Control Systems”,
“abstract”: “Deep reinforcement learning enables robots to learn complex manipulation tasks through trial and error. This paper presents a framework that combines model-based planning with policy gradient methods to achieve sample-efficient learning of dexterous manipulation skills.”
},
{
“id”: 10,
“title”: “Microplastic Pollution in Freshwater Systems”,
“abstract”: “This study quantifies microplastic contamination across 30 freshwater lakes and rivers, identifying primary sources and transport mechanisms. Results indicate correlation between population density and contamination levels, with implications for water treatment policies and plastic waste management.”
}
]

papers_df = pd.DataFrame(abstracts)
print(f”Dataset loaded with {len(papers_df)} scientific papers”)
papers_df[[“id”, “title”]]

Now we’ll load a pre-trained Sentence Transformer model from Hugging Face. We’ll use the all-MiniLM-L6-v2 model, which provides a good balance between performance and speed:

Copy CodeCopiedUse a different Browsermodel_name = ‘all-MiniLM-L6-v2’
model = SentenceTransformer(model_name)
print(f”Loaded model: {model_name}”)

Next, we’ll convert our text abstracts into dense vector embeddings:

Copy CodeCopiedUse a different Browserdocuments = papers_df[‘abstract’].tolist()
document_embeddings = model.encode(documents, show_progress_bar=True)

print(f”Generated {len(document_embeddings)} embeddings with dimension {document_embeddings.shape[1]}”)

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. We’ll use it to index our document embeddings:

Copy CodeCopiedUse a different Browserdimension = document_embeddings.shape[1]

index = faiss.IndexFlatL2(dimension)
index.add(np.array(document_embeddings).astype(‘float32’))

print(f”Created FAISS index with {index.ntotal} vectors”)

Now let’s implement a function that takes a query, converts it to an embedding, and retrieves the most similar documents:

Copy CodeCopiedUse a different Browserdef semantic_search(query: str, top_k: int = 3) -> List[Dict]:
“””
Search for documents similar to query

Args:
query: Text to search for
top_k: Number of results to return

Returns:
List of dictionaries containing document info and similarity score
“””
query_embedding = model.encode([query])

distances, indices = index.search(np.array(query_embedding).astype(‘float32’), top_k)

results = []
for i, idx in enumerate(indices[0]):
results.append({
‘id’: papers_df.iloc[idx][‘id’],
‘title’: papers_df.iloc[idx][‘title’],
‘abstract’: papers_df.iloc[idx][‘abstract’],
‘similarity_score’: 1 – distances[0][i] / 2
})

return results

Let’s test our semantic search with various queries that demonstrate its ability to understand meaning beyond exact keywords:

Copy CodeCopiedUse a different Browsertest_queries = [
“How do transformers work in natural language processing?”,
“What are the effects of global warming on ocean life?”,
“Tell me about COVID vaccine development”,
“Latest algorithms in quantum computing”,
“How can cities reduce their carbon footprint?”
]

for query in test_queries:
print(“n” + “=”*80)
print(f”Query: {query}”)
print(“=”*80)

results = semantic_search(query, top_k=3)

for i, result in enumerate(results):
print(f”nResult #{i+1} (Score: {result[‘similarity_score’]:.4f}):”)
print(f”Title: {result[‘title’]}”)
print(f”Abstract snippet: {result[‘abstract’][:150]}…”)

Let’s visualize the document embeddings to see how they cluster by topic:

Copy CodeCopiedUse a different Browserfrom sklearn.decomposition import PCA

pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(document_embeddings)

plt.figure(figsize=(12, 8))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], s=100, alpha=0.7)

for i, (x, y) in enumerate(reduced_embeddings):
plt.annotate(papers_df.iloc[i][‘title’][:20] + “…”,
(x, y),
fontsize=9,
alpha=0.8)

plt.title(‘Document Embeddings Visualization (PCA)’)
plt.xlabel(‘Component 1’)
plt.ylabel(‘Component 2′)
plt.grid(True, linestyle=’–‘, alpha=0.7)
plt.tight_layout()
plt.show()

Let’s create a more interactive search interface:

Copy CodeCopiedUse a different Browserfrom IPython.display import display, HTML, clear_output
import ipywidgets as widgets

def run_search(query_text):
clear_output(wait=True)

display(HTML(f”<h3>Query: {query_text}</h3>”))

start_time = time.time()
results = semantic_search(query_text, top_k=5)
search_time = time.time() – start_time

display(HTML(f”<p>Found {len(results)} results in {search_time:.4f} seconds</p>”))

for i, result in enumerate(results):
html = f”””
<div style=”margin-bottom: 20px; padding: 15px; border: 1px solid #ddd; border-radius: 5px;”>
<h4>{i+1}. {result[‘title’]} <span style=”color: #007bff;”>(Score: {result[‘similarity_score’]:.4f})</span></h4>
<p>{result[‘abstract’]}</p>
</div>
“””
display(HTML(html))

search_box = widgets.Text(
value=”,
placeholder=’Type your search query here…’,
description=’Search:’,
layout=widgets.Layout(width=’70%’)
)

search_button = widgets.Button(
description=’Search’,
button_style=’primary’,
tooltip=’Click to search’
)

def on_button_clicked(b):
run_search(search_box.value)

search_button.on_click(on_button_clicked)

display(widgets.HBox([search_box, search_button]))

In this tutorial, we’ve built a complete semantic search system using Sentence Transformers. This system can understand the meaning behind user queries and return relevant documents even when there isn’t exact keyword matching. We’ve seen how embedding-based search provides more intelligent results than traditional methods.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

The post A Step-by-Step Guide to Building a Semantic Search Engine with Sentence Transformers, FAISS, and all-MiniLM-L6-v2 appeared first on MarkTechPost.

Build a generative AI enabled virtual IT troubleshooting assistant usi …

Today’s organizations face a critical challenge with the fragmentation of vital information across multiple environments. As businesses increasingly rely on diverse project management and IT service management (ITSM) tools such as ServiceNow, Atlassian Jira and Confluence, employees find themselves navigating a complex web of systems to access crucial data.
This isolated approach leads to several challenges for IT leaders, developers, program managers, and new employees. For example:

Inefficiency: Employees need to access multiple systems independently to gather data insights and remediation steps during incident troubleshooting
Lack of integration: Information is isolated across different environments, making it difficult to get a holistic view of ITSM activities
Time-consuming: Searching for relevant information across multiple systems is time-consuming and reduces productivity
Potential for inconsistency: Using multiple systems increases the risk of inconsistent data and processes across the organization.

Amazon Q Business is a fully managed, generative artificial intelligence (AI) powered assistant that can address challenges such as inefficient, inconsistent information access within an organization by providing 24/7 support tailored to individual needs. It handles a wide range of tasks such as answering questions, providing summaries, generating content, and completing tasks based on data in your organization. Amazon Q Business offers over 40 data source connectors that connect to your enterprise data sources and help you create a generative AI solution with minimal configuration. Amazon Q Business also supports over 50 actions across popular business applications and platforms. Additionally, Amazon Q Business offers enterprise-grade data security, privacy, and built-in guardrails that you can configure.
This blog post explores an innovative solution that harnesses the power of generative AI to bring value to your organization and ITSM tools with Amazon Q Business.
Solution overview
The solution architecture shown in the following figure demonstrates how to build a virtual IT troubleshooting assistant by integrating with multiple data sources such as Atlassian Jira, Confluence, and ServiceNow. This solution helps streamline information retrieval, enhance collaboration, and significantly boost overall operational efficiency, offering a glimpse into the future of intelligent enterprise information management.

This solution integrates with ITSM tools such as ServiceNow Online and project management software such as Atlassian Jira and Confluence using the Amazon Q Business data source connectors. You can use a data source connector to combine data from different places into a central index for your Amazon Q Business application. For this demonstration, we use the Amazon Q Business native index and retriever. We also configure an application environment and grant access to users to interact with an application environment using AWS IAM Identity Center for user management. Then, we provision subscriptions for IAM Identity Center users and groups.
Authorized users interact with the application environment through a web experience. You can share the web experience endpoint URL with your users so they can open the URL and authenticate themselves to start chatting with the generative AI application powered by Amazon Q Business.
Deployment
Start by setting up the architecture and data needed for the demonstration.

We’ve provided an AWS CloudFormation template in our GitHub repository that you can use to set up the environment for this demonstration. If you don’t have existing Atlassian Jira, Confluence, and ServiceNow accounts follow these steps to create trial accounts for the demonstration
Once step 1 is complete, open the AWS Management Console for Amazon Q Business. On the Applications tab, open your application to see the data sources. See Best practices for data source connector configuration in Amazon Q Business to understand best practices
To improve retrieved results and customize the end user chat experience, use Amazon Q to map document attributes from your data sources to fields in your Amazon Q index. Choose the Atlassian Jira, Confluence Cloud and ServiceNow Online links to learn more about their document attributes and field mappings. Select the data source to edit its configurations under Actions. Select the appropriate fields that you think would be important for your search needs. Repeat the process for all of the data sources. The following figure is an example of some of the Atlassian Jira project field mappings that we selected
Sync mode enables you to choose how you want to update your index when your data source content changes. Sync run schedule sets how often you want Amazon Q Business to synchronize your index with the data source. For this demonstration, we set the Sync mode to Full Sync and the Frequency to Run on demand. Update Sync mode with your changes and choose Sync Now to start syncing data sources. When you initiate a sync, Amazon Q will crawl the data source to extract relevant documents, then sync them to the Amazon Q index, making them searchable
After syncing data sources, you can configure the metadata controls in Amazon Q Business. An Amazon Q Business index has fields that you can map your document attributes to. After the index fields are mapped to document attributes and are search-enabled, admins can use the index fields to boost results from specific sources, or by end users to filter and scope their chat results to specific data. Boosting chat responses based on document attributes helps you rank sources that are more authoritative higher than other sources in your application environment. See Boosting chat responses using metadata boosting to learn more about metadata boosting and metadata controls. The following figure is an example of some of the metadata controls that we selected
For the purposes of the demonstration, use the Amazon Q Business web experience. Select your application under Applications and then select the Deployed URL link in the web experience settings
Enter the same username, password and multi-factor authentication (MFA) authentication for the user that you created previously in IAM Identity Center to sign in to the Amazon Q Business web experience generative AI assistant

Demonstration
Now that you’ve signed in to the Amazon Q Business web experience generative AI assistant (shown in the previous figure), let’s try some natural language queries.
IT leaders: You’re an IT leader and your team is working on a critical project that needs to hit the market quickly. You can now ask questions in natural language to Amazon Q Business to get answers based on your company data.

Developers: Developers who want to know information such as the tasks that are assigned to them, specific tasks details, or issues in a particular sub segment. They can now get these questions answered from Amazon Q Business without necessarily signing in to either Atlassian Jira or Confluence.

Project and program managers: Project and program managers can monitor the activities or developments in their projects or programs from Amazon Q Business without having to contact various teams to get individual status updates.

New employees or business users: A newly hired employee who’s looking for information to get started on a project or a business user who needs tech support can use the generative AI assistant to get the information and support they need.

Benefits and outcomes
From the demonstrations, you saw that various users whether they are leaders, managers, developers, or business users can benefit from using a generative AI solution like our virtual IT assistant built using Amazon Q Business. It removes the undifferentiated heavy lifting of having to navigate multiple solutions and cross-reference multiple items and data points to get answers. Amazon Q Business can use the generative AI to provide responses with actionable insights in just few seconds. Now, let’s dive deeper into some of the additional benefits that this solution provides.

Increased efficiency: Centralized access to information from ServiceNow, Atlassian Jira, and Confluence saves time and reduces the need to switch between multiple systems.
Enhanced decision-making: Comprehensive data insights from multiple systems leads to better-informed decisions in incident management and problem-solving for various users across the organization.
Faster incident resolution: Quick access to enterprise data sources and knowledge and AI-assisted remediation steps can significantly reduce mean time to resolutions (MTTR) for cases with elevated priorities.
Improved knowledge management: Access to Confluence’s architectural documents and other knowledge bases such as ServiceNow’s Knowledge Articles promotes better knowledge sharing across the organization. Users can now get responses based on information from multiple systems.
Seamless integration and enhanced user experience: Better integration between ITSM processes, project management, and software development streamlines operations. This is helpful for organizations and teams that incorporate agile methodologies.
Cost savings: Reduction in time spent searching for information and resolving incidents can lead to significant cost savings in IT operations.
Scalability: Amazon Q Business can grow with the organization, accommodating future needs and additional data sources as required. Organization can create more Amazon Q Business applications and share purpose-built Amazon Q Business apps within their organizations to manage repetitive tasks.

Clean up
After completing your exploration of the virtual IT troubleshooting assistant, delete the CloudFormation stack from your AWS account. This action terminates all resources created during deployment of this demonstration and prevents unnecessary costs from accruing in your AWS account.
Conclusion
By integrating Amazon Q Business with enterprise systems, you can create a powerful virtual IT assistant that streamlines information access and improves productivity. The solution presented in this post demonstrates the power of combining AI capabilities with existing enterprise systems to create powerful unified ITSM solutions and more efficient and user-friendly experiences.
We provide the sample virtual IT assistant using an Amazon Q Business solution as open source—use it as a starting point for your own solution and help us make it better by contributing fixes and features through GitHub pull requests. Visit the GitHub repository to explore the code, choose Watch to be notified of new releases, and check the README for the latest documentation updates.
Learn more:

Amazon Q Business
Generative AI on AWS

For expert assistance, AWS Professional Services, AWS Generative AI partner solutions, and AWS Generative AI Competency Partners are here to help.
We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the GitHub repository.

About the Authors
Jasmine Rasheed Syed is a Senior Customer Solutions manager at AWS, focused on accelerating time to value for the customers on their cloud journey by adopting best practices and mechanisms to transform their business at scale. Jasmine is a seasoned, result oriented leader with 20+ years of progressive experience in Insurance, Retail & CPG with exemplary track record spanning across Business Development, Cloud/Digital Transformation, Delivery, Operational & Process Excellence and Executive Management.
Suprakash Dutta is a Sr. Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning. He is part of the AI/ML community at AWS and designs Generative AI and Intelligent Document Processing(IDP) solutions.
Joshua Amah is a Partner Solutions Architect at Amazon Web Services, specializing in supporting SI partners with a focus on AI/ML and generative AI technologies. He is passionate about guiding AWS Partners in using cutting-edge technologies and best practices to build innovative solutions that meet customer needs. Joshua provides architectural guidance and strategic recommendations for both new and existing workloads.
Brad King is an Enterprise Account Executive at Amazon Web Services specializing in translating complex technical concepts into business value and making sure that clients achieve their digital transformation goals efficiently and effectively through long term partnerships.
Joseph Mart is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS). His core competence and interests lie in machine learning applications and generative AI. Joseph is a technology addict who enjoys guiding AWS customers on architecting their workload in the AWS Cloud. In his spare time, he loves playing soccer and visiting nature.

Process formulas and charts with Anthropic’s Claude on Amazon Bedroc …

Research papers and engineering documents often contain a wealth of information in the form of mathematical formulas, charts, and graphs. Navigating these unstructured documents to find relevant information can be a tedious and time-consuming task, especially when dealing with large volumes of data. However, by using Anthropic’s Claude on Amazon Bedrock, researchers and engineers can now automate the indexing and tagging of these technical documents. This enables the efficient processing of content, including scientific formulas and data visualizations, and the population of Amazon Bedrock Knowledge Bases with appropriate metadata.
Amazon Bedrock is a fully managed service that provides a single API to access and use various high-performing foundation models (FMs) from leading AI companies. It offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI practices. Anthropic’s Claude 3 Sonnet offers best-in-class vision capabilities compared to other leading models. It can accurately transcribe text from imperfect images—a core capability for retail, logistics, and financial services, where AI might glean more insights from an image, graphic, or illustration than from text alone. The latest of Anthropic’s Claude models demonstrate a strong aptitude for understanding a wide range of visual formats, including photos, charts, graphs and technical diagrams. With Anthropic’s Claude, you can extract more insights from documents, process web UIs and diverse product documentation, generate image catalog metadata, and more.
In this post, we explore how you can use these multi-modal generative AI models to streamline the management of technical documents. By extracting and structuring the key information from the source materials, the models can create a searchable knowledge base that allows you to quickly locate the data, formulas, and visualizations you need to support your work. With the document content organized in a knowledge base, researchers and engineers can use advanced search capabilities to surface the most relevant information for their specific needs. This can significantly accelerate research and development workflows, because professionals no longer have to manually sift through large volumes of unstructured data to find the references they need.
Solution overview
This solution demonstrates the transformative potential of multi-modal generative AI when applied to the challenges faced by scientific and engineering communities. By automating the indexing and tagging of technical documents, these powerful models can enable more efficient knowledge management and accelerate innovation across a variety of industries.
In addition to Anthropic’s Claude on Amazon Bedrock, the solution uses the following services:

Amazon SageMaker JupyterLab – The SageMakerJupyterLab application is a web-based interactive development environment (IDE) for notebooks, code, and data. JupyterLab application’s flexible and extensive interface can be used to configure and arrange machine learning (ML) workflows. We use JupyterLab to run the code for processing formulae and charts.
Amazon Simple Storage Service (Amazon S3) – Amazon S3 is an object storage service built to store and protect any amount of data. We use Amazon S3 to store sample documents that are used in this solution.
AWS Lambda –AWS Lambda is a compute service that runs code in response to triggers such as changes in data, changes in application state, or user actions. Because services such as Amazon S3 and Amazon Simple Notification Service (Amazon SNS) can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.

The solution workflow contains the following steps:

Split the PDF into individual pages and save them as PNG files.
With each page:

Extract the original text.
Render the formulas in LaTeX.
Generate a semantic description of each formula.
Generate an explanation of each formula.
Generate a semantic description of each graph.
Generate an interpretation for each graph.
Generate metadata for the page.

Generate metadata for the full document.
Upload the content and metadata to Amazon S3.
Create an Amazon Bedrock knowledge base.

The following diagram illustrates this workflow.

Prerequisites

If you’re new to AWS, you first need to create and set up an AWS account.
Additionally, in your account under Amazon Bedrock, request access to anthropic.claude-3-5-sonnet-20241022-v2:0 if you don’t have it already.

Deploy the solution
Complete the following steps to set up the solution:

Launch the AWS CloudFormation template by choosing Launch Stack (this creates the stack in the us-east-1 AWS Region):

When the stack deployment is complete, open the Amazon SageMaker AI
Choose Notebooks in the navigation pane.
Locate the notebook claude-scientific-docs-notebook and choose Open JupyterLab.

In the notebook, navigate to notebooks/process_scientific_docs.ipynb.

Choose conda_python3 as the kernel, then choose Select.

Walk through the sample code.

Explanation of the notebook code
In this section, we walk through the notebook code.
Load data
We use example research papers from arXiv to demonstrate the capability outlined here. arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
We download the documents and store them under a samples folder locally. Multi-modal generative AI models work well with text extraction from image files, so we start by converting the PDF to a collection of images, one for each page.
Get Metadata from formulas
After the image documents are available, you can use Anthropic’s Claude to extract formulas and metadata with the Amazon Bedrock Converse API. Additionally, you can use the Amazon Bedrock Converse API to obtain an explanation of the extracted formulas in plain language. By combining the formula and metadata extraction capabilities of Anthropic’s Claude with the conversational abilities of the Amazon Bedrock Converse API, you can create a comprehensive solution for processing and understanding the information contained within the image documents.
We start with the following example PNG file.

We use the following request prompt:

sample_prompt = “””

Evaluate this page line by line.
For each line, if it is a formula, convert this math expression to latex format.
Next describe the formula in plain language Be sure to enclose Latex formulas in double dollar sign for example: $$ <math expression> $$ Use markdown syntax to format your output
“””

file = “./samples/2003.10304/page_2.png”

display(Image(filename=file, width=600))
output, result = stream_conversation(message=sample_prompt, file_paths=[file])
response_text = result[“content”]
display(Markdown(response_text))
print(output)

We get the following response, which shows the extracted formula converted to LaTeX format and described in plain language, enclosed in double dollar signs.

Get metadata from charts
Another useful capability of multi-modal generative AI models is the ability to interpret graphs and generate summaries and metadata. The following is an example of how you can obtain metadata of the charts and graphs using simple natural language conversation with models. We use the following graph.

We provide the following request:

sample_prompt = f”””
You are a data scientist expert who has perfect vision and pay a lot of attention to details.
interpret the graph on this page
provide the answer in markdown format “””

file = “./samples/2003.10304/page_5.png”

display(Image(filename=file, width=600))
output, result = stream_conversation(message=sample_prompt, file_paths=[file])
response_text = result[“content”]
display(Markdown(response_text))
print(output)

The response returned provides its interpretation of the graph explaining the color-coded lines and suggesting that overall, the DSC model is performing well on the training data, achieving a high Dice coefficient of around 0.98. However, the lower and fluctuating validation Dice coefficient indicates potential overfitting and room for improvement in the model’s generalization performance.

Generate metadata
Using natural language processing, you can generate metadata for the paper to aid in searchability.
We use the following request:

sample_prompt = f”””
Generate a metadata json object for this research paper.

{{
“title”: “”,
“authors”: [],
“institutions”: [],
“topics”: [],
“funding-sources”: [],
“algorithms”:[],
“data_sets”:[]
}}
“””

file = ‘./samples/2003.10304/page_0.png’

We get the following response, including formula markdown and a description.

{

“title”: “Attention U-Net Based Adversarial Architectures for Chest X-ray Lung Segmentation”,

“authors”: [“Gusztáv Gaál”, “Balázs Maga”, “András Lukács”], “institutions”: [“AI Research Group, Institute of Mathematics, Eötvös Loránd University, Budapest, Hungary”],

“topics”: [ “Chest X-ray segmentation”, “Medical imaging”, “Deep learning”, “Computer-aided detection”, “Lung segmentation” ],

“funding-sources”: [],

“algorithms”: [ “U-Net”, “Adversarial architectures”, “Fully Convolutional Neural Networks (FCN)”, “Mask R-CNN” ],

“data_sets”: [“JSRT dataset”]

}

Use your extracted data in a knowledge base
Now that we’ve prepared our data with formulas, analyzed charts, and metadata, we will create an Amazon Bedrock knowledge base. This will make the information searchable and enable question-answering capabilities.
Prepare your Amazon Bedrock knowledge base
To create a knowledge base, first upload the processed files and metadata to Amazon S3:

markdown_file_key = “2003.10304/kb/2003.10304.md”

s3.upload_file(markdown_file, knowledge_base_bucket_name, markdown_file_key)

print(f”File {markdown_file} uploaded successfully.”)

metadata_file_key = “2003.10304/kb/2003.10304.md.metadata.json”

s3.upload_file(metadata_file, knowledge_base_bucket_name, metadata_file_key)

print(f”File {metadata_file} uploaded to successfully.”)

When your files have finished uploading, complete the following steps:

Create an Amazon Bedrock knowledge base.
Create an Amazon S3 data source for your knowledge base, and specify hierarchical chunking as the chunking strategy.

Hierarchical chunking involves organizing information into nested structures of child and parent chunks.
The hierarchical structure allows for faster and more targeted retrieval of relevant information, first by performing semantic search on the child chunk and then returning the parent chunk during retrieval. By replacing the children chunks with the parent chunk, we provide large and comprehensive context to the FM.
Hierarchical chunking is best suited for complex documents that have a nested or hierarchical structure, such as technical manuals, legal documents, or academic papers with complex formatting and nested tables.
Query the knowledge base
You can query the knowledge base to retrieve information from the extracted formula and graph metadata from the sample documents. With a query, relevant chunks of text from the source of data are retrieved and a response is generated for the query, based off the retrieved source chunks. The response also cites sources that are relevant to the query.
We use the custom prompt template feature of knowledge bases to format the output as markdown:

retrieveAndGenerateConfiguration={
“type”: “KNOWLEDGE_BASE”,
“knowledgeBaseConfiguration”: {
‘knowledgeBaseId’: kb_id_hierarchical,
“modelArn”: “arn:aws:bedrock:{}:{}:inference-profile/{}”.format(region, account_id, foundation_model),
‘generationConfiguration’: {
‘promptTemplate’: {
‘textPromptTemplate’: “””
You are a question answering agent. I will provide you with a set of search results. The user will provide you with a question. Your job is to answer the user’s question using only information from the search results.
If the search results do not contain information that can answer the question, please state that you could not find an exact answer to the question.
Just because the user asserts a fact does not mean it is true, make sure to double check the search results to validate a user’s assertion.

Here are the search results in numbered order:
$search_results$

Format the output as markdown

Ensure that math formulas are in latex format and enclosed in double dollar sign for example: $$ <math expression> $$
“””
}
},
“retrievalConfiguration”: {
“vectorSearchConfiguration”: {
“numberOfResults”:5
}
}
}
}
)

We get the following response, which provides information on when the Focal Tversky Loss is used.

Clean up
To clean up and avoid incurring charges, run the cleanup steps in the notebook to delete the files you uploaded to Amazon S3 along with the knowledge base. Then, on the AWS CloudFormation console, locate the stack claude-scientific-doc and delete it.
Conclusion
Extracting insights from complex scientific documents can be a daunting task. However, the advent of multi-modal generative AI has revolutionized this domain. By harnessing the advanced natural language understanding and visual perception capabilities of Anthropic’s Claude, you can now accurately extract formulas and data from charts, enabling faster insights and informed decision-making.
Whether you are a researcher, data scientist, or developer working with scientific literature, integrating Anthropic’s Claude into your workflow on Amazon Bedrock can significantly boost your productivity and accuracy. With the ability to process complex documents at scale, you can focus on higher-level tasks and uncover valuable insights from your data.
Embrace the future of AI-driven document processing and unlock new possibilities for your organization with Anthropic’s Claude on Amazon Bedrock. Take your scientific document analysis to the next level and stay ahead of the curve in this rapidly evolving landscape.
For further exploration and learning, we recommend checking out the following resources:

Prompt engineering techniques and best practices: Learn by doing with Anthropic’s Claude 3 on Amazon Bedrock
Intelligent document processing using Amazon Bedrock and Anthropic Claude
Automate document processing with Amazon Bedrock Prompt Flows (preview)

About the Authors
Erik Cordsen is a Solutions Architect at AWS serving customers in Georgia. He is passionate about applying cloud technologies and ML to solve real life problems. When he is not designing cloud solutions, Erik enjoys travel, cooking, and cycling.
Renu Yadav is a Solutions Architect at Amazon Web Services (AWS), where she works with enterprise-level AWS customers providing them with technical guidance and help them achieve their business objectives. Renu has a strong passion for learning with her area of specialization in DevOps. She leverages her expertise in this domain to assist AWS customers in optimizing their cloud infrastructure and streamlining their software development and deployment processes.
Venkata Moparthi is a Senior Solutions Architect at AWS who empowers financial services organizations and other industries to navigate cloud transformation with specialized expertise in Cloud Migrations, Generative AI, and secure architecture design. His customer-focused approach combines technical innovation with practical implementation, helping businesses accelerate digital initiatives and achieve strategic outcomes through tailored AWS solutions that maximize cloud potential.

Automate IT operations with Amazon Bedrock Agents

IT operations teams face the challenge of providing smooth functioning of critical systems while managing a high volume of incidents filed by end-users. Manual intervention in incident management can be time-consuming and error prone because it relies on repetitive tasks, human judgment, and potential communication gaps. Using generative AI for IT operations offers a transformative solution that helps automate incident detection, diagnosis, and remediation, enhancing operational efficiency.
AI for IT operations (AIOps) is the application of AI and machine learning (ML) technologies to automate and enhance IT operations. AIOps helps IT teams manage and monitor large-scale systems by automatically detecting, diagnosing, and resolving incidents in real time. It combines data from various sources—such as logs, metrics, and events—to analyze system behavior, identify anomalies, and recommend or execute automated remediation actions. By reducing manual intervention, AIOps improves operational efficiency, accelerates incident resolution, and minimizes downtime.
This post presents a comprehensive AIOps solution that combines various AWS services such as Amazon Bedrock, AWS Lambda, and Amazon CloudWatch to create an AI assistant for effective incident management. This solution also uses Amazon Bedrock Knowledge Bases and Amazon Bedrock Agents. The solution uses the power of Amazon Bedrock to enable the deployment of intelligent agents capable of monitoring IT systems, analyzing logs and metrics, and invoking automated remediation processes.
Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through a single API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage the infrastructure. Amazon Bedrock Knowledge Bases is a fully managed capability with built-in session context management and source attribution that helps you implement the entire Retrieval Augmented Generation (RAG) workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows. Amazon Bedrock Agents is a fully managed capability that make it straightforward for developers to create generative AI-based applications that can complete complex tasks for a wide range of use cases and deliver up-to-date answers based on proprietary knowledge sources.
Generative AI is rapidly transforming businesses and unlocking new possibilities across industries. This post highlights the transformative impact of large language models (LLMs). With the ability to encode human expertise and communicate in natural language, generative AI can help augment human capabilities and allow organizations to harness knowledge at scale.
Challenges in IT operations with runbooks
Runbooks are detailed, step-by-step guides that outline the processes, procedures, and tasks needed to complete specific operations, typically in IT and systems administration. They are commonly used to document repetitive tasks, troubleshooting steps, and routine maintenance. By standardizing responses to issues and facilitating consistency in task execution, runbooks help teams improve operational efficiency and streamline workflows. Most organizations rely on runbooks to simplify complex processes, making it straightforward for teams to handle routine operations and respond effectively to system issues. For organizations, managing hundreds of runbooks, monitoring their status, keeping track of failures, and setting up the right alerting can become difficult. This creates visibility gaps for IT teams. When you have multiple runbooks for various processes, managing the dependencies and run order between them can become complex and tedious. It’s challenging to handle failure scenarios and make sure everything runs in the right sequence.
The following are some of the challenges that most organizations face with manual IT operations:

Manual diagnosis through run logs and metrics
Runbook dependency and sequence mapping
No automated remediation processes
No real-time visibility into runbook progress

Solution overview
Amazon Bedrock is the foundation of this solution, empowering intelligent agents to monitor IT systems, analyze data, and automate remediation. The solution provides sample AWS Cloud Development Kit (AWS CDK) code to deploy this solution. The AIOps solution provides an AI assistant using Amazon Bedrock Agents to help with operations automation and runbook execution.
The following architecture diagram explains the overall flow of this solution.

The agent uses Anthropic’s Claude LLM available on Amazon Bedrock as one of the FMs to analyze incident details and retrieve relevant information from the knowledge base, a curated collection of runbooks and best practices. This equips the agent with business-specific context, making sure responses are precise and backed by data from Amazon Bedrock Knowledge Bases. Based on the analysis, the agent dynamically generates a runbook tailored to the specific incident and invokes appropriate remediation actions, such as creating snapshots, restarting instances, scaling resources, or running custom workflows.
Amazon Bedrock Knowledge Bases create an Amazon OpenSearch Serverless vector search collection to store and index incident data, runbooks, and run logs, enabling efficient search and retrieval of information. Lambda functions are employed to run specific actions, such as sending notifications, invoking API calls, or invoking automated workflows. The solution also integrates with Amazon Simple Email Service (Amazon SES) for timely notifications to stakeholders.
The solution workflow consists of the following steps:

Existing runbooks in various formats (such as Word documents, PDFs, or text files) are uploaded to Amazon Simple Storage Service (Amazon S3).
Amazon Bedrock Knowledge Bases converts these documents into vector embeddings using a selected embedding model, configured as part of the knowledge base setup.
These vector embeddings are stored in OpenSearch Serverless for efficient retrieval, also configured during the knowledge base setup.
Agents and action groups are then set up with the required APIs and prompts for handling different scenarios.
The OpenAPI specification defines which APIs need to be called, along with their input parameters and expected output, allowing Amazon Bedrock Agents to make informed decisions.
When a user prompt is received, Amazon Bedrock Agents uses RAG, action groups, and the OpenAPI specification to determine the appropriate API calls. If more details are needed, the agent prompts the user for additional information.
Amazon Bedrock Agents can iterate and call multiple functions as needed until the task is successfully complete.

Prerequisites
To implement this AIOps solution, you need an active AWS account and basic knowledge of the AWS CDK and the following AWS services:

Amazon Bedrock
Amazon CloudWatch
AWS Lambda
Amazon OpenSearch Serverless
Amazon SES
Amazon S3

Additionally, you need to provision the required infrastructure components, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Block Store (Amazon EBS) volumes, and other resources specific to your IT operations environment.
Build the RAG pipeline with OpenSearch Serverless
This solution uses a RAG pipeline to find relevant content and best practices from operations runbooks to generate responses. The RAG approach helps make sure the agent generates responses that are grounded in factual documentation, which avoids hallucinations. The relevant matches from the knowledge base guide Anthropic’s Claude 3 Haiku model so it focuses on the relevant information. The RAG process is powered by Amazon Bedrock Knowledge Bases, which stores information that the Amazon Bedrock agent can access and use. For this use case, our knowledge base contains existing runbooks from the organization with step-by-step procedures to resolve different operational issues on AWS resources.
The pipeline has the following key tasks:

Ingest documents in an S3 bucket – The first step ingests existing runbooks into an S3 bucket to create a searchable index with the help of OpenSearch Serverless.
Monitor infrastructure health using CloudWatch – An Amazon Bedrock action group is used to invoke Lambda functions to get CloudWatch metrics and alerts for EC2 instances from an AWS account. These specific checks are then used as Anthropic’s Claude 3 Haiku model inputs to form a health status overview of the account.

Configure Amazon Bedrock Agents
Amazon Bedrock Agents augment the user request with the right information from Amazon Bedrock Knowledge Bases to generate an accurate response. For this use case, our knowledge base contains existing runbooks from the organization with step-by-step procedures to resolve different operational issues on AWS resources.
By configuring the appropriate action groups and populating the knowledge base with relevant data, you can tailor the Amazon Bedrock agent to assist with specific tasks or domains and provide accurate and helpful responses within its intended scopes.
Amazon Bedrock agents empower Anthropic’s Claude 3 Haiku to use tools, overcoming LLM limitations like knowledge cutoffs and hallucinations, for enhanced task completion through API calls and other external interactions.
The agent’s workflow is to check for resource alerts using an API, then if found, fetch and execute the relevant runbook’s steps (for example, create snapshots, restart instances, and send emails).
The overall system enables automated detection and remediation of operational issues on AWS while enforcing adherence to documented procedures through the runbook approach.
To set up this solution using Amazon Bedrock Agents, refer to the GitHub repo that provisions the following resources. Make sure to verify the AWS Identity and Access Management (IAM) permissions and follow IAM best practices while deploying the code. It is advised to apply least-privilege permissions for IAM policies.

S3 bucket
Amazon Bedrock agent
Action group
Amazon Bedrock agent IAM role
Amazon Bedrock agent action group
Lambda function
Lambda service policy permission
Lambda IAM role

Benefits
With this solution, organizations can automate their operations and save a lot of time. The automation is also less prone to errors compared to manual execution. It offers the following additional benefits:

Reduced manual intervention – Automating incident detection, diagnosis, and remediation helps minimize human involvement, reducing the likelihood of errors, delays, and inconsistencies that often arise from manual processes.
Increased operational efficiency – By using generative AI, the solution speeds up incident resolution and optimizes operational workflows. The automation of tasks such as runbook execution, resource monitoring, and remediation allows IT teams to focus on more strategic initiatives.
Scalability – As organizations grow, managing IT operations manually becomes increasingly complex. Automating operations using generative AI can scale with the business, managing more incidents, runbooks, and infrastructure without requiring proportional increases in personnel.

Clean up
To avoid incurring unnecessary costs, it’s recommended to delete the resources created during the implementation of this solution when not in use. You can do this by deleting the AWS CloudFormation stacks deployed as part of the solution, or manually deleting the resources on the AWS Management Console or using the AWS Command Line Interface (AWS CLI).
Conclusion
The AIOps pipeline presented in this post empowers IT operations teams to streamline incident management processes, reduce manual interventions, and enhance operational efficiency. With the power of AWS services, organizations can automate incident detection, diagnosis, and remediation, enabling faster incident resolution and minimizing downtime.
Through the integration of Amazon Bedrock, Anthropic’s Claude on Amazon Bedrock, Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, and other supporting services, this solution provides real-time visibility into incidents, automated runbook generation, and dynamic remediation actions. Additionally, the solution provides timely notifications and seamless collaboration between AI agents and human operators, fostering a more proactive and efficient approach to IT operations.
Generative AI is rapidly transforming how businesses can take advantage of cloud technologies with ease. This solution using Amazon Bedrock demonstrates the immense potential of generative AI models to enhance human capabilities. By providing developers expert guidance grounded in AWS best practices, this AI assistant enables DevOps teams to review and optimize cloud architecture across of AWS accounts.
Try out the solution yourself and leave any feedback or questions in the comments.

About the Authors
Upendra V is a Sr. Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprise customers design and deploy production-ready Generative AI workloads, implement Large Language Models (LLMs) and Agentic AI systems, and optimize cloud deployments. With expertise in cloud adoption and machine learning, he enables organizations to build and scale AI-driven applications efficiently.
Deepak Dixit is a Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprises architect scalable AI/ML workloads, implement Large Language Models (LLMs), and optimize cloud-native applications.

NVIDIA AI Just Open Sourced Canary 1B and 180M Flash – Multilingual …

In the realm of artificial intelligence, multilingual speech recognition and translation have become essential tools for facilitating global communication. However, developing models that can accurately transcribe and translate multiple languages in real-time presents significant challenges. These challenges include managing diverse linguistic nuances, maintaining high accuracy, ensuring low latency, and deploying models efficiently across various devices.​

To address these challenges, NVIDIA AI has open-sourced two models: Canary 1B Flash and Canary 180M Flash. These models are designed for multilingual speech recognition and translation, supporting languages such as English, German, French, and Spanish. Released under the permissive CC-BY-4.0 license, these models are available for commercial use, encouraging innovation within the AI community.​

Technically, both models utilize an encoder-decoder architecture. The encoder is based on FastConformer, which efficiently processes audio features, while the Transformer Decoder handles text generation. Task-specific tokens, including <target language>, <task>, <toggle timestamps>, and <toggle PnC> (punctuation and capitalization), guide the model’s output. The Canary 1B Flash model comprises 32 encoder layers and 4 decoder layers, totaling 883 million parameters, whereas the Canary 180M Flash model consists of 17 encoder layers and 4 decoder layers, amounting to 182 million parameters. This design ensures scalability and adaptability to various languages and tasks. ​

Performance metrics indicate that the Canary 1B Flash model achieves an inference speed exceeding 1000 RTFx on open ASR leaderboard datasets, enabling real-time processing. In English automatic speech recognition (ASR) tasks, it attains a word error rate (WER) of 1.48% on the Librispeech Clean dataset and 2.87% on the Librispeech Other dataset. For multilingual ASR, the model achieves WERs of 4.36% for German, 2.69% for Spanish, and 4.47% for French on the MLS test set. In automatic speech translation (AST) tasks, the model demonstrates robust performance with BLEU scores of 32.27 for English to German, 22.6 for English to Spanish, and 41.22 for English to French on the FLEURS test set. ​

Data as of March 20 2025

The smaller Canary 180M Flash model also delivers impressive results, with an inference speed surpassing 1200 RTFx. It achieves a WER of 1.87% on the Librispeech Clean dataset and 3.83% on the Librispeech Other dataset for English ASR. For multilingual ASR, the model records WERs of 4.81% for German, 3.17% for Spanish, and 4.75% for French on the MLS test set. In AST tasks, it achieves BLEU scores of 28.18 for English to German, 20.47 for English to Spanish, and 36.66 for English to French on the FLEURS test set. ​

Both models support word-level and segment-level timestamping, enhancing their utility in applications requiring precise alignment between audio and text. Their compact sizes make them suitable for on-device deployment, enabling offline processing and reducing dependency on cloud services. Moreover, their robustness leads to fewer hallucinations during translation tasks, ensuring more reliable outputs. The open-source release under the CC-BY-4.0 license encourages commercial utilization and further development by the community.​

In conclusion, NVIDIA’s open-sourcing of the Canary 1B and 180M Flash models represents a significant advancement in multilingual speech recognition and translation. Their high accuracy, real-time processing capabilities, and adaptability for on-device deployment address many existing challenges in the field. By making these models publicly available, NVIDIA not only demonstrates its commitment to advancing AI research but also empowers developers and organizations to build more inclusive and efficient communication tools.

Check out the Canary 1B Model and Canary 180M Flash. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post NVIDIA AI Just Open Sourced Canary 1B and 180M Flash – Multilingual Speech Recognition and Translation Models appeared first on MarkTechPost.

Microsoft AI Introduces Claimify: A Novel LLM-based Claim-Extraction M …

The widespread adoption of Large Language Models (LLMs) has significantly changed the landscape of content creation and consumption. However, it has also introduced critical challenges regarding accuracy and factual reliability. The content generated by LLMs often includes claims that lack proper verification, potentially leading to misinformation. Therefore, accurately extracting claims from these outputs for effective fact-checking has become essential, albeit challenging due to inherent ambiguities and context dependencies.

Microsoft AI Research has recently developed Claimify, an advanced claim-extraction method based on LLMs, specifically designed to enhance accuracy, comprehensiveness, and context-awareness in extracting claims from LLM outputs. Claimify addresses the limitations of existing methods by explicitly dealing with ambiguity. Unlike other approaches, it identifies sentences with multiple possible interpretations and only proceeds with claim extraction when the intended meaning is clearly determined within the given context. This careful approach ensures higher accuracy and reliability, particularly benefiting subsequent fact-checking efforts.

From a technical standpoint, Claimify employs a structured pipeline comprising three key stages: Selection, Disambiguation, and Decomposition. During the Selection stage, Claimify leverages LLMs to identify sentences that contain verifiable information, filtering out those without factual content. In the Disambiguation stage, it uniquely focuses on detecting and resolving ambiguities, such as unclear references or multiple plausible interpretations. Claims are extracted only if ambiguities can be confidently resolved. The final stage, Decomposition, involves converting each clarified sentence into precise, context-independent claims. This structured process enhances both the accuracy and completeness of the resulting claims.

In evaluations using the BingCheck dataset—which covers a broad range of topics and complex LLM-generated responses—Claimify demonstrated notable improvements over previous methods. It achieved a high entailment rate of 99%, indicating a strong consistency between the extracted claims and the original content. Regarding coverage, Claimify captured 87.6% of verifiable content while maintaining a high precision rate of 96.7%, outperforming comparable approaches. Its systematic approach to decontextualization also ensured that essential contextual details were retained, resulting in better-grounded claims compared to prior methods.

Overall, Claimify represents a meaningful advancement in the automated extraction of reliable claims from LLM-generated content. By methodically addressing ambiguity and contextuality through a structured and careful evaluation framework, Claimify establishes a new standard for accuracy and reliability. As reliance on LLM-produced content continues to grow, tools like Claimify will play an increasingly crucial role in ensuring the trustworthiness and factual integrity of this content.

Check out the Paper and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Microsoft AI Introduces Claimify: A Novel LLM-based Claim-Extraction Method that Outperforms Prior Solutions to Produce More Accurate, Comprehensive, and Substantiated Claims from LLM Outputs appeared first on MarkTechPost.