Salesforce AI Research Proposes PerfCodeGen: A Training-Free Framework …

Large Language Models (LLMs) have become essential tools in software development, offering capabilities such as generating code snippets, automating unit tests, and debugging. However, these models often fall short in producing code that is not only functionally correct but also efficient in runtime. Overlooking runtime efficiency can lead to software that performs poorly, increases operational costs, and impacts user experience. This issue is particularly pronounced for less experienced developers, who may rely on AI-suggested code without fully understanding its implications. Salesforce Research addresses these challenges with PerfCodeGen, a framework that aims to improve both the correctness and performance of LLM-generated code.

Salesforce AI’s PerfCodeGen is a training-free framework designed to enhance the runtime efficiency of LLM-generated code. It achieves this by using execution feedback in an iterative self-refinement process. Unlike approaches requiring fine-tuning with extensive training data, PerfCodeGen employs a feedback loop that evaluates and refines code based on runtime metrics during test execution. The framework operates in two key phases: refining correctness and optimizing performance. Initially, it ensures the generated code meets functional requirements by addressing issues identified in unit tests. Once correctness is established, the framework focuses on runtime efficiency, optimizing the code by targeting and refining the most resource-intensive test cases. This iterative process results in solutions that are both correct and efficient.

Technical Insights and Benefits

PerfCodeGen integrates with existing LLM workflows and begins by generating multiple candidate solutions using nucleus sampling. In the first phase, these candidates are assessed for correctness through unit tests. Feedback from failed tests is used to refine the solutions. Once functional correctness is ensured, the framework moves to the second phase, analyzing runtime metrics to identify bottlenecks. This information is then used to optimize the code further, focusing on the most time-consuming test cases.

This two-phase process increases the likelihood of producing optimally efficient programs. PerfCodeGen’s methodology mirrors human debugging and optimization practices, making it both effective and intuitive. Additionally, the framework’s reliance on feedback rather than retraining allows it to scale across various LLMs and application domains. It has shown consistent improvements in runtime efficiency and correctness across models such as Phi-3-mini, Llama 3, and GPT-4.

PerfCodeGen has been tested on benchmarks such as HumanEval, MBPP, and APPS, demonstrating its effectiveness:

Runtime Efficiency: On HumanEval, GPT-4’s optimization rate (%Opt) increased from 24.54% to 28.83% with PERFCODEGEN, with similar improvements observed across other models.

Correctness Improvement: On MBPP, GPT-3.5’s correctness rate (%Correct) rose from 66.38% to 73.36% with a single sample (Best@1).

Outperforming Ground Truth: PERFCODEGEN enabled LLMs to generate more efficient solutions than ground truth in approximately 55% of HumanEval tasks and 67% of MBPP tasks.

Scalability: Open models such as Phi-3-mini and Mixtral achieved performance comparable to closed models like GPT-3.5 and GPT-4.

These results highlight PERFCODEGEN’s ability to balance correctness and runtime efficiency effectively, making it a valuable addition to LLM-driven code generation workflows.

Conclusion:

PerfCodeGen offers a practical solution to a key limitation of current LLMs: their focus on correctness at the expense of runtime efficiency. By incorporating execution feedback into an iterative refinement process, PerfCodeGen enables the generation of code that is both correct and efficient. This approach enhances the usability of LLMs in software development, providing developers with tools to produce higher-quality code without extensive retraining. The framework’s success across diverse benchmarks demonstrates its potential as a step forward in creating efficient, reliable, and accessible AI-driven programming solutions.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post Salesforce AI Research Proposes PerfCodeGen: A Training-Free Framework that Enhances the Performance of LLM-Generated Code with Execution Feedback appeared first on MarkTechPost.

ChemAgent: Enhancing Large Language Models for Complex Chemical Reason …

Chemical reasoning involves intricate, multi-step processes requiring precise calculations, where small errors can lead to significant issues. LLMs often struggle with domain-specific challenges, such as accurately handling chemical formulas, reasoning through complex steps, and integrating code effectively. Despite advancements in scientific reasoning, benchmarks like SciBench reveal LLMs’ limitations in solving chemical problems, highlighting the need for innovative approaches. Recent frameworks, such as StructChem, attempt to address these challenges by structuring problem-solving into stages like formula generation and confidence-based reviews. Other techniques, including advanced prompting strategies and Python-based reasoning tools, have also been explored. For instance, ChemCrow leverages function calling and precise code generation for tackling chemistry-specific tasks, while combining LLMs with external tools like Wolfram Alpha shows potential for improving accuracy in scientific problem-solving, though integration remains a challenge.

Decomposing complex problems into smaller tasks has enhanced model reasoning and accuracy, particularly in multi-step chemical problems. Studies emphasize the benefits of breaking down queries into manageable components, improving understanding and performance in domains like reading comprehension and complex question answering. Additionally, self-evolution techniques, where LLMs refine their outputs through iterative improvement and prompt evolution, have shown promise. Memory-enhanced frameworks, tool-assisted critiquing, and self-verification methods strengthen LLM capabilities by enabling error correction and refinement. These advancements provide a foundation for developing scalable systems capable of handling the complexities of chemical reasoning while maintaining accuracy and efficiency.

Researchers from Yale University, UIUC, Stanford University, and Shanghai Jiao Tong University introduced ChemAgent, a framework that enhances LLM performance through a dynamic, self-updating library. ChemAgent decomposes chemical tasks into sub-tasks, storing these and their solutions in a structured memory system. This system includes Planning Memory for strategies, Execution Memory for task-specific solutions, and Knowledge Memory for foundational principles. When solving new problems, ChemAgent retrieves, refines, and updates relevant information, enabling iterative learning. Tested on SciBench datasets, ChemAgent improved accuracy by up to 46% (GPT-4), outperforming state-of-the-art methods and demonstrating potential for applications like drug discovery.

ChemAgent is a system designed to improve LLMs for solving complex chemical problems. It organizes tasks into a structured memory with three components: Planning Memory (strategies), Execution Memory (solutions), and Knowledge Memory (chemical principles). Problems are broken into smaller sub-tasks in a library built from verified solutions. Relevant tasks are retrieved, refined, and dynamically updated during inference to enhance adaptability. ChemAgent outperforms baseline models (Few-shot, StructChem) on four datasets, achieving high accuracy through structured memory and iterative refinement. Its hierarchical approach and memory integration establish an effective framework for advanced chemical reasoning tasks.

The study evaluates ChemAgent’s memory components (Mp, Me, Mk) to identify their contributions, with GPT-4 as the base model. Results show that removing any component reduces performance, with Mk being the most impactful, particularly in datasets like ATKINS with limited memory pools. Memory quality is crucial, as GPT-4-generated memories outperform GPT-3.5, while hybrid memories degrade accuracy due to conflicting inputs. ChemAgent demonstrates consistent performance improvement across different LLMs, with the most notable gains on powerful models like GPT-4. The self-updating memory mechanism enhances problem-solving capabilities, particularly in complex datasets requiring specialized chemical knowledge and logical reasoning.

In conclusion, ChemAgent is a framework that enhances LLMs in solving complex chemical problems through self-exploration and a dynamic, self-updating memory library. By decomposing tasks into planning, execution, and knowledge components, ChemAgent builds a structured library to improve task decomposition and solution generation. Experiments on datasets like SciBench show significant performance gains, up to a 46% improvement using GPT-4. The framework effectively addresses challenges in chemical reasoning, such as handling domain-specific formulas and multi-step processes. It holds promise for broader applications in drug discovery and materials science.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post ChemAgent: Enhancing Large Language Models for Complex Chemical Reasoning with Dynamic Memory Frameworks appeared first on MarkTechPost.

NVIDIA AI Introduces Omni-RGPT: A Unified Multimodal Large Language Mo …

Multimodal large language models (MLLMs) bridge vision and language, enabling effective interpretation of visual content. However, achieving precise and scalable region-level comprehension for static images and dynamic videos remains challenging. Temporal inconsistencies, scaling inefficiencies, and limited video comprehension hinder progress, particularly in maintaining consistent object and region representations across video frames. Temporal drift, caused by motion, scaling, or perspective changes, coupled with reliance on computationally heavy methods like bounding boxes or Region of Interest (RoI)-aligned features, increases complexity and limits real-time and large-scale video analysis.

Recent strategies, such as textual region coordinates, visual markers, and RoI-based features, have attempted to address these issues. However, they often fail to ensure temporal consistency across frames or efficiently process large datasets. Bounding boxes lack robustness for multi-frame tracking, and static frame analysis misses intricate temporal relationships. While innovations like embedding coordinates into textual prompts and using image-based markers have advanced the field, a unified solution for image and video domains remains out of reach.

Researchers from NVIDIA and Yonsei University developed Omni-RGPT, a novel multimodal large language model designed to achieve seamless region-level comprehension in images and videos to address these challenges. This model introduces Token Mark, a groundbreaking method that embeds region-specific tokens into visual and text prompts, establishing a unified connection between the two modalities. The Token Mark system replaces traditional RoI-based approaches by defining a unique token for each target region, which remains consistent across frames in a video. This strategy prevents temporal drift and reduces computational costs, enabling robust reasoning for static and dynamic inputs. Including a Temporal Region Guide Head further enhances the model’s performance on video data by classifying visual tokens to avoid reliance on complex tracking mechanisms.

Omni-RGPT leverages a newly created large-scale dataset called RegVID-300k, which contains 98,000 unique videos, 214,000 annotated regions, and 294,000 region-level instruction samples. This dataset was constructed by combining data from ten public video datasets, offering diverse and fine-grained instructions for region-specific tasks. The dataset supports visual commonsense reasoning, region-based captioning, and referring expression comprehension. Unlike other datasets, RegVID-300k includes detailed captions with temporal context and mitigates visual hallucinations through advanced validation techniques.

Omni-RGPT achieved state-of-the-art results on several benchmarks, including 84.5% accuracy on the Causal-VidQA dataset, which evaluates temporal and spatial reasoning across video sequences. The model outperformed existing methods like MotionEpic by over 5% in some sub-tasks, demonstrating superior performance in prediction and counterfactual reasoning. Similarly, the model excelled in video captioning tasks, achieving high METEOR scores on challenging datasets like Vid-STG and BenSMOT. The model achieved remarkable accuracy for image-based tasks on the Visual Commonsense Reasoning (VCR) dataset, outperforming methods specifically optimized for image domains.

Several key takeaways from the research on Omni-RGPT include:

This approach enables consistent and scalable region-level understanding by embedding predefined tokens into visual and text inputs. This prevents temporal drift and supports seamless reasoning across frames.

The dataset provides detailed, fine-grained, diverse annotations, enabling the model to excel in complex video tasks. It includes 294,000 region-level instructions and addresses gaps in existing datasets.

Omni-RGPT demonstrated superior performance across benchmarks such as Causal-VidQA and VCR, achieving accuracy improvements of up to 5% compared to leading models.

The model’s design reduces computational overhead by avoiding dependency on bounding box coordinates or full video tracklets, making it suitable for real-world applications.

The framework seamlessly integrates image and video tasks under a single architecture, achieving exceptional performance without compromising efficiency.

In conclusion, Omni-RGPT addresses critical challenges in region-specific multimodal learning by introducing Token Mark and a novel dataset to support detAIled comprehension in images and videos. The model’s scalable design and state-of-the-art performance across diverse tasks set a new benchmark for the field. Omni-RGPT provides a robust foundation for future research and practical applications in AI by eliminating temporal drift, reducing computational complexity, and leveraging large-scale data.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post NVIDIA AI Introduces Omni-RGPT: A Unified Multimodal Large Language Model for Seamless Region-level Understanding in Images and Videos appeared first on MarkTechPost.

This AI Paper from Alibaba Unveils WebWalker: A Multi-Agent Framework …

Enabling artificial intelligence to navigate and retrieve contextually rich, multi-faceted information from the internet is important in enhancing AI functionalities. Traditional search engines are limited to superficial results, failing to capture the nuances required to investigate profoundly integrated content across a network of related web pages. This constraint limits LLMs in performing tasks that require reasoning across hierarchical information, which negatively impacts domains such as education, organizational decision-making, and the resolution of complex inquiries. Current benchmarks do not adequately assess the intricacies of multi-step interactions, resulting in a considerable deficit in evaluating and improving LLMs’ capabilities in web traversal.

Though Mind2Web and WebArena focus on action-oriented interactions that contain HTML directives, they suffer important limitations like noise, a rather poor understanding of wider contexts, and less enabling of multi-step reasoning. RAG systems are useful for retrieving real-time data but are largely limited to horizontal searches that often miss key content buried within the deeper layers of websites. The limitations of current methodologies make them inadequate for addressing complex, data-driven issues that require concurrent reasoning and planning across numerous web pages.

Researchers from the Alibaba Group introduced WebWalker, a multi-agent framework designed to emulate human-like web navigation. This dual-agent system consists of the Explorer Agent, tasked with methodical page navigation, and the Critic Agent, which aggregates and assesses information to facilitate query resolution. By combining horizontal and vertical exploration, this explore-critic system overcomes the limitations of traditional RAG systems. The dedicated benchmark, WebWalkerQA, with single-source and multi-source queries, evaluates whether the AI can handle layered, multi-step tasks. This coupling of vertical exploration with reasoning allows WebWalker to improve the depth and quality of retrieved information by leaps and bounds.

The benchmark supporting WebWalker, WebWalkerQA, comprises 680 question-answer pairs derived from 1,373 web pages in domains related to education, organizations, conferences, and games. Most queries mimic realistic tasks and require inferring information spread over several subpages. Evaluation of accuracy is in terms of correct answers, along with the number of actions, or steps taken by the system to resolve it, for single-source and multi-source reasoning. Evaluated with different model architectures, including GPT-4o and Qwen-2.5 series, WebWalker showed robustness when dealing with complex and dynamic queries. It used HTML metadata to navigate correctly and had a thought-action-observation framework to engage proficiently with structured web hierarchies.

The results show that WebWalker has an important advantage over managing complex web navigation tasks compared with ReAct and Reflexion and significantly surpasses them in accuracy in single-source and multi-source scenarios. The system also demonstrated outstanding performance in layered reasoning tasks while keeping action counts optimized; hence, the balance between accuracy and resource usage is reached effectively. Such results confirm the scalability and adaptability of the system and make it a benchmark for AI-enhanced web navigation frameworks.

WebWalker solves the problems of navigation and reasoning over highly integrated web content with a dual-agent framework based on an explore-critic paradigm. The benchmark for the tool, WebWalkerQA, systematically tests these functionalities and thus provides a challenging benchmark for tasks in web navigation. It is the most important development towards AI systems to access and manage dynamic, stratified information efficiently, marking an important milestone in the area of AI-enhanced information retrieval. Moreover, by redesigning web traversal metrics and enhancing retrieval-augmented generation systems, WebWalker thus lays a more robust foundation on which increasingly intricate real-world applications can be targeted, hence thereby reinforcing its significance in the realm of artificial intelligence.

Check out the Paper, Project Page, and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post This AI Paper from Alibaba Unveils WebWalker: A Multi-Agent Framework for Benchmarking Multistep Reasoning in Web Traversal appeared first on MarkTechPost.

Chat with Your Documents Using Retrieval-Augmented Generation (RAG)

Imagine having a personal chatbot that can answer questions directly from your documents—be it PDFs, research papers, or books. With Retrieval-Augmented Generation (RAG), this is not only possible but also straightforward to implement. In this tutorial, we’ll learn how to build a chatbot that interacts with your documents, like PDFs, using Retrieval-Augmented Generation (RAG). We’ll use Groq for language model inference, Chroma as the vector store, and Gradio for the user interface.

By the end, you’ll have a chatbot capable of answering questions directly from your documents, keeping context of your conversation, and providing concise, accurate answers.

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances the capabilities of Large Language Models (LLMs) by integrating an information retrieval system. This system fetches relevant data from external sources, providing the LLM with grounded information to generate more accurate and contextually appropriate responses. By combining the generative abilities of LLMs with real-time data retrieval, RAG reduces inaccuracies and ensures up-to-date information in AI-generated content.

Prerequisites

Python Installation: Ensure Python 3.9+ is installed on your system.

Groq API Key: Sign up for a Groq account and generate an API key:

Visit Groq Console.

Navigate to API Keys and create a new key.

Copy your API key for use in the project.

Dependencies: Install the required libraries:

pip install langchain langchain-community langchain-groq gradio sentence-transformers PyPDF2 chromadb

These libraries will help with language processing, building the user interface, model integration, PDF handling, and vector database management.

Downloading the PDF Resource

For this tutorial, we’ll use a publicly available PDF containing information about diseases, their symptoms, and cures. Download the PDF and save it in your project directory (you are free to use any pdf).

Step 1: Extracting Text from the PDF

We’ll use PyPDF2 to extract text from the PDF:

from PyPDF2 import PdfReader

def extract_text_from_pdf(pdf_path):
reader = PdfReader(pdf_path)
text = “”
for page in reader.pages:
text += page.extract_text()
return text

pdf_path = ‘diseases.pdf’ # Replace with your PDF path
pdf_text = extract_text_from_pdf(pdf_path)

Step 2: Split the Text into Chunks

Long documents are divided into smaller, manageable chunks for processing.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_text_into_chunks(text, chunk_size=2000, chunk_overlap=200):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
return text_splitter.split_text(text)

text_chunks = split_text_into_chunks(pdf_text)

Step 3: Create a Vector Store with Chroma

We’ll embed the text chunks using a pre-trained model and store them in a Chroma vector database.

from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

embedding_model = SentenceTransformerEmbeddings(model_name=”all-MiniLM-L6-v2″)

vector_store = Chroma(
collection_name=”disease_info”,
embedding_function=embedding_model,
persist_directory=”./chroma_db”
)

vector_store.add_texts(texts=text_chunks)

Step 4: Initialize the Groq Language Model

To use Groq’s language model, set your API key and initialize the ChatGroq instance.

import os
from langchain_groq import ChatGroq

os.environ[“GROQ_API_KEY”] = ‘your_groq_api_key_here’ # Replace with your API key

llm = ChatGroq(model=”mixtral-8x7b-32768″, temperature=0.1)

Step 5: Create the Conversational Retrieval Chain

With LangChain’s ConversationalRetrievalChain, we can link the language model and the vector database.

from langchain.chains import ConversationalRetrievalChain

retrieval_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vector_store.as_retriever(topk=3),
return_source_documents=True
)

Step 6: Implement the Chatbot Logic

We define the logic for maintaining conversation history and generating responses.

conversation_history = []

def get_response(user_query):
response = retrieval_chain({
“question”: user_query,
“chat_history”: conversation_history
})
conversation_history.append((user_query, response[‘answer’]))
return response[‘answer’]

Step 7: Build the User Interface with Gradio

Finally, create a Gradio interface to interact with the chatbot.

import gradio as gr

def chat_interface(user_input, history):
response = get_response(user_input)
history.append((user_input, response))
return history, history

with gr.Blocks() as demo:
chatbot = gr.Chatbot()
state = gr.State([])
with gr.Row():
user_input = gr.Textbox(show_label=False, placeholder=”Enter your question…”)
submit_btn = gr.Button(“Send”)
submit_btn.click(chat_interface, inputs=[user_input, state], outputs=[chatbot, state])

Running the Code

Save the script as app.py and run

python app.py

Hurray! You are done. The Gradio interface will launch, allowing you to chat with your document.

But why stop here? You can go further by trying to build any of the following functionalities in the chatbot.

Enhanced Vector Store: Use other vector databases like Milvus or Pinecone for scalability.

Fine-tuned Models: Experiment with fine-tuned Groq models for domain-specific accuracy.

Multi-Document Support: Extend the system to handle multiple documents.

Better Context Handling: Refine conversational logic to better manage longer chat histories.

Custom UI: Design a more polished user interface with advanced styling and features.

Congratulations! You’ve successfully built a document-based chatbot using Groq and LangChain. Experiment with improvements and build something amazing!

Resources:

https://nios.ac.in/media/documents/SrSec314NewE/Lesson-29.pdf

LangChain (https://www.langchain.com/)

Groq (https://groq.com/)

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post Chat with Your Documents Using Retrieval-Augmented Generation (RAG) appeared first on MarkTechPost.

CoAgents: A Frontend Framework Reshaping Human-in-the-Loop AI Agents f …

With AI Agents being the Talk of the Town, CopilotKit is an open-source framework designed to give you a holistic exposure to that experience. It facilitates the integration of AI copilots into applications, enabling developers to create interactive AI-driven functionalities easily. It provides a robust infrastructure that rapidly deploys production-ready AI experiences ranging from a simple chatbot to a complex multi-agent system.

CopilotKit offers multiple core experiences, the most recent of which is CoAgents, which provides an Agent UI when building agentic applications. Imagine a system where you can collaboratively build complex projects alongside an AI that understands context, responds to your feedback, and adapts to evolving requirements in real-time. That’s precisely what CoAgents offers. Also, the strengths of CopilotKit and Langraph while using CoAgents allow users to build agent-native applications that can think, adapt, and collaborate with users in real-time.

CoAgents provide users with five core strengths:

Seamless State Sync: With just one line of code, your app and agent stay perfectly in sync, ensuring that the agent instantly knows what the app knows.

Agentic Generative UI or Agent UI: Build real-time, dynamic user interfaces that update based on your agent’s thinking. This feature promotes trust through transparency by showing intermediate agent states.

Intermediate Agent State Streaming: This feature lets you peek into your agent’s processing steps in real-time, offering engaging and transparent experiences as progress unfolds.

Human-in-the-loop (HITL): Implement smart checkpoints where humans can intervene and guide the agents. This is ideal for tasks requiring a human touch.

Real-Time Frontend Actions: Integrate backend and frontend workflows to enable your agent to execute context-aware actions seamlessly within your application.

Let’s look into a demonstration covered by the CEO of CopilotKit, Atai Barkai, and team – CoAgents integrated with the powerful LangChain framework to create an AI agent capable of writing a complete children’s book. This AI agent can chat, create a story outline, generate characters, write chapters, and generate image descriptions, which can be utilized to create illustrations with DALL-E 3. Combining all these steps results in a fully fleshed-out children’s story, complete with narrative structure, compelling characters, and AI-generated artwork. When we look into how It works, there are mainly five steps to it:

Story Outline Creation: We ask the AI agent to produce an outline for a children’s story. Our example features a kid from Earth traveling to Mars for space exploration. Within moments, the AI provides a structured outline in our web app, giving us a birds-eye view of the upcoming narrative.

Image Source

Dynamic Customization: The real power of CoAgents shines when changes are introduced. Instead of one kid going to Mars, we can seamlessly shift gears and ask for two kids—Alex and John—to travel to the Moon. The story outline instantly adjusts to the new requirements by confirming the updates with the AI. This two-way communication between the application and the AI makes it easy to iterate on the creative process.

Real-Time Story and Character Creation: With the outline set, we instruct the AI to generate character profiles and write the actual chapters. Because CoAgents is fully integrated with LangChain, the story-writing process happens in real-time. As the AI works, each chapter appears in the app’s interface, allowing you to follow the story’s progress as it unfolds.

Image Source

Streaming Intermediate States: A standout feature of CopilotKit is the ability to stream intermediate states. You can watch each phase of the AI’s work in the chat window, from brainstorming ideas to polishing the final text. This transparency provides deeper insights into the AI’s reasoning and can help identify moments when human intervention is beneficial.

State Control: Another advantage of CoAgents is the granular control over data visibility. Developers can decide which processes are exposed in the front end and which remain hidden for security or proprietary reasons. So, while the AI might generate style parameters for illustrations behind the scenes, the user only sees the final creative output.

Image Source

This example demonstrates the unique possibilities and aspects that can be impacted in the frontend directly with CoAgents. You can explore other samples on the CopilotKit page, like Agent-Native Travel Planner (ANA) and Agent-Native Research Canvas (ANA) based on Agent-Native Applications (ANAs), which is an interesting exploration in itself. ANAs combine domain-specific agents, direct application integration, and user collaboration to deliver truly interactive, adaptive workflows. They extend beyond simple chat interfaces, using transparency and guided interactions to give users control while leveraging AI-driven assistance. This hybrid approach ensures context-awareness, intelligent recommendations, and seamless task execution within an app’s native environment. Rather than working in isolation, ANAs utilize human oversight at every stage to build trust, reduce errors, and streamline operations. This results in an engaging, efficient user experience that outperforms standalone copilots and fully autonomous systems, charting a new path for modern SaaS innovation and growth.

Now, let’s look into the quickstart on CoAgents; this guide assumes you’re familiar with using LangGraph to build agent workflows. If you need a quick introduction, check out this brief example from the LangGraph docs. 

Getting started with CoAgents requires three prerequisites: familiarity with LangGraph for building agent workflows, a valid LangSmith API key, and a LangGraph agent implementation in Python or JavaScript. The system offers two deployment paths: the recommended LangGraph Platform, which supports local and cloud deployments, or the Copilot Remote Endpoint, which allows Python-only self-hosting via FastAPI. 

Image Source

Integration can be achieved through either Copilot Cloud or self-hosted runtime. The cloud integration process requires a LangGraph deployment URL and LangSmith API key. Users need to register their LangGraph agent through cloud.copilotkit.ai and configure Remote Endpoints for backend connections. Self-hosted runtime requires manual backend configuration and follows separate documentation.

Image Source

The implementation can be verified by testing the chatbot Agent UI interface and confirming agent responses. For troubleshooting, users should verify the validity of their LangSmith API key, check the accessibility of the deployment URL, ensure proper environment configuration, and validate Remote Endpoint connections. These steps ensure a functional CoAgents implementation with proper backend communication.

Image Source

In conclusion, CoAgents is a frontend framework developed by CopilotKit that enables companies to build agent-native applications with robust Agent UI features, ensuring complete real-time visibility into the agent’s actions. Its integrated “UI for your Agent” component provides transparent monitoring to foster user trust and prevent confusion during execution. CoAgents also supports advanced human-in-the-loop capabilities through shared state management between agents and applications, allowing developers to create agentic generative interfaces that dynamically respond to the agent’s evolving state. As a result, CoAgents stands out as the go-to solution for teams seeking to leverage powerful, dynamic Agent UI elements in their agent-native applications.

Sources

https://docs.copilotkit.ai/?ref=github_readme 

https://docs.copilotkit.ai/coagents?utm_source=newsletter&utm_medium=marktechpost&utm_campaign=coagents-release 

https://github.com/CopilotKit/CopilotKit?utm_source=newsletter&utm_medium=marktechpost&utm_campaign=coagents-release 

https://www.youtube.com/watch?v=nBephBv4zr0 

https://dev.to/copilotkit 

https://blog.dailydoseofds.com/p/copilotkit-build-deploy-and-operate

https://www.marktechpost.com/2024/10/02/copilotkits-coagents-the-missing-link-that-makes-it-easy-to-connect-langgraph-agents-to-humans-in-the-loop/ 

https://www.copilotkit.ai/blog/build-full-stack-apps-with-langgraph-and-copilotkit 

https://www.copilotkit.ai/blog/new-wave-of-agent-native-apps 

Thanks to the Tawkit team for the thought leadership/ Resources for this article. Tawkit team has supported us in this content/article.
The post CoAgents: A Frontend Framework Reshaping Human-in-the-Loop AI Agents for Building Next-Generation Interactive Applications with Agent UI and LangGraph Integration appeared first on MarkTechPost.

Enhancing Retrieval-Augmented Generation: Efficient Quote Extraction f …

LLMs have significantly advanced natural language processing, excelling in tasks like open-domain question answering, summarization, and conversational AI. However, their growing size and computational demands highlight inefficiencies in managing extensive contexts, particularly in functions requiring complex reasoning and retrieving specific information. To address this, Retrieval-Augmented Generation (RAG) combines retrieval systems with generative models, allowing access to external knowledge for improved domain-specific performance without extensive retraining. Despite its promise, RAG faces challenges, especially for smaller models, which often struggle with reasoning in large or noisy contexts, limiting their effectiveness in handling complex scenarios.

Researchers from TransLab, University of Brasilia, have introduced LLMQuoter, a lightweight model designed to enhance RAG by implementing a “quote-first-then-answer” strategy. Built on the LLaMA-3B architecture and fine-tuned with Low-Rank Adaptation (LoRA) using a subset of the HotpotQA dataset, LLMQuoter identifies key textual evidence before reasoning, reducing cognitive load and improving accuracy. Leveraging knowledge distillation from high-performing teacher models achieves significant accuracy gains—over 20 points—compared to full-context methods like RAFT while maintaining computational efficiency. LLMQuoter offers a scalable, resource-friendly solution for advanced RAG workflows, streamlining complex reasoning tasks for researchers and practitioners.

Reasoning remains a fundamental challenge for LLMs, with both large and small models facing unique limitations. While large models excel at generalization, they often struggle with complex logical reasoning and multi-step problem-solving, primarily due to their reliance on pattern replication from training data. Smaller models, although more resource-efficient, suffer from capacity constraints, leading to difficulties in maintaining context for reasoning-intensive tasks. Techniques such as split-step reasoning, task-specific fine-tuning, and self-correction mechanisms have emerged to address these challenges. These methods break reasoning tasks into manageable phases, enhance inference efficiency, and improve accuracy by leveraging strategies like Generative Context Distillation (GCD) and domain-specific approaches. Additionally, frameworks like RAFT (Retrieval-Augmented Fine-Tuning) combine reasoning and retrieval to enable more context-aware and accurate responses, especially in domain-specific applications.

Knowledge distillation plays a critical role in making LLMs more efficient by transferring capabilities from large teacher models to smaller, resource-efficient student models. This process allows compact models to perform complex tasks like reasoning and recommendation with reduced computational demands. Techniques such as rationale-based distillation, temperature scaling, and collaborative embedding distillation bridge gaps between teacher and student models, enhancing their generalization and semantic alignment. Evaluation frameworks like DSpy and LLM-based judges provide nuanced assessments of LLM-generated outputs, incorporating metrics tailored to semantic relevance and creative aspects. Experiments demonstrate that models trained to extract relevant quotes instead of processing full contexts deliver superior performance. This highlights the advantages of integrating quote extraction with reasoning for more efficient and accurate AI systems.

The study demonstrates the effectiveness of using quote extraction to enhance RAG systems. Fine-tuning a compact quoter model with minimal resources, such as 5 minutes on an NVIDIA A100 GPU, led to significant improvements in recall, precision, and F1 scores, with the latter increasing to 69.1%. Experiments showed that using extracted quotes instead of full context significantly boosted accuracy across models, with LLAMA 1B achieving 62.2% accuracy using quotes compared to 24.4% with full context. This “divide and conquer” strategy streamlined reasoning by reducing the cognitive load on larger models, enabling even non-optimized models to perform efficiently.

Future research could expand the methodology by testing diverse datasets, incorporating reinforcement learning techniques like Proximal Policy Optimization (PPO), and fine-tuning larger models to explore scalability. Advancing prompt engineering could further improve both quote extraction and reasoning processes. Additionally, the approach holds potential for broader applications, such as memory-augmented RAG systems, where lightweight mechanisms retrieve relevant information from large external knowledge bases. These efforts aim to refine the quote-based RAG pipeline, making high-performing NLP systems more scalable and resource-efficient.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post Enhancing Retrieval-Augmented Generation: Efficient Quote Extraction for Scalable and Accurate NLP Systems appeared first on MarkTechPost.

How Kyndryl integrated ServiceNow and Amazon Q Business

This post is co-written with Sujith R Pillai from Kyndryl.
In this post, we show you how Kyndryl, an AWS Premier Tier Services Partner and IT infrastructure services provider that designs, builds, manages, and modernizes complex, mission-critical information systems, integrated Amazon Q Business with ServiceNow in a few simple steps. You will learn how to configure Amazon Q Business and ServiceNow, how to create a generative AI plugin for your ServiceNow incidents, and how to test and interact with ServiceNow using the Amazon Q Business web experience. By the end of this post, you will be able to enhance your ServiceNow experience with Amazon Q Business and enjoy the benefits of a generative AI–powered interface.
Solution overview
Amazon Q Business has three main components: a front-end chat interface, a data source connector and retriever, and a ServiceNow plugin. Amazon Q Business uses AWS Secrets Manager secrets to store the ServiceNow credentials securely. The following diagram shows the architecture for the solution.

Chat
Users interact with ServiceNow through the generative AI–powered chat interface using natural language.
Data source connector and retriever
A data source connector is a mechanism for integrating and synchronizing data from multiple repositories into one container index. Amazon Q Business has two types of retrievers: native retrievers and existing retrievers using Amazon Kendra. The native retrievers support a wide range of Amazon Q Business connectors, including ServiceNow. The existing retriever option is for those who already have an Amazon Kendra retriever and would like to use that for their Amazon Q Business application. For the ServiceNow integration, we use the native retriever.
ServiceNow plugin
Amazon Q Business provides a plugin feature for performing actions such as creating incidents in ServiceNow.
The following high-level steps show how to configure the Amazon Q Business – ServiceNow integration:

Create a user in ServiceNow for Amazon Q Business to communicate with ServiceNow
Create knowledge base articles in ServiceNow if they do not exist already
Create an Amazon Q Business application and configure the ServiceNow data source and retriever in Amazon Q Business
Synchronize the data source
Create a ServiceNow plugin in Amazon Q Business

Prerequisites
To run this application, you must have an Amazon Web Services (AWS) account, an AWS Identity and Access Management (IAM) role, and a user that can create and manage the required resources. If you are not an AWS account holder, see How do I create and activate a new Amazon Web Services account?
You need an AWS IAM Identity Center set up in the AWS Organizations organizational unit (OU) or AWS account in which you are building the Amazon Q Business application. You should have a user or group created in IAM Identity Center. You will assign this user or group to the Amazon Q Business application during the application creation process. For guidance, refer to Manage identities in IAM Identity Center.
You also need a ServiceNow user with incident_manager and knowledge_admin permissions to create and view knowledge base articles and to create incidents. We use a developer instance of ServiceNow for this post as an example. You can find out how to get the developer instance in Personal Developer Instances.
Solution walkthrough
To integrate ServiceNow and Amazon Q Business, use the steps in the following sections.
Create a knowledge base article
Follow these steps to create a knowledge base article:

Sign in to ServiceNow and navigate to Self-Service > Knowledge
Choose Create an Article
On the Create new article page, select a knowledge base and choose a category. Optionally, you may create a new category.
Provide a Short description and type in the Article body
Choose Submit to create the article, as shown in the following screenshot

Repeat these steps to create a couple of knowledge base articles. In this example, we created a hypothetical enterprise named Example Corp for demonstration purposes.

Create an Amazon Q Business application
Amazon Q offers three subscription plans: Amazon Q Business Lite, Amazon Q Business Pro, and Amazon Q Developer Pro. Read the Amazon Q Documentation for more details. For this example, we used Amazon Q Business Lite.
Create application
Follow these steps to create an application:

In the Amazon Q Business console, choose Get started, then choose Create application to create a new Amazon Q Business application, as shown in the following screenshot

Name your application in Application name. In Service access, select Create and use a new service-linked role (SLR). For more information about example service roles, see IAM roles for Amazon Q Business. For information on service-linked roles, including how to manage them, see Using service-linked roles for Amazon Q Business. We named our application ServiceNow-Helpdesk. Next, select Create, as shown in the following screenshot.

Choose a retriever and index provisioning
To choose a retriever and index provisioning, follow these steps in the Select retriever screen, as shown in the following screenshot:

For Retrievers, select Use native retriever
For Index provisioning, choose Starter
Choose Next

Connect data sources
Amazon Q Business has ready-made connectors for common data sources and business systems.

Enter “ServiceNow” to search and select ServiceNow Online as the data source, as shown in the following screenshot

Enter the URL and the version of your ServiceNow instance. We used the ServiceNow version Vancouver for this post.

Scroll down the page to provide additional details about the data source. Under Authentication, select Basic authentication. Under AWS Secrets Manager secret, select Create and add a new secret from the dropdown menu as shown in the screenshot.

Provide the Username and Password you created in ServiceNow to create an AWS Secrets Manager secret. Choose Save.

Under Configure VPC and security group, keep the setting as No VPC because you will be connecting to the ServiceNow by the internet. You may choose to create a new service role under IAM role. This will create a role specifically for this application.

In the example, we synchronize the ServiceNow knowledge base articles and incidents. Provide the information as shown in the following image below. Notice that for Filter query the example shows the following code.

workflow_state=published^kb_knowledge_base=dfc19531bf2021003f07e2c1ac0739ab^article_type=text^active=true^EQ

This filter query aims to sync the articles that meet the following criteria:

workflow_state = published
kb_knowledge_base = dfc19531bf2021003f07e2c1ac0739ab (This is the default Sys ID for the knowledge base named “Knowledge” in ServiceNow).
Type = text (This field contains the text in the Knowledge article).
Active = true (This field filters the articles to sync only the ones that are active).

The filter fields are separated by ^, and the end of the query is represented by EQ. You can find more details about the Filter query and other parameters in Connecting Amazon Q Business to ServiceNow Online using the console.

Provide the Sync scope for the Incidents, as shown in the following screenshot

You may select Full sync initially so that a complete synchronization is performed. You need to select the frequency of the synchronization as well. For this post, we chose Run on demand. If you need to keep the knowledge base and incident data more up-to-date with the ServiceNow instance, choose a shorter window.

A field mapping will be provided for you to validate. You won’t be able to change the field mapping at this stage. Choose Add data source to proceed.

This completes the data source configuration for Amazon Q Business. The configuration takes a few minutes to be completed. Watch the screen for any errors and updates. Once the data source is created, you will be greeted with a message You successfully created the following data source: ‘ServiceNow-Datasource’
Add users and groups
Follow these steps to add users and groups:

Choose Next
In the Add groups and users page, click Add groups and users. You will be presented with the option of Add and assign new users or Assign existing users and groups. Select Assign existing users and groups. Choose Next, as shown in the following image.

Search for an existing user or group in your IAM Identity Center, select one, and choose Assign. After selecting the right user or group, choose Done.

This completes the activity of assigning the user and group access to the Amazon Q Business application.

Create a web experience
Follow these steps to create a web experience in the Add groups and users screen, as shown in the following screenshot.

Choose Create and use a new service role in the Web experience service access section
Choose Create application

The deployed application with the application status will be shown in the Amazon Q Business > Applications console as shown in the following screenshot.

Synchronize the data source
Once the data source is configured successfully, it’s time to start the synchronization. To begin this process, the ServiceNow fields that require synchronization must be updated. Because we intend to get answers from the knowledge base content, the text field needs to be synchronized. To do so, follow these steps:

In the Amazon Q Business console, select Applications in the navigation pane
Select ServiceNow-Helpdesk and then ServiceNow-Datasource
Choose Actions. From the dropdown, choose Edit, as shown in the following screenshot.

Scroll down to the bottom of the page to the Field mappings Select text and description.

Choose Update. After the update, choose Sync now.

The synchronization takes a few minutes to complete depending on the amount of data to be synchronized. Make sure that the Status is Completed, as shown in the following screenshot, before proceeding further. If you notice any error, you can choose the error hyperlink. The error hyperlink will take you to Amazon CloudWatch Logs to examining the logs for further troubleshooting.

Create ServiceNow plugin
A ServiceNow plugin in Amazon Q Business helps you create incidents in ServiceNow through Amazon Q Business chat. To create one, follow these steps:

In the Amazon Q Business console, select Enhancements from the navigation pane
Under Plugins, choose Add plugin, as shown in the following screenshot

In the Add Plugin page, shown in the following screenshot, and select the ServiceNow plugin

Provide a Name for the plugin
Enter the ServiceNow URL and use the previously created AWS Secrets Manager secret for the Authentication
Select Create and use a new service role
Choose Add plugin

The status of the plugin will be shown in the Plugins If Plugin status is Active, the plugin is configured and ready to use.

Use the Amazon Q Business chat interface
To use the Amazon Q Business chat interface, follow these steps:

In the Amazon Q Business console, choose Applications from the navigation pane. The web experience URL will be provided for each Amazon Q Business application.

Choose the Web experience URL to open the chat interface. Enter an IAM Identity Center username and password that was assigned to this application. The following screenshot shows the Sign in

You can now ask questions and receive responses, as shown in the following image. The answers will be specific to your organization and are retrieved from the knowledge base in ServiceNow.

You can ask the chat interface to create incidents as shown in the next screenshot.
A new pop-up window will appear, providing additional information related to the incident. In this window, you can provide more information related to the ticket and choose Create.

This will create a ServiceNow incident using the web experience of Amazon Q Business without signing in to ServiceNow. You may verify the ticket in the ServiceNow console as shown in the next screenshot.

Conclusion
In this post, we showed how Kyndryl is using Amazon Q Business to enable natural language conversations with ServiceNow using the ServiceNow connector provided by Amazon Q Business. We also showed how to create a ServiceNow plugin that allows users to create incidents in ServiceNow directly from the Amazon Q Business chat interface. We hope that this tutorial will help you take advantage of the power of Amazon Q Business for your ServiceNow needs.

About the authors
Asif Fouzi is a Principal Solutions Architect leading a team of seasoned technologists supporting Global Service Integrators (GSI) such as Kyndryl in their cloud journey. When he is not innovating on behalf of users, he likes to play guitar, travel, and spend time with his family.
Sujith R Pillai is a cloud solution architect in the Cloud Center of Excellence at Kyndryl with extensive experience in infrastructure architecture and implementation across various industries. With his strong background in cloud solutions, he has led multiple technology transformation projects for Kyndryl customers.

What is Deep Learning?

The growth of data in the digital age presents both opportunities and challenges. An immense volume of text, images, audio, and video is generated daily across platforms. Traditional machine learning models, while effective in many scenarios, often struggle to process high-dimensional and unstructured data without extensive preprocessing and feature engineering. This approach is not only time-consuming but can also miss subtle patterns in the data. These limitations are particularly significant in fields like medical imaging, autonomous driving, and natural language processing, where understanding complex patterns is essential. This gap has led to the evolution of deep learning models, designed to learn directly from raw data.

What is Deep Learning?

Deep learning, a subset of machine learning, is inspired by the structure and functioning of the human brain. It employs artificial neural networks with multiple layers—hence the term “deep”—to model intricate patterns in data. Unlike traditional machine learning, which relies heavily on manual feature extraction, deep learning models learn hierarchical representations on their own. Each layer in a neural network extracts progressively abstract features from the data, enabling these models to understand and process complex patterns. As noted by IBM, deep learning excels in handling unstructured data, making it valuable for tasks like image recognition, speech synthesis, and language translation.

Technical Details and Benefits

Deep learning relies on artificial neural networks composed of layers of interconnected nodes. Notable architectures include:

Convolutional Neural Networks (CNNs): Designed for image and video data, CNNs detect spatial patterns through convolutional operations.

Recurrent Neural Networks (RNNs): Well-suited for sequential data like time series and text, RNNs retain context through loops.

Transformers: Widely used in natural language processing, transformers leverage self-attention mechanisms to capture contextual relationships within text.

These models are fueled by large datasets and advanced hardware, such as GPUs and TPUs. NVIDIA highlights how GPUs enable deep learning by accelerating computations through parallel processing. Key benefits of deep learning include:

Automatic Feature Extraction: Minimizes the need for manual data preprocessing.

High Accuracy: Delivers superior performance in many tasks.

Scalability: Effectively utilizes large-scale datasets.

Versatility: Adapts to a wide range of applications, from healthcare to finance.

Various Deep Learning Frameworks

PyTorch

TensorFlow

JAX

PaddlePaddle

MATLAB

Theano

Keras

Deeplearning4j (DL4J)

Results, Applications, and Examples

Deep learning has had a transformative impact across many fields by extracting valuable insights from complex data. Prominent applications include:

Healthcare: AI models analyze medical images to detect diseases like cancer early. Deep learning algorithms can identify tumors with high precision, reducing false positives and improving diagnostic accuracy.

Autonomous Vehicles: CNNs enable self-driving cars to interpret road conditions, detect obstacles, and make real-time decisions.

Natural Language Processing: Models such as OpenAI’s GPT and Google’s BERT have advanced applications like chatbots, sentiment analysis, and machine translation.

Finance: Fraud detection systems leverage deep learning to identify irregularities in transaction data.

As AWS reports, businesses that incorporate deep learning often experience enhanced efficiency. For instance, Netflix uses deep learning to power its recommendation system, improving user satisfaction and retention.

Conclusion

Deep learning is changing the way machines learn and make decisions. By mimicking the brain’s approach to processing information, deep learning models have significantly impacted various industries. However, challenges like computational costs and data privacy concerns persist, emphasizing the need for continued research and innovation.

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post What is Deep Learning? appeared first on MarkTechPost.

Revolutionizing Vision-Language Tasks with Sparse Attention Vectors: A …

Generative Large Multimodal Models (LMMs), such as LLaVA and Qwen-VL, excel in vision-language (VL) tasks like image captioning and visual question answering (VQA). However, these models face challenges when applied to foundational discriminative VL tasks, such as image classification or multiple-choice VQA, which require discrete label predictions. The primary obstacle is the difficulty in extracting useful features from generative models for discriminative tasks.

Current methods for adapting LMMs to discriminative tasks often rely on prompt engineering, finetuning, or specialized architectures. Despite their promise, these approaches are limited by their dependency on large-scale trAIning data, modality-specific features, or lack of flexibility. To tackle this problem, a team of researchers from Carnegie Mellon University, University of California, Berkeley, IBM Research, and MIT-IBM Watson AI Lab proposes a novel solution: Sparse Attention Vectors (SAVs). SAVs are a finetuning-free method that leverages sparse attention head activations in LMMs as features for discriminative VL tasks. Inspired by neuroscience’s concept of functional specificity (how different parts of the brain are specific to different functions) and recent work on transformer interpretability, this method uses fewer than 1% of the attention heads to extract discriminative features effectively. SAVs achieve state-of-the-art performance with only few-shot examples and demonstrate robustness across diverse tasks.

Diving deeper into how the method works, the following steps were followed to identify and utilize Sparse Attention Vectors as shown in Figure 2:

Extracting Attention Vectors: For a frozen LMM and a few-shot labeled dataset (e.g., 20 examples per label), attention vectors are extracted from each attention head in every layer. Specifically, for the final token of each input sequence, the attention vector is computed as an output of the dot-product attention mechanism.

Identifying Relevant Vectors: The discriminative capability of each attention vector is evaluated using a nearest class centroid classifier. For each class, the mean (centroid) attention vector is computed across the few-shot examples. Cosine similarity is calculated between each input’s attention vector and the class centroids, and attention heads are ranked based on their classification accuracy. The top-performing heads (e.g., the top 20) are selected as the sparse attention vector set (HSAV).

Classification Using SAVs: Given a query input, predictions are made using the selected sparse attention heads. For each head in HSAV, the query’s similarity to class centroids is computed, and the final class label is determined by a majority vote across all heads. This approach allows the use of less than 1% of the total attention heads, making the method lightweight and efficient.

For evaluation, SAVs were tested on two state-of-the-art LMMs—LLaVA-OneVision and Qwen2-VL—and compared against multiple baselines, including zero-shot (ZS) methods, few-shot methods, and finetuning approaches like LoRA. Evaluations spanned a wide range of discriminative VL tasks. SAVs outperformed baselines in detecting hallucinations (e.g., distinguishing “hallucinating” from “not hallucinating”) and harmful content in datasets like LMM-Hallucination and VLGuard. SAVs demonstrated superior performance on challenging datasets like BLINK and NaturalBench, which require visual and compositional reasoning. SAVs were also highly effective on datasets like EuroSAT (satellite image classification) and Oxford-IIIT-Pets (fine-grained classification of pet breeds). Results showed that SAVs consistently outperformed zero-shot and few-shot baselines, closing the gap with discriminative vision-language models (VLMs) like CLIP and SigLIP. For instance, SAVs achieved higher accuracy on safety tasks and demonstrated robust performance across VQA and image classification benchmarks.

Moreover, SAVs require only a few labeled examples per class, making them practical for tasks with limited training data. The method identifies specific attention heads that contribute to classification, offering insights into the model’s inner workings. SAVs are adaptable to a variety of discriminative tasks, including those involving interleaved image-text inputs. By leveraging a sparse subset of attention heads, the method is computationally lightweight and scalable.

While SAVs provide a significant advancement, they rely on accessing the internal architecture of the LMM. This dependency limits their applicability to open-source models and poses challenges for broader usage. Additionally, tasks like image-text retrieval might benefit from finer-grained confidence metrics than the current majority voting mechanism. Future research could explore enhancing SAVs for tasks like multimodal retrieval, data compression, and feature embedding for downstream applications.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post Revolutionizing Vision-Language Tasks with Sparse Attention Vectors: A Lightweight Approach to Discriminative Classification appeared first on MarkTechPost.

MiniMax-Text-01 and MiniMax-VL-01 Released: Scalable Models with Light …

Large Language Models (LLMs) and Vision-Language Models (VLMs) transform natural language understanding, multimodal integration, and complex reasoning tasks. Yet, one critical limitation remains: current models cannot efficiently handle extremely large contexts. This challenge has prompted researchers to explore new methods and architectures to improve these models’ scalability, efficiency, and performance.

Existing models typically support token context lengths between 32,000 and 256,000, which limits their ability to handle scenarios requiring larger context windows, such as extended programming instructions or multi-step reasoning tasks. Increasing context sizes is computationally expensive due to the quadratic complexity of traditional softmax attention mechanisms. Researchers have explored alternative attention methods, such as sparse attention, linear attention, and state-space models, to address these challenges, but large-scale implementation remains limited.

Sparse attention focuses on relevant inputs to reduce computational overhead, while linear attention simplifies the attention matrix for scalability. However, adoption has been slow due to compatibility issues with existing architectures and suboptimal real-world performance. For example, state-space models effectively process long sequences but often lack the robustness and accuracy of transformer-based systems in complex tasks.

Researchers from MiniMax have introduced the MiniMax-01 series, including two variants to address these limitations:

MiniMax-Text-01: MiniMax-Text-01 comprises 456 billion total parameters, with 45.9 billion activated per token. It leverages a hybrid attention mechanism for efficient long-context processing. Its context window extends to 1 million tokens during training and 4 million tokens during inference.

MiniMax-VL-01: MiniMax-VL-01 integrates a lightweight Vision Transformer (ViT) module and processes 512 billion vision-language tokens through a four-stage training pipeline.

The models employ a novel lightning attention mechanism, reducing the computational complexity of processing long sequences. Also, integrating a Mixture of Experts (MoE) architecture enhances scalability and efficiency. The MiniMax models feature 456 billion parameters, of which 45.9 billion are activated for each token. This combination allows the models to process context windows of up to 1 million tokens during training and extrapolate to 4 million tokens during inference. By leveraging advanced computational strategies, the MiniMax-01 series offers unprecedented capabilities in long-context processing while maintaining performance on par with state-of-the-art models such as GPT-4 and Claude-3.5.

Image Source

The lightning attention mechanism achieves linear computational complexity, enabling the model to scale effectively. The hybrid attention architecture alternates between lightning and softmax attention layers, ensuring a balance between computational efficiency and retrieval capabilities. The models also incorporate an enhanced Linear Attention Sequence Parallelism (LASP+) algorithm, efficiently handling extensive sequences. Also, the vision-language model MiniMax-VL-01 integrates a lightweight vision transformer module, enabling it to process 512 billion vision-language tokens through a four-stage training process. These innovations are complemented by optimized CUDA kernels and parallelization strategies, achieving over 75% Model Flops Utilization on Nvidia H20 GPUs.

Image Source

Performance evaluations reveal that the MiniMax models achieve groundbreaking results across various benchmarks: 

For instance, MiniMax-Text-01 is 88.5% accurate on MMLU and performs competitively against models like GPT-4. 

The vision-language model MiniMax-VL-01 surpasses many peers, with a 96.4% accuracy rate on DocVQA and 91.7% on AI2D benchmarks. 

These models also offer a 20–32 times longer context window than traditional counterparts, significantly enhancing their utility for long-context applications.

Image Source

In conclusion, the MiniMax-01 series, comprising MiniMax-Text-01 and MiniMax-VL-01, represents a breakthrough in addressing scalability and long-context challenges. It combines innovative techniques like lightning attention with a hybrid architecture. By leveraging advanced computational frameworks and optimization strategies, researchers have introduced a solution that extends context capabilities to an unprecedented 4 million tokens and matches or surpasses the performance of leading models like GPT-4.

Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post MiniMax-Text-01 and MiniMax-VL-01 Released: Scalable Models with Lightning Attention, 456B Parameters, 4M Token Contexts, and State-of-the-Art Accuracy appeared first on MarkTechPost.

HCLTech’s AWS powered AutoWise Companion: A seamless experience for …

This post introduces HCLTech’s AutoWise Companion, a transformative generative AI solution designed to enhance customers’ vehicle purchasing journey. By tailoring recommendations based on individuals’ preferences, the solution guides customers toward the best vehicle model for them. Simultaneously, it empowers vehicle manufacturers (original equipment manufacturers (OEMs)) by using real customer feedback to drive strategic decisions, boosting sales and company profits. Powered by generative AI services on AWS and large language models’ (LLMs’) multi-modal capabilities, HCLTech’s AutoWise Companion provides a seamless and impactful experience.
In this post, we analyze the current industry challenges and guide readers through the AutoWise Companion solution functional flow and architecture design using built-in AWS services and open source tools. Additionally, we discuss the design from security and responsible AI perspectives, demonstrating how you can apply this solution to a wider range of industry scenarios.
Opportunities
Purchasing a vehicle is a crucial decision that can induce stress and uncertainty for customers. The following are some of the real-life challenges customers and manufacturers face:

Choosing the right brand and model – Even after narrowing down the brand, customers must navigate through a multitude of vehicle models and variants. Each model has different features, price points, and performance metrics, making it difficult to make a confident choice that fits their needs and budget.
Analyzing customer feedback – OEMs face the daunting task of sifting through extensive quality reporting tool (QRT) reports. These reports contain vast amounts of data, which can be overwhelming and time-consuming to analyze.
Aligning with customer sentiments – OEMs must align their findings from QRT reports with the actual sentiments of customers. Understanding customer satisfaction and areas needing improvement from raw data is complex and often requires advanced analytical tools.

HCLTech’s AutoWise Companion solution addresses these pain points, benefiting both customers and manufacturers by simplifying the decision-making process for customers and enhancing data analysis and customer sentiment alignment for manufacturers.
The solution extracts valuable insights from diverse data sources, including OEM transactions, vehicle specifications, social media reviews, and OEM QRT reports. By employing a multi-modal approach, the solution connects relevant data elements across various databases. Based on the customer query and context, the system dynamically generates text-to-SQL queries, summarizes knowledge base results using semantic search, and creates personalized vehicle brochures based on the customer’s preferences. This seamless process is facilitated by Retrieval Augmentation Generation (RAG) and a text-to-SQL framework.
Solution overview
The overall solution is divided into functional modules for both customers and OEMs.
Customer assist
Every customer has unique preferences, even when considering the same vehicle brand and model. The solution is designed to provide customers with a detailed, personalized explanation of their preferred features, empowering them to make informed decisions. The solution presents the following capabilities:

Natural language queries – Customers can ask questions in plain language about vehicle features, such as overall ratings, pricing, and more. The system is equipped to understand and respond to these inquiries effectively.
Tailored interaction – The solution allows customers to select specific features from an available list, enabling a deeper exploration of their preferred options. This helps customers gain a comprehensive understanding of the features that best suit their needs.
Personalized brochure generation – The solution considers the customer’s feature preferences and generates a customized feature explanation brochure (with specific feature images). This personalized document helps the customer gain a deeper understanding of the vehicle and supports their decision-making process.

OEM assist
OEMs in the automotive industry must proactively address customer complaints and feedback regarding various automobile parts. This comprehensive solution enables OEM managers to analyze and summarize customer complaints and reported quality issues across different categories, thereby empowering them to formulate data-driven strategies efficiently. This enhances decision-making and competitiveness in the dynamic automotive industry. The solution enables the following:

Insight summaries – The system allows OEMs to better understand the insightful summary presented by integrating and aggregating data from various sources, such as QRT reports, vehicle transaction sales data, and social media reviews.
Detailed view – OEMs can seamlessly access specific details about issues, reports, complaints, or data point in natural language, with the system providing the relevant information from the referred reviews data, transaction data, or unstructured QRT reports.

To better understand the solution, we use the seven steps shown in the following figure to explain the overall function flow.

The overall function flow consists of the following steps:

The user (customer or OEM manager) interacts with the system through a natural language interface to ask various questions.
The system’s natural language interpreter, powered by a generative AI engine, analyzes the query’s context, intent, and relevant persona to identify the appropriate data sources.
Based on the identified data sources, the respective multi-source query execution plan is generated by the generative AI engine.
The query agent parses the execution plan and send queries to the respective query executor.
Requested information is intelligently fetched from multiple sources such as company product metadata, sales transactions, OEM reports, and more to generate meaningful responses.
The system seamlessly combines the collected information from the various sources, applying contextual understanding and domain-specific knowledge to generate a well-crafted, comprehensive, and relevant response for the user.
The system generates the response for the original query and empowers the user to continue the interaction, either by asking follow-up questions within the same context or exploring new areas of interest, all while benefiting from the system’s ability to maintain contextual awareness and provide consistently relevant and informative responses.

Technical architecture
The overall solution is implemented using AWS services and LangChain. Multiple LangChain functions, such as CharacterTextSplitter and embedding vectors, are used for text handling and embedding model invocations. In the application layer, the GUI for the solution is created using Streamlit in Python language. The app container is deployed using a cost-optimal AWS microservice-based architecture using Amazon Elastic Container Service (Amazon ECS) clusters and AWS Fargate.
The solution contains the following processing layers:

Data pipeline – The various data sources, such as sales transactional data, unstructured QRT reports, social media reviews in JSON format, and vehicle metadata, are processed, transformed, and stored in the respective databases.
Vector embedding and data cataloging – To support natural language query similarity matching, the respective data is vectorized and stored as vector embeddings. Additionally, to enable the natural language to SQL (text-to-SQL) feature, the corresponding data catalog is generated for the transactional data.
LLM (request and response formation) – The system invokes LLMs at various stages to understand the request, formulate the context, and generate the response based on the query and context.
Frontend application – Customers or OEMs interact with the solution using an assistant application designed to enable natural language interaction with the system.

The solution uses the following AWS data stores and analytics services:

Unstructured data – Amazon Simple Storage Service (Amazon S3) buckets are used to store the JSON-based social media feedback data, quality report PDFs (specific to OEMs), and the vehicle and its features images.
Transactional sales data – Amazon Relational Database Service (Amazon RDS) for PostgreSQL is used to hold transactional reports of vehicles on a quarterly or monthly basis.
Vehicle specification data – Amazon DynamoDB is used to store the vehicle metadata (its features and specifications).
Amazon Athena – Amazon Athena is used to query the JSON-based social media feedback data stored in an S3 bucket.
AWS Glue – AWS Glue is used for data cataloging.

The following figure depicts the technical flow of the solution.

The workflow consists of the following steps:

The user’s query, expressed in natural language, is processed by an orchestrated AWS Lambda
The Lambda function tries to find the query match from the LLM cache. If a match is found, the response is returned from the LLM cache. If no match is found, the function invokes the respective LLMs through Amazon Bedrock. This solution uses LLMs (Anthropic’s Claude 2 and Claude 3 Haiku) on Amazon Bedrock for response generation. The Amazon Titan Embeddings G1 – Text LLM is used to convert the knowledge documents and user queries into vector embeddings.
Based on the context of the query and the available catalog, the LLM identifies the relevant data sources:

The transactional sales data, social media reviews, vehicle metadata, and more, are transformed and used for customers and OEM interactions.
The data in this step is restricted and is only accessible for OEM personas to help diagnose the quality related issues and provide insights on the QRT reports. This solution uses Amazon Textract as a data extraction tool to extract text from PDFs (such as quality reports).

The LLM generates queries (text-to-SQL) to fetch data from the respective data channels according to the identified sources.
The responses from each data channel are assembled to generate the overall context.
Additionally, to generate a personalized brochure, relevant images (described as text-based embeddings) are fetched based on the query context. Amazon OpenSearch Serverless is used as a vector database to store the embeddings of text chunks extracted from quality report PDFs and image descriptions.
The overall context is then passed to a response generator LLM to generate the final response to the user. The cache is also updated.

Responsible generative AI and security considerations
Customers implementing generative AI projects with LLMs are increasingly prioritizing security and responsible AI practices. This focus stems from the need to protect sensitive data, maintain model integrity, and enforce ethical use of AI technologies. The AutoWise Companion solution uses AWS services to enable customers to focus on innovation while maintaining the highest standards of data protection and ethical AI use.
Amazon Bedrock Guardrails
Amazon Bedrock Guardrails provides configurable safeguards that can be applied to user input and foundation model output as safety and privacy controls. By incorporating guardrails, the solution proactively steers users away from potential risks or errors, promoting better outcomes and adherence to established standards. In the automobile industry, OEM vendors usually apply safety filters for vehicle specifications. For example, they want to validate the input to make sure that the queries are about legitimate existing models. Amazon Bedrock Guardrails provides denied topics and contextual grounding checks to make sure the queries about non-existent automobile models are identified and denied with a custom response.
Security considerations
The system employs a RAG framework that relies on customer data, making data security the foremost priority. By design, Amazon Bedrock provides a layer of data security by making sure that customer data stays encrypted and protected and is neither used to train the underlying LLM nor shared with the model providers. Amazon Bedrock is in scope for common compliance standards, including ISO, SOC, CSA STAR Level 2, is HIPAA eligible, and customers can use Amazon Bedrock in compliance with the GDPR.
For raw document storage on Amazon S3, transactional data storage, and retrieval, these data sources are encrypted, and respective access control mechanisms are put in place to maintain restricted data access.
Key learnings
The solution offered the following key learnings:

LLM cost optimization – In the initial stages of the solution, based on the user query, multiple independent LLM calls were required, which led to increased costs and execution time. By using the AWS Glue Data Catalog, we have improved the solution to use a single LLM call to find the best source of relevant information.
LLM caching – We observed that a significant percentage of queries received were repetitive. To optimize performance and cost, we implemented a caching mechanism that stores the request-response data from previous LLM model invocations. This cache lookup allows us to retrieve responses from the cached data, thereby reducing the number of calls made to the underlying LLM. This caching approach helped minimize cost and improve response times.
Image to text – Generating personalized brochures based on customer preferences was challenging. However, the latest vision-capable multimodal LLMs, such as Anthropic’s Claude 3 models (Haiku and Sonnet), have significantly improved accuracy.

Industrial adoption
The aim of this solution is to help customers make an informed decision while purchasing vehicles and empowering OEM managers to analyze factors contributing to sales fluctuations and formulate corresponding targeted sales boosting strategies, all based on data-driven insights. The solution can also be adopted in other sectors, as shown in the following table.

Industry
Solution adoption

Retail and ecommerce
By closely monitoring customer reviews, comments, and sentiments expressed on social media channels, the solution can assist customers in making informed decisions when purchasing electronic devices.

Hospitality and tourism
The solution can assist hotels, restaurants, and travel companies to understand customer sentiments, feedback, and preferences and offer personalized services.

Entertainment and media
It can assist television, movie studios, and music companies to analyze and gauge audience reactions and plan content strategies for the future.

Conclusion
The solution discussed in this post demonstrates the power of generative AI on AWS by empowering customers to use natural language conversations to obtain personalized, data-driven insights to make informed decisions during the purchase of their vehicle. It also supports OEMs in enhancing customer satisfaction, improving features, and driving sales growth in a competitive market.
Although the focus of this post has been on the automotive domain, the presented approach holds potential for adoption in other industries to provide a more streamlined and fulfilling purchasing experience.
Overall, the solution demonstrates the power of generative AI to provide accurate information based on various structured and unstructured data sources governed by guardrails to help avoid unauthorized conversations. For more information, see the HCLTech GenAI Automotive Companion in AWS Marketplace.

About the Authors
Bhajan Deep Singh leads the AWS Gen AI/AIML Center of Excellence at HCL Technologies. He plays an instrumental role in developing proof-of-concept projects and use cases utilizing AWS’s generative AI offerings. He has successfully led numerous client engagements to deliver data analytics and AI/machine learning solutions. He holds AWS’s AI/ML Specialty, AI Practitioner certification and authors technical blogs on AI/ML services and solutions. With his expertise and leadership, he enables clients to maximize the value of AWS generative AI.
Mihir Bhambri works as AWS Senior Solutions Architect at HCL Technologies. He specializes in tailored Generative AI solutions, driving industry-wide innovation in sectors such as Financial Services, Life Sciences, Manufacturing, and Automotive. Leveraging AWS cloud services and diverse Large Language Models (LLMs) to develop multiple proof-of-concepts to support business improvements. He also holds AWS Solutions Architect Certification and has contributed to the research community by co-authoring papers and winning multiple AWS generative AI hackathons.
Yajuvender Singh is an AWS Senior Solution Architect at HCLTech, specializing in AWS Cloud and Generative AI technologies. As an AWS-certified professional, he has delivered innovative solutions across insurance, automotive, life science and manufacturing industries and also won multiple AWS GenAI hackathons in India and London. His expertise in developing robust cloud architectures and GenAI solutions, combined with his contributions to the AWS technical community through co-authored blogs, showcases his technical leadership.
Sara van de Moosdijk, simply known as Moose, is an AI/ML Specialist Solution Architect at AWS. She helps AWS partners build and scale AI/ML solutions through technical enablement, support, and architectural guidance. Moose spends her free time figuring out how to fit more books in her overflowing bookcase.
Jerry Li, is a Senior Partner Solution Architect at AWS Australia, collaborating closely with HCLTech in APAC for over four years. He also works with HCLTech Data & AI Center of Excellence team, focusing on AWS data analytics and generative AI skills development, solution building, and go-to-market (GTM) strategy.

About HCLTech
HCLTech is at the vanguard of generative AI technology, using the robust AWS Generative AI tech stack. The company offers cutting-edge generative AI solutions that are poised to revolutionize the way businesses and individuals approach content creation, problem-solving, and decision-making. HCLTech has developed a suite of readily deployable generative AI assets and solutions, encompassing the domains of customer experience, software development life cycle (SDLC) integration, and industrial processes.

Mitigating risk: AWS backbone network traffic prediction using GraphSt …

The AWS global backbone network is the critical foundation enabling reliable and secure service delivery across AWS Regions. It connects our 34 launched Regions (with 108 Availability Zones), our more than 600 Amazon CloudFront POPs, and 41 Local Zones and 29 Wavelength Zones, providing high-performance, ultralow-latency connectivity for mission-critical services across 245 countries and territories.
This network requires continuous management through planning, maintenance, and real-time operations. Although most changes occur without incident, the dynamic nature and global scale of this system introduce the potential for unforeseen impacts on performance and availability. The complex interdependencies between network components make it challenging to predict the full scope and timing of these potential impacts, necessitating advanced risk assessment and mitigation strategies.
In this post, we show how you can use our enterprise graph machine learning (GML) framework GraphStorm to solve prediction challenges on large-scale complex networks inspired by our practices of exploring GML to mitigate the AWS backbone network congestion risk.
Problem statement
At its core, the problem we are addressing is how to safely manage and modify a complex, dynamic network while minimizing service disruptions (such as the risk of congestion, site isolation, or increased latency). Specifically, we need to predict how changes to one part of the AWS global backbone network might affect traffic patterns and performance across the entire system. In the case of congestive risk for example, we want to determine whether taking a link out of service is safe under varying demands. Key questions include:

Can the network handle customer traffic with remaining capacity?
How long before congestion appears?
Where will congestion likely occur?
How much traffic is at risk of being dropped?

This challenge of predicting and managing network disruptions is not unique to telecommunication networks. Similar problems arise in various complex networked systems across different industries. For instance, supply chain networks face comparable challenges when a key supplier or distribution center goes offline, necessitating rapid reconfiguration of logistics. In air traffic control systems, the closure of an airport or airspace can lead to complex rerouting scenarios affecting multiple flight paths. In these cases, the fundamental problem remains similar: how to predict and mitigate the ripple effects of localized changes in a complex, interconnected system where the relationships between components are not always straightforward or immediately apparent.
Today, teams at AWS operate a number of safety systems that maintain a high operational readiness bar, and work relentlessly on improving safety mechanisms and risk assessment processes. We conduct a rigorous planning process on a recurring basis to inform how we design and build our network, and maintain resiliency under various scenarios. We rely on simulations at multiple levels of detail to eliminate risks and inefficiencies from our designs. In addition, every change (no matter how small) is thoroughly tested before it is deployed into the network.
However, at the scale and complexity of the AWS backbone network, simulation-based approaches face challenges in real-time operational settings (such as expensive and time-consuming computational process), which impact the efficiency of network maintenance. To complement simulations, we are therefore investing in data-driven strategies that can scale to the size of the AWS backbone network without a proportional increase in computational time. In this post, we share our progress along this journey of model-assisted network operations.
Approach
In recent years, GML methods have achieved state-of-the-art performance in traffic-related tasks, such as routing, load balancing, and resource allocation. In particular, graph neural networks (GNNs) demonstrate an advantage over classical time series forecasting, due to their ability to capture structure information hidden in network topology and their capacity to generalize to unseen topologies when networks are dynamic.
In this post, we frame the physical network as a heterogeneous graph, where nodes represent entities in the networked system, and edges represent both demands between endpoints and actual traffic flowing through the network. We then apply GNN models to this heterogeneous graph for an edge regression task.
Unlike common GML edge regression that predicts a single value for an edge, we need to predict a time series of traffic on each edge. For this, we adopt the sliding-window prediction method. During training, we start from a time point T and use historical data in a time window of size W to predict the value at T+1. We then slide the window one step ahead to predict the value at T+2, and so on. During inference, we use predicted values rather than actual values to form the inputs in a time window as we slide the window forward, making the method an autoregressive sliding-window one. For a more detailed explanation of the principles behind this method, please refer to this link.
We train GNN models with historical demand and traffic data, along with other features (network incidents and maintenance events) by following the sliding-window method. We then use the trained model to predict future traffic on all links of the backbone network using the autoregressive sliding-window method because in a real application, we can only use the predicted values for next-step predictions.
In the next section, we show the result of adapting this method to AWS backbone traffic forecasting, for improving operational safety.
Applying GNN-based traffic prediction to the AWS backbone network
For the backbone network traffic prediction application at AWS, we need to ingest a number of data sources into the GraphStorm framework. First, we need the network topology (the graph). In our case, this is composed of devices and physical interfaces that are logically grouped into individual sites. One site may contain dozens of devices and hundreds of interfaces. The edges of the graph represent the fiber connections between physical interfaces on the devices (these are the OSI layer 2 links). For each interface, we measure the outgoing traffic utilization in bps and as a percentage of the link capacity. Finally, we have a traffic matrix that holds the traffic demands between any two pairs of sites. This is obtained using flow telemetry.
The ultimate goal of our application is to improve safety on the network. For this purpose, we measure the performance of traffic prediction along three dimensions:

First, we look at the absolute percentage error between the actual and predicted traffic on each link. We want this error metric to be low to make sure that our model actually learned the routing pattern of the network under varying demands and a dynamic topology.
Second, we quantify the model’s propensity for under-predicting traffic. It is critical to limit this behavior as much as possible because predicting traffic below its actual value can lead to increased operational risk.
Third, we quantify the model’s propensity for over-predicting traffic. Although this is not as critical as the second metric, it’s nonetheless important to address over-predictions because they slow down maintenance operations.

We share some of our results for a test conducted on 85 backbone segments, over a 2-week period. Our traffic predictions are at a 5-minute time resolution. We trained our model on 2 weeks of data and ran the inference on a 6-hour time window. Using GraphStorm, training took less than 1 hour on an m8g.12xlarge instance for the entire network, and inference took under 2 seconds per segment, for the entire 6-hour window. In contrast, simulation-based traffic prediction requires dozens of instances for a similar network sample, and each simulation takes more than 100 seconds to go through the various scenarios.
In terms of the absolute percentage error, we find that our p90 (90th percentile) to be on the order of 13%. This means that 90% of the time, the model’s prediction is less than 13% away from the actual traffic. Because this is an absolute metric, the model’s prediction can be either above or below the network traffic. Compared to classical time series forecasting with XGBoost, our approach yields a 35% improvement.
Next, we consider all the time intervals in which the model under-predicted traffic. We find the p90 in this case to be below 5%. This means that, in 90% of the cases when the model under-predicts traffic, the deviation from the actual traffic is less than 5%.
Finally, we look at all the time intervals in which the model over-predicted traffic (again, this is to evaluate permissiveness for maintenance operations). We find the p90 in this case to be below 14%. This means that, in 90% of the cases when the model over-predicted traffic, the deviation from the actual traffic was less than 14%.
These measurements demonstrate how we can tune the performance of the model to value safety above the pace of routine operations.
Finally, in this section, we provide a visual representation of the model output around a maintenance operation. This operation consists of removing a segment of the network out of service for maintenance. As shown in the following figure, the model is able to predict the changing nature of traffic on two different segments: one where traffic increases sharply as a result of the operation (left) and the second referring to the segment that was taken out of service and where traffic drops to zero (right).

An example for GNN-based traffic prediction with synthetic data
Unfortunately, we can’t share the details about the AWS backbone network including the data we used to train the model. To still provide you with some code that makes it straightforward to get started solving your network prediction problems, we share a synthetic traffic prediction problem instead. We have created a Jupyter notebook that generates synthetic airport traffic data. This dataset simulates a global air transportation network using major world airports, creating fictional airlines and flights with predefined capacities. The following figure illustrates these major airports and the simulated flight routes derived from our synthetic data.

Our synthetic data includes: major world airports, simulated airlines and flights with predefined capacities for cargo demands, and generated air cargo demands between airport pairs, which will be delivered by simulated flights.
We employ a simple routing policy to distribute these demands evenly across all shortest paths between two airports. This policy is intentionally hidden from our model, mimicking the real-world scenarios where the exact routing mechanisms are not always known. If flight capacity is insufficient to meet incoming demands, we simulate the excess as inventory stored at the airport. The total inventory at each airport serves as our prediction target. Unlike real air transportation networks, we didn’t follow a hub-and-spoke topology. Instead, our synthetic network uses a point-to-point structure. Using this synthetic air transportation dataset, we now demonstrate a node time series regression task, predicting the total inventory at each airport every day. As illustrated in the following figure, the total inventory amount at an airport is influenced by its own local demands, the traffic passing through it, and the capacity that it can output. By design, the output capacity of an airport is limited to make sure that most airport-to-airport demands require multiple-hop fulfillment.

In the remainder of this section, we cover the data preprocessing steps necessary for using the GraphStorm framework, before customizing a GNN model for our application. Towards the end of the post, we also provide an architecture for an operational safety system built using GraphStorm and in an environment of AWS services.
Data preprocessing for graph time series forecasting
To use GraphStorm for node time series regression, we need to structure our synthetic air traffic dataset according to GraphStorm’s input data format requirements. This involves preparing three key components: a set of node tables, a set of edge tables, and a JSON file describing the dataset.
We abstract the synthetic air traffic network into a graph with one node type (airport) and two edge types. The first edge type, airport, demand, airport, represents demand between any pair of airports. The second one, airport, traffic, airport, captures the amount of traffic sent between connected airports.
The following diagram illustrates this graph structure.

Our airport nodes have two types of associated features: static features (longitude and latitude) and time series features (daily total inventory amount). For each edge, the src_code and dst_code capture the source and destination airport codes. The edge features also include a demand and a traffic time series. Finally, edges for connected airports also hold the capacity as a static feature.
The synthetic data generation notebook also creates a JSON file, which describes the air traffic data and provides instructions for GraphStorm’s graph construction tool to follow. Using these artifacts, we can employ the graph construction tool to convert the air traffic graph data into a distributed DGL graph. In this format:

Demand and traffic time series data is stored as E*T tensors in edges, where E is the number of edges of a given type, and T is the number of days in our dataset.
Inventory amount time series data is stored as an N*T tensor in nodes, where N is the number of airport nodes.

This preprocessing step makes sure our data is optimally structured for time series forecasting using GraphStorm.
Model
To predict the next total inventory amount for each airport, we employ GNN models, which are well-suited for capturing these complex relationships. Specifically, we use GraphStorm’s Relational Graph Convolutional Network (RGCN) module as our GNN model. This allows us to effectively pass information (demands and traffic) among airports in our network. To support the sliding-window prediction method we described earlier, we created a customized RGCN model.
The detailed implementation of the node time series regression model can be found in the Python file. In the following sections, we explain a few key implementation points.
Customized RGCN model
The GraphStorm v0.4 release adds support for edge features. This means that we can use a for-loop to iterate along the T dimensions in the time series tensor, thereby implementing the sliding-window method in the forward() function during model training, as shown in the following pseudocode:

def forward(self, ……):
……
# —- Process Time Series Data Step by Step Using Sliding Windows —- #
for step in range(0, (self._ts_size – self._window_size)):
# extract one step time series feature based on time window arguments
ts_feats = get_one_step_ts_feats(…, self._ts_size, self._window_size, step)
……
# extract one step time series labels
new_labels = get_ts_labels(labels, self._ts_size, self._window_size, step)
……
# compute loss per window
step_loss = self.model(ts_feats, new_labels)
# sum all step losses and average them
ts_loss = sum(step_losses) / len(step_losses)

The actual code of the forward() function is in the following code snippet.
In contrast, because the inference step needs to use the autoregressive sliding-window method, we implement a one-step prediction function in the predict() routine:

def predict(self, ….., use_ar=False, predict_step=-1):
……
# —- Use Autoregressive Method in Inference —-
# It is inferrer’s resposibility to provide the “predict_step“ value.
if use_ar:
# extract one step time series feature based on the given predict_step
ts_feats = get_one_step_ts_feats(…, self._ts_size, self._window_size,
predict_step)
……
# compute prediction only
predi = self.model(ts_feats)
else:
# ————- Same as Forward() method ————- #
……

The actual code of the predict() function is in the following code snippet.
Customized node trainer
GraphStorm’s default node trainer (GSgnnNodePredctionTrainer), which handles the model training loop, can’t process the time series feature requirement. Therefore, we implement a customized node trainer by inheriting the GSgnnNodePredctionTrainer and use our own customized node_mini_batch_gnn_predict() method. This is shown in the following code snippet.
Customized node_mini_batch_predict() method
The customized node_mini_batch_predict() method calls the customized model’s predict() method, passing the two additional arguments that are specific to our use case. These are used to determine whether the autoregressive property is used or not, along with the current prediction step for appropriate indexing (see the following code snippet).
Customized node predictor (inferrer)
Similar to the node trainer, GraphStorm’s default node inference class, which drives the inference pipeline (GSgnnNodePredictionInferrer), can’t handle the time series feature processing we need in this application. We therefore create a customized node inferrer by inheriting GSgnnNodePredictionInferrer, and add two specific arguments. In this customized inferrer, we use a for-loop to iterate over the T dimensions of the time series feature tensor. Unlike the for-loop we used in model training, the inference loop uses the predicted values in subsequent prediction steps (this is shown in the following code snippet).
So far, we have focused on the node prediction example with our dataset and modeling. However, our approach allows for various other prediction tasks, such as:

Forecasting traffic between specific airport pairs.
More complex scenarios like predicting potential airport congestion or increased utilization of alternative routes when reducing or eliminating flights between certain airports.

With the customized model and pipeline classes, we can use the following Jupyter notebook to run the overall training and inference pipeline for our airport inventory amount prediction task. We encourage you to explore these possibilities, adapt the provided example to your specific use cases or research interests, and refer to our Jupyter notebooks for a comprehensive understanding of how to use GraphStorm APIs for various GML tasks.
System architecture for GNN-based network traffic prediction
In this section, we propose a system architecture for enhancing operational safety within a complex network, such as the ones we discussed earlier. Specifically, we employ GraphStorm within an AWS environment to build, train, and deploy graph models. The following diagram shows the various components we need to achieve the safety functionality.

The complex system in question is represented by the network shown at the bottom of the diagram, overlaid on the map of the continental US. This network emits telemetry data that can be stored on Amazon Simple Storage Service (Amazon S3) in a dedicated bucket. The evolving topology of the network should also be extracted and stored.
On the top right of the preceding diagram, we show how Amazon Elastic Compute Cloud (Amazon EC2) instances can be configured with the necessary GraphStorm dependencies using direct access to the project’s GitHub repository. After they’re configured, we can build GraphStorm Docker images on them. These images then can be put on Amazon Elastic Container Registry (Amazon ECR) and be made available to other services (for example, Amazon SageMaker).
During training, SageMaker jobs use those instances along with the network data to train a traffic prediction model such as the one we demonstrated in this post. The trained model can then be stored on Amazon S3. It might be necessary to repeat this training process periodically, to make sure that the model’s performance keeps up with changes to the network dynamics (such as modifications to the routing schemes).
Above the network representation, we show two possible actors: operators and automation systems. These actors call on a network safety API implemented in AWS Lambda to make sure that the actions they intend to take are safe for the anticipated time horizon (for example, 1 hour, 6 hours, 24 hours). To provide an answer, the Lambda function uses the on-demand inference capabilities of SageMaker. During inference, SageMaker uses the pre-trained model to produce the necessary traffic predictions. These predictions can also be stored on Amazon S3 to continuously monitor the model’s performance over time, triggering training jobs when significant drift is detected.
Conclusion
Maintaining operational safety for the AWS backbone network, while supporting the dynamic needs of our global customer base, is a unique challenge. In this post, we demonstrated how the GML framework GraphStorm can be effectively applied to predict traffic patterns and potential congestion risks in such complex networks. By framing our network as a heterogeneous graph and using GNNs, we’ve shown that it’s possible to capture the intricate interdependencies and dynamic nature of network traffic. Our approach, tested on both synthetic data and the actual AWS backbone network, has demonstrated significant improvements over traditional time series forecasting methods, with a 35% reduction in prediction error compared to classical approaches like XGBoost.
The proposed system architecture, integrating GraphStorm with various AWS services like Amazon S3, Amazon EC2, SageMaker, and Lambda, provides a scalable and efficient framework for implementing this approach in production environments. This setup allows for continuous model training, rapid inference, and seamless integration with existing operational workflows.
We will keep you posted about our progress in taking our solution to production, and share the benefit for AWS customers.
We encourage you to explore the provided Jupyter notebooks, adapt our approach to your specific use cases, and contribute to the ongoing development of graph-based ML techniques for managing complex networked systems. To learn how to use GraphStorm to solve a broader class of ML problems on graphs, see the GitHub repo.

About the Authors
Jian Zhang is a Senior Applied Scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, the US, and Singapore. As an enlightener of AWS graph capabilities, Zhang has given many public presentations about GraphStorm, the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.
Fabien Chraim is a Principal Research Scientist in AWS networking. Since 2017, he’s been researching all aspects of network automation, from telemetry and anomaly detection to root causing and actuation. Before Amazon, he co-founded and led research and development at Civil Maps (acquired by Luminar). He holds a PhD in electrical engineering and computer sciences from UC Berkeley.
Patrick Taylor is a Senior Data Scientist in AWS networking. Since 2020, he has focused on impact reduction and risk management in networking software systems and operations research in networking operations teams. Previously, Patrick was a data scientist specializing in natural language processing and AI-driven insights at Hyper Anna (acquired by Alteryx) and holds a Bachelor’s degree from the University of Sydney.
Xiang Song is a Senior Applied Scientist at AWS AI Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of Amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph database. He is now leading the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He received his PhD in computer systems and architecture at the Fudan University, Shanghai, in 2014.
Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems and robotics scientist—a field in which he holds a PhD.

What is Machine Learning (ML)?

In today’s digital age, we are surrounded by enormous amounts of data, from social media interactions to e-commerce transactions and medical records. Making sense of this data to derive meaningful insights is a significant challenge. Traditional programming methods often fall short when dealing with complex and dynamic datasets, making manual rule-based systems inefficient. For instance, how can we accurately predict customer preferences or identify potential fraud in real-time? These challenges highlight the need for systems that can adapt and learn—problems that Machine Learning (ML) is designed to address. ML has become integral to many industries, supporting data-driven decision-making and innovations in fields like healthcare, finance, and transportation.

Explaining Machine Learning

Machine Learning is a branch of Artificial Intelligence (AI) that allows systems to learn and improve from data without being explicitly programmed. At its core, ML involves analyzing data to identify patterns, make predictions, and automate processes. Rather than relying on predefined rules, ML models learn from historical data to adapt to new situations. For example, streaming platforms use ML to recommend movies, email providers use it to filter spam, and healthcare systems use it to assist in diagnosing diseases. IBM describes Machine Learning as “training algorithms to process and analyze data to make predictions or decisions with minimal human intervention.”

Technical Details and Benefits

Machine Learning operates on three key components: data, algorithms, and computational power. Data serves as the foundation, providing the information needed to train models. Algorithms, including supervised, unsupervised, and reinforcement learning techniques, determine how the system interprets and processes this data. Supervised learning relies on labeled datasets, unsupervised learning identifies hidden patterns in unlabeled data, and reinforcement learning optimizes decision-making through trial and error. Cloud platforms like AWS, Google Cloud, and Microsoft Azure provide the computational infrastructure necessary for training and deploying ML models.

The benefits of ML are wide-ranging. Organizations using ML often achieve greater efficiency, reduced costs, and better decision-making. In healthcare, ML algorithms help detect anomalies in medical images, facilitating early diagnosis and treatment. Retailers use ML to tailor customer experiences, increasing sales and loyalty. ML also enables improvements in sectors such as finance, manufacturing, and agriculture by predicting market trends, optimizing supply chains, and boosting crop yields. These capabilities make ML a valuable tool for businesses of all sizes.

Insights

Numerous real-world applications highlight the impact of Machine Learning. According to a study by SAS, organizations adopting ML report up to a 30% improvement in operational efficiency. In healthcare, IBM Watson’s ML technologies have contributed to identifying new drug treatments. Meanwhile, e-commerce platforms leveraging ML have experienced a 20-40% increase in conversion rates through personalized recommendations.

The data underscores the value of ML in transforming raw information into actionable insights. A recent article by Databricks notes that ML models often achieve higher predictive accuracy compared to traditional statistical methods. Additionally, businesses utilizing ML report significant cost savings, with AWS highlighting reductions of up to 25% in operational expenses. For more insights into ML’s capabilities, resources such as IBM, MIT Sloan, and AWS provide valuable perspectives.

Conclusion

Machine Learning represents a practical and effective approach to solving complex problems, analyzing data, and making informed decisions. By leveraging data, algorithms, and computational power, ML provides tools to address challenges that traditional programming cannot. Its applications range from improving efficiency in businesses to advancing healthcare and personalizing customer experiences. As industries continue to explore ML’s potential, its role in shaping the future of technology and innovation will only grow.

Sources:

https://www.ibm.com/think/topics/machine-learning

https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained

https://aws.amazon.com/what-is/machine-learning/

https://www.geeksforgeeks.org/ml-machine-learning/

https://www.datacamp.com/blog/what-is-machine-learning

https://cloud.google.com/learn/what-is-machine-learning

https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-machine-learning-platform

https://www.sas.com/en_us/insights/analytics/machine-learning.html

https://www.techtarget.com/searchenterpriseai/definition/machine-learning-ML 

https://www.databricks.com/glossary/machine-learning-models

https://www.coursera.org/articles/what-is-machine-learning

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios.’ (Promoted)
The post What is Machine Learning (ML)? appeared first on MarkTechPost.

OpenBMB Just Released MiniCPM-o 2.6: A New 8B Parameters, Any-to-Any M …

Artificial intelligence has made significant strides in recent years, but challenges remAIn in balancing computational efficiency and versatility. State-of-the-art multimodal models, such as GPT-4, often require substantial computational resources, limiting their use to high-end servers. This creates accessibility barriers and leaves edge devices like smartphones and tablets unable to leverage such technologies effectively. Additionally, real-time processing for tasks like video analysis or speech-to-text conversion continues to face technical hurdles, further highlighting the need for efficient, flexible AI models that can function seamlessly on limited hardware.

OpenBMB Releases MiniCPM-o 2.6: A Flexible Multimodal Model

OpenBMB’s MiniCPM-o 2.6 addresses these challenges with its 8-billion-parameter architecture. This model offers comprehensive multimodal capabilities, supporting vision, speech, and language processing while running efficiently on edge devices such as smartphones, tablets, and iPads. MiniCPM-o 2.6 incorporates a modular design with:

SigLip-400M for visual understanding.

Whisper-300M for multilingual speech processing.

ChatTTS-200M for conversational capabilities.

Qwen2.5-7B for advanced text comprehension.

The model achieves a 70.2 average score on the OpenCompass benchmark, outperforming GPT-4V on visual tasks. Its multilingual support and ability to function on consumer-grade devices make it a practical choice for diverse applications.

Technical Details and Benefits

MiniCPM-o 2.6 integrates advanced technologies into a compact and efficient framework:

Parameter Optimization: Despite its size, the model is optimized for edge devices through frameworks like llama.cpp and vLLM, maintaining accuracy while minimizing resource demands.

Multimodal Processing: It processes images up to 1.8 million pixels (1344×1344 resolution) and includes OCR capabilities that lead benchmarks like OCRBench.

Streaming Support: The model supports continuous video and audio processing, enabling real-time applications like surveillance and live broadcasting.

Speech Features: It offers bilingual speech understanding, voice cloning, and emotion control, facilitating natural, real-time interactions.

Ease of Integration: Compatibility with platforms like Gradio simplifies deployment, and its commercial-friendly nature supports applications with fewer than one million daily active users.

These features make MiniCPM-o 2.6 accessible to developers and businesses, enabling them to deploy sophisticated AI solutions without relying on extensive infrastructure.

Performance Insights and Real-World Applications

MiniCPM-o 2.6 has delivered notable performance results:

Visual Tasks: Outperforming GPT-4V on OpenCompass with a 70.2 average score underscores its capability in visual reasoning.

Speech Processing: Real-time English/Chinese conversation, emotion control, and voice cloning provide advanced natural language interaction capabilities.

Multimodal Efficiency: Continuous video/audio processing supports use cases such as live translation and interactive learning tools.

OCR Excellence: High-resolution processing ensures accurate document digitization and other OCR tasks.

These capabilities can impact industries ranging from education to healthcare. For example, real-time speech and emotion recognition could enhance accessibility tools, while its video and audio processing enable new opportunities in content creation and media.

Conclusion

MiniCPM-o 2.6 represents a significant development in AI technology, addressing long-standing challenges of resource-intensive models and edge-device compatibility. By combining advanced multimodal capabilities with efficient operation on consumer-grade devices, OpenBMB has created a model that is both powerful and accessible. As AI becomes increasingly integral to daily life, MiniCPM-o 2.6 highlights how innovation can bridge the gap between performance and practicality, empowering developers and users across industries to leverage cutting-edge technology effectively.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios.’ (Promoted)
The post OpenBMB Just Released MiniCPM-o 2.6: A New 8B Parameters, Any-to-Any Multimodal Model that can Understand Vision, Speech, and Language and Runs on Edge Devices appeared first on MarkTechPost.