Optimizing enterprise AI assistants: How Crypto.com uses LLM reasoning …

This post is co-written with Jessie Jiao from Crypto.com. Crypto.com is a crypto exchange and comprehensive trading service serving 140 million users in 90 countries. To improve the service quality of Crypto.com, the firm implemented generative AI-powered assistant services on AWS.
Modern AI assistants—artificial intelligence systems designed to interact with users through natural language, answer questions, and even perform tasks—face increasingly complex challenges in production environments. Beyond handling basic FAQs, they must now execute meaningful actions, adhere to company policies, implement content filtering, escalate to human operators when needed, and manage follow-up tasks. These requirements demand sophisticated systems capable of handling diverse scenarios while maintaining consistency and compliance.
To address these challenges, a modular subsystem architecture proves invaluable. This architectural approach divides an AI system into separate, specialized components that can function independently while working together as a cohesive whole. Such design allows for flexible integration of different processing logics, such as intelligent routing between knowledge bases, dynamic prioritization of information sources, and seamless incorporation of business rules and policies. Each subsystem can be independently developed and optimized for specific tasks while maintaining overall system coherence.
As AI assistant systems grow in complexity, with multiple subsystems handling various workloads, prompt engineering emerges as a critical discipline. This art of carefully crafting input text guides language model responses and facilitates consistent behavior across interconnected components. Crafting effective prompts that work across different subsystems while maintaining consistency and accuracy is both critical and time-intensive. This challenge is particularly acute in enterprise environments where precision and reliability are paramount.
In this post, we explore how we used user and system feedback to continuously improve and optimize our instruction prompts. This feedback-driven approach has enabled us to create more effective prompts that adapt to various subsystems while maintaining high performance across different use cases.
Feedback and reasoning: The key to LLM performance improvement
Although large language models (LLMs) have demonstrated remarkable capabilities, they can sometimes struggle with complex or ambiguous inputs. This is where feedback mechanisms become essential. By incorporating feedback loops, LLMs can learn from their mistakes, refine the instruction, and adapt to challenging scenarios.
One powerful approach is critiquing, where LLMs are paired with an external feedback mechanism that provide critiques or feedback. For instance, when processing documents, if an LLM generates an incorrect summary, a fact-checking tool can identify inaccuracies and provide feedback. The model can then revise its output, leading to improved accuracy and reliability. This iterative process mirrors human learning, where feedback drives continuous improvement. Consider an example where a customer asks an enterprise AI assistant, “I need to increase my credit limit immediately for an emergency purchase.” The assistant might initially respond with approval steps without verification, but a critique system would flag: “Response bypasses required identity verification protocol and fails to assess qualification criteria per company policy.” With this feedback, the assistant can revise its response to include proper authentication steps, eligibility checking, and alternative options for emergency situations—demonstrating how critiquing facilitates adherence to business rules while maintaining helpful customer service.
Unlike traditional machine learning (ML) processes where feedback serves as a loss function to update model weights, these feedback mechanisms operate differently in inference-time LLM applications. Rather than modifying the underlying model parameters, feedback provides supplementary instructions that dynamically guide the model’s behavior. This approach allows for behavioral adaptation without the computational expense of retraining, effectively creating a flexible instruction layer that shapes model outputs while preserving the core capabilities of the pre-trained model. Such runtime adaptability represents a significant advancement in making LLMs more responsive to specific requirements without architectural modifications.
The effectiveness of feedback mechanisms extends beyond simple error correction, enabling LLMs to develop a nuanced understanding of task requirements. Through iterative feedback cycles, models can learn to interpret ambiguous instructions more effectively, identify implicit context, and adapt their processing strategies accordingly. This capability is particularly valuable in enterprise settings where complex, domain-specific tasks require precise interpretation of instructions. By analyzing feedback patterns over time, LLMs can even anticipate potential misunderstandings and proactively adjust their approach, leading to more efficient and accurate outcomes. In our research implementing this approach for financial services classification tasks, we observed substantial performance improvements—from initial accuracy rates of 60% to eventually achieving 100% through systematic feedback incorporation. Each iteration addressed specific weaknesses identified in previous rounds, demonstrating how structured critique leads to continuous model improvement.
For deeper insights into these mechanisms, we recommend exploring two key research papers: CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, which demonstrates how LLMs can self-correct with tool-interactive critiquing, and Reflexion: Language Agents with Verbal Reinforcement Learning, which explores language agents with verbal reinforcement learning. The following figure provides a visual representation of this feedback process.

Recent developments in reasoning capabilities have made this feedback process even more powerful. Modern LLMs can now engage in sophisticated analysis of their own outputs, breaking down complex problems into manageable components and systematically evaluating each aspect of their performance. To learn more, see Anthropic’s Claude 3.7 Sonnet hybrid reasoning model is now available in Amazon Bedrock and DeepSeek-R1 now available as a fully managed serverless model in Amazon Bedrock. This self-analysis capability, combined with external feedback, creates a robust framework for continuous improvement.
Consider a scenario where an LLM is tasked with sentiment analysis. Initially, when classifying a mixed review like “The product worked as advertised, but customer service was disappointing,” the model might incorrectly label it as positive. Through error analysis and verification, a critique mechanism (powered by a separate reasoning model) can provide targeted feedback, explaining that negative statements about service quality significantly impact overall sentiment. This feedback doesn’t modify the model’s weights but instead serves as supplementary instruction that enriches the original prompt template, helping the model properly weigh contrasting sentiments within the same text.
Over multiple feedback iterations, the LLM employs reasoning capabilities to incorporate this external feedback and develop more sophisticated classification heuristics. With the critique system continuously verifying outputs and providing constructive guidance, the model learns to identify why certain patterns lead to misclassifications and refines its approach accordingly. When encountering new ambiguous reviews, it can now apply these learned insights to correctly interpret subtle emotional nuances. This demonstrates how reasoning-based feedback effectively modifies the instruction context without requiring parameter adjustments, allowing for continuous improvement through analytical understanding rather than mechanical optimization.
In the next section, we explore how these feedback mechanisms and reasoning capability can be operationalized to enhance workflows.
Solution overview
The integration of feedback and reasoning creates a powerful learning loop: feedback identifies areas for improvement, reasoning capabilities analyze the root causes of issues, and the resulting insights drive specific, actionable changes. This systematic approach to improvement makes sure that each iteration brings the model closer to optimal performance, while maintaining transparency and accountability in the development process.

For practical examples and complete implementation code of this process, check out our GitHub repository. This repository includes sample datasets, evaluation frameworks, and ready-to-use templates for each step of the optimization workflow.
Our proposed solution uses two foundation models (FMs) through Amazon Bedrock: Amazon Nova for executing instructional tasks and optimizing the instruction prompt, and Anthropic’s Claude 3.7 or DeepSeek-R1 for error analysis and feedback generation. Amazon Bedrock, a fully managed service, provides access to high-performance FMs from leading AI companies, enabling flexible model selection and testing. You can explore illustration_notebook_optimization_prompt.ipynb for a quick walkthrough of the high-level process for LLM optimization, which demonstrates key concepts and implementation details in an accessible format.
LLM optimization workflow
The following is the high-level process for LLM optimization:

The process begins with a precise articulation of task requirements and success criteria. This crucial first step involves three key components: defining specific task objectives, crafting a well-structured prompt template with clear instructions, and assembling a comprehensive evaluation dataset with verified ground truth labels. During this phase, we establish quantifiable success metrics and acceptance criteria to measure improvement effectively. The Amazon Nova Pro understanding model is configured to provide both task outputs and detailed explanations for its decisions, enabling transparency in the evaluation process.

For illustration, we started with a simple prompt template to categorize customer inquiries into multiple classes, such as PASSWORD_RESET, ESCALATION, and OUT_OF_SCOPE. This initial template provided only basic category definitions without detailed guidance on edge cases or classification priorities, serving as our baseline for improvement. You can refer to the test case dataset and initial template.

Following the setup, we conduct rigorous testing against ground truth data to evaluate model performance. This evaluation focuses on both successful and failed cases, with particular emphasis on analyzing misclassifications. The model’s generated explanations for each decision serve as valuable insights into its reasoning process. We collect both quantitative performance metrics (accuracy, precision, recall) and qualitative insights into error patterns, creating a comprehensive performance baseline.

During this step, we compare model predictions to ground truth labels and record both quantitative metrics and detailed error cases. For example, when a customer urgently reports unauthorized account changes with “Someone must have accessed my account…I need this fixed immediately”, the model might incorrectly classify it as CARD_DISPUTE instead of the correct ESCALATION category. Each prediction is logged with its success status (true/false), the model’s explanation, and the correct label. This comprehensive analysis creates a structured dataset of both successful classifications and failure cases, providing critical input for the reasoning-based optimization in the next step.

The key step of our optimization process lies in systematic error analysis using a dedicated reasoning framework. This framework examines the model’s explanations for each error case, identifying root causes and pattern recognition failures. Beyond individual error analysis, we employ pattern recognition to identify systemic issues across multiple cases. The reasoning model, in our case Anthropic’s Claude 3.7, incorporates historical feedback and learning patterns to generate specific, actionable feedback for prompt improvement. This critical step produces structured, detailed recommendations for prompt optimization.

The reasoning model analyzed classification performance through a structured framework that identified error patterns, investigated prompt-specific root causes, considered historical context from previous iterations, and suggested targeted improvements. This methodical approach focused exclusively on enhancing prompt clarity, structure, and precision—avoiding model or data modifications outside the scope of prompt engineering. By systematically addressing ambiguities and refining classification criteria, we achieved progressively better performance with each iteration. See the following code:

critique_prompt_template = “””
Analyze classification performance and provide reasoning for prompt improvements:
Current Template: ${input_current_template}
Evaluation Results: ${evaluation_results}

Follow these thinking steps:
1. Error Pattern Analysis:
2. Root Cause Investigation:
3. Historical Context Review:
• Previous suggestions: ${suggestion_history}
4. Prompt Improvement Ideas:

Output final suggestions between <suggestion> </suggestion> tags
“””

You can see the detailed implementation in error_analysis_with_reasoning.py.

Using the structured feedback from the reasoning framework, we implement targeted modifications to the prompt template. These refinements might include enhancing instruction clarity, adjusting classification parameters, or restructuring the prompt format. Each modification directly addresses specific issues identified in the analysis phase, making sure changes are evidence-based and purposeful. The focus remains on improving the instruction layer rather than modifying the underlying model architecture.

To implement these structured improvements, we developed a systematic prompt rewriting mechanism encoded in our prompt_rewrite.py module. This component transforms analytical feedback into concrete prompt enhancements through a dedicated template-based approach. The rewriting process follows a methodical workflow: it preserves essential components like placeholders, incorporates specific improvements identified in the analysis, and makes sure modifications directly address root causes from the feedback. This systematic rewriting approach guarantees that each iteration builds upon previous learnings rather than making arbitrary changes.

rewrite_prompt_template = “””
TASK: Improve the prompt template based on critique feedback.
INPUT:
– Current Template: ${input_current_template}
– Critique Analysis: ${critique_feedbacks}
INSTRUCTIONS:
1. Preserve the current template structure and all placeholders
2. Implement specific improvements identified in the critique
3. Focus on addressing root causes of errors
4. Create a complete, ready-to-use improved template
OUTPUT FORMAT:
– Root cause summary
– Improved template incorporating all recommended changes
The improved template should directly address identified issues while remaining concise and effective.
“””

The optimization process concludes each iteration by testing the refined prompt against the evaluation dataset. We measure performance improvements through comparative analysis of key metrics and conduct quality assessments of new outputs. This phase initiates the next iteration cycle, where successful changes are incorporated into the baseline, and newly identified challenges inform the next round of optimization. This creates a sustainable improvement loop that progressively enhances prompt effectiveness while maintaining detailed documentation of successful strategies.

Through our iterative refinement process, we transformed a basic prompt into a highly effective instruction set for LLMs. Each iteration strategically addressed specific weaknesses identified through our structured analysis framework. For complete documentation of each iteration’s analysis and improvements, see iteration_log.
What began as a simple prompt evolved into a comprehensive set of instructions incorporating nuanced task boundaries, explicit priority rules for edge cases, hierarchical decision criteria, and precise handling instructions for corner cases. Rather than modify model weights or architecture, our approach used targeted feedback from a critique mechanism to enhance the instruction layer, effectively guiding model behavior without retraining. Each iteration built upon lessons from previous rounds, systematically addressing error patterns revealed through our critique framework. The feedback served as supplementary instructions that enriched the original prompt template, allowing the model to develop increasingly sophisticated processing heuristics over time.
Results
Through these iterative approaches, we benchmarked the solution on the production system. Our comparative analysis between the initial and final prompts revealed several important patterns:

Boundary confusion was resolved by adding explicit prioritization rules between overlapping categories
Edge case handling improved by incorporating specific examples that defined thresholds for categorization
Decision transparency increased through structured reasoning requirements in the output format
Classification consistency was enhanced by adding counterexamples to help prevent overcategorization in sensitive areas

Through 10 deliberate iterations and the incorporation of detailed task-specific instructions, we achieved a remarkable 34-percentage-point improvement in task effectiveness, transforming a basic prompt with 60% accuracy into a robust classification system with 94% accuracy on challenging cases. This validates not only our iterative optimization strategy but demonstrates how systematic prompt refinement can dramatically enhance LLM model performance without modifying the underlying model architecture.
Conclusion
The integration of feedback mechanisms into AI assistant systems represents a significant leap forward in conversational AI capabilities. By implementing robust feedback loops, we’ve demonstrated how AI assistants can evolve from static question-answering systems to dynamic, self-improving resources. The modular subsystem architecture, combined with continuous prompt optimization through feedback, enables AI assistants to handle increasingly complex tasks while maintaining compliance and accuracy.
As we’ve shown through practical examples and research insights, feedback-driven systems not only produce better outputs but also allow for more effective and streamlined input instructions over time. This efficiency gain is particularly valuable in enterprise environments where precision and adaptability are crucial, and where model retraining is costly or impractical. Each iteration builds upon lessons from previous rounds, systematically addressing error patterns revealed through our critique framework.
Looking ahead, the continued refinement of feedback mechanisms and prompt engineering techniques will be essential for developing next-generation AI assistant systems. By embracing these approaches, organizations can create AI assistants that not only meet current demands but also adapt to future challenges, delivering increasingly sophisticated and reliable interactions. We invite you to try our proposed feedback-driven prompt optimization approach in your own applications. For those interested in implementing these techniques, Amazon Bedrock provides an ideal landscape for exploring these methods in your specific business contexts, offering a selection of FMs with flexible deployment options.

About the authors
Jessie Jiao is a Senior Software Engineer at crypto.com, where she leverages her extensive experience in designing, building, and implementing enterprise applications with LLM models and AI technologies. She is passionate about harnessing the power of AI to drive business transformation and enhance operational efficiency.
Gary Lo is a Solutions Architect at AWS based in Hong Kong. He is a highly passionate IT professional with over 10 years of experience in designing and implementing critical and complex solutions for distributed systems, web applications, and mobile platforms for startups and enterprise companies. Outside of the office, he enjoys cooking and sharing the latest technology trends and insights on his social media platforms with thousands of followers.
Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.
Michelle Hong, PhD, works as Prototyping Solutions Architect at Amazon Web Services, where she helps customers build innovative applications using a variety of AWS components. She demonstrated her expertise in machine learning, particularly in natural language processing, to develop data-driven solutions that optimize business processes and improve customer experiences.

The U.S. White House Releases AI Playbook: A Bold Strategy to Lead th …

The White House just released the U.S. AI Playbook—formally titled “America’s AI Action Plan”—a sweeping, high-impact federal strategy that clarifies one thing: the United States is going all in on artificial intelligence. Whether you’re in Silicon Valley, leading a Fortune 500, or managing a critical government agency, the message is unambiguous: scale AI fast, dismantle barriers, and secure American leadership across the technology, industrial, and geopolitical fronts.

Clear Ambition: Outpace Competitors, Build Relentlessly, Lead Globally

The core objective is explicit. Drawing inspiration from the space race, the plan calls on all sectors to treat AI as both an economic and national security imperative. Whoever builds the largest AI ecosystem will set global standards, reap economic rewards, and dictate the technological future. The document is a roadmap, backed by presidential executive orders, that envisions AI as the engine of an industrial revolution, an information revolution, and an intellectual renaissance—all happening at once.

Pillar I: Accelerate AI Innovation

Remove Regulatory Bottlenecks: The playbook begins by eliminating “onerous regulation”—with a clear warning: federal funding, grants, and contracts may be restricted in states imposing heavy-handed AI rules. Deregulation is described not just as policy, but as a key competitive advantage; for example, Executive Order 14179 rescinds previous strictures, emphasizing that innovation must not be “smothered in bureaucracy” at any level.

Federal Action Points:

Review and repeal federal and state regulations that hinder AI deployment.

Funding decisions will now consider a state’s AI regulatory climate.

Agencies like the OSTP and OMB are tasked with fast-tracking the deregulation agenda.

What it Means for You: If your operations or customer base span multiple states, closely monitor how local regulations may impact access to federal opportunities and partner eligibility.

Open-Source and Open-Weight Models: The U.S. commits to making open models, with accessible “weights,” a national priority. These models are seen as crucial for research, SME adoption, and independence from closed vendor systems. The plan also sees open models as “geostrategic assets” vital to American influence abroad, with specific steps to create financial markets for AI compute and to expand public-private partnerships under the National AI Research Resource.

Pillar II: Build American AI Infrastructure

Unleashing National Capacity: Perhaps the most data-intensive section, the plan notes a dramatic need for new data centers, semiconductor fabs, and—above all—energy. The U.S. electrical grid, stagnant since the 1970s, is flagged as an urgent chokepoint. Rivals like China have aggressively scaled up, and the plan’s not subtle: “Build, Baby, Build” is the refrain.

Key Infrastructure Actions:

Streamlined permitting: NEPA reforms, categorical exclusions, and FAST-41 expansions will cut red tape for data centers and energy projects.

Grid overhaul: Stabilize the existing grid, boost new generation (including nuclear and geothermal), and reform markets to incentivize reliability.

Semiconductor renaissance: The CHIPS Act will focus purely on returns and capacity—minus prior “ideological” add-ons—to bring advanced chip fabrication back to the U.S.

What it Means for You: Expect rising demand and volatility in data center space and energy pricing. Secure, diversify, and lock in partnerships for compute, cloud, and hardware. Organizations slow to secure infrastructure could find themselves paying premium rates or left out when bottlenecks hit.

Pillar III: International Leadership, Security, and Diplomacy

Allies, Adversaries, and American Standards: The Playbook envisions not just a domestic transformation, but a global campaign. Exporting “the full U.S. AI technology stack”—chips, models, applications, and standards—to allied nations becomes a top priority, with new mechanisms for economic diplomacy and standards-setting at the international level. There is substantial focus on:

Countering Chinese tech and regulatory influence in multilateral bodies.

Aggressive tightening of export controls on semiconductors and AI technologies, with advanced location verification for chips1.

Ensuring rigorous national security reviews of both domestic and foreign (especially adversarial) AI models.

Workforce: AI Upskilling and Job Protection as Core Policy

Federal Upskilling Campaigns: The plan mandates career-long AI skill development as a central federal funding objective—supported by new guidance making AI training tax-advantaged for employers under IRC Section 132. Two executive orders from April 2025 reinforce commitments to K-12 AI literacy and “skilled trades for the future”.

Actionable Initiatives:

Priority on integrating AI training into apprenticeships, CTE, and higher education.

New AI Workforce Research Hub to track and predict job displacement trends and uncover actionable insights.

Funding rapid retraining and upskilling for workers in industries facing automation pressures.

What it Means for You: Companies must prioritize internal AI talent pipelines or risk falling behind. Federal incentives will favor organizations embracing rapid upskilling, especially in operations, engineering, and cybersecurity roles.

Institutional AI Adoption: Government as AI Buyer and Benchmark

Turnkey Federal AI Use: The plan accelerates government AI procurement and deployments, particularly in defense, public health, and infrastructure. This includes formalizing the Chief Artificial Intelligence Officer Council, building government AI procurement toolkits, and deploying classified, high-security AI data centers for sensitive workloads.

Market Signal: Federal adoption isn’t just for show; it sets compliance, risk, and procurement benchmarks the private sector will need to meet to access government contracts and regulated markets.

Strategic Takeaways

Regulatory “Climate” Now Determines Eligibility: Organizations in stricter regulatory states may lose access to billions in federal funding and contracts—assess and adapt your compliance playbook now.

Infrastructure is King: Anticipated surge in compute and energy demand will reshape market price dynamics—early movers lock in advantages.

Open-Source AI Now a Federal Priority: Leverage open models for transparency, innovation, and access to upcoming government programs.

Talent and Training Aren’t Optional: Investing in workforce retraining and AI literacy is now federally incentivized and frequently a condition for funding.

Federal Procurement Will Shape Industry Standards: Many U.S. agencies—including military and healthcare—will soon demand compliance with new federal toolkits and risk frameworks.

Conclusion

The U.S. AI Playbook is a bold, all-in bet. It frames AI not as incremental progress, but as a directional bet on America’s future—economically, geopolitically, and scientifically. For organizations of all sizes, ignoring this federal roadmap is not a viable option. Aligning with these five strategic moves—auditing regulatory risk, planning for infrastructure volatility, leveraging open-source AI, investing in upskilling, and tracking federal adoption—is now the price of entry to America’s next industrial age.

Check out the Full Report. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post The U.S. White House Releases AI Playbook: A Bold Strategy to Lead the Global AI Race appeared first on MarkTechPost.

Building a Context-Aware Multi-Agent AI System Using Nomic Embeddings …

In this tutorial, we walk through the complete implementation of an advanced AI agent system powered by Nomic Embeddings and Google’s Gemini. We design the architecture from the ground up, integrating semantic memory, contextual reasoning, and multi-agent orchestration into a single intelligent framework. Using LangChain, Faiss, and LangChain-Nomic, we equip our agents with the ability to store, retrieve, and reason over information using natural language queries. The goal is to demonstrate how we can build a modular and extensible AI system that supports both analytical research and friendly conversation.

Copy CodeCopiedUse a different Browser!pip install -qU langchain-nomic langchain-core langchain-community langchain-google-genai faiss-cpu numpy matplotlib

import os
import getpass
import numpy as np
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from langchain_nomic import NomicEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.documents import Document
from langchain_google_genai import ChatGoogleGenerativeAI
import json

if not os.getenv(“NOMIC_API_KEY”):
os.environ[“NOMIC_API_KEY”] = getpass.getpass(“Enter your Nomic API key: “)

if not os.getenv(“GOOGLE_API_KEY”):
os.environ[“GOOGLE_API_KEY”] = getpass.getpass(“Enter your Google API key (for Gemini): “)

We begin by installing all the required libraries, including langchain-nomic, langchain-google-genai, and faiss-cpu, to support our agent’s embedding, reasoning, and vector search capabilities. We then import the necessary modules and securely set our Nomic and Google API keys using getpass to ensure smooth integration with the embedding and LLM services. Check out the full Codes.

Copy CodeCopiedUse a different Browser@dataclass
class AgentMemory:
“””Agent’s episodic and semantic memory”””
episodic: List[Dict[str, Any]]
semantic: Dict[str, Any]
working: Dict[str, Any]

class IntelligentAgent:
“””Advanced AI Agent with Nomic Embeddings for semantic reasoning”””

def __init__(self, agent_name: str = “AIAgent”, personality: str = “helpful”):
self.name = agent_name
self.personality = personality

self.embeddings = NomicEmbeddings(
model=”nomic-embed-text-v1.5″,
dimensionality=384,
inference_mode=”remote”
)

self.llm = ChatGoogleGenerativeAI(
model=”gemini-1.5-flash”,
temperature=0.7,
max_tokens=512
)

self.memory = AgentMemory(
episodic=[],
semantic={},
working={}
)

self.knowledge_base = None
self.vector_store = None

self.capabilities = {
“reasoning”: True,
“memory_retrieval”: True,
“knowledge_search”: True,
“context_awareness”: True,
“learning”: True
}

print(f” {self.name} initialized with Nomic embeddings + Gemini LLM”)

def add_knowledge(self, documents: List[str], metadata: List[Dict] = None):
“””Add knowledge to agent’s semantic memory”””
if metadata is None:
metadata = [{“source”: f”doc_{i}”} for i in range(len(documents))]

docs = [Document(page_content=doc, metadata=meta)
for doc, meta in zip(documents, metadata)]

if self.vector_store is None:
self.vector_store = InMemoryVectorStore.from_documents(docs, self.embeddings)
else:
self.vector_store.add_documents(docs)

print(f” Added {len(documents)} documents to knowledge base”)

def remember_interaction(self, user_input: str, agent_response: str, context: Dict = None):
“””Store interaction in episodic memory”””
memory_entry = {
“timestamp”: len(self.memory.episodic),
“user_input”: user_input,
“agent_response”: agent_response,
“context”: context or {},
“embedding”: self.embeddings.embed_query(f”{user_input} {agent_response}”)
}
self.memory.episodic.append(memory_entry)

def retrieve_similar_memories(self, query: str, k: int = 3) -> List[Dict]:
“””Retrieve similar past interactions”””
if not self.memory.episodic:
return []

query_embedding = self.embeddings.embed_query(query)
similarities = []

for memory in self.memory.episodic:
similarity = np.dot(query_embedding, memory[“embedding”])
similarities.append((similarity, memory))

similarities.sort(reverse=True, key=lambda x: x[0])
return [mem for _, mem in similarities[:k]]

def search_knowledge(self, query: str, k: int = 3) -> List[Document]:
“””Search knowledge base for relevant information”””
if self.vector_store is None:
return []
return self.vector_store.similarity_search(query, k=k)

def reason_and_respond(self, user_input: str) -> str:
“””Main reasoning pipeline with context integration”””

similar_memories = self.retrieve_similar_memories(user_input, k=2)

relevant_docs = self.search_knowledge(user_input, k=3)

context = {
“similar_memories”: similar_memories,
“relevant_knowledge”: [doc.page_content for doc in relevant_docs],
“working_memory”: self.memory.working
}

response = self._generate_contextual_response(user_input, context)

self.remember_interaction(user_input, response, context)

self.memory.working[“last_query”] = user_input
self.memory.working[“last_response”] = response

return response

def _generate_contextual_response(self, query: str, context: Dict) -> str:
“””Generate response using Gemini LLM with context”””

context_info = “”

if context[“relevant_knowledge”]:
context_info += f”Relevant Knowledge: {‘ ‘.join(context[‘relevant_knowledge’][:2])}n”

if context[“similar_memories”]:
memory = context[“similar_memories”][0]
context_info += f”Similar Past Interaction: User asked ‘{memory[‘user_input’]}’, I responded ‘{memory[‘agent_response’][:100]}…’n”

prompt = f”””You are {self.name}, an AI agent with personality: {self.personality}.

Context Information:
{context_info}

User Query: {query}

Please provide a helpful response based on the context. Keep it concise (under 150 words) and maintain your personality.”””

try:
response = self.llm.invoke(prompt)
return response.content.strip()
except Exception as e:
if context[“relevant_knowledge”]:
knowledge_summary = ” “.join(context[“relevant_knowledge”][:2])
return f”Based on my knowledge: {knowledge_summary[:200]}…”
elif context[“similar_memories”]:
last_memory = context[“similar_memories”][0]
return f”I recall a similar question. Previously: {last_memory[‘agent_response’][:150]}…”
else:
return “I need more information to provide a comprehensive answer.”

We define the core structure of our intelligent agent by creating a memory system that mimics episodic and semantic recall. We integrate Nomic embeddings for semantic understanding and use Gemini LLM to generate contextual, personality-driven responses. With built-in capabilities like memory retrieval, knowledge search, and reasoning, we enable the agent to interact intelligently and learn from each conversation. Check out the full Codes.

Copy CodeCopiedUse a different Browserclass ResearchAgent(IntelligentAgent):
“””Specialized agent for research and analysis tasks”””

def __init__(self):
super().__init__(“ResearchBot”, “analytical and thorough”)
self.research_domains = []

def analyze_topic(self, topic: str) -> Dict[str, Any]:
“””Analyze a topic using semantic similarity and Gemini reasoning”””

related_docs = self.search_knowledge(topic, k=5)

if not related_docs:
return {“analysis”: “No relevant information found”, “confidence”: 0.0}

topic_embedding = self.embeddings.embed_query(topic)
doc_embeddings = [self.embeddings.embed_query(doc.page_content)
for doc in related_docs]

similarities = [np.dot(topic_embedding, doc_emb)
for doc_emb in doc_embeddings]

context = ” “.join([doc.page_content for doc in related_docs[:3]])
analysis_prompt = f”””As a research analyst, analyze the topic: {topic}

Available information:
{context}

Provide a structured analysis including:
1. Key insights (2-3 points)
2. Confidence level assessment
3. Research gaps or limitations
4. Practical implications

Keep response under 200 words.”””

try:
gemini_analysis = self.llm.invoke(analysis_prompt)
detailed_analysis = gemini_analysis.content.strip()
except:
detailed_analysis = f”Analysis of {topic} based on available documents with {len(related_docs)} relevant sources.”

analysis = {
“topic”: topic,
“related_documents”: len(related_docs),
“max_similarity”: max(similarities),
“avg_similarity”: np.mean(similarities),
“key_insights”: [doc.page_content[:100] + “…” for doc in related_docs[:3]],
“confidence”: max(similarities),
“detailed_analysis”: detailed_analysis
}

return analysis

class ConversationalAgent(IntelligentAgent):
“””Agent optimized for natural conversations”””

def __init__(self):
super().__init__(“ChatBot”, “friendly and engaging”)
self.conversation_history = []

def maintain_conversation_context(self, user_input: str) -> str:
“””Maintain conversation flow with context awareness”””

self.conversation_history.append({“role”: “user”, “content”: user_input})

recent_context = ” “.join([msg[“content”] for msg in self.conversation_history[-3:]])

response = self.reason_and_respond(recent_context)

self.conversation_history.append({“role”: “assistant”, “content”: response})

return response

We extend our intelligent agent into two specialized versions: a ResearchAgent for structured topic analysis and a ConversationalAgent for natural dialogue. The research agent leverages semantic similarity and Gemini LLM to generate confident, insight-rich analyses, while the conversational agent maintains a history-aware chat experience that feels coherent and engaging. This modular design enables us to tailor AI behaviors to meet specific user needs. Check out the full Codes.

Copy CodeCopiedUse a different Browserdef demonstrate_agent_capabilities():
“””Comprehensive demonstration of agent capabilities”””

print(” Creating and testing AI agents…”)

research_agent = ResearchAgent()
chat_agent = ConversationalAgent()

knowledge_documents = [
“Artificial intelligence is transforming industries through automation and intelligent decision-making systems.”,
“Machine learning algorithms require large datasets to identify patterns and make predictions.”,
“Natural language processing enables computers to understand and generate human language.”,
“Computer vision allows machines to interpret and analyze visual information from images and videos.”,
“Robotics combines AI with physical systems to create autonomous machines.”,
“Deep learning uses neural networks with multiple layers to solve complex problems.”,
“Reinforcement learning teaches agents to make decisions through trial and error.”,
“Quantum computing promises to solve certain problems exponentially faster than classical computers.”
]

research_agent.add_knowledge(knowledge_documents)
chat_agent.add_knowledge(knowledge_documents)

print(“n Testing Research Agent…”)

topics = [“machine learning”, “robotics”, “quantum computing”]

for topic in topics:
analysis = research_agent.analyze_topic(topic)
print(f”n Analysis of ‘{topic}’:”)
print(f” Confidence: {analysis[‘confidence’]:.3f}”)
print(f” Related docs: {analysis[‘related_documents’]}”)
print(f” Detailed Analysis: {analysis.get(‘detailed_analysis’, ‘N/A’)[:200]}…”)
print(f” Key insight: {analysis[‘key_insights’][0] if analysis[‘key_insights’] else ‘None’}”)

print(“n Testing Conversational Agent…”)

conversation_inputs = [
“Tell me about artificial intelligence”,
“How does machine learning work?”,
“What’s the difference between AI and machine learning?”,
“Can you explain neural networks?”
]

for user_input in conversation_inputs:
response = chat_agent.maintain_conversation_context(user_input)
print(f”n User: {user_input}”)
print(f” Agent: {response}”)

print(“n Memory Analysis…”)
print(f”Research Agent memories: {len(research_agent.memory.episodic)}”)
print(f”Chat Agent memories: {len(chat_agent.memory.episodic)}”)

similar_memories = chat_agent.retrieve_similar_memories(“artificial intelligence”, k=2)
if similar_memories:
print(f”n Similar memory found:”)
print(f” Query: {similar_memories[0][‘user_input’]}”)
print(f” Response: {similar_memories[0][‘agent_response’][:100]}…”)

We run a comprehensive demonstration of our AI agents by loading a shared knowledge base and evaluating both research and conversational tasks. We test the ResearchAgent’s ability to generate insightful analyses on key topics and validate the ConversationalAgent’s performance across multi-turn queries. Through introspection, we confirm that the agents effectively retain and retrieve relevant past interactions. Check out the full Codes.

Copy CodeCopiedUse a different Browserclass MultiAgentSystem:
“””Orchestrate multiple specialized agents”””

def __init__(self):
self.agents = {
“research”: ResearchAgent(),
“chat”: ConversationalAgent()
}
self.coordinator_embeddings = NomicEmbeddings(model=”nomic-embed-text-v1.5″, dimensionality=256)

def route_query(self, query: str) -> str:
“””Route query to most appropriate agent”””

agent_descriptions = {
“research”: “analysis, research, data, statistics, technical information”,
“chat”: “conversation, questions, general discussion, casual talk”
}

query_embedding = self.coordinator_embeddings.embed_query(query)
best_agent = “chat”
best_similarity = 0

for agent_name, description in agent_descriptions.items():
desc_embedding = self.coordinator_embeddings.embed_query(description)
similarity = np.dot(query_embedding, desc_embedding)

if similarity > best_similarity:
best_similarity = similarity
best_agent = agent_name

return best_agent

def process_query(self, query: str) -> Dict[str, Any]:
“””Process query through appropriate agent”””

selected_agent, confidence = self.route_query_with_confidence(query)
agent = self.agents[selected_agent]

if selected_agent == “research”:
if “analyze” in query.lower() or “research” in query.lower():
topic = query.replace(“analyze”, “”).replace(“research”, “”).strip()
result = agent.analyze_topic(topic)
response = f”Research Analysis: {result.get(‘detailed_analysis’, str(result))}”
else:
response = agent.reason_and_respond(query)
else:
response = agent.maintain_conversation_context(query)

return {
“query”: query,
“selected_agent”: selected_agent,
“response”: response,
“confidence”: confidence
}

def route_query_with_confidence(self, query: str) -> tuple[str, float]:
“””Route query to most appropriate agent and return confidence”””

agent_descriptions = {
“research”: “analysis, research, data, statistics, technical information”,
“chat”: “conversation, questions, general discussion, casual talk”
}

query_embedding = self.coordinator_embeddings.embed_query(query)
best_agent = “chat”
best_similarity = 0.0

for agent_name, description in agent_descriptions.items():
desc_embedding = self.coordinator_embeddings.embed_query(description)
similarity = np.dot(query_embedding, desc_embedding)

if similarity > best_similarity:
best_similarity = similarity
best_agent = agent_name

return best_agent, best_similarity

We built a multi-agent system that intelligently routes queries to either the research or conversational agent based on semantic similarity. By embedding both the user query and agent specialties using Nomic embeddings, we ensure that the most relevant expert is assigned to each request. This architecture allows us to scale intelligent behavior while maintaining specialization and precision. Check out the full Codes.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
print(“n Advanced AI Agent System with Nomic Embeddings + Gemini LLM”)
print(“=” * 70)
print(” Note: This uses Google’s Gemini 1.5 Flash (free tier) for reasoning”)
print(” Get your free Google API key at: https://makersuite.google.com/app/apikey”)
print(” Get your Nomic API key at: https://atlas.nomic.ai/”)
print(“=” * 70)

demonstrate_agent_capabilities()

print(“n Testing Multi-Agent System…”)
multi_system = MultiAgentSystem()

knowledge_docs = [
“Python is a versatile programming language used in AI development.”,
“TensorFlow and PyTorch are popular machine learning frameworks.”,
“Data preprocessing is crucial for successful machine learning projects.”
]

for agent in multi_system.agents.values():
agent.add_knowledge(knowledge_docs)

test_queries = [
“Analyze the impact of AI on society”,
“How are you doing today?”,
“Research machine learning trends”,
“What’s your favorite color?”
]

for query in test_queries:
result = multi_system.process_query(query)
print(f”n Query: {query}”)
print(f” Routed to: {result[‘selected_agent’]} agent”)
print(f” Response: {result[‘response’][:150]}…”)

print(“n Advanced AI Agent demonstration complete!”)

We conclude by running a comprehensive demonstration of our AI system, initializing the agents, loading knowledge, and testing real-world queries. We observe how the multi-agent system intelligently routes each query based on its content, showcasing the strength of our modular design. This final execution confirms the agents’ capabilities in reasoning, memory, and adaptive response generation.

In conclusion, we now have a powerful and flexible AI agent framework that leverages Nomic embeddings for semantic understanding and Gemini LLM for contextual response generation. We demonstrate how agents can independently manage memory, retrieve knowledge, and reason intelligently, while the multi-agent system ensures that user queries are routed to the most capable agent. By walking through both research-focused and conversational interactions, we showcase how this setup can serve as a foundation for building truly intelligent and responsive AI assistants.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building a Context-Aware Multi-Agent AI System Using Nomic Embeddings and Gemini LLM appeared first on MarkTechPost.

VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embeddi …

Embedding models act as bridges between different data modalities by encoding diverse multimodal information into a shared dense representation space. There have been advancements in embedding models in recent years, driven by progress in large foundation models. However, existing multimodal embedding models are trained on datasets such as MMEB and M-BEIR, with most focus only on natural images and photographs sourced from the MSCOCO, Flickr, and ImageNet datasets. These datasets fail to cover larger forms of visual information, including documents, PDFs, websites, videos, and slides. This causes existing embedding models to underperform on realistic tasks such as article searching, website searching, and YouTube video search.

Multimodal embedding benchmarks such as MSCOCO, Flickr30K, and Conceptual Captions initially focused on static image-text pairs for tasks like image captioning and retrieval. More recent benchmarks, such as M-BEIR and MMEB, introduced multi-task evaluations, but remain limited to static images and short contexts. Video representation learning has evolved through models like VideoCLIP and VideoCoCa, integrating contrastive learning with captioning objectives. Visual document representation learning advanced through models like ColPali and VisRAG, which use VLMs for document retrieval. Unified modality retrieval methods like GME and Uni-Retrieval achieve strong performance on universal benchmarks. However, none can unify image, video, and visual document retrieval within a single framework.

Researchers from Salesforce Research, UC Santa Barbara, University of Waterloo, and Tsinghua University have proposed VLM2Vec-V2 to unify image, video, and visual document retrieval within a single framework. Firstly, researchers developed MMEB-V2, a benchmark that extends MMEB with five new task types, including visual document retrieval, video retrieval, temporal grounding, video classification, and video question answering. Secondly, VLM2Vec-V2 serves as a general-purpose embedding model that supports multiple input modalities while demonstrating strong performance on both newly introduced tasks and original image benchmarks. This establishes a foundation for more scalable and flexible representation learning in both research and practical applications.

VLM2Vec-V2 utilizes Qwen2-VL as its backbone, selected for its specialized capabilities in multimodal processing. Qwen2-VL offers three critical features that support unified embedding learning: Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-RoPE), and a unified framework that combines 2D and 3D convolutions. To enable effective multi-task training across diverse data sources, VLM2Vec-V2 introduces a flexible data sampling pipeline with two key components: (a) on-the-fly batch mixing based on predefined sampling weight tables that control the relative probabilities of each dataset, and (b) an interleaved sub-batching strategy that splits full batches into independently sampled sub-batches, improving the stability of contrastive learning.

VLM2Vec-V2 achieves the highest overall average score of 58.0 across 78 datasets covering image, video, and visual document tasks, outperforming strong baselines including GME, LamRA, and VLM2Vec built on the same Qwen2-VL backbone. On image tasks, VLM2Vec-V2 outperforms most baselines by significant margins and achieves performance comparable to VLM2Vec-7B despite being only 2B parameters in size. For video tasks, the model achieves competitive performance despite training on relatively small amounts of video data. In visual document retrieval, VLM2Vec-V2 outperforms all VLM2Vec variants, but still lags behind ColPali, which is specifically optimized for visual document tasks.

In conclusion, researchers introduced VLM2Vec-V2, a strong baseline model trained through contrastive learning across diverse tasks and modality combinations. VLM2Vec-V2 is built upon MMEB-V2 and uses Qwen2-VL as its backbone model. MMEB-V2 is a benchmark designed by researchers to assess multimodal embedding models across various modalities, including text, images, videos, and visual documents. The experimental evaluation demonstrates the effectiveness of VLM2Vec-V2 in achieving balanced performance across multiple modalities while highlighting the diagnostic value of MMEB-V2 for future research.

Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents appeared first on MarkTechPost.

REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasonin …

Large Reasoning Models (LRMs) have rapidly advanced, exhibiting impressive performance in complex problem-solving tasks across domains like mathematics, coding, and scientific reasoning. However, current evaluation approaches primarily focus on single-question testing, which reveals significant limitations. This article introduces REST (Reasoning Evaluation through Simultaneous Testing) — a novel multi-problem stress-testing framework designed to push LRMs beyond isolated problem-solving and better reflect their real-world multi-context reasoning capabilities.

Why Current Evaluation Benchmarks Fall Short for Large Reasoning Models

Most current benchmarks, such as GSM8K and MATH, evaluate LRMs by asking one question at a time. While effective for initial model development, this isolated question approach faces two critical drawbacks:

Decreasing Discriminative Power: Many state-of-the-art LRMs now achieve near-perfect scores on popular benchmarks (e.g., DeepSeek-R1 reaching 97% accuracy on MATH500). These saturated results make it increasingly difficult to distinguish true model improvements, forcing the expensive, continuous creation of harder datasets to differentiate capabilities.

Lack of Real-World Multi-Context Evaluation: Real-world applications — like educational tutoring, technical support, or multitasking AI assistants — require reasoning across multiple, potentially interfering questions simultaneously. Single-question testing does not capture these dynamic, multi-problem challenges that reflect true cognitive load and reasoning robustness.

Introducing REST: Stress-Testing LRMs with Multiple Problems at Once

To address these challenges, researchers from Tsinghua University, OpenDataLab, Shanghai AI Laboratory, and Renmin University developed REST, a simple yet powerful evaluation method that simultaneously tests LRMs on multiple questions bundled into a single prompt.

Multi-Question Benchmark Reconstruction: REST repurposes existing benchmarks by concatenating multiple questions into one prompt, adjusting the stress level parameter that controls how many questions are presented simultaneously.

Comprehensive Evaluation: REST evaluates critical reasoning competencies beyond basic problem-solving — including contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management.

Wide Applicability: The framework is validated on 34 advanced LRMs ranging from 1.5 billion to 671 billion parameters, tested on 7 diverse benchmarks across varying difficulty levels (from simple GSM8K to challenging AIME and GPQA).

REST Reveals Key Insights About LRM Reasoning Abilities

The REST evaluation uncovers several groundbreaking findings:

1. Significant Performance Degradation Under Multi-Problem Stress

Even state-of-the-art LRMs like DeepSeek-R1 show notable accuracy drops when handling multiple questions together. For example, DeepSeek-R1’s accuracy on challenging benchmarks like AIME24 falls by nearly 30% under REST compared to isolated question testing. This contradicts prior assumptions that large language models are inherently capable of effortlessly multitasking across problems.

2. Enhanced Discriminative Power Among Similar Models

REST dramatically amplifies the differences between models with near-identical single-question scores. On MATH500, for instance:

R1-7B and R1-32B achieve close single-question accuracies of 93% and 94.6%, respectively.

Under REST, R1-7B’s accuracy plummets to 66.75% while R1-32B maintains a high 88.97%, revealing a stark 22% performance gap.

Similarly, among same-sized models like AReaL-boba-RL-7B and OpenThinker2-7B, REST captures significant differences in multi-problem handling abilities that single-question evaluations mask.

3. Post-Training Methods May Not Guarantee Robust Multi-Problem Reasoning

Models fine-tuned with reinforcement learning or supervised tuning on single-problem reasoning often fail to preserve their advantages in REST’s multi-question setting. This calls for rethinking training strategies to optimize reasoning robustness under realistic multi-context scenarios.

4. “Long2Short” Training Enhances Performance Under Stress

Models trained with “long2short” techniques — which encourage concise and efficient reasoning chains — maintain higher accuracy under REST. This suggests a promising avenue for designing models better suited to simultaneous multi-problem reasoning.

How REST Stimulates Realistic Reasoning Challenges

By increasing the cognitive load on LRMs through simultaneous problem presentation, REST simulates real-world demands where reasoning systems must dynamically prioritize, avoid overthinking one problem, and resist interference from concurrent tasks.

REST also systematically analyzes error types, revealing common failure modes such as:

Question Omission: Ignoring later questions in a multi-question prompt.

Summary Errors: Incorrectly summarizing answers across problems.

Reasoning Errors: Logical or calculation mistakes within the reasoning process.

These nuanced insights are largely invisible in single-question assessments.

Practical Evaluation Setup and Benchmark Coverage

REST evaluated 34 LRMs spanning sizes from 1.5B to 671B parameters.

Benchmarks tested include:

Simple: GSM8K

Medium: MATH500, AMC23

Challenging: AIME24, AIME25, GPQA Diamond, LiveCodeBench

Model generation parameters are set according to official guidelines, with output token limits of 32K for reasoning models.

Using the standardized OpenCompass toolkit ensures consistent, reproducible results.

Conclusion: REST as a Future-Proof, Realistic LRM Evaluation Paradigm

REST constitutes a significant leap forward in evaluating large reasoning models by:

Addressing Benchmark Saturation: Revitalizes existing datasets without expensive full replacements.

Reflecting Real-World Multi-Task Demands: Tests models under realistic, high cognitive load conditions.

Guiding Model Development: Highlights the importance of training methods like Long2Short to mitigate overthinking and encourage adaptive reasoning focus.

In sum, REST paves the way for more reliable, robust, and application-relevant benchmarking of next-generation reasoning AI systems.

Check out the Paper, Project Page and Code. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter
The post REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models appeared first on MarkTechPost.

URBAN-SIM: Advancing Autonomous Micromobility with Scalable Urban Simu …

Micromobility solutions—such as delivery robots, mobility scooters, and electric wheelchairs—are rapidly transforming short-distance urban travel. Despite their growing popularity as flexible, eco-friendly transport alternatives, most micromobility devices still rely heavily on human control. This dependence limits operational efficiency and raises safety concerns, especially in complex, crowded city environments filled with dynamic obstacles like pedestrians and cyclists.

The Need for Autonomous Micromobility in Urban Spaces

Traditional transportation methods like cars and buses are ideal for long-distance travel but often struggle with last-mile connectivity—the final leg in urban journeys. Micromobility fills this gap by offering lightweight, low-speed devices that excel in short urban trips. However, true autonomy in micromobility remains elusive: current AI solutions tend to focus narrowly on specific tasks such as obstacle avoidance or simple navigation, failing to address the multifaceted challenges posed by real urban environments that include uneven terrain, stairs, and dense crowds.

Limitations of Existing Robot Learning and Simulation Platforms

Most simulation platforms for robot training are tailored for indoor environments or vehicle-centric road networks and lack the contextual richness and complexity found in urban sidewalks, plazas, and alleys. Meanwhile, highly efficient platforms often provide simplified scenes unsuitable for deep learning in environments with diverse obstacles and unpredictable pedestrian movements. This gap restricts the ability of AI agents to effectively learn critical skills for autonomous micromobility.

Introducing URBAN-SIM: High-Performance Simulation for Urban Micromobility

To address these challenges, researchers from the University of California, Los Angeles, and the University of Washington developed URBAN-SIM, a scalable, high-fidelity urban simulation platform designed explicitly for autonomous micromobility research.

Key Features of URBAN-SIM:

Hierarchical Urban Scene GenerationProcedurally creates infinitely diverse, large-scale urban environments—from street blocks to detailed terrain features—that include sidewalks, ramps, stairs, and uneven surfaces. This layered pipeline ensures a realistic and varied setting for robot training.

Interactive Dynamic Agent SimulationSimulates responsive pedestrians, cyclists, and vehicles in real-time on GPUs, enabling complex multi-agent interactions that mimic true urban dynamics.

Asynchronous Scene Sampling for ScalabilityEnables parallel training of AI agents across hundreds of unique and complex urban scenes on a single GPU, dramatically boosting training speed and promoting robust policy learning.

Built on NVIDIA’s Omniverse and PhysX physics engine, URBAN-SIM combines realistic visual rendering with precision physics for authentic embodied AI training.

URBAN-BENCH: Comprehensive Benchmark Suite for Real-World Skills

Complementing URBAN-SIM, the team created URBAN-BENCH, a task suite and benchmark framework that captures essential autonomous micromobility capabilities grounded in actual urban usage scenarios. URBAN-BENCH includes:

Urban Locomotion Tasks: Traversing flat surfaces, slopes, stairs, and rough terrain to ensure stable and efficient robot movement.

Urban Navigation Tasks: Navigating clear pathways, avoiding static obstacles like benches and trash bins, and managing dynamic obstacles such as moving pedestrians and cyclists.

Urban Traverse Task: A challenging kilometer-scale journey combining complex terrains, obstacles, and dynamic agents, designed to test long-horizon navigation and decision-making.

Human-AI Shared Autonomy Approach

For the long-distance urban traverse task, URBAN-BENCH introduces a human-AI shared autonomy model. This flexible control architecture decomposes the robot’s control system into layers—high-level decision making, mid-level navigation, and low-level locomotion—allowing humans to intervene in complex or risky scenarios while enabling AI to manage routine navigation and movement. This collaboration balances safety and efficiency in dynamic urban settings.

Evaluating Diverse Robots in Realistic Tasks

URBAN-SIM and URBAN-BENCH support a wide range of robotic platforms, including wheeled, quadruped, wheeled-legged, and humanoid robots. Benchmarks reveal unique strengths and weaknesses for each robot type across locomotion and navigation challenges, illustrating the platform’s generalizability.

For example:

Quadruped robots excel in stability and stair traversal.

Wheeled robots perform best on clear, flat paths.

Wheeled-legged robots leverage their hybrid design for combined terrain adaptability.

Humanoid robots effectively navigate narrow, crowded urban spaces by sidestepping.

Scalability and Training Efficiency

The asynchronous scene sampling strategy enables training across diverse urban scenes, demonstrating up to a 26.3% performance improvement over synchronous training methods. Increasing the diversity of training environments directly correlates with higher success rates in navigation tasks, highlighting the necessity of large-scale, varied simulation for robust autonomous micromobility.

Conclusion

URBAN-SIM and URBAN-BENCH represent vital steps toward enabling safe, efficient, and scalable autonomous micromobility in complex urban settings. Future work aims to bridge simulation and real-world deployment through ROS 2 integration and sim-to-real transfer techniques. Additionally, the platform will evolve to incorporate multi-modal perception and manipulation capabilities necessary for comprehensive urban robot applications such as parcel delivery and assistive robotics.

By enabling scalable training and benchmarking of embodied AI agents in authentic urban scenarios, this research catalyzes progress in autonomous micromobility—promoting sustainable urban development, enhancing accessibility, and improving safety in public spaces.

Check out the Paper and Code. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter
The post URBAN-SIM: Advancing Autonomous Micromobility with Scalable Urban Simulation appeared first on MarkTechPost.

How Memory Transforms AI Agents: Insights and Leading Solutions in 202 …

The importance of memory in AI agents cannot be overstated. As artificial intelligence matures from simple statistical models to autonomous agents, the ability to remember, learn, and adapt becomes a foundational capability. Memory distinguishes basic reactive bots from truly interactive, context-aware digital entities capable of supporting nuanced, humanlike interactions and decision-making.

Why Is Memory Vital in AI Agents?

Context Retention: Memory enables AI agents to hold onto conversation history, user preferences, and goal states across multiple interactions. This ability delivers personalized, coherent, and contextually correct responses even during extended or multi-turn conversations.

Learning and Adaptation: With memory, agents can learn from both successes and failures, refining behavior continuously without retraining. Remembering past outcomes, errors, or exceptional user requests helps them become more accurate and reliable over time.

Predictive and Proactive Behavior: Recalling historical patterns allows AI to anticipate user needs, detect anomalies, or even prevent potential problems before they occur.

Long-term Task Continuity: For workflows or projects spanning multiple sessions, memory lets agents pick up where they left off and maintain continuity across complex, multi-step processes.

Types of Memory in AI Agents

Short-Term Memory (Working/Context Window): Temporarily retains recent interactions or data for immediate reasoning.

Long-Term Memory: Stores knowledge, facts, and experiences over extended periods. Forms include:

Episodic Memory: Remembers specific events, cases, or conversations.

Semantic Memory: Holds general knowledge such as rules, facts, or domain expertise.

Procedural Memory: Encodes learned skills and complex routines, often through reinforcement learning or repeated exposure.

4 Prominent AI Agent Memory Platforms (2025)

A flourishing ecosystem of memory solutions has emerged, each with unique architectures and strengths. Here are four leading platforms:

1. Mem0

Architecture: Hybrid—combines vector stores, knowledge graphs, and key-value models for flexible and adaptive recall.

Strengths: High accuracy (+26% over OpenAI’s in recent tests), rapid response, deep personalization, powerful search and multi-level recall capabilities.

Use Case Fit: For agent builders demanding fine-tuned control and bespoke memory structures, especially in complex (multi-agent or domain-specific) workflows.

2. Zep

Architecture: Temporal knowledge graph with structured session memory.

Strengths: Designed for scale; easy integration with frameworks like LangChain and LangGraph. Dramatic latency reductions (90%) and improved recall accuracy (+18.5%).

Use Case Fit: For production pipelines needing robust, persistent context and rapid deployment of LLM-powered features at enterprise scale.

3. LangMem

Architecture: Summarization-centric; minimizes memory footprint via smart chunking and selective recall, prioritizing essential info.

Strengths: Ideal for conversational agents with limited context windows or API call constraints.

Use Case Fit: Chatbots, customer support agents, or any AI that operates with constrained resources.

4. Memary

Architecture: Knowledge-graph focus, designed to support reasoning-heavy tasks and cross-agent memory sharing.

Strengths: Persistent modules for preferences, conversation “rewind,” and knowledge graph expansion.

Use Case Fit: Long-running, logic-intensive agents (e.g., in legal, research, or enterprise knowledge management).

Memory as the Foundation for Truly Intelligent AI

Today, memory is a core differentiator in advanced agentic AI systems. It unlocks authentic, adaptive, and goal-driven behavior. Platforms like Mem0, Zep, LangMem, and Memary represent the new standard in endowing AI agents with robust, efficient, and contextually relevant memory—paving the way for agents that aren’t just “intelligent,” but continuously evolving partners in work and life.

Check out the Paper, Project and GitHub Page. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter
The post How Memory Transforms AI Agents: Insights and Leading Solutions in 2025 appeared first on MarkTechPost.

EraRAG: A Scalable, Multi-Layered Graph-Based Retrieval System for Dyn …

Large Language Models (LLMs) have revolutionized many areas of natural language processing, but they still face critical limitations when dealing with up-to-date facts, domain-specific information, or complex multi-hop reasoning. Retrieval-Augmented Generation (RAG) approaches aim to address these gaps by allowing language models to retrieve and integrate information from external sources. However, most existing graph-based RAG systems are optimized for static corpora and struggle with efficiency, accuracy, and scalability when the data is continually growing—such as in news feeds, research repositories, or user-generated online content.

Introducing EraRAG: Efficient Updates for Evolving Data

Recognizing these challenges, researchers from Huawei, The Hong Kong University of Science and Technology, and WeBank have developed EraRAG, a novel retrieval-augmented generation framework purpose-built for dynamic, ever-expanding corpora. Rather than rebuilding the entire retrieval structure whenever new data arrives, EraRAG relies on localized, selective updates that touch only those parts of the retrieval graph affected by the changes.

Core Features:

Hyperplane-Based Locality-Sensitive Hashing (LSH):Every corpus is chunked into small text passages which are embedded as vectors. EraRAG then uses randomly sampled hyperplanes to project these vectors into binary hash codes—a process that groups semantically similar chunks into the same “bucket.” This LSH-based approach maintains both semantic coherence and efficient grouping.

Hierarchical, Multi-Layered Graph Construction:The core retrieval structure in EraRAG is a multi-layered graph. At each layer, segments (or buckets) of similar text are summarized using a language model. Segments that are too large are split, while those too small are merged—ensuring both semantic consistency and balanced granularity. Summarized representations at higher layers enable efficient retrieval for both fine-grained and abstract queries.

Incremental, Localized Updates:When new data arrives, its embedding is hashed using the original hyperplanes—ensuring consistency with the initial graph construction. Only the buckets/segments directly impacted by new entries are updated, merged, split, or re-summarized, while the rest of the graph remains untouched. The update propagates up the graph hierarchy, but always remains localized to the affected region, saving significant computation and token costs.

Reproducibility and Determinism:Unlike standard LSH clustering, EraRAG preserves the set of hyperplanes used during initial hashing. This makes bucket assignment deterministic and reproducible, which is crucial for consistent, efficient updates over time.

Performance and Impact

Comprehensive experiments on a variety of question answering benchmarks demonstrate that EraRAG:

Reduces Update Costs: Achieves up to 95% reduction in graph reconstruction time and token usage compared to leading graph-based RAG methods (e.g., GraphRAG, RAPTOR, HippoRAG).

Maintains High Accuracy: EraRAG consistently outperforms other retrieval architectures in both accuracy and recall—across static, growing, and abstract question answering tasks—with minimal compromise in retrieval quality or multi-hop reasoning capabilities.

Supports Versatile Query Needs: The multi-layered graph design allows EraRAG to efficiently retrieve fine-grained factual details or high-level semantic summaries, tailoring its retrieval pattern to the nature of each query.

Practical Implications

EraRAG offers a scalable and robust retrieval framework ideal for real-world settings where data is continuously added—such as live news, scholarly archives, or user-driven platforms. It strikes a balance between retrieval efficiency and adaptability, making LLM-backed applications more factual, responsive, and trustworthy in fast-changing environments.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project | Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post EraRAG: A Scalable, Multi-Layered Graph-Based Retrieval System for Dynamic and Growing Corpora appeared first on MarkTechPost.

FEEDER: A Pre-Selection Framework for Efficient Demonstration Selectio …

LLMs have demonstrated exceptional performance across multiple tasks by utilizing few-shot inference, also known as in-context learning (ICL). The main problem lies in selecting the most representative demonstrations from large training datasets. Early methods selected demonstrations based on relevance using similarity scores between each example and the input question. Current methods suggest using additional selection rules, along with similarity, to enhance the efficiency of demonstration selection. These improvements introduce significant computational overhead when the number of shots increases. The effectiveness of selected demonstrations should also consider the specific LLM in use, as different LLMs exhibit varying capabilities and knowledge domains.

Researchers from Shanghai Jiao Tong University, Xiaohongshu Inc., Carnegie Mellon University, Peking University, No Affiliation, University College London, and University of Bristol have proposed FEEDER (FEw yet Essential Demonstration prE-selectoR), a method to identify a core subset of demonstrations containing the most representative examples in training data, adjusted to specific LLMs. To construct this subset, “sufficiency” and “necessity” metrics are introduced in the pre-selection stage, along with a tree-based algorithm. Moreover, FEEDER reduces training data size by 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection techniques in ICL across LLMs ranging from 300M to 8B parameters.

FEEDER is evaluated on 6 text classification datasets: SST-2, SST-5, COLA, TREC, SUBJ, and FPB, covering tasks from sentiment classification and linguistic analysis to textual entailment. It is also evaluated on the reasoning dataset GSM8K, the semantic-parsing dataset SMCALFlow, and the scientific question-answering dataset GPQA. The official splits for each dataset are directly followed to get the training and test data. Moreover, multiple LLM variants are utilized to evaluate the performance of the method, including two GPT-2 variants, GPT-neo with 1.3B parameters, GPT-3 with 6B parameters, Gemma-2 with 2B parameters, Llama-2 with 7B parameters, Llama-3 with 8B parameters, and Qwen-2.5 with 32B parameters as the LLM base.

Results regarding in-context learning performance show that FEEDER enables retention of almost half the training samples while achieving superior or comparable performance. Evaluation of few-shot performance on complex tasks using LLMs like Gemma-2 shows that FEEDER improves performance even when LLMs struggle with challenging tasks. It performs effectively with large numbers of shots, handling situations where LLM performance usually drops when the number of examples increases from 5 to 10 due to noisy or repeated demonstrations. Moreover, FEEDER minimizes negative impact on LLM performance by evaluating the sufficiency and necessity of each demonstration, and helps in the performance stability of LLMs

On bi-level optimization, FEEDER achieves improved performance by utilizing a small yet high-quality dataset for fine-tuning while simultaneously reducing computational expenses, aligning with the core-set selection principle. Results indicate that fine-tuning LLMs provides greater performance improvements compared to augmenting LLMs with contexts, with FEEDER achieving even better performance gains in fine-tuning settings. Performance analysis reveals that FEEDER’s effectiveness first rises and then drops with increasing number of runs or rounds (R and K, respectively), confirming that identifying representative subsets from training datasets enhances LLM performance. However, overly narrow subsets may limit potential performance gains.

In conclusion, researchers introduced FEEDER, a demonstration pre-selector designed to use LLM capabilities and domain knowledge to identify high-quality demonstrations through an efficient discovery approach. It reduces training data requirements while maintaining comparable performance, offering a practical solution for efficient LLM deployment. Future research directions include exploring applications with larger LLMs and extending FEEDER’s capabilities to areas such as data safety and data management. FEEDER makes a valuable contribution to demonstration selection, providing researchers and practitioners with an effective tool for optimizing LLM performance while reducing computational overhead.

Check out the Paper. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post FEEDER: A Pre-Selection Framework for Efficient Demonstration Selection in LLMs appeared first on MarkTechPost.

Alibaba Qwen Introduces Qwen3-MT: Next-Gen Multilingual Machine Transl …

Alibaba has introduced Qwen3-MT (qwen-mt-turbo) via Qwen API, its latest and most advanced machine translation model, designed to break language barriers with unprecedented accuracy, speed, and flexibility. Trained on trillions of multilingual tokens, Qwen3-MT supports over 92 languages—covering more than 95% of the global population. Leveraging cutting-edge architecture, reinforcement learning, and rich customization options, it delivers top-tier translation quality at a fraction of the cost and latency of traditional systems.

Model Architecture and Training Data

Qwen3-MT is built on Alibaba’s sophisticated Qwen3 transformer architecture, enhanced with a lightweight Mixture-of-Experts (MoE) backbone. This design balances computational efficiency with deep contextual understanding to optimize translation quality.

Scale: Trained on trillions of tokens spanning diverse languages, domains, and registers, ranging from formal legal texts to colloquial dialogue and technical literature.

Multilinguality: The expansive dataset ensures nuanced grasp of syntax, semantics, idioms, and cultural context across language pairs.

Reinforcement Learning: Continuous fine-tuning via reinforcement learning allows the model to adapt dynamically for greater fluency, accuracy, and idiomatic expression based on real-world feedback.

Translation Quality-Automatic Evaluation

Multilingual Coverage and Population Reach

Supporting 92+ languages, Qwen3-MT addresses a vast global audience across numerous language families including:

Language FamilyExample LanguagesIndo-EuropeanEnglish, French, Spanish, Russian, Hindi, Bengali, GermanSino-TibetanChinese (Simplified, Traditional, Cantonese), BurmeseAfro-AsiaticArabic (with dialectal variations), Hebrew, MalteseAustronesianIndonesian, Malay, TagalogDravidianTamil, Telugu, KannadaTurkicTurkish, Kazakh, UzbekOthersJapanese, Korean, Thai, Vietnamese, Swahili, Basque

These supported languages collectively cover over 95% of the world’s population, empowering enterprises and developers to build truly global multilingual experiences.

Benchmark and Evaluation Performance

Automatic Metrics

Qwen3-MT achieves leading BLEU scores on prominent benchmarks such as:

Chinese-English and English-German test sets, outperforming models like GPT-4.1-mini and Gemini-2.5-Flash.

The WMT24 multilingual benchmark, delivering comparable translation fidelity to massive models like GPT-4.1 and Gemini-2.5-Pro, but operating at significantly lower computational cost.

Its MoE architecture enables this efficiency by activating only specialized subsets of the model per request, reducing inference time and cost.

Human Evaluation

Triple-blind human assessments covering ten major languages (e.g., English, Chinese, Japanese, Arabic, Spanish) demonstrate that Qwen3-MT leads in:

Acceptance Rate: Higher frequency of useable translations accepted by professional translators.

Excellence Rate: More translations rated “excellent” for fluency, semantic precision, and contextual fidelity.

These metrics confirm real-world translation quality beyond automated scoring.

Performance, Scalability, and Cost Efficiency

Ultra-fast Inference: Thanks to MoE and optimized routing, Qwen3-MT delivers low latency that supports real-time applications such as live chat and streaming translation.

High Concurrency: It can serve thousands of simultaneous translation requests efficiently, suitable for large-scale SaaS, e-commerce, and media platforms.

Cost-effective Pricing: Starting at $0.5 per million tokens, it dramatically reduces costs compared to dense, fully-activated large models.

Visual comparisons indicate that Qwen3-MT maintains a leading position in balancing speed, cost, and translation quality.

Customization and Domain Adaptability

Qwen3-MT offers advanced options for domain-specific customization:

Terminology Control: Users can enforce consistent translation of brand names, technical terms, or jargon via direct glossary injection.

Domain Prompts: Custom prompts tailor translation style and tone—legal, medical, conversational, or technical—enhancing contextual appropriacy.

Translation Memory Integration: Adaptive reuse of user corrections and past translations accelerates workflows and boosts consistency especially across lengthy projects.

Such extensibility makes Qwen3-MT an excellent fit for enterprises with specialized language requirements.

Reinforcement Learning: Enhancing Translation Fluency

By continuously incorporating post-editing feedback and user interaction data, Qwen3-MT’s reinforcement learning pipeline iteratively refines:

Context preservation and idiomatic correctness across languages.

Reduction of critical errors tailored to domain complexity.

Real-time adaptation to evolving linguistic trends and user preferences.

This lifelong learning approach ensures translation relevance and accuracy over time.

API Access and Deployment

Qwen API: Provides RESTful endpoints and SDKs for seamless integration into web, mobile, and backend systems.

Flexible Deployment: Supports cloud, edge, and hybrid architectures, alongside batch translation mode for high-volume processing.

Highly Reliable: Engineered for enterprise-level SLAs with robust monitoring and uptime guarantees.

Application Scenarios

Qwen3-MT is powering:

E-commerce Localization: Translating product descriptions, reviews, and customer inquiries in real time.

Content Management: Automated news, documentation, and educational content localization.

Customer Service: Multilingual automation for ticketing, chatbots, and virtual assistants, improving customer experience worldwide.

Competitive Positioning

FeatureQwen3-MTGoogle TranslateAzure TranslatorAWS TranslateLanguages Supported92+100+90+75+Context AwarenessHighMediumMediumMediumReinforcement LearningYesLimitedNoNoBatch ProcessingYesYesYesYesReal-time CapabilityYesYesYesYesCustom ModelsYesYesYesYesStarting Price$0.5/million tokensPay-per-usePay-per-usePay-per-use

Qwen3-MT’s combination of translation quality, cost-effectiveness, and extensibility places it firmly among the top-tier MT solutions available today.

Conclusion

Alibaba’s Qwen3-MT represents a remarkable advance in machine translation technology, delivering broad multilingual reach, superior translation fidelity validated by both automatic and human evaluations, and enterprise-ready speed and cost-efficiency. Its novel Mixture-of-Experts architecture paired with reinforcement learning ensures that Qwen3-MT is adaptable, scalable, and future-proof—empowering developers and businesses to communicate seamlessly across languages at global scale.

Check out the Hugging Face Demo, ModelScope Demo, API Doc and Technical Details. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post Alibaba Qwen Introduces Qwen3-MT: Next-Gen Multilingual Machine Translation Powered by Reinforcement Learning appeared first on MarkTechPost.

Build an intelligent eDiscovery solution using Amazon Bedrock Agents

Legal teams spend bulk of their time manually reviewing documents during eDiscovery. This process involves analyzing electronically stored information across emails, contracts, financial records, and collaboration systems for legal proceedings. This manual approach creates significant bottlenecks: attorneys must identify privileged communications, assess legal risks, extract contractual obligations, and maintain regulatory compliance across thousands of documents per case. The process is not only resource-intensive and time-consuming, but also prone to human error when dealing with large document volumes.
Amazon Bedrock Agents with multi-agent collaboration directly addresses these challenges by helping organizations deploy specialized AI agents that process documents in parallel while maintaining context across complex legal workflows. Instead of sequential manual review, multiple agents work simultaneously—one extracts contract terms while another identifies privileged communications, all coordinated by a central orchestrator. This approach can reduce document review time by 60–70% while maintaining the accuracy and human oversight required for legal proceedings, though actual performance varies based on document complexity and foundation model (FM) selection.
In this post, we demonstrate how to build an intelligent eDiscovery solution using Amazon Bedrock Agents for real-time document analysis. We show how to deploy specialized agents for document classification, contract analysis, email review, and legal document processing, all working together through a multi-agent architecture. We walk through the implementation details, deployment steps, and best practices to create an extensible foundation that organizations can adapt to their specific eDiscovery requirements.
Solution overview
This solution demonstrates an intelligent document analysis system using Amazon Bedrock Agents with multi-agent collaboration functionality. The system uses multiple specialized agents to analyze legal documents, classify content, assess risks, and provide structured insights. The following diagram illustrates the solution architecture.

The architecture diagram shows three main workflows for eDiscovery document analysis:

Real-time document analysis workflow – Attorneys and clients (authenticated users) can upload documents and interact through mobile/web clients and chat. Documents are processed in real time for immediate analysis without persistent storage—uploaded documents are passed directly to the Amazon Bedrock Collaborator Agent endpoint.
Case research document analysis workflow – This workflow is specifically for attorneys (authenticated users). It allows document review and analysis through mobile/web clients and chat. It’s focused on the legal research aspects of previously processed documents.
Document upload workflow – Law firm clients (authenticated users) can upload documents through mobile/web clients. Documents are transferred by using AWS Transfer Family web apps to an Amazon Simple Storage Service (Amazon S3) bucket for storage.

Although this architecture supports all three workflows, this post focuses specifically on implementing the real-time document analysis workflow for two key reasons: it represents the core functionality that delivers immediate value to legal teams, and it provides the foundational patterns that can be extended to support the other workflows. The real-time processing capability demonstrates the multi-agent coordination that makes this solution transformative for eDiscovery operations.
Real-time document analysis workflow
This workflow processes uploaded documents through coordinated AI agents, typically completing analysis within 1–2 minutes of upload. The system accelerates early case assessment by providing structured insights immediately, compared to traditional manual review that can take hours per document. The implementation coordinates five specialized agents that process different document aspects in parallel, listed in the following table.

Agent Type
Primary Function
Processing Time*
Key Outputs

Collaborator Agent
Central orchestrator and workflow manager
2–5 seconds
Document routing decisions, consolidated results

Document Classification Agent
Initial document triage and sensitivity detection
5–10 seconds
Document type, confidence scores, sensitivity flags

Email Analysis Agent
Communication pattern analysis
10–20 seconds
Participant maps, conversation threads, timelines

Legal Document Analysis Agent
Court filing and legal brief analysis
15–30 seconds
Case citations, legal arguments, procedural dates

Contract Analysis Agent
Contract terms and risk assessment
20–40 seconds
Party details, key terms, obligations, risk scores

*Processing times are estimates based on testing with Anthropic’s Claude 3.5 Haiku on Amazon Bedrock and might vary depending on document complexity and size. Actual performance in your environment may differ.
Let’s explore an example of processing a sample legal settlement agreement. The workflow consists of the following steps:

The Collaborator Agent identifies the document as requiring both contract and legal analysis.
The Contract Analysis Agent extracts parties, payment terms, and obligations (40 seconds).
The Legal Document Analysis Agent identifies case references and precedents (30 seconds).
The Document Classification Agent flags confidentiality levels (10 seconds).
The Collaborator Agent consolidates findings into a comprehensive report (15 seconds).

Total processing time is approximately 95 seconds for the sample document, compared to 2–4 hours of manual review for similar documents. In the following sections, we walk through deploying the complete eDiscovery solution, including Amazon Bedrock Agents, the Streamlit frontend, and necessary AWS resources.
Prerequisites
Make sure you have the following prerequisites:

An AWS account with appropriate permissions for Amazon Bedrock, AWS Identity and Access Management (IAM), and AWS CloudFormation.
Amazon Bedrock model access for Anthropic’s Claude 3.5 Haiku v1 in your deployment AWS Region. You can use a different supported model of your choice for this solution. If you use a different model than the default (Anthropic’s Claude 3.5 Haiku v1), you must modify the CloudFormation template to reflect your chosen model’s specifications before deployment. At the time of writing, Anthropic’s Claude 3.5 Haiku is available in US East (N. Virginia), US East (Ohio), and US West (Oregon). For current model availability, see Model support by AWS Region.
The AWS Command Line Interface (AWS CLI) installed and configured with appropriate credentials.
Python 3.8+ installed.
Terminal or command prompt access.

Deploy the AWS infrastructure
You can deploy the following CloudFormation template, which creates the five Amazon Bedrock agents, inference profile, and supporting IAM resources. (Costs will be incurred for the AWS resources used). Complete the following steps:

Launch the CloudFormation stack.

You will be redirected to the AWS CloudFormation console. In the stack parameters, the template URL will be prepopulated.

For EnvironmentName, enter a name for your deployment (default: LegalBlogSetup).
Review and create the stack.

After successful deployment, note the following values from the CloudFormation stack’s Outputs tab:

CollabBedrockAgentId
CollabBedrockAgentAliasId

Configure AWS credentials
Test if AWS credentials are working:aws sts get-caller-identityIf you need to configure credentials, use the following command:

aws configure

Set up the local environment
Complete the following steps to set up your local environment:

Create a new directory for your project:

mkdir bedrock-document-analyzer
cd bedrock-document-analyzer

Set up a Python virtual environment:

#On macOS/Linux:
source venv/bin/activate
#On Windows:
venvScriptsactivate

Download the Streamlit application:

curl -O https://aws-blogs-artifacts-public.s3.us-east-1.amazonaws.com/ML-18253/eDiscovery-LegalBlog-UI.py

Install dependencies:

pip install streamlit boto3 PyPDF2 python-docx

Configure and run the application
Complete the following steps:

Run the downloaded Streamlit frontend UI file eDiscovery-LegalBlog-UI.py by executing the following command in your terminal or command prompt:

streamlit run eDiscovery-LegalBlog-UI.py

This command will start the Streamlit server and automatically open the application in your default web browser.

Under Agent configuration, provide the following values:

For AWS_REGION, enter your Region.
For AGENT_ID, enter the Amazon Bedrock Collaborator Agent ID.
For AGENT_ALIAS_ID, enter the Amazon Bedrock Collaborator Agent Alias ID.

Choose Save Configuration.

Now you can upload documents (TXT, PDF, and DOCX) to analyze and interact with.
Test the solution
The following is a demonstration of testing the application.

Implementation considerations
Although Amazon Bedrock Agents significantly streamlines eDiscovery workflows, organizations should consider several key factors when implementing AI-powered document analysis solutions. Consider the following legal industry requirements for compliance and governance:

Attorney-client privilege protection – AI systems must maintain confidentiality boundaries and can’t expose privileged communications during processing
Cross-jurisdictional compliance – GDPR, CCPA, and industry-specific regulations vary by region and case type
Audit trail requirements – Legal proceedings demand comprehensive processing documentation for all AI-assisted decisions
Professional responsibility – Lawyers remain accountable for AI outputs and must demonstrate competency in deployed tools

You might encounter technical implementation challenges, such as document processing complexity:

Variable document quality – Scanned PDFs, handwritten annotations, and corrupted files require preprocessing strategies
Format diversity – Legal documents span emails, contracts, court filings, and multimedia content requiring different processing approaches
Scale management – Large cases involving over 100,000 documents require careful resource planning and concurrent processing optimization

The system integration also has specific requirements:

Legacy system compatibility – Most law firms use established case management systems that need seamless integration
Authentication workflows – Multi-role access (attorneys, paralegals, clients) with different permission levels
AI confidence thresholds – Determining when human review is required based on processing confidence scores

Additionally, consider your human/AI collaboration framework. The most successful eDiscovery implementations maintain human oversight at critical decision points. Although Amazon Bedrock Agents excels at automating routine tasks like document classification and metadata extraction, legal professionals remain essential for the following factors:

Complex legal interpretations requiring contextual understanding
Privilege determinations that impact case strategy
Quality control of AI-generated insights
Strategic analysis of document relationships and case implications

This collaborative approach optimizes the eDiscovery process—AI handles time-consuming data processing while legal professionals focus on high-stakes decisions requiring human judgment and expertise. For your implementation strategy, consider a phased deployment approach. Organizations should implement staged rollouts to minimize risk while building confidence:

Pilot programs using lower-risk document categories (routine correspondence, standard contracts)
Controlled expansion with specialized agents and broader user base
Full deployment enabling complete multi-agent collaboration organization-wide

Lastly, consider the following success planning best practices:

Establish clear governance frameworks for model updates and version control
Create standardized testing protocols for new agent deployments
Develop escalation procedures for edge cases requiring human intervention
Implement parallel processing during validation periods to maintain accuracy

By addressing these considerations upfront, legal teams can facilitate smoother implementation and maximize the benefits of AI-powered document analysis while maintaining the accuracy and oversight required for legal proceedings.
Clean up
If you decide to discontinue using the solution, complete the following steps to remove it and its associated resources deployed using AWS CloudFormation:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Locate the stack you created during the deployment process (you assigned a name to it).
Select the stack and choose Delete.

Results
Amazon Bedrock Agents transforms eDiscovery from time-intensive manual processes into efficient AI-powered operations, delivering measurable operational improvements across business services organizations. With a multi-agent architecture, organizations can process documents in 1–2 minutes compared to 2–4 hours of manual review for similar documents, achieving a 60–70% reduction in review time while maintaining accuracy and compliance requirements. A representative implementation from the financial services sector demonstrates this transformative potential: a major institution transformed their compliance review process from a 448-page manual workflow requiring over 10,000 hours to an automated system that reduced external audit times from 1,000 to 300–400 hours and internal audits from 800 to 320–400 hours. The institution now conducts 30–40 internal reviews annually with existing staff while achieving greater accuracy and consistency across assessments. These results demonstrate the potential across implementations: organizations implementing this solution can progress from initial efficiency gains in pilot phases to a 60–70% reduction in review time at full deployment. Beyond time savings, the solution delivers strategic advantages, including resource optimization that helps legal professionals focus on high-value analysis rather than routine document processing, improved compliance posture through systematic identification of privileged communications, and future-ready infrastructure that adapts to evolving legal technology requirements.
Conclusion
The combination of Amazon Bedrock multi-agent collaboration, real-time processing capabilities, and the extensible architecture provided in this post offers legal teams immediate operational benefits while positioning them for future AI advancements—creating the powerful synergy of AI efficiency and human expertise that defines modern legal practice.
To learn more about Amazon Bedrock, refer to the following resources:

GitHub repo: Amazon Bedrock Workshop
Amazon Bedrock User Guide
Workshop: GenAI for AWS Cloud Operations
Workshop: Using generative AI on AWS for diverse content types

About the authors
Puneeth Ranjan Komaragiri is a Principal Technical Account Manager at AWS. He is particularly passionate about monitoring and observability, cloud financial management, and generative AI domains. In his current role, Puneeth enjoys collaborating closely with customers, using his expertise to help them design and architect their cloud workloads for optimal scale and resilience.
Pramod Krishna is a Senior Solutions Architect at AWS. He works as a trusted advisor for customers, helping customers innovate and build well-architected applications in AWS Cloud. Outside of work, Krishna enjoys reading, music, and traveling.
Sean Gifts Is a Senior Technical Account Manager at AWS. He is excited about helping customers with application modernization, specifically event-driven architectures that use serverless frameworks. Sean enjoys helping customers improve their architecture with simple, scalable solutions. Outside of work, he enjoys exercising, enjoying new foods, and traveling.

How PerformLine uses prompt engineering on Amazon Bedrock to detect co …

This post is co-written with Bogdan Arsenie and Nick Mattei from PerformLine.
PerformLine operates within the marketing compliance industry, a specialized subset of the broader compliance software market, which includes various compliance solutions like anti-money laundering (AML), know your customer (KYC), and others. Specifically, marketing compliance refers to adhering to regulations and guidelines set by government agencies that make sure a company’s marketing, advertising, and sales content and communications are truthful, accurate, and not misleading for consumers. PerformLine is the leading service providing comprehensive compliance oversight across marketing, sales, and partner channels. As pioneers of the marketing compliance industry, PerformLine has conducted over 1.1 billion compliance observations over the past 10+ years, automating the entire compliance process—from pre-publication review of materials to continuous monitoring of consumer-facing channels such as websites, emails, and social media. Trusted by consumer finance brands and global organizations, PerformLine uses AI-driven solutions to protect brands and their consumers, transforming compliance efforts into a competitive advantage.
“Discover. Monitor. Act. This isn’t just our tagline—it’s the foundation of our innovation at PerformLine,” says PerformLine’s CTO Bogdan Arsenie. PerformLine’s engineering team brings these principles to life by developing AI-powered technology solutions. In this post, PerformLine and AWS explore how PerformLine used Amazon Bedrock to accelerate compliance processes, generate actionable insights, and provide contextual data—delivering the speed and accuracy essential for large-scale oversight.
The problem
One of PerformLine’s enterprise customers needed a more efficient process for running compliance checks on newly launched product pages, particularly those that integrate multiple products within the same visual and textual framework. These complex pages often feature overlapping content that can apply to one product, several products, or even all of them at once, necessitating a context-aware interpretation that mirrors how a typical consumer would view and interact with the content. By adopting AWS and the architecture discussed in this post, PerformLine can retrieve and analyze these intricate pages through AI-driven processing, generating detailed insights and contextual data that capture the nuanced interplay between various product elements. After the relevant information is extracted and structured, it’s fed directly into their rules engine, enabling robust compliance checks. This accomplishes a seamless flow, from data ingestion to rules-based analysis. It not only preserves the depth of each product’s presentation but also delivers the speed and accuracy critical to large-scale oversight. Monitoring millions of webpages daily for compliance demands a system that can intelligently parse, extract, and analyze content at scale—much like the approach PerformLine has developed for their enterprise customers. In this dynamic landscape, the ever-evolving nature of web content challenges traditional static parsing, requiring a context-aware and adaptive solution. This architecture not only processes bulk data offline but also delivers near real-time performance for one-time requests, dynamically scaling to manage the diverse complexity of each page. By using AI-powered inference, PerformLine provides comprehensive coverage of every product and marketing element across the web, while striking a careful balance between accuracy, performance, and cost.
Solution overview
With this flexible, adaptable solution, PerformLine can tackle even the most challenging webpages, providing comprehensive coverage when extracting and analyzing web content with multiple products. At the same time, by combining consistency with the adaptability of foundation models (FMs), PerformLine can maintain reliable performance across the diverse range of products and websites their customers monitor. This dual focus on agility and operational consistency makes sure their customers benefit from robust compliance checks and data integrity, without sacrificing the speed or scale needed to remain competitive.
PerformLine’s upstream ingestion pipeline efficiently collects millions of web pages and their associated metadata in a batch process. Downstream assets are submitted to PerformLine’s rules engine and compliance review processes. It was imperative that they not disrupt those processes or introduce cascading changes for this solution.
PerformLine decided to use generative AI and Amazon Bedrock to address their core challenges. Amazon Bedrock allows for a broad selection of models, including Amazon Nova. Amazon Bedrock is continuously expanding feature sets around using FMs at scale. This provides a reliable foundation to build a highly available and efficient content processing system.
PerformLine’s solution incorporates the following key components:

AI inference with Amazon Bedrock – Provides seamless access to FMs for content extraction and analysis
Application inference profiles – Enables precise tracking and optimization of inference costs
Event-driven serverless processing pipeline –Provides a lightweight, scalable approach to handling dynamic workloads using Amazon EventBridge, Amazon Simple Queue Service (Amazon SQS), AWS Lambda, Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.
Prompt management in Bedrock – Supports versioning, testing, and deployment of prompts for improved AI consistency and control
Task orchestration – Uses Amazon SQS to manage work queues efficiently, facilitating smooth and scalable task execution

PerformLine implemented a scalable, serverless event-driven architecture (shown in the following diagram) that seamlessly integrates with their existing system, requiring less than a day to develop and deploy. This made it possible to focus on prompt optimization, evaluation, and cost management rather than infrastructure overhead. This architecture allows PerformLine to dynamically parse, extract, and analyze web content with high reliability, flexibility, and cost-efficiency.

The system implements multiple queue types (Incoming, DLQ, Results) and includes error handling mechanisms. Data flows through various AWS services including: Amazon RDS for initial data storage Amazon MQ RabbitMQ for message handling Amazon S3 for asset storage Amazon EventBridge for event management Amazon SQS for queue management AWS Lambda for serverless processing Amazon DynamoDB for NoSQL data storage

PerformLine’s process consists of several steps, including processing (Step 1), event trigger and storage (Steps 2–6), structured output and storage (Step 7), and downstream processing and compliance checks (Steps 8–9):

Millions of pages are processed by an upstream extract, transform, and load (ETL) process from PerformLine’s core systems running on the AWS Cloud.
When a page is retrieved, it triggers an event in the compliance check system.
Amazon S3 allows for storage of the data from a page according to metadata.
EventBridge uses event-driven processing to route Amazon S3 events to Amazon SQS.
Amazon SQS queues messages for processing and enables messages to be retried on failure.
A Lambda Function consumes SQS messages and also scales dynamically to handle even unpredictable workloads:

This function uses Amazon Bedrock to perform extraction and generative AI analysis of the content from Amazon SQS. Amazon Bedrock offers the greatest flexibility to choose the right model for the job. For PerformLine’s use case, Amazon’s Nova Pro was best suited for complex requests that require a powerful model but still allows for a high performance to cost ratio. Anthropic’s Claude Haiku model allows for optimized quick calls, where a fast response is paramount for additional processing if needed. Amazon Bedrock features, including Amazon Bedrock Prompt Management and inference profiles are used to increase input code variability without affecting output and reduce complexity in usage of FMs through Amazon Bedrock.
The function stores customer-defined product schemas in Amazon DynamoDB, enabling dynamic large language model (LLM) targeting and schema-driven output generation.

Amazon S3 stores the extracted data, which is formatted as structured JSON adhering to the target schema.
EventBridge forwards Amazon S3 events to Amazon SQS, making extracted data available for downstream processing.
Compliance checks and business rules, running on other PerformLine’s systems, are applied to validate and enforce regulatory requirements.

Cost optimizations
The solution offers several cost optimizations, including change data capture (CDC) on the web and strategic multi-pass inference. After a page’s content has been analyzed and formatted, it’s written back to a partition that includes a metadata hash of the asset. This enables upstream processes to determine whether a page has already been processed and if its content has changed. The key benefits of this approach include:

Alleviating redundant processing of the same pages, contributing to PerformLine experiencing a 15% workload reduction in human evaluation tasks. This frees time for human evaluators and allows them focus on critical pages rather than all the pages.
Avoiding reprocessing unchanged pages, dynamically reducing PerformLine’s analysts’ workload by over 50% in addition to deduplication gains.

LLM inference costs can escalate at scale, but context and carefully structured prompts are critical for accuracy. To optimize costs while maintaining precision, PerformLine implemented a multi-pass approach using Amazon Bedrock:

Initial filtering with Amazon Nova Micro – This lightweight model efficiently identifies relevant products with minimal cost.
Targeted extraction with Amazon Nova Lite – Identified products are batched into smaller groups and passed to Amazon Nova Lite for deeper analysis. This keeps PerformLine within token limits while improving extraction accuracy.
Increased accuracy through context-aware processing – By first identifying the target content and then processing it in smaller batches, PerformLine significantly improved accuracy while minimizing token consumption.

Use of Amazon Bedrock
During initial testing, PerformLine quickly realized the need for a more scalable approach to prompt management. Manually tracking multiple prompt versions and templates became inefficient as PerformLine iterated and collaborated.
Amazon Bedrock’s Prompt Management service provided a centralized solution, enabling them to version, manage, and seamlessly deploy prompts to production. After the prompts are deployed, they can be dynamically referenced in AWS Lambda, allowing for flexible configuration. Additionally, by using Amazon Bedrock application profile inference endpoints, PerformLine can dynamically adjust the models the Lambda function invokes, track cost per invocation, and attribute costs to specific application instances through setting up cost tags.
To streamline model interactions, PerformLine chose the Amazon Bedrock Converse API which provides a developer-friendly, standardized interface for model invocation. When combined with inference endpoints and prompt management, a Lambda function using the Amazon Bedrock Converse API becomes highly configurable—PerformLine developers can rapidly test new models and prompts, evaluate results, and iterate without needing to rebuild or redeploy. The simplification of prompt management and ability to deploy various models through Amazon Bedrock is shown in the following diagram.

Comprehensive AWS ML model configuration architecture highlighting three main components: Inference System: Model ID integration Profile configuration Content management Inference settings Prompt Management: Version control (V1 and Draft versions) Publish ID tracking Model A specifications Store configurations Environment Control: Separate PROD and DEV paths Environment-specific parameter stores Invoke ID management Engineering iteration tracking

Future plans and enhancements
PerformLine is excited to dive into additional Amazon Bedrock features, including prompt caching and Amazon Bedrock Flows.
With prompt caching, users can checkpoint prompt tokens, effectively caching context for reuse in subsequent API calls. Prompt caching on Amazon Bedrock offers up to 85% latency improvements and 90% cost reduction in comparison to calls without prompt caching. PerformLine sees prompt caching as a feature that will become the standard moving forward. They have a number of use cases for their data, and having the ability to apply further analysis on the same content at a lower cost creates new opportunities for feature expansion and development.
Amazon Bedrock Flows is a visual workflow builder that enables users to orchestrate multi-step generative AI tasks by connecting FMs and APIs without extensive coding. Amazon Bedrock Flows is a next step in simplifying PerformLine’s orchestration of knowledge bases, prompt caching, and even Amazon Bedrock agents in the future. Creating flows can help reduce time to feature deployment and maintenance.
Summary
PerformLine has implemented a highly scalable, serverless, AI-driven architecture that enhances efficiency, cost-effectiveness, and compliance in the web content processing pipeline. By using Amazon Bedrock, EventBridge, Amazon SQS, Lambda, and DynamoDB, they have built a solution that can dynamically scale, optimize AI inference costs, and reduce redundant processing—all while maintaining operational flexibility and compliance integrity. Based on their current volume and workflow, PerformLine is projected to process between 1.5 to 2 million pages daily, from which they expect to extract approximately 400,000 to 500,000 products. Additionally, PerformLine anticipates applying rules to each asset, resulting in about 500,000 rule observations that will require review each day.Throughout the design process PerformLine made sure their solution remains as simple as possible while still delivering operational flexibility and integrity. This approach minimizes complexity, enhances maintainability, and accelerates deployment, empowering them to adapt quickly to evolving business needs without unnecessary overhead.
By using a serverless AI-driven architecture built on Amazon Bedrock, PerformLine helps their customers tackle even the most complex, multi-product webpages with unparalleled accuracy and efficiency. This holistic approach interprets visual and textual elements as a typical consumer would, verifying that every product variant is accurately assessed for compliance. The resulting insights are then fed directly into a rules engine, enabling rapid, data-driven decisions. For PerformLine’s customers, this means less redundant processing, lower operational costs, and a dramatically simplified compliance workflow, all without compromising on speed or accuracy. By reducing the overhead of large-scale data analysis and streamlining compliance checks, PerformLine’s solution ultimately frees teams to focus on driving innovation and delivering value.

About the authors
Bogdan Arsenie is the Chief Technology Officer at PerformLine, with over two decades of experience leading technological innovation across digital advertising, big data, mobile gaming, and social engagement. Bogdan began programming at age 13, customizing bulletin board software to fund his passion for Star Trek memorabilia. He served as PerformLine’s founding CTO from 2007–2009, pioneering their initial compliance platform. Later, as CTO at the Rumie Initiative, he helped scale a global education initiative recognized by Google’s Impact Challenge.
Nick Mattei is a Senior Software Engineer at PerformLine. He is focused on solutions architecture and distributed application development in AWS. Outside of work, Nick is an avid cyclist and skier, always looking for the next great climb or powder day.
Shervin Suresh is a Generative AI Solutions Architect at AWS. He supports generative AI adoption both internally at AWS and externally with fast-growing startup customers. He is passionate about using technology to help improve the lives of people in all aspects. Outside of work, Shervin loves to cook, build LEGO, and collaborate with people on things they are passionate about.
Medha Aiyah is a Solutions Architect at AWS. She graduated from the University of Texas at Dallas with an MS in Computer Science, with a focus on AI/ML. She supports ISV customers in a wide variety of industries, by empowering customers to use AWS optimally to achieve their business goals. She is especially interested in guiding customers on ways to implement AI/ML solutions and use generative AI. Outside of work, Medha enjoys hiking, traveling, and dancing.
Michael Zhang is a generalist Solutions Architect at AWS working with small to medium businesses. He has been with Amazon for over 3 years and uses his background in computer science and machine learning to support customers on AWS. In his free time, Michael loves to hike and explore other cultures.

A Coding Guide to Build a Tool-Calling ReAct Agent Fusing Prolog Logic …

In this tutorial, we are walking through a hands-on fusion of symbolic logic and generative AI. We set up PySwip to embed a Prolog knowledge base, wrap its predicates as LangChain tools, and then wire everything into a ReAct-style agent. Along the way, we are crafting family-relationship rules, mathematical predicates like factorial, and list utilities, then letting the agent plan, call tools, and reason over the results. By the end of the setup, we can issue natural-language questions and watch the agent translate them into precise Prolog queries, stitch together multi-step answers, and return structured JSON-backed insights.

Download full codes/notebook

Copy CodeCopiedUse a different Browser!apt-get install swi-prolog -y
!pip install pyswip langchain-google-genai langgraph langchain-core

We install SWI-Prolog with apt-get and then add pyswip, LangChain’s Google GenAI wrapper, LangGraph, and core LangChain packages via pip so we can bridge Prolog logic with our Gemini-powered agent. With these dependencies in place, we’re ready to code, query, and orchestrate reasoning end to end.

Copy CodeCopiedUse a different Browserimport os
from pyswip import Prolog
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
import json

GOOGLE_API_KEY = “Use Your Own API Key Here”
os.environ[“GOOGLE_API_KEY”] = GOOGLE_API_KEY

llm = ChatGoogleGenerativeAI(model=”gemini-1.5-flash”, temperature=0)

We load our core stack, including PySwip for Prolog, LangChain and LangGraph for tooling, and Gemini 1.5 Flash for LLM power. We then set the GOOGLE_API_KEY environment variable so the model can authenticate. With the LLM initialized at zero temperature, we’re primed to get deterministic, logic-grounded answers from our agent.

Copy CodeCopiedUse a different Browserclass AdvancedPrologInterface:
def __init__(self):
self.prolog = Prolog()
self._load_knowledge_base()

def _load_knowledge_base(self):
“””Load comprehensive Prolog knowledge base”””
rules = [
“parent(john, mary, alice)”,
“parent(john, mary, bob)”,
“parent(bob, susan, charlie)”,
“parent(alice, david, emma)”,
“parent(charlie, lisa, frank)”,

“male(john)”, “male(bob)”, “male(david)”, “male(charlie)”, “male(frank)”,
“female(mary)”, “female(alice)”, “female(susan)”, “female(emma)”, “female(lisa)”,

“grandparent(X, Z) :- parent(X, _, Y), parent(Y, _, Z)”,
“sibling(X, Y) :- parent(P1, P2, X), parent(P1, P2, Y), X \= Y”,
“uncle(X, Y) :- sibling(X, Z), parent(Z, _, Y), male(X)”,
“aunt(X, Y) :- sibling(X, Z), parent(Z, _, Y), female(X)”,
“cousin(X, Y) :- parent(P1, _, X), parent(P2, _, Y), sibling(P1, P2)”,

“factorial(0, 1)”,
“factorial(N, F) :- N > 0, N1 is N – 1, factorial(N1, F1), F is N * F1”,

“list_member(X, [X|_])”,
“list_member(X, [_|T]) :- list_member(X, T)”,
“list_length([], 0)”,
“list_length([_|T], N) :- list_length(T, N1), N is N1 + 1”,

“animal(dog)”, “animal(cat)”, “animal(whale)”, “animal(eagle)”,
“mammal(dog)”, “mammal(cat)”, “mammal(whale)”,
“bird(eagle)”, “bird(sparrow)”,
“can_fly(eagle)”, “can_fly(sparrow)”,
“can_swim(whale)”, “can_swim(fish)”,
“aquatic_mammal(X) :- mammal(X), can_swim(X)”
]

for rule in rules:
try:
self.prolog.assertz(rule)
except Exception as e:
print(f”Warning: Could not assert rule ‘{rule}’: {e}”)

def query(self, query_string):
“””Execute Prolog query and return results”””
try:
results = list(self.prolog.query(query_string))
return results if results else [{“result”: “No solutions found”}]
except Exception as e:
return [{“error”: f”Query failed: {str(e)}”}]

We wrap SWI-Prolog in an AdvancedPrologInterface, load a rich rule/fact base on init, and assert each clause safely. We then expose a query() method that runs any Prolog goal and returns JSON-friendly results (or a clear error/no-solution message), allowing us to drive logic queries directly from Python.

Download full codes/notebook

Copy CodeCopiedUse a different Browserprolog_interface = AdvancedPrologInterface()

@tool
def family_relationships(query: str) -> str:
“””
Query family relationships in Prolog format.
Examples: ‘parent(john, mary, X)’, ‘sibling(X, Y)’, ‘grandparent(X, charlie)’
“””
results = prolog_interface.query(query)
return json.dumps(results, indent=2)

@tool
def mathematical_operations(operation: str, number: int) -> str:
“””
Perform mathematical operations using Prolog.
Supported operations: ‘factorial’
Example: operation=’factorial’, number=5
“””
if operation == “factorial”:
query = f”factorial({number}, Result)”
results = prolog_interface.query(query)
return json.dumps(results, indent=2)
else:
return json.dumps([{“error”: f”Operation ‘{operation}’ not supported”}])

@tool
def advanced_queries(query_type: str, entity: str = “”) -> str:
“””
Perform advanced relationship queries.
Types: ‘all_children’, ‘all_grandchildren’, ‘all_siblings’, ‘all_cousins’
“””
queries = {
‘all_children’: f”parent(_, _, {entity})” if entity else “parent(_, _, X)”,
‘all_grandchildren’: f”grandparent(_, {entity})” if entity else “grandparent(_, X)”,
‘all_siblings’: f”sibling({entity}, X)” if entity else “sibling(X, Y)”,
‘all_cousins’: f”cousin({entity}, X)” if entity else “cousin(X, Y)”
}

if query_type in queries:
results = prolog_interface.query(queries[query_type])
return json.dumps(results, indent=2)
else:
return json.dumps([{“error”: f”Query type ‘{query_type}’ not supported”}])

We instantiate AdvancedPrologInterface and then wrap its queries as LangChain tools, such as family_relationships, mathematical_operations, and advanced_queries, so that we can call precise Prolog goals from natural language. We define each tool to format and dispatch the right query (such as factorial/2 or cousin lookups) and return clean JSON, allowing our agent to orchestrate logic calls seamlessly.

Copy CodeCopiedUse a different Browsertools = [family_relationships, mathematical_operations, advanced_queries]
agent = create_react_agent(llm, tools)

def run_family_analysis():
“””Comprehensive family relationship analysis”””
print(” Family Relationship Analysis”)
print(“=” * 50)

queries = [
“Who are all the parents in the family database?”,
“Find all grandparent-grandchild relationships”,
“Show me all the siblings in the family”,
“Who are John and Mary’s children?”,
“Calculate the factorial of 6 using Prolog”
]

for i, query in enumerate(queries, 1):
print(f”n Query {i}: {query}”)
print(“-” * 30)

try:
response = agent.invoke({“messages”: [(“human”, query)]})
answer = response[“messages”][-1].content
print(f” Response: {answer}”)
except Exception as e:
print(f” Error: {str(e)}”)

def demonstrate_complex_reasoning():
“””Show advanced multi-step reasoning”””
print(“n Complex Multi-Step Reasoning”)
print(“=” * 40)

complex_query = “””
I want a complete family tree analysis. Please:
1. List all parent-child relationships
2. Identify all grandparent relationships
3. Find any uncle/aunt relationships
4. Show cousin relationships
5. Calculate factorial of 4 as a bonus math operation
“””

print(f”Complex Query: {complex_query}”)
print(“-” * 40)

try:
response = agent.invoke({“messages”: [(“human”, complex_query)]})
print(f” Comprehensive Analysis:n{response[‘messages’][-1].content}”)
except Exception as e:
print(f” Error in complex reasoning: {str(e)}”)

def interactive_prolog_session():
“””Interactive Prolog knowledge base exploration”””
print(“n Interactive Prolog Explorer”)
print(“Ask about family relationships, math operations, or general queries!”)
print(“Type ‘examples’ to see sample queries, ‘quit’ to exit”)
print(“-” * 50)

examples = [
“Who are Bob’s children?”,
“Find all grandparents in the family”,
“Calculate factorial of 5”,
“Show me all cousin relationships”,
“Who are Alice’s siblings?”
]

while True:
user_input = input(“n You: “)

if user_input.lower() == ‘quit’:
print(” Goodbye!”)
break
elif user_input.lower() == ‘examples’:
print(” Example queries:”)
for ex in examples:
print(f” • {ex}”)
continue

try:
response = agent.invoke({“messages”: [(“human”, user_input)]})
print(f” AI: {response[‘messages’][-1].content}”)
except Exception as e:
print(f” Error: {str(e)}”)

We register our three Prolog tools, spin up a ReAct agent around Gemini, and then script helper routines, run_family_analysis, demonstrate_complex_reasoning, and an interactive loop, to fire natural-language queries that the agent translates into Prolog calls. This way, we test simple prompts, multi-step reasoning, and live Q&A, all while keeping the logic layer transparent and debuggable.

Copy CodeCopiedUse a different Browserdef test_direct_queries():
“””Test direct Prolog queries for verification”””
print(“n Direct Prolog Query Testing”)
print(“=” * 35)

test_queries = [
(“parent(john, mary, X)”, “Find John and Mary’s children”),
(“grandparent(X, charlie)”, “Find Charlie’s grandparents”),
(“sibling(alice, X)”, “Find Alice’s siblings”),
(“factorial(4, X)”, “Calculate 4 factorial”),
(“cousin(X, Y)”, “Find all cousin pairs”)
]

for query, description in test_queries:
print(f”n {description}”)
print(f”Query: {query}”)
results = prolog_interface.query(query)
print(f”Results: {json.dumps(results, indent=2)}”)

def main():
“””Main demonstration runner”””
if GOOGLE_API_KEY == “YOUR_GEMINI_API_KEY_HERE”:
print(” Please set your Gemini API key in Cell 3!”)
print(“Get it from: https://aistudio.google.com/app/apikey”)
return

print(” Advanced Prolog + Gemini Integration”)
print(“Using PySwip for stable Prolog integration”)
print(“=” * 55)

test_direct_queries()
run_family_analysis()
demonstrate_complex_reasoning()

def show_mathematical_capabilities():
“””Demonstrate mathematical reasoning with Prolog”””
print(“n Mathematical Reasoning with Prolog”)
print(“=” * 40)

math_queries = [
“Calculate factorial of 3, 4, and 5”,
“What is the factorial of 7?”,
“Show me how factorial calculation works step by step”
]

for query in math_queries:
print(f”n Math Query: {query}”)
try:
response = agent.invoke({“messages”: [(“human”, query)]})
print(f” Result: {response[‘messages’][-1].content}”)
except Exception as e:
print(f” Error: {str(e)}”)

if __name__ == “__main__”:
main()
show_mathematical_capabilities()

print(“n Tutorial completed successfully!”)
print(” Key achievements:”)
print(” • Integrated PySwip with Gemini AI”)
print(” • Created advanced Prolog reasoning tools”)
print(” • Demonstrated complex family relationship queries”)
print(” • Implemented mathematical operations in Prolog”)
print(” • Built interactive AI agent with logical reasoning”)
print(“n Try extending with your own Prolog rules and facts!”)

We wire everything together in main() to verify our Prolog goals, run the family analysis, and showcase multi-step reasoning, then show_mathematical_capabilities() stresses factorial queries from natural language. We conclude by printing a quick recap of what we’ve built so far, allowing us to confidently extend the stack with new rules or swap models next.

In conclusion, we have demonstrated that symbolic reasoning and LLMs complement each other beautifully: Prolog guarantees correctness on well-defined logic, while Gemini handles flexible language understanding and orchestration. We are leaving with a working scaffold, direct Prolog queries for verification, tool-wrapped predicates for agents, and demo functions for complex family tree and mathematical analyses. From here, we are ready to expand the knowledge base, add new domains (such as finance rules, game logic, and knowledge graphs), or swap in different LLMs. We are also positioned to expose this stack via an interactive UI or API, allowing others to explore logic-guided AI in real-time.

Check out the Full Codes. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post A Coding Guide to Build a Tool-Calling ReAct Agent Fusing Prolog Logic with Gemini and LangGraph appeared first on MarkTechPost.

GitHub Introduces Vibe Coding with Spark: Revolutionizing Intelligent …

GitHub has introduced Spark, a groundbreaking addition to its suite of developer tools, aimed at revolutionizing the way full-stack intelligent applications are built and deployed. With Spark, available in public preview for Copilot Pro+ subscribers, developers can go from idea to a fully deployed app in minutes—all using natural language prompts and without the usual hassle of setup or configuration.

Key Features

Natural Language App Creation

Spark leverages state-of-the-art AI, powered by Claude Sonnet 4, to transform simple descriptions into complete applications. Developers can describe their app ideas in plain English, and Spark handles the generation of both frontend and backend code, streamlining what traditionally took weeks into mere minutes.

Zero Configuration Overhead

Spark delivers an out-of-the-box experience by integrating essential components such as:

Data management

Large language model (LLM) inference

Hosting and deployments

GitHub authentication

This means users don’t need to spend time managing infrastructure, API keys, or security settings.

AI Integration Without API Hassles

Adding intelligent features to applications is simplified. Spark supports leading LLMs from platforms like OpenAI, Meta, DeepSeek, and xAI. No API key management is necessary—everything is managed through GitHub’s unified interface.

One-Click Deployment

Developers can deploy their applications with a single click. Spark automates the entire build and publication process, minimizing time to production and reducing opportunity for configuration errors.

Flexible Development Workflow

Spark adapts to diverse development styles:

Natural language prompts for rapid prototyping.

Visual editing controls for UI adjustments without code.

Direct code editing with Copilot Completions for those who prefer a hands-on approach.

Seamless repository creation with GitHub Actions and Dependabot pre-integrated.

No sandboxing — everything stays synchronized with your real project repos.

Expansion With Copilot Agents

Beyond initial app creation, Spark enables deeper development through:

Opening a Codespace directly from Spark for interactive, agent-powered coding.

Assigning issues to Copilot’s coding agents for automated problem resolution and feature development.

Image Source: Marktechpost.com

Getting Started

Spark is currently available to Copilot Pro+ subscribers at no additional cost. To try it out:

Visit github.com/spark to begin building your app.

If not already a Copilot Pro+ user, sign up for access.

All Spark messages use premium requests included within existing GitHub Copilot plans.

The platform is expected to roll out to more users in the near future, with further UI and feature updates anticipated as part of the ongoing public preview.

Conclusion

GitHub Spark marks a major step forward in democratizing application development, allowing both seasoned developers and newcomers to rapidly build, deploy, and iterate on sophisticated, AI-powered applications—no setup, no configuration, and no operational headaches. As Spark matures, it promises to further blur the line between idea and implementation, accelerating the path from concept to deployment at scale.

Frequently Asked Questions FAQs

1. What is GitHub Spark and who is it for?

GitHub Spark is an all-in-one, AI-powered platform designed to help users create full-stack intelligent applications using natural language, visual controls, or direct code editing. It is built for everyone—from complete beginners to experienced developers—enabling users to turn ideas into functional apps rapidly and deploy them with a single click, all without the need for complex setup or configuration. Spark offers deep integration with GitHub’s trusted tools, supporting secure collaboration, rapid prototyping, and effortless scaling.

2. Do I need coding experience to use Spark?

No, coding experience is not required to use Spark. The platform is designed to be accessible to users of all technical backgrounds. You can simply describe what you want to build in plain English, and Spark will handle both frontend and backend generation, as well as AI features and database connections. For those with programming experience, Spark also allows direct code editing, app refinement in the Spark editor, and powerful integrations with GitHub Copilot and Codespaces for greater control and customization.

3. How do I build and deploy an app with Spark?

To build and deploy an app using Spark:

Visit the Spark homepage: github.com/spark

Describe your vision in natural language; Spark generates a working app with all the necessary components.

Refine your app using natural language, visual controls, or code in the live editor. Changes appear instantly in the live preview.

When you’re satisfied, publish your app with a single click. Your app is then securely hosted with built-in GitHub authentication and is immediately accessible to your chosen audience.

Spark handles all necessary infrastructure, utilizing Microsoft Azure for hosting and reliable performance, so there is no additional setup required.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post GitHub Introduces Vibe Coding with Spark: Revolutionizing Intelligent App Development in a Flash appeared first on MarkTechPost.

Google Researchers Introduced LSM-2 with Adaptive and Inherited Maskin …

Introduction

Wearable devices are transforming health monitoring by enabling continuous collection of physiological and behavioral signals such as heart rate, activity, temperature, and skin conductance. However, the real-world data that these devices generate is highly prone to missingness due to sensor failures, device removal, charging, motion artifacts, battery-saving modes, and other interruptions. This presents a significant challenge for self-supervised learning (SSL) and foundation models, which typically expect complete, regular data streams. Past solutions often relied on data imputation or discarding incomplete instances, which risks introducing bias or wasting valuable information.

A team of researchers from Google DeepMind introduced LSM-2 (Large Sensor Model 2) framework—accompanied by the new Adaptive and Inherited Masking (AIM) strategy—addresses these issues directly, learning robust representations from incomplete wearable sensor data without explicit imputation. Below, we examine the technical innovations, empirical results, and key insights from this advancement.

The Challenge: Wearable Data Missingness

Data Fragmentation: In a large-scale dataset of 1.6 million day-long (1440-minute) wearable data samples, 0% of the samples were fully complete; missingness is ubiquitous and often structured into long gaps, not simple random dropouts.

Missingness Modes: Common causes include:

Device off (charging or not worn)

Selective sensor deactivation (power-saving or operation-specific)

Motion artifacts or environmental noise

Out-of-range or physiologically impossible readings filtered out during preprocessing

Impact on Modeling: Many clinically-relevant physiological patterns (e.g., circadian rhythms, heart rate variability) require analysis of long sequences—where missingness is nearly guaranteed.

Adaptive and Inherited Masking (AIM): Technical Approach

Key Concepts

AIM integrates two masking types for robust learning:

Inherited Mask: Marks tokens corresponding to real missingness in the sensor data

Artificial Mask: Randomly masks observed tokens to provide reconstruction targets for self-supervised pretraining

These masks are unioned and handled by a transformer-based encoder-decoder structure, enabling the model to:

Learn directly from non-imputed, incomplete data

Adjust dynamically to real-world missingness during inference

Produce representations robust to both partial and systematic data gaps

Masking Strategies for Pretraining

Random Imputation: Dropping 80% of tokens simulating sensor noise

Temporal Slices: Dropping 50% of temporal windows (all sensors missing during random periods)

Sensor Slices: Dropping 50% of sensor channels across the entire day (modeling selective sensor off periods)

AIM combines the efficiency of dropout masking (removal from computation) and the flexibility of attention masking (support for dynamically-varying missingness), allowing the model to scale to long input sequences (day-long, >3,000 tokens).

Dataset and Pretraining Details

Scale: 40 million hours of day-long, multimodal sensor data, collected from 60,440 participants between March and May 2024.

Sensors: Photoplethysmography (PPG), accelerometer, electrodermal activity (EDA), skin temperature, and altimeter. Each device contributed minutely aggregated features across a 24-hour window.

Demographic Diversity: Participants across a wide range of ages (18–96), genders, and BMI classes.

Downstream Labeled Data:

Metabolic Study (hypertension, anxiety prediction; n=1,250 labeled users)

Activity Recognition (20 activity classes, 104,086 events).

Evaluation and Results

Downstream Tasks

AIM-based LSM-2 was assessed on:

Classification: Binary hypertension, anxiety, and 20-class activity recognition

Regression: Age and BMI

Generative: Recovery of missing sensor data (random imputation, temporal/signal gaps)

Quantitative Results

TaskMetricBest LSM-1LSM-2 w/ AIMImprovementHypertensionF10.6400.651+1.7%Activity RecognitionF10.4700.474+0.8%BMI (regression)Corr0.6670.673+1.0%Random Imputation (80%)MSE (↓)0.300.20+33% lower error2-signal RecoveryMSE (↓)0.730.17+77% lower error

Robustness to Targeted Missingness: When specific sensors or time windows were artificially removed, LSM-2 with AIM experienced 73% smaller performance drops (on average) compared to LSM-1. For example, F1 loss after removing accelerometry for activity recognition was -57% for LSM-2, as opposed to -71% for LSM-1, and LSM-2 retained +47% higher absolute F1 after ablation.

Clinical Coherence: The model’s performance drop matched domain expectations. Nighttime biosignal removal significantly reduced hypertension/anxiety prediction accuracy (reflecting real-world diagnostic value of nocturnal data).

Scaling: LSM-2 exhibited better scaling than LSM-1 in terms of subjects, data, compute, and model size, with no saturation observed in performance gains.

Technical Insights

Direct Handling of Real-World Missingness: LSM-2 is the first wearable foundation model trained and evaluated directly on incomplete data, without explicit imputation.

Hybrid Masking Mechanism: Adaptive and inherited masking achieves both computational efficiency (via dropout removal) and flexibility (via attention masking).

Generalizable Embeddings: Even with a frozen backbone and simple linear probes, LSM-2 achieves state-of-the-art results in both clinical/person-level and event-level tasks, outperforming supervised and contrastive SSL baselines.

Generative and Discriminative Power: LSM-2 is the only evaluated model capable of both reconstructing missing signals and generating embeddings applicable across various downstream tasks, suggesting utility for real-world medical and behavioral monitoring applications.

Conclusion

LSM-2 with Adaptive and Inherited Masking presents a major step forward for deploying AI-driven health insights using real-world wearable sensor data. By directly embracing ubiquitous, structured missingness, and unifying generative and discriminative capabilities under one efficient and robust foundation model, this approach lays crucial groundwork for the future of wearable and health AI in realistic, imperfect data environments.

Check out the Paper and Technical details. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post Google Researchers Introduced LSM-2 with Adaptive and Inherited Masking (AIM): Enabling Direct Learning from Incomplete Wearable Data appeared first on MarkTechPost.