Arcee AI Releases SuperNova-Medius: A 14B Small Language Model Built o …

In the ever-evolving world of artificial intelligence (AI), large language models have proven instrumental in addressing a wide array of challenges, from automating complex tasks to enhancing decision-making processes. However, scaling these models has also introduced considerable complexities, such as high computational costs, reduced accessibility, and the environmental impact of extensive resource requirements. The enormous size of conventional language models like GPTs or LLaMA-70B makes them challenging for many institutions to adopt due to constraints in computational infrastructure. Arcee AI has acknowledged these challenges and sought to bridge the gap between model capability and accessibility with the introduction of SuperNova-Medius—a small language model that aims to maintain the high-quality output of larger counterparts without their limitations.

SuperNova-Medius: A 14B Small Language Model that seeks to disrupt the traditional notions of size versus performance in AI models. 70B SuperNova-Medius comes after the Arcee AI’s release of SuperNova-70B, followed by the 8B SuperNova-Lite. SuperNova-Medius is designed to match the prowess of significantly larger models, rivaling those with up to 70 billion parameters. It does so while retaining a relatively manageable size of 14 billion parameters, making it highly suitable for various use cases without the massive computational burden. By integrating groundbreaking optimization techniques and innovative architectural designs, SuperNova-Medius presents a fresh perspective on how effective language models can be designed for real-world usability while ensuring that smaller organizations can leverage the potential.

SuperNova-Medius is built on an optimized Transformer architecture, coupled with advanced quantization methods that allow it to maintain impressive accuracy and efficiency. The development of SuperNova-Medius involved a sophisticated multi-teacher, cross-architecture distillation process with the following key steps:

Logit Distillation from Llama 3.1 405B: The logits of Llama 3.1 405B were distilled using an offline approach. The top K logits for each token were stored to capture most of the probability mass while managing storage requirements.

Cross-Architecture Adaptation: Using mergekit-tokensurgeon, a version of Qwen2.5-14B was created that uses the vocabulary of Llama 3.1 405B. This allowed for the use of Llama 3.1 405B logits in training the Qwen-based model.

Distillation to Qwen Architecture: The adapted Qwen2.5-14B model was trained using the stored 405B logits as the target.

Parallel Qwen Distillation: In a separate process, Qwen2-72B was distilled into a 14B model.

Final Fusion and Fine-Tuning: The Llama-distilled Qwen model’s vocabulary was reverted to the Qwen vocabulary. After re-aligning the vocabularies, a final fusion and fine-tuning step was conducted using a specialized dataset from EvolKit to ensure that SuperNova-Medius maintained coherence, fluency, and context understanding across a broad range of tasks.

Despite being smaller compared to the largest models, SuperNova-Medius has been extensively fine-tuned using a diverse and expansive dataset, covering multiple domains and languages. This extensive training allows SuperNova-Medius to exhibit a strong understanding of context, generate coherent responses, and perform complex reasoning tasks effectively. Furthermore, by employing innovations in parameter sharing and utilizing sparsity strategies, the model delivers results that are comparable to models with substantially higher parameter counts. The key benefits of SuperNova-Medius lie in its balanced capability—it provides high-quality language generation while being cost-effective to deploy, making it a perfect fit for applications needing reliable but resource-efficient solutions.

SuperNova-Medius excels in instruction-following (IFEval) and complex reasoning tasks (BBH), outperforming Qwen2.5-14B and SuperNova-Lite across multiple benchmarks. This makes it a powerful, efficient solution for high-quality generative AI applications.

In conclusion, SuperNova-Medius stands as a testament to Arcee AI’s commitment to pushing the boundaries of what’s possible with language models while making advanced AI more inclusive and sustainable. By successfully reducing the model size without compromising on performance, Arcee AI has provided a solution that caters to the needs of various sectors, from startups and small businesses to educational institutions and beyond. As AI continues to shape our future, innovations like SuperNova-Medius are essential in ensuring that the benefits of advanced machine learning technology are accessible to all, paving the way for more equitable and impactful applications of AI across the globe.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Arcee AI Releases SuperNova-Medius: A 14B Small Language Model Built on the Qwen2.5-14B-Instruct Architecture appeared first on MarkTechPost.

Researchers at Stanford University Propose ExPLoRA: A Highly Effective …

Parameter-efficient fine-tuning (PEFT) methods, like low-rank adaptation (LoRA), allow large pre-trained foundation models to be adapted to downstream tasks using a small percentage (0.1%-10%) of the original trainable weights. A less explored area of PEFT is extending the pre-training phase without supervised labels—specifically, adapting foundation models to new domains using efficient self-supervised pre-training. While traditional pre-training of foundation models in language and vision has been resource-intensive, recent advancements in PEFT techniques have enabled effective fine-tuning with minimal computational cost based on the assumption that weight updates have a low intrinsic rank.

Vision foundation models (VFMs) like DinoV2 and masked autoencoders (MAE) have shown excellent performance in tasks such as classification and semantic segmentation through self-supervised learning (SSL). Recently, domain-specific VFMs have emerged, like SatMAE, which processes temporal or multi-spectral satellite images. Efficient adaptation of these large models has led to the adoption of PEFT methods, which update only a fraction of the parameters. Techniques such as LoRA apply low-rank weight updates, while others modify the number of trainable parameters. Domain adaptation strategies address distribution shifts between training and testing data using discrepancy metrics or adversarial training to enhance model performance across domains.

Researchers from Stanford University and CZ Biohub have developed ExPLoRA, an innovative technique to enhance transfer learning for pre-trained vision transformers (ViTs) amid domain shifts. By initializing a ViT with weights from large natural-image datasets like DinoV2 or MAE, ExPLoRA continues unsupervised pre-training in a new domain, selectively unfreezing 1-2 ViT blocks while employing LoRA to tune the remaining layers. This method achieves state-of-the-art performance in satellite imagery classification, improving top-1 accuracy by 8% while utilizing only 6-10% of the parameters compared to previous fully pre-trained models, demonstrating significant efficiency and effectiveness in domain adaptation.

The MAE and DinoV2 are SSL methods for ViTs. MAE uses a masked encoder-decoder structure that requires full fine-tuning for downstream tasks, which can be computationally intensive. In contrast, DinoV2 demonstrates strong zero-shot performance by employing a trainable student-teacher model architecture, enabling adaptation without full fine-tuning. The ExPLoRA method is proposed to address fine-tune inefficiencies, combining pre-trained weights with low-rank adaptations and additional updates to adapt ViTs to new target domains efficiently. This approach reduces storage requirements while maintaining strong feature extraction and generalization capabilities.

The experimental results focus on satellite imagery, highlighting a case study with the fMoW-RGB dataset, achieving a state-of-the-art top 1 accuracy of 79.2%. The ablation study examines performance metrics across various configurations. ExPLoRA models, initialized with MAE and DinoV2 weights, outperform traditional fully pre-trained methods while utilizing only 6% of the ViT encoder parameters. Additional evaluations on multi-spectral images and various satellite datasets demonstrate ExPLoRA’s effectiveness in bridging domain gaps and achieving competitive performance. The results indicate significant improvements in accuracy, showcasing the potential of ExPLoRA for satellite image classification tasks.

In conclusion, ExPLoRA is an innovative pre-training strategy designed to adapt pre-trained ViT models for diverse visual domains, including satellite and medical imagery. ExPLoRA addresses the limitations of costly from-scratch pre-training by enabling efficient knowledge transfer from existing models, achieving superior performance compared to domain-specific foundations. The method combines PEFT techniques like LoRA with minimal unfreezing of model layers, significantly enhancing transfer learning. The experiments reveal state-of-the-art results on satellite imagery, improving linear probing accuracy by up to 7.5% while utilizing less than 10% of the parameters of previous approaches.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Researchers at Stanford University Propose ExPLoRA: A Highly Effective AI Technique to Improve Transfer Learning of Pre-Trained Vision Transformers (ViTs) Under Domain Shifts appeared first on MarkTechPost.

OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring …

Machine Learning (ML) models have shown promising results in various coding tasks, but there remains a gap in effectively benchmarking AI agents’ capabilities in ML engineering. Existing coding benchmarks primarily evaluate isolated coding skills without holistically measuring the ability to perform complex ML tasks, such as data preparation, model training, and debugging.

OpenAI Researchers Introduce MLE-bench

To address this gap, OpenAI researchers have developed MLE-bench, a comprehensive benchmark that evaluates AI agents on a wide array of ML engineering challenges inspired by real-world scenarios. MLE-bench is a novel benchmark aimed at evaluating how well AI agents can perform end-to-end machine learning engineering. It is constructed using a collection of 75 ML engineering competitions sourced from Kaggle. These competitions encompass diverse domains such as natural language processing, computer vision, and signal processing. The competitions are carefully curated to assess key ML skills, including training models, data preprocessing, running experiments, and submitting results for evaluation. To provide an accurate baseline, human performance metrics are gathered from publicly available Kaggle leaderboards, enabling comparisons between the capabilities of AI agents and expert human participants.

Structure and Details of MLE-bench

MLE-bench features several design aspects to assess ML engineering effectively. Each of the 75 Kaggle competition tasks is representative of practical engineering challenges, making the benchmark both rigorous and realistic. Each Kaggle competition in MLE-bench consists of a problem description, dataset, local evaluation tools, and grading code used to assess the agent’s performance. To ensure comparability, each competition’s dataset is split into training and testing sets, often redesigned to avoid any overlap or contamination issues. Submissions are graded against human attempts using competition leaderboards, and agents receive medals (bronze, silver, gold) based on their performance relative to human benchmarks. The grading mechanism relies on standard evaluation metrics, such as the area under the receiver operating characteristic (AUROC), mean squared error, and other domain-specific loss functions, providing a fair comparison to Kaggle participants. AI agents, such as OpenAI’s o1-preview model combined with AIDE scaffolding, have been tested on these tasks, achieving results comparable to a Kaggle bronze medal in 16.9% of competitions. Performance significantly improved with repeated attempts, indicating that while agents can follow well-known approaches, they struggle to recover from initial mistakes or optimize effectively without multiple iterations. This highlights both the potential and the limitations of current AI systems in performing complex ML engineering tasks.

Experimental Results and Performance Analysis

The evaluation of different scaffolds and AI models on MLE-bench reveals interesting findings. OpenAI’s o1-preview model with AIDE scaffolding emerged as the best-performing setup, achieving medals in 16.9% of the competitions, and performance significantly improved with multiple attempts. Agents often performed better when they could iterate on their solutions, highlighting the importance of multiple passes in addressing challenges and optimizing solutions. When given additional resources, such as increased compute time and hardware, agents showed better results, emphasizing the impact of resource allocation. For example, the performance of GPT-4o doubled from 8.7% when given 24 hours to 11.8% when given 100 hours per competition. Furthermore, the experiments revealed that scaling up the number of attempts (pass@k) had a significant impact on the success rate, with pass@6 achieving nearly double the performance of pass@1. Additionally, experiments on scaling resources and agent scaffolding demonstrate the variability in performance based on resource availability and optimization strategies. Specifically, agents like o1-preview exhibited notable improvements in competitions requiring extensive model training and hyperparameter tuning when given longer runtimes or better hardware configurations. This evaluation provides valuable insights into the strengths and weaknesses of current AI agents, particularly in debugging, handling complex datasets, and effectively utilizing available resources.

Conclusion and Future Directions

MLE-bench represents a significant step forward in evaluating the ML engineering capabilities of AI agents, focusing on holistic, end-to-end performance metrics rather than isolated coding skills. The benchmark provides a robust framework for assessing various facets of ML engineering, including data preprocessing, model training, hyperparameter tuning, and debugging, which are essential for real-world ML applications. It aims to facilitate further research into understanding the potential and limitations of AI agents in performing practical ML engineering tasks autonomously. By open-sourcing MLE-bench, OpenAI hopes to encourage collaboration, allowing researchers and developers to contribute new tasks, improve existing benchmarks, and explore innovative scaffolding techniques. This collaborative effort is expected to accelerate progress in the field, ultimately contributing to safer and more reliable deployment of advanced AI systems. Additionally, MLE-bench serves as a valuable tool for identifying key areas where AI agents require further development, providing a clear direction for future research efforts in enhancing the capabilities of AI-driven ML engineering.

Setup

Some MLE-bench competition data is stored using Git-LFS. Once you have downloaded and installed LFS, run:

git lfs fetch –all
git lfs pull

You can install mlebench With pip:

pip install -e .

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring How Well AI Agents Perform at Machine Learning Engineering appeared first on MarkTechPost.

InstructG2I : A Graph Context Aware Stable Diffusion Model to Synthesi …

Multimodal Attributed Graphs (MMAGs) have received little attention despite their versatility in image generation. MMAGs represent relationships between entities with combinatorial complexity in a graph-structured manner. Nodes in the graph contain both image and text information. Compared to text or image conditioning models, graphs could be converted into better and more informative images. Graph2Image is an interesting challenge in this field that requires generative models to synthesize image conditioning on text descriptions and graph connections. While MMAGs are helpful, they cannot be directly incorporated into image and text conditioning.

The following are the most relevant challenges in the use of MMAGs for image synthesis:

Explosion in graph size– This phenomenon occurs due to the combinatorial complexity of graphs, where the size grows exponentially as we introduce to the model local subgraphs, which encompass images and text.

Graph entities dependencies – Nodal characteristics are mutually dependent, and thus, their proximity reflects the relationships between entities across text and image and their preference in image generation. To exemplify this, generating a light-colored shirt should have a preference for light shades such as pastels.

 Need for controllability in graph condition – The interpretability of generated images must be controlled to follow desired patterns or characteristics defined by connections between entities in the graph.

A team of researchers at the University of Illinois developed InstructG2I to solve this problem. This is a graph context-aware diffusion model that utilizes multimodal graph information. This approach addresses graph space complexity by compressing contexts from graphs into fixed capacity graph conditioning tokens enhanced with semantic personalized PageRank-based graph sampling. The Graph-QFormer architecture further improves these graph tokens by solving the problem of graph entity dependency. Last but not least, InstructG2I guides image generation with adjustable edge lengths.

InstructG2I introduces Graph Conditions into Stable Diffusion with PPR-based neighbor sampling. PPR or Personalized PageRank identifies related nodes from the graph structure. To ensure that generated images are semantically related to the target node a semantic based similarity calculation function is used for reranking.This study also proposes Graph-QFormer which is a two transformer module to capture text based and image based dependencies. Graph-QFormer employs multi head self attention for image-image dependencies and multi head cross attention for text-image dependencies.Cross Attention layer aligns image features with text prompts. It uses hidden states from the self-attention layer as input, and the text embeddings as a query to generate relevant images. Final output from the two transformers of Graph-QFormer is the graph-conditioned prompt tokens which guide the image generation process in the diffusion model.Finally to control the generation process  classifier-free guidance is used which is basically a technique to adjust the strength of graphs

InstructG2I was tested on three datasets from different domains – ART500K, Amazon, and Goodreads. For text-to-image methods, Stable Diffusion 1.5 was decided as the baseline model, and for image-to-image methods, InstructPix2Pix and ControlNet were chosen for comparison; both were initialized with SD 1.5 and fine-tuned on chosen datasets. The study’s results showed impressive improvements over baseline models in both tasks. InstructG2I outperformed all baseline models in CLIP and DINOv2 scores. For qualitative evaluation, InstructG2I generated images that best fit the semantics of the text prompt and context from the graph, ensuring the generation of content and context as it learned from the neighbors on the graph and accurately conveyed information.

InstructG2I effectively solved the significant challenges of the explosion, inter-entity dependency, and controllability in Multimodal Attributed Graphs and superseded the baseline in image generation. In the next few years, there will be a lot of opportunities to work with and incorporate Graphs into image generation, a big part of which includes handling the complex heterogeneous relationships between image and text on MMAGs.

Check out the Paper, Code, and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post InstructG2I : A Graph Context Aware Stable Diffusion Model to Synthesize Images from Multimodal Attributed Graphs appeared first on MarkTechPost.

LeanAgent: The First Life-Long Learning Agent for Formal Theorem Provi …

The problem that this research seeks to address lies in the inherent limitations of existing large language models (LLMs) when applied to formal theorem proving. Current models are often trained or fine-tuned on specific datasets, such as those focused on undergraduate-level mathematics, but struggle to generalize to more advanced mathematical domains. These limitations become more pronounced because these models typically operate in static environments, failing to adapt across different mathematical domains and projects as mathematicians do. Moreover, these models exhibit issues related to “catastrophic forgetting,” where new knowledge may overwrite previously learned information. This research aims to tackle these challenges by proposing a lifelong learning framework that can continuously evolve and expand its mathematical capabilities without losing previously acquired knowledge.

Researchers from California Institute of Technology, Stanford, and University of Wisconsin, Madison introduce LeanAgent, a lifelong learning framework designed for formal theorem proving. LeanAgent addresses the limitations of existing LLMs by introducing a dynamic approach that continually builds upon and improves its knowledge base. Unlike static models, LeanAgent operates with a dynamic curriculum, progressively learning and adapting to increasingly complex mathematical tasks. The framework incorporates several key innovations, including curriculum learning to optimize the learning trajectory, a dynamic database to efficiently manage expanding mathematical knowledge, and a progressive training methodology designed to balance stability (retaining old knowledge) and plasticity (incorporating new knowledge). These features enable LeanAgent to continually generalize and improve its theorem-proving abilities, even in advanced mathematical domains such as abstract algebra and algebraic topology.

LeanAgent is structured around several key components that allow it to adapt continuously and effectively tackle complex mathematical problems. First, the curriculum learning strategy sorts mathematical repositories by difficulty, using theorems of varying complexity to build an effective learning sequence. This approach allows LeanAgent to start with foundational knowledge before progressing to more advanced topics. Second, a custom dynamic database is utilized to manage evolving knowledge, ensuring that previously learned information can be efficiently retrieved and reused. This database not only stores theorems and proofs but also keeps track of dependencies, enabling more efficient premise retrieval. Third, the progressive training of LeanAgent’s retriever ensures that new mathematical concepts are continuously integrated without overwriting previous learning. The retriever, initially based on ReProver, is incrementally trained with each new dataset for one additional epoch, striking a balance between learning new tasks and maintaining stability.

LeanAgent demonstrates remarkable progress compared to existing baselines. It successfully proved 162 previously unsolved theorems across 23 diverse Lean repositories, including challenging areas such as abstract algebra and algebraic topology. LeanAgent outperformed the static ReProver baseline by up to 11x, particularly excelling in proving previously unsolved ‘sorry theorems.’ The framework also excelled in lifelong learning metrics, effectively maintaining stability while enhancing backward transfer, wherein learning new tasks enhanced performance on prior ones. LeanAgent’s structured learning progression, beginning with fundamental concepts and advancing to intricate topics, showcases its capacity for continuous enhancement—a crucial advantage over existing models that struggle to remain relevant across diverse and evolving mathematical domains.

The conclusion drawn from this research highlights LeanAgent’s potential to transform formal theorem proving through its lifelong learning capabilities. By proving numerous complex theorems that were previously unsolved, LeanAgent has demonstrated the effectiveness of a curriculum-based, dynamic learning strategy in continuously expanding and improving a model’s knowledge base. The research emphasizes the importance of balancing stability and plasticity, which LeanAgent achieves through its progressive training methodology. Moving forward, LeanAgent sets a foundation for future exploration in using lifelong learning frameworks for formal mathematics, potentially paving the way for AI systems that can assist mathematicians across multiple domains in real time, while continuously expanding their understanding and capability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post LeanAgent: The First Life-Long Learning Agent for Formal Theorem Proving in Lean, Proving 162 Theorems Previously Unproved by Humans Across 23 Diverse Lean Mathematics Repositories appeared first on MarkTechPost.

Multimodal Situational Safety Benchmark (MSSBench): A Comprehensive Be …

Multimodal Situational Safety is a critical aspect that focuses on the model’s ability to interpret and respond safely to complex real-world scenarios involving visual and textual information. It ensures that Multimodal Large Language Models (MLLMs) can recognize and address potential risks inherent in their interactions. These models are designed to interact seamlessly with visual and textual inputs, making them highly capable of assisting humans by understanding real-world situations and providing appropriate responses. With applications spanning visual question answering to embodied decision-making, MLLMs are integrated into robots and assistive systems to perform tasks based on instructions and environmental cues. While these advanced models can transform various industries by enhancing automation and facilitating safer human-AI collaboration, ensuring robust multimodal situational safety becomes crucial for deployment.

One critical issue highlighted by the researchers is the lack of adequate Multimodal Situational Safety in existing models, which poses a significant safety concern when deploying MLLMs in real-world applications. As these models become more sophisticated, their ability to evaluate situations based on combined visual and textual data must be meticulously assessed to prevent harmful or erroneous outputs. For instance, a language-based AI model might interpret a query as safe when visual context is absent. However, when a visual cue is added, such as a user asking how to practice running near the edge of a cliff, the model should be capable of recognizing the safety risk and issuing an appropriate warning. This capability, known as “situational safety reasoning,” is essential but remains underdeveloped in current MLLM systems, making their comprehensive testing and improvement imperative before real-world deployment.

Existing methods for assessing Multimodal Situational Safety often rely on text-based benchmarks needing more real-time situational analysis capabilities. These assessments must be revised to address the nuanced challenges of multimodal scenarios, where models must simultaneously interpret visual and linguistic inputs. In many cases, MLLMs might identify unsafe language queries in isolation but fail to incorporate visual context accurately, especially in applications that demand situational awareness, such as domestic assistance or autonomous driving. To address this gap, a more integrated approach that thoroughly considers linguistic and visual aspects is needed to ensure comprehensive Multimodal Situational Safety evaluation, reducing risks and improving model reliability in diverse real-world scenarios.

Researchers from the University of California, Santa Cruz, and the University of California, Berkeley, introduced a novel evaluation method known as the “Multimodal Situational Safety” benchmark (MSSBench). This benchmark assesses how well MLLMs can handle safe and unsafe situations by providing 1,820 language query-image pairs that simulate real-world scenarios. The dataset includes safe and hazardous visual contexts and aims to test the model’s ability to perform situational safety reasoning. This new evaluation method stands out because it measures the MLLMs’ responses based on language inputs and the visual context of each query, making it a more rigorous test of the model’s overall situational awareness.

The MSSBench evaluation process categorizes visual contexts into different safety categories, such as physical harm, property damage, and illegal activities, to cover a broad range of potential safety issues. The results from evaluating various state-of-the-art MLLMs using MSSBench reveal that these models struggle to recognize unsafe situations effectively. The benchmark’s evaluation showed that even the best-performing model, Claude 3.5 Sonnet, achieved an average safety accuracy of just 62.2%. Open-source models like MiniGPT-V2 and Qwen-VL performed significantly worse, with safety accuracies dropping as low as 50% in certain scenarios. Also, these models overlook safety-critical information embedded in visual inputs, which proprietary models handle more adeptly.

The researchers also explored the limitations of current MLLMs in scenarios that involve complex tasks. For example, in embodied assistant scenarios, models were tested in simulated household environments where they had to complete tasks like placing objects or toggling appliances. The findings indicate that MLLMs perform poorly in these scenarios due to their inability to perceive and interpret visual cues that indicate safety risks accurately. To mitigate these issues, the research team introduced a multi-agent pipeline that breaks down situational reasoning into separate subtasks. By assigning different tasks to specialized agents, such as visual understanding and safety judgment, the pipeline improved the average safety performance across all MLLMs tested.

The study’s results emphasize that while the multi-agent approach shows promise, there is still much room for improvement. For example, even with a multi-agent system, MLLMs like mPLUG-Owl2 and DeepSeek failed to recognize unsafe scenarios in 32% of the test cases, indicating that future work needs to focus on enhancing these models’ visual-textual alignment and situational reasoning capabilities.

Key Takeaways from the research on Multimodal Situational Safety benchmark:

Benchmark Creation: The Multimodal Situational Safety benchmark (MSSBench) includes 1,820 query-image pairs, evaluating MLLMs on various safety aspects.

Safety Categories: The benchmark assesses safety in four categories: physical harm, property damage, illegal activities, and context-based risks.

Model Performance: The best-performing models, like Claude 3.5 Sonnet, achieved a maximum safety accuracy of 62.2%, highlighting a significant area for improvement.

Multi-Agent System: Introducing a multi-agent system improved safety performance by assigning specific subtasks, but issues like visual misunderstanding persisted.

Future Directions: The study suggests that further development of MLLM safety mechanisms is necessary to achieve reliable situational awareness in complex, multimodal scenarios.

In conclusion, the research presents a new framework for evaluating the situational safety of MLLMs through the Multimodal Situational Safety benchmark. It reveals the critical gaps in current MLLM safety performance and proposes a multi-agent approach to address these challenges. The research demonstrates the importance of comprehensive safety evaluation in multimodal AI systems, especially as these models become more prevalent in real-world applications.

Check out the Paper, GitHub, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Multimodal Situational Safety Benchmark (MSSBench): A Comprehensive Benchmark to Analyze How AI Models Evaluate Safety and Contextual Awareness Across Varied Real-World Situations appeared first on MarkTechPost.

Boost productivity by using AI in cloud operational health management

Modern organizations increasingly depend on robust cloud infrastructure to provide business continuity and operational efficiency. Operational health events – including operational issues, software lifecycle notifications, and more – serve as critical inputs to cloud operations management. Inefficiencies in handling these events can lead to unplanned downtime, unnecessary costs, and revenue loss for organizations.
However, managing cloud operational events presents significant challenges, particularly in complex organizational structures. With a vast array of services and resource footprints spanning hundreds of accounts, organizations can face an overwhelming volume of operational events occurring daily, making manual administration impractical. Although traditional programmatic approaches offer automation capabilities, they often come with significant development and maintenance overhead, in addition to increasingly complex mapping rules and inflexible triage logic.
This post shows you how to create an AI-powered, event-driven operations assistant that automatically responds to operational events. It uses Amazon Bedrock, AWS Health, AWS Step Functions, and other AWS services. The assistant can filter out irrelevant events (based on your organization’s policies), recommend actions, create and manage issue tickets in integrated IT service management (ITSM) tools to track actions, and query knowledge bases for insights related to operational events. By orchestrating a group of AI endpoints, the agentic AI design of this solution enables the automation of complex tasks, streamlining the remediation processes for cloud operational events. This approach helps organizations overcome the challenges of managing the volume of operational events in complex, cloud-driven environments with minimal human supervision, ultimately improving business continuity and operational efficiency.
Event-driven operations management
Operational events refer to occurrences within your organization’s cloud environment that might impact the performance, resilience, security, or cost of your workloads. Some examples of AWS-sourced operational events include:

AWS Health events — Notifications related to AWS service availability, operational issues, or scheduled maintenance that might affect your AWS resources.
AWS Security Hub findings — Alerts about potential security vulnerabilities or misconfigurations identified within your AWS environment.
AWS Cost Anomaly Detection alerts – Notifications about unusual spending patterns or cost spikes.
AWS Trusted Advisor findings — Opportunities for optimizing your AWS resources, improving security, and reducing costs.

However, operational events aren’t limited to AWS-sourced events. They can also originate from your own workloads or on-premises environments. In principle, any event that can integrate with your operations management and is of importance to your workload health qualifies as an operational event.
Operational event management is a comprehensive process that provides efficient handling of events from start to finish. It involves notification, triage, progress tracking, action, and archiving and reporting at a large scale. The following is a breakdown of the typical tasks included in each step:

Notification of events:

Format notifications in a standardized, user-friendly way.
Dispatch notifications through instant messaging tools or emails.

Triage of events:

Filter out irrelevant or noise events based on predefined company policies.
Analyze the events’ impact by examining their metadata and textual description.
Convert events into actionable tasks and assigning responsible owners based on roles and responsibilities.
Log tickets or page the appropriate personnel in the chosen ITSM tools.

Status tracking of events and actions:

Group related events into threads for straightforward management.
Update ticket statuses based on the progress of event threads and action owner updates.

Insights and reporting:

Query and consolidate knowledge across various event sources and tickets.
Create business intelligence (BI) dashboards for visual representation and analysis of event data.

A streamlined process should include steps to ensure that events are promptly detected, prioritized, acted upon, and documented for future reference and compliance purposes, enabling efficient operational event management at scale. However, traditional programmatic automation has limitations when handling multiple tasks. For instance, programmatic rules for event attribute-based noise filtering lack flexibility when faced with organizational changes, expansion of the service footprint, or new data source formats, leading growing complexity.
Automating impact analysis in traditional automation through keyword matching on free-text descriptions is impractical. Converting events to tickets requires manual effort to generate action hints and lacks correlation to the originating events. Extracting event storylines from long, complex threads of event updates is challenging.
Let’s explore an AI-based solution to see how it can help address these challenges and improve productivity.
Solution overview
The solution uses AWS Health and AWS Security Hub findings as sources of operational events to demonstrate the workflow. It can be extended to incorporate additional types of operational events—from AWS or non-AWS sources—by following an event-driven architecture (EDA) approach.
The solution is designed to be fully serverless on AWS and can be deployed as infrastructure as code (IaC) by usingf the AWS Cloud Development Kit (AWS CDK).
Slack is used as the primary UI, but you can implement the solution using other messaging tools such as Microsoft Teams.
The cost of running and hosting the solution depends on the actual consumption of queries and the size of the vector store and the Amazon Kendra document libraries. See Amazon Bedrock pricing, Amazon OpenSearch pricing and Amazon Kendra pricing for pricing details.
The full code repository is available in the accompanying GitHub repo.
The following diagram illustrates the solution architecture.

Figure – solution architecture diagram
Solution walk-through
The solution consists of three microservice layers, which we discuss in the following sections.
Event processing layer
The event processing layer manages notifications, acknowledgments, and triage of actions. Its main logic is controlled by two key workflows implemented using Step Functions.

Event orchestration workflow – This workflow is subscribed to and invoked by operational events delivered to the main Amazon EventBridge hub. It sends HealthEventAdded or SecHubEventAdded events back to the main event hub following the workflow in the following figure.

Figure – Event orchestration workflow

Event notification workflow – This workflow formats notifications that are exchanged between Slack chat and backend microservices. It listens to control events such as HealthEventAdded and SecHubEventAdded.

Figure – Event notification workflow
AI layer
The AI layer handles the interactions between Agents for Amazon Bedrock, Knowledge Bases for Amazon Bedrock, and the UI (Slack chat). It has several key components.
OpsAgent is an operations assistant powered by Anthropic Claude 3 Haiku on Amazon Bedrock. It reacts to operational events based on the event type and text descriptions. OpsAgent is supported by two other AI model endpoints on Amazon Bedrock with different knowledge domains. An action group is defined and attached to OpsAgent, allowing it to solve more complex problems by orchestrating the work of AI endpoints and taking actions such as creating tickets without human supervisions.
OpsAgent is pre-prompted with required company policies and guidelines to perform event filtering, triage, and ITSM actions based on your requirements. See the sample escalation policy in the GitHub repo (between escalation_runbook tags).
OpsAgent uses two supporting AI model endpoints:

The events expert endpoint uses the Amazon Titan in Amazon Bedrock foundation model (FM) and Amazon OpenSearch Serverless to answer questions about operational events using Retrieval Augmented Generation (RAG).
The ask-aws endpoint uses the Amazon Titan model and Amazon Kendra as the RAG source. It contains the latest AWS documentation on selected topics. You must syncronize the Amazon Kendra data sources to ensure the underlying AI model is using the latest documentation. Your can do this using the AWS Management Console after the solution is deployed.

These dedicated endpoints with specialized RAG data sources help break down complex tasks, improve accuracy, and make sure the correct model is used.
The AI layer also includes of two AI orchestration Step Functions workflows. The workflows manage the AI agent, AI model endpoints, and the interaction with the user (through Slack chat):

The AI integration workflow defines how the operations assistant reacts to operational events based on the event type and the text descriptions of those events. The following figure illustrates the workflow.

Figure – AI integration workflow

The AI chatbot workflow manages the interaction between users and the OpsAgent assistant through a chat interface. The chatbot handles chat sessions and context.

Figure: AI chatbot workflow
Archiving and reporting layer
The archiving and reporting layer handles streaming, storing, and extracting, transforming, and loading (ETL) operational event data. It also prepares a data lake for BI dashboards and reporting analysis. However, this solution doesn’t include an actual dashboard implementation; it prepares an operational event data lake for later development.
Use case examples
You can use this solution for automated event notification, autonomous event acknowledgement, and action triage by setting up a virtual supervisor or operator that follows your organization’s policies. The virtual operator is equipped with multiple AI capabilities—each of which is specialized in a specific knowledge domain—such as generating recommended actions or taking actions to issue tickets in ITSM tools, as shown in the following figure.

Figure – use case example 1
The virtual event supervisor filters out noise based on your policies, as illustrated in the following figure.

Figure – use case example 2
AI can use the tickets that are related to a specific AWS Health event to provide the latest status updates on those tickets, as shown in the following figure.

Figure – use case example 3
The following figure shows how the assistant evaluates complex threads of operational events to provide valuable insights.

Figure – use case example 4
The following figure shows a more sophisticated use case.

Figure – use case example 5
Prerequisites
To deploy this solution, you must meet the following prerequisites:

Have at least one AWS account with permissions to create and manage the necessary resources and components for the application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?. The project uses a typical setup of two accounts, where one is the organization’s health administrator account and the other is the worker account hosting backend microservices. The worker account can be the same as the administrator account if you choose to use a single account setup.
Make sure you have access to Amazon Bedrock FMs in your preferred AWS Region in the worker account. The FMs used in the post are Anthropic Claude 3 Haiku, and Amazon Titan Text G1 – Premier.
Enable the AWS Health Organization view and delegate an administrator account in your AWS management account if you want to manage AWS Health events across your entire organization. Enabling AWS Health Organization view is optional if you only need to source operational events from a single account. Delegation of a separate administrator account for AWS Health is also optional if you want to manage all operational events from your AWS management account.
Enable AWS Security Hub in your AWS management account. Optionally, enable Security Hub with Organizations integration if you want to monitor security findings for the entire organization instead of just a single account.
Have a Slack workspace with permissions to configure a Slack app and set up a channel.
Install the AWS CDK in your local environment, bootstrapped in your AWS accounts, it will be used for solution deployment into the administration account and worker account.
Have AWS Serverless Application Model (AWS SAM) and Docker installed in your development environment to build AWS Lambda packages

Create a Slack app and set up a channel
Set up Slack:

Create a Slack app from the manifest template, using the content of the slack-app-manifest.json file from the GitHub repository.
Install your app into your workspace, and take note of the Bot User OAuth Token value to be used in later steps.
Take note of the Verification Token value under Basic Information of your app, you will need it in later steps.
In your Slack desktop app, go to your workspace and add the newly created app.
Create a Slack channel and add the newly created app as an integrated app to the channel.
Find and take note of the channel ID by choosing (right-clicking) the channel name, choosing Additional options to access the More menu, and choosing Open details to see the channel details.

Prepare your deployment environment
Use the following commands to ready your deployment environment for the worker account. Make sure you aren’t running the command under an existing AWS CDK project root directory. This step is required only if you chose a worker account that’s different from the administration account:

# Make sure your shell session environment is configured to access the worker
# account of your choice, for detailed guidance on how to configure, refer to
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
# Note that in this step you are bootstrapping your worker account in such a way
# that your administration account is trusted to execute CloudFormation deployment in
# your worker account, the following command uses an example execution role policy of ‘AdministratorAccess’,
# you can swap it for other policies of your own for least privilege best practice,
# for more information on the topic, refer to https://repost.aws/knowledge-center/cdk-customize-bootstrap-cfntoolkit
cdk bootstrap aws://<replace with your AWS account id of the worker account>/<replace with the region where your worker services is> –trust <replace with your AWS account id of the administration account> –cloudformation-execution-policies ‘arn:aws:iam::aws:policy/AdministratorAccess’ –trust-for-lookup <replace with your AWS account id of the administration account>

Use the following commands to ready your deployment environment for the administration account. Make sure you aren’t running the commands under an existing AWS CDK project root directory:

# Make sure your shell session environment is configured to access the admistration
# account of your choice, for detailed guidance on how to configure, refer to
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
# Note ‘us-east-1’ region is required for receiving AWS Health events associated with
# services that operate in AWS global region.
cdk bootstrap <replace with your AWS account id of the administration account>/us-east-1

# Optional, if you have your cloud infrastructures hosted in other AWS regions than ‘us-east-1’,
# repeat the below commands for each region
cdk bootstrap <replace with your AWS account id of the administration account>/<replace with the region name, e.g. us-west-2>

Copy the GitHub repo to your local directory
Use the following code to copy the GitHub repo to your local directory.:

git clone https://github.com/aws-samples/ops-health-ai.git
cd ops-health-ai
npm install
cd lambda/src
# Depending on your build environment, you might want to change the arch type to ‘x86’
# or ‘arm’ in lambda/src/template.yaml file before build
sam build –use-container
cd ../..

Create an .env file
Create an .env file containing the following code under the project root directory. Replace the variable placeholders with your account information:

CDK_ADMIN_ACCOUNT=<replace with your 12 digits administration AWS account id>
CDK_PROCESSING_ACCOUNT=<replace with your 12 digits worker AWS account id. This account id is the same as the admin account id if using single account setup>
EVENT_REGIONS=us-east-1,<region 1 of where your infrastructures are hosted>,<region 2 of where your infrastructures are hosted>
CDK_PROCESSING_REGION=<replace with the region where you want the worker services to be, e.g. us-east-1>
EVENT_HUB_ARN=arn:aws:events:<replace with the worker service region>:<replace with the worker service account id>:event-bus/AiOpsStatefulStackAiOpsEventBus
SLACK_CHANNEL_ID=<your Slack channel ID noted down from earlier step>
SLACK_APP_VERIFICATION_TOKEN=<replace with your Slack app verification token>
SLACK_ACCESS_TOKEN=<replace with your Slack Bot User OAuth Token value>

Deploy the solution using the AWS CDK
Deploy the processing microservice to your worker account (the worker account can be the same as your administrator account):

In the project root directory, run the following command: cdk deploy –all –require-approval never
Capture the HandleSlackCommApiUrl stack output URL,
Go to your Slack app and navigate to Event Subscriptions, Request URL Change,
Update the URL value with the stack output URL and save your settings.

Test the solution
Test the solution by sending a mock operational event to your administration account . Run the following AWS Command Line Interface (AWS CLI) command: aws events put-events –entries file://test-events/mockup-events.json
You will receive Slack messages notifying you about the mock event followed by automatic update from the AI assistant reporting the actions it took and the reasons for each action. You don’t need to manually choose Accept or Discharge for each event.
Try creating more mock events based on your past operational events and test them with the use cases described in the Use case examples section.
If you have just enabled AWS Security Hub in your administrator account, you might need to wait for up to 24 hours for any findings to be reported and acted on by the solution. AWS Health events, on the other hand, will be reported whenever applicable.
Clean up
To clean up your resources, run the following command in the CDK project directory: cdk destroy –all
Conclusion
This solution uses AI to help you automate complex tasks in cloud operational events management, bringing new opportunities for you to further streamline cloud operations management at scale with improved productivity, and operational resilience.
To learn more about the AWS services used in this solution, see:

Concepts for AWS Health
Understanding how findings work in Security Hub
Agents for Amazon Bedrock

About the author
Sean Xiaohai Wang is a Senior Technical Account Manager at Amazon Web Services. He helps enterpise customers build and operate efficiently on AWS.

How Indeed builds and deploys fine-tuned LLMs on Amazon SageMaker

This post is cowritten with Ethan Handel and Zhiyuan He from Indeed.com.
Indeed is the world’s #1 job site¹ and a leading global job matching and hiring marketplace. Our mission is to help people get jobs. At Indeed, we serve over 350 million global Unique Visitors  monthly² across more than 60 countries, powering millions of connections to new job opportunities every day. Since our founding nearly two decades ago, machine learning (ML) and artificial intelligence (AI) have been at the heart of building data-driven products that better match job seekers with the right roles and get people hired.
On the Core AI team at Indeed, we embody this legacy of AI innovation by investing heavily in HR domain research and development. We provide teams across the company with production-ready, fine-tuned large language models (LLMs) based on state-of-the-art open source architectures. In this post, we describe how using the capabilities of Amazon SageMaker has accelerated Indeed’s AI research, development velocity, flexibility, and overall value in our pursuit of using Indeed’s unique and vast data to leverage advanced LLMs.
Infrastructure challenges
Indeed’s business is fundamentally text-based. Indeed company generates 320 Terabytes of data daily³, which is uniquely valuable due to its breadth and the ability to connect elements like job descriptions and resumes and match them to the actions and behaviors that drive key company metric: a successful hire. LLMs represent a significant opportunity to improve how job seekers and employers interact in Indeed’s marketplace, with use cases such as match explanations, job description generation, match labeling, resume or job description skill extraction, and career guides, among others.
Last year, the Core AI team evaluated if Indeed’s HR domain-specific data could be used to fine-tune open source LLMs to enhance performance on particular tasks or domains. We chose the fine-tuning approach to best incorporate Indeed’s unique knowledge and vocabulary around mapping the world of jobs. Other strategies like prompt tuning or Retrieval Augmented Generation (RAG) and pre-training models were initially less appropriate due to context window limitations and cost-benefit trade-offs.
The Core AI team’s objective was to explore solutions that addressed the specific needs of Indeed’s environment by providing high performance for fine-tuning, minimal effort for iterative development, and a pathway for future cost-effective production inference. Indeed was looking for a solution that addressed the following challenges:

How do we efficiently set up repeatable, low-overhead patterns for fine-tuning open-source LLMs?
How can we provide production LLM inference at Indeed’s scale with favorable latency and costs?
How do we efficiently onboard early products with different request and inference patterns?

The following sections discuss how we addressed each challenge.
Solution overview
Ultimately, Indeed’s Core AI team converged on the decision to use Amazon SageMaker to solve for the aforementioned challenges and meet the following requirements:

Accelerate fine-tuning using Amazon SageMaker
Serve production traffic quickly using Amazon SageMaker inference
Enable Indeed to serve a variety of production use cases with flexibility using Amazon SageMaker generative AI inference capabilities (inference components)

Accelerate fine-tuning using Amazon SageMaker
One of the primary challenges that we faced was achieving efficient fine-tuning. Initially, Indeed’s Core AI team setup involved manually setting up raw Amazon Elastic Compute Cloud (Amazon EC2) instances and configuring training environments. Scientists had to manage personal development accounts and GPU schedules, leading to development overhead and resource under-utilization. To address these challenges, we used Amazon SageMaker to initiate and manage training jobs efficiently. Transitioning to Amazon SageMaker provided several advantages:

Resource optimization – Amazon SageMaker offered better instance availability and billed only for the actual training time, reducing costs associated with idle resources
Ease of setup – We no longer needed to worry about the setup required for running training jobs, simplifying the process significantly
Scalability – The Amazon SageMaker infrastructure allowed us to scale our training jobs efficiently, accommodating the growing demands of our LLM fine-tuning efforts

Smoothly serve production traffic using Amazon SageMaker inference
To better serve Indeed users with LLMs, we standardized the request and response formats across different models by employing open source software as an abstraction layer. This layer converted the interactions into a standardized OpenAI format, simplifying integration with various services and providing consistency in model interactions.
We built an inference infrastructure using Amazon SageMaker inference to host fine-tuned Indeed in-house models. The Amazon SageMaker infrastructure provided a robust service for deploying and managing models at scale. We deployed different specialized models on Amazon SageMaker inference endpoints. Amazon SageMaker supports various inference frameworks; we chose the Transformers Generative Inference (TGI) framework from Hugging Face for flexibility in access to the latest open source models.
The setup on Amazon SageMaker inference has enabled rapid iteration, allowing Indeed  to experiment with over 20 different models in a month. Furthermore, the robust infrastructure is capable of hosting dynamic production traffic, handling up to 3 million requests per day.
The following architecture diagram showcases the interaction between Indeed’s application and Amazon SageMaker inference endpoints.

Serve a variety of production use cases with flexibility using Amazon SageMaker generative AI inference components
Results from LLM fine-tuning revealed performance benefits. The final challenge was quickly implementing the capability to serve production traffic to support real, high-volume production use cases. Given the applicability of our models to meet use cases across the HR domain, our team hosted multiple different specialty models for various purposes. Most models didn’t necessitate the extensive resources of an 8-GPU p4d instance but still required the latency benefits of A100 GPUs.
Amazon SageMaker recently introduced a new feature called inference components that significantly enhances the efficiency of deploying multiple ML models to a single endpoint. This innovative capability allows for the optimal placement and packing of models onto ML instances, resulting in an average cost savings of up to 50%. The inference components abstraction enables users to assign specific compute resources, such as CPUs, GPUs, or AWS Neuron accelerators, to each individual model. This granular control allows for more efficient utilization of computing power, because Amazon SageMaker can now dynamically scale each model up or down based on the configured scaling policies. Furthermore, the intelligent scaling offered by this capability automatically adds or removes instances as needed, making sure that capacity is met while minimizing idle compute resources. This flexibility extends the ability to scale a model down to zero copies, freeing up valuable resources when demand is low. This feature empowers generative AI and LLM inference to optimize their model deployment costs, reduce latency, and manage multiple models with greater agility and precision. By decoupling the models from the underlying infrastructure, inference components offer a more efficient and cost-effective way to use the full potential of Amazon SageMaker inference.
Amazon SageMaker inference components allowed Indeed’s Core AI team to deploy different models to the same instance with the desired copies of a model, optimizing resource usage. By consolidating multiple models on a single instance, we created the most cost-effective LLM solution available to Indeed product teams. Furthermore, with inference components now supporting dynamic auto scaling, we could optimize the deployment strategy. This feature automatically adjusts the number of model copies based on demand, providing even greater efficiency and cost savings, even compared to third-party LLM providers.
Since integrating inference components into the inference design, Indeed’s Core AI team has built and validated LLMs that have served over 6.5 million production requests.
The following figure illustrates the internals of the Core AI’s LLM server.

The simplicity of our Amazon SageMaker setup significantly improves setup speed and flexibility. Today, we deploy Amazon SageMaker models using the Hugging Face TGI image in a custom Docker container, giving Indeed instant access to over 18 open source model families.
The following diagram illustrates Indeed’s Core AI flywheel.

Core AI’s business value from Amazon SageMaker
The seamless integration of Amazon SageMaker inference components, coupled with our team’s iterative enhancements, has accelerated our path to value. We can now swiftly deploy and fine-tune our models, while benefiting from robust scalability and cost-efficiency—a significant advantage in our pursuit of delivering cutting-edge HR solutions to our customers.
Maximize performance
High-velocity research enables Indeed to iterate on fine-tuning approaches to maximize performance. We have fine-tuned over 75 models to advance research and production objectives.
We can quickly validate and improve our fine-tuning methodology with many open-source LLMs. For instance, we moved from fine-tuning base foundation models (FMs) with third-party instruction data to fine-tuning instruction-tuned FMs based on empirical performance improvements.
For our unique purposes, our portfolio of LLMs performs at parity or better than the most popular general third-party models across 15 HR domain-specific tasks. For specific HR domain tasks like extracting skill attributes from resumes, we see a 4–5 times improvement from fine-tuning performance over general domain third-party models and a notable increase in HR marketplace functionality.
The following figure illustrates Indeed’s inference continuous integration and delivery (CI/CD) workflow.

The following figure presents some task examples.

High flexibility
Flexibility allows Indeed to be on the frontier of LLM technology. We can deploy and test the latest state-of-the-art open science models on our scalable Amazon SageMaker inference infrastructure immediately upon availability. When Meta launched the Llama3 model family in April 2024, these FMs were deployed within the day, enabling Indeed to start research and provide early testing for teams across Indeed. Within weeks, we fine-tuned our best-performing model to-date and released it. The following figure illustrates an example task.

Production scale
Core AI developed LLMs have already served 6.5 million live production requests with a single p4d instance and a p99 latency of under 7 seconds.
Cost-efficiency
Each LLM request through Amazon SageMaker is on average 67% cheaper than the prevailing third-party vendor model’s on-demand pricing in early 2024, creating the potential for significant cost savings.
Indeed’s contributions to Amazon SageMaker inference: Enhancing generative AI inference capabilities
Building upon the success of their use case, Indeed has been instrumental in partnering with the Amazon SageMaker inference team to provide inputs to help AWS build and enhance key generative AI capabilities within Amazon SageMaker. Since the early days of engagement, Indeed has provided the Amazon SageMaker inference team with valuable inputs to improve our offerings. The features and optimizations introduced through this collaboration are empowering other AWS customers to unlock the transformative potential of generative AI with greater ease, cost-effectiveness, and performance.

“Amazon SageMaker inference has enabled Indeed to rapidly deploy high-performing HR domain generative AI models, powering millions of users seeking new job opportunities every day. The flexibility, partnership, and cost-efficiency of Amazon SageMaker inference has been valuable in supporting Indeed’s efforts to leverage AI to better serve our users.”
– Ethan Handel, Senior Product Manager at Indeed.

Conclusion
Indeed’s implementation of Amazon SageMaker inference components has been instrumental in solidifying the company’s position as an AI leader in the HR industry. Core AI now has a robust service landscape that enhances the company’s ability to develop and deploy AI solutions tailored to the HR industry. With Amazon SageMaker, Indeed has successfully built and integrated HR domain LLMs that significantly improve job matching processes and other aspects of Indeed’s marketplace.
The flexibility and scalability of Amazon SageMaker inference components have empowered Indeed to stay ahead of the curve, continually adapting its AI-driven solutions to meet the evolving needs of job seekers and employers worldwide. This strategic partnership underscores the transformative potential of integrating advanced AI capabilities, like those offered by Amazon SageMaker inference components, into core business operations to drive efficiency and innovation.
¹Comscore, Unique Visitors, June 2024 ²Indeed Internal Data, average monthly Unique Visitors October 2023 – March 2024 ³Indeed data

About the Authors
Ethan Handel is a Senior Product Manager at Indeed, based in Austin, TX. He specializes in generative AI research and development and applied data science products, unlocking new ways to help people get jobs across the world every day. He loves solving big problems and innovating with how Indeed gets value from data. Ethan also loves being a dad of three, is an avid photographer, and loves everything automotive.
Zhiyuan He is a Staff Software Engineer at Indeed, based in Seattle, WA. He leads a dynamic team that focuses on all aspects of utilizing LLM at Indeed, including fine-tuning, evaluation, and inferencing, enhancing the job search experience for millions globally. Zhiyuan is passionate about tackling complex challenges and is exploring creative approaches.
Alak Eswaradass is a Principal Solutions Architect at AWS based in Chicago, IL. She is passionate about helping customers design cloud architectures using AWS services to solve business challenges and is enthusiastic about solving a variety of ML use cases for AWS customers. When she’s not working, Alak enjoys spending time with her daughters and exploring the outdoors with her dogs.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, multi-tenant models, cost optimizations, and making deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Brett Seib is a Senior Solutions Architect, based in Austin, Texas. He is passionate about innovating and using technology to solve business challenges for customers. Brett has several years of experience in the enterprise, Artificial Intelligence (AI), and data analytics industries, accelerating business outcomes.

Improve LLM application robustness with Amazon Bedrock Guardrails and …

Agentic workflows are a fresh new perspective in building dynamic and complex business use case-based workflows with the help of large language models (LLMs) as their reasoning engine. These agentic workflows decompose the natural language query-based tasks into multiple actionable steps with iterative feedback loops and self-reflection to produce the final result using tools and APIs. This naturally warrants the need to measure and evaluate the robustness of these workflows, in particular those that are adversarial or harmful in nature.
Amazon Bedrock Agents can break down natural language conversations into a sequence of tasks and API calls using ReAct and chain-of-thought (CoT) prompting techniques using LLMs. This offers tremendous use case flexibility, enables dynamic workflows, and reduces development cost. Amazon Bedrock Agents is instrumental in customization and tailoring apps to help meet specific project requirements while protecting private data and securing your applications. These agents work with AWS managed infrastructure capabilities and Amazon Bedrock, reducing infrastructure management overhead.
Although Amazon Bedrock Agents have built-in mechanisms to help avoid general harmful content, you can incorporate a custom, user-defined fine-grained mechanism with Amazon Bedrock Guardrails. Amazon Bedrock Guardrails provides additional customizable safeguards on top of the built-in protections of foundation models (FMs), delivering safety protections that are among the best in the industry by blocking harmful content and filtering hallucinated responses for Retrieval Augmented Generation (RAG) and summarization workloads. This enables you to customize and apply safety, privacy, and truthfulness protections within a single solution.
In this post, we demonstrate how you can identify and improve the robustness of Amazon Bedrock Agents when integrated with Amazon Bedrock Guardrails for domain-specific use cases.
Solution overview
In this post, we explore a sample use case for an online retail chatbot. The chatbot requires dynamic workflows for use cases like searching for and purchasing shoes based on customer preferences using natural language queries. To implement this, we build an agentic workflow using Amazon Bedrock Agents.
To test its adversarial robustness, we then prompt this bot to give fiduciary advice regarding retirement. We use this example to demonstrate robustness concerns, followed by robustness improvement using the agentic workflow with Amazon Bedrock Guardrails to help prevent the bot from giving fiduciary advice.
In this implementation, the preprocessing stage (the first stage of the agentic workflow, before the LLM is invoked) of the agent is turned off by default. Even with preprocessing turned on, there is usually a need for more fine-grained use case-specific control over what can be marked as safe and acceptable or not. In this example, a retail agent for shoes giving away fiduciary advice is definitely out of scope of the product use case and may be detrimental advice, resulting in customers losing trust, among other safety concerns.
Another typical fine-grained robustness control requirement could be to restrict personally identifiable information (PII) from being generated by these agentic workflows. We can configure and set up Amazon Bedrock Guardrails in Amazon Bedrock Agents to deliver improved robustness against such regulatory compliance cases and custom business needs without the need for fine-tuning LLMs.
The following diagram illustrates the solution architecture.

We use the following AWS services:

Amazon Bedrock to invoke LLMs
Amazon Bedrock Agents for the agentic workflows
Amazon Bedrock Guardrails to deny adversarial inputs
AWS Identity and Access Management (IAM) for permission control across various AWS services
AWS Lambda for business API implementation
Amazon SageMaker to host Jupyter notebooks and invoke the Amazon Bedrock Agents API

In the following sections, we demonstrate how to use the GitHub repository to run this example using three Jupyter notebooks.
Prerequisites
To run this demo in your AWS account, complete the following prerequisites:

Create an AWS account if you don’t already have one.
Clone the GitHub repository and follow the steps explained in the README.
Set up a SageMaker notebook using an AWS CloudFormation template, available in the GitHub repo. The CloudFormation template also provides the required IAM access to set up SageMaker resources and Lambda functions.
Acquire access to models hosted on Amazon Bedrock. Choose Manage model access in the navigation pane on the Amazon Bedrock console and choose from the list of available options. We use Anthropic Claude 3 Haiku on Amazon Bedrock and Amazon Titan Embeddings Text v1 on Amazon Bedrock for this post.

Create a guardrail
In the Part 1a notebook, complete the following steps to create a guardrail to help prevent the chatbot from providing fiduciary advice:

Create a guardrail with Amazon Bedrock Guardrails using the Boto3 API with content filters, word and phrase filters, and sensitive word filters, such as for PII and regular expressions (regex) to protect sensitive information from our retail customers.
List and create guardrail versions.
Update the guardrails.
Perform unit testing on the guardrails.
Note theguardrail-idand guardrail-arn values to use in Part 1c:

create_response = client.create_guardrail(
name=guardrail_name,
description=’Prevents our model from providing fiduciary advice.’,
topicPolicyConfig={
‘topicsConfig’: [
{
‘name’: ‘Fiduciary Advice’,
‘definition’: ‘Providing personalized advice or recommendations on managing financial assets, investments, or trusts in a fiduciary capacity or assuming related obligations and liabilities.’,
‘examples’: [
‘What stocks should I invest in for my retirement?’,
‘Is it a good idea to put my money in a mutual fund?’,
‘How should I allocate my 401(k) investments?’,
‘What type of trust fund should I set up for my children?’,
‘Should I hire a financial advisor to manage my investments?’
],
‘type’: ‘DENY’
}
]
},
….
}

Test the use case without guardrails
In the Part 1b notebook, complete the following steps to demonstrate the use case using Amazon Bedrock Agents without Amazon Bedrock Guardrails and no preprocessing to demonstrate the adversarial robustness problem:

Choose the underlying FM for your agent.
Provide a clear and concise agent instruction.
Create and associate an action group with an API schema and Lambda function.
Create, invoke, test, and deploy the agent.
Demonstrate a chat session with multi-turn conversations.

The agent instruction is as follows:

“You are an agent that helps customers purchase shoes. If the customer does not provide their name in the first input, ask for them name before invoking any functions.
Retrieve customer details like customer ID and preferred activity based on the name.
Then check inventory for shoe best fit activity matching customer preferred activity.
Generate response with shoe ID, style description and colors based on shoe inventory details.
If multiple matches exist, display all of them to the user.
After customer indicates they would like to order the shoe, use the shoe ID corresponding to their choice and
customer ID from initial customer details received, to place order for the shoe.”

A valid user query would be “Hello, my name is John Doe. I am looking to buy running shoes. Can you elaborate more about Shoe ID 10?” However, by using Amazon Bedrock Agents without Amazon Bedrock Guardrails, the agent allows fiduciary advice for queries like the following:

“How should I invest for my retirement? I want to be able to generate $5,000 a month.”
“How do I make money to prepare for my retirement?”

Test the use case with guardrails
In the Part 1c notebook, repeat the steps in Part 1b but now to demonstrate using Amazon Bedrock Agents with guardrails (and still no preprocessing) to improve and evaluate the adversarial robustness concern by not allowing fiduciary advice. The complete steps are the following:

Choose the underlying FM for your agent.
Provide a clear and concise agent instruction.
Create and associate an action group with an API schema and Lambda function.
During the configuration setup of Amazon Bedrock Agents in this example, associate the guardrail created previously in Part 1a with this agent.
Create, invoke, test, and deploy the agent.
Demonstrate a chat session with multi-turn conversations.

To associate a guardrail-id with an agent during creation, we can use the following code snippet:

gconfig = {
“guardrailIdentifier”: ‘an9l3icjg3kj’,
“guardrailVersion”: ‘DRAFT’
}

response = bedrock_agent_client.create_agent(
agentName=agent_name,
agentResourceRoleArn=agent_role[‘Role’][‘Arn’],
description=”Retail agent for shoe purchase.”,
idleSessionTTLInSeconds=3600,
foundationModel=”anthropic.claude-3-haiku-20240307-v1:0″,
instruction=agent_instruction,
guardrailConfiguration=gconfig,
)

As we can expect, our retail chatbot should decline to answer invalid queries because it has no relationship with its purpose in our use case.
Cost considerations
The following are important cost considerations:

There are no separate charges for building resources using Amazon Bedrock Agents.
You will incur charges for embedding model and text model invocation on Amazon Bedrock. Generation of text and embeddings with the agent in Amazon Bedrock incurs charges according to the cost of each FM. For more details, refer to Amazon Bedrock pricing.
You will incur charges for Amazon Bedrock Guardrails. For more details, see Amazon Bedrock pricing.
You will incur charges for storing files in Amazon Simple Storage Service (Amazon S3). For more details, see Amazon S3 pricing.
You will incur charges for your SageMaker instance, Lambda function, and AWS CloudFormation usage. For more details, see Amazon SageMaker pricing, AWS Lambda pricing, and AWS CloudFormation pricing.

Clean up
For the Part 1b and Part 1c notebooks, to avoid incurring recurring costs, the implementation automatically cleans up resources after an entire run of the notebook. You can check the notebook instructions in the Clean-up Resources section on how to avoid the automatic cleanup and experiment with different prompts.
The order of cleanup is as follows:

Disable the action group.
Delete the action group.
Delete the alias.
Delete the agent.
Delete the Lambda function.
Empty the S3 bucket.
Delete the S3 bucket.
Delete IAM roles and policies.

You can delete guardrails from the Amazon Bedrock console or API. Unless the guardrails are invoked through agents in this demo, you will not be charged. For more details, see Delete a guardrail.
Conclusion
In this post, we demonstrated how Amazon Bedrock Guardrails can improve the robustness of the agent framework. We were able to stop our chatbot from responding to non-relevant queries and protect personal information from our customers, ultimately improving the robustness of our agentic implementation with Amazon Bedrock Agents.
In general, the preprocessing stage of Amazon Bedrock Agents can intercept and reject adversarial inputs, but guardrails can help prevent prompts that may be very specific to the topic or use case (such as PII and HIPAA rules) that the LLM hasn’t seen previously, without having to fine-tune the LLM.
To learn more about creating models with Amazon Bedrock, see Customize your model to improve its performance for your use case. To learn more about using agents to orchestrate workflows, see Automate tasks in your application using conversational agents. For details about using guardrails to safeguard your generative AI applications, refer to Stop harmful content in models using Amazon Bedrock Guardrails.
Acknowledgements
The author thanks all the reviewers for their valuable feedback.

About the Author
Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.

Google AI Introduces Tx-LLM: A Large Language Model (LLM) Fine-Tuned f …

Developing therapeutics is costly and time-consuming, often taking 10-15 years and up to $2 billion, with most drug candidates failing during clinical trials. A successful therapeutic must meet various criteria, such as target interaction, non-toxicity, and suitable pharmacokinetics. Current AI models focus on specialized tasks within this pipeline, but their limited scope can hinder performance. The Therapeutics Data Commons (TDC) offers datasets to help AI models predict drug properties, yet these models work independently. LLMs, which excel at multi-tasking, provide the potential to improve therapeutic development by learning across diverse tasks using a unified approach.

LLMs, particularly transformer-based models, have advanced natural language processing, excelling in tasks through self-supervised learning on large datasets. Recent studies show LLMs can handle diverse tasks, including regression, using textual representations of parameters. In therapeutics, specialized models like graph neural networks (GNNs) represent molecules as graphs for functions such as drug discovery. Protein and nucleic acid sequences are also encoded to predict properties like binding and structure. LLMs are increasingly applied in biology and chemistry, with models like LlaSMol and protein-specific models achieving promising results in drug synthesis and protein engineering tasks.

Researchers from Google Research and Google DeepMind introduced Tx-LLM, a generalist large language model fine-tuned from PaLM-2, designed to handle diverse therapeutic tasks. Trained on 709 datasets covering 66 functions across the drug discovery pipeline, Tx-LLM uses a single set of weights to process various chemical and biological entities, such as small molecules, proteins, and nucleic acids. It achieves competitive performance on 43 tasks and surpasses state-of-the-art on 22. Tx-LLM excels in tasks combining molecular representations with text and shows positive transfer between different drug types. This model is a valuable tool for end-to-end drug development.

The researchers compiled a dataset collection called TxT, containing 709 drug discovery datasets from the TDC repository, focusing on 66 tasks. Each dataset was formatted for instruction tuning, featuring four components: instructions, context, question, and answer. These tasks included binary classification, regression, and generation tasks, with representations like SMILES strings for molecules and amino acid sequences for proteins. Tx-LLM was fine-tuned from PaLM-2 using this data. They evaluated the model’s performance using metrics such as AUROC and Spearman correlation and set accuracy. Statistical tests and data contamination analyses were performed to ensure robust results.

The Tx-LLM model demonstrated strong performance on TDC datasets, surpassing or matching state-of-the-art (SOTA) results on 43 out of 66 tasks. It outperformed SOTA on 22 datasets and achieved near-SOTA performance on 21 others. Notably, Tx-LLM excelled in datasets combining SMILES molecular strings with text features like disease or cell line descriptions, likely due to its pretrained knowledge of the text. However, it struggled on datasets that relied solely on SMILES strings, where graph-based models were more effective. Overall, the results highlight the strengths of fine-tuned language models for tasks involving drugs and text-based features.

Tx-LLM is the first LLM trained on diverse TDC datasets, including molecules, proteins, cells, and diseases. Interestingly, training with non-small molecule datasets, such as proteins, improved performance on small molecule tasks. While general LLMs have struggled with specialized chemistry tasks, Tx-LLM excelled in regression, outperforming state-of-the-art models in several cases. This model shows potential for end-to-end drug development, from gene identification to clinical trials. However, Tx-LLM is still in the research stage, with limitations in natural language instruction and prediction accuracy, requiring further improvement and validation for broader applications.

Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Google AI Introduces Tx-LLM: A Large Language Model (LLM) Fine-Tuned from PaLM-2 to Predict Properties of Many Entities that are Relevant to Therapeutic Development appeared first on MarkTechPost.

Comparative Analysis: ColBERT vs. ColPali

Problem Addressed

ColBERT and ColPali address different facets of document retrieval, focusing on improving efficiency and effectiveness. ColBERT seeks to enhance the effectiveness of passage search by leveraging deep pre-trained language models like BERT while maintaining a lower computational cost through late interaction techniques. Its main goal is to solve the computational challenges posed by conventional BERT-based ranking methods, which are costly in terms of time and resources. ColPali, on the other hand, aims to improve document retrieval for visually rich documents by addressing the limitations of standard text-based retrieval systems. ColPali focuses on overcoming the inefficiencies in utilizing visual information effectively, allowing the integration of visual and textual features for better retrieval in applications like Retrieval-Augmented Generation (RAG).

Key Elements

Key elements of ColBERT include the use of BERT for context encoding and a novel late interaction architecture. In ColBERT, queries and documents are independently encoded using BERT, and their interactions are computed using efficient mechanisms like MaxSim, allowing for better scalability without sacrificing effectiveness. ColPali incorporates Vision-Language Models (VLMs) to generate embeddings from document images. It utilizes a late interaction mechanism similar to ColBERT but extends it to multimodal inputs, making it particularly useful for visually rich documents. ColPali also introduces the Visual Document Retrieval Benchmark (ViDoRe), which evaluates systems on their ability to understand visual document features.

Technical Details, Benefits, and Drawbacks

ColBERT’s technical implementation includes the use of a late interaction approach where the query and document embeddings are generated separately and then matched using a MaxSim operation. This allows ColBERT to balance efficiency and computational cost by pre-computing document representations offline. The benefits of ColBERT include its high query-processing speed and reduced computational cost, which make it suitable for large-scale information retrieval tasks. However, it has limitations when dealing with documents that contain a lot of visual data, as it focuses solely on text.

ColPali, in contrast, leverages VLMs to generate contextualized embeddings directly from document images, thus incorporating visual features into the retrieval process. The benefits of ColPali include its ability to efficiently retrieve visually rich documents and perform well on multimodal tasks. However, the incorporation of vision models comes with additional computational overhead during indexing, and its memory footprint is larger compared to text-only methods like ColBERT due to the storage requirements for visual embeddings. The indexing process in ColPali is more time-consuming than ColBERT’s, although the retrieval phase remains efficient due to the late interaction mechanism.

Importance and Further Details

Both ColBERT and ColPali are important as they address key challenges in document retrieval for different types of content. ColBERT’s contribution lies in optimizing BERT-based models for efficient text-based retrieval, bridging the gap between effectiveness and computational efficiency. Its late interaction mechanism allows it to retain the benefits of contextualized representations while significantly reducing the cost per query. ColPali’s significance is in expanding the scope of document retrieval to visually rich documents, which are often neglected by standard text-based approaches. By integrating visual information, ColPali sets the foundation for future retrieval systems that can handle diverse document formats more effectively, supporting applications like RAG in practical, multimodal settings.

Conclusion

In conclusion, ColBERT and ColPali represent advancements in document retrieval by addressing specific challenges in efficiency, effectiveness, and multimodality. ColBERT offers a computationally efficient way to leverage BERT’s capabilities for passage retrieval, making it ideal for large-scale text-heavy retrieval tasks. ColPali, meanwhile, extends retrieval capabilities to include visual elements, enhancing the retrieval performance for visually rich documents and highlighting the importance of multimodal integration in practical applications. Both models have their strengths and limitations, but together, they illustrate the ongoing evolution of document retrieval to handle increasingly diverse and complex data sources.

Check out the Papers on ColBERT and ColPali. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Comparative Analysis: ColBERT vs. ColPali appeared first on MarkTechPost.

Archon: A Machine Learning Framework for Large Language Model Enhancem …

Artificial intelligence has made remarkable strides with the development of Large Language Models (LLMs), significantly impacting various domains, including natural language processing, reasoning, and even coding tasks. As LLMs grow more powerful, they require sophisticated methods to optimize their performance during inference. Inference-time techniques and strategies used to improve the quality of responses generated by these models at runtime have become crucial. However, the research community must still establish best practices for integrating these techniques into a cohesive system.

A core challenge in improving LLM performance is determining which inference-time techniques yield the best results for different tasks. The problem is compounded by the sheer variety of functions, such as instruction-following, reasoning, and coding, which may benefit from various combinations of inference-time techniques. Moreover, understanding the complex interactions between techniques like ensembling, repeated sampling, ranking, fusion, and verification is crucial for maximizing performance. Researchers need a robust system that can efficiently explore the extensive design space of possible combinations and optimize these architectures according to the task and compute constraints.

Traditional methods for inference-time optimization have focused on applying individual techniques to LLMs. For instance, generation ensembling involves querying multiple models simultaneously and selecting the best response, while repeated sampling involves querying a single model numerous times. These techniques have shown promise, but their standalone application often leads to limited improvements. Frameworks like Mixture-of-Agents (MoA) and LeanStar have attempted to integrate multiple techniques but still face challenges in generalization and performance across various tasks. Thus, there is a growing demand for a modular, automated approach to building optimized LLM systems.

Researchers from Stanford University and the University of Washington have developed Archon, a modular framework designed to automate LLM architecture search using inference-time techniques. The Archon framework leverages diverse LLMs and inference-time methods, combining them into a cohesive system that surpasses traditional models’ performance. Rather than relying on a single LLM queried once, Archon dynamically selects, combines, and stacks layers of techniques to optimize performance for specific benchmarks. By treating the problem as a hyperparameter optimization task, the framework can identify optimal architectures that maximize accuracy, latency, and cost-efficiency for a given compute budget.

The Archon framework is structured as a multi-layered system where each layer performs a distinct inference-time technique. For example, the first layer might generate multiple candidate responses using an ensemble of LLMs, while subsequent layers apply ranking, fusion, or verification techniques to refine these responses. The framework uses Bayesian optimization algorithms to search potential configurations and select the most effective one for a target benchmark. This modular design allows Archon to outperform top-performing models like GPT-4o and Claude 3.5 Sonnet by an average of 15.1 percentage points across a wide range of tasks.

The performance of Archon was evaluated across several benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. The results were compelling: Archon architectures demonstrated an average accuracy increase of 11.2 percentage points using open-source models and 15.1 percentage points utilizing a mix of open-source and closed-source models. In coding tasks, the framework achieved a 56% improvement in Pass@1 scores, boosting accuracy from 17.9% to 29.3% through unit test generation and evaluation. Even when constrained to open-source models, Archon surpassed the performance of single-call state-of-the-art models by 11.2 percentage points, highlighting the efficacy of its layered approach.

The key results show that Archon achieves state-of-the-art performance in various domains by integrating multiple inference-time techniques. For instruction-following tasks, adding numerous layers of generation, ranking, and fusion significantly improved the quality of responses. Archon excelled in reasoning tasks like MixEval and MATH by incorporating verification and unit testing methods, leading to an average increase of 3.7 to 8.9 percentage points when applying task-specific architectures. The framework combined extensive sampling and unit test generation to produce accurate and reliable outputs for coding challenges.

Key Takeaways from the research on Archon:

Performance Boost: Archon achieves an average accuracy increase of 15.1 percentage points across various benchmarks, outperforming state-of-the-art models like GPT-4o and Claude 3.5 Sonnet.

Diverse Applications: The framework excels in instruction-following, reasoning, and coding tasks, showing versatility.

Effective Inference-Time Techniques: Archon provides superior performance in all evaluated scenarios by combining techniques such as ensembling, fusion, ranking, and verification.

Improved Coding Accuracy: Achieved a 56% boost in coding task accuracy by leveraging unit test generation and evaluation methods.

Scalability and Modularity: The framework’s modular design allows it to adapt easily to new tasks and configurations, making it a robust tool for LLM optimization.

In conclusion, Archon addresses the critical need for an automated system that optimizes LLMs at inference time by effectively combining various techniques. This research provides a practical solution to the complexities of inference-time architecture design, making it easier for developers to build high-performing LLM systems tailored to specific tasks. The Archon framework sets a new standard for optimizing LLMs. It offers a systematic and automated approach to inference-time architecture search, demonstrating its ability to achieve top-tier results across diverse benchmarks.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Archon: A Machine Learning Framework for Large Language Model Enhancement Using Automated Inference-Time Architecture Search for Improved Task Performance appeared first on MarkTechPost.

Enable or disable ACL crawling safely in Amazon Q Business

Amazon Q Business recently added support for administrators to modify the default access control list (ACL) crawling feature for data source connectors.
Amazon Q Business is a fully managed, AI powered assistant with enterprise-grade security and privacy features. It includes over 40 data source connectors that crawl and index documents. By default, Amazon Q Business indexes ACL information attached to documents along with the documents themselves and uses this to filter chat responses based on the user’s document access. With this new feature, you can enable or disable ACL crawling as required by their business use case.
This post introduces the new ACL toggle feature for Amazon Q Business, which you can use to enable or disable ACL crawling. We’ll explore use cases for disabling ACLs and discuss how to safely enable or disable ACL crawling.
Overview of access control list crawling
Amazon Q Business data source connectors help crawl various data sources to collect and index content in Amazon Q Business for fast discovery and retrieval when answering user queries. These data sources often contain documents with different classifications such as public, internal public, private, and confidential. To provide fine-grained control over access rights, you can attach ACLs to documents, allowing you to specify different levels of access for various users or groups. To verify that Amazon Q Business respects access control policies and that users only receive responses for content they’re authorized to access, the data source connectors automatically crawl for access permissions associated with the content, user identifiers, and groups.

The preceding figure illustrates the Amazon Q Business data source crawler with ACL crawling enabled. As the connector retrieves content from the data source, it examines the associated ACL and compiles a list of users and groups with read permissions for each document. The connector also collects user identifiers, which are stored in the Amazon Q Business user store for quick matching during query execution. Both the ACL and content are optimized and stored in the Amazon Q Business index storage, enabling secure and efficient retrieval when answering user queries. For more information on the user store, see Understanding Amazon Q Business User Store.
When to disable ACL crawling?
ACL crawling builds a security-aware index that respects access control policies in the primary data source. This process helps maintain data privacy and access control required for regulatory compliance, making sure that sensitive information isn’t inadvertently exposed through user query results. It provides a scalable mechanism to handle large amounts of content while maintaining consistency between the actual access controls on the data and what’s discoverable through search. Because of these advantages, ACL crawling is strongly recommended for all data sources. However, there are some circumstances when you might need to disable it. The following are some reasons why you might disable ACL crawling.
Internally public content
Organizations often designate certain data sources as internally public, including HR policies, IT knowledge bases, and wiki pages. For instance, a company might allocate an entire Microsoft SharePoint site for policies accessible to all employees, classifying it as internal-public. In such cases, crawling ACLs for permissions that include all employees can be costly and create unnecessary overhead. Turning off ACL crawling might be advantageous in these scenarios.
Data source contains irreconcilable identities
Amazon Q Business requires all users to authenticate with an enterprise-approved identity provider (IdP). After successful authentication, Amazon Q Business uses the IdP-provided user identifier to match against the user identifier fetched from the data source during ACL crawling. This process validates user access to content before retrieving it for query responses.
However, because of legacy issues such as mergers and acquisitions, data source configuration limitations, or other constraints, the primary user identifier from the IdP might differ from the one in the data source. This discrepancy can prevent Amazon Q Business from retrieving relevant content from the index and answering user queries effectively.
In such cases, it might be necessary to disable ACL crawling and use alternative options. These include implementing attribute filters or building dedicated restricted applications with access limited to specific audiences and content. For more information on attribute filters, see Filtering chat responses using document attributes.
Use case-driven targeted deployments
As a fully managed service, Amazon Q Business can be quickly deployed in multiple instances for scoped down targeted use cases. Examples include an HR bot in Slack or an AI assistant for customer support agents in a contact center. Because these AI assistants might be deployed for a limited audience, and the indexed content might be generally available to all users with application access, ACL crawling can be turned off.
Note of caution
Amazon Q Business cannot enforce access controls if ACL crawling is disabled. When ACL crawling is disabled for a data source, indexed content in that source will be considered accessible to users with access to the Amazon Q Business application. Therefore, disabling ACL crawling should be done with caution and due diligence. The following are some recommended best practices:

Notify data source content owners and administrators of your intent to disable ACL crawling and obtain their approval beforehand.
If applicable, consider implementing alternative options such as attribute filtering to restrict content retrieval or deploying a scoped-down, use-case-driven deployment to a limited audience.
Maintain a decision document that clearly articulates the reasons for disabling ACL crawling, the scope of affected content, and precautions taken to prevent indexing of sensitive information.

Note: As a precaution, you cannot disable ACL crawling for an existing Amazon Q Business data source that already has ACL crawling enabled. To disable ACL crawling, you must delete the data source and recreate it. You can only disable ACL crawling during the data source creation process, and this requires an account administrator to grant permission for disabling ACL crawling when configuring the data source.
Procedures for configuring ACL crawling
Amazon Q Business ACL crawling helps protect your data. Amazon Q Business provides safeguards to help administrators and developers mitigate accidentally disabling ACL crawling. In this section, we will cover how you can allow or deny the ACL crawling disable feature, explore procedures to enable or disable ACL crawling, explain how to monitor logs for ACL crawling configuration changes, and troubleshoot common issues.
Personas for configuring ACL crawling
ACL crawling configuration typically involves multiple roles, depending on your organizational structure. To maximize safeguards, it’s recommended that these roles are filled by different individuals. For faster deployments, identify the necessary personnel within your organization before starting the project and ensure they collaborate to complete the configuration. Here are the common roles needed for ACL crawling configuration:

AWS account administrator – An AWS account administrator is a user with full access to AWS services and the ability to manage IAM resources and permissions in the account. They can create and manage organizations, enabling centralized management of multiple AWS accounts.
Amazon Q Business administrator – An Amazon Q Business administrator is typically a user or role responsible for managing and configuring the Amazon Q Business service. Their duties include creating and optimizing Amazon Q Business indexes, setting up guardrails, and tuning relevance. They also set up and maintain connections to various data sources that Amazon Q Business will index, such as Amazon Simple Storage Service (Amazon S3) buckets, SharePoint, Salesforce, and Confluence.

Prerequisites for ACL crawling

Amazon Q Business application.

For information on configuring a starter application, see Creating a sample Amazon Q Business application.

Amazon Q Business data source connector that supports ACL crawling configuration.

For a complete list of connectors that support disabling ACL crawling, see Connecting Amazon Q Business data sources.

Data source authentication that meets the permissions required for crawling content and ACLs.

Process to disallow the option to disable ACL crawling
By default, the option to disable ACL crawling is enabled for an account. AWS account administrators can disallow this feature by setting up an account-level policy. It’s recommended to configure an explicit deny for production accounts by default. The following below shows the associated actions in relation to the personas involved in the configuration process.

Administrators can attach the IAM action qbusiness:DisableAclOnDataSource to the Amazon Q Business administrator user or role policy to deny or allow the option to disable ACL crawling. The example IAM policy code snippet that follows demonstrates how to set up an explicit deny.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Deny”,
“Action”: [
“qbusiness:DisableAclOnDataSource”
],
“Resource”: [“*”]
}
]
}

Note that even if the option to disable ACL crawling is denied, the user interface might not gray out this option. However, if you attempt to create a data source with this option disabled, it will fail the validation check, and Amazon Q Business will not create the data source.
Process to disable ACL crawling for a data source connector
Before setting up a data source connector with ACL crawling disabled in your Amazon Q Business application deployment, make sure that you have no sensitive content in the data source or have implemented controls to help prevent accidental content exposure. Verify that the data source connector supports the option to disable ACL crawling. Notify information custodians, content owners, and data source administrators of your intent to disable ACL crawling and obtain their documented approvals, if necessary. If your account administrator has explicitly denied the option to disable ACL crawling, request temporary permission. After you have secured all approvals and exceptions, create a new data source with ACL crawling disabled and sync the data. With ACL crawling disabled, Amazon Q Business users will be able to discover knowledge and obtain answers from the indexed documents through this connector. Notify the account administrator to revert the account policy back to explicitly denying the disable ACL crawling option. The process and interaction between different roles are shown in the following chart.

The following is an overview of the procedure to create a data source with ACL crawling disabled using AWS Console:

Navigate to the Amazon Q Business console.
Select the Amazon Q Business application that you want to add a data source connector to.
Choose Add data source in the Data sources section and select the desired connector.
Update the connector configuration information. See Connecting Amazon Q Business data sources for configuration details.
In the Authorization section, choose Disable ACLs and check the acknowledgment to accept the risks of disabling ACL crawling.
Complete the remaining connector configuration and choose Save.
Sync the data source.

Note: You cannot disable ACL crawling for an existing data source connector that was created with ACL crawling enabled. You must create a new data source connector instance with ACL disabled and delete the older instance that has ACL crawling enabled.
Process to enable ACL crawling for a data source connector
Creating a data source connector with ACL crawling enabled is recommended and doesn’t require additional allow listing from AWS account administrators. To enable ACL crawling, you follow steps similar to disabling ACLs as described in the previous section. When configuring the data source connector using the console, choose Enable ACLs in the Authorization section to create a connector with ACL crawling enabled. You can also enable ACL crawling at any time for an existing data source connector that was created with this option disabled. Sync the data source connector for the ACL enforcement to take effect. Amazon Q Business users can only query and obtain answers from documents to which they have access in the original data source.
It’s important to review that the data source administrator has set up the required permissions properly, making sure that the crawler has permission to crawl for ACLs in the data source before enabling ACL crawling. You can find the required permissions in the prerequisite section of the connector in Connecting Amazon Q Business data sources. The following shows the process for setting up a data source connector with ACL crawling enabled.
Logging and monitoring the ACL crawling configuration
Amazon Q Business uses AWS CloudTrail for logging API calls related to ACL crawling configuration. You can monitor the CloudTrail log for CreateDataSource and UpdateDataSource API calls to identify ACL crawling-related changes made to data source configuration. For a complete list of Amazon Q Business APIs that are logged to CloudTrail, see Logging Amazon Q Business API calls using AWS CloudTrail.
Administrators can configure Amazon CloudWatch alarms to generate automated alert notifications if ACL crawling is disabled for a data source connector, allowing them to initiate corrective action. For step-by-step instructions on setting up CloudWatch alarms based on CloudTrail events, see How do I use CloudWatch alarms to monitor CloudTrail events.
The example CloudWatch alarm code snippet that follows shows the filter pattern for identifying events related to disabling ACL crawling in a data source connector.

{
($.eventSource = qbusiness.amazonaws.com)
&& (
($.eventName = CreateDataSource)
|| ($.eventName = UpdateDataSource)
)
&& ($.requestParameters.disableAclCrawl = true)
}

Tips for troubleshooting
When configuring Amazon Q Business data source connectors, you might occasionally encounter issues. The following are some common errors and their possible resolutions.
Not authorized to disable ACL crawling
When creating a new data source connector with ACL crawling disabled, you might see an error message stating not authorized to perform: qbusiness:DisableAclOnDataSource as shown in the following image.

This error indicates that your administrator has explicitly denied the option to disable ACL crawling for your AWS account. Contact your administrator to allow-list this action for your account. For more details, see the Process to disable ACL crawling for a data source connector section earlier in this post.
Data source connection errors
Data source connectors might also fail to connect to your data source or crawl data. In such cases, verify that Amazon Q Business can reach the data source through the public internet or through a VPC private network. See Connecting Amazon Q Business data sources to make sure that your data source authentication has the permissions needed to crawl content and ACLs, if enabled.
Identity and ACL mismatch errors
Finally, after successfully syncing data with ACL crawling enabled, some users might still be unable to get answers to queries, even though the relevant documents were indexed. This issue commonly occurs when the user lacks access to the indexed content in the original data source, or when the user identity obtained from the data source doesn’t match the sign-in identity. To troubleshoot such ACL mismatch issues, examine the data source sync report. For more information, see Introducing document-level sync reports: Enhanced data sync visibility in Amazon Q Business.
Key considerations and recommendations
Given the impact that disabling ACL crawling can have on content security, consider these restrictions and best practices when disabling ACL crawling in Amazon Q Business data source connectors:

ACL crawling enablement is a one-way control mechanism. After it’s enabled, you cannot disable it. This helps prevent accidentally disabling ACL crawling in production environments.
Keep ACL crawling enabled by default and disable it only for the subset of data source connectors that require it.
If necessary, consider splitting the indexing of a data source by setting up multiple data source connectors and limiting ACL crawling disablement to a smaller content segment. Use the document Inclusion and Exclusion feature of data source connectors to define the indexing scope.
When ACL crawling is disabled because of irreconcilable identities, consider alternative options. These include implementing attribute filters, restricting access to the Amazon Q Business application, and setting up guardrails.
As a security best practice, AWS Organizations and account administrators should add a service control policy to explicitly deny the qbusiness:DisableAclOnDataSource permission for all accounts. Grant this permission only when requested by an Amazon Q Business administrator. After configuring a data source connector with ACL crawling disabled, revert to an explicit deny. Use a ticketing system to maintain a record of exception approvals. For more information, see <link>.
Currently, disabling ACL crawling is available for limited connectors, including ServiceNow, Confluence, SharePoint, Jira, Google Drive, OneDrive, Salesforce, Zendesk, GitHub, MS Teams, and Slack. For the latest list of connectors that support disabling ACL crawling, see Connecting Amazon Q Business data sources.

Clean up
To avoid incurring additional charges, make sure you delete any resources created in this post.

To delete any data source created in Amazon Q Business, follow the instructions in Deleting an Amazon Q Business data source connector to delete the same.
To delete any Amazon Q Business application created, follow the instructions in Deleting an application.

Conclusion
Amazon Q Business data source connector ACL crawling is an essential feature that helps organizations build, manage, and scale secure AI assistants. It plays a crucial role in enforcing regulatory and compliance policies and protecting sensitive content. With the introduction of a self-service feature to disable ACL crawling, Amazon Q Business now provides you more autonomy to choose deployment options that suit your organization’s business needs. To start building secure AI assistants with Amazon Q Business, explore the Getting started guide.

About the Authors
Rajesh Kumar Ravi, a Senior Solutions Architect at Amazon Web Services, specializes in building generative AI solutions using Amazon Q Business, Amazon Bedrock, and Amazon Kendra. He helps businesses worldwide implement these technologies to enhance efficiency, innovation, and competitiveness. An accomplished technology leader, Rajesh has experience developing innovative AI products, nurturing the builder community, and contributing to new ideas. Outside of work, he enjoys walking and short hiking trips.
Meenakshisundaram Thandavarayan works for AWS as an AI/ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focuses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker and strives to drive business to new ways of working through innovation, incubation, and democratization.
Amit Choudhary is a Product Manager for Amazon Q Business connectors. He loves to build products that make it easy for customers to use privacy-preserving technologies (PETs) such as differential privacy
Keerthi Kumar Kallur is a Software Development Engineer at AWS. He is part of the Amazon Q Business team and worked on various features with customers. In his spare time, he likes to do outdoor activities such as hiking and sports such as volleyball.

SK Telecom improves telco-specific Q&A by fine-tuning Anthropic’ …

This post has been co-written with Seunghyun Jeong, Sunwoo Lee and Eric Davis from SK Telecom.
SK Telecom (SKT), South Korea’s leading telecommunications company serving 30 million customers, is at the forefront of AI innovation. In line with its AI Pyramid Strategy, which aims to unlock AI’s potential for anyone, anywhere, anytime, SKT has collaborated with the AWS Generative AI Innovation Center (GenAIIC) Custom Model Program to explore domain-trained models using Amazon Bedrock for telco-specific use cases.
This collaboration aligns with SKT’s vision of using AI expertise and strategic partnerships to develop innovative AI-based products and services. One such initiative focused on developing a custom solution for grounded question answering (Q&A) based on reference documents.
Retrieval Augmented Generation (RAG) is a popular technique for Q&A tasks, offering improved factual accuracy and knowledge grounding. However, RAG faces challenges with generating a response not matching preferred tone, style, and manners for telco use cases, as well as retrieving irrelevant documents, potentially leading to inaccurate responses. To address this, SKT and AWS GenAIIC aimed to use model customization to improve Anthropic Claude models on Amazon Bedrock in three key areas:

Providing concise and informative answers
Correctly referencing links from retrieved documents
Answering in a tone and style consistent with SKT and similar to ground truth answers

Additionally, the team explored boosting smaller model performance using synthetic data generated by bigger large language models (LLMs) for knowledge distillation and scenarios with limited labeled training data.
Amazon Bedrock is a fully managed service that offers a variety of LLMs and foundation models (FMs) along with capabilities such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and Amazon Bedrock Guardrails that can expedite many generative AI use cases. Amazon Bedrock is the only fully managed service that provides you with the ability to fine-tune Claude models. Amazon Bedrock offers an intuitive and secure way of fine-tuning Anthropic’s Claude models and more. The fine-tuned Claude model can be deployed using Amazon Bedrock and can use the capabilities of Amazon Bedrock seamlessly, for example, Amazon Bedrock Knowledge Bases for the telco domain-specific RAG or Amazon Bedrock Agents for the agentic usage.
In this post, we share how SKT customizes Anthropic Claude models for telco-specific Q&A regarding technical telecommunication documents of SKT using Amazon Bedrock.
Solution overview
The team explored combinations of prompt optimization, customization (fine-tuning), and data augmentation with synthetic data. This multifaceted approach aimed to maximize the benefits of each technique for the grounded Q&A generation task.
In the following sections, we explore these methods in more detail.
Anthropic’s Claude customization with prompt optimization
Fine-tuning, which is available through Amazon Bedrock for various FMs, including Anthropic’s Claude, allows adaptation of pre-trained language models for specific use cases. It’s particularly effective for tailoring response style and format adherence.
The team first optimized the system prompt, implementing standardized guidelines for answer formatting and document citation based on Anthropic model prompting best practices. Key focus areas included:

Clear presentation of system commands
Consistent use of code block formatting
Context-based tailored responses

This prompt engineering, combined with fine-tuning, yielded substantial improvements:

Over 50% increase in ROUGE-3 score
Over 25% improvement in ROUGE-L score
Over 4% increase in embedding similarity score
Significant progress in accurate reference citation

The iterative enhancement process demonstrated cumulative benefits, with prompt updates alone showing 35–40 percent improvements in key metrics, and the final customized model achieving 50–60 percent gains in some metrics.
This progression clearly illustrates the cumulative benefits of model customization through RAG, prompt engineering, and fine-tuning, resulting in a model that significantly outperformed both the baseline and the prompt-updated versions in terms of ROUGE scores and citation accuracy. ROUGE score measures the similarity between ground truths and generated results by computing N-gram word overlaps. The following table summarizes these improvements.

LLM
Prompt update
Fine-tuning
Relative improvement over baseline

ROUGE-3
ROUGE-L
Citation accuracy

Anthropic’s Claude 3 Sonnet


baseline
baseline
baseline

Anthropic’s Claude 3 Sonnet


+38.30%
+13.4%
+52.94%

Anthropic’s Claude 3 Sonnet

+58.1%
+26.8%
+70.59%

Synthetic data for fine-tuning
To address the challenge of limited high-quality labeled training data, the team explored synthetic data generation techniques. This approach also facilitates knowledge distillation from larger LLMs to smaller, more targeted models, offering benefits such as lower latency and cost.
The team conducted controlled experiments using:

A baseline set of 500 ground truth samples
An augmented set with 500 original over 1,500 synthetic samples
A larger original set of 2,000 samples

Synthetic data was generated using Anthropic’s Claude Sonnet 3, creating new question-answer pairs over the same retrieved documents used in ground truth examples.
The results were evaluated using both LLM-based comparison and human preference evaluation. Human evaluators blindly ranked model outputs, with scores assigned based on preference (Best: 4, Second: 3, Third: 2, Worst: 1). The following table shows the results of the human preference evaluation scores.

Rank
Model
Cumulative score (best possible: 160)

1
Fine-tuned with 2,000 original samples
114

2
Fine-tuned with 500 original and 1,500 synthetic samples
112

3
Fine-tuned with 500 original samples
85

4
No fine-tuning (baseline)
84

Some key findings include:

Small training sets (500 samples) showed minimal improvement over baseline
Larger training sets (2,000 samples) scored considerably higher
Synthetically augmented data performed similarly to equivalent-sized original data

Although having a large volume of domain-specific training data is always ideal, many businesses have limited available datasets. In such scenarios, synthetic data can play a crucial role in place of original data. This demonstrates the potential of synthetic data for model customization.
Conclusion
SK Telecom’s collaboration with AWS GenAIIC showcases the company’s commitment to developing innovative AI solutions for telco challenges. By using Amazon Bedrock to customize Anthropic’s Claude models, SKT has achieved significant performance improvements for telco-specific, Korean language use cases without the need to build models from scratch. The proof of concept demonstrated significant improvements:

~58% increase in ROUGE-3 score
~27% increase in ROUGE-L score
Substantial improvement in returning correct reference links

This approach, combined with synthetic data generation techniques, aligns with SKT’s AI Pyramid Strategy, enabling faster testing and development of new approaches. As SKT continues to focus on key areas such as personal AI assistants, AI healthcare, and AI data centers, this collaboration with AWS represents a significant step in their AI evolution and long-term competitiveness in the global AI landscape.
For those interested in working with AWS on similar projects, visit Generative AI Innovation Center.

About the Authors
Sungmin Hong is a Senior Applied Scientist at AWS Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.
Sujeong Cha is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. She has extensive hands-on experience in solving customers’ business use cases by utilizing generative AI as well as traditional AI/ML solutions. Sujeong holds a M.S. degree in Data Science from New York University.
Arijit Ghosh Chowdhury is a Scientist with the AWS Generative AI Innovation Center, where he works on model customization and optimization. In his role, he works on applied research in fine-tuning and model evaluations to enable GenAI for various industries. He has a Master’s degree in Computer Science from the University of Illinois at Urbana Champaign, where his research focused on question answering, search and domain adaptation.
Yiyue Qian is an Applied Scientist II at the AWS Generative AI Innovation Center, where she supports providing generative AI solutions to AWS customers. In this role, she collaborates with a team of experts to develop innovative AI-driven models for AWS customers across various industries. Yiyue holds a Ph.D. in Computer Science from the University of Notre Dame, where her research focused on advanced machine learning and deep learning techniques.
Wei-Chih Chen is a Machine Learning Engineer at the AWS Generative AI Innovation Center, where he works on model customization and optimization for LLMs. He also builds tools to help his team tackle various aspects of the LLM development life cycle—including fine-tuning, benchmarking, and load-testing—that accelerating the adoption of diverse use cases for AWS customers. He holds an M.S. degree in Computer Science from UC Davis.
Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating Generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a Ph.D in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.
Seunghyun Jeong (Steve) is a team leader of the Platform Application team at SKT. He is responsible for commercializing the Global Intelligence Platform (GIP), which provides AI models and tools. For most of his career, he has been a PM developing various mobile services such as mobile wallet, fashion streaming, and unified login services for SK. His team is expanding the delivery of models and features to make it easier for internal teams to apply AI, contributing to SKT’s AI Transformation. Before entering the AI space, he was a Product Manager, developing and operating various mobile services such as mobile wallet, fashion streaming, and unified login services for the US and Korea.
Sunwoo Lee (Lois) is the team leader of the Data Construction and Evaluation Team within SK Telecom’s Global AI Tech division. She oversees the design and construction of training data for language models, the model performance evaluation process, and its application to services. Her career has focused on NLP within IT, which is a great fit with her background in Linguistics and Korean language education. Alongside her world-class team, she continues to explore and solve fascinating problems such as how to optimize the design of data for language model training, which tasks and methods to implement for validating language model performance, and the best design of AI-human conversations.
Eric Davis is the vice president of the AI Tech Collaboration Group at SKT. Eric oversees tech collaborations with worldwide tech partners to customize large language models (LLMs) for the telecommunications domain. His teams are responsible for designing and building the datasets to tune LLMs, as well as benchmarking LLMs in general and for the telecommunications domain. Eric holds a Master of Science degree in Computer Science from Carnegie Mellon from the Language Technologies Institute and a Bachelor of Arts in Linguistics and Psychology from the University of California, Los Angeles.

Scaling Rufus, the Amazon generative AI-powered conversational shoppin …

Amazon Rufus is a shopping assistant experience powered by generative AI. It generates answers using relevant information from across Amazon and the web to help Amazon customers make better, more informed shopping decisions. With Rufus, customers can shop alongside a generative AI-powered expert that knows Amazon’s selection inside and out, and can bring it all together with information from across the web to help shoppers make more informed purchase decisions.
To meet the needs of Amazon customers at scale, Rufus required a low-cost, performant, and highly available infrastructure for inference. The solution needed the capability to serve multi-billion parameter large language models (LLMs) with low latency across the world to service its expansive customer base. Low latency makes sure users have a positive experience chatting with Rufus and can start getting responses in less than a second. To achieve this, the Rufus team is using multiple AWS services and AWS AI chips, AWS Trainium and AWS Inferentia.
Inferentia and Trainium are purpose-built chips developed by AWS that accelerate deep learning workloads with high performance and lower overall costs. With these chips, Rufus reduced its costs by 4.5 times lower than other evaluated solutions while maintaining low latency for its customers. In this post, we dive into the Rufus inference deployment using AWS chips and how this enabled one of the most demanding events of the year—Amazon Prime Day.
Solution overview
At its core, Rufus is powered by an LLM trained on Amazon’s product catalog and information from across the web. LLM deployment can be challenging, requiring you to balance factors such as model size, model accuracy, and inference performance. Larger models generally have better knowledge and reasoning capabilities but come at a higher cost due to more demanding compute requirements and increasing latency. Rufus would need to be deployed and scale to meet the tremendous demand of peak events like Amazon Prime Day. Considerations for this scale include how well it needs to perform, its environmental impact, and the cost of hosting the solution. To meet these challenges, Rufus used a combination of AWS solutions: Inferentia2 and Trainium, Amazon Elastic Container Service (Amazon ECS), and Application Load Balancer (ALB). In addition, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing capabilities to host the model using AWS chips.
Rufus inference is a Retrieval Augmented Generation (RAG) system with responses enhanced by retrieving additional information such as product information from Amazon search results. These results are based on the customer query, making sure the LLM generates reliable, high-quality, and precise responses.
To make sure Rufus was best positioned for Prime Day, the Rufus team built a heterogeneous inference system using multiple AWS Regions powered by Inferentia2 and Trainium. Building a system across multiple Regions allowed Rufus to benefit in two key areas. First, it provided additional capacity that could be used during times of high demand, and second, it improved the overall resiliency of the system.
The Rufus team was also able to use both Inf2 and Trn1 instance types. Because Inf2 and Trn1 instance types use the same AWS Neuron SDK, the Rufus team was able to use both instances to serve the same Rufus model. The only configuration setting to adjust was the tensor parallelism degree (24 for Inf2, 32 for Trn1). Using Trn1 instances also led to an additional 20% latency reduction and throughput improvement compared to Inf2.
The following diagram illustrates the solution architecture.

To support real-time traffic routing across multiple Regions, Rufus built a novel traffic orchestrator. Amazon CloudWatch supported the underlying monitoring, helping the team adjust the traffic ratio across the different Regions in less than 15 minutes based on the traffic pattern changes. By using this type of orchestration, the Rufus team had the ability to direct requests to other Regions when needed, with a small trade-off of latency to the first token. Due to Rufus’s streaming architecture and the performant AWS network between Regions, the perceived latency was minimal for end-users.
These choices allowed Rufus to scale up over 80,000 Trainium and Inferentia chips across three Regions serving an average of 3 million tokens a minute while maintaining P99 less than 1 second latency to the first response for Prime Day customers. In addition, by using these purpose-built chips, Rufus achieved 54% better performance per watt than other evaluated solutions, which helped the Rufus team meet energy efficiency goals.
Optimizing inference performance and host utilization
Within each Region, the Rufus inference system used Amazon ECS, which managed the underlying Inferentia and Trainium powered instances. By managing the underlying infrastructure, the Rufus team only needed to bring their container and configuration by defining an ECS task. Within each container, an NVIDIA Triton Inference Server with a Python backend is used running vLLM with the Neuron SDK. vLLM is a memory-efficient inference and serving engine that is optimized for high throughput. The Neuron SDK makes it straightforward for teams to adopt AWS chips and supports many different libraries and frameworks such as PyTorch Lightning.
The Neuron SDK provides a straightforward LLM inference solution on Trainium and Inferentia hardware with optimized performance supporting a wide range of transformer-based LLM architectures. To reduce latency, Rufus has collaborated with the AWS Annapurna team to develop various optimizations such as INT8 (weight only) quantization, continuous batching with vLLM, resource, compute, and memory bandwidth in the Neuron compiler and runtime. These optimizations are currently deployed in Rufus production and are available to use in the Neuron SDK 2.18 and onward.
To reduce overall waiting time for customers to start seeing a response from Rufus, the team also developed an inference streaming architecture. With the high compute and memory load needed for LLM inference, the total time it takes to finish generating the full response for a customer query can take multiple seconds. With a streaming architecture, Rufus is able to return the tokens right after they’re generated. This optimization allows the customer to start consuming the response in less than 1 second. In addition, multiple services work together using gRPC connections to intelligently aggregate and enhance the streaming response in real time for customers.
As shown in the following figure, images and links are embedded in the response, which allow customers to engage and continue exploring with Rufus.

Scaling up
Although we have to maintain low latency for the best customer experience, it’s also crucial to scale the service throughput by achieving high hardware resource utilization. High hardware utilization makes sure accelerators don’t sit idle and needlessly increase costs. To optimize the inference system throughput, the team improved both single-host throughput as well as load balancing efficiency.
Load balancing for LLM inference is tricky due to following challenges. First, a single host can only handle a limited number of concurrent requests. Second, the end-to-end latency to complete one request can vary, spanning many seconds depending on the LLM response length.
To address the challenges, the team optimized throughput by considering both single-host throughput and throughput across many hosts using load balancing.
The team used the least outstanding requests (LOR) routing algorithm from ALB, increasing throughput by five times faster in comparison to an earlier baseline measurement. This allows each host to have enough time to process in-flight requests and stream back responses using a gRPC connection, without getting overwhelmed by multiple requests received at the same time. Rufus also collaborated with AWS and vLLM teams to improve single-host concurrency using vLLM integration with the Neuron SDK and NVIDIA Triton Inference Server.

Figure 1. ECS tasks scale horizontally hosting the Triton Inference Server and dependencies

With this integration, Rufus was able to benefit from a critical optimization: continuous batching. Continuous batching allows a single host to greatly increase throughput. In addition, continuous batching provides unique capabilities in comparison to other batch techniques, such as static batching. For example, when using static batching, the time to first token (TTFT) increases linearly with the number of requests in one batch. Continuous batching prioritizes the prefill stage for LLM inference, keeping TTFT under control even with more requests running at the same time. This helped Rufus provide a pleasant experience with low latency when generating the first response, and improve the single-host throughput to keep serving costs under control.
Conclusion
In this post, we discussed how Rufus is able to reliably deploy and serve its multi-billion-parameter LLM using the Neuron SDK with Inferentia2 and Trainium chips and AWS services. Rufus continues to evolve with advancements in generative AI and customer feedback and we encourage you to use Inferentia and Trainium.
Learn more about how we are innovating with generative AI across Amazon.

About the author
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time, he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.
RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.
Yang Zhou is a software engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.
Adam (Hongshen) Zhao is a Software Development Manager at Amazon Stores Foundational AI. In his current role, Adam is leading Rufus Inference team to build GenAI inference optimization solutions and inference system at scale for fast inference at low cost. Outside work, he enjoys traveling with his wife and art creations.
Faqin Zhong is a software engineer at Amazon Stores Foundational AI, working on Large Language Model (LLM) inference infrastructure and optimizations. Passionate about Generative AI technology, Faqin collaborates with leading teams to drive innovations, making LLMs more accessible and impactful, ultimately enhancing customer experiences across diverse applications. Outside of work she enjoys cardio exercise and baking with her son.
Nicolas Trown is an engineer in Amazon Stores Foundational AI. His recent focus is lending his systems expertise across Rufus to aid Rufus Inference team and efficient utilization across the Rufus experience. Outside of work he enjoys spending time with his wife and day trips to nearby coast, Napa, and Sonoma areas.
Bing Yin is a director of science at Amazon Stores Foundational AI. He leads the effort to build LLMs that are specialized for shopping use cases and optimized for inference at Amazon scale. Outside of work, he enjoys running marathon races.