Google AI Introduces Tx-LLM: A Large Language Model (LLM) Fine-Tuned f …

Developing therapeutics is costly and time-consuming, often taking 10-15 years and up to $2 billion, with most drug candidates failing during clinical trials. A successful therapeutic must meet various criteria, such as target interaction, non-toxicity, and suitable pharmacokinetics. Current AI models focus on specialized tasks within this pipeline, but their limited scope can hinder performance. The Therapeutics Data Commons (TDC) offers datasets to help AI models predict drug properties, yet these models work independently. LLMs, which excel at multi-tasking, provide the potential to improve therapeutic development by learning across diverse tasks using a unified approach.

LLMs, particularly transformer-based models, have advanced natural language processing, excelling in tasks through self-supervised learning on large datasets. Recent studies show LLMs can handle diverse tasks, including regression, using textual representations of parameters. In therapeutics, specialized models like graph neural networks (GNNs) represent molecules as graphs for functions such as drug discovery. Protein and nucleic acid sequences are also encoded to predict properties like binding and structure. LLMs are increasingly applied in biology and chemistry, with models like LlaSMol and protein-specific models achieving promising results in drug synthesis and protein engineering tasks.

Researchers from Google Research and Google DeepMind introduced Tx-LLM, a generalist large language model fine-tuned from PaLM-2, designed to handle diverse therapeutic tasks. Trained on 709 datasets covering 66 functions across the drug discovery pipeline, Tx-LLM uses a single set of weights to process various chemical and biological entities, such as small molecules, proteins, and nucleic acids. It achieves competitive performance on 43 tasks and surpasses state-of-the-art on 22. Tx-LLM excels in tasks combining molecular representations with text and shows positive transfer between different drug types. This model is a valuable tool for end-to-end drug development.

The researchers compiled a dataset collection called TxT, containing 709 drug discovery datasets from the TDC repository, focusing on 66 tasks. Each dataset was formatted for instruction tuning, featuring four components: instructions, context, question, and answer. These tasks included binary classification, regression, and generation tasks, with representations like SMILES strings for molecules and amino acid sequences for proteins. Tx-LLM was fine-tuned from PaLM-2 using this data. They evaluated the model’s performance using metrics such as AUROC and Spearman correlation and set accuracy. Statistical tests and data contamination analyses were performed to ensure robust results.

The Tx-LLM model demonstrated strong performance on TDC datasets, surpassing or matching state-of-the-art (SOTA) results on 43 out of 66 tasks. It outperformed SOTA on 22 datasets and achieved near-SOTA performance on 21 others. Notably, Tx-LLM excelled in datasets combining SMILES molecular strings with text features like disease or cell line descriptions, likely due to its pretrained knowledge of the text. However, it struggled on datasets that relied solely on SMILES strings, where graph-based models were more effective. Overall, the results highlight the strengths of fine-tuned language models for tasks involving drugs and text-based features.

Tx-LLM is the first LLM trained on diverse TDC datasets, including molecules, proteins, cells, and diseases. Interestingly, training with non-small molecule datasets, such as proteins, improved performance on small molecule tasks. While general LLMs have struggled with specialized chemistry tasks, Tx-LLM excelled in regression, outperforming state-of-the-art models in several cases. This model shows potential for end-to-end drug development, from gene identification to clinical trials. However, Tx-LLM is still in the research stage, with limitations in natural language instruction and prediction accuracy, requiring further improvement and validation for broader applications.

Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Google AI Introduces Tx-LLM: A Large Language Model (LLM) Fine-Tuned from PaLM-2 to Predict Properties of Many Entities that are Relevant to Therapeutic Development appeared first on MarkTechPost.

Comparative Analysis: ColBERT vs. ColPali

Problem Addressed

ColBERT and ColPali address different facets of document retrieval, focusing on improving efficiency and effectiveness. ColBERT seeks to enhance the effectiveness of passage search by leveraging deep pre-trained language models like BERT while maintaining a lower computational cost through late interaction techniques. Its main goal is to solve the computational challenges posed by conventional BERT-based ranking methods, which are costly in terms of time and resources. ColPali, on the other hand, aims to improve document retrieval for visually rich documents by addressing the limitations of standard text-based retrieval systems. ColPali focuses on overcoming the inefficiencies in utilizing visual information effectively, allowing the integration of visual and textual features for better retrieval in applications like Retrieval-Augmented Generation (RAG).

Key Elements

Key elements of ColBERT include the use of BERT for context encoding and a novel late interaction architecture. In ColBERT, queries and documents are independently encoded using BERT, and their interactions are computed using efficient mechanisms like MaxSim, allowing for better scalability without sacrificing effectiveness. ColPali incorporates Vision-Language Models (VLMs) to generate embeddings from document images. It utilizes a late interaction mechanism similar to ColBERT but extends it to multimodal inputs, making it particularly useful for visually rich documents. ColPali also introduces the Visual Document Retrieval Benchmark (ViDoRe), which evaluates systems on their ability to understand visual document features.

Technical Details, Benefits, and Drawbacks

ColBERT’s technical implementation includes the use of a late interaction approach where the query and document embeddings are generated separately and then matched using a MaxSim operation. This allows ColBERT to balance efficiency and computational cost by pre-computing document representations offline. The benefits of ColBERT include its high query-processing speed and reduced computational cost, which make it suitable for large-scale information retrieval tasks. However, it has limitations when dealing with documents that contain a lot of visual data, as it focuses solely on text.

ColPali, in contrast, leverages VLMs to generate contextualized embeddings directly from document images, thus incorporating visual features into the retrieval process. The benefits of ColPali include its ability to efficiently retrieve visually rich documents and perform well on multimodal tasks. However, the incorporation of vision models comes with additional computational overhead during indexing, and its memory footprint is larger compared to text-only methods like ColBERT due to the storage requirements for visual embeddings. The indexing process in ColPali is more time-consuming than ColBERT’s, although the retrieval phase remains efficient due to the late interaction mechanism.

Importance and Further Details

Both ColBERT and ColPali are important as they address key challenges in document retrieval for different types of content. ColBERT’s contribution lies in optimizing BERT-based models for efficient text-based retrieval, bridging the gap between effectiveness and computational efficiency. Its late interaction mechanism allows it to retain the benefits of contextualized representations while significantly reducing the cost per query. ColPali’s significance is in expanding the scope of document retrieval to visually rich documents, which are often neglected by standard text-based approaches. By integrating visual information, ColPali sets the foundation for future retrieval systems that can handle diverse document formats more effectively, supporting applications like RAG in practical, multimodal settings.

Conclusion

In conclusion, ColBERT and ColPali represent advancements in document retrieval by addressing specific challenges in efficiency, effectiveness, and multimodality. ColBERT offers a computationally efficient way to leverage BERT’s capabilities for passage retrieval, making it ideal for large-scale text-heavy retrieval tasks. ColPali, meanwhile, extends retrieval capabilities to include visual elements, enhancing the retrieval performance for visually rich documents and highlighting the importance of multimodal integration in practical applications. Both models have their strengths and limitations, but together, they illustrate the ongoing evolution of document retrieval to handle increasingly diverse and complex data sources.

Check out the Papers on ColBERT and ColPali. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Comparative Analysis: ColBERT vs. ColPali appeared first on MarkTechPost.

Archon: A Machine Learning Framework for Large Language Model Enhancem …

Artificial intelligence has made remarkable strides with the development of Large Language Models (LLMs), significantly impacting various domains, including natural language processing, reasoning, and even coding tasks. As LLMs grow more powerful, they require sophisticated methods to optimize their performance during inference. Inference-time techniques and strategies used to improve the quality of responses generated by these models at runtime have become crucial. However, the research community must still establish best practices for integrating these techniques into a cohesive system.

A core challenge in improving LLM performance is determining which inference-time techniques yield the best results for different tasks. The problem is compounded by the sheer variety of functions, such as instruction-following, reasoning, and coding, which may benefit from various combinations of inference-time techniques. Moreover, understanding the complex interactions between techniques like ensembling, repeated sampling, ranking, fusion, and verification is crucial for maximizing performance. Researchers need a robust system that can efficiently explore the extensive design space of possible combinations and optimize these architectures according to the task and compute constraints.

Traditional methods for inference-time optimization have focused on applying individual techniques to LLMs. For instance, generation ensembling involves querying multiple models simultaneously and selecting the best response, while repeated sampling involves querying a single model numerous times. These techniques have shown promise, but their standalone application often leads to limited improvements. Frameworks like Mixture-of-Agents (MoA) and LeanStar have attempted to integrate multiple techniques but still face challenges in generalization and performance across various tasks. Thus, there is a growing demand for a modular, automated approach to building optimized LLM systems.

Researchers from Stanford University and the University of Washington have developed Archon, a modular framework designed to automate LLM architecture search using inference-time techniques. The Archon framework leverages diverse LLMs and inference-time methods, combining them into a cohesive system that surpasses traditional models’ performance. Rather than relying on a single LLM queried once, Archon dynamically selects, combines, and stacks layers of techniques to optimize performance for specific benchmarks. By treating the problem as a hyperparameter optimization task, the framework can identify optimal architectures that maximize accuracy, latency, and cost-efficiency for a given compute budget.

The Archon framework is structured as a multi-layered system where each layer performs a distinct inference-time technique. For example, the first layer might generate multiple candidate responses using an ensemble of LLMs, while subsequent layers apply ranking, fusion, or verification techniques to refine these responses. The framework uses Bayesian optimization algorithms to search potential configurations and select the most effective one for a target benchmark. This modular design allows Archon to outperform top-performing models like GPT-4o and Claude 3.5 Sonnet by an average of 15.1 percentage points across a wide range of tasks.

The performance of Archon was evaluated across several benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. The results were compelling: Archon architectures demonstrated an average accuracy increase of 11.2 percentage points using open-source models and 15.1 percentage points utilizing a mix of open-source and closed-source models. In coding tasks, the framework achieved a 56% improvement in Pass@1 scores, boosting accuracy from 17.9% to 29.3% through unit test generation and evaluation. Even when constrained to open-source models, Archon surpassed the performance of single-call state-of-the-art models by 11.2 percentage points, highlighting the efficacy of its layered approach.

The key results show that Archon achieves state-of-the-art performance in various domains by integrating multiple inference-time techniques. For instruction-following tasks, adding numerous layers of generation, ranking, and fusion significantly improved the quality of responses. Archon excelled in reasoning tasks like MixEval and MATH by incorporating verification and unit testing methods, leading to an average increase of 3.7 to 8.9 percentage points when applying task-specific architectures. The framework combined extensive sampling and unit test generation to produce accurate and reliable outputs for coding challenges.

Key Takeaways from the research on Archon:

Performance Boost: Archon achieves an average accuracy increase of 15.1 percentage points across various benchmarks, outperforming state-of-the-art models like GPT-4o and Claude 3.5 Sonnet.

Diverse Applications: The framework excels in instruction-following, reasoning, and coding tasks, showing versatility.

Effective Inference-Time Techniques: Archon provides superior performance in all evaluated scenarios by combining techniques such as ensembling, fusion, ranking, and verification.

Improved Coding Accuracy: Achieved a 56% boost in coding task accuracy by leveraging unit test generation and evaluation methods.

Scalability and Modularity: The framework’s modular design allows it to adapt easily to new tasks and configurations, making it a robust tool for LLM optimization.

In conclusion, Archon addresses the critical need for an automated system that optimizes LLMs at inference time by effectively combining various techniques. This research provides a practical solution to the complexities of inference-time architecture design, making it easier for developers to build high-performing LLM systems tailored to specific tasks. The Archon framework sets a new standard for optimizing LLMs. It offers a systematic and automated approach to inference-time architecture search, demonstrating its ability to achieve top-tier results across diverse benchmarks.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Archon: A Machine Learning Framework for Large Language Model Enhancement Using Automated Inference-Time Architecture Search for Improved Task Performance appeared first on MarkTechPost.

Enable or disable ACL crawling safely in Amazon Q Business

Amazon Q Business recently added support for administrators to modify the default access control list (ACL) crawling feature for data source connectors.
Amazon Q Business is a fully managed, AI powered assistant with enterprise-grade security and privacy features. It includes over 40 data source connectors that crawl and index documents. By default, Amazon Q Business indexes ACL information attached to documents along with the documents themselves and uses this to filter chat responses based on the user’s document access. With this new feature, you can enable or disable ACL crawling as required by their business use case.
This post introduces the new ACL toggle feature for Amazon Q Business, which you can use to enable or disable ACL crawling. We’ll explore use cases for disabling ACLs and discuss how to safely enable or disable ACL crawling.
Overview of access control list crawling
Amazon Q Business data source connectors help crawl various data sources to collect and index content in Amazon Q Business for fast discovery and retrieval when answering user queries. These data sources often contain documents with different classifications such as public, internal public, private, and confidential. To provide fine-grained control over access rights, you can attach ACLs to documents, allowing you to specify different levels of access for various users or groups. To verify that Amazon Q Business respects access control policies and that users only receive responses for content they’re authorized to access, the data source connectors automatically crawl for access permissions associated with the content, user identifiers, and groups.

The preceding figure illustrates the Amazon Q Business data source crawler with ACL crawling enabled. As the connector retrieves content from the data source, it examines the associated ACL and compiles a list of users and groups with read permissions for each document. The connector also collects user identifiers, which are stored in the Amazon Q Business user store for quick matching during query execution. Both the ACL and content are optimized and stored in the Amazon Q Business index storage, enabling secure and efficient retrieval when answering user queries. For more information on the user store, see Understanding Amazon Q Business User Store.
When to disable ACL crawling?
ACL crawling builds a security-aware index that respects access control policies in the primary data source. This process helps maintain data privacy and access control required for regulatory compliance, making sure that sensitive information isn’t inadvertently exposed through user query results. It provides a scalable mechanism to handle large amounts of content while maintaining consistency between the actual access controls on the data and what’s discoverable through search. Because of these advantages, ACL crawling is strongly recommended for all data sources. However, there are some circumstances when you might need to disable it. The following are some reasons why you might disable ACL crawling.
Internally public content
Organizations often designate certain data sources as internally public, including HR policies, IT knowledge bases, and wiki pages. For instance, a company might allocate an entire Microsoft SharePoint site for policies accessible to all employees, classifying it as internal-public. In such cases, crawling ACLs for permissions that include all employees can be costly and create unnecessary overhead. Turning off ACL crawling might be advantageous in these scenarios.
Data source contains irreconcilable identities
Amazon Q Business requires all users to authenticate with an enterprise-approved identity provider (IdP). After successful authentication, Amazon Q Business uses the IdP-provided user identifier to match against the user identifier fetched from the data source during ACL crawling. This process validates user access to content before retrieving it for query responses.
However, because of legacy issues such as mergers and acquisitions, data source configuration limitations, or other constraints, the primary user identifier from the IdP might differ from the one in the data source. This discrepancy can prevent Amazon Q Business from retrieving relevant content from the index and answering user queries effectively.
In such cases, it might be necessary to disable ACL crawling and use alternative options. These include implementing attribute filters or building dedicated restricted applications with access limited to specific audiences and content. For more information on attribute filters, see Filtering chat responses using document attributes.
Use case-driven targeted deployments
As a fully managed service, Amazon Q Business can be quickly deployed in multiple instances for scoped down targeted use cases. Examples include an HR bot in Slack or an AI assistant for customer support agents in a contact center. Because these AI assistants might be deployed for a limited audience, and the indexed content might be generally available to all users with application access, ACL crawling can be turned off.
Note of caution
Amazon Q Business cannot enforce access controls if ACL crawling is disabled. When ACL crawling is disabled for a data source, indexed content in that source will be considered accessible to users with access to the Amazon Q Business application. Therefore, disabling ACL crawling should be done with caution and due diligence. The following are some recommended best practices:

Notify data source content owners and administrators of your intent to disable ACL crawling and obtain their approval beforehand.
If applicable, consider implementing alternative options such as attribute filtering to restrict content retrieval or deploying a scoped-down, use-case-driven deployment to a limited audience.
Maintain a decision document that clearly articulates the reasons for disabling ACL crawling, the scope of affected content, and precautions taken to prevent indexing of sensitive information.

Note: As a precaution, you cannot disable ACL crawling for an existing Amazon Q Business data source that already has ACL crawling enabled. To disable ACL crawling, you must delete the data source and recreate it. You can only disable ACL crawling during the data source creation process, and this requires an account administrator to grant permission for disabling ACL crawling when configuring the data source.
Procedures for configuring ACL crawling
Amazon Q Business ACL crawling helps protect your data. Amazon Q Business provides safeguards to help administrators and developers mitigate accidentally disabling ACL crawling. In this section, we will cover how you can allow or deny the ACL crawling disable feature, explore procedures to enable or disable ACL crawling, explain how to monitor logs for ACL crawling configuration changes, and troubleshoot common issues.
Personas for configuring ACL crawling
ACL crawling configuration typically involves multiple roles, depending on your organizational structure. To maximize safeguards, it’s recommended that these roles are filled by different individuals. For faster deployments, identify the necessary personnel within your organization before starting the project and ensure they collaborate to complete the configuration. Here are the common roles needed for ACL crawling configuration:

AWS account administrator – An AWS account administrator is a user with full access to AWS services and the ability to manage IAM resources and permissions in the account. They can create and manage organizations, enabling centralized management of multiple AWS accounts.
Amazon Q Business administrator – An Amazon Q Business administrator is typically a user or role responsible for managing and configuring the Amazon Q Business service. Their duties include creating and optimizing Amazon Q Business indexes, setting up guardrails, and tuning relevance. They also set up and maintain connections to various data sources that Amazon Q Business will index, such as Amazon Simple Storage Service (Amazon S3) buckets, SharePoint, Salesforce, and Confluence.

Prerequisites for ACL crawling

Amazon Q Business application.

For information on configuring a starter application, see Creating a sample Amazon Q Business application.

Amazon Q Business data source connector that supports ACL crawling configuration.

For a complete list of connectors that support disabling ACL crawling, see Connecting Amazon Q Business data sources.

Data source authentication that meets the permissions required for crawling content and ACLs.

Process to disallow the option to disable ACL crawling
By default, the option to disable ACL crawling is enabled for an account. AWS account administrators can disallow this feature by setting up an account-level policy. It’s recommended to configure an explicit deny for production accounts by default. The following below shows the associated actions in relation to the personas involved in the configuration process.

Administrators can attach the IAM action qbusiness:DisableAclOnDataSource to the Amazon Q Business administrator user or role policy to deny or allow the option to disable ACL crawling. The example IAM policy code snippet that follows demonstrates how to set up an explicit deny.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Deny”,
“Action”: [
“qbusiness:DisableAclOnDataSource”
],
“Resource”: [“*”]
}
]
}

Note that even if the option to disable ACL crawling is denied, the user interface might not gray out this option. However, if you attempt to create a data source with this option disabled, it will fail the validation check, and Amazon Q Business will not create the data source.
Process to disable ACL crawling for a data source connector
Before setting up a data source connector with ACL crawling disabled in your Amazon Q Business application deployment, make sure that you have no sensitive content in the data source or have implemented controls to help prevent accidental content exposure. Verify that the data source connector supports the option to disable ACL crawling. Notify information custodians, content owners, and data source administrators of your intent to disable ACL crawling and obtain their documented approvals, if necessary. If your account administrator has explicitly denied the option to disable ACL crawling, request temporary permission. After you have secured all approvals and exceptions, create a new data source with ACL crawling disabled and sync the data. With ACL crawling disabled, Amazon Q Business users will be able to discover knowledge and obtain answers from the indexed documents through this connector. Notify the account administrator to revert the account policy back to explicitly denying the disable ACL crawling option. The process and interaction between different roles are shown in the following chart.

The following is an overview of the procedure to create a data source with ACL crawling disabled using AWS Console:

Navigate to the Amazon Q Business console.
Select the Amazon Q Business application that you want to add a data source connector to.
Choose Add data source in the Data sources section and select the desired connector.
Update the connector configuration information. See Connecting Amazon Q Business data sources for configuration details.
In the Authorization section, choose Disable ACLs and check the acknowledgment to accept the risks of disabling ACL crawling.
Complete the remaining connector configuration and choose Save.
Sync the data source.

Note: You cannot disable ACL crawling for an existing data source connector that was created with ACL crawling enabled. You must create a new data source connector instance with ACL disabled and delete the older instance that has ACL crawling enabled.
Process to enable ACL crawling for a data source connector
Creating a data source connector with ACL crawling enabled is recommended and doesn’t require additional allow listing from AWS account administrators. To enable ACL crawling, you follow steps similar to disabling ACLs as described in the previous section. When configuring the data source connector using the console, choose Enable ACLs in the Authorization section to create a connector with ACL crawling enabled. You can also enable ACL crawling at any time for an existing data source connector that was created with this option disabled. Sync the data source connector for the ACL enforcement to take effect. Amazon Q Business users can only query and obtain answers from documents to which they have access in the original data source.
It’s important to review that the data source administrator has set up the required permissions properly, making sure that the crawler has permission to crawl for ACLs in the data source before enabling ACL crawling. You can find the required permissions in the prerequisite section of the connector in Connecting Amazon Q Business data sources. The following shows the process for setting up a data source connector with ACL crawling enabled.
Logging and monitoring the ACL crawling configuration
Amazon Q Business uses AWS CloudTrail for logging API calls related to ACL crawling configuration. You can monitor the CloudTrail log for CreateDataSource and UpdateDataSource API calls to identify ACL crawling-related changes made to data source configuration. For a complete list of Amazon Q Business APIs that are logged to CloudTrail, see Logging Amazon Q Business API calls using AWS CloudTrail.
Administrators can configure Amazon CloudWatch alarms to generate automated alert notifications if ACL crawling is disabled for a data source connector, allowing them to initiate corrective action. For step-by-step instructions on setting up CloudWatch alarms based on CloudTrail events, see How do I use CloudWatch alarms to monitor CloudTrail events.
The example CloudWatch alarm code snippet that follows shows the filter pattern for identifying events related to disabling ACL crawling in a data source connector.

{
($.eventSource = qbusiness.amazonaws.com)
&& (
($.eventName = CreateDataSource)
|| ($.eventName = UpdateDataSource)
)
&& ($.requestParameters.disableAclCrawl = true)
}

Tips for troubleshooting
When configuring Amazon Q Business data source connectors, you might occasionally encounter issues. The following are some common errors and their possible resolutions.
Not authorized to disable ACL crawling
When creating a new data source connector with ACL crawling disabled, you might see an error message stating not authorized to perform: qbusiness:DisableAclOnDataSource as shown in the following image.

This error indicates that your administrator has explicitly denied the option to disable ACL crawling for your AWS account. Contact your administrator to allow-list this action for your account. For more details, see the Process to disable ACL crawling for a data source connector section earlier in this post.
Data source connection errors
Data source connectors might also fail to connect to your data source or crawl data. In such cases, verify that Amazon Q Business can reach the data source through the public internet or through a VPC private network. See Connecting Amazon Q Business data sources to make sure that your data source authentication has the permissions needed to crawl content and ACLs, if enabled.
Identity and ACL mismatch errors
Finally, after successfully syncing data with ACL crawling enabled, some users might still be unable to get answers to queries, even though the relevant documents were indexed. This issue commonly occurs when the user lacks access to the indexed content in the original data source, or when the user identity obtained from the data source doesn’t match the sign-in identity. To troubleshoot such ACL mismatch issues, examine the data source sync report. For more information, see Introducing document-level sync reports: Enhanced data sync visibility in Amazon Q Business.
Key considerations and recommendations
Given the impact that disabling ACL crawling can have on content security, consider these restrictions and best practices when disabling ACL crawling in Amazon Q Business data source connectors:

ACL crawling enablement is a one-way control mechanism. After it’s enabled, you cannot disable it. This helps prevent accidentally disabling ACL crawling in production environments.
Keep ACL crawling enabled by default and disable it only for the subset of data source connectors that require it.
If necessary, consider splitting the indexing of a data source by setting up multiple data source connectors and limiting ACL crawling disablement to a smaller content segment. Use the document Inclusion and Exclusion feature of data source connectors to define the indexing scope.
When ACL crawling is disabled because of irreconcilable identities, consider alternative options. These include implementing attribute filters, restricting access to the Amazon Q Business application, and setting up guardrails.
As a security best practice, AWS Organizations and account administrators should add a service control policy to explicitly deny the qbusiness:DisableAclOnDataSource permission for all accounts. Grant this permission only when requested by an Amazon Q Business administrator. After configuring a data source connector with ACL crawling disabled, revert to an explicit deny. Use a ticketing system to maintain a record of exception approvals. For more information, see <link>.
Currently, disabling ACL crawling is available for limited connectors, including ServiceNow, Confluence, SharePoint, Jira, Google Drive, OneDrive, Salesforce, Zendesk, GitHub, MS Teams, and Slack. For the latest list of connectors that support disabling ACL crawling, see Connecting Amazon Q Business data sources.

Clean up
To avoid incurring additional charges, make sure you delete any resources created in this post.

To delete any data source created in Amazon Q Business, follow the instructions in Deleting an Amazon Q Business data source connector to delete the same.
To delete any Amazon Q Business application created, follow the instructions in Deleting an application.

Conclusion
Amazon Q Business data source connector ACL crawling is an essential feature that helps organizations build, manage, and scale secure AI assistants. It plays a crucial role in enforcing regulatory and compliance policies and protecting sensitive content. With the introduction of a self-service feature to disable ACL crawling, Amazon Q Business now provides you more autonomy to choose deployment options that suit your organization’s business needs. To start building secure AI assistants with Amazon Q Business, explore the Getting started guide.

About the Authors
Rajesh Kumar Ravi, a Senior Solutions Architect at Amazon Web Services, specializes in building generative AI solutions using Amazon Q Business, Amazon Bedrock, and Amazon Kendra. He helps businesses worldwide implement these technologies to enhance efficiency, innovation, and competitiveness. An accomplished technology leader, Rajesh has experience developing innovative AI products, nurturing the builder community, and contributing to new ideas. Outside of work, he enjoys walking and short hiking trips.
Meenakshisundaram Thandavarayan works for AWS as an AI/ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focuses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector and design thinker and strives to drive business to new ways of working through innovation, incubation, and democratization.
Amit Choudhary is a Product Manager for Amazon Q Business connectors. He loves to build products that make it easy for customers to use privacy-preserving technologies (PETs) such as differential privacy
Keerthi Kumar Kallur is a Software Development Engineer at AWS. He is part of the Amazon Q Business team and worked on various features with customers. In his spare time, he likes to do outdoor activities such as hiking and sports such as volleyball.

SK Telecom improves telco-specific Q&A by fine-tuning Anthropic’ …

This post has been co-written with Seunghyun Jeong, Sunwoo Lee and Eric Davis from SK Telecom.
SK Telecom (SKT), South Korea’s leading telecommunications company serving 30 million customers, is at the forefront of AI innovation. In line with its AI Pyramid Strategy, which aims to unlock AI’s potential for anyone, anywhere, anytime, SKT has collaborated with the AWS Generative AI Innovation Center (GenAIIC) Custom Model Program to explore domain-trained models using Amazon Bedrock for telco-specific use cases.
This collaboration aligns with SKT’s vision of using AI expertise and strategic partnerships to develop innovative AI-based products and services. One such initiative focused on developing a custom solution for grounded question answering (Q&A) based on reference documents.
Retrieval Augmented Generation (RAG) is a popular technique for Q&A tasks, offering improved factual accuracy and knowledge grounding. However, RAG faces challenges with generating a response not matching preferred tone, style, and manners for telco use cases, as well as retrieving irrelevant documents, potentially leading to inaccurate responses. To address this, SKT and AWS GenAIIC aimed to use model customization to improve Anthropic Claude models on Amazon Bedrock in three key areas:

Providing concise and informative answers
Correctly referencing links from retrieved documents
Answering in a tone and style consistent with SKT and similar to ground truth answers

Additionally, the team explored boosting smaller model performance using synthetic data generated by bigger large language models (LLMs) for knowledge distillation and scenarios with limited labeled training data.
Amazon Bedrock is a fully managed service that offers a variety of LLMs and foundation models (FMs) along with capabilities such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and Amazon Bedrock Guardrails that can expedite many generative AI use cases. Amazon Bedrock is the only fully managed service that provides you with the ability to fine-tune Claude models. Amazon Bedrock offers an intuitive and secure way of fine-tuning Anthropic’s Claude models and more. The fine-tuned Claude model can be deployed using Amazon Bedrock and can use the capabilities of Amazon Bedrock seamlessly, for example, Amazon Bedrock Knowledge Bases for the telco domain-specific RAG or Amazon Bedrock Agents for the agentic usage.
In this post, we share how SKT customizes Anthropic Claude models for telco-specific Q&A regarding technical telecommunication documents of SKT using Amazon Bedrock.
Solution overview
The team explored combinations of prompt optimization, customization (fine-tuning), and data augmentation with synthetic data. This multifaceted approach aimed to maximize the benefits of each technique for the grounded Q&A generation task.
In the following sections, we explore these methods in more detail.
Anthropic’s Claude customization with prompt optimization
Fine-tuning, which is available through Amazon Bedrock for various FMs, including Anthropic’s Claude, allows adaptation of pre-trained language models for specific use cases. It’s particularly effective for tailoring response style and format adherence.
The team first optimized the system prompt, implementing standardized guidelines for answer formatting and document citation based on Anthropic model prompting best practices. Key focus areas included:

Clear presentation of system commands
Consistent use of code block formatting
Context-based tailored responses

This prompt engineering, combined with fine-tuning, yielded substantial improvements:

Over 50% increase in ROUGE-3 score
Over 25% improvement in ROUGE-L score
Over 4% increase in embedding similarity score
Significant progress in accurate reference citation

The iterative enhancement process demonstrated cumulative benefits, with prompt updates alone showing 35–40 percent improvements in key metrics, and the final customized model achieving 50–60 percent gains in some metrics.
This progression clearly illustrates the cumulative benefits of model customization through RAG, prompt engineering, and fine-tuning, resulting in a model that significantly outperformed both the baseline and the prompt-updated versions in terms of ROUGE scores and citation accuracy. ROUGE score measures the similarity between ground truths and generated results by computing N-gram word overlaps. The following table summarizes these improvements.

LLM
Prompt update
Fine-tuning
Relative improvement over baseline

ROUGE-3
ROUGE-L
Citation accuracy

Anthropic’s Claude 3 Sonnet


baseline
baseline
baseline

Anthropic’s Claude 3 Sonnet


+38.30%
+13.4%
+52.94%

Anthropic’s Claude 3 Sonnet

+58.1%
+26.8%
+70.59%

Synthetic data for fine-tuning
To address the challenge of limited high-quality labeled training data, the team explored synthetic data generation techniques. This approach also facilitates knowledge distillation from larger LLMs to smaller, more targeted models, offering benefits such as lower latency and cost.
The team conducted controlled experiments using:

A baseline set of 500 ground truth samples
An augmented set with 500 original over 1,500 synthetic samples
A larger original set of 2,000 samples

Synthetic data was generated using Anthropic’s Claude Sonnet 3, creating new question-answer pairs over the same retrieved documents used in ground truth examples.
The results were evaluated using both LLM-based comparison and human preference evaluation. Human evaluators blindly ranked model outputs, with scores assigned based on preference (Best: 4, Second: 3, Third: 2, Worst: 1). The following table shows the results of the human preference evaluation scores.

Rank
Model
Cumulative score (best possible: 160)

1
Fine-tuned with 2,000 original samples
114

2
Fine-tuned with 500 original and 1,500 synthetic samples
112

3
Fine-tuned with 500 original samples
85

4
No fine-tuning (baseline)
84

Some key findings include:

Small training sets (500 samples) showed minimal improvement over baseline
Larger training sets (2,000 samples) scored considerably higher
Synthetically augmented data performed similarly to equivalent-sized original data

Although having a large volume of domain-specific training data is always ideal, many businesses have limited available datasets. In such scenarios, synthetic data can play a crucial role in place of original data. This demonstrates the potential of synthetic data for model customization.
Conclusion
SK Telecom’s collaboration with AWS GenAIIC showcases the company’s commitment to developing innovative AI solutions for telco challenges. By using Amazon Bedrock to customize Anthropic’s Claude models, SKT has achieved significant performance improvements for telco-specific, Korean language use cases without the need to build models from scratch. The proof of concept demonstrated significant improvements:

~58% increase in ROUGE-3 score
~27% increase in ROUGE-L score
Substantial improvement in returning correct reference links

This approach, combined with synthetic data generation techniques, aligns with SKT’s AI Pyramid Strategy, enabling faster testing and development of new approaches. As SKT continues to focus on key areas such as personal AI assistants, AI healthcare, and AI data centers, this collaboration with AWS represents a significant step in their AI evolution and long-term competitiveness in the global AI landscape.
For those interested in working with AWS on similar projects, visit Generative AI Innovation Center.

About the Authors
Sungmin Hong is a Senior Applied Scientist at AWS Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.
Sujeong Cha is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. She has extensive hands-on experience in solving customers’ business use cases by utilizing generative AI as well as traditional AI/ML solutions. Sujeong holds a M.S. degree in Data Science from New York University.
Arijit Ghosh Chowdhury is a Scientist with the AWS Generative AI Innovation Center, where he works on model customization and optimization. In his role, he works on applied research in fine-tuning and model evaluations to enable GenAI for various industries. He has a Master’s degree in Computer Science from the University of Illinois at Urbana Champaign, where his research focused on question answering, search and domain adaptation.
Yiyue Qian is an Applied Scientist II at the AWS Generative AI Innovation Center, where she supports providing generative AI solutions to AWS customers. In this role, she collaborates with a team of experts to develop innovative AI-driven models for AWS customers across various industries. Yiyue holds a Ph.D. in Computer Science from the University of Notre Dame, where her research focused on advanced machine learning and deep learning techniques.
Wei-Chih Chen is a Machine Learning Engineer at the AWS Generative AI Innovation Center, where he works on model customization and optimization for LLMs. He also builds tools to help his team tackle various aspects of the LLM development life cycle—including fine-tuning, benchmarking, and load-testing—that accelerating the adoption of diverse use cases for AWS customers. He holds an M.S. degree in Computer Science from UC Davis.
Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating Generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a Ph.D in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.
Seunghyun Jeong (Steve) is a team leader of the Platform Application team at SKT. He is responsible for commercializing the Global Intelligence Platform (GIP), which provides AI models and tools. For most of his career, he has been a PM developing various mobile services such as mobile wallet, fashion streaming, and unified login services for SK. His team is expanding the delivery of models and features to make it easier for internal teams to apply AI, contributing to SKT’s AI Transformation. Before entering the AI space, he was a Product Manager, developing and operating various mobile services such as mobile wallet, fashion streaming, and unified login services for the US and Korea.
Sunwoo Lee (Lois) is the team leader of the Data Construction and Evaluation Team within SK Telecom’s Global AI Tech division. She oversees the design and construction of training data for language models, the model performance evaluation process, and its application to services. Her career has focused on NLP within IT, which is a great fit with her background in Linguistics and Korean language education. Alongside her world-class team, she continues to explore and solve fascinating problems such as how to optimize the design of data for language model training, which tasks and methods to implement for validating language model performance, and the best design of AI-human conversations.
Eric Davis is the vice president of the AI Tech Collaboration Group at SKT. Eric oversees tech collaborations with worldwide tech partners to customize large language models (LLMs) for the telecommunications domain. His teams are responsible for designing and building the datasets to tune LLMs, as well as benchmarking LLMs in general and for the telecommunications domain. Eric holds a Master of Science degree in Computer Science from Carnegie Mellon from the Language Technologies Institute and a Bachelor of Arts in Linguistics and Psychology from the University of California, Los Angeles.

Scaling Rufus, the Amazon generative AI-powered conversational shoppin …

Amazon Rufus is a shopping assistant experience powered by generative AI. It generates answers using relevant information from across Amazon and the web to help Amazon customers make better, more informed shopping decisions. With Rufus, customers can shop alongside a generative AI-powered expert that knows Amazon’s selection inside and out, and can bring it all together with information from across the web to help shoppers make more informed purchase decisions.
To meet the needs of Amazon customers at scale, Rufus required a low-cost, performant, and highly available infrastructure for inference. The solution needed the capability to serve multi-billion parameter large language models (LLMs) with low latency across the world to service its expansive customer base. Low latency makes sure users have a positive experience chatting with Rufus and can start getting responses in less than a second. To achieve this, the Rufus team is using multiple AWS services and AWS AI chips, AWS Trainium and AWS Inferentia.
Inferentia and Trainium are purpose-built chips developed by AWS that accelerate deep learning workloads with high performance and lower overall costs. With these chips, Rufus reduced its costs by 4.5 times lower than other evaluated solutions while maintaining low latency for its customers. In this post, we dive into the Rufus inference deployment using AWS chips and how this enabled one of the most demanding events of the year—Amazon Prime Day.
Solution overview
At its core, Rufus is powered by an LLM trained on Amazon’s product catalog and information from across the web. LLM deployment can be challenging, requiring you to balance factors such as model size, model accuracy, and inference performance. Larger models generally have better knowledge and reasoning capabilities but come at a higher cost due to more demanding compute requirements and increasing latency. Rufus would need to be deployed and scale to meet the tremendous demand of peak events like Amazon Prime Day. Considerations for this scale include how well it needs to perform, its environmental impact, and the cost of hosting the solution. To meet these challenges, Rufus used a combination of AWS solutions: Inferentia2 and Trainium, Amazon Elastic Container Service (Amazon ECS), and Application Load Balancer (ALB). In addition, the Rufus team partnered with NVIDIA to power the solution using NVIDIA’s Triton Inference Server, providing capabilities to host the model using AWS chips.
Rufus inference is a Retrieval Augmented Generation (RAG) system with responses enhanced by retrieving additional information such as product information from Amazon search results. These results are based on the customer query, making sure the LLM generates reliable, high-quality, and precise responses.
To make sure Rufus was best positioned for Prime Day, the Rufus team built a heterogeneous inference system using multiple AWS Regions powered by Inferentia2 and Trainium. Building a system across multiple Regions allowed Rufus to benefit in two key areas. First, it provided additional capacity that could be used during times of high demand, and second, it improved the overall resiliency of the system.
The Rufus team was also able to use both Inf2 and Trn1 instance types. Because Inf2 and Trn1 instance types use the same AWS Neuron SDK, the Rufus team was able to use both instances to serve the same Rufus model. The only configuration setting to adjust was the tensor parallelism degree (24 for Inf2, 32 for Trn1). Using Trn1 instances also led to an additional 20% latency reduction and throughput improvement compared to Inf2.
The following diagram illustrates the solution architecture.

To support real-time traffic routing across multiple Regions, Rufus built a novel traffic orchestrator. Amazon CloudWatch supported the underlying monitoring, helping the team adjust the traffic ratio across the different Regions in less than 15 minutes based on the traffic pattern changes. By using this type of orchestration, the Rufus team had the ability to direct requests to other Regions when needed, with a small trade-off of latency to the first token. Due to Rufus’s streaming architecture and the performant AWS network between Regions, the perceived latency was minimal for end-users.
These choices allowed Rufus to scale up over 80,000 Trainium and Inferentia chips across three Regions serving an average of 3 million tokens a minute while maintaining P99 less than 1 second latency to the first response for Prime Day customers. In addition, by using these purpose-built chips, Rufus achieved 54% better performance per watt than other evaluated solutions, which helped the Rufus team meet energy efficiency goals.
Optimizing inference performance and host utilization
Within each Region, the Rufus inference system used Amazon ECS, which managed the underlying Inferentia and Trainium powered instances. By managing the underlying infrastructure, the Rufus team only needed to bring their container and configuration by defining an ECS task. Within each container, an NVIDIA Triton Inference Server with a Python backend is used running vLLM with the Neuron SDK. vLLM is a memory-efficient inference and serving engine that is optimized for high throughput. The Neuron SDK makes it straightforward for teams to adopt AWS chips and supports many different libraries and frameworks such as PyTorch Lightning.
The Neuron SDK provides a straightforward LLM inference solution on Trainium and Inferentia hardware with optimized performance supporting a wide range of transformer-based LLM architectures. To reduce latency, Rufus has collaborated with the AWS Annapurna team to develop various optimizations such as INT8 (weight only) quantization, continuous batching with vLLM, resource, compute, and memory bandwidth in the Neuron compiler and runtime. These optimizations are currently deployed in Rufus production and are available to use in the Neuron SDK 2.18 and onward.
To reduce overall waiting time for customers to start seeing a response from Rufus, the team also developed an inference streaming architecture. With the high compute and memory load needed for LLM inference, the total time it takes to finish generating the full response for a customer query can take multiple seconds. With a streaming architecture, Rufus is able to return the tokens right after they’re generated. This optimization allows the customer to start consuming the response in less than 1 second. In addition, multiple services work together using gRPC connections to intelligently aggregate and enhance the streaming response in real time for customers.
As shown in the following figure, images and links are embedded in the response, which allow customers to engage and continue exploring with Rufus.

Scaling up
Although we have to maintain low latency for the best customer experience, it’s also crucial to scale the service throughput by achieving high hardware resource utilization. High hardware utilization makes sure accelerators don’t sit idle and needlessly increase costs. To optimize the inference system throughput, the team improved both single-host throughput as well as load balancing efficiency.
Load balancing for LLM inference is tricky due to following challenges. First, a single host can only handle a limited number of concurrent requests. Second, the end-to-end latency to complete one request can vary, spanning many seconds depending on the LLM response length.
To address the challenges, the team optimized throughput by considering both single-host throughput and throughput across many hosts using load balancing.
The team used the least outstanding requests (LOR) routing algorithm from ALB, increasing throughput by five times faster in comparison to an earlier baseline measurement. This allows each host to have enough time to process in-flight requests and stream back responses using a gRPC connection, without getting overwhelmed by multiple requests received at the same time. Rufus also collaborated with AWS and vLLM teams to improve single-host concurrency using vLLM integration with the Neuron SDK and NVIDIA Triton Inference Server.

Figure 1. ECS tasks scale horizontally hosting the Triton Inference Server and dependencies

With this integration, Rufus was able to benefit from a critical optimization: continuous batching. Continuous batching allows a single host to greatly increase throughput. In addition, continuous batching provides unique capabilities in comparison to other batch techniques, such as static batching. For example, when using static batching, the time to first token (TTFT) increases linearly with the number of requests in one batch. Continuous batching prioritizes the prefill stage for LLM inference, keeping TTFT under control even with more requests running at the same time. This helped Rufus provide a pleasant experience with low latency when generating the first response, and improve the single-host throughput to keep serving costs under control.
Conclusion
In this post, we discussed how Rufus is able to reliably deploy and serve its multi-billion-parameter LLM using the Neuron SDK with Inferentia2 and Trainium chips and AWS services. Rufus continues to evolve with advancements in generative AI and customer feedback and we encourage you to use Inferentia and Trainium.
Learn more about how we are innovating with generative AI across Amazon.

About the author
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time, he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.
RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.
Yang Zhou is a software engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.
Adam (Hongshen) Zhao is a Software Development Manager at Amazon Stores Foundational AI. In his current role, Adam is leading Rufus Inference team to build GenAI inference optimization solutions and inference system at scale for fast inference at low cost. Outside work, he enjoys traveling with his wife and art creations.
Faqin Zhong is a software engineer at Amazon Stores Foundational AI, working on Large Language Model (LLM) inference infrastructure and optimizations. Passionate about Generative AI technology, Faqin collaborates with leading teams to drive innovations, making LLMs more accessible and impactful, ultimately enhancing customer experiences across diverse applications. Outside of work she enjoys cardio exercise and baking with her son.
Nicolas Trown is an engineer in Amazon Stores Foundational AI. His recent focus is lending his systems expertise across Rufus to aid Rufus Inference team and efficient utilization across the Rufus experience. Outside of work he enjoys spending time with his wife and day trips to nearby coast, Napa, and Sonoma areas.
Bing Yin is a director of science at Amazon Stores Foundational AI. He leads the effort to build LLMs that are specialized for shopping use cases and optimized for inference at Amazon scale. Outside of work, he enjoys running marathon races.

Differential Transformer: A Foundation Architecture for Large Language …

Transformer architecture has enabled large language models (LLMs) to perform complex natural language understanding and generation tasks. At the core of the Transformer is an attention mechanism designed to assign importance to various tokens within a sequence. However, this mechanism distributes attention unevenly, often allocating focus to irrelevant contexts. This phenomenon, known as “attention noise,” hinders the model’s ability to identify and utilize key information from lengthy sequences accurately. It becomes especially problematic in applications such as question answering, summarization, and in-context learning, where a clear and precise understanding of the context is critical.

One of the main challenges researchers face is ensuring that these models can correctly identify and focus on the most relevant segments of the text without being distracted by the surrounding context. This problem becomes more pronounced when scaling up the models regarding size and training tokens. The attention noise hampers the retrieval of key information and leads to issues such as hallucination, where models generate factually incorrect information or fail to follow logical coherence. As models grow larger, these problems become more challenging to address, making it crucial to develop new methods to eliminate or minimize attention noise.

Previous methods to tackle attention noise have included modifications to the architecture, training regimen, or normalization strategies. However, these solutions often have trade-offs regarding increased complexity or reduced model efficiency. For instance, some techniques rely on dynamic attention mechanisms that adjust focus based on context but struggle with maintaining consistent performance in long-context scenarios. Others incorporate advanced normalization strategies, but they add computational overhead and complexity. As a result, researchers have been looking for simpler yet effective ways to enhance the performance of LLMs without compromising on scalability or efficiency.

Microsoft Research and Tsinghua University researchers have introduced a new architecture called the Differential Transformer (DIFF Transformer). This novel architecture addresses the problem of attention noise by introducing a differential attention mechanism that effectively filters out irrelevant context while amplifying attention to meaningful segments. The differential attention mechanism operates by splitting the query and key vectors into two groups and computing two separate softmax attention maps. The difference between these maps serves as the final attention score, canceling common-mode noise and enabling the model to pivot more accurately on the intended information. This approach is inspired by concepts from electrical engineering, such as differential amplifiers, where common noise is canceled by taking the difference between two signals.

The DIFF Transformer consists of several layers containing a differential attention module and a feed-forward network. It retains the macrostructure of the original Transformer, ensuring compatibility with existing architectures while introducing innovations at the micro level. The model incorporates improvements like pre-RMSNorm and SwiGLU, borrowed from the LLaMA architecture, contributing to enhanced stability and efficiency during training.

The DIFF Transformer outperforms traditional Transformers in several key areas. For instance, it achieves comparable language modeling performance using only 65% of the model size and training tokens required by conventional Transformers. This translates into a 38% reduction in the number of parameters and a 36% decrease in the number of training tokens needed, directly resulting in a more resource-efficient model. When scaled up, a DIFF Transformer with 7.8 billion parameters achieves a language modeling loss similar to a 13.1 billion parameter Transformer, thereby matching performance while using 59.5% fewer parameters. This demonstrates the scalability of the DIFF Transformer, allowing for effective handling of large-scale NLP tasks with significantly lower computational costs.

In a series of tests, the DIFF Transformer demonstrated a remarkable capability for key information retrieval, outperforming the traditional Transformer by up to 76% in tasks where key information was embedded within the first half of a long context. In a “Needle-In-A-Haystack” experiment, where relevant answers were placed at varying positions within contexts of up to 64,000 tokens, the DIFF Transformer consistently maintained high accuracy, even when distractors were present. The traditional Transformer, in comparison, saw a steady decline in accuracy as the context length increased, highlighting the superior ability of the DIFF Transformer to maintain focus on relevant content.

The DIFF Transformer significantly reduced hallucination rates compared to conventional models. In a detailed evaluation using question-answering datasets such as Qasper, HotpotQA, and 2WikiMultihopQA, the DIFF Transformer achieved a 13% higher accuracy in single-document question answering and a 21% improvement in multi-document question answering. It achieved an average accuracy gain of 19% on text summarization tasks, effectively reducing the generation of factually incorrect or misleading summaries. These results underscore the robustness of the DIFF Transformer in diverse NLP applications.

The differential attention mechanism also improves the stability of the DIFF Transformer when dealing with context order permutations. At the same time, traditional Transformers exhibit high variance in performance when the order of context changes. The DIFF Transformer showed minimal performance fluctuation, indicating greater robustness to order sensitivity. In a comparative evaluation, the standard deviation of the DIFF Transformer’s accuracy across multiple-order permutations was less than 2%, while the traditional Transformer’s variance was over 10%. This stability makes the DIFF Transformer particularly suitable for applications involving in-context learning, where the model’s ability to utilize information from a changing context is crucial.

In conclusion, the DIFF Transformer introduces a groundbreaking approach to addressing attention noise in large language models. By implementing a differential attention mechanism, the model can achieve superior accuracy and robustness with fewer resources, positioning it as a promising solution for academic research and real-world applications.

Check out the Paper and Code Implementation. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Differential Transformer: A Foundation Architecture for Large Language Models that Reduces Attention Noise and Achieves Significant Gains in Efficiency and Accuracy appeared first on MarkTechPost.

AutoArena: An Open-Source AI Tool that Automates Head-to-Head Evaluati …

Evaluating generative AI systems can be a complex and resource-intensive process. As the landscape of generative models evolves rapidly, organizations, researchers, and developers face significant challenges in systematically evaluating different models, including LLMs (Large Language Models), retrieval-augmented generation (RAG) setups, or even variations in prompt engineering. Traditional methods for evaluating these systems can be cumbersome, time-consuming, and highly subjective, especially when comparing the nuances of outputs across models. These challenges result in slower iteration cycles and increased cost, often hampering innovation. To address these issues, Kolena AI has introduced a new tool called AutoArena—a solution designed to automate the evaluation of generative AI systems effectively and consistently.

Overview of AutoArena

AutoArena is specifically developed to provide an efficient solution for evaluating the comparative strengths and weaknesses of generative AI models. It allows users to perform head-to-head evaluations of different models using LLM judges, thus making the evaluation process more objective and scalable. By automating the process of model comparison and ranking, AutoArena accelerates decision-making and helps identify the best model for any specific task. The open-source nature of the tool also opens it up for contributions and refinements from a broad community of developers, enhancing its capability over time.

Features and Technical Details

AutoArena has a streamlined and user-friendly interface designed for both technical and non-technical users. The tool automates head-to-head comparisons between generative AI models—be it LLMs, different RAG configurations, or prompt tweaks—using LLM judges. These judges are capable of evaluating various outputs based on pre-set criteria, removing the need for manual evaluations, which are both labor-intensive and prone to bias. AutoArena allows users to set up their desired evaluation tasks easily and then leverages LLMs to provide consistent and replicable evaluations. This automation significantly reduces the cost and human effort typically required for such tasks while ensuring that each model is objectively assessed under the same conditions. AutoArena also provides visualization features to help users interpret the evaluation results, thus offering clear and actionable insights.

One of the major reasons why AutoArena is important lies in its potential to streamline the evaluation process and bring consistency to it. Evaluating generative AI models often involves a level of subjectivity that can lead to variability in results—AutoArena addresses this issue by using standardized LLM judges to assess model quality consistently. By doing so, it provides a structured evaluation framework that minimizes bias and subjective variations that typically affect evaluations. This consistency is crucial for organizations that need to benchmark multiple models before deploying AI solutions. Furthermore, the open-source nature of AutoArena fosters transparency and community-driven innovation, allowing researchers and developers to contribute and adapt the tool to evolving requirements in the AI space. As AI becomes increasingly integral to various industries, the need for reliable benchmarking tools like AutoArena becomes essential for building trustworthy AI systems.

Conclusion

In conclusion, AutoArena by Kolena AI represents a significant advancement in the automation of generative AI evaluations. The tool addresses the challenges of labor-intensive and subjective evaluations by introducing an automated, scalable approach that utilizes LLM judges. Its capabilities are not only beneficial for researchers and organizations seeking objective assessments but also for the broader community contributing to its open-source development. By facilitating a streamlined evaluation process, AutoArena helps accelerate innovation in generative AI, ultimately enabling more informed decision-making and improving the quality of AI systems being developed.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post AutoArena: An Open-Source AI Tool that Automates Head-to-Head Evaluations Using LLM Judges to Rank GenAI Systems appeared first on MarkTechPost.

ZODIAC: Bridging LLMs and Cardiological Diagnostics for Enhanced Clini …

LLMs are advancing healthcare by offering new possibilities in clinical support, especially through tools like Microsoft’s BioGPT and Google’s Med-PaLM. Despite these innovations, LLMs in healthcare face a significant challenge: aligning with the professionalism and precision required for real-world diagnostics. This gap is particularly crucial under FDA regulations for Software-as-a-Medical-Device (SaMD), where LLMs must demonstrate specialized expertise. Current models, designed for general tasks, often need to meet the clinical standards required for life-critical healthcare environments, making their professional integration an ongoing challenge.

LLMs have advanced in processing unstructured medical data. However, concerns about their domain-specific expertise in critical clinical settings must be addressed. Recent work, like ZODIAC, aims to address these limitations by focusing on cardiological diagnostics. Multi-agent frameworks, widely used in healthcare for managing complex workflows, show promise in optimizing tasks like patient care coordination. However, cardiological diagnostic systems have mostly relied on rule-based or single-agent models, with deep learning models making recent strides. Incorporating LLMs into cardiology remains an underexplored area that this work seeks to advance.

Researchers from ZBeats Inc., New York University, and other institutions present ZODIAC, an LLM-powered system designed to achieve cardiologist-level professionalism in cardiological diagnostics. ZODIAC assists by extracting key patient data, detecting arrhythmias, and generating preliminary reports for expert review. Built on a multi-agent framework, ZODIAC processes multimodal data and is fine-tuned with real-world, cardiologist-verified inputs. Rigorous clinical validation shows ZODIAC outperforms leading models like GPT-4o and BioGPT. Successfully integrated into electrocardiography devices, ZODIAC sets a new standard for aligning LLMs with SaMD regulations, ensuring safety and accuracy in medical practice.

The ZODIAC framework is designed for cardiologist-level diagnostics using a multi-agent system that processes multimodal patient data. It collects biostatistics, tabular metrics, and ECG tracings, which different agents analyze. One agent interprets tabular metrics, while another evaluates ECG images, generating clinical findings. A third agent synthesizes these findings with clinical guidelines to create a diagnostic report. The process, validated by cardiologists, aligns with real-world medical practices and adheres to regulatory standards for SaMD, ensuring professional accuracy and compliance during hospital deployments.

The clinical validation experiments follow real-world settings, focusing on eight evaluation metrics. Five metrics assess clinical output quality, while three focus on security. Cardiologists were engaged to evaluate the ZODIAC framework, rating it on a scale of one to five using anonymized models to prevent bias. ZODIAC outperformed general and medical-specialist models, excelling in clinical professionalism and security. Subgroup analysis revealed ZODIAC’s consistent diagnostic performance across diverse populations. An ablation study confirmed the importance of fine-tuning and in-context learning, with ZODIAC also demonstrating high stability in repeated diagnostic outputs.

In conclusion, the study introduce ZODIAC, an advanced framework powered by LLMs for cardiology diagnostics, aimed at enhancing the collaboration between clinicians and LLMs. Utilizing cardiologist-validated data, ZODIAC employs instruction tuning, in-context learning, and fact-checking to deliver diagnoses comparable to human specialists. Clinical validation reveals ZODIAC’s superior performance across various patient demographics and arrhythmia types, outperforming leading models such as OpenAI’s GPT-4o and Microsoft’s BioGPT. The framework’s multi-agent collaboration processes diverse patient data, leading to accurate arrhythmia detection and preliminary report generation, marking a significant advancement in integrating LLMs into medical devices, including electrocardiography equipment.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post ZODIAC: Bridging LLMs and Cardiological Diagnostics for Enhanced Clinical Precision appeared first on MarkTechPost.

Unlock the knowledge in your Slack workspace with Slack connector for …

Amazon Q Business is a fully managed, generative AI-powered assistant that you can configure to answer questions, provide summaries, generate content, and complete tasks based on your enterprise data. Amazon Q Business offers over 40 built-in connectors to popular enterprise applications and document repositories, including Amazon Simple Storage Service (Amazon S3), Salesforce, Google Drive, Microsoft 365, ServiceNow, Gmail, Slack, Atlassian, and Zendesk and can help you create your generative AI solution with minimal configuration.
Nearly 100 thousand organizations use Slack to bring the right people together to securely collaborate with each other. A Slack workspace captures invaluable organizational knowledge in the form of the information that flows through it as the users communicate on it. Hence, it is valuable to make this knowledge quickly and securely available to the users.
In this post, we will demonstrate how to set up Slack connector for Amazon Q Business to sync communications from both public and private channels, reflective of user permissions. We will also guide you through the configurations needed on your Slack workspace. Additionally, you will learn how to configure the Amazon Q Business application and enable user authentication through AWS IAM Identity Center, which is a recommended service for managing a workforce’s access to AWS applications.
Data source overview
Amazon Q Business uses large language models (LLMs) to build a unified solution that connects multiple data sources. Typically, you’d need to use a natural language processing (NLP) technique called Retrieval Augmented Generation (RAG) for this. With RAG, generative AI enhances its responses by incorporating relevant information retrieved from a curated dataset. Amazon Q Business has a built-in managed RAG capability designed to reduce the undifferentiated heavy lifting involved in creating these systems. Typical of a RAG model, Amazon Q Business has two components: A retrieval component that retrieves relevant documents for the user query and a generation component that takes the query and the retrieved documents and then generates an answer to the query using an LLM.
A Slack workspace has multiple elements. It has public channels where workspace users can participate and private channels where only channel members can communicate with each other. Individuals can also directly communicate with each other in one-on-one conversations and in user groups. This communication is in the form of messages and threads of replies, with optional document attachments. Slack workspaces of active organizations are highly dynamic, with the content and collaboration evolving and growing in volume continuously.

The preceding figure shows the process flow of the solution. When you connect Amazon Q Business to a data source (in this case, Slack), what Amazon Q considers and crawls as a document varies by connector. For the Amazon Q Business Slack connector, each message, message attachment and channel post is considered a single document, However, Slack conversation threads that help you create organized discussions around specific messages are also considered and ingested as a single document, regardless of the number of participants or messages they contain.
Amazon Q Business crawls access control list (ACL) information attached to a document (user and group information) from your Slack instance. This information can be used to filter chat responses to the user’s document access level. The Slack connector supports token-based authentication. This could be a Slack bot user OAuth token or Slack user OAuth token. See the Slack connector overview to get the list of entities that are extracted, supported filters, sync modes, and file types.
User IDs (_user_id) exist in Slack on messages and channels where there are set access permissions. They are mapped from the user emails as the IDs in Slack.
To connect your data source connector to Amazon Q Business, you must give Amazon Q Business an IAM role that has the following permissions:

Permission to access the BatchPutDocument and BatchDeleteDocument operations to ingest documents.
Permission to access the User Store API operations to ingest user and group access control information from documents.
Permission to access your AWS Secrets Manager secret to authenticate your data source connector instance.
(Optional) If you’re using Amazon Virtual Private Cloud (Amazon VPC), permission to access your Amazon VPC.

Solution overview
In this solution, we will show you how to create a Slack workspace with users who perform various roles within the organization. We will then show you how to configure this workspace to define a set of scopes that are required by the Amazon Q Business Slack connector to index the user communication. This will be followed by the configuration of the Amazon Q Business application and a Slack data source. Based on the configuration, when the data source is synchronized, the connector either crawls and indexes the content from the workspace that was created on or before a specific date. The connector also collects and ingests ACL information for each indexed message and document. Thus, the search results of a query made by a user includes results only from those documents that the user is authorized to read.
Prerequisites
To build the Amazon Q Business connector for Slack, you need the following:
In Slack:

Create a Slack bot user OAuth token or Slack user OAuth token. You can choose either token to connect Amazon Q Business to your Slack data source. See the Slack documentation on access tokens for more information.
Note your Slack workspace team ID from your Slack workspace main page URL. For example, https://app.slack.com/client/T0123456789/… where T0123456789 is the team ID.
Add the OAuth scopes and read permissions.

In your AWS account:

Create an AWS Identity and Access Management (IAM) role for your data source and, if using the Amazon Q Business API, note the ARN of the IAM role.
Store your Slack authentication credentials in an AWS Secrets Manager secret and, if using the Amazon Q Business API, note the ARN of the secret.
Enable and configure an IAM Identity Center instance. Amazon Q Business integrates with IAM Identity Center as a gateway to manage user access to your Amazon Q Business application. We recommend enabling and pre-configuring an Identity Center instance before you begin to create your Amazon Q Business application. Identity Center is the recommended AWS service for managing human user access to AWS resources. Amazon Q Business supports both organization and account level Identity Center instances. See Setting up for Amazon Q Business for more information.

Configure your Slack workspace
You will create one user for each of the following roles: Administrator, Data scientist, Database administrator, Solutions architect and Generic.

User name
Role

arnav_desai
Admin

jane_doe
Data Scientist

pat_candella
DB Admin

mary_major
Solutions Architect

john_stiles
Generic User

To showcase the ACL propagation, you will create three public channels, #general, #customerwork, and #random, that any member can access including the Generic user. Also, one private channel, #anydepartment-project-private, that can be accessed only by the users arnav_desai, john_stiles, mary_major, and pat_candella.
To create a Slack app:

Navigate to the Slack API Your Apps page and choose Create New App.
Select From scratch. In the next screen, select the workspace to develop your app, and then choose Create an App.
Give the Slack app a name and select a workspace to develop your app in. Then choose Create App.
After you’ve created your app, select it and navigate to Features and choose OAuth & Permissions.
Scroll down to Scopes > User Token Scopes and set the OAuth scope based on the user token scopes in Prerequisites for connecting Amazon Q Business to Slack.

Note: You can configure two types of scopes in a Slack workspace:

Bot token scope: Only the messages to which it has been explicitly added are crawled by the bot token. It is employed to grant restricted access to specific messages only.
User token scope: Only the data shared with the member is accessible to the user token, which acts as a representative of a Slack user.

For this example, so you can search on the conversations between users, you will use the user token scope.

After the OAuth scope for yser token has been set up as described in the Slack prerequisites, scroll up to the section OAuth Tokens for your Workspace, and choose Install to Workspace, and then choose Allow.
This will generate a user OAuth token. Copy this token to use when configuring the Amazon Q Business Slack connector.

Configure the data source using the Amazon Q Business Slack connector
In this section, you will create an Amazon Q Business application using the console.
To create an Amazon Q Business application

In the AWS Management Console for Amazon Q Business, choose Create Application.
Enter an Application Name, such as my-slack-workspace. Leave the Service access as the default value, and select AWS IAM Identity Center for Access Management . Enter a new Tag value as required and choose Create to the Amazon Q Business Application.
Leave the default option of Use Native retriever selected for Retrievers, leave Enterprise as the Index provisioning and leave the default value of 1 as the Number of units. Each unit in Amazon Q Business index is 20,000 documents or 200 MB of extracted text (whichever comes first). Choose Next.
Scroll down the list of available connectors and select Slack and then choose Next.

Enter a Data source name and a Description to identify your data source and then enter the Slack workspace team ID to connect with Amazon Q Business.
In the Authentication section, select Create and add a new secret.
On the dialog box that appears, enter a Secret name followed by the User OAuth Slack token that was copied from the Slack workspace.
For the IAM role, select Create a new service role (Recommended).
In Sync scope, choose the following:

For select type of content to crawl, select All channels.
Select an appropriate date for Select crawl start date.
Leave the default value selected for Maximum file size as 50.
You can include specific Messages, such as bot messages or archived messages to sync.
Additionally, you can include up to 100 patterns to include or exclude filenames, types, or file paths to sync.

For Sync mode, leave Full sync selected and for the Sync run schedule, select Run on demand.
Leave the field mapping as is and choose Add data source.
On the next page, choose Next.

Add the five users you created earlier, who are a part of IAM Identity Center and the Slack workspace to the Amazon Q Business application. To add users to Identity Center, follow the instructions in Add users to your Identity Center directory. When done, choose Add groups and users and choose Assign.
When a user is added, each user is assigned the default Q Business Pro For more information on different pricing tiers, see the Amazon Q Business pricing page.
Choose Create application to finish creating the Amazon Q Business application.
After the application and the data source are created, select the data source and then choose Sync now to start syncing documents from your data source.
The sync process ingests the documents from your Slack workspace to your selections in the Slack connector configuration in Amazon Q Business. The following screenshot shows the results of a successful sync, indicated by the status of Completed.

Search with Amazon Q Business
Now, you’re ready to make a few queries in Amazon Q Business.
To search using Amazon Q Business:

Navigate to the Web experience settings tab and click on the Deployed URL.
For this demonstration, sign in as pat_candella who has the role of DB Admin.
Enter the password for pat_candella and choose Sign in
Upon successful sign-in, you will be signed in to Amazon Q Business.
In the Slack workspace, there is a public channel, the #customerwork channel that all users are members of. The #customerwork Slack channel is being used to communicate about an upcoming customer engagement, as shown in the following figure.
Post the first question to Amazon Q Business.

I am currently using Apache Kafka. Can you list high level steps involved in migration to Amazon MSK?

Note that the response includes citations that refer to the conversation as well as the content of the PDF that was attached to the conversation.
Security and privacy options with Slack data connector
Next, you will create a private channel called #anydepartment-project-private with four out of the five users—arnav_desai, john_stiles, mary_major and pat_candella—and verify that the messages exchanged in a private channel are not available to non-members like jane_doe. Note that after you create a new private channel, you need to manually re-run the sync on the data source.
The below screenshot shows the private slack channel with four out of five users and the slack conversation.
Testing security and privacy options with Slack data connector

While signed in as pat_candella, who is part of the private #anydepartment-project-private channel, execute the following query:

What is Amazon Kendra and which API do I use to query a Kendra index?

Now, sign in as jane_doe, who is not a member of the #anydepartment-project-private channel and execute the same query.
Amazon Q Business prevents jane_doe from getting insights from information within the private channels that they aren’t part of, based on the synced ACL information.

Indexing aggregated Slack threads
Slack organizes conversations into threads, which can involve multiple users and messages. The Amazon Q Business Slack connector treats each thread as a single document, regardless of the number of participants or messages it contains. This approach allows Amazon Q Business to ingest entire conversation threads as individual units, maximizing the amount of data that can be processed within a single index unit. As a result, you can efficiently incorporate more comprehensive conversational context into your Amazon Q Business system.
The figure that follows shows a conversation between pat_candella and jane_doe that includes six messages in a thread. The Slack connector aggregates this message thread as a single message, thus maximizing the use of an index unit.

Because the conversation thread is aggregated as a single document within the Amazon Q Business index, you can ask questions that pertain to a single conversation thread as shown in the following figure.

Troubleshooting the sync process

Why isn’t Amazon Q Business answering any of my questions?

If you aren’t getting answers to your questions from Amazon Q Business, verify the following:

Permissions – Document ACLs indexed by Amazon Q Business may not allow you to query certain data entities as demonstrated in our example. If this is the case, please reach out to your Slack workspace administrator to make sure that your user has access to required documents and repeat the sync process.
Data connector sync – A failed data source sync may prevent the documents from being indexed, meaning that Amazon Q Business would be unable to answer questions about the documents that failed to sync. Please refer to the official documentation to troubleshoot data source connectors.

I’m receiving access errors on Amazon Q Business application. What causes this?

See Troubleshooting Amazon Q Business identity and access to diagnose and fix common issues that you might encounter when working with Amazon Q and IAM.

How can I sync documents without ACLs?

Amazon Q Business supports crawling ACLs for document security by default. Turning off ACLs and identity crawling are no longer supported. If you want to index documents without ACLs, ensure that the documents are marked as public in your data source. Please refer to the official documentation, How Amazon Q Business connector for crawls Slack ACLs.

My connector is unable to sync. How can I monitor data source sync progress?

Amazon Q Business provides visibility into the data sync operations. Learn more about this feature in the AWS Machine Learning blog.
Additionally, as the sync process runs, you can monitor progress or debug failures by monitoring the Amazon CloudWatch logs that can be accessed from the Details section of the Sync run history.
A sample query to determine which documents or messages were indexed from a specific slack channel, C12AB34578, and logStream of SYNC_RUN_HISTORY_REPORT/xxxxxxxxxxxxxxxxxxxxxxxx would look like the following:

fields LogLevel, DocumentId, DocumentTitle, CrawlAction, ConnectorDocumentStatus.Status as ConnectorDocumentStatus, ErrorMsg, CrawlStatus.Status as CrawlStatus, SyncStatus.Status as SyncStatus, IndexStatus.Status as IndexStatus, SourceUri, Acl, Metadata, HashedDocumentId, @timestamp

| filter @logStream like ‘SYNC_RUN_HISTORY_REPORT/xxxxxxxxxxxxxxxxxxxxxxxx’ and Metadata like /”stringValue”:”C12AB34578″/

| sort @timestamp desc

| limit 10000

Choosing Run query displays the list of messages as the Amazon Q Business Index sync runs, as shown in the following figure.

Cleanup
To delete an Amazon Q Business application, you can use the console or the DeleteApplication API operation.
To delete an Amazon Q Business application using the console

Sign in to the Amazon Q Business console.
Select the respective the Amazon Q Business Application and choose
Choose Delete
In the dialog box that opens, enter Delete to confirm deletion, and then choose Delete.
You are returned to the service console while your application is deleted. When the deletion process is complete, the console displays a message confirming successful deletion.

To delete the IAM Identity Center instance, see Delete your IAM Identity Center instance.
Conclusion
This blog post provides a step-by-step guide on setting up the Slack connector for Amazon Q Business, enabling you to seamlessly integrate data from your Slack workspace. Moreover, we highlighted the importance of data privacy and security, demonstrating how the connector adheres to the ACLs within your Slack workspace. This feature helps ensure that private channel conversations remain confidential and inaccessible to individuals who aren’t members of those channels. By following these steps and understanding the built-in security measures, you can use the power of Amazon Q Business while maintaining the integrity and privacy of your Slack workspace.
To learn more about the Amazon Q Business connector for Slack, see Connecting Slack to Amazon Q Business. You can automate all the showcased console operations through Amazon Q Business API’s, the AWS CLI and other applicable AWS SDKs.
If you choose to converse with Amazon Q Business using Slack direct messages (DMs) to ask questions and get answers based on company data or to get help creating new content such as email drafts, summarize attached files, and perform tasks, see Deploy a Slack gateway for Amazon Q, your business expert for information about how to bring Amazon Q, your business expert, to users in Slack.

About the Authors
Akshara Shah is a Senior Solutions Architect at Amazon Web Services. She provides strategic technical guidance to help customers design and build cloud solutions. She is currently focused on machine learning and AI technologies.
Roshan Thomas is a Senior Solutions Architect at Amazon Web Services. He is based in Melbourne, Australia and works closely with enterprise customers to accelerate their journey in the cloud. He is passionate about technology and helping customers architect and build solutions on AWS.

Transitioning off Amazon Lookout for Metrics 

Amazon Lookout for Metrics is a fully managed service that uses machine learning (ML) to detect anomalies in virtually any time-series business or operational metrics—such as revenue performance, purchase transactions, and customer acquisition and retention rates—with no ML experience required. The service, which was launched in March 2021, predates several popular AWS offerings that have anomaly detection, such as Amazon OpenSearch, Amazon CloudWatch, AWS Glue Data Quality, Amazon Redshift ML, and Amazon QuickSight.
After careful consideration, we have made the decision to end support for Amazon Lookout for Metrics, effective October 10, 2025. In addition, as of today, new customer sign-ups are no longer available. Existing customers will be able to use the service as usual until October 10, 2025, when we will end support for Amazon Lookout for Metrics.
In this post, we provide an overview of the alternate AWS services that offer anomaly detection capabilities for customers to consider transitioning their workloads to.
AWS services with anomaly detection capabilities
We recommend customers use Amazon OpenSearch, Amazon CloudWatch, Amazon Redshift ML, Amazon QuickSight, or AWS Glue Data Quality services for their anomaly detection use cases as an alternative to Amazon Lookout for Metrics. These AWS services offer generally available, ML-powered anomaly detection capabilities that can be used out of the box without requiring any ML expertise. Following is a brief overview of each service.
Using Amazon OpenSearch for anomaly detection
Amazon OpenSearch Service features a highly performant, integrated anomaly detection engine that enables the real-time identification of anomalies in streaming data as well as in historical data. You can pair anomaly detection with built-in alerting in OpenSearch to send notifications when there is an anomaly. To start using OpenSearch for anomaly detection you first must index your data into OpenSearch, from there you can enable anomaly detection in OpenSearch Dashboards. To learn more, see the documentation.
Using Amazon CloudWatch for anomaly detection
Amazon CloudWatch supports creating anomaly detectors on specific Amazon CloudWatch Log Groups by applying statistical and ML algorithms to CloudWatch metrics. Anomaly detection alarms can be created based on a metric’s expected value. These types of alarms don’t have a static threshold for determining alarm state. Instead, they compare the metric’s value to the expected value based on the anomaly detection model. To start using CloudWatch anomaly detection, you first must ingest data into CloudWatch and then enable anomaly detection on the log group.
Using Amazon Redshift ML for anomaly detection
Amazon Redshift ML makes it easy to create, train, and apply machine learning models using familiar SQL commands in Amazon Redshift data warehouses. Anomaly detection can be done on your analytics data through Redshift ML by using the included XGBoost model type, local models, or remote models with Amazon SageMaker. With Redshift ML, you don’t have to be a machine learning expert and you pay only for the training cost of the SageMaker models. There are no additional costs to using Redshift ML for anomaly detection. To learn more, see the documentation.
Using Amazon QuickSight for anomaly detection
Amazon QuickSight is a fast, cloud-powered, business intelligence service that delivers insights to everyone in the organization. As a fully managed service, QuickSight lets customers create and publish interactive dashboards that include ML insights. QuickSight supports a highly performant, integrated anomaly detection engine that uses proven Amazon technology to continuously run ML-powered anomaly detection across millions of metrics to discover hidden trends and outliers in customers’ data. This tool allows customers to get deep insights that are often buried in the aggregates and not scalable with manual analysis. With ML-powered anomaly detection, customers can find outliers in their data without the need for manual analysis, custom development, or ML domain expertise. To learn more, see the documentation.
Using Amazon Glue Data Quality for anomaly detection
Data engineers and analysts can use AWS Glue Data Quality to measure and monitor their data. AWS Glue Data Quality uses a rule-based approach that works well for known data patterns and offers ML-based recommendations to help you get started. You can review the recommendations and augment rules from over 25 included data quality rules. To capture unanticipated, less obvious data patterns, you can enable anomaly detection. To use this feature, you can write rules or analyzers and then turn on anomaly detection in AWS Glue ETL. AWS Glue Data Quality collects statistics for columns specified in rules and analyzers, applies ML algorithms to detect anomalies, and generates visual observations explaining the detected issues. Customers can use recommended rules to capture the anomalous patterns and provide feedback to tune the ML model for more accurate detection. To learn more, see the blog post, watch the introductory video, or see the documentation.
Using Amazon SageMaker Canvas for anomaly detection (a beta feature)
The Amazon SageMaker Canvas team plans to provide support for anomaly detection use cases in Amazon SageMaker Canvas. We’ve created an AWS CloudFormation template-based solution to give customers early access to the underlying anomaly detection feature. Customers can use the CloudFormation template to bring up an application stack that receives time-series data from an Amazon Managed Streaming for Apache Kafka (Amazon MSK) streaming source and performs near-real-time anomaly detection in the streaming data. To learn more about the beta offering, see Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink.
Frequently asked questions

What is the cutoff point for current customers?

We created an allow list of account IDs that have used Amazon Lookout for Metrics in the last 30 days and have active Amazon Lookout for Metrics resources, including detectors, within the service. If you are an existing customer and are having difficulties using the service, please reach out to us via AWS Customer Support for help.

How will access change before the sunset date?

Current customers can do all the things they could previously. The only change is that non-current customers cannot create any new resources in Amazon Lookout for Metrics.

What happens to my Amazon Lookout for Metrics resources after the sunset date?

After October 10, 2025, all references to AWS Lookout for Metrics models and resources will be deleted from Amazon Lookout for Metrics. You will not be able to discover or access Amazon Lookout for Metrics from your AWS Management Console and applications that call the Amazon Lookout for Metrics API will no longer work.

Will I be billed for Amazon Lookout for Metrics resources remaining in my account after October 10, 2025?

Resources created by Amazon Lookout for Metrics internally will be deleted after October 10, 2025. Customers will be responsible for deleting the input data sources created by them, such as Amazon Simple Storage Service (Amazon S3) buckets, Amazon Redshift clusters, and so on.

How do I delete my Amazon Lookout for Metrics resources?

Open the Lookout for Metrics console Detectors
Choose the detector from the list.
Choose Delete.
Repeat these steps for every detector.

How can I export anomalies data before deleting the resources?

Anomalies data for each measure can be downloaded for a detector by using the Amazon Lookout for Metrics APIs for a particular detector. Exporting Anomalies explains how to connect to a detector, query for anomalies, and download them into a format for later use.
Conclusion
In this blog post, we have outlined methods to create anomaly detectors using alternates such as Amazon OpenSearch, Amazon CloudWatch, and a CloudFormation template-based solution.
Resource links:

Anomaly detection using Amazon OpenSearch: Create an anomaly detector, configure the model, set up detector jobs, and observe the results using Amazon OpenSearch.
Anomaly detection using Amazon CloudWatch: Explore Amazon CloudWatch anomaly detection and set it up using the AWS Management Console, the AWS Command Line Interface (AWS CLI), or AWS CloudFormation.
Create a CloudWatch alarm based on anomaly detection: Create a CloudWatch alarm based on anomaly detection and modify or delete an anomaly detection model.
Anomaly detection in AWS Glue Data Quality: Detect unanticipated issues with your data using powerful ML-based anomaly detection algorithms. Use AWS Glue Data Quality to understand the anomaly and provide feedback to tune the ML model for accurate detection.

About the Author
Nirmal Kumar is Sr. Product Manager for the Amazon SageMaker service. Committed to broadening access to AI/ML, he steers the development of no-code and low-code ML solutions. Outside work, he enjoys travelling and reading non-fiction.

Hex-LLM: A New LLM Serving Framework Designed for Efficiently Serving …

In the rapidly evolving world of artificial intelligence, large language models (LLMs) have become essential tools for a variety of applications, ranging from natural language understanding to content generation. While the capabilities of these models continue to expand, efficiently serving and deploying them remains a challenge, particularly when it comes to balancing cost, throughput, and latency. Recent advancements by Google and the introduction of Hex-LLM, a specialized serving framework, offer promising solutions for efficiently deploying open LLMs from Hugging Face on Google TPUs.

Hex-LLM: A Game-Changer for Serving Open LLMs on TPUs

Hex-LLM is Vertex AI’s in-house LLM serving framework that is designed and optimized for Google’s Cloud TPU hardware, which is available as part of AI Hypercomputer. It provides a high-performance, low-cost solution for deploying open-source models from Hugging Face. Developed to address the challenges of serving large models at scale, Hex-LLM stands out due to its advanced optimization techniques, which allow it to handle significant workloads with impressive efficiency.

Key Features and Innovations of Hex-LLM

To efficiently serve LLMs on TPUs, Hex-LLM integrates a variety of key features and optimization techniques, which significantly enhance performance:

Token-Based Continuous Batching: One of the standout features of Hex-LLM is token-based continuous batching. This method allows for efficient utilization of TPU resources by processing incoming tokens in a continuous stream. By handling requests in this manner, Hex-LLM maximizes throughput, significantly reducing the cost per token served. This approach ensures that no TPU cycles are wasted, resulting in an overall boost in efficiency.

XLA-Optimized PagedAttention Kernels: Hex-LLM employs XLA (Accelerated Linear Algebra) optimized PagedAttention kernels, which are crucial for managing the attention mechanism of transformer models. These kernels are tailored to exploit the full potential of TPU hardware, minimizing the latency and computational load associated with the attention calculations. By leveraging XLA-optimized kernels, Hex-LLM achieves low-latency inference, which is essential for applications requiring real-time or near-real-time responses.

Tensor Parallelism: Another critical feature of Hex-LLM is tensor parallelism, which enables the distribution of model computations across multiple TPU cores. This parallelism is particularly beneficial for serving large models like Llama 2 70B, as it allows for the workload to be split effectively, ensuring that the TPUs operate at peak efficiency without being bottlenecked by single-threaded tasks.

Dynamic LoRA Adapters and Quantization: Hex-LLM supports the use of Dynamic Low-Rank Adaptation (LoRA) adapters, which offer a flexible way to fine-tune models for specific tasks without retraining the entire model. Additionally, Hex-LLM supports quantization techniques, including BNB (Billion-scale Neural Basis) and AWQ (Adaptive Weight Quantization), allowing models to run with lower precision, thereby reducing memory usage and increasing inference speed without compromising performance.

Integration with Hugging Face Hub

Hex-LLM integrates directly with the Hugging Face Hub, allowing developers to easily load and serve models from the extensive library of open LLMs available. This seamless integration simplifies the process of deploying models on Google TPUs, making it more accessible for those who may not have extensive experience with TPU infrastructure. By directly pulling models from Hugging Face, users can quickly experiment with different LLMs and deploy them in production environments without the need for extensive manual configuration.

Performance Metrics: Speed and Cost

The performance of Hex-LLM is impressive, particularly when serving large models. For instance, Hex-LLM achieves a throughput of 1510 output tokens per second for Llama 2 70B in int8 precision on a single TPU v5e-8, with an approximate cost of $9.60 per hour. This translates to a latency of 26 milliseconds per token, which is remarkable for a model of this size. These metrics demonstrate that Hex-LLM is not only capable of serving large models with high efficiency but also does so at a cost that is feasible for many applications.

Availability in Vertex AI Model Garden

Hex-LLM is available as part of the Vertex AI Model Garden, a platform that offers a wide variety of pre-trained models and tools for machine learning. By including Hex-LLM in the Model Garden, Google provides users with a straightforward way to access and deploy open LLMs on TPUs, complete with the optimizations offered by the Hex-LLM framework. This availability ensures that users can leverage the power of TPUs for LLM deployment without needing to set up the infrastructure from scratch.

Conclusion

Hex-LLM represents a significant step forward in the efficient serving of open LLMs, particularly for users looking to deploy large models on Google TPUs. With features like token-based continuous batching, XLA-optimized PagedAttention kernels, tensor parallelism, and direct integration with Hugging Face, Hex-LLM offers a powerful and cost-effective solution for LLM deployment. While its current status as a closed-source framework may limit its accessibility, the performance gains and cost reductions it provides make it an attractive option for organizations seeking to leverage the power of large language models in their applications.

Check out the Details here and LInkedIn Post. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Hex-LLM: A New LLM Serving Framework Designed for Efficiently Serving Open LLMs on Google Cloud TPUs appeared first on MarkTechPost.

Evaluating the Planning Capabilities of Large Language Models: Feasibi …

New developments in Large Language Models (LLMs) have shown how well these models perform sophisticated reasoning tasks like coding, language comprehension, and math problem-solving. However, there is less information about how effectively these models work in terms of planning, especially in situations where a goal must be attained through a sequence of interconnected actions. Because planning frequently calls for models to comprehend constraints, manage sequential decisions, function in dynamic contexts, and retain recollection of previous activities, it is a more difficult topic for LLMs to handle.

In recent research, a team of researchers from University of Texas at Austin have assessed the planning capabilities of OpenAI’s o1 model, which is a newcomer to the LLM field that was created with improved reasoning capabilities. The study tested the model’s performance in terms of three primary dimensions: generalisability, optimality, and feasibility, using a variety of benchmark tasks.

The ability of the model to provide a plan that can be carried out and complies with the requirements and limitations of the task is referred to as feasibility. For instance, jobs in settings like Barman and Tyreworld are heavily constrained, requiring the utilization of resources or actions in a specified order, and failing to follow these instructions fails. In this regard, the o1-preview model demonstrated some amazing strengths, especially in its capacity to self-evaluate its plans and adhere to task-specific limitations. The model’s capacity to evaluate itself enhances its likelihood of success by enabling it to more accurately determine if the steps it generates comply with the task’s requirements.

While coming up with workable designs is a vital first step, optimality or how well the model completes the task is also essential. Finding a solution alone is frequently insufficient in many real-world scenarios, as the solution also needs to be efficient in terms of the amount of time, resources used, and procedures required. The study found that although the o1-preview model outperformed the GPT-4 in the following limitations, it frequently produced less-than-ideal designs. This indicates that the model frequently included pointless or redundant actions, which resulted in ineffective solutions. 

For example, the model’s answers were workable but included needless repeats that may have been avoided with a more optimized approach in environments like Floortile and Grippers, which demand excellent spatial reasoning and task sequencing.

The capacity of a model to apply newly learned planning techniques to unique or unfamiliar problems for which it has not received explicit training is known as generalization. This is a crucial component in real-world applications since activities are frequently dynamic and need flexible and adaptive planning techniques. The o1-preview model had trouble generalizing in spatially complicated environments like Termes, where jobs include managing 3D spaces or many interacting objects. Its performance drastically declined in new, spatially dynamic tasks, even while it could keep structure in more familiar activities.

The study’s findings have demonstrated the o1-preview model’s advantages and disadvantages in relation to planning. On the one hand, the model’s capabilities above GPT-4 are evident in its capacity to adhere to limits, control state transitions, and assess the viability of its own plans. Because of this, it is more dependable in structured settings where adherence to rules is essential. However, there are still a lot of substantial decision-making and memory management constraints in the model. For tasks requiring strong spatial reasoning, in particular, the o1-preview model often produces less-than-ideal designs and has difficulty generalizing to unfamiliar environments.

This pilot study lays the framework for future research targeted at overcoming the stated limitations of LLMs in planning tasks. The crucial areas in need of development are as follows.

Memory Management: Reducing the number of unnecessary steps and increasing work efficiency could be achieved by improving the model’s capacity to remember and make effective use of previous activities.

Decision-Making: More work is required to improve the sequential decisions made by LLMs, making sure that each action advances the model towards the objective in the best possible way.

Generalization: Improving abstract thinking and generalization methods could improve LLM performance in unique situations, especially those involving symbolic reasoning or spatial complexity.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Evaluating the Planning Capabilities of Large Language Models: Feasibility, Optimality, and Generalizability in OpenAI’s o1 Model appeared first on MarkTechPost.

Researchers at Stanford University Introduce Tutor CoPilot: A Human-AI …

Integrating Artificial Intelligence (AI) tools in education has shown great potential to enhance teaching methods and learning experiences, especially where access to experienced educators is limited. One prominent AI-based approach is using Language Models (LMs) to support tutors in real time. Such systems can provide expert-like suggestions that help tutors improve student engagement and performance. By equipping novice educators with real-time guidance, AI tools have the potential to bridge the expertise gap in education and create a more equitable learning environment. This is particularly crucial in classrooms with diverse student abilities and educational backgrounds.

The fundamental problem in education is the high cost and limited scalability of traditional tutoring training programs. Comprehensive professional development sessions can cost up to $3,300 per teacher annually, making it challenging for schools with tight budgets to offer quality training. These programs often require tutors to invest significant time outside their teaching hours, making them impractical for part-time educators. Also, many professional development programs need to be aligned with the specific needs of novice tutors, which means they fail to address the dynamic, real-time challenges faced during live tutoring sessions. Consequently, many tutors develop their skills on the job, leading to inconsistent teaching quality and missed student learning opportunities.

Educators have relied on professional development workshops and training seminars to improve their skills. However, these methods are not always effective due to their static nature, which doesn’t cater to the real-time interaction needs of teachers. To address this, some educators have tried using online forums and support networks, but these lack the structured feedback necessary for professional growth. Also, adapting generic training programs for specific educational settings remains challenging, and many tutors, particularly those working in under-served communities, find it difficult to implement these strategies effectively.

Researchers from Stanford University developed Tutor CoPilot, a human-AI collaborative system designed to provide real-time guidance to tutors during live tutoring sessions. Tutor CoPilot aims to replicate expert educators’ decision-making process by providing actionable and context-specific expert-like suggestions. The system uses think-aloud protocols captured from experienced tutors to train the AI model to deliver feedback in real-time. This innovative approach enables less experienced tutors to deliver high-quality instruction that closely aligns with best practices in teaching.

Tutor CoPilot works by embedding itself within a virtual tutoring platform, where tutors can activate it during sessions for immediate assistance. The AI system then analyzes the conversation context and the lesson topic to offer suggestions that the tutor can implement instantly. Suggestions include asking guiding questions to encourage student reasoning, providing hints to support problem-solving, and affirming correct responses. Tutor CoPilot allows tutors to personalize these suggestions, making it comfortable to adapt to the unique needs of each student. The platform also includes a safety mechanism that de-identifies student and tutor names, ensuring user privacy during interactions.

The performance of Tutor CoPilot was tested in a large-scale, randomized, controlled trial involving 900 tutors and 1,800 students from Title I schools. The results were significant: students working with tutors who used Tutor CoPilot were four percentage points more likely to master mathematics topics than the control group, where only 62% of students achieved mastery. Interestingly, the positive impact was even greater for tutors initially rated less effective. For these tutors, the mastery rate increased by nine percentage points, closing the gap between less experienced and more experienced educators. The study also found that Tutor CoPilot costs only $20 per tutor annually, making it a cost-effective alternative to traditional training programs.

Key findings revealed that Tutor CoPilot frequently encouraged tutors to employ high-quality pedagogical strategies. For example, tutors using the system were more likely to prompt students to explain their reasoning, use guiding questions to promote deeper understanding and avoid simply giving away the answers. Such strategies are aligned with best practices in effective teaching and have been shown to improve student outcomes significantly. Also, interviews with tutors indicated that they found the system helpful in breaking down complex concepts. However, occasional issues with the tool provided suggestions that needed grade-level appropriateness.

Key Takeaways from the research on Tutor CoPilot:

The study involved 900 tutors and 1,800 K-12 students from under-served communities.

Students working with Tutor CoPilot were four percentage points more likely to achieve topic mastery.

Tutors rated as less effective showed the most improvement, with their students’ mastery rates increasing by nine percentage points.

Tutor CoPilot costs only $20 per tutor annually, compared to traditional training programs, which cost over $3,300 per teacher.

The system encourages using high-quality teaching strategies, such as prompting students to explain their reasoning and asking guiding questions.

In conclusion, the study’s results show that integrating human-AI collaborative systems like Tutor CoPilot in education can significantly improve the quality of teaching, particularly in underserved communities. The research team demonstrated that Tutor CoPilot enhances novice tutors’ effectiveness and provides a scalable solution for improving educational outcomes across diverse student populations. At a fraction of the cost of traditional training programs, Tutor CoPilot offers a promising pathway for making high-quality education accessible to all students.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!
The post Researchers at Stanford University Introduce Tutor CoPilot: A Human-AI Collaborative System that Significantly Improves Real-Time Tutoring Quality for Students appeared first on MarkTechPost.

Efficient Pre-training of Llama 3-like model architectures using torch …

This post is co-written with Less Wright and Wei Feng from Meta
Pre-training large language models (LLMs) is the first step in developing powerful AI systems that can understand and generate human-like text. By exposing models to vast amounts of diverse data, pre-training lays the groundwork for LLMs to learn general language patterns, world knowledge, and reasoning capabilities. This foundational process enables LLMs to perform a wide range of tasks without task-specific training, making them highly versatile and adaptable. Pre-training is essential for building a strong base of knowledge, which can then be refined and specialized through fine-tuning, transfer learning, or few-shot learning approaches.
In this post, we collaborate with the team working on PyTorch at Meta to showcase how the torchtitan library accelerates and simplifies the pre-training of Meta Llama 3-like model architectures. We showcase the key features and capabilities of torchtitan such as FSDP2, torch.compile integration, and FP8 support that optimize the training efficiency. We pre-train a Meta Llama 3 8B model architecture using torchtitan on Amazon SageMaker on p5.48xlarge instances, each equipped with 8 Nvidia H100 GPUs. We demonstrate a 38.23% performance speedup in the training throughput compared to the baseline without applying the optimizations (as shown in the following figure). Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available, and SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs.

To learn more, you can find our complete code sample on GitHub.
Introduction to torchtitan
torchtitan is a reference architecture for large-scale LLM training using native PyTorch. It aims to showcase PyTorch’s latest distributed training features in a clean, minimal code base. The library is designed to be simple to understand, use, and extend for different training purposes, with minimal changes required to the model code when applying various parallel processing techniques.
torchtitan offers several key features, including FSDP2 with per-parameter sharding, tensor parallel processing, selective layer and operator activation checkpointing, and distributed checkpointing. It supports pre-training of Meta Llama 3-like and Llama 2-like model architectures of various sizes and includes configurations for multiple datasets. The library provides straightforward configuration through TOML files and offers performance monitoring through TensorBoard. In the following sections, we highlight some of the key features of torchtitan.
Transitioning from FSDP1 to FSDP2
FSDP1 and FSDP2 are two approaches to fully sharded data parallel training. FSDP1 uses flat-parameter sharding, which flattens all parameters to 1D, concatenates them into a single tensor, pads it, and then chunks it across workers. This method offers bounded padding and efficient unsharded storage, but might not always allow optimal sharding for individual parameters. FSDP2, on the other hand, represents sharded parameters as DTensors sharded on dim-0, handling each parameter individually. This approach enables easier manipulation of parameters, for example per-weight learning rate, communication-free sharded state dicts, and simpler meta-device initialization. The transition from FSDP1 to FSDP2 reflects a shift towards more flexible and efficient parameter handling in distributed training, addressing limitations of the flat-parameter approach while potentially introducing new optimization opportunities.
torchtitan support for torch.compile
torch.compile is a key feature in PyTorch that significantly boosts model performance with minimal code changes. Through its just-in-time (JIT) compilation, it analyzes and transforms PyTorch code into more efficient kernels. torchtitan supports torch.compile, which delivers substantial speedups, especially for large models and complex architectures, by using techniques like operator fusion, memory planning, and automatic kernel selection. This is enabled by setting compile = true in the model’s TOML configuration file.
torchtitan support for FP8 linear operations
torchtitan provides support for FP8 (8-bit floating point) computation that significantly reduces memory footprint and enhances performance in LLM training. FP8 has two formats, E4M3 and E5M2, each optimized for different aspects of training. E4M3 offers higher precision, making it ideal for forward propagation, whereas E5M2, with its larger dynamic range, is better suited for backpropagation. When operating at a lower precision, FP8 has no impact on model accuracy, which we demonstrate by convergence comparisons of the Meta Llama 3 8B pre-training at 2,000 steps. FP8 support on torchtitan is through the torchao library, and we enable FP8 by setting enable_float8_linear = true in the model’s TOML configuration file.
torchtitan support for FP8 all-gather
This feature enables efficient communication of FP8 tensors across multiple GPUs, significantly reducing network bandwidth compared to bfloat16 all-gather operations. FP8 all-gather performs float8 casting before the all-gather operation, reducing the message size. Key to its efficiency is the combined absolute maximum (AMAX) AllReduce, which calculates AMAX for all float8 parameters in a single operation after the optimizer step, avoiding multiple small all-reduces. Similar to FP8 support, this also has no impact on model accuracy, which we demonstrate by convergence comparisons of the Meta Llama 3 8B pre-training.
Pre-training Meta Llama 3 8B with torchtitan on Amazon SageMaker
SageMaker training jobs offer several key advantages that enhance the pre-training process of Meta Llama 3-like model architectures with torchtitan. It provides a fully managed environment that simplifies large-scale distributed training across multiple instances, which is crucial for efficiently pre-training LLMs. SageMaker supports custom containers, which allows seamless integration of the torchtitan library and its dependencies, so all necessary components are readily available.
The built-in distributed training capabilities of SageMaker streamline the setup of multi-GPU and multi-node jobs, reducing the complexity typically associated with such configurations. Additionally, SageMaker integrates with TensorBoard, enabling real-time monitoring and visualization of training metrics and providing valuable insights into the pre-training process. With these features, researchers and practitioners can focus more on model development and optimization rather than infrastructure management, ultimately accelerating the iterative process of creating and refining custom LLMs.
Solution overview
In the following sections, we walk you through how to prepare a custom image with the torchtitan library, then configure a training job estimator function to launch a Meta Llama 3 8B model pre-training with the c4 dataset (Colossal Clean Crawled Corpus) on SageMaker. The c4 dataset is a large-scale web text corpus that has been cleaned and filtered to remove low-quality content. It is frequently used for pre-training language models.
Prerequisites
Before you begin, make sure you have the following requirements in place:

An AWS account.
A SageMaker domain and Amazon SageMaker Studio For instructions to create these, refer to Quick setup to Amazon SageMaker.
A Hugging Face access token so you can download the Meta Llama 3 models and tokenizer to use later.
You need to request a quota increase of at least 1 ml.p5.48xlarge instance for training job usage on SageMaker.

Build the torchtitan custom image
SageMaker BYOC (Bring Your Own Container) allows you to use custom Docker containers to train and deploy ML models. Typically, SageMaker provides built-in algorithms and preconfigured environments for popular ML frameworks. However, there may be cases where you have unique or proprietary algorithms, dependencies, or specific requirements that aren’t available in the built-in options, necessitating custom containers. In this case, we need to use the nightly versions of torch, torchdata, and the torchao package to train with FP8 precision.
We use the Amazon SageMaker Studio Image Build convenience package, which offers a command line interface (CLI) to simplify the process of building custom container images directly from SageMaker Studio notebooks. This tool eliminates the need for manual setup of Docker build environments, streamlining the workflow for data scientists and developers. The CLI automatically manages the underlying AWS services required for image building, such as Amazon Simple Storage Service (Amazon S3), AWS CodeBuild, and Amazon Elastic Container Registry (Amazon ECR), allowing you to focus on your ML tasks rather than infrastructure setup. It offers a simple command interface, handles packaging of Dockerfiles and container code, and provides the resulting image URI for use in SageMaker training and hosting.
Before getting started, make sure your AWS Identity and Access Management (IAM) execution role has the required IAM permissions and policies to use the Image Build CLI. For more information, see Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks. We have provided the Jupyter notebook to build the custom container in the GitHub repo.
Complete the following steps to build the custom image:

Install the Image Build package with the following command:

! pip install sagemaker-studio-image-build

To extend the pre-built image, you can use the included deep learning libraries and settings without having to create an image from scratch:

FROM 763104351884.dkr.ecr.${REGION}.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker

Next, specify the libraries to install. You need the nightly versions of torch, torchdata, and the torchao libraries:

RUN pip install –pre torch –force-reinstall –index-url https://download.pytorch.org/whl/nightly/cu121

RUN pip install –pre torchdata –index-url https://download.pytorch.org/whl/nightly

#install torchtitan dependencies
RUN pip install –no-cache-dir
datasets>=2.19.0
tomli>=1.1.0
tensorboard
sentencepiece
tiktoken
blobfile
tabulate

#install torchao package for FP8 support
RUN pip install –pre torchao –index-url https://download.pytorch.org/whl/nightly/cu121
#Display installed packages for reference
RUN pip freeze

Use the Image Build CLI to build and push the image to Amazon ECR:

!sm-docker build –repository torchtitan:latest . You’re now ready to use this image for pre-training models with torchtitan in SageMaker.
Prepare your dataset (optional)
By default, the torchtitan library uses the allenai/c4 “en” dataset in its training configuration. This is streamed directly during training using the HuggingFaceDataset class. However, you may want to pre-train the Meta Llama 3 models on your own dataset residing in Amazon S3. For this purpose, we have prepared a sample Jupyter notebook to download the allenai/c4 “en” dataset from the Hugging Face dataset hub to an S3 bucket. We use the SageMaker InputDataConfiguration to load the dataset to our training instances in the later section. You can download the dataset with a SageMaker processing job available in the sample Jupyter notebook.
Launch your training with torchtitan
Complete the following steps to launch your training:

Import the necessary SageMaker modules and retrieve your work environment details, such as AWS account ID and AWS Region. Make sure to upgrade the SageMaker SDK to the latest version. This might require a SageMaker Studio kernel restart.

%pip install –upgrade “sagemaker>=2.224″
%pip install sagemaker-experiments

import os
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch

role = get_execution_role()
print(f”SageMaker Execution Role: {role}”)

client = boto3.client(“sts”)
account = client.get_caller_identity()[“Account”]
print(f”AWS account: {account}”)

session = boto3.session.Session()
region = session.region_name
print(f”AWS region: {region}”)

sm_boto_client = boto3.client(“sagemaker”)
sagemaker_session = sagemaker.session.Session(boto_session=session)

default_bucket = sagemaker_session.default_bucket()
print(“Default bucket for this session: “, default_bucket)

Clone the torchtitan repository and prepare the training environment. Create a source directory and move the necessary dependencies from the torchtitan directory. This step makes sure you have all the required files for your training process.

git clone https://github.com/pytorch/torchtitan.git
mkdir torchtitan/src
!mv torchtitan/torchtitan/ torchtitan/train_configs/ torchtitan/train.py torchtitan/src/

Use the following command to download the Meta Llama 3 tokenizer, which is essential for preprocessing your dataset. Provide your Hugging Face token.

python torchtitan/src/torchtitan/datasets/download_tokenizer.py –repo_id meta-llama/Meta-Llama-3-8B –tokenizer_path “original” –hf_token=”YOUR_HF_TOKEN”

One of the key advantages of torchtitan is its straightforward configuration through TOML files. We modify the Meta Llama-3-8b TOML configuration file to enable monitoring and optimization features.

Enable TensorBoard profiling for better insights into the training process:

[metrics]
log_freq = 10
enable_tensorboard = true
save_tb_folder = “/opt/ml/output/tensorboard”

Enable torch.compile for improved performance:

compile = true

Enable FP8 for more efficient computations:

float8]
enable_float8_linear = true

Activate FP8 all-gather for optimized distributed training:

enable_fsdp_float8_all_gather= true
precompute_float8_dynamic_scale_for_fsdp = true

To monitor the training progress, set up TensorBoard output. This allows you to visualize the training metrics in real time, providing valuable insights into how the model is learning.

from sagemaker.debugger import TensorBoardOutputConfig

LOG_DIR=”/opt/ml/output/tensorboard”
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=f”s3://sagemaker-{region}-{account}/tensorboard/”,
container_local_output_path=LOG_DIR
)

Set up the data channels for SageMaker training. Create TrainingInput objects that point to the preprocessed dataset in Amazon S3, so your model has access to the training data it needs.

#update the path below the s3 dataset path from running the previous Jupyter Notebook from Step 2
training_dataset_location = “<PATH-TO-DATASET>”

s3_train_bucket = training_dataset_location

if s3_train_bucket != None:
train = sagemaker.inputs.TrainingInput(s3_train_bucket, distribution=”FullyReplicated”, s3_data_type=”S3Prefix”)
data_channels = {“train”: train}

With all the pieces in place, you’re ready to create the SageMaker PyTorch estimator. This estimator encapsulates all the configurations, including the custom container, hyperparameters, and resource allocations.

import os

from time import gmtime, strftime

hyperparameters = {
“config_file”: “train_configs/llama3_8b.toml”
}
timestamp = strftime(“%Y-%m-%d-%H-%M”, gmtime())

estimator = PyTorch(
base_job_name=f’llama3-8b-{timestamp}’,
entry_point=”train.py”,
image_uri=”<PATH-TO-IMAGE-URI>”,
source_dir=os.path.join(os.getcwd(), “src”),
role=role,
instance_type=”ml.p5.48xlarge”,
volume_size=800,
instance_count=4,
hyperparameters=hyperparameters,
use_spot_instances = False,
sagemaker_session=sagemaker_session,
tensorboard_output_config=tensorboard_output_config,
distribution={
‘torch_distributed’: {‘enabled’: True},
},

)

Initiate the model training on SageMaker:

estimator.fit(inputs=data_channels)
Performance numbers
The following table summarizes the performance numbers for the various training runs with different optimizations.

Setup
Configuration
TOML Configuration
Throughput (Tokens per Second)
Speedup Over Baseline

LLama3 – 8B pre-training on 4 x p5.48xlarge instances (32 NVIDIA H100 GPUs)
Baseline
Default Configuration
6475

torch.compile
compile = true
7166
10.67%

FP8 linear
compile = true enable_float8_linear = true
8624
33.19%

FP8 all-gather
compile = true enable_float8_linear = true enable_fsdp_float8_all_gather= true precompute_float8_dynamic_scale_for_fsdp = true
8950
38.23%

The performance results show clear optimization progress in Meta Llama 3 8B pre-training. torch.compile() delivered an 10.67% speedup, and FP8 linear operations tripled this to 33%. Adding FP8 all-gather further increased the speedup to 38.23% over the baseline. This progression demonstrates how combining optimization strategies significantly enhances training efficiency.
The following figure illustrates the stepwise performance gains for Meta Llama 3 8B pre-training on torchtitan with the optimizations.

These optimizations didn’t affect the model’s training quality. The loss curves for all optimization levels, including the baseline, torch.compile(), FP8 linear, and FP8 all-gather configurations, remained consistent throughout the training process, as shown in the following figure.

The following table showcases the consistent loss value with the different configurations.

Configuration
Loss After 2,000 Steps

Baseline
3.602

Plus torch.compile
3.601

Plus FP8
3.612

Plus FP8 all-gather
3.607

Clean up
After you complete your training experiments, clean up your resources to avoid unnecessary charges. You can start by deleting any unused SageMaker Studio resources. Next, remove the custom container image from Amazon ECR by deleting the repository you created. If you ran the optional step to use your own dataset, delete the S3 bucket where this data was stored.
Conclusion
In this post, we demonstrated how to efficiently pre-train Meta Llama 3 models using the torchtitan library on SageMaker. With torchtitan’s advanced optimizations, including torch.compile, FP8 linear operations, and FP8 all-gather, we achieved a 38.23% acceleration in Meta Llama 3 8B pre-training without compromising the model’s accuracy.
SageMaker simplified the large-scale training by offering seamless integration with custom containers, effortless scaling across multiple instances, built-in support for distributed training, and integration with TensorBoard for real-time monitoring.
Pre-training is a crucial step in developing powerful and adaptable LLMs that can effectively tackle a wide range of tasks and applications. By combining the latest PyTorch distributed training features in torchtitan with the scalability and flexibility of SageMaker, organizations can use their proprietary data and domain expertise to create robust and high-performance AI models. Get started by visiting the GitHub repository for the complete code example and optimize your LLM pre-training workflow.
Special thanks
Special thanks to Gokul Nadathur (Engineering Manager at Meta), Gal Oshri (Principal Product Manager Technical at AWS) and Janosch Woschitz (Sr. ML Solution Architect at AWS) for their support to the launch of this post.

About the Authors
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS.He helps AWS customers—from small startups to large enterprises—train and deploy foundation models efficiently on AWS. He is passionate about computational optimization problems and improving the performance of AI workloads.
Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. He serves as a voting member of the PyTorch Foundation Governing Board, where he contributes to the strategic advancement of open-source deep learning frameworks. At AWS, Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
Less Wright is an AI/Partner Engineer in PyTorch. He works on Triton/CUDA kernels (Accelerating Dequant with SplitK work decomposition); paged, streaming, and quantized optimizers; and PyTorch Distributed (PyTorch FSDP).
Wei Feng is a Software Engineer on the PyTorch distributed team. He has worked on float8 all-gather for FSDP2, TP (Tensor Parallel) in TorchTitan, and 4-bit quantization for distributed QLoRA in TorchTune. He is also a core maintainer of FSDP2.