PRISE: A Unique Machine Learning Method for Learning Multitask Tempora …

In the domain of sequential decision-making, especially in robotics, agents often deal with continuous action spaces and high-dimensional observations. These difficulties result from making decisions across a broad range of potential actions like complex, continuous action spaces and evaluating enormous volumes of data. Advanced procedures are needed to process and act upon the information in these scenarios in an efficient and effective manner.

In recent research, a team of researchers from the University of Maryland, College Park, and Microsoft Research has presented a new viewpoint that formulates the problem of sequence compression in terms of creating temporal action abstractions. Large language models’ (LLMs) training pipelines are the source of inspiration for this method in the field of natural language processing (NLP). Tokenizing input is a crucial part of LLM training, and it’s commonly accomplished using byte pair encoding (BPE). This research suggests adapting BPE, which is commonly utilized in NLP, to the task of learning variable timespan abilities in continuous control domains.

Primitive Sequence Encoding (PRISE) is a new approach which has been introduced by the research to put this theory into practice. PRISE produces efficient action abstractions by fusing BPE and continuous action quantization. In order to facilitate processing and analysis, continuous activities are quantized by converting them into discrete codes. These discrete code sequences are then compressed using the BPE sequence compression technique to reveal significant and recurrent action primitives.

Empirical studies use robotic manipulation tasks to show the effectiveness of PRISE. The study has demonstrated that the high-level skills identified improve behavior cloning’s (BC) performance on downstream tasks through the use of PRISE on a series of multitask robotic manipulation demonstrations. Compact and meaningful action primitives produced by PRISE are useful for Behaviour Cloning, an approach where agents learn from expert examples.

The team has summarized their primary contributions as follows.

Primitive Sequence Encoding (PRISE), a unique method for learning multitask temporal action abstractions using NLP approaches, is the main contribution of this work. 

To simplify the action representation, PRISE converts the continuous action space of the agent into discrete codes. These distinct action codes are arranged in a sequence based on pretraining trajectories. These action sequences are used by PRISE to extract skills with varied timesteps.

PRISE considerably improves learning efficiency over strong baselines such as ACT by learning policies over the learned skills and decoding them into simple action sequences during downstream tasks.

Research involves in-depth research to comprehend how different parameters affect PRISE’s performance, demonstrating the vital function BPE plays in the project’s success.

In conclusion, temporal action abstractions present a potent means of improving sequential decision-making when seen as a sequence compression problem. Through the effective integration of NLP approaches, particularly BPE, into the continuous control domain, PRISE is able to learn and encode high-level skills. These abilities show the promise of interdisciplinary approaches in increasing robotics and artificial intelligence, in addition to enhancing the effectiveness of techniques such as behavior cloning.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post PRISE: A Unique Machine Learning Method for Learning Multitask Temporal Action Abstractions Using Natural Language Processing (NLP) appeared first on MarkTechPost.

FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplicatio …

Large Language Models (LLMs) face deployment challenges due to latency issues caused by memory bandwidth constraints. Researchers use weight-only quantization to address this, compressing LLM parameters to lower precision. This approach improves latency and reduces GPU memory requirements. Implementing this effectively requires custom mixed-type matrix-multiply kernels that move, dequantize, and process weights efficiently. Existing kernels like bits and bytes, Marlin, and BitBLAS have shown significant speed-ups but are often limited to 4-bit quantization. Recent advancements in odd-bit and non-uniform quantization methods highlight the need for more flexible kernels that can support a wider range of settings to maximize the potential of weight quantization in LLM deployment.

Researchers have attempted to solve the LLM deployment challenges using weight-only quantization. Uniform quantization converts full-precision weights to lower-precision intervals, while non-uniform methods like lookup table (LUT) quantization offer more flexibility. Existing kernels like bits and bytes, Marlin, and BitBLAS move quantized weights from main memory to on-chip SRAM, performing matrix multiplications after de-quantizing to floating-point. These show significant speed-ups but often specialize in 4-bit uniform quantization, with LUT-quantization kernels underperforming. Non-uniform methods like SqueezeLLM and NormalFloat face trade-offs between lookup table size and quantization granularity. Also, non-uniformly quantized operations can’t utilize GPU accelerators optimized for floating-point calculations. This highlights the need for efficient kernels that can utilize quantized representations to minimize memory movement and GPU-native floating-point matrix multiplications, balancing the benefits of quantization with hardware optimization.

Researchers from Massachusetts Institute of Technology, High School of Mathematics Plovdiv and Carnegie Mellon University, MBZUAI, Petuum Inc. introduce an innovative approach that,  flexible lookup-table engine (FLUTE) for deploying weight-quantized LLMs, focusing on low-bit and non-uniform quantization. It addresses three main challenges: handling sub-8-bit matrices, optimizing lookup table-based dequantization, and improving workload distribution for small batches and low-bit-width weights. FLUTE overcomes these issues through three key strategies: offline weight restructuring, a shared-memory lookup table for efficient dequantization, and Stream-K partitioning for optimized workload distribution. This approach enables FLUTE to effectively manage the complexities of low-bit and non-uniform quantization in LLM deployment, improving efficiency and performance in scenarios where traditional methods fall short.

FLUTE is an innovative approach for, flexible mixed-type matrix multiplications in weight-quantized LLMs. It addresses key challenges in deploying low-bit and non-uniform quantized models through three main strategies:

Offline Matrix Restructuring: FLUTE reorders quantized weights to optimize for Tensor Core operations, handling non-standard bit widths (e.g., 3-bit) by splitting weights into bit-slices and combining them in registers.

Vectorized Lookup in Shared Memory: To optimize dequantization, FLUTE uses a vectorized lookup table stored in shared memory, accessing two elements simultaneously. It also employs table duplication to reduce bank conflicts.

Stream-K Workload Partitioning: FLUTE implements Stream-K decomposition to evenly distribute workload across SMs, mitigating wave quantization issues in low-bit and low-batch scenarios.

These innovations allow FLUTE to efficiently fuse dequantization and matrix multiplication operations, optimizing memory usage and computational throughput. The kernel employs a sophisticated pipeline of data movement between global memory, shared memory, and registers, utilizing GPU hardware capabilities for maximum performance in weight-quantized LLM deployments.

FLUTE shows impressive performance across various matrix shapes on both A6000 and A100 GPUs. On the A6000, it occasionally approaches the theoretical maximum speedup of 4x. This performance is also consistent across different batch sizes, unlike other LUT-compatible kernels which typically achieve similar speedups only at a batch size of 1 and then degrade rapidly as batch size increases. Also, FLUTE’s performance compares well even to Marlin, a kernel highly specialized for FP16 input and uniform-quantized INT4 weights. This demonstrates FLUTE’s ability to efficiently handle both uniform and non-uniform quantization schemes.

FLUTE demonstrates superior performance in LLM deployment across various quantization settings. The learned NF quantization approach outperforms standard methods and combines well with AWQ. FLUTE’s flexibility allows for experiments with different bit widths and group sizes, nearly matching 16-bit baseline perplexity with small group sizes. End-to-end latency tests using vLLM framework showed meaningful speedups across various configurations, including with Gemma-2 models. A group size of 64 was found to balance quality and speed effectively. Overall, FLUTE proves to be a versatile and efficient solution for quantized LLM deployment, offering improved performance across multiple scenarios.

FLUTE is a CUDA kernel designed to accelerate LLM inference through fused quantized matrix multiplications. It offers flexibility in mapping quantized to de-quantized values via lookup tables and supports various bit widths and group sizes. FLUTE’s performance is demonstrated through kernel-level benchmarks and end-to-end evaluations on state-of-the-art LLMs like LLaMA-3 and Gemma-2. Tested on A6000 and A100 GPUs in single and tensor parallel setups, FLUTE shows efficiency across unquantized, 3-bit, and 4-bit configurations. This versatility and performance make FLUTE a promising solution for accelerating LLM inference using advanced quantization techniques.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference appeared first on MarkTechPost.

Self-Route: A Simple Yet Effective AI Method that Routes Queries to RA …

Large Language Models (LLMs) have revolutionized the field of natural language processing, allowing machines to understand and generate human language. These models, such as GPT-4 and Gemini-1.5, are crucial for extensive text processing applications, including summarization and question answering. However, managing long contexts remains challenging due to computational limitations and increased costs. Researchers are, therefore, exploring innovative approaches to balance performance and efficiency.

A notable challenge in processing lengthy texts is the computational burden and associated costs. Traditional methods often need to improve when dealing with long contexts, necessitating new strategies to handle this issue effectively. This problem requires methodologies that balance high performance with cost efficiency. One promising approach is Retrieval Augmented Generation (RAG), which retrieves relevant information based on a query and prompts LLMs to generate responses within that context. RAG significantly expands a model’s capacity to access information economically. However, a comparative analysis becomes essential with advancements in LLMs like GPT-4 and Gemini-1.5, which show improved capabilities in directly processing long contexts.

Researchers from Google DeepMind and the University of Michigan introduced a new method called SELF-ROUTE. This method combines the strengths of RAG and long-context LLMs (LC) to route queries efficiently using model self-reflection to decide whether to use RAG or LC based on the nature of the query. The SELF-ROUTE method operates in two steps. Initially, the query and retrieved chunks are provided to the LLM to determine if the query is answerable. If deemed answerable, the RAG-generated answer is used. Otherwise, the LC will be given the full context for a more comprehensive response. This approach significantly reduces computational costs while maintaining high performance, effectively leveraging the strengths of both RAG and LC models.

The SELF-ROUTE evaluation involved three recent LLMs: Gemini-1.5-Pro, GPT-4, and GPT-3.5-Turbo. The study benchmarked these models using LongBench and u221eBench datasets, focusing on query-based tasks in English. The results demonstrated that LC models consistently outperformed RAG in understanding long contexts. For example, LC surpassed RAG by 7.6% for Gemini-1.5-Pro, 13.1% for GPT-4, and 3.6% for GPT-3.5-Turbo. However, RAG’s cost-effectiveness remains a significant advantage, particularly when the input text considerably exceeds the model’s context window size.

SELF-ROUTE achieved notable cost reductions while maintaining comparable performance to LC models. For instance, the cost was reduced by 65% for Gemini-1.5-Pro and 39% for GPT-4. The method also showed a high degree of prediction overlap between RAG and LC, with 63% of queries having identical predictions and 70% showing a score difference of less than 10. This overlap suggests that RAG and LC often make similar predictions, both correct and incorrect, allowing SELF-ROUTE to leverage RAG for most queries and reserve LC for more complex cases.

The detailed performance analysis revealed that, on average, LC models surpassed RAG by significant margins: 7.6% for Gemini-1.5-Pro, 13.1% for GPT-4, and 3.6% for GPT-3.5-Turbo. Interestingly, for datasets with extremely long contexts, such as those in u221eBench, RAG sometimes performed better than LC, particularly for GPT-3.5-Turbo. This finding highlights RAG’s effectiveness in specific use cases where the input text exceeds the model’s context window size.

The study also examined various datasets to understand the limitations of RAG. Common failure reasons included multi-step reasoning requirements, general or implicit queries, and long, complex queries that challenge the retriever. By analyzing these failure patterns, the research team identified potential areas for improvement in RAG, such as incorporating chain-of-thought processes and enhancing query understanding techniques.

In conclusion, the comprehensive comparison of RAG and LC models highlights the trade-offs between performance and computational cost in long-context LLMs. While LC models demonstrate superior performance, RAG remains viable due to its lower cost and specific advantages in handling extensive input texts. The SELF-ROUTE method effectively combines the strengths of both RAG and LC, achieving performance comparable to LC at a significantly reduced cost.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post Self-Route: A Simple Yet Effective AI Method that Routes Queries to RAG or Long Context LC based on Model Self-Reflection appeared first on MarkTechPost.

Amazon SageMaker inference launches faster auto scaling for generative …

Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative artificial intelligence (AI) models to scale automatically. You can now use sub-minute metrics and significantly reduce overall scaling latency for generative AI models. With this enhancement, you can improve the responsiveness of your generative AI applications as demand fluctuates.
The rise of foundation models (FMs) and large language models (LLMs) has brought new challenges to generative AI inference deployment. These advanced models often take seconds to process, while sometimes handling only a limited number of concurrent requests. This creates a critical need for rapid detection and auto scaling to maintain business continuity. Organizations implementing generative AI seek comprehensive solutions that address multiple concerns: reducing infrastructure costs, minimizing latency, and maximizing throughput to meet the demands of these sophisticated models. However, they prefer to focus on solving business problems rather than doing the undifferentiated heavy lifting to build complex inference platforms from the ground up.
SageMaker provides industry-leading capabilities to address these inference challenges. It offers endpoints for generative AI inference that reduce FM deployment costs by 50% on average and latency by 20% on average by optimizing the use of accelerators. The SageMaker inference optimization toolkit, a fully managed model optimization feature in SageMaker, can deliver up to two times higher throughput while reducing costs by approximately 50% for generative AI performance on SageMaker. Besides optimization, SageMaker inference also provides streaming support for LLMs, enabling you to stream tokens in real time rather than waiting for the entire response. This allows for lower perceived latency and more responsive generative AI experiences, which are crucial for use cases like conversational AI assistants. Lastly, SageMaker inference provides the ability to deploy a single model or multiple models using SageMaker inference components on the same endpoint using advanced routing strategies to effectively load balance to the underlying instances backing an endpoint.
Faster auto scaling metrics
To optimize real-time inference workloads, SageMaker employs Application Auto Scaling. This feature dynamically adjusts the number of instances in use and the quantity of model copies deployed, responding to real-time changes in demand. When in-flight requests surpass a predefined threshold, auto scaling increases the available instances and deploys additional model copies to meet the heightened demand. Similarly, as the number of in-flight requests decreases, the system automatically removes unnecessary instances and model copies, effectively reducing costs. This adaptive scaling makes sure resources are optimally utilized, balancing performance needs with cost considerations in real time.
With today’s launch, SageMaker real-time endpoints now emit two new sub-minute Amazon CloudWatch metrics: ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy. ConcurrentRequestsPerModel is the metric used for SageMaker real-time endpoints; ConcurrentRequestsPerCopy is used when SageMaker real-time inference components are used.
These metrics provide a more direct and accurate representation of the load on the system by tracking the actual concurrency or the number of simultaneous requests being handled by the containers (in-flight requests), including the requests queued inside the containers. The concurrency-based target tracking and step scaling policies focus on monitoring these new metrics. When the concurrency levels increase, the auto scaling mechanism can respond by scaling out the deployment, adding more container copies or instances to handle the increased workload. By taking advantage of these high-resolution metrics, you can now achieve significantly faster auto scaling, reducing detection time and improving the overall scale-out time of generative AI models. You can use these new metrics for endpoints created with accelerator instances like AWS Trainium, AWS Inferentia, and NVIDIA GPUs.
In addition, you can enable streaming responses back to the client on models deployed on SageMaker. Many current solutions track a session or concurrency metric only until the first token is sent to the client and then mark the target instance as available. SageMaker can track a request until the last token is streamed to the client instead of until the first token. This way, clients can be directed to instances to GPUs that are less busy, avoiding hotspots. Additionally, tracking concurrency also helps you make sure requests that are in-flight and queued are treated alike for alerting on the need for auto scaling. With this capability, you can make sure your model deployment scales proactively, accommodating fluctuations in request volumes and maintaining optimal performance by minimizing queuing delays.
In this post, we detail how the new ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy CloudWatch metrics work, explain why you should use them, and walk you through the process of implementing them for your workloads. These new metrics allow you to scale your LLM deployments more effectively, providing optimal performance and cost-efficiency as the demand for your models fluctuates.
Components of auto scaling
The following figure illustrates a typical scenario of how a SageMaker real-time inference endpoint scales out to handle an increase in concurrent requests. This demonstrates the automated and responsive nature of scaling in SageMaker. In this example, we walk through the key steps that occur when the inference traffic to a SageMaker real-time endpoint starts to increase and concurrency to the model deployed on every instance goes up. We show how the system monitors the traffic, invokes an auto scaling action, provisions new instances, and ultimately load balances the requests across the scaled-out resources. Understanding this scaling process is crucial for making sure your generative AI models can handle fluctuations in demand and provide a seamless experience for your customers. By the end of this walkthrough, you’ll have a clear picture of how SageMaker real-time inference endpoints can automatically scale to meet your application’s needs.
Let’s dive into the details of this scaling scenario using the provided figure.

The key steps are as follows:

Increased inference traffic (t0) – At some point, the traffic to the SageMaker real-time inference endpoint starts to increase, indicating a potential need for additional resources. The increase in traffic leads to a higher number of concurrent requests required for each model copy or instance.
CloudWatch alarm monitoring (t0 → t1) – An auto scaling policy uses CloudWatch to monitor metrics, sampling it over a few data points within a predefined time frame. This makes sure the increased traffic is a sustained change in demand, not a temporary spike.
Auto scaling trigger (t1) – If the metric crosses the predefined threshold, the CloudWatch alarm goes into an InAlarm state, invoking an auto scaling action to scale up the resources.
New instance provisioning and container startup (t1 → t2) – During the scale-up action, new instances are provisioned if required. The model server and container are started on the new instances. When the instance provisioning is complete, the model container initialization process begins. After the server successfully starts and passes the health checks, the instances are registered with the endpoint, enabling them to serve incoming traffic requests.
Load balancing (t2) – After the container health checks pass and the container reports as healthy, the new instances are ready to serve inference requests. All requests are now automatically load balanced between the two instances using the pre-built routing strategies in SageMaker.

This approach allows the SageMaker real-time inference endpoint to react quickly and handle the increased traffic with minimal impact to the clients.
Application Auto Scaling supports target tracking and step scaling policies. Each have their own logic to handle scale-in and scale-out:

Target tracking works to scale out by adding capacity to reduce the difference between the metric value (ConcurrentRequestsPerModel/Copy) and the target value set. When the metric (ConcurrentRequestsPerModel/Copy) is below the target value, Application Auto Scaling scales in by removing capacity.
Step scaling works to scales capacity using a set of adjustments, known as step adjustments. The size of the adjustment varies based on the magnitude of the metric value (ConcurrentRequestsPerModel/Copy)/alarm breach.

By using these new metrics, auto scaling can now be invoked and scale out significantly faster compared to the older SageMakerVariantInvocationsPerInstance predefined metric type. This decrease in the time to measure and invoke a scale-out allows you to react to increased demand significantly faster than before (under 1 minute). This works especially well for generative AI models, which are typically concurrency-bound and can take many seconds to complete each inference request.
Using the new high-resolution metrics allow you to greatly decrease the time it takes to scale up an endpoint using Application Auto Scaling. These high-resolution metrics are emitted at 10-second intervals, allowing for faster invoking of scale-out procedures. For models with less than 10 billion parameters, this can be a significant percentage of the time it takes for an end-to-end scaling event. For larger model deployments, this can be up to 5 minutes shorter before a new copy of your FM or LLM is ready to service traffic.

Get started with faster auto scaling
Getting started with using the metrics is straightforward. You can use the following steps to create a new scaling policy to benefit from faster auto scaling. In this example, we deploy a Meta Llama 3 model that has 8 billion parameters on a G5 instance type, which uses NVIDIA A10G GPUs. In this example, the model can fit entirely on a single GPU and we can use auto scaling to scale up the number of inference components and G5 instances based on our traffic. The full notebook can be found on the GitHub for SageMaker Single Model Endpoints and SageMaker with inference components.

After you create your SageMaker endpoint, you define a new auto scaling target for Application Auto Scaling. In the following code block, you set as_min_capacity and as_max_capacity to the minimum and maximum number of instances you want to set for your endpoint, respectively. If you’re using inference components (shown later), you can use instance auto scaling and skip this step.

autoscaling_client = boto3.client(“application-autoscaling”, region_name=region)

# Register scalable target
scalable_target = autoscaling_client.register_scalable_target(
ServiceNamespace=”sagemaker”,
ResourceId=resource_id,
ScalableDimension=”sagemaker:variant:DesiredInstanceCount”,
MinCapacity=as_min_capacity,
MaxCapacity=as_max_capacity, # Replace with your desired maximum instances
)

After you create your new scalable target, you can define your policy. You can choose between using a target tracking policy or step scaling policy. In the following target tracking policy, we have set TargetValue to 5. This means we’re asking auto scaling to scale up if the number of concurrent requests per model is equal to or greater than five.

# Create Target Tracking Scaling Policy
target_tracking_policy_response = autoscaling_client.put_scaling_policy(
PolicyName=”SageMakerEndpointScalingPolicy”,
ServiceNamespace=”sagemaker”,
ResourceId=resource_id,
ScalableDimension=”sagemaker:variant:DesiredInstanceCount”,
PolicyType=”TargetTrackingScaling”,
TargetTrackingScalingPolicyConfiguration={
“TargetValue”: 5.0, # Scaling triggers when endpoint receives 5 ConcurrentRequestsPerModel
“PredefinedMetricSpecification”: {
“PredefinedMetricType”: “SageMakerVariantConcurrentRequestsPerModelHighResolution”
},
“ScaleInCooldown”: 180, # Cooldown period after scale-in activity
“ScaleOutCooldown”: 180, # Cooldown period after scale-out activity
},
)

If you would like to configure a step scaling policy, refer to the following notebook.
That’s it! Traffic now invoking your endpoint will be monitored with concurrency tracked and evaluated against the policy you specified. Your endpoint will scale up and down based on the minimum and maximum values you provided. In the preceding example, we set a cooldown period for scaling in and out to 180 seconds, but you can change this based on what works best for your workload.
SageMaker inference components
If you’re using inference components to deploy multiple generative AI models on a SageMaker endpoint, you can complete the following steps:

After you create your SageMaker endpoint and inference components, you define a new auto scaling target for Application Auto Scaling:

autoscaling_client = boto3.client(“application-autoscaling”, region_name=region)

# Register scalable target
scalable_target = autoscaling_client.register_scalable_target(
ServiceNamespace=”sagemaker”,
ResourceId=resource_id,
ScalableDimension=”sagemaker:inference-component:DesiredCopyCount”,
MinCapacity=as_min_capacity,
MaxCapacity=as_max_capacity, # Replace with your desired maximum instances
)

After you create your new scalable target, you can define your policy. In the following code, we set TargetValue to 5. By doing so, we’re asking auto scaling to scale up if the number of concurrent requests per model is equal to or greater than five.

# Create Target Tracking Scaling Policy
target_tracking_policy_response = autoscaling_client.put_scaling_policy(
PolicyName=”SageMakerInferenceComponentScalingPolicy”,
ServiceNamespace=”sagemaker”,
ResourceId=resource_id,
ScalableDimension=”sagemaker:inference-component:DesiredCopyCount”,
PolicyType=”TargetTrackingScaling”,
TargetTrackingScalingPolicyConfiguration={
“TargetValue”: 5.0, # Scaling triggers when endpoint receives 5 ConcurrentRequestsPerCopy
“PredefinedMetricSpecification”: {
“PredefinedMetricType”: “SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution”
},
“ScaleInCooldown”: 180, # Cooldown period after scale-in activity
“ScaleOutCooldown”: 180, # Cooldown period after scale-out activity
},
)

You can use the new concurrency-based target tracking auto scaling policies in tandem with existing invocation-based target tracking policies. When a container experiences a crash or failure, the resulting requests are typically short-lived and may be responded to with error messages. In such scenarios, the concurrency-based auto scaling policy can detect the sudden drop in concurrent requests, potentially causing an unintentional scale-in of the container fleet. However, the invocation-based policy can act as a safeguard, avoiding the scale-in if there is still sufficient traffic being directed to the remaining containers. With this hybrid approach, container-based applications can achieve a more efficient and adaptive scaling behavior. The balance between concurrency-based and invocation-based policies allows the system to respond appropriately to various operational conditions, such as container failures, sudden spikes in traffic, or gradual changes in workload patterns. This enables the container infrastructure to scale up and down more effectively, optimizing resource utilization and providing reliable application performance.
Sample runs and results
With the new metrics, we have observed improvements in the time required to invoke scale-out events. To test the effectiveness of this solution, we completed some sample runs with Meta Llama models (Llama 2 7B and Llama 3 8B). Prior to this feature, detecting the need for auto scaling could take over 6 minutes, but with this new feature, we were able to reduce that time to less than 45 seconds. For generative AI models such as Meta Llama 2 7B and Llama 3 8B, we have been able to reduce the overall end-to-end scale-out time by approximately 40%.
The following figures illustrate the results of sample runs for Meta Llama 3 8B.

The following figures illustrate the results of sample runs for Meta Llama 2 7B.

As a best practice, it’s important to optimize your container, model artifacts, and bootstrapping processes to be as efficient as possible. Doing so can help minimize deployment times and improve the responsiveness of AI services.
Conclusion
In this post, we detailed how the ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy metrics work, explained why you should use them, and walked you through the process of implementing them for your workloads. We encourage you to try out these new metrics and evaluate whether they improve your FM and LLM workloads on SageMaker endpoints. You can find the notebooks on GitHub.
Special thanks to our partners from Application Auto Scaling for making this launch happen: Ankur Sethi, Vasanth Kumararajan, Jaysinh Parmar Mona Zhao, Miranda Liu, Fatih Tekin, and Martin Wang.

About the Authors
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.
Praveen Chamarthi is a Senior AI/ML Specialist with Amazon Web Services. He is passionate about AI/ML and all things AWS. He helps customers across the Americas scale, innovate, and operate ML workloads efficiently on AWS. In his spare time, Praveen loves to read and enjoys sci-fi movies.
Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, and spending time with friends and families.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.
Kunal Shah is a software development engineer at Amazon Web Services (AWS) with 7+ years of industry experience. His passion lies in deploying machine learning (ML) models for inference, and he is driven by a strong desire to learn and contribute to the development of AI-powered tools that can create real-world impact. Beyond his professional pursuits, he enjoys watching historical movies, traveling and adventure sports.
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Find answers accurately and quickly using Amazon Q Business with the S …

Amazon Q Business is a fully managed, generative artificial intelligence (AI)-powered assistant that helps enterprises unlock the value of their data and knowledge. With Amazon Q, you can quickly find answers to questions, generate summaries and content, and complete tasks by using the information and expertise stored across your company’s various data sources and enterprise systems. At the core of this capability are native data source connectors that seamlessly integrate and index content from multiple repositories into a unified index. This enables the Amazon Q large language model (LLM) to provide accurate, well-written answers by drawing from the consolidated data and information. The data source connectors act as a bridge, synchronizing content from disparate systems like Salesforce, Jira, and SharePoint into a centralized index that powers the natural language understanding and generative abilities of Amazon Q.
To make this integration process as seamless as possible, Amazon Q Business offers multiple pre-built connectors to a wide range of data sources, including Atlassian Jira, Atlassian Confluence, Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, and many more. This allows you to create your generative AI solution with minimal configuration. For a full list of Amazon Q supported data source connectors, see Supported connectors.
One of the key integrations for Amazon Q is with Microsoft SharePoint Online. SharePoint is a widely used collaborative platform that allows organizations to manage and share content, knowledge, and applications to improve productivity and decision-making. By integrating Amazon Q with SharePoint, businesses can empower their employees to access information and insights from SharePoint more efficiently and effectively.
With the Amazon Q and SharePoint Online integration, business users can do the following:

Get instant answers – Users can ask natural language questions and Amazon Q will provide accurate, up-to-date answers by searching and synthesizing information from across the organization’s SharePoint sites and content.
Accelerate research and analysis – Instead of manually searching through SharePoint documents, users can use Amazon Q to quickly find relevant information, summaries, and insights to support their research and decision-making.
Streamline content creation – Amazon Q can assist in generating drafts, outlines, and even complete content pieces (such as reports, articles, or presentations) by drawing on the knowledge and data stored in SharePoint.
Automate workflows and tasks – Amazon Q can be configured to complete routine tasks and queries (such as generating status reports, answering FAQs, or requesting information) by interacting with the relevant SharePoint data and applications.
Enhance collaboration – By making SharePoint content more accessible and actionable through Amazon Q, the integration facilitates better knowledge sharing, problem-solving, and collaboration across the organization.

In this post, we guide you through the process of setting up the SharePoint Online connector in Amazon Q Business. This will enable your organization to use the power of generative AI to unlock the full value of your SharePoint investment and empower your workforce to work smarter and more efficiently.
Find accurate answers from content in Microsoft SharePoint using Amazon Q Business
After you integrate Amazon Q Business with Microsoft SharePoint, users can ask questions from the body of the document. For this post, we use a SharePoint Online site named HR Policies that has information about the travel policy, state disability insurance policy, payroll taxes, and paid family leave program for California stored in document libraries. Some of the questions you can ask Amazon Q Business might include the following:

Is there a leave plan in California for new parents?
Can I claim disability insurance during this time?
Before applying for leave, I want to submit my submit expense report, how can I do it?
Is there any limit on spending on a business trip?
How can I calculate UI and ETT?

Overview of the data source
SharePoint is a website-based collaboration system that is used as a secure place to store, organize, share, and access information from any device. SharePoint empowers teamwork with dynamic and productive team sites for every project team, department, and division.
SharePoint is available in two options: SharePoint Server and SharePoint Online. SharePoint Server is a locally hosted platform that your company owns and operates. You’re responsible for everything from server architecture, active directory, to file storage. SharePoint Server 2016, SharePoint Server 2019, and SharePoint Server Subscription Edition are the active SharePoint Server releases. SharePoint Online is a cloud-based service provided directly from Microsoft. They take care of identity management architecture, and site management. SharePoint Sever and SharePoint Online contain pages, files, attachments, links, events, and comments that can be crawled by Amazon Q SharePoint connectors for SharePoint Server and SharePoint Online.

SharePoint Online and SharePoint Server offer a site content space where site owners can view a list of all pages, libraries, and lists for their site. The site content space also provides access to add lists, pages, document libraries, and more.

Pages are the contents stored on webpages; these are meant to display information to the end-user.

A document library provides a secure place to store files where you and your coworkers can find them easily. You can work on them together and access them from any device at any time.

A list is one of the data storage mechanisms within SharePoint. It provides the UI to view the items in a list. You can add, edit, and delete items or view individual items.

Overview of the SharePoint Online connector for Amazon Q Business
To crawl and index contents from SharePoint Online, you can configure the Amazon Q Business SharePoint Online connector as a data source in your Amazon Q business application. When you connect Amazon Q Business to a data source and initiate the sync process, Amazon Q Business crawls and indexes documents from the data source into its index.
Let’s look at what are considered as documents in the context of Amazon Q business SharePoint Online connector. A document is a collection of information that consists of a title, the content (or the body), metadata (data about the document), and access control list (ACL) information to make sure answers are provided from documents that the user has access to.
The following entities in SharePoint are crawled and indexed as documents along with their metadata and access control information:

Files
Events
Pages
Links
Attachments
Comments

Amazon Q Business crawls data source document attributes or metadata and maps them to fields in your Amazon Q index. Refer to Amazon Q Business SharePoint Online data source connector field mappings for more details.
Configure and prepare the Amazon Q connector
Before you index the content from Microsoft SharePoint online, your need to first establish a secure connection between the Amazon Q Business connector for SharePoint Online with your SharePoint Online instance. To establish a secure connection, you need to authenticate with the data source.
The following are the supported authentication mechanisms for the SharePoint connector:

Basic Authentication
OAuth 2.0 with Resource Owner Password Credentials Flow
Azure AD App-Only (OAuth 2.0 Certificate)
SharePoint App-Only with Client Credentials Flow
OAuth 2.0 with Refresh Token Flow

Secure querying with ACL crawling, identity crawling, and user store
Secure querying is when a user runs a query and is returned answers from documents that the user has access to and not from documents that the user does not have access to. To enable users to do secure querying, Amazon Q Business honors ACLs of the documents. Amazon Q Business does this by first supporting the indexing of ACLs. Indexing documents with ACLs is crucial for maintaining data security, because documents without ACLs are considered public. At query time, the user’s credentials (email address) are passed along with the query so that answers from documents that are relevant to the query and which the user is authorized to access are displayed.
A document’s ACL contains information such as the user’s email address and the local groups or federated groups (if Microsoft SharePoint is integrated with an identity provider (IdP) such as Azure Active Directory/Entra ID) that have access to the document. The SharePoint online data source can be optionally connected to an IdP such as Okta or Microsoft Entra ID. In this case, the documents in SharePoint Online can have the federated group information.
When a user logs in to a web application to conduct a search, the user’s credentials (such as an email address) need to match that’s in the ACL of the document to return results from that document. The web application that the user uses to retrieve answers would be connected to an IdP or AWS IAM Identity Center. The user’s credentials from the IdP or IAM Identity Center are referred to here as the federated user credentials. The federated user credentials such as the email address are passed along with the query so that Amazon Q can return the answers from the documents that this user has access to. However, sometimes this user’s federated credentials may not be present in the SharePoint Online data source or the SharePoint document’s ACLs. Instead, the user’s local user alias, local groups that this local user alias is a part of, or the federated groups that the federated user is a part of are available in the document’s ACL. Therefore, there is a need to map the federated user credential to the local user alias, local groups, or federated groups in the document ACL.
To map this federated user’s email address to the local user aliases, local groups, or federated groups, certain Amazon Q Business connectors, including the SharePoint Online connector, provide an identity crawler to load the identity information (local user alias, local groups, federated groups, and their mappings, along with any other mappings to a federated user) from the connected data sources into a user store. At query time, Amazon Q Business retrieves the associated local user aliases, local groups, and any federated groups from the user store and uses that along with the query for securely retrieving passages from documents that the user has access to.
If you need to index documents without ACLs, you must make sure they’re explicitly marked as public in your data source.
Refer to How Amazon Q Business connector crawls SharePoint (Online) ACLs for more details.
Amazon Q indexes the documents with ACLs and sets the user’s email address or user principal name for the user and the group name [site URL hash value | group name] for the local group in the ACL. If the SharePoint Online data source is connected to an IdP such as Azure AD/Entra ID or Okta, the AD group name visible in the SharePoint site is set as the federated group ACL. The identity crawler sets these the same as the principals along with the available mappings in the user store. Any additional mappings need to be set in the user store using the user store APIs.
Overview of solution
This post presents the steps to create a certificate and private key, configure Azure AD (either using the Azure AD console or a PowerShell script), and configure Amazon Q Business.
For this post, we use a SharePoint Online site named HR Policies that hosts policy documents in a Documents library and payroll tax documents in a Payroll Taxes library to walk you through the solution.
In one of the scenarios that we validate, a SharePoint user (Carlos Salazar) is part of the SharePoint site members group, and he has access only to policy documents in the Documents library.

Carlos Salazar can receive responses for queries related to HR policies, as shown in the following example.

However, for questions related to payroll tax, he did not receive any response.

Another SharePoint user (John Doe) is part of the SharePoint site owners group and has access to both the Documents and Payroll Taxes libraries.

John Doe receives responses for queries related to payroll taxes, as shown in the following example.

Prerequisites
You should meet the following prerequisites:

The user performing these steps should be a global administrator on Azure AD/Entra ID.
Configure Microsoft Entra ID and IAM Identity Center integration.
You need a Microsoft Windows instance to run PowerShell scripts and commands with PowerShell 7.4.1+. Details of the required PowerShell modules are described later in this post.
The user should have administrator permissions on the Windows instance.
Make sure that the user running these PowerShell commands has the right M365 license (for example, M365 E3).

Create the certificate and private key
In Azure AD, when configuring App-Only authentication, you typically use a certificate to request access. Anyone with the certificate’s private key can use the app and the permissions granted to the app. We create and configure a self-signed X.509 certificate that will be used to authenticate Amazon Q against Azure AD, while requesting the App-Only access token. The following steps walk you through the setup of this model.
For this post, we use Windows PowerShell to run a few PowerShell commands. You can use an existing Windows instance or spin up a Windows EC2 instance or Windows workstation to run the PowerShell commands.
You can use the following PowerShell script to create a self-signed certificate. You can also generate the self-signed certificate through the New-PnPAzureCertificate command.

Run the following command:

.Create-SelfSignedCertificate.ps1 -CommonName “<amazonqbusinessdemo>” -StartDate <StartDate in yyyy-mm-dd format> -EndDate <EndDate in yyyy-mm-dd format>

You will be asked to give a password to encrypt your private key, and both the .PFX file and the .CER file will be exported to the current folder (where you ran the PowerShell script from). Verify that you now have a .cer and .pfx file.

Upload this .cer file to an S3 location that your Amazon Q IAM role has GetObject permissions for. You can let Amazon Q create this role for you in future steps outlined later in this post, and the correct permissions will be added for you if you choose.

Now you extract the private key contents from the .pfx file and save it for Amazon Q connector configuration. This .pfx file will be present in the folder where you have saved the certificate.

Run the following command to extract the private key:

openssl pkcs12 -in [amazonqbusinessdemo.pfx] -nocerts -out [amazonqbusinessdemo.key]

You will be prompted for the import password. Enter the password that you used to protect your key pair when you created the .pfx file (client ID, in our case). You will be prompted again to provide a new password to protect the .key file that you are creating. Store the password to your key file in a secure place to avoid misuse. (When you enter a password, the window shows nothing if you’re using the Windows CMD window. Enter your password and choose Enter.)

Run the following command to decrypt the private key:

openssl rsa -in [amazonqbusinessdemo.key] -out [amazonqbusinessdemo-decrypted.key]

Run the following command to extract the certificate:

openssl pkcs12 -in [amazonqbusinessdemo.pfx] -clcerts -nokeys -out [amazonqbusinessdemo.crt]

This decrypted key and certificate will be used by the connector for authentication purposes.

Upload the X.509 certificate (ending with .crt) to an S3 bucket. This will be used when configuring the SharePoint Online connector for Amazon Q.

Verify the contents of the file amazonqbusinessdemo-decrypted.key starts with the standard BEGIN PRIVATE KEY header.
Copy and paste the contents of the amazonqbusinessdemo-decrypted.key for use later in our Amazon Q setup.

Configure Azure AD
You can configure Azure AD using either of the following methods:

Using the Azure AD console GUI. This is a manual step-by-step process.
Using the provided PowerShell script. This is an automated process that takes in the inputs and configures the required permissions.

Follow the steps for either option to complete the Azure AD configuration.
Configure Azure AD using the Azure AD console
To configure Azure AD using the GUI, you first register an Azure AD application in the Azure AD tenant that is linked to the SharePoint Online/O365 tenant. For more details, see Granting access via Azure AD App-Only.

Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
Navigate to Microsoft Azure Portal.
Search for and choose App registrations.

Choose New registration.

Enter a name for your application, select who can use this application, and choose Register.

An application will be created. You will see a page like the following screenshot.

Note the application (client) ID and the directory (tenant) ID.

These IDs will be different than what is shown in the screenshot.

Now you can configure the newly registered application for SharePoint permissions.

Choose API permissions in the navigation pane.
Choose Add a permission to add the permissions to your application.

Choose SharePoint from the list of applications.

Configure permissions.

There are two different ways to configure SharePoint permissions.
To configure permissions to access multiple SharePoint Site collections (using Azure AD App-Only permissions), select Site.FullControl.All to allow full control permissions to all the SharePoint site collections and to read the ACLs from these site collections.

This permission requires admin consent in a tenant before it can be used. To do so, choose Grant admin consent for <organization name> and choose Yes to confirm.

Alternatively, to configure permissions to access specific SharePoint site collections, select Sites.Selected to allow access to a subset of site collections without a signed-in user. The specific site collections and the permissions granted will be configured in SharePoint Online.

This permission requires admin consent in a tenant before it can be used. To do so, choose Grant admin consent for <organization name> and choose Yes to confirm.

Next, you grant Azure AD app permissions to one or more SharePoint site collections. Make sure the following prerequisites are in place:

You must have Windows Server/Workstation with PowerShell 7.4.1+.
The user running these PowerShell commands must have the right M365 license (for example, M365 E3).
Install the PowerShell modules using Install-Module -Name PnP.PowerShell -AllPreRelease.
If this is your first-time running PowerShell commands, run the Connect-PnPOnline -Url <site collection url> -PnPManagementShell PowerShell command and complete the consent process to use PnP cmdlets. Alternatively, run the Register-PnPManagementShellAccess cmdlet, which grants access to the tenant for the PnP management shell multi-tenant Azure AD application.

Open PowerShell and connect to SharePoint Online using the Connect-PnPOnline command:

Connect-PnPOnline -Url <sitecollectionUrl> -PnPManagementShell

Add the Azure AD app to one or more specific site collection permissions using Grant-PnPAzureADAppSitePermission:

Grant-PnPAzureADAppSitePermission -AppId <app-id> -DisplayName <displayname> -Site [<sitecollectionurl>] -Permissions <FullControl>

If you want to configure permissions to more than one SharePoint Online site collection, then you must repeat the preceding PowerShell commands for every collection.
Now you’re ready to connect the certificate.

Choose Certificates & secrets in the navigation pane.
On the Certificates tab, choose Upload certificate.

Choose the .cer file you generated earlier and choose Add to upload it.

This completes the configuration on the Azure AD side.
Configure Azure AD using the provided PowerShell script
The user running this PowerShell script should be an Azure AD tenant admin or have tenant admin permissions. Additionally, as a prerequisite, install the MS Graph PowerShell SDK.
Complete the following steps to run the PowerShell script:

Run the PowerShell script and follow the instructions.

This script will do the following:

Register a new application in Azure AD/Entra ID
Configure the required SharePoint permissions
Provide admin consent for the permissions

The output from the PowerShell script will look like the following screenshot.

If you chose Selected as the permission to target a specific SharePoint Site collection, continue with the steps to configure a specific SharePoint Site collection as mentioned earlier.
If you have more than one SharePoint site collection to be crawled, repeat the previous step to configure each collection.

Configure Amazon Q
Make sure you have set up Amazon Q Business with Entra ID as IdP as mentioned in the prerequisites. Also, make sure the email ID is in lowercase letters while creating the users in Entra ID.
Follow the instructions in Connecting Amazon Q Business to SharePoint (Online) using the console.
For Step 9 (Authentication), we choose Azure AD App-Only authentication and configure it as follows:

For Tenant ID, enter the tenant ID of your SharePoint account. This will be directory (tenant) ID in your registered Azure application, in the Azure Portal, as shown in the following screenshot (the IDs will be different for your setup).

For Certificate path, enter the full S3 path to your certificate (for example, s3://certBucket/azuread.crt). This is the Azure AD self-signed X.509 certificate to authenticate the connector for Azure AD. This certificate was created earlier.
For AWS Secrets Manager secret, create a secret in AWS Secrets Manager to store your SharePoint authentication credentials:

For Secret name, enter a name for your secret.
For Client ID, enter the Azure AD client ID generated when you registered SharePoint in Azure AD. This is the application (client) ID created in the Azure Portal when registering the SharePoint application in Azure, as described earlier.
For Private key, enter a private key to authenticate the connector for Azure AD. This is the contents of the .pfx file you created when registering your Azure SharePoint application, as described earlier. Enter the decrypted contents of that .pfx file in its entirety. Choose Show private key to verify it matches the contents for your .pfx file.

Continue with the rest of the steps in Connecting Amazon Q Business to SharePoint (Online) using the console.
Access the web experience on Amazon Q
To access the web experience, complete the following steps:

On the Amazon Q Business console, choose Applications in the navigation pane.
Choose the application you created.
Choose the link under Web experience URL to browse Amazon Q.

When prompted, authenticate with Entra ID/Azure AD.

After you’re authenticated, you can access Amazon Q. You can ask Amazon Q a question and get a response based on the permissions of the logged-in user.
References

For instructions on how to create an Amazon Q Business application with IAM Identity Center, refer to Configure SAML and SCIM with Microsoft Entra ID and IAM Identity Center.
Use the following PowerShell script to configure Azure AD:

param(
[Parameter(Mandatory=$true,
HelpMessage=”The friendly name of the app registration”)]
[String]
$AppName,

[Parameter(Mandatory=$true,
HelpMessage=”The file path to your public key file”)]
[String]
$CertPath,

[Parameter(Mandatory=$false,
HelpMessage=”Your Azure Active Directory tenant ID”)]
[String]
$TenantId,

[Parameter(Mandatory=$false)]
[Switch]
$StayConnected = $false
)

# Display the options for permission
$validOptions = @(‘R’, ‘F’, ‘S’)
Write-Host “Select the permissions: [F]-sites.FullControl.All [S]-sites.Selected”

# Loop to prompt the user until a valid option is selected
do {
foreach ($option in $validOptions) {
Write-Host “[$option]”
}
$selectedPermission = Read-Host “Enter your choice (F or S)”
} while ($selectedPermission -notin $validOptions)

# Map user input to corresponding permissions
$permissionMapping = @{
‘F’ = ‘678536fe-1083-478a-9c59-b99265e6b0d3’
‘S’ = ’20d37865-089c-4dee-8c41-6967602d4ac8′
}

$selectedPermissionValue = $permissionMapping[$selectedPermission]

# Requires an admin
if ($TenantId)
{
Connect-MgGraph -Scopes “Application.ReadWrite.All User.Read AppRoleAssignment.ReadWrite.All” -TenantId $TenantId
}
else
{
Connect-MgGraph -Scopes “Application.ReadWrite.All User.Read AppRoleAssignment.ReadWrite.All”
}

# Graph permissions constants
$sharePointResourceId = “00000003-0000-0ff1-ce00-000000000000″
$SitePermission = @{
Id=$selectedPermissionValue
Type=”Role”
}

# Get context for access to tenant ID
$context = Get-MgContext

# Load cert
$cert = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2($CertPath)
Write-Host -ForegroundColor Cyan “Certificate loaded”

# Create app registration
$appRegistration = New-MgApplication -DisplayName $AppName -SignInAudience “AzureADMyOrg” `
-Web @{ RedirectUris=”http://localhost”; } `
-RequiredResourceAccess @{ ResourceAppId=$sharePointResourceId; ResourceAccess=$UserReadAll, $GroupReadAll, $SitePermission } `
-AdditionalProperties @{} -KeyCredentials @(@{ Type=”AsymmetricX509Cert”; Usage=”Verify”; Key=$cert.RawData })
Write-Host -ForegroundColor Cyan “App registration created with app ID” $appRegistration.AppId

# Create corresponding service principal
$servicePrincipal= New-MgServicePrincipal -AppId $appRegistration.AppId -AdditionalProperties @{} | Out-Null
Write-Host -ForegroundColor Cyan “Service principal created”
Write-Host
Write-Host -ForegroundColor Green “Success”
Write-Host

# Providing admin consent
$scp = Get-MgServicePrincipal -Filter “DisplayName eq ‘$($AppName)'”
$app = Get-MgServicePrincipal -Filter “AppId eq ‘$sharePointResourceId'”
New-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $scp.id -PrincipalId $scp.Id -ResourceId $app.Id -AppRoleId $selectedPermissionValue

# Generate Connect-MgGraph command
$connectGraph = “Connect-MgGraph -ClientId “”” + $appRegistration.AppId + “”” -TenantId “””`
+ $context.TenantId + “”” -CertificateName “”” + $cert.SubjectName.Name + “”””
Write-Host $connectGraph

if ($StayConnected -eq $false)
{
Disconnect-MgGraph
Write-Host “Disconnected from Microsoft Graph”
}
else
{
Write-Host
Write-Host -ForegroundColor Yellow “The connection to Microsoft Graph is still active. To disconnect, use Disconnect-MgGraph”

You can test if the Grant-PnPAzureADAppSitePermission cmdlet worked by connecting to the SharePoint site using the Azure AD app that has the SharePoint.Sites.Selected permission and run a few SharePoint API calls:

Make a note of the certificate thumbprint as shown earlier.
Install the certificate for the current user in the Windows Certificate Management Store.
Run the following PowerShell cmdlet to connect to the SharePoint site collection using PnPOnline:

Connect-PnPOnline -Url “<SharePoint site collection url> -ClientId “<client id>” -Thumbprint “<certificate thumbprint>” -Tenant “<tenant id>

Run Get-PnPList to list all the SharePoint lists in the site collection and confirm that the permissions are configured correctly:

Get-PnPList

Troubleshooting
For troubleshooting guidance, refer to Troubleshooting your SharePoint (Online) connector.
Clean up
Complete the following steps to clean up your resources:

Open the Office 365 Admin Center using the account of a user member of the Tenant Global Admins group.
Navigate to the Microsoft Azure Portal.
Search for and choose App registrations.
Select the app you created earlier, then choose Delete.
On the Amazon Q Business console, choose Applications in the navigation pane.
Select the application you created, and on the Actions menu, choose Delete.

Conclusion
In this post, we explored how Amazon Q Business can seamlessly integrate with SharePoint Online to help enterprises unlock the value of their data and knowledge. With the SharePoint Online connector, organizations can empower their employees to find answers quickly, accelerate research and analysis, streamline content creation, automate workflows, and enhance collaboration.
We walked you through the process of setting up the SharePoint Online connector, including configuring the necessary Azure AD integration and authentication mechanisms. With these foundations in place, you can start unlocking the full potential of your SharePoint investment and drive greater productivity, efficiency, and innovation across your business.
Now that you’ve learned how to integrate Amazon Q Business with your Microsoft SharePoint Online content, it’s time to unlock the full potential of your organization’s knowledge and data. To get started, sign up for an Amazon Q Business account and follow the steps in this post to set up the SharePoint Online connector. Then you can start asking Amazon Q natural language questions and watch as it surfaces the most relevant information from your company’s SharePoint sites and documents.
Don’t miss out on the transformative power of generative AI and the Amazon Q Business platform. Sign up today and experience the difference that Amazon Q can make for your organization’s SharePoint-powered knowledge and content management.

About the Authors
Vijai Gandikota is a Principal Product Manager on the Amazon Q and Amazon Kendra team of Amazon Web Services. He is responsible for the Amazon Q and Amazon Kendra connectors, ingestion, security, and other aspects of Amazon Q and Amazon Kendra.
Satveer Khurpa is a Senior Solutions Architect on the GenAI Labs team at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies enables him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.
Vijai Anand Ramalingam is a Senior Modernization Architect at Amazon Web Services, specialized in enabling and accelerating customers’ application modernization, transitioning from legacy monolith applications to microservices.
Ramesh Jatiya is a Senior Solutions Architect in the Independent Software Vendor (ISV) team at Amazon Web Services. He is passionate about working with ISV customers to design, deploy, and scale their applications in the cloud to derive business value. He is also pursuing an MBA in Machine Learning and Business Analytics from Babson College, Boston. Outside of work, he enjoys running, playing tennis, and cooking.
Neelam Rana is a Software Development Engineer on the Amazon Q and Amazon Kendra engineering team. She works on Amazon Q connector design, development, integration, and test operations.
Dipti Kulkarni is a Software Development Manager on the Amazon Q and Amazon Kendra engineering team of Amazon Web Services, where she manages the connector development and integration teams.

Evaluate conversational AI agents with Amazon Bedrock

As conversational artificial intelligence (AI) agents gain traction across industries, providing reliability and consistency is crucial for delivering seamless and trustworthy user experiences. However, the dynamic and conversational nature of these interactions makes traditional testing and evaluation methods challenging. Conversational AI agents also encompass multiple layers, from Retrieval Augmented Generation (RAG) to function-calling mechanisms that interact with external knowledge sources and tools. Although existing large language model (LLM) benchmarks like MT-bench evaluate model capabilities, they lack the ability to validate the application layers. The following are some common pain points in developing conversational AI agents:

Testing an agent is often tedious and repetitive, requiring a human in the loop to validate the semantics meaning of the responses from the agent, as shown in the following figure.
Setting up proper test cases and automating the evaluation process can be difficult due to the conversational and dynamic nature of agent interactions.
Debugging and tracing how conversational AI agents route to the appropriate action or retrieve the desired results can be complex, especially when integrating with external knowledge sources and tools.

Agent Evaluation, an open source solution using LLMs on Amazon Bedrock, addresses this gap by enabling comprehensive evaluation and validation of conversational AI agents at scale.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Agent Evaluation provides the following:

Built-in support for popular services, including Agents for Amazon Bedrock, Knowledge Bases for Amazon Bedrock, Amazon Q Business, and Amazon SageMaker endpoints
Orchestration of concurrent, multi-turn conversations with your agent while evaluating its responses
Configurable hooks to validate actions triggered by your agent
Integration into continuous integration and delivery (CI/CD) pipelines to automate agent testing
A generated test summary for performance insights including conversation history, test pass rate, and reasoning for pass/fail results
Detailed traces to enable step-by-step debugging of the agent interactions

In this post, we demonstrate how to streamline virtual agent testing at scale using Amazon Bedrock and Agent Evaluation.
Solution overview
To use Agent Evaluation, you need to create a test plan, which consists of three configurable components:

Target – A target represents the agent you want to test
Evaluator – An evaluator represents the workflow and logic to evaluate the target on a test
Test – A test defines the target’s functionality and how you want your end-user to interact with the target, which includes:

A series of steps representing the interactions between the agent and the end-user
Your expected results of the conversation

The following figure illustrates how Agent Evaluation works on a high level. The framework implements an LLM agent (evaluator) that will orchestrate conversations with your own agent (target) and evaluate the responses during the conversation.

The following figure illustrates the evaluation workflow. It shows how the evaluator reasons and assesses responses based on the test plan. You can either provide an initial prompt or instruct the evaluator to generate one to initiate the conversation. At each turn, the evaluator engages the target agent and evaluates its response. This process continues until the expected results are observed or the maximum number of conversation turns is reached.

By understanding this workflow logic, you can create a test plan to thoroughly assess your agent’s capabilities.
Use case overview
To illustrate how Agent Evaluation can accelerate the development and deployment of conversational AI agents at scale, let’s explore an example scenario: developing an insurance claim processing agent using Agents for Amazon Bedrock. This insurance claim processing agent is expected to handle various tasks, such as creating new claims, sending reminders for pending documents related to open claims, gathering evidence for claims, and searching for relevant information across existing claims and customer knowledge repositories.
For this use case, the goal is to test the agent’s capability to accurately search and retrieve relevant information from existing claims. You want to make sure the agent provides correct and reliable information about existing claims to end-users. Thoroughly evaluating this functionality is crucial before deployment.
Begin by creating and testing the agent in your development account. During this phase, you interact manually with the conversational AI agent using sample prompts to do the following:

Engage the agent in multi-turn conversations on the Amazon Bedrock console
Validate the responses from the agent
Validate all the actions invoked by the agent
Debug and check traces for any routing failures

With Agent Evaluation, the developer can streamline this process through the following steps:

Configure a test plan:

Choose an evaluator from the models provided by Amazon Bedrock.
Configure the target, which should be a type that Agent Evaluation supports. For this post, we use an Amazon Bedrock agent.
Define the test steps and expected results. In the following example test plan, you have a claim with the ID claim-006 in your test system. You want to confirm that your agent can accurately answer questions about this specific claim.

evaluator:
type: bedrock-claude
model: claude-haiku
target:
type: bedrock-agent
bedrock_agent_alias_id:xxxxxxx
bedrock_agent_id:xxxxxxx
tests:
– name: GetOpenClaimsWithDetails
steps:
– Ask the agent which claims are open.
– Ask the agent for details on claim-006.
expected_results:
– The agent returns a list of open claims.
– The agent returns the details on claim-006.

Run the test plan from the command line:

agenteval run
The Agent Evaluation test runner will automatically orchestrate the test based on the test plan, and use the evaluator to determine if the responses from the target match the expected results.
View the result summary. A result summary will be provided in markdown format. In the following example, the summary indicates that the test failed because the agent was unable to provide accurate information about the existing claim claim-006.
 Debug with the trace files of the failed tests. Agent Evaluation provides detailed trace files for the tests. Each trace file meticulously records every prompt and interaction between the target and the evaluator.For instance, in the _invoke_target step, you can gain valuable insights into the rationale behind the Amazon Bedrock agent’s responses, allowing you to delve deeper into the decision-making process:

{
“timestamp”: “2024-04-17 12:52:11.231906+00:00”,
“step_name”: “_invoke_target”,
“preProcessingTrace”: {
“modelInvocationOutput”: {
“parsedResponse”: {
“isValid”: false,
“rationale”: “Let’s analyze the user’s input step-by-step: 1. The user is asking for details on a specific claim with ID ‘claim-006’. 2. This does not seem malicious or harmful in any way, so we can rule out Category A. 3. The user is not trying to get information about the agent’s functions or manipulate its behavior. So we can rule out Category B. 4. We don’t have enough information yet to determine if the agent can actually answer this question. We need to check the conversation history and available functions first. 5. Looking at the conversation history, the agent has access to a function called ‘get: :send-reminder: :getAllOpenClaims’ that returns a list of open claim IDs. 6. However, ‘claim-006’ is not in the list of open claim IDs returned. So the agent does not have direct access to details on this specific claim. 7. Therefore, this question falls into Category C – the agent will be unable to answer or assist with this question using only the functions it has access to. Given this analysis, I categorize this input as: <category>C</category>”
}
}
}
}
The trace shows that after reviewing the conversation history, the evaluator concludes, “the agent will be unable to answer or assist with this question using only the functions it has access to.” Consequently, it ends the conversation with the target agent and proceeds to generate the test status. In the _generate_test_status step, the evaluator generates the test status with reasoning based on the responses from the target.

{
“timestamp”: “2024-04-17 12:52:12.976985+00:00”,
“step_name”: “_generate_test_status”,
“system_prompt”: “You are a quality assurance engineer evaluating a conversation between an USER and an AGENT. You will be given an ordered list of steps wrapped in <steps> tags. Each step represents a task that the USER wants to perform when interacting with the AGENT. Your job is analyze the running conversation in <conversation> tags and classify it into the following categories: – A: The USER has attempted all the steps. – B: The USER has not yet attempted all the steps. Please think hard about the response in <thinking> tags before providing only the category letter within <category> tags.”,
“prompt”: “Here are the steps and conversation: <steps> 1. Ask the agent which claims are open. 2. Ask the agent for details on claim-006. <steps> <conversation> USER: Which claims are currently open? AGENT: The open claims are: 2s34w-8x, 5t16u-7v, 3b45c-9d USER: Can you please provide me with the details on claim-006? AGENT: Sorry, I don’t have enough information to answer that. </conversation>”,
“test_status”: “B”,
“reasoning”: “The user has attempted the first step of asking which claims are open, and the agent has provided a list of open claims. However, the user has not yet attempted the second step of asking for details on claim-006, as the agent has indicated that they do not have enough information to provide those details.”
}
The test plan defines the expected result as the target agent accurately providing details about the existing claim claim-006. However, after testing, the target agent’s response doesn’t meet the expected result, and the test fails.
After identifying and addressing the issue, you can rerun the test to validate the fix. In this example, it’s evident that the target agent lacks access to the claim claim-006. From there, you can continue investigating and verify if claim-006 exists in your test system.

Integrate Agent Evaluation with CI/CD pipelines
After validating the functionality in the development account, you can commit the code to the repository and initiate the deployment process for the conversational AI agent to the next stage. Seamless integration with CI/CD pipelines is a crucial aspect of Agent Evaluation, enabling comprehensive integration testing to make sure no regressions are introduced during new feature development or updates. This rigorous testing approach is vital for maintaining the reliability and consistency of conversational AI agents as they progress through the software delivery lifecycle.
By incorporating Agent Evaluation into CI/CD workflows, organizations can automate the testing process, making sure every code change or update undergoes thorough evaluation before deployment. This proactive measure minimizes the risk of introducing bugs or inconsistencies that could compromise the conversational AI agent’s performance and the overall user experience.
A standard agent CI/CD pipeline includes the following steps:

 The source repository stores the agent configuration, including agent instructions, system prompts, model configuration, and so on. You should always commit your changes to provide quality and reproducibility.
When you commit your changes, a build step is invoked. This is where unit tests should run and validate the changes, including typo and syntax checks.
When the changes are deployed to the staging environment, Agent Evaluation runs with a series of test cases for runtime validation.
The runtime validation on the staging environment can help build confidence to deploy the fully tested agent to production.

The following figure illustrates this pipeline.

In the following sections, we provide step-by-step instructions to set up Agent Evaluation with GitHub Actions.
Prerequisites
Complete the following prerequisite steps:

Follow the GitHub user guide to get started with GitHub.
Follow the GitHub Actions user guide to understand GitHub workflows and Actions.
Follow the insurance claim processing agent using Agents for Amazon Bedrock example to set up an agent.

Set up GitHub Actions
Complete the following steps to deploy the solution:

Write a series of test cases following the agent-evaluation test plan syntax and store test plans in the GitHub repository. For example, a test plan to test an Amazon Bedrock agent target is written as follows, with BEDROCK_AGENT_ALIAS_ID and BEDROCK_AGENT_ID as placeholders:

evaluator:
model: claude-3
target:
bedrock_agent_alias_id: BEDROCK_AGENT_ALIAS_ID
bedrock_agent_id: BEDROCK_AGENT_ID
type: bedrock-agent
tests:
InsuranceClaimQuestions:

Create an AWS Identity and Access Management (IAM) user with the proper permissions:

The principal must have InvokeModel permission to the model specified in the configuration.
The principal must have the permissions to call the target agent. Depending on the target type, different permissions are required. Refer to the agent-evaluation target documentation for details.

Store the IAM credentials (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) in GitHub Actions secrets.
Configure a GitHub workflow as follows:

name: Update Agents for Bedrock

on:
push:
branches: [ “main” ]

env:
AWS_REGION: <Deployed AWS region>

permissions:
contents: read

jobs:
build:
runs-on: ubuntu-latest

steps:
– name: Checkout
uses: actions/checkout@v4

– name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}

– name: Install agent-evaluation
run: |
pip install agent-evaluation
agenteval –help

– name: Test Bedrock Agent
id: test-bedrock-agent
env:
BEDROCK_AGENT_ALIAS_ID: ${{ vars.BEDROCK_AGENT_ALIAS_ID }}
BEDROCK_AGENT_ID: ${{ vars.BEDROCK_AGENT_ID }}
run: |
sed -e “s/BEDROCK_AGENT_ALIAS_ID/$BEDROCK_AGENT_ALIAS_ID/g” -e “s/BEDROCK_AGENT_ID/$BEDROCK_AGENT_ID/g” test_plans/agenteval.yml > agenteval.yml
agenteval run

– name: Test Summary
if: always()
id: test-summary
run: |
cat agenteval_summary.md >> $GITHUB_STEP_SUMMARY

When you push new changes to the repository, it will invoke the GitHub Action, and an example workflow output is displayed, as shown in the following screenshot. A test summary like the following screenshot will be posted to the GitHub workflow page with details on which tests have failed. The summary also provides the reasons for the test failures.

Clean up
Complete the following steps to clean up your resources:

Delete the IAM user you created for the GitHub Action.
Follow the insurance claim processing agent using Agents for Amazon Bedrock example to delete the agent.

Evaluator considerations
By default, evaluators use the InvokeModel API with On-Demand mode, which will incur AWS charges based on input tokens processed and output tokens generated. For the latest pricing details for Amazon Bedrock, refer to Amazon Bedrock pricing.
The cost of running an evaluator for a single test is influenced by the following:

The number and length of the steps
The number and length of expected results
The length of the target agent’s responses

You can view the total number of input tokens processed and output tokens generated by the evaluator using the –verbose flag when you perform a run (agenteval run –verbose).
Conclusion
This post introduced Agent Evaluation, an open source solution that enables developers to seamlessly integrate agent evaluation into their existing CI/CD workflows. By taking advantage of the capabilities of LLMs on Amazon Bedrock, Agent Evaluation enables you to comprehensively evaluate and debug your agents, achieving reliable and consistent performance. With its user-friendly test plan configuration, Agent Evaluation simplifies the process of defining and orchestrating tests, allowing you to focus on refining your agents’ capabilities. The solution’s built-in support for popular services makes it a versatile tool for testing a wide range of conversational AI agents. Moreover, Agent Evaluation’s seamless integration with CI/CD pipelines empowers teams to automate the testing process, making sure every code change or update undergoes rigorous evaluation before deployment. This proactive approach minimizes the risk of introducing bugs or inconsistencies, ultimately enhancing the overall user experience.
The following are some recommendations to consider:

Don’t use the same model to evaluate the results that you use to power the agent. Doing so may introduce biases and lead to inaccurate evaluations.
Block your pipelines on accuracy failures. Implement strict quality gates to help prevent deploying agents that fail to meet the expected accuracy or performance thresholds.
Continuously expand and refine your test plans. As your agents evolve, regularly update your test plans to cover new scenarios and edge cases, and provide comprehensive coverage.
Use Agent Evaluation’s logging and tracing capabilities to gain insights into your agents’ decision-making processes, facilitating debugging and performance optimization.

Agent Evaluation unlocks a new level of confidence in your conversational AI agents’ performance by streamlining your development workflows, accelerating time-to-market, and delivering exceptional user experiences. To further explore the best practices of building and testing conversational AI agent evaluation at scale, get started by trying Agent Evaluation and provide your feedback.

About the Authors
Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.
Bobby Lindsey is a Machine Learning Specialist at Amazon Web Services. He’s been in technology for over a decade, spanning various technologies and multiple roles. He is currently focused on combining his background in software engineering, DevOps, and machine learning to help customers deliver machine learning workflows at scale. In his spare time, he enjoys reading, research, hiking, biking, and trail running.
Tony Chen is a Machine Learning Solutions Architect at Amazon Web Services, helping customers design scalable and robust machine learning capabilities in the cloud. As a former data scientist and data engineer, he leverages his experience to help tackle some of the most challenging problems organizations face with operationalizing machine learning.
Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.
Curt Lockhart is an AI/ML Specialist Solutions Architect at AWS. He comes from a non-traditional background of working in the arts before his move to tech, and enjoys making machine learning approachable for each customer. Based in Seattle, you can find him venturing to local art museums, catching a concert, and wandering throughout the cities and outdoors of the Pacific Northwest.

Nvidia AI Releases Minitron 4B and 8B: A New Series of Small Language …

Large language models (LLMs) models, designed to understand and generate human language, have been applied in various domains, such as machine translation, sentiment analysis, and conversational AI. LLMs, characterized by their extensive training data and billions of parameters, are notoriously computationally intensive, posing challenges to their development and deployment. Despite their capabilities, training and deploying these models is resource-heavy, often requiring extensive computational power and large datasets, leading to substantial costs.

One of the primary challenges in this area is the resource-intensive nature of training multiple variants of LLMs from scratch. Researchers aim to create different model sizes to suit various deployment needs, but this process demands enormous computational resources and vast training data. The high cost associated with this approach makes it difficult to scale and deploy these models efficiently. The need to reduce these costs without compromising model performance has driven researchers to explore alternative methods.

Existing approaches to mitigate these challenges include various pruning techniques and knowledge distillation methods. Pruning systematically removes less important weights or neurons from a pre-trained model, reducing its size and computational demands. On the other hand, knowledge distillation transfers knowledge from a larger, more complex model (the teacher) to a smaller, simpler model (the student), enhancing the student model’s performance while requiring fewer resources for training. Despite these techniques, finding a balance between model size, training cost, and performance remains a significant challenge.

Researchers at NVIDIA have introduced a novel approach to prune and retrain LLMs efficiently. Their method focuses on structured pruning, systematically removing entire neurons, layers, or attention heads based on their calculated importance. This approach is combined with a knowledge distillation process, allowing the pruned model to be retrained using a small fraction of the original training data. This method aims to retain the performance of the original model while significantly reducing the training cost and time. The researchers have developed the Minitron model family and have open-sourced these models on Huggingface for public use.

The proposed method begins with an existing large model and prunes it to create smaller, more efficient variants. The importance of each component—neuron, head, layer—is calculated using activation-based metrics during forward propagation on a small calibration dataset of 1024 samples. Components deemed less important are pruned. Following this, the pruned model undergoes a knowledge distillation-based retraining, which helps recover the model’s accuracy. This process leverages a significantly smaller dataset, making the retraining phase much less resource-intensive than traditional methods.

The performance of this method was evaluated on the Nemotron-4 model family. The researchers achieved a 2-4× reduction in model size while maintaining comparable performance levels. Specifically, using this method, the 8B and 4B models derived from a 15B model required up to 40× fewer training tokens than training from scratch. This resulted in compute cost savings of 1.8× for training the entire model family (15B, 8B, and 4B). Notably, the 8B model demonstrated a 16% improvement in MMLU scores compared to models trained from scratch. These models performed comparably to other well-known community models, such as Mistral 7B, Gemma 7B, and LLaMa-3 8B, outperforming state-of-the-art compression techniques from existing literature. The Minitron models have been made available on Huggingface for public use, providing the community access to these optimized models.

In conclusion, the researchers at NVIDIA have demonstrated that structured pruning combined with knowledge distillation can reduce the cost and resources required to train large language models. By employing activation-based metrics and a small calibration dataset for pruning, followed by efficient retraining using knowledge distillation, they have shown that it is possible to maintain and, in some cases, improve model performance while drastically cutting down on computational costs. This innovative approach paves the way for more accessible and efficient NLP applications, making it feasible to deploy LLMs at various scales without incurring prohibitive costs.

Check out the Paper and Models. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post Nvidia AI Releases Minitron 4B and 8B: A New Series of Small Language Models that are 40x Faster Model Training via Pruning and Distillation appeared first on MarkTechPost.

Nvidia AI Proposes ChatQA 2: A Llama3-based Model for Enhanced Long-Co …

Long-context understanding and retrieval-augmented generation (RAG) in large language models (LLMs) is rapidly advancing, driven by the need for models that can handle extensive text inputs and provide accurate, efficient responses. These capabilities are essential for processing large volumes of information that cannot fit into a single prompt, which is crucial for tasks such as document summarization, conversational question answering, and information retrieval.

The performance gap between open-access LLMs and proprietary models like GPT-4-Turbo remains a significant challenge. While open-access models like Llama-3-70B-Instruct and QWen2-72B-Instruct have enhanced their capabilities, they often need to catch up in processing large text volumes and retrieval tasks. This gap is particularly evident in real-world applications, where the ability to handle long-context inputs and retrieve relevant information efficiently is critical. Current methods for enhancing long-context understanding involve extending the context window of LLMs and employing RAG. These techniques complement each other, with long-context models excelling in summarizing large documents and RAG efficiently retrieving relevant information for specific queries. However, existing solutions often suffer from context fragmentation and low recall rates, undermining their effectiveness.

Researchers from Nividia introduced ChatQA 2, a Llama3-based model developed to address these challenges. ChatQA 2 aims to bridge the gap between open-access and proprietary LLMs in long-context and RAG capabilities. By extending the context window to 128K tokens and using a three-stage instruction tuning process, ChatQA 2 significantly enhances instruction-following, RAG performance, and long-context understanding. This model achieves a context window extension from 8K to 128K tokens through continuous pretraining on a mix of datasets, including the SlimPajama dataset with upsampled long sequences, resulting in 10 billion tokens with a sequence length of 128K.

The technology behind ChatQA 2 involves a detailed and reproducible technical recipe. The model’s development begins with extending the context window of Llama3-70B from 8K to 128K tokens by continually pretraining it on a mix of datasets. This process uses a learning rate of 3e-5 and a batch size 32, training for 2000 steps to process 8 billion tokens. Following this, a three-stage instruction tuning process is applied. The first two stages involve training on high-quality instruction-following datasets and conversational QA data with provided context. In contrast, the third stage focuses on long-context sequences up to 128K tokens. This comprehensive approach ensures that ChatQA 2 can handle various tasks effectively.

ChatQA 2 achieves accuracy comparable to GPT-4-Turbo-2024-0409 on many long-context understanding tasks and surpasses it in RAG benchmarks. For instance, in the InfiniteBench evaluation, which includes functions like longbook summarization, QA, multiple-choice, and dialogue, ChatQA 2 achieved an average score of 34.11, close to the highest score of 34.88 by Qwen2-72B-Instruct. The model also excels in medium-long context benchmarks within 32K tokens, scoring 47.37, and short-context tasks within 4K tokens, achieving an average score of 54.81. These results highlight ChatQA 2’s robust capabilities across different context lengths and functions.

ChatQA 2 addresses significant issues in the RAG pipeline, such as context fragmentation and low recall rates. The model improves retrieval accuracy and efficiency by utilizing a state-of-the-art long-context retriever. For example, the E5-mistral embedding model supports up to 32K tokens for retrieval, significantly enhancing the model’s performance on query-based tasks. In comparisons between RAG and long-context solutions, ChatQA 2 consistently demonstrated superior results, particularly in functions requiring extensive text processing.

In conclusion, ChatQA 2 by extending the context window to 128K tokens and implementing a three-stage instruction tuning process, ChatQA 2 achieves GPT-4-Turbo-level capabilities in long-context understanding and RAG performance. This model offers flexible solutions for various downstream tasks, balancing accuracy and efficiency through advanced long-context and retrieval-augmented generation techniques. The development and evaluation of ChatQA 2 mark a crucial step forward in large language models, providing enhanced capabilities for processing and retrieving information from extensive text inputs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post Nvidia AI Proposes ChatQA 2: A Llama3-based Model for Enhanced Long-Context Understanding and RAG Capabilities appeared first on MarkTechPost.

Predicting Sustainable Development Goals (SDG) Scores by 2030: A Machi …

Forecasting Sustainable Development Goals (SDG) Scores by 2030:

The Sustainable Development Goals (SDGs) set by the United Nations aim to eradicate poverty, protect the environment, combat climate change, and ensure peace and prosperity by 2030. These 17 goals address global health, education, inequality, environmental degradation, and climate change challenges. Despite extensive research tracking progress towards these goals, more work must be done to forecast SDG scores. This study aims to predict SDG scores for different global regions by 2030 using ARIMAX and Linear Regression (LR), smoothed by the Holt-Winters’ multiplicative technique. Predictors identified from SDGs likely to be influenced by AI in the future were used to enhance model performance. Forecast results indicate that OECD countries and Eastern Europe and Central Asia are expected to achieve the highest SDG scores. At the same time, Latin America and the Caribbean, East and South Asia, the Middle East and North Africa, and Sub-Saharan Africa will exhibit lower levels of achievement.

Sustainable development emphasizes achieving intergenerational equity and optimizing resource consumption to meet future needs. Following the Brundtland Commission’s definition, it became clear that economic growth alone cannot ensure sustainability due to the depletion of natural resources. Sustainable development requires balancing environmental, financial, and social sustainability. With 193 UN member states adopting the SDGs in 2015, there is an international consensus on addressing global challenges. The introduction of smart technologies, particularly AI, has the potential to accelerate SDG implementation. AI can significantly impact various SDGs, including health, education, and climate action. However, privacy concerns, cybersecurity issues, and social biases must be managed through regulatory standards and international guidelines to mitigate potential adverse effects. This study’s findings highlight the importance of identifying priority areas for action and formulating targeted policies to improve SDG scores globally.

Materials and Methods:

This study develops forecasting models using predictors identified through a literature review of AI’s influence on SDGs. Systematic searches in Scopus using specific keywords yielded 33 relevant papers from 1994 to 2023. Predictor selection utilized filter techniques, and the final predictors were chosen from SDGs related to health, education, clean energy, and climate action. Forecast models, including ARIMAX and LR with Holt-Winters smoothing, were built using Python in Google Colab. The ARIMAX model handles non-stationary data, while LR with Holt-Winters enhances accuracy. Data from the Sustainable Development Report 2023 was used, focusing on regional groupings to minimize missing data issues.

Analysis of ARIMAX and LR Models for SDG Scores:

The ARIMAX and LR models predict SDG scores across six regions from 2022 to 2030. The ARIMAX model generally provides more precise forecasts, particularly for “OECD countries,” which show the highest accuracy and lowest error margins. In contrast, “Sub-Saharan Africa” has the lowest scores and greatest variability. Both models predict similar trends, with “OECD countries” showing the highest growth and “Sub-Saharan Africa” the lowest. Over time, regions like “Latin America and the Caribbean” and “East and South Asia” show moderate improvements, while “Eastern Europe and Central Asia” exhibit stable growth.

 Image source

Discussion:

Forecasting SDG scores using ARIMAX and smooth linear regression methods reveals a nuanced picture of global progress. AI’s role in enhancing SDGs is dual-faceted: while it contributes to reducing energy consumption, monitoring the environment, and improving health, it also poses risks such as privacy violations, increased inequality, and technological unemployment. The forecasted SDG scores for 2030 show varied regional progress, with OECD countries leading, followed by Eastern Europe, Asia, and Latin America. Sub-Saharan Africa faces significant challenges but shows potential for improvement with AI. Policymakers should leverage AI to support regions lagging in SDG achievement while addressing socio-economic and political factors influencing development.

Conclusion:

This study uses machine learning models to forecast SDG scores for global regions up to 2030, indicating an overall upward trend. Regions like OECD countries, Eastern Europe and Central Asia, Latin America, and the Caribbean are expected to lead with higher scores. At the same time, East and South Asia, the Middle East, and North Africa will improve but remain lower. Strong political, cultural, and socio-economic structures correlate with higher SDG scores. Limitations include uncertainty in predictions and the evolving impact of AI. Future research should explore economic, social, and environmental predictors, refine forecasting models, and assess the influence of policy changes on SDG outcomes.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post Predicting Sustainable Development Goals (SDG) Scores by 2030: A Machine Learning Approach with ARIMAX and Linear Regression Models appeared first on MarkTechPost.

Mistral Large 2 is now available in Amazon Bedrock

Mistral AI’s Mistral Large 2 (24.07) foundation model (FM) is now generally available in Amazon Bedrock. Mistral Large 2 is the newest version of Mistral Large, and according to Mistral AI offers significant improvements across multilingual capabilities, math, reasoning, coding, and much more.
In this post, we discuss the benefits and capabilities of this new model with some examples.
Overview of Mistral Large 2
Mistral Large 2 is an advanced large language model (LLM) with state-of-the-art reasoning, knowledge, and coding capabilities according to Mistral AI. It is multi-lingual by design, supporting dozens of languages, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, Polish, Arabic, and Hindi. Per Mistral AI, a significant effort was also devoted to enhancing the model’s reasoning capabilities. One of the key focuses during training was to minimize the model’s tendency to hallucinate, or generate plausible-sounding but factually incorrect or irrelevant information. This was achieved by fine-tuning the model to be more cautious and discerning in its responses, making sure it provides reliable and accurate outputs. Additionally, the new Mistral Large 2 is trained to acknowledge when it can’t find solutions or doesn’t have sufficient information to provide a confident answer.
According to Mistral AI, the model is also proficient in coding, trained on over 80 programming languages such as Python, Java, C, C++, JavaScript, Bash, Swift, and Fortran. With its best-in-class agentic capabilities, it can natively call functions and output JSON, enabling seamless interaction with external systems, APIs, and tools. Additionally, Mistral Large 2 (24.07) boasts advanced reasoning and mathematical capabilities, making it a powerful asset for tackling complex logical and computational challenges.
Mistral Large 2 also offers an increased context window of 128,000 tokens. At the time of writing, the model (mistral.mistral-large-2407-v1:0) is available in the us-west-2 AWS Region.
Get started with Mistral Large 2 on Amazon Bedrock
If you’re new to using Mistral AI models, you can request model access on the Amazon Bedrock console. For more details, see Manage access to Amazon Bedrock foundation models.
To test Mistral Large 2 on the Amazon Bedrock console, choose Text or Chat under Playgrounds in the navigation pane. Then choose Select model and choose Mistral as the category and Mistral Large 24.07 as the model.

By choosing View API request, you can also access the model using code examples in the AWS Command Line Interface (AWS CLI) and AWS SDKs. You can use model IDs such as mistral.mistral-large-2407-v1:0, as shown in the following code:

$ aws bedrock-runtime invoke-model
–model-id mistral.mistral-large-2407-v1:0
–body “{“prompt”:”<s>[INST] this is where you place your input text [/INST]”, “max_tokens”:200, “temperature”:0.5, “top_p”:0.9, “top_k”:50}”
–cli-binary-format raw-in-base64-out
–region us-west-2
invoke-model-output.txt

In the following sections, we dive into the capabilities of Mistral Large 2.
Increased context window
Mistral Large 2 supports a context window of 128,000 tokens, compared to Mistral Large (24.02), which had a 32,000-token context window. This larger context window is important for developers because it allows the model to process and understand longer pieces of text, such as entire documents or code files, without losing context or coherence. This can be particularly useful for tasks like code generation, documentation analysis, or any application that requires understanding and processing large amounts of text data.
Generating JSON and tool use
Mistral Large 2 now offers a native JSON output mode. This feature allows developers to receive the model’s responses in a structured, easy-to-read format that can be readily integrated into various applications and systems. With JSON being a widely adopted data exchange standard, this capability simplifies the process of working with the model’s outputs, making it more accessible and practical for developers across different domains and use cases. To learn more about how to generate JSON with the Converse API, refer to Generating JSON with the Amazon Bedrock Converse API.
To generate JSON with the Converse API, you need to define a toolSpec. In the following code, we present an example for a travel agent company that will take passenger information and requests and convert them to JSON:

# Define the tool configuration
import json
tool_list = [
{
“toolSpec”: {
“name”: “travel_agent”,
“description”: “Converts trip details as a json structure.”,
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“origin_airport”: {
“type”: “string”,
“description”: “Origin airport (IATA code)”
},
“destination_airport”: {
“type”: “boolean”,
“description”: “Destination airport (IATA code)”
},
“departure_date”: {
“type”: “string”,
“description”: “Departure date”,
},
“return_date”: {
“type”: “string”,
“description”: “Return date”,
}
},
“required”: [
“origin_airport”,
“destination_airport”,
“departure_date”,
“return_date”
]
}
}
}
}
]
content = “””
I would like to book a flight from New York (JFK) to London (LHR) for a round-trip.
The departure date is June 15, 2023, and the return date is June 25, 2023.

For the flight preferences, I would prefer to fly with Delta or United Airlines.
My preferred departure time range is between 8 AM and 11 AM, and my preferred arrival time range is between 9 AM and 1 PM (local time in London).
I am open to flights with one stop, but no more than that.
Please include non-stop flight options if available.
“””

message = {
“role”: “user”,
“content”: [
{ “text”: f”<content>{content}</content>” },
{ “text”: “Please create a well-structured JSON object representing the flight booking request, ensuring proper nesting and organization of the data. Include sample data for better understanding. Create the JSON based on the content within the <content> tags.” }
],
}
# Bedrock client configuration
response = bedrock_client.converse(
modelId=model_id,
messages=[message],
inferenceConfig={
“maxTokens”: 500,
“temperature”: 0.1
},
toolConfig={
“tools”: tool_list
}
)

response_message = response[‘output’][‘message’]
response_content_blocks = response_message[‘content’]
content_block = next((block for block in response_content_blocks if ‘toolUse’ in block), None)
tool_use_block = content_block[‘toolUse’]
tool_result_dict = tool_use_block[‘input’]

print(json.dumps(tool_result_dict, indent=4))

We get the following response:

{
“origin_airport”: “JFK”,
“destination_airport”: “LHR”,
“departure_date”: “2023-06-15”,
“return_date”: “2023-06-25”
}

Mistral Large 2 was able to correctly take our user query and convert the appropriate information to JSON.
Mistral Large 2 also supports the Converse API and tool use. You can use the Amazon Bedrock API to give a model access to tools that can help it generate responses for messages that you send to the model. For example, you might have a chat application that lets users find the most popular song played on a radio station. To answer a request for the most popular song, a model needs a tool that can query and return the song information. The following code shows an example for getting the correct train schedule:

# Define the tool configuration
toolConfig = {
“tools”: [
{
“toolSpec”: {
“name”: “shinkansen_schedule”,
“description”: “Fetches Shinkansen train schedule departure times for a specified station and time.”,
“inputSchema”: {
“json”: {
“type”: “object”,
“properties”: {
“station”: {
“type”: “string”,
“description”: “The station name.”
},
“departure_time”: {
“type”: “string”,
“description”: “The departure time in HH:MM format.”
}
},
“required”: [“station”, “departure_time”]
}
}
}
}
]
}
# Define shikansen schedule tool
def shinkansen_schedule(station, departure_time):
schedule = {
“Tokyo”: {“09:00”: “Hikari”, “12:00”: “Nozomi”, “15:00”: “Kodama”},
“Osaka”: {“10:00”: “Nozomi”, “13:00”: “Hikari”, “16:00”: “Kodama”}
}
return schedule.get(station, {}).get(departure_time, “No train found”)
def prompt_mistral(prompt):
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
converse_api_params = {
“modelId”: model_id,
“messages”: messages,
“toolConfig”: toolConfig,
“inferenceConfig”: {“temperature”: 0.0, “maxTokens”: 400},
}

response = bedrock_client.converse(**converse_api_params)

if response[‘output’][‘message’][‘content’][0].get(‘toolUse’):
tool_use = response[‘output’][‘message’][‘content’][0][‘toolUse’]
tool_name = tool_use[‘name’]
tool_inputs = tool_use[‘input’]

if tool_name == “shinkansen_schedule”:
print(“Mistral wants to use the shinkansen_schedule tool”)
station = tool_inputs[“station”]
departure_time = tool_inputs[“departure_time”]

try:
result = shinkansen_schedule(station, departure_time)
print(“Train schedule result:”, result)
except ValueError as e:
print(f”Error: {str(e)}”)

else:
print(“Mistral responded with:”)
print(response[‘output’][‘message’][‘content’][0][‘text’])
prompt_mistral(“What train departs Tokyo at 9:00?”)

We get the following response:

Mistral wants to use the shinkansen_schedule tool
Train schedule result: Hikari

Mistral Large 2 was able to correctly identify the shinkansen tool and demonstrate its use.
Multilingual support
Mistral Large 2 now supports a large number of character-based languages such as Chinese, Japanese, Korean, Arabic, and Hindi. This expanded language support allows developers to build applications and services that can cater to users from diverse linguistic backgrounds. With multilingual capabilities, developers can create localized UIs, provide language-specific content and resources, and deliver a seamless experience for users regardless of their native language.
In the following example, we translate customer emails generated by the author into different languages such as Hindi and Japanese:

emails= “””
“I recently bought your RGB gaming keyboard and absolutely love the customizable lighting features! Can you guide me on how to set up different profiles for each game I play?”
“I’m trying to use the macro keys on the gaming keyboard I just purchased, but they don’t seem to be registering my inputs. Could you help me figure out what might be going wrong?”
“I’m considering buying your gaming keyboard and I’m curious about the key switch types. What options are available and what are their main differences?”
“I wanted to report a small issue where my keyboard’s space bar is a bit squeaky. However, your quick-start guide was super helpful and I fixed it easily by following the lubrication tips. Just thought you might want to know!”
“My new gaming keyboard stopped working within a week of purchase. None of the keys respond, and the lights don’t turn on. I need a solution or a replacement as soon as possible.”
“I’ve noticed that the letters on the keys of my gaming keyboard are starting to fade after several months of use. Is this covered by the warranty?”
“I had an issue where my keyboard settings would reset every time I restarted my PC. I figured out it was due to a software conflict and resolved it by updating the firmware. Just wanted to ask if there are any new updates coming soon?”
“I’ve been having trouble with the keyboard software not saving my configurations, and it’s starting to get frustrating. What can be done to ensure my settings are saved permanently?”
“””
def converse(prompt, inference_config):
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
response = bedrock_client.converse(
messages=messages,
modelId=model_id,
inferenceConfig=inference_config
)
generated_text = response[‘output’][‘message’][‘content’][0][‘text’]
print(generated_text)
return generated_text

prompt=f”””

emails={emails}

Translate the following customer emails into these languages:

1. Hindi
2. Japanese

Label each language section accordingly”””.format(emails=emails)

inference_config = {“temperature”: 0.0, “maxTokens”: 4000, “topP”: 0.1}

response = converse(prompt, inference_config)

We get the following response:

1. “मैंने हाल ही में आपका RGB गेमिंग कीबोर्ड खरीदा और कस्टमाइजेबल लाइटिंग फीचर्स से बहुत प्रेम करता हूँ! क्या आप मुझे बता सकते हैं कि मैं हर गेम के लिए अलग-अलग प्रोफाइल कैसे सेट कर सकता हूँ?”
2. “मैं अपने नए गेमिंग कीबोर्ड पर मैक्रो कीज का उपयोग करने की कोशिश कर रहा हूँ, लेकिन वे मेरे इनपुट्स को रजिस्टर नहीं कर रहे हैं। क्या आप मुझे बता सकते हैं कि क्या गलत हो सकता है?”
3. “मैं आपका गेमिंग कीबोर्ड खरीदने के बारे में सोच रहा हूँ और मुझे की स्विच टाइप्स के बारे में जानना है। क्या विकल्प उपलब्ध हैं और उनके मुख्य अंतर क्या हैं?”
4. “मैं यह रिपोर्ट करना चाहता था कि मेरे कीबोर्ड का स्पेस बार थोड़ा सा चरमरा रहा है। हालाँकि, आपका क्विक-स्टार्ट गाइड बहुत मददगार था और मैंने लुब्रिकेशन टिप्स का पालन करके इसे आसानी से ठीक कर दिया। बस यह जानना चाहता था कि शायद आपको पता चलना चाहिए!”
5. “मेरा नया गेमिंग कीबोर्ड खरीद के एक सप्ताह के भीतर काम करना बंद हो गया। कोई भी की जवाब नहीं दे रहा है, और लाइट्स भी नहीं चालू हो रहे हैं। मुझे एक समाधान या एक रिप्लेसमेंट जितनी जल्दी हो सके चाहिए।”
6. “मैंने नोट किया है कि मेरे गेमिंग कीबोर्ड के कीज पर अक्षर कुछ महीनों के उपयोग के बाद फेड होने लगे हैं। क्या यह वारंटी के तहत कवर है?”
7. “मेरे कीबोर्ड सेटिंग्स हर बार मेरे पीसी को रीस्टार्ट करने पर रीसेट हो जाती थीं। मैंने पता लगाया कि यह एक सॉफ्टवेयर कॉन्फ्लिक्ट के कारण था और फर्मवेयर अपडेट करके इसे सुलझा दिया। बस पूछना चाहता था कि क्या कोई नए अपडेट आने वाले हैं?”
8. “मेरे कीबोर्ड सॉफ्टवेयर मेरी कॉन्फ़िगरेशन को सेव नहीं कर रहे हैं, और यह अब परेशान करने लगा है। मेरे सेटिंग्स को स्थायी रूप से सेव करने के लिए क्या किया जा सकता है?”

### Japanese

1. “最近、あなたのRGBゲーミングキーボードを購入し、カスタマイズ可能なライティング機能が大好きです! 各ゲームごとに異なるプロファイルを設定する方法を教えていただけますか?”
2. “新しく購入したゲーミングキーボードのマクロキーを使おうとしていますが、入力が認識されません。何が問題か教えていただけますか?”
3. “あなたのゲーミングキーボードを購入しようと考えていますが、キースイッチの種類について知りたいです。どのようなオプションがあり、その主な違いは何ですか?”
4. “キーボードのスペースバーが少しきしむようになりました。ただし、クイックスタートガイドが非常に役立ち、潤滑のヒントに従って簡単に修理できました。ただ、知っておいてほしいと思いました!”
5. “新しいゲーミングキーボードが購入後1週間で動作しなくなりました。どのキーも反応せず、ライトも点きません。できるだけ早く解決策や交換をお願いします。”
6. “ゲーミングキーボードのキーの文字が数ヶ月の使用後に消え始めました。これは保証に含まれますか?”
7. “キーボードの設定がPCを再起動するたびにリセットされていました。ソフトウェアの競合が原因であることがわかり、ファームウェアを更新して解決しました。新しいアップデートが近々来るのか教えていただけますか?”
8. “キーボードソフトウェアが設定を保存しないため、非常に困っています。設定を永続的に保存するために何ができますか?”

Coding tasks
Mistral Large 2 has been trained on over 80 coding languages, including popular ones like Python, Java, C, C++, JavaScript, and Bash, as well as more specialized languages such as Swift and Fortran. This comprehensive language support empowers developers to tackle a wide range of coding tasks and projects across various domains and platforms. Whether you’re working on web development, mobile applications, scientific computing, or system programming, Mistral Large 2 can assist you with code generation, debugging, refactoring, and other coding-related tasks. For example, the following code requests the model to generate a Python function:

prompt = “””
Write a Python function called palindrome_prime_finder that finds all prime numbers within a given range that are also palindromes when written in base 10 (decimal system).
“””

inference_config = {“temperature”: 0.0, “maxTokens”: 1000, “topP”: 0.1}

response = converse(prompt, inference_config)

We get the following response:

Certainly! Below is a Python function called `palindrome_prime_finder` that finds all prime numbers within a given range that are also palindromes when written in base 10.

“`python
def is_prime(n):
“””Check if a number is prime.”””
if n <= 1:
return False
if n <= 3:
return True
if n % 2 == 0 or n % 3 == 0:
return False
i = 5
while i * i <= n:
if n % i == 0 or n % (i + 2) == 0:
return False
i += 6
return True

def is_palindrome(n):
“””Check if a number is a palindrome.”””
return str(n) == str(n)[::-1]

def palindrome_prime_finder(start, end):
“””Find all prime numbers within a given range that are also palindromes.”””
palindrome_primes = []
for num in range(start, end + 1):
if is_prime(num) and is_palindrome(num):
palindrome_primes.append(num)
return palindrome_primes

# Example usage:
start_range = 1
end_range = 1000
print(palindrome_prime_finder(start_range, end_range))
“`

### Explanation:
1. **is_prime(n)**: This function checks if a number `n` is prime. It uses a basic primality test that eliminates even numbers and numbers divisible by 3 early, then checks divisibility up to the square root of `n`.

2. **is_palindrome(n)**: This function checks if a number `n` is a palindrome by converting it to a string and comparing it to its reverse.

3. **palindrome_prime_finder(start, end)**: This function iterates through all numbers in the given range `[start, end]`. For each number, it checks if it is both a prime and a palindrome. If both conditions are met, the number is added to the list `palindrome_primes`.

### Example Usage:
The example usage finds all palindromic prime numbers between 1 and 1000. You can adjust the `start_range` and `end_range` variables to find palindromic primes in any desired range.

Conclusion
Mistral AI’s Mistral Large 2 FM is now available on Amazon Bedrock in the US West (Oregon) Region. To get started with Mistral Large 2 in Amazon Bedrock, visit the Amazon Bedrock console.
Interested in diving deeper? Check out the Mistral-on-AWS repo. For more information about Mistral AI on Amazon Bedrock, refer to Mistral AI models now available on Amazon Bedrock.

About the Authors
Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.
Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and Data Analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.
Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

LLM experimentation at scale using Amazon SageMaker Pipelines and MLfl …

Large language models (LLMs) have achieved remarkable success in various natural language processing (NLP) tasks, but they may not always generalize well to specific domains or tasks. You may need to customize an LLM to adapt to your unique use case, improving its performance on your specific dataset or task. You can customize the model using prompt engineering, Retrieval Augmented Generation (RAG), or fine-tuning. Evaluation of a customized LLM against the base LLM (or other models) is necessary to make sure the customization process has improved the model’s performance on your specific task or dataset.
In this post, we dive into LLM customization using fine-tuning, exploring the key considerations for successful experimentation and how Amazon SageMaker with MLflow can simplify the process using Amazon SageMaker Pipelines.
LLM selection and fine-tuning journeys
When working with LLMs, customers often have different requirements. Some may be interested in evaluating and selecting the most suitable pre-trained foundation model (FM) for their use case, while others might need to fine-tune an existing model to adapt it to a specific task or domain. Let’s explore two customer journeys:

Selecting and evaluating foundation models – You can evaluate the performance of different pre-trained FMs on relevant datasets and metrics specific to your use case. You can then select the best model based on the evaluation results. You can do this using services such as Amazon SageMaker JumpStart and Amazon SageMaker Clarify. It can also be done at scale, as explained in Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services. The following diagram illustrates an example architecture.

Fine-tuning an LLM for a specific task or domain adaptation – In this user journey, you need to customize an LLM for a specific task or domain data. This requires fine-tuning the model. The fine-tuning process may involve one or more experiment, each requiring multiple iterations with different combinations of datasets, hyperparameters, prompts, and fine-tuning techniques, such as full or Parameter-Efficient Fine-Tuning (PEFT). Each iteration can be considered a run within an experiment.

Fine-tuning an LLM can be a complex workflow for data scientists and machine learning (ML) engineers to operationalize. To simplify this process, you can use Amazon SageMaker with MLflow and SageMaker Pipelines for fine-tuning and evaluation at scale. In this post, we describe the step-by-step solution and provide the source code in the accompanying GitHub repository.
Solution overview
Running hundreds of experiments, comparing the results, and keeping a track of the ML lifecycle can become very complex. This is where MLflow can help streamline the ML lifecycle, from data preparation to model deployment. By integrating MLflow into your LLM workflow, you can efficiently manage experiment tracking, model versioning, and deployment, providing reproducibility. With MLflow, you can track and compare the performance of multiple LLM experiments, identify the best-performing models, and deploy them to production environments with confidence.
You can create workflows with SageMaker Pipelines that enable you to prepare data, fine-tune models, and evaluate model performance with simple Python code for each step.
Now you can use SageMaker managed MLflow to run LLM fine-tuning and evaluation experiments at scale. Specifically:

MLflow can manage tracking of fine-tuning experiments, comparing evaluation results of different runs, model versioning, deployment, and configuration (such as data and hyperparameters)
SageMaker Pipelines can orchestrate multiple experiments based on the experiment configuration

The following figure shows the overview of the solution.

Prerequisites
Before you begin, make sure you have the following prerequisites in place:

Hugging Face login token – You need a Hugging Face login token to access the models and datasets used in this post. For instructions to generate a token, see User access tokens.
SageMaker access with required IAM permissions – You need to have access to SageMaker with the necessary AWS Identity and Access Management (IAM) permissions to create and manage resources. Make sure you have the required permissions to create notebooks, deploy models, and perform other tasks outlined in this post. To get started, see Quick setup to Amazon SageMaker. Please follow this post to make sure you have proper IAM role confugured for MLflow.

Set up an MLflow tracking server
MLflow is directly integrated in Amazon SageMaker Studio. To create an MLflow tracking server to track experiments and runs, complete the following steps:

On the SageMaker Studio console, choose MLflow under Applications in the navigation pane.

For Name, enter an appropriate server name.
For Artifact storage location (S3 URI), enter the location of an Amazon Simple Storage Service (Amazon S3) bucket.
Choose Create.

The tracking server may require up to 20 minutes to initialize and become operational. When it’s running, you can note its ARN to use in the llm_fine_tuning_experiments_mlflow.ipynb notebook. The ARN will have the following format:

arn:aws:sagemaker:<region>:<account_id>:mlflow-tracking-server/<tracking_server_name>

For subsequent steps, you can refer to the detailed description provided in this post, as well as the step-by-step instructions outlined in the llm_fine_tuning_experiments_mlflow.ipynb notebook. You can Launch the notebook in Amazon SageMaker Studio Classic or SageMaker JupyterLab.
Overview of SageMaker Pipelines for experimentation at scale
We use SageMaker Pipelines to orchestrate LLM fine-tuning and evaluation experiments. With SageMaker Pipelines, you can:

Run multiple LLM experiment iterations simultaneously, reducing overall processing time and cost
Effortlessly scale up or down based on changing workload demands
Monitor and visualize the performance of each experiment run with MLflow integration
Invoke downstream workflows for further analysis, deployment, or model selection

MLflow integration with SageMaker Pipelines requires the tracking server ARN. You also need to add the mlflow and sagemaker-mlflow Python packages as dependencies in the pipeline setup. Then you can use MLflow in any pipeline step with the following code snippet:

mlflow_arn=”” #get the tracking ARN from step 1
experiment_name=”” #experiment name of your choice
mlflow.set_tracking_uri(mlflow_arn)
mlflow.set_experiment(experiment_name)

with mlflow.start_run(run_name=run_name) as run:
        #code for the corresponding step

Log datasets with MLflow
With MLflow, you can log your dataset information alongside other key metrics, such as hyperparameters and model evaluation. This enables tracking and reproducibility of experiments across different runs, allowing for more informed decision-making about which models perform best on specific tasks or domains. By logging your datasets with MLflow, you can store metadata, such as dataset descriptions, version numbers, and data statistics, alongside your MLflow runs.
In the preproccess step, you can log training data and evaluation data. In this example, we download the data from a Hugging Face dataset. We are using HuggingFaceH4/no_robots for fine-tuning and evaluation. First, you need to set the MLflow tracking ARN and experiment name to log data. After you process the data and select the required number of rows, you can log the data using the log_input API of MLflow. See the following code:

mlflow.set_tracking_uri(mlflow_arn)
mlflow.set_experiment(experiment_name)

dataset = load_dataset(dataset_name, split=”train”)
# Data processing implementation

# Data logging with MLflow
df_train = pd.DataFrame(dataset)
training_data = mlflow.data.from_pandas(df_train, source=training_input_path)
mlflow.log_input(training_data, context=”training”)
df_evaluate = pd.DataFrame(eval_dataset)
evaluation_data = mlflow.data.from_pandas(df_evaluate, source=eval_input_path)
mlflow.log_input(evaluation_data, context=”evaluation”)

Fine-tune a Llama model with LoRA and MLflow
To streamline the process of fine-tuning LLM with Low-Rank Adaption (LoRA), you can use MLflow to track hyperparameters and save the resulting model. You can experiment with different LoRA parameters for training and log these parameters along with other key metrics, such as training loss and evaluation metrics. This enables tracking of your fine-tuning process, allowing you to identify the most effective LoRA parameters for a given dataset and task.
For this example, we use the PEFT library from Hugging Face to fine-tune a Llama 3 model. With this library, we can perform LoRA fine-tuning, which offers faster training with reduced memory requirements. It can also work well with less training data.
We use the HuggingFace class from the SageMaker SDK to create a training step in SageMaker Pipelines. The actual implementation of training is defined in llama3_fine_tuning.py. Just like the previous step, we need to set the MLflow tracking URI and use the same run_id:

mlflow.set_tracking_uri(args.mlflow_arn)
mlflow.set_experiment(args.experiment_name)

with mlflow.start_run(run_id=args.run_id) as run:
# implementation

While using the Trainer class from Transformers, you can mention where you want to report the training arguments. In our case, we want to log all the training arguments to MLflow:

trainer = transformers.Trainer(
model=model,
train_dataset=lm_train_dataset,
eval_dataset=lm_test_dataset,
args=transformers.TrainingArguments(
per_device_train_batch_size=per_device_train_batch_size,
per_device_eval_batch_size=per_device_eval_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
gradient_checkpointing=gradient_checkpointing,
logging_steps=2,
num_train_epochs=num_train_epochs,
learning_rate=learning_rate,
bf16=True,
save_strategy=”no”,
output_dir=”outputs”,
report_to=”mlflow”,
run_name=”llama3-peft”,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

When the training is complete, you can save the full model, so you need to merge the adapter weights to the base model:

model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
save_dir = “/opt/ml/model/”
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size=”2GB”)
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “right”
tokenizer.save_pretrained(save_dir)

The merged model can be logged to MLflow with the model signature, which defines the expected format for model inputs and outputs, including any additional parameters needed for inference:

params = {
“top_p”: 0.9,
“temperature”: 0.9,
“max_new_tokens”: 200,
}

signature = infer_signature(“inputs”,”generated_text”, params=params)

mlflow.transformers.log_model(
transformers_model={“model”: model, “tokenizer”: tokenizer},
signature=signature,
artifact_path=”model”,
model_config = params
)

Evaluate the model
Model evaluation is the key step to select the most optimal training arguments for fine-tuning the LLM for a given dataset. In this example, we use the built-in evaluation capability of MLflow with the mlflow.evaluate() API. For question answering models, we use the default evaluator logs exact_match, token_count, toxicity, flesch_kincaid_grade_level, and ari_grade_level.
MLflow can load the model that was logged in the fine-tuning step. The base model is downloaded from Hugging Face and adapter weights are downloaded from the logged model. See the following code:

logged_model = f”runs:/{preprocess_step_ret[‘run_id’]}/model”
loaded_model = mlflow.pyfunc.load_model(model_uri=logged_model)
results = mlflow.evaluate(
model=loaded_model,
data=df,
targets=”answer”,
model_type=”question-answering”,
evaluator_config={“col_mapping”: {“inputs”: “question”}},
)

These evaluation results are logged in MLflow in the same run that logged the data processing and fine-tuning step.
Create the pipeline
After you have the code ready for all the steps, you can create the pipeline:

from sagemaker import get_execution_role

pipeline = Pipeline(name=pipeline_name, steps=[evaluate_finetuned_llama7b_instruction_mlflow], parameters=[lora_config])

You can run the pipeline using the SageMaker Studio UI or using the following code snippet in the notebook:

execution1 = pipeline.start()

Compare experiment results
After you start the pipeline, you can track the experiment in MLflow. Each run will log details of the preprocessing, fine-tuning, and evaluation steps. The preprocessing step will log training and evaluation data, and the fine-tuning step will log all training arguments and LoRA parameters. You can select these experiments and compare the results to find the optimal training parameters and best fine-tuned model.
You can open the MLflow UI from SageMaker Studio.

Then you can select the experiment to filter out runs for that experiment. You can select multiple runs to make the comparison.

When you compare, you can analyze the evaluation score against the training arguments.

Register the model
After you analyze the evaluation results of different fine-tuned models, you can select the best model and register it in MLflow. This model will be automatically synced with Amazon SageMaker Model Registry.

Deploy the model
You can deploy the model through the SageMaker console or SageMaker SDK. You can pull the model artifact from MLflow and use the ModelBuilder class to deploy the model:

from sagemaker.serve import ModelBuilder
from sagemaker.serve.mode.function_pointers import Mode
from sagemaker.serve import SchemaBuilder

model_builder = ModelBuilder(
mode=Mode.SAGEMAKER_ENDPOINT,
role_arn=”<role_arn>”,
model_metadata={
# both model path and tracking server ARN are required if you use an mlflow run ID or mlflow model registry path as input
“MLFLOW_MODEL_PATH”: “runs:/<run_id>/model”,
“MLFLOW_TRACKING_ARN”: “<MLFLOW_TRACKING_ARN>”,
},
instance_type=”ml.g5.12xlarge”
)
model = model_builder.build()
predictor = model.deploy( initial_instance_count=1, instance_type=”ml.g5.12xlarge” )

Clean up
In order to not incur ongoing costs, delete the resources you created as part of this post:

Delete the MLflow tracking server.
Run the last cell in the notebook to delete the SageMaker pipeline:

sagemaker_client = boto3.client(‘sagemaker’)
response = sagemaker_client.delete_pipeline(
PipelineName=pipeline_name,
)

Conclusion
In this post, we focused on how to run LLM fine-tuning and evaluation experiments at scale using SageMaker Pipelines and MLflow. You can use managed MLflow from SageMaker to compare training parameters and evaluation results to select the best model and deploy that model in SageMaker. We also provided sample code in a GitHub repository that shows the fine-tuning, evaluation, and deployment workflow for a Llama3 model.
You can start taking advantage of SageMaker with MLflow for traditional MLOps or to run LLM experimentation at scale.

About the Authors
Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for Generative AI to help customers and partners build GenAI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture and ML applications.
Dr. Sokratis Kartakis is a Principal Machine Learning and Operations Specialist Solutions Architect for Amazon Web Services. Sokratis focuses on enabling enterprise customers to industrialize their ML and generative AI solutions by exploiting AWS services and shaping their operating model, such as MLOps/FMOps/LLMOps foundations, and transformation roadmap using best development practices. He has spent over 15 years inventing, designing, leading, and implementing innovative end-to-end production-level ML and AI solutions in the domains of energy, retail, health, finance, motorsports, and more.
Kirit Thadaka is a Senior Product Manager at AWS focused on generative AI experimentation on Amazon SageMaker. Kirit has extensive experience working with customers to build scalable workflows for MLOps to make them more efficient at bringing models to production.
Piyush Kadam is a Senior Product Manager for Amazon SageMaker, a fully managed service for generative AI builders. Piyush has extensive experience delivering products that help startups and enterprise customers harness the power of foundation models.

Discover insights from Amazon S3 with Amazon Q S3 connector 

Amazon Q is a fully managed, generative artificial intelligence (AI) powered assistant that you can configure to answer questions, provide summaries, generate content, gain insights, and complete tasks based on data in your enterprise. The enterprise data required for these generative-AI powered assistants can reside in varied repositories across your organization. One common repository to store data is Amazon Simple Storage Service (Amazon S3), which is an object storage service that stores data as objects within storage buckets. Customers of all sizes and industries can securely index data from a variety of data sources such as document repositories, web sites, content management systems, customer relationship management systems, messaging applications, database, and so on.
To build a generative AI-based conversational application that’s integrated with the data sources that contain the relevant content an enterprise needs to invest time, money, and people, you need to build connectors to the data sources. Next you need to index the data to make it available for a Retrieval Augmented Generation (RAG) approach where relevant passages are delivered with high accuracy to a large language model (LLM). To do this you need to select an index that provides the capabilities to index the content for semantic and vector search, build the infrastructure to retrieve the data, rank the answers, and build a feature rich web application. You also need to hire and staff a large team to build, maintain and manage such a system.
Amazon Q Business is a fully managed generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q business can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code, and enterprise systems such as Atlassian Jira and others. To do this, Amazon Q provides native data source connectors that can index content into a built-in retriever and uses an LLM to provide accurate, well written answers. A data source connector within Amazon Q helps to integrate and synchronize data from multiple repositories into one index.
Amazon Q Business offers multiple prebuilt connectors to a large number of data sources, including Atlassian Jira, Atlassian Confluence, Amazon S3, Microsoft SharePoint, Salesforce, and many more and can help you create your generative AI solution with minimal configuration. For a full list of Amazon Q supported data source connectors, see Amazon Q connectors.
Now you can use the Amazon Q S3 connector to index your data on S3 and build a generative AI assistant that can derive insights from the data stored. Amazon Q generates comprehensive responses to natural language queries from users by analyzing information across content that it has access to. Amazon Q also supports access control for your data so that the right users can access the right content. Its responses to questions are based on the content that your end user has permissions to access.
This post shows how to configure the Amazon Q S3 connector and derive insights by creating a generative-AI powered conversation experience on AWS using Amazon Q while using access control lists (ACLs) to restrict access to documents based on user permissions.
Finding accurate answers from content in S3 using Amazon Q Business
After you integrate Amazon Q Business with Amazon S3, users can ask questions about the content stored in S3. For example, a user might ask about the main points discussed in a blog post on cloud security, the installation steps outlined in a user guide, findings from a case study on hybrid cloud usage, market trends noted in an analyst report, or key takeaways from a whitepaper on data encryption. This integration helps users to quickly find the specific information they need, improving their understanding and ability to make informed business decisions.
Secure querying with ACL crawling and identity crawling
Secure querying is when a user runs a query and is returned answers from documents that the user has access to and not from documents that the user does not have access to. To enable users to do secure querying, Amazon Q Business honors ACLs of the documents. Amazon Q Business does this by first supporting the indexing of ACLs. Indexing documents with ACLs is crucial for maintaining data security, because documents without ACLs are treated as public. Second, at query time the user’s credentials (email address) are passed along with the query so that only answers from documents that are relevant to the query and that the user is authorized to access are displayed.
A document’s ACL, included in the metadata.json or acl.json files alongside the document in the S3 bucket, contains details such as the user’s email address and local groups.
When a user signs in to a web application to conduct a search, their credentials (such as an email address) need to match what’s in the ACL of the document to return results from that document. The web application that the user uses to retrieve answers would be connected to an identity provider (IdP) or the AWS IAM Identity Center. The user’s credentials from the IdP or IAM Identity Center are referred to here as the federated user credentials. The federated user credentials are passed along with the query so that Amazon Q can return the answers from the documents that this user has access to. However, there are occasions when a user’s federated credentials might be absent from the S3 bucket ACLs. In these instances, only the user’s local alias and local groups are specified in the document’s ACL. Therefore, it’s necessary to map these federated user credentials to the corresponding local user alias and local group in the document’s ACL.
Any document or folder without an explicit ACL Deny clause is treated as public.
Solution overview
As an administrator user of Amazon Q, the high-level steps to set up a generative AI chat application are to create an Amazon Q application, connect to different data sources, and finally deploy your web experience. An Amazon Q web experience is the chat interface that you create using your Amazon Q application. Then, your users can chat with your organization’s Amazon Q web experience, and it can be integrated with IAM Identity Center. You can configure and customize your Amazon Q web experience using either the AWS Management Console for Amazon Q or the Amazon Q API.
Amazon Q understands and respects your existing identities, roles, and permissions and uses this information to personalize its interactions. If a user doesn’t have permission to access data without Amazon Q, they can’t access it using Amazon Q either. The following table outlines which documents each user is authorized to access for our use case. The documents being used in this example are a subset of AWS public documents. In this blog post, we will focus on users Arnav (Guest), Mary, and Pat and their assigned groups.

First name
Last name
Group
Document type authorized for access

1
Arnav
Desai

Blogs

2
Pat
Candella
Customer
Blogs, user guides

3
Jane
Doe
Sales
Blogs, user guides, and case studies

4
John
Stiles
Marketing
Blogs, user guides, case studies, and analyst reports

5
Mary
Major
Solutions architect
Blogs, user guides, case studies, analyst reports, and whitepapers

Architecture diagram
The following diagram illustrates the solution architecture. Amazon S3 is the data source and documents along with the ACL information are passed to Amazon Q from S3. The user submits a query to the Amazon Q application. Amazon Q retrieves the user and group information and provides answers based on the documents that the user has access to.

In the upcoming sections, we will show you how to implement this architecture.
Prerequisites
For this walkthrough, you should have the following prerequisites:

An AWS account.
Amazon S3 and IAM Identity Center permissions.
Privileges to create an Amazon Q application, AWS resources, and AWS Identity and Access Management (IAM) roles and policies.
Basic knowledge of AWS services and working knowledge of S3.
Follow the steps for Setting up for Amazon Q Business if you’re using Amazon Q Business for the first time.

Prepare your S3 bucket as a data source
In the AWS Region list, choose US East (N. Virginia) as the Region. You can choose any Region that Amazon Q is available in but ensure that you remain in the same Region when creating all other resources. To prepare an S3 bucket as a data source, create an S3 bucket. Note the name of the S3 bucket. Replace <REPLACE-WITH-NAME-OF-S3-BUCKET> with the name of the bucket in the commands below. In a terminal with the AWS Command Line Interface (AWS CLI) or AWS CloudShell, run the following commands to upload the documents to the data source bucket:

aws s3 cp s3://aws-ml-blog/artifacts/building-a-secure-search-application-with-access-controls-kendra/docs.zip .

unzip docs.zip

aws s3 cp Data/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Data/ –recursive

aws s3 cp Meta/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Meta/ –recursive

The documents being queried are stored in an S3 bucket. Each document type has a separate folder: blogs, case-studies, analyst reports, user guides, and white papers. This folder structure is contained in a folder named Data as shown below:

Each object in S3 is considered a single document. Any <object-name>.metadata.json file and access control list (ACL) file is considered metadata for the object it’s associated with and not treated as a separate document. In this example, metadata files including the ACLs are in a folder named Meta. We use the Amazon Q S3 connector to configure this S3 bucket as the data source. When the data source is synced with the Amazon Q index, it crawls and indexes all documents and collects the ACLs and document attributes from the metadata files. To learn more about ACLs using metadata files, see Amazon S3 document metadata. Here’s the sample metadata JSON file:

{
“Attributes”: {
“DocumentType”: “user-guides”
},
“AccessControlList”: [
{ “Access”: “ALLOW”, “Name”: “customer”, “Type”: “GROUP” },
{ “Access”: “ALLOW”, “Name”: “AWS-Sales”, “Type”: “GROUP” },
{ “Access”: “ALLOW”, “Name”: “AWS-Marketing”, “Type”: “GROUP” },
{ “Access”: “ALLOW”, “Name”: “AWS-SA”, “Type”: “GROUP” }
]
}

Create users and groups in IAM Identity Center
In this section, you create the following mapping for demonstration:

User
Group name

1
Arnav

2
Pat
customer

3
Mary
AWS-SA

To create users:

Open the AWS IAM Identity Center
If you haven’t enabled IAM Identity Center, choose Enable. If there’s a pop-up, choose how you want to enable IAM Identity Center. For this example, select Enable only in this AWS account. Choose Continue.
In the IAM Identity Center dashboard, choose Users in the navigation pane.
Choose Add User.
Enter the user details for Mary:

Username: mary_major
Email address: mary_major@example.com Note: Use or create a real email address for each user to use in a later step.
First name: Mary
Last name: Major
Display name: Mary Major

Skip the optional fields and choose Next to create the user.
In the Add user to groups page, choose Next and then choose Add user. Follow the same steps to create users for Pat and Arnav (Guest user). (You will assign users to groups at a later step.)

To create groups:

Now, you will create two groups: AWS-SA and customer. Choose Groups on the navigation pane and choose Create group.

For the group name, enter AWS-SA, add user Mary to the group,and choose Create group.
Similarly, create a group name customer, add user Pat, and choose Create group.
Now, add multi-factor authentication to the users following the instructions sent to the user email. For more details, see Multi-factor authentication for Identity Center users. When done, you will have the users and groups set up on IAM Identity Center.

Create and configure your Amazon Q application
In this step, you create an Amazon Q application that powers the conversation web experience:

On the AWS Management Console for Amazon Q, in the Region list, choose US East (N. Virginia).
On the Getting started page, select Enable identity-aware sessions. Once enabled, Amazon Q connected to IAM Identity Center should be displayed. Choose Subscribe in Q Business.
On the Amazon Q Business console, choose Get started.
On the Applications page, choose Create application.
On the Create application page, enter Application name and leave everything else with default values. 
Choose Create.
On the Select retriever page, for Retrievers, select Use native retriever.
Choose Next. This will take you to the Connect data sources

Configure Amazon S3 as the data source
In this section, you walk through an example of adding an S3 connector. The S3 connector consists of blogs, user guides, case studies, analyst reports, and whitepapers.
To add the S3 connector:

On the Connect data sources page, select Amazon S3 connector.
For Data source name, enter a name for your data source.
In the IAM role section, select Create new service role (Recommended).

In Sync scope section, browse to your S3 bucket containing the data files.
Under Advanced settings, for Metadata files prefix folder location, enter Meta/
Choose Filter patterns. Under Include patterns, enter Data/ as the prefix and choose Add.
For Frequency under Sync run schedule, choose Run on demand.
Leave the rest as default and choose Add data source. Wait until the data source is added.
On the Connect data sources page, choose Next. This will take you to the Add users and groups

Add users and groups in Amazon Q
In this section, you set up users and groups to showcase how access can be managed based on the permissions.

On the Add users and groups page, choose Assign existing users and groups and choose Next.
Enter the users and groups you want to add and choose Assign. You will have to enter the user names and groups in the search box and select the user or group. Verify that users and groups are correctly displayed under the Users and Groups tabs respectively.
Select the Current subscription. In this example, we selected choose Q Business Lite for groups. Choose the same subscription for users under the Users tab. You can also update subscriptions after creating the application.
Leave the Service role name as default and choose Create application.

Sync S3 data source
With your application created, you will crawl and index the documents in the S3 bucket created at the beginning of the process.

Select the name of the application

Go to the Data sources Select the radio button next to the S3 data source and choose Sync now.

The sync can take from a few minutes to a few hours. Wait for the sync to complete. Verify the sync is complete and documents have been added.

Run queries with Amazon Q
Now that you have configured the Amazon Q application and integrated it with IAM Identity Center, you can test queries from different users based on their group permissions. This will demonstrate how Amazon Q respects the access control rules set up in the Amazon S3 data source.
You have three users for testing—Pat from the Customer group, Mary from the AWS-SA group, and Arnav who isn’t part of any group. According to the access control list (ACL) configuration, Pat should have access to blogs and user guides, Mary should have access to blogs, user guides, case studies, analyst reports, and whitepapers, and Arnav should have access only to blogs.
In the following steps, you will sign in as each user and ask various questions to see what responses Amazon Q provides based on the permitted document types for their respective groups. You will also test edge cases where users try to access information from restricted sources to validate the access control functionality.

In the Amazon Q Business console, choose Applications on the navigation pane and copy the Web experience URL.

Sign in as Pat to the Amazon Q chat interface.
Pat is part of the Customer group and has access to blogs and user guides
When asked a question like “What is AWS?” Amazon Q will provide a summary pulling information from blogs and user guides, highlighting the sources at the end of each excerpt.

Try asking a question that requires information from user guides, such as “How do I set up an AWS account?” Amazon Q will summarize relevant details from the permitted user guide sources for Pat’s group.

However, if you, as Pat, ask a question that requires information from whitepapers, analyst reports, or case studies, Amazon Q will indicate that it could not find any relevant information from the sources she has access to.
Ask a question such as “What are the strategic planning assumptions for the year 2025?” to see this.

Sign in as Mary to the Amazon Q chat interface.
Sign out as user Pat. Start a new incognito browser session or use a different browser. Copy the web experience URL and sign in as user Mary. Repeat these steps each time you need to sign in as a different user.
Mary is part of the AWS-SA group, so she has access to blogs, case studies, analyst reports, and whitepapers.
When Mary asks the same question about strategic planning, Amazon Q will provide a comprehensive summary pulling information from all the permitted sources.

With Mary’s sign-in, you can ask various other questions related to AWS services, architectures, or solutions, and Amazon Q will effectively summarize information from across all the content types Mary’s group has access to.

Sign in as Arnav to the Amazon Q chat interface
Arnav is not part of any group and is able to access only blogs. If Arnav asks a question about Amazon Polly, Amazon Q will return blog posts.

When Arnav tries to get information from the user guides, access is restricted. If they ask about something like how to set up an AWS account, Amazon Q responds that it could not find relevant information.

This shows how Amazon Q respects the data access rules configured in the Amazon S3 data source, allowing users to gain insights only from the content their group has permissions to view, while still providing comprehensive answers when possible within those boundaries.
Troubleshooting
Troubleshooting your Amazon S3 connector provides information about error codes you might see for the Amazon S3 connector and suggested troubleshooting actions. If you encounter an HTTP status code 403 (Forbidden) error when you open your Amazon Q Business application, it means that the user is unable to access the application. See Troubleshooting Amazon Q Business and identity provider integration for common causes and how to address them.
Frequently asked questions
Q. Why isn’t Amazon Q Business answering any of my questions?
A. Verify that you have synced your data source on the Amazon Q console. Also, check the ACLs to ensure you have the required permissions to retrieve answers from Amazon Q.
Q. How can I sync documents without ACLs?
A. When configuring the Amazon S3 connector, under Sync scope, you can optionally choose not to include the metadata or ACL configuration file location in Advanced settings. This will allow you to sync documents without ACLs.

Q. I updated the contents of my S3 data source but Amazon Q business answers using old data.
A. After content has been updated in your S3 data source location, you must re-sync the contents for the updated data to be picked up by Amazon Q. Go to the Data sources Select the radio button next to the S3 data source and choose Sync now. After the sync is complete, verify that the updated data is reflected by running queries on Amazon Q.

Q. I am unable to sign in as a new user through the web experience URL.
A. Clear your browser cookies and sign in as a new user.
Q. I keep trying to sign in but am getting this error:

A. Try signing in from a different browser or clear browser cookies and try again.
Q. What are the supported document formats and what is considered a document in Amazon S3?
A. See Supported document types and What is a document? to learn more.
Call to action
Explore other features in Amazon Q Business such as:

The Amazon Q Business document enrichment feature helps you control both what documents and document attributes are ingested into your index and also how they’re ingested. Using document enrichment, you can create, modify, or delete document attributes and document content when you ingest them into your Amazon Q Business index. For example, you can scrub personally identifiable information (PII) by choosing to delete any document attributes related to PII.
Amazon Q Business features

Filtering using metadata – Use document attributes to customize and control users’ chat experience. Currently supported only if you use the Amazon Q Business API.
Source attribution with citations – Verify responses using Amazon Q Business source attributions.
Upload files and chat – Let users upload files directly into chat and use uploaded file data to perform web experience tasks.
Quick prompts – Feature sample prompts to inform users of the capabilities of their Amazon Q Business web experience.

To improve retrieved results and customize the user chat experience, you can map document attributes from your data sources to fields in your Amazon Q index. Learn more by exploring Amazon Q Business Amazon S3 data source connector field mappings.

Clean up
To avoid incurring future charges and to clean out unused roles and policies, delete the resources you created: the Amazon Q application, data sources, and corresponding IAM roles.

To delete the Amazon Q application, go to the Amazon Q console and, on the Applications page, select your application.
On the Actions drop-down menu, choose Delete.
To confirm deletion, enter delete in the field and choose Delete. Wait until you get the confirmation message; the process can take up to 15 minutes.
To delete the S3 bucket created in Prepare your S3 bucket as a data source, empty the bucket and then follow the steps to delete the bucket.
Delete your IAM Identity Center instance.

Conclusion
This blog post has walked you through the steps to build a secure, permissions-based generative AI solution using Amazon Q and Amazon S3 as the data source. By configuring user groups and mapping their access privileges to different document folders in S3, it demonstrated that Amazon Q respects these access control rules. When users query the AI assistant, it provides comprehensive responses by analyzing only the content their group has permission to view, preventing unauthorized access to restricted information. This solution allows organizations to safely unlock insights from their data repositories using generative AI while ensuring data access governance.
Don’t let your data’s potential go untapped. Continue exploring how Amazon Q can transform your enterprise data to gain actionable insights. Join the conversation and share your thoughts or questions in the comments section below.

About the Author
Kruthi Jayasimha Rao is a Partner Solutions Architect with a focus in AI and ML. She provides technical guidance to AWS Partners in following best practices to build secure, resilient, and highly available solutions in the AWS Cloud.
Keagan Mirazee is a Partner Solutions Architect specializing in Generative AI to assist AWS Partners in engineering reliable and scalable cloud solutions.
Dipti Kulkarni is a Sr. Software Development Engineer for Amazon Q. Dipti is a passionate engineer building connectors for Amazon Q.

Llama 3.1 Released: Meta’s New Open-Source AI Model that You can Fin …

Meta announced the release of Llama 3.1, the most capable model in the LLama Series. This latest iteration of the Llama series, particularly the 405B model, represents a substantial advancement in open-source AI capabilities, positioning Meta at the forefront of AI innovation. 

Meta has long advocated for open-source AI, a stance underscored by Mark Zuckerberg’s assertion that open-source benefits developers, Meta, and society. Llama 3.1 embodies this philosophy by offering state-of-the-art capabilities in an openly accessible model. The release aims to democratize AI, making cutting-edge technology available to various users and applications.

The Llama 3.1 405B model stands out for its exceptional flexibility, control, and performance, rivaling even the most advanced closed-source models. It is designed to support various applications, including synthetic data generation and model distillation, thus enabling the community to explore new workflows and innovations. With support for eight languages and an expanded context length of 128K, Llama 3.1 is versatile and robust, catering to diverse use cases such as long-form text summarization and multilingual conversational agents.

Image Source

Meta’s release of Llama 3.1 is bolstered by a comprehensive ecosystem of partners, including AWS, NVIDIA, Databricks, Dell, and Google Cloud, all offering services to support the model from day one. This collaborative approach ensures that users and developers have the tools and platforms to leverage Llama 3.1’s full potential, fostering a thriving environment for AI innovation.

Image Source

Llama 3.1 introduces new security and safety tools, such as Llama Guard 3 and Prompt Guard. These features are designed to help developers build responsibly, ensuring that AI applications are safe and secure. Meta’s commitment to responsible AI development is further reflected in their request for comment on the Llama Stack API, which aims to standardize and facilitate third-party integration with Llama models.

The development of Llama 3.1 involved rigorous evaluation across over 150 benchmark datasets, spanning multiple languages and real-world scenarios. The 405B model demonstrated competitive performance with leading AI models like GPT-4 and Claude 3.5 Sonnet, showcasing its general knowledge, steerability, math, tool use, and multilingual translation capabilities.

Image Source

Training the Llama 3.1 405B model was monumental, involving over 16 thousand H100 GPUs and processing over 15 trillion tokens. To ensure efficiency and scalability, we meta-optimized the training stack, adopting a standard decoder-only transformer model architecture with iterative post-training procedures. These processes enhanced the quality of synthetic data generation and model performance, setting new benchmarks for open-source AI.

To improve the model’s helpfulness and instruction-following capabilities, Meta employed a multi-round alignment process involving Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO). Combined with high-quality synthetic data generation and filtering, these techniques enabled Meta to produce a model that excels in both short-context benchmarks and extended 128K context scenarios.

Meta envisions Llama 3.1 as part of a broader AI system that includes various components and tools for developers. This ecosystem approach allows the creation of custom agents and new agentic behaviors, supported by a full reference system with sample applications and new safety models. The ongoing development of the Llama Stack aims to standardize interfaces for building AI toolchain components, promoting interoperability and ease of use.

In conclusion, Meta’s dedication to open-source AI is driven by a belief in its potential to spur innovation and distribute power more evenly across society. The open availability of Llama model weights allows developers to customize, train, and fine-tune models to suit their specific needs, fostering a diverse range of AI applications. Examples of community-driven innovations include AI study buddies, medical decision-making assistants, and healthcare communication tools, all developed using previous Llama models.

Check out the Details and Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post Llama 3.1 Released: Meta’s New Open-Source AI Model that You can Fine-Tune, Distill, and Deploy Anywhere and available in 8B, 70B, and 405B appeared first on MarkTechPost.

Apple Researchers Propose LazyLLM: A Novel AI Technique for Efficient …

Large Language Models (LLMs) have made a significant leap in recent years, but their inference process faces challenges, particularly in the prefilling stage. The primary issue lies in the time-to-first-token (TTFT), which can be slow for long prompts due to the deep and wide architecture of state-of-the-art transformer-based LLMs. This slowdown occurs because the cost of computing attention increases quadratically with the number of tokens in the prompts. For example, Llama 2 with 7 billion parameters requires 21 times more time for TTFT compared to each subsequent decoding step, accounting for approximately 23% of the total generation time on the LongBench benchmark. Optimizing TTFT has become a critical path toward efficient LLM inference.

Prior studies have explored various approaches to address the challenges of efficient long-context inference and TTFT optimization in LLMs. Some methods focus on modifying transformer architectures, such as replacing standard self-attention with local windowed attention or using locality-sensitive hashing. However, these require significant model changes and retraining. Other techniques optimize the KV cache to accelerate decoding steps but don’t address TTFT. Token pruning approaches, which selectively remove less important tokens during inference, have shown promise in sentence classification tasks. Examples include Learned Token Pruning and width-wise computation reduction. However, these methods were designed for single-iteration processing tasks and need adaptation for generative LLMs. Each approach has limitations, prompting the need for more versatile solutions that can improve TTFT without extensive model modifications.

Researchers from Apple and Meta AI propose LazyLLM, a unique technique to accelerate LLM prefilling by selectively computing the KV cache for important tokens and deferring less crucial ones. It uses attention scores from previous layers to assess token importance and prune progressively. Unlike permanent prompt compression, LazyLLM can revive pruned tokens to maintain accuracy. An Aux Cache mechanism stores pruned tokens’ hidden states, ensuring efficient revival and preventing performance degradation. LazyLLM offers three key advantages: universality (compatible with any transformer-based LLM), training-free implementation, and effectiveness across various language tasks. This method improves inference speed in both prefilling and decoding stages without requiring model modifications or fine-tuning.

The LazyLLM framework is designed to optimize LLM inference through progressive token pruning. The method starts with the full context and gradually reduces computations towards the end of the model by pruning less important tokens. Unlike static pruning, LazyLLM allows the dynamic selection of token subsets in different generation steps, crucial for maintaining performance.

This framework employs layer-wise token pruning in each generation step, using attention maps to determine token importance. It calculates a confidence score for each token and prunes those below a certain percentile. This approach is applied progressively, keeping more tokens in earlier layers and reducing them towards the end of the transformer.

To overcome the challenges in extending pruning to decoding steps, LazyLLM introduces an Aux Cache mechanism. This cache stores hidden states of pruned tokens, allowing efficient retrieval without recomputation. During decoding, the model first accesses the KV cache for existing tokens and retrieves hidden states from the Aux Cache for pruned tokens. Also, this implementation ensures each token is computed at most once per transformer layer, guaranteeing that LazyLLM’s worst-case runtime is not slower than the baseline. The method’s dynamic nature and efficient caching mechanism contribute to its effectiveness in optimizing both the prefilling and decoding stages of LLM inference.

LazyLLM demonstrates significant improvements in LLM inference efficiency across various language tasks. It achieves substantial TTFT speedups (up to 2.89x for Llama 2 and 4.77x for XGen) while maintaining accuracy close to baseline levels. The method outperforms other approaches like random token drop, static pruning, and prompt compression in speed-accuracy trade-offs. LazyLLM’s effectiveness spans multiple tasks, including QA, summarization, and code completion. It often computes less than 100% of prompt tokens, leading to reduced overall computation and improved generation speeds. The progressive pruning strategy, informed by layer-wise analysis, contributes to its superior performance. These results highlight LazyLLM’s capacity to optimize LLM inference without compromising accuracy.

LazyLLM, an innovative technique for efficient LLM inference, particularly in long context scenarios, selectively computes KV for important tokens and defers computation of less relevant ones. Extensive evaluation across various tasks demonstrates that LazyLLM significantly reduces TTFT while maintaining performance. A key advantage is its seamless integration with existing transformer-based LLMs, improving inference speed without fine-tuning. By dynamically prioritizing token computation based on relevance, LazyLLM offers a practical solution to enhance LLM efficiency, addressing the growing demand for faster and more resource-efficient language models in diverse applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post Apple Researchers Propose LazyLLM: A Novel AI Technique for Efficient LLM Inference in Particular under Long Context Scenarios appeared first on MarkTechPost.

Progressive Learning Framework for Enhancing AI Reasoning through Weak …

As large language models surpass human-level capabilities, providing accurate supervision becomes increasingly difficult. Weak-to-strong learning, which uses a less capable model to enhance a stronger one, offers potential benefits but needs testing for complex reasoning tasks. This method currently lacks efficient techniques to prevent the stronger model from imitating the weaker model’s errors. As AI progresses toward Artificial General Intelligence (AGI), creating superintelligent systems introduces significant challenges, particularly in supervision and learning paradigms. Conventional methods relying on human oversight or advanced model guidance become inadequate as AI capabilities surpass those of their supervisors.

Researchers from Shanghai Jiao Tong University, Fudan University, Shanghai AI Laboratory, and GAIR have developed a progressive learning framework that allows strong models to refine their training data autonomously. This approach begins with supervised fine-tuning on a small, high-quality dataset, followed by preference optimization using contrastive samples identified by the strong model. Experiments on the GSM8K and MATH datasets show significant improvements in the reasoning abilities of Llama2-70b using three different weak models. The framework’s effectiveness is further demonstrated with Llama3-8b-instruct supervising Llama3-70b on the challenging OlympicArena dataset, paving the way for enhanced AI reasoning strategies.

LLMs enhance task-solving and alignment with human instructions through supervised fine-tuning (SFT), which relies on high-quality training data for substantial performance gains. This study examines the potential of learning from weak supervision. Aligning LLMs with human values also requires RLHF and direct preference optimization (DPO). DPO simplifies reparameterizing reward functions in RLHF and has various stable and performant variants like ORPO and SimPO. In mathematical reasoning, researchers focus on prompting techniques and generating high-quality question-answer pairs for fine-tuning, significantly improving problem-solving capabilities.

The weak-to-strong training method aims to maximize the use of weak data and enhance the strong model’s abilities. In Stage I, potentially positive samples are identified without ground truth and used for supervised fine-tuning. Stage II involves using the full weak data, focusing on potentially negative samples through preference learning-based approaches like DPO. This method refines the strong model by learning from the weak model’s mistakes. The strong model’s responses are sampled, and confidence levels are used to determine reliable answers. Contrastive samples are created for further training, helping the strong model differentiate between correct and incorrect solutions, resulting in an improved model.

The experiments utilize GSM8K and MATH datasets, with subsets Dgold,1, and Dgold, two used for training weak and strong models. Initial training on GSM8K was enhanced with additional data, while MATH data faced limitations due to its complexity. Iterative fine-tuning improved weak models, which subsequently elevated strong model performance. Using preference learning methods, significant improvements were observed, particularly on GSM8K. Further analysis showed better generalization on simpler problems. Tests with Llama3 models on OlympicArena, a more challenging dataset, demonstrated that the proposed weak-to-strong learning method is effective and scalable in realistic scenarios.

In conclusion, the study investigates the effectiveness of the weak-to-strong framework in complex reasoning tasks, presenting a method that leverages weak supervision to develop strong capabilities without human or advanced model annotations. The strong model refines its training data independently, even without prior task knowledge, progressively enhancing its reasoning skills through iterative learning. This self-directed data curation is essential for advancing AI reasoning capabilities promoting model independence and efficiency. The study highlights innovative model supervision’s role in AI development, particularly for AGI. Limitations include using current models as proxies for future advanced models and the challenges posed by errors and noise in process-level supervision.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here
The post Progressive Learning Framework for Enhancing AI Reasoning through Weak-to-Strong Supervision appeared first on MarkTechPost.