MaPO: The Memory-Friendly Maestro – A New Standard for Aligning Gene …

Machine learning has achieved remarkable advancements, particularly in generative models like diffusion models. These models are designed to handle high-dimensional data, including images and audio. Their applications span various domains, such as art creation and medical imaging, showcasing their versatility. The primary focus has been on enhancing these models to better align with human preferences, ensuring that their outputs are useful and safe for broader applications.

Despite significant progress, current generative models often need help aligning perfectly with human preferences. This misalignment can lead to either useless or potentially harmful outputs. The critical issue is to fine-tune these models to consistently produce desirable and safe outputs without compromising their generative abilities.

Existing research includes reinforcement learning techniques and preference optimization strategies, such as Diffusion-DPO and SFT. Methods like Proximal Policy Optimization (PPO) and models like Stable Diffusion XL (SDXL) have been employed. Furthermore, frameworks such as Kahneman-Tversky Optimization (KTO) have been adapted for text-to-image diffusion models. While these approaches improve alignment with human preferences, they often fail to handle diverse stylistic discrepancies and efficiently manage memory and computational resources.

Researchers from the Korea Advanced Institute of Science and Technology (KAIST), Korea University, and Hugging Face have introduced a novel method called Maximizing Alignment Preference Optimization (MaPO). This method aims to fine-tune diffusion models more effectively by integrating preference data directly into the training process. The research team conducted extensive experiments to validate their approach, ensuring it surpasses existing methods in terms of alignment and efficiency.

MaPO enhances diffusion models by incorporating a preference dataset during training. This dataset includes various human preferences the model must align with, such as safety and stylistic choices. The method involves a unique loss function that prioritizes preferred outcomes while penalizing less desirable ones. This fine-tuning process ensures the model generates outputs that closely align with human expectations, making it a versatile tool across different domains. The methodology employed by MaPO does not rely on any reference model, which differentiates it from traditional methods. By maximizing the likelihood margin between preferred and dispreferred image sets, MaPO learns general stylistic features and preferences without overfitting the training data. This makes the method memory-friendly and efficient, suitable for various applications.

The performance of MaPO has been evaluated on several benchmarks. It demonstrated superior alignment with human preferences, achieving higher scores in safety and stylistic adherence. MaPO scored 6.17 on the Aesthetics benchmark and reduced training time by 14.5%, highlighting its efficiency. Moreover, the method surpassed the base Stable Diffusion XL (SDXL) and other existing methods, proving its effectiveness in generating preferred outputs consistently.

The MaPO method represents a significant advancement in aligning generative models with human preferences. Researchers have developed a more efficient and effective solution by integrating preference data directly into the training process. This method enhances the safety and usefulness of model outputs and sets a new standard for future developments in this field.

Overall, the research underscores the importance of direct preference optimization in generative models. MaPO’s ability to handle reference mismatches and adapt to diverse stylistic preferences makes it a valuable tool for various applications. The study opens new avenues for further exploration in preference optimization, paving the way for more personalized and safe generative models in the future.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit
The post MaPO: The Memory-Friendly Maestro – A New Standard for Aligning Generative Models with Diverse Preferences appeared first on MarkTechPost.

Leveraging Machine Learning and Process-Based Models for Soil Organic …

In recent years, ML algorithms have increasingly been recognized in ecological modeling, including predicting soil organic carbon (SOC). However, their application on smaller datasets typical of long-term soil research has yet to be extensively evaluated, particularly in comparison to traditional process-based models. A study conducted in Austria compared ML algorithms like Random Forest and Support Vector Machines against process-based models such as RothC and ICBM, using data from five long-term experimental sites. The findings revealed that ML algorithms performed better when large datasets were available. Still, their accuracy declined with smaller training sets or more rigorous cross-validation methods like leave-one-site-out. While requiring careful calibration, process-based models better understand the biophysical and biochemical mechanisms underlying SOC dynamics. The study thus recommended combining ML algorithms with process-based models to leverage their respective strengths for robust SOC predictions across different scales and conditions.

SOC is vital for soil health, so maintaining and increasing SOC levels are essential for boosting soil fertility, improving resilience to climate change, and reducing carbon emissions. We need dependable monitoring systems and predictive models to achieve these objectives, especially in light of changing environmental conditions and land-use practices. ML and process-based models both play critical roles in this endeavor. ML is particularly useful with large datasets, while process-based models provide comprehensive insights into soil mechanisms. By combining these approaches, we can mitigate the shortcomings of each and achieve more precise and adaptable predictions, which are crucial for effective soil management and environmental conservation worldwide.

Methods and Materials:

The study utilized data from five long-term field experiments across Austria, spanning various management practices aimed at SOC accumulation. These experiments covered 53 treatment variants and provided detailed information on soil characteristics, climate data, and management practices. The Soil samples were collected from 0-25 cm, depending on the site. Daily climate data, including temperature, precipitation, and evaporation, were sourced from high-quality datasets. Process-based SOC models like RothC, AMG.v2, ICBM, and C-TOOL were employed alongside machine learning algorithms (Random forest, SVMs, Gaussian process regression) for predicting SOC dynamics.

Research Methodology Overview:

The research conducted between February 25th and March 5th, 2023, evaluated ChatGPT’s ability to answer fundamental questions in modern soil science. Four ChatGPT responses were assessed: Free ChatGPT-3.5, short and long answers from paid ChatGPT-3.5 (Pro-a and Pro-b), and reactions from paid ChatGPT-4.0. Responses were initiated with a prompt to “Act as a soil scientist,” and if timed out, followed by “Continue.” The expert evaluation involved five specialists rating answers on a scale of 0 to 100, with final scores averaged. Additionally, a Likert Scale survey gathered perceptions from 73 soil scientists regarding ChatGPT’s knowledge and reliability, yielding responses from 50 participants for analysis.

Summary of SOC Sequestration and Modeling Approaches:

The observed annual sequestration rates at five Austrian sites align with other studies and cover a range of soil and climate conditions typical for Central-Eastern Europe. The study found that certain ML algorithms, like Random Forest and SVM with a polynomial kernel, outperformed process-based models due to their ability to capture non-linear relationships. Combining ML with process-based models improved predictions. For robust SOC modeling, uncalibrated models are recommended when data is scarce, calibrated models with cross-validation when data is adequate, and ML models when data is abundant. Accurate SOC modeling necessitates comprehensive, long-term datasets encompassing various agricultural practices and conditions.

Perceptions and Contributions of ChatGPT in Soil Science:

A study exploring the perceptions of Indonesian soil scientists towards ChatGPT revealed significant findings. Predominantly, the community consists of 64% males and 36% females, with the majority (88%) having formal education in soil science. Most respondents (76%) know ChatGPT and 60% have used it, primarily valuing its potential to aid in research and academic writing. While 86% do not consider ChatGPT fraudulent, they agree it requires verification and paraphrasing before use in scientific contexts. ChatGPT-4.0 was rated highly for its accuracy in providing relevant answers, particularly in English. Despite confidence in ChatGPT’s potential to advance soil science, the respondents emphasize the necessity for human oversight to ensure the tool’s responsible and effective use.

                                                             Image source

Conclusions on the Use of ChatGPT in Soil Science and Machine Learning for SOC Prediction:

The research highlights the valuable role of ChatGPT and ML in soil science. Indonesian soil scientists express over 80% trust in ChatGPT, favoring ChatGPT-4.0 for its superior accuracy in aiding research and education, though the free and paid versions of ChatGPT-3.5 are also considered reliable. However, the perceived accuracy of ChatGPT responses is generally 55%, indicating room for future improvements. Concurrently, non-linear ML models, especially when combined with process-based models like Random Forest, show promise in predicting SOC dynamics, particularly in datasets from long-term agricultural studies. Integrating ML with expert knowledge could enhance the precision of SOC forecasts, underlining the importance of human oversight and model refinement.

Sources:

https://www.sciencedirect.com/science/article/pii/S2666544124000194

https://www.sciencedirect.com/science/article/pii/S1871678424000086

The post Leveraging Machine Learning and Process-Based Models for Soil Organic Carbon Prediction: A Comparative Study and the Role of ChatGPT in Soil Science appeared first on MarkTechPost.

Microsoft Releases Florence-2: A Novel Vision Foundation Model with a …

There has been a marked movement in the field of AGI systems towards using pretrained, adaptable representations known for their task-agnostic benefits in various applications. Natural language processing (NLP) is a clear example of this tendency since more sophisticated models demonstrate adaptability by learning new tasks and domains from scratch with only basic instructions. The success of natural language processing inspires a similar strategy in computer vision. 

One of the main obstacles to universal representation for various vision-related tasks is the requirement for broad perceptual ability. In contrast to natural language processing (NLP), computer vision works with complex visual data such as object location, masked contours, and properties. Mastery of various challenging tasks is required to achieve universal representation in computer vision. Distinctiveness and severe hurdles define this endeavor. The lack of thorough visual annotations is a major obstacle that prevents us from building a basic model that can capture the subtleties of spatial hierarchy and semantic granularity. A further obstacle is that there currently needs to be a unified pretraining framework in computer vision that uses a single network architecture to integrate semantic granularity and spatial hierarchy seamlessly.

A team of Microsoft researchers introduces Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. This solves the problems of needing a consistent architecture and limiting comprehensive data by creating a single, prompt-based representation for all vision activities. Annotated data of high quality and broad scale is required for multitask learning. Using FLD-5B, the data engine generates a complete visual dataset with a total of 5.4B annotations for 126M images—a significant improvement over labor-intensive manual annotation. The engine’s two processing modules are highly efficient. Instead of using a single person to annotate each image, as was done in the past, the first module employs specialized models to do it automatically and in collaboration. A more trustworthy and objective picture interpretation is achieved when numerous models collaborate to attain a consensus, reminiscent of the wisdom of crowds’ ideas. 

The Florence-2 model stands out for its unique features. It integrates an image encoder and a multi-modality encoder-decoder into a sequence-to-sequence (seq2seq) architecture, following the NLP community’s goal of developing flexible models with a consistent framework. This architecture can handle a variety of vision tasks without requiring task-specific architectural alterations. The model’s unified multitask learning technique with consistent optimization, using the same loss function as the aim, is made possible by uniformizing all annotations in the FLD-5B dataset into textual outputs. Florence-2 is a multi-purpose vision foundation model that can ground, caption, and detect objects using just one model and a standard set of parameters, activated by textual cues.

Despite its compact size, Florence-2 stands tall in the field, able to compete with larger specialized models. After fine-tuning using publicly available human-annotated data, Florence-2 achieves new state-of-the-art performances on the benchmarks on RefCOCO/+/g. This pre-trained model outperforms supervised and self-supervised models on downstream tasks, including ADE20K semantic segmentation and COCO object detection and instance segmentation. The results speak for themselves, showing significant improvements of 6.9, 5.5, and 5.9 points on the COCO and ADE20K datasets using Mask-RCNN, DIN, and the training efficiency is 4 times better than pre-trained models on ImageNet. This performance is a testament to the effectiveness and reliability of Florence-2.

Florence-2, with its pre-trained universal representation, has proven to be highly effective. The experimental results demonstrate its prowess in improving a multitude of downstream tasks, instilling confidence in its capabilities. 

Check out the Paper and Model Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit
The post Microsoft Releases Florence-2: A Novel Vision Foundation Model with a Unified, Prompt-based Representation for a Variety of Computer Vision and Vision-Language Tasks appeared first on MarkTechPost.

Open-Sora 1.2 by HPC AI Tech: Transforming Video Generation With Advan …

Open-Sora, an initiative by HPC AI Tech, is a great innovation in democratizing efficient video production. By embracing open-source principles, Open-Sora aims to make advanced video generation techniques accessible to everyone, fostering innovation, creativity, and inclusivity in content creation.

Open-Sora 1.0 and 1.1

Open-Sora 1.0 laid the groundwork for this project, offering a full pipeline for video data preprocessing, training, and inference. It supports generating videos up to 2 seconds long at 512×512 resolution with a minimal training cost. Following this, Open-Sora 1.1 expanded capabilities to support 2-15 second videos, ranging from 144p to 720p, and various aspect ratios. It introduced a comprehensive video processing pipeline, including scene cutting, filtering, and captioning, making it easier for users to build their video datasets.

Key Features of Open-Sora

Open-Sora aims to simplify the complexities of video generation by providing a streamlined and user-friendly platform. Its primary features include:

Text-to-Video Generation: Users can generate videos based on textual descriptions.

Image-to-Video Generation: This feature allows images to be transformed into video sequences.

Video-to-Video Translation: Users can convert one video format to another with ease.

Open-Sora 1.2 Enhancements

Open-Sora 1.2 introduces several notable improvements over its predecessors. It includes a 3D-VAE model, rectified flow, and score conditioning, significantly enhancing video quality. The update also focuses on better data handling and multi-stage training, ensuring the model can handle more complex tasks efficiently.

Video Compression Network: The new version incorporates OpenAI’s Sora, which improves video compression by reducing temporal dimensions without sacrificing frame rates. This results in smoother, high-quality video output.

Rectified Flow Training: Adopting techniques from the latest diffusion models, Open-Sora 1.2 includes rectified flow training, enhancing the performance and quality of generated videos.

Evaluation Metrics: Open-Sora 1.2 supports advanced evaluation metrics like validation loss, VBench score, and VBench-i2v score, ensuring comprehensive assessment during the training process. The improvements in evaluation can be seen in the higher quality and semantic scores compared to previous versions.

Image Source

The training process for Open-Sora 1.2 remains similar to earlier versions but with enhanced configurations. The model is trained on over 30 million data points, utilizing 80,000 GPU hours supporting various video resolutions and aspect ratios. The command line for inference supports multiple configurations, including text-to-video and image-to-video generation.

Image Source

Open-Sora 1.2 provides model weights and a detailed installation guide, ensuring users can deploy the system easily. The installation process supports various CUDA versions and includes dependencies for data preprocessing, VAE, and model evaluation.

Conclusion

Open-Sora 1.2 by HPC AI Tech is a robust and innovative solution for video generation, incorporating state-of-the-art techniques and open-source accessibility. With its continuous improvements and community-driven approach, Open-Sora is poised to revolutionize content creation.

Sources

https://huggingface.co/spaces/hpcai-tech/open-sora

https://github.com/hpcaitech/Open-Sora

https://x.com/AdeenaY8/status/1803006922674557108

https://github.com/hpcaitech/Open-Sora/tree/main

The post Open-Sora 1.2 by HPC AI Tech: Transforming Video Generation With Advanced, Open-Source Video Generation and Compression appeared first on MarkTechPost.

Manage Amazon SageMaker JumpStart foundation model access with private …

Amazon SageMaker JumpStart is a machine learning (ML) hub offering pre-trained models and pre-built solutions. It provides access to hundreds of foundation models (FMs). A private hub is a feature in SageMaker JumpStart that allows an organization to share their models and notebooks so as to centralize model artifacts, facilitate discoverability, and increase the reuse within the organization. With new models released daily, many enterprise admins want more control over the FMs that can be discovered and used by users within their organization (for example, only allowing models based on pytorch framework to be discovered).
Now enterprise admins can effortlessly configure granular access control over the FMs that SageMaker JumpStart provides out of box so that only allowed models can be accessed by users within their organizations. In this post, we discuss the steps required for an administrator to configure granular access control of models in SageMaker JumpStart using a private hub, as well as the steps for users to access and consume models from the private hub.
Solution overview
Starting today, with SageMaker JumpStart and its private hub feature, administrators can create repositories for a subset of models tailored to different teams, use cases, or license requirements using the Amazon SageMaker Python SDK. Admins can also set up multiple private hubs with different lists of models discoverable for different groups of users. Users are then only able to discover and use models within the private hubs they have access to through Amazon SageMaker Studio and the SDK. This level of control empowers enterprises to consume the latest in open weight generative artificial intelligence (AI) development while enforcing governance guardrails. Finally, admins can share access to private hubs across multiple AWS accounts, enabling collaborative model management while maintaining centralized control. SageMaker JumpStart uses AWS Resource Access Manager (AWS RAM) to securely share private hubs with other accounts in the same organization. The new feature is available in the us-east-2 AWS Region as of writing, and will be available to more Regions soon.
The following diagram shows an example architecture of SageMaker JumpStart with its public and private hub features. The diagram illustrates how SageMaker JumpStart provides access to different model repositories, with some users accessing the public SageMaker JumpStart hub and others using private curated hubs.

In the following section, we demonstrate how admins can configure granular access control of models in SageMaker JumpStart using a private hub. Then we show how users can access and consume allowlisted models in the private hub using SageMaker Studio and the SageMaker Python SDK. Finally, we look at how an admin user can share the private hub with users in another account.
Prerequisites
To use the SageMaker Python SDK and run the code associated with this post, you need the following prerequisites:

An AWS account that contains all your AWS resources
An AWS Identity and Access Management (IAM) role with access to SageMaker Studio notebooks
SageMaker JumpStart enabled in a SageMaker Studio domain

Create a private hub, curate models, and configure access control (admins)
This section provides a step-by-step guide for administrators to create a private hub, curate models, and configure access control for your organization’s users.

Because the feature has been integrated in the latest SageMaker Python SDK, to use the model granular access control feature with a private hub, let’s first update the SageMaker Python SDK: !pip3 install sagemaker —force-reinstall —quiet
Next, import the SageMaker and Boto3 libraries: import boto3
from sagemaker import Session
from sagemaker.jumpstart.hub.hub import Hub
Configure your private hub: HUB_NAME=”CompanyHub”
HUB_DISPLAY_NAME=”Allowlisted Models”
HUB_DESCRIPTION=”These are allowlisted models taken from the JumpStart Public Hub.”
REGION=”<your_region_name>” # for example, “us-west-2” In the preceding code, HUB_NAME specifies the name of your Hub. HUB_DISPLAY_NAME is the display name for your hub that will be shown to users in UI experiences. HUB_DESCRIPTION is the description for your hub that will be shown to users.
Set up a Boto3 client for SageMaker: sm_client = boto3.client(‘sagemaker’)
session = Session(sagemaker_client=sm_client)
session.get_caller_identity_arn()
Check if the following policies have been already added to your admin IAM role; if not, you can add them as inline policies: {
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Action”: [
                “s3:ListBucket”,
                “s3:GetObject”,
                “s3:GetObjectTagging”
            ],
            “Resource”: [
                “arn:aws:s3:::jumpstart-cache-prod-<REGION>”,
                “arn:aws:s3:::jumpstart-cache-prod-<REGION>/*”
            ],
            “Effect”: “Allow”
        }
    ]
} Replace the <REGION> placeholder using the configurations in Step 3. In addition to setting up IAM permissions to the admin role, you need to scope down permissions for your users so they can’t access public contents.
Use the following policy to deny access to the public hub for your users. These can be added as inline policies in the user’s IAM role: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Action”: “s3:*”,
“Effect”: “Deny”,
“Resource”: [
“arn:aws:s3:::jumpstart-cache-prod-<REGION>”,
“arn:aws:s3:::jumpstart-cache-prod-<REGION>/*”
],
“Condition”: {
“StringNotLike”: {“s3:prefix”: [“*.ipynb”, “*/eula.txt”]}
}
},
{
“Action”: “sagemaker:*”,
“Effect”: “Deny”,
“Resource”: [
“arn:aws:sagemaker:<REGION>:aws:hub/SageMakerPublicHub”,
“arn:aws:sagemaker:<REGION>:aws:hub-content/SageMakerPublicHub/*/*”
]
}
]
}
Replace the <REGION> placeholder in the policy using the configurations in Step 3. After you have set up the private hub configuration and permissions, you’re ready to create the private hub.
Use the following code to create the private hub within your AWS account in the Region you specified earlier: hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)

try:
hub.create(
description=HUB_DESCRIPTION,
display_name=HUB_DISPLAY_NAME
)
print(f”Successfully created Hub with name {HUB_NAME} in {REGION}”)
except Exception as e:
if “ResourceInUse” in str(e):
print(f”A hub with the name {HUB_NAME} already exists in your account.”)
else:
raise e

Use hub.describe() to verify the configuration of your hub.After your private hub is set up, you can add a reference to models from the SageMaker JumpStart public hub to your private hub. No model artifacts need to be managed by the customer. The SageMaker team will manage any version or security updates.For a list of available models, refer to Built-in Algorithms with pre-trained Model Table.
To search programmatically, run the command filter_value = “framework == meta”
response = hub.list_sagemaker_public_hub_models(filter=filter_value)
models = response[“hub_content_summaries”]
while response[“next_token”]:
response = hub.list_sagemaker_public_hub_models(filter=filter_value,
next_token=response[“next_token”])
models.extend(response[“hub_content_summaries”])

print(models)
The filter argument is optional. For a list of filters you can apply, refer to SageMaker Python SDK.
Use the retrieved models from the preceding command to create model references for your private hub: for model in models:
print(f”Adding {model.get(‘hub_content_name’)} to Hub”)
hub.create_model_reference(model_arn=model.get(“hub_content_arn”),
model_name=model.get(“hub_content_name”)) The SageMaker JumpStart private hub offers other useful features for managing and interacting with the curated models. Administrators can check the metadata of a specific model using the hub.describe_model(model_name=<model_name>) command. To list all available models in the private hub, you can use a simple loop: response = hub.list_models()
models = response[“hub_content_summaries”]
while response[“next_token”]:
response = hub.list_models(next_token=response[“next_token”])
models.extend(response[“hub_content_summaries”])

for model in models:
print(model.get(‘HubContentArn’))
If you need to remove a specific model reference from the private hub, use the following command: hub.delete_model_reference(“<model_name>”) If you want to delete the private hub from your account and Region, you’ll need to delete all the HubContents first, then delete the private hub. Use the following code: for model in models:
hub.delete_model_reference(model_name=model.get(‘HubContentName’))

hub.delete()

Interact with allowlisted models (users)
This section offers a step-by-step guide for users to interact with allowlisted models in SageMaker JumpStart. We demonstrate how to list available models, identify a model from the public hub, and deploy the model to endpoints from SageMaker Studio as well as the SageMaker Python SDK.
User experience in SageMaker Studio
Complete the following steps to interact with allowlisted models using SageMaker Studio:

 On the SageMaker Studio console, choose JumpStart in the navigation pane or in the Prebuilt and automated solutions section.
Choose one of model hubs you have access to. If the user has access to multiple hubs, you’ll see a list of hubs, as shown in the following screenshot. If the user has access to only one hub, you’ll go straight to the model list. You can view the model details and supported actions like train, deploy, and evaluate.
To deploy a model, choose Deploy.
Modify your model configurations like instances and deployment parameters, and choose Deploy.

User experience using the SageMaker Python SDK
To interact with your models using the SageMaker Python SDK, complete the following steps:

Just like the admin process, the first step is to force reinstall the SageMaker Python SDK: !pip3 install sagemaker —force-reinstall —quiet
Import the SageMaker and Boto3 libraries: import boto3
from sagemaker import Session
from sagemaker.jumpstart.hub.hub import Hub
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.jumpstart.estimator import JumpStartEstimator
To access the models in your private hub, you need the Region and the name of the hub on your account. Fill out the HUB_NAME and REGION fields with the information provided by your administrator: HUB_NAME=”CompanyHub”
REGION=”<your_region_name>” # for example, “us-west-2”
sm_client = boto3.client(‘sagemaker’)
sm_runtime_client = boto3.client(‘sagemaker-runtime’)
session = Session(sagemaker_client=sm_client,
sagemaker_runtime_client=sm_runtime_client)
hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)
List the models available in your private hub using the following command: response = hub.list_models()
models = response[“hub_content_summaries”]
while response[“next_token”]:
response = hub.list_models(next_token=response[“next_token”])
models.extend(response[“hub_content_summaries”])

print(models)
To get more information about a particular model, use the describe_model method: model_name = “huggingface-llm-phi-2”
response = hub.describe_model(model_name=model_name)
print(response)
You can deploy models in a hub with the Python SDK by using JumpStartModel. To deploy a model from the hub to an endpoint and invoke the endpoint with the default payloads, run the following code. To select which model from your hub you want to use, pass in a model_id and version. If you pass in * for the version, it will take the latest version available for that model_id in the hub. If you’re using a model gated behind a EULA agreement, pass in accept_eula=True. model_id, version = “huggingface-llm-phi-2”, “1.0.0”
model = JumpStartModel(model_id, version, hub_name=HUB_NAME,
region=REGION, sagemaker_session=session)
predictor = model.deploy(accept_eula=False)
To invoke your deployed model with the default payloads, use the following code: example_payloads = model.retrieve_all_examples()
for payload in example_payloads:
response = predictor.predict(payload.body)
print(“nInputn”, payload.body, “nnOutputn”,
response[0][“generated_text”], “nn===============”)
To delete the model endpoints that you created, use the following code: predictor.delete_model()
predictor.delete_endpoint()

Cross-account sharing of private hubs
SageMaker JumpStart private hubs support cross-account sharing, allowing you to extend the benefits of your curated model repository beyond your own AWS account. This feature enables collaboration across different teams or departments within your organization, even when they operate in separate AWS accounts. By using AWS RAM, you can securely share your private hubs while maintaining control over access.
To share your private hub across accounts, complete the following steps:

On the AWS RAM console, choose Create resource share.
When specifying resource share details, choose the SageMaker hub resource type and select one or more private hubs that you want to share. When you share a hub with any other account, all of its contents are also shared implicitly.
Associate permissions with your resource share.
Use AWS account IDs to specify the accounts to which you want to grant access to your shared resources.
Review your resource share configuration and choose Create resource share.

It may take a few minutes for the resource share and principal associations to complete.
Admins that want to perform the preceding steps programmatically can enter the following command to initiate the sharing:
# create a resource share using the private hub
aws ram create-resource-share
–name test-share
–resource-arns arn:aws:sagemaker:<region>:<resource_owner_account_id>:hub/<hub_name>
–principals <consumer_account_id>
–region <region>
Replace the <resource_owner_account_id>, <consumer_account_id>, <hub_name>, and <region> placeholders with the appropriate values for the resource owner account ID, consumer account ID, name of the hub, and Region to use.
After you set up the resource share, the specified AWS account will receive an invitation to join. They must accept this invitation through AWS RAM to gain access to the shared private hub. This process makes sure access is granted only with explicit consent from both the hub owner and the recipient account. For more information, refer to Using shared AWS resources.
You can also perform this step programmatically:
# list resource shares
aws ram get-resource-share-invitations
–region <region>

# accept resource share
# using the arn from the previous response
aws ram accept-resource-share-invitation
–resource-share-invitation-arn <arn_from_ previous_request>
–region <region>

For detailed instructions on creating resource shares and accepting invitations, refer to Creating a resource share in AWS RAM. By extending your private hub across accounts, you can foster collaboration and maintain consistent model governance across your entire organization.
Conclusion
SageMaker JumpStart allows enterprises to adopt FMs while maintaining granular control over model access and usage. By creating a curated repository of approved models in private hubs, organizations can align their AI initiatives with corporate policies and regulatory requirements. The private hub decouples model curation from model consumption, enabling administrators to manage the model inventory while data scientists focus on developing AI solutions.
This post explained the private hub feature in SageMaker JumpStart and provided steps to set up and use a private hub, with minimal additional configuration required. Administrators can select models from the public SageMaker JumpStart hub, add them to the private hub, and manage user access through IAM policies. Users can then deploy these preapproved models, fine-tune them on custom datasets, and integrate them into their applications using familiar SageMaker interfaces. The private hub uses the SageMaker underlying infrastructure, allowing it to scale with enterprise-level ML demands.
For more information about SageMaker JumpStart, refer to SageMaker JumpStart. To get started using SageMaker JumpStart, access it through SageMaker Studio.
About the Authors
Raju Rangan is a Senior Solutions Architect at AWS. He works with government-sponsored entities, helping them build AI/ML solutions using AWS. When not tinkering with cloud solutions, you’ll catch him hanging out with family or smashing birdies in a lively game of badminton with friends.
Sherry Ding is a senior AI/ML specialist solutions architect at AWS. She has extensive experience in machine learning with a PhD in computer science. She mainly works with public sector customers on various AI/ML-related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.
June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.
Bhaskar Pratap is a Senior Software Engineer with the Amazon SageMaker team. He is passionate about designing and building elegant systems that bring machine learning to people’s fingertips. Additionally, he has extensive experience with building scalable cloud storage services.

eSentire delivers private and secure generative AI interactions to cus …

eSentire is an industry-leading provider of Managed Detection & Response (MDR) services protecting users, data, and applications of over 2,000 organizations globally across more than 35 industries. These security services help their customers anticipate, withstand, and recover from sophisticated cyber threats, prevent disruption from malicious attacks, and improve their security posture.
In 2023, eSentire was looking for ways to deliver differentiated customer experiences by continuing to improve the quality of its security investigations and customer communications. To accomplish this, eSentire built AI Investigator, a natural language query tool for their customers to access security platform data by using AWS generative artificial intelligence (AI) capabilities.
In this post, we share how eSentire built AI Investigator using Amazon SageMaker to provide private and secure generative AI interactions to their customers.
Benefits of AI Investigator
Before AI Investigator, customers would engage eSentire’s Security Operation Center (SOC) analysts to understand and further investigate their asset data and associated threat cases. This involved manual effort for customers and eSentire analysts, forming questions and searching through data across multiple tools to formulate answers.
eSentire’s AI Investigator enables users to complete complex queries using natural language by joining multiple sources of data from each customer’s own security telemetry and eSentire’s asset, vulnerability, and threat data mesh. This helps customers quickly and seamlessly explore their security data and accelerate internal investigations.
Providing AI Investigator internally to the eSentire SOC workbench has also accelerated eSentire’s investigation process by improving the scale and efficacy of multi-telemetry investigations. The LLM models augment SOC investigations with knowledge from eSentire’s security experts and security data, enabling higher-quality investigation outcomes while also reducing time to investigate. Over 100 SOC analysts are now using AI Investigator models to analyze security data and provide rapid investigation conclusions.
Solution overview
eSentire customers expect rigorous security and privacy controls for their sensitive data, which requires an architecture that doesn’t share data with external large language model (LLM) providers. Therefore, eSentire decided to build their own LLM using Llama 1 and Llama 2 foundational models. A foundation model (FM) is an LLM that has undergone unsupervised pre-training on a corpus of text. eSentire tried multiple FMs available in AWS for their proof of concept; however, the straightforward access to Meta’s Llama 2 FM through Hugging Face in SageMaker for training and inference (and their licensing structure) made Llama 2 an obvious choice.
eSentire has over 2 TB of signal data stored in their Amazon Simple Storage Service (Amazon S3) data lake. eSentire used gigabytes of additional human investigation metadata to perform supervised fine-tuning on Llama 2. This further step updates the FM by training with data labeled by security experts (such as Q&A pairs and investigation conclusions).
eSentire used SageMaker on several levels, ultimately facilitating their end-to-end process:

They used SageMaker notebook instances extensively to spin up GPU instances, giving them the flexibility to swap high-power compute in and out when needed. eSentire used instances with CPU for data preprocessing and post-inference analysis and GPU for the actual model (LLM) training.
The additional benefit of SageMaker notebook instances is its streamlined integration with eSentire’s AWS environment. Because they have vast amounts of data (terabyte scale, over 1 billion total rows of relevant data in preprocessing input) stored across AWS—in Amazon S3 and Amazon Relational Database Service (Amazon RDS) for PostgreSQL clusters—SageMaker notebook instances allowed secure movement of this volume of data directly from the AWS source (Amazon S3 or Amazon RDS) to the SageMaker notebook. They needed no additional infrastructure for data integration.
SageMaker real-time inference endpoints provide the infrastructure needed for hosting their custom self-trained LLMs. This was very useful in combination with SageMaker integration with Amazon Elastic Container Registry (Amazon ECR), SageMaker endpoint configuration, and SageMaker models to provide the entire configuration required to spin up their LLMs as needed. The fully featured end-to-end deployment capability provided by SageMaker allowed eSentire to effortlessly and consistently update their model registry as they iterate and update their LLMs. All of this was entirely automated with the software development lifecycle (SDLC) using Terraform and GitHub, which is only possible through SageMaker ecosystem.

The following diagram visualizes the architecture diagram and workflow.

The application’s frontend is accessible through Amazon API Gateway, using both edge and private gateways. To emulate intricate thought processes akin to those of a human investigator, eSentire engineered a system of chained agent actions. This system uses AWS Lambda and Amazon DynamoDB to orchestrate a series of LLM invocations. Each LLM call builds upon the previous one, creating a cascade of interactions that collectively produce high-quality responses. This intricate setup makes sure that the application’s backend data sources are seamlessly integrated, thereby providing tailored responses to customer inquiries.
When a SageMaker endpoint is constructed, an S3 URI to the bucket containing the model artifact and Docker image is shared using Amazon ECR.
For their proof of concept, eSentire selected the Nvidia A10G Tensor Core GPU housed in an MLG5 2XL instance for its balance of performance and cost. For LLMs with significantly larger numbers of parameters, which demand greater computational power for both training and inference tasks, eSentire used 12XL instances equipped with four GPUs. This was necessary because the computational complexity and the amount of memory required for LLMs can increase exponentially with the number of parameters. eSentire plans to harness P4 and P5 instance types for scaling their production workloads.
Additionally, a monitoring framework that captures the inputs and outputs of AI Investigator was necessary to enable threat hunting visibility to LLM interactions. To accomplish this, the application integrates with an open sourced eSentire LLM Gateway project to monitor the interactions with customer queries, backend agent actions, and application responses. This framework enables confidence in complex LLM applications by providing a security monitoring layer to detect malicious poisoning and injection attacks while also providing governance and support for compliance through logging of user activity. The LLM gateway can also be integrated with other LLM services, such as Amazon Bedrock.
Amazon Bedrock enables you to customize FMs privately and interactively, without the need for coding. Initially, eSentire’s focus was on training bespoke models using SageMaker. As their strategy evolved, they began to explore a broader array of FMs, evaluating their in-house trained models against those provided by Amazon Bedrock. Amazon Bedrock offers a practical environment for benchmarking and a cost-effective solution for managing workloads due to its serverless operation. This serves eSentire well, especially when customer queries are sporadic, making serverless an economical alternative to persistently running SageMaker instances.
From a security perspective as well, Amazon Bedrock doesn’t share users’ inputs and model outputs with any model providers. Additionally, eSentire have custom guardrails for NL2SQL applied to their models.
Results
The following screenshot shows an example of eSentire’s AI Investigator output. As illustrated, a natural language query is posed to the application. The tool is able to correlate multiple datasets and present a response.

Dustin Hillard, CTO of eSentire, shares: “eSentire customers and analysts ask hundreds of security data exploration questions per month, which typically take hours to complete. AI Investigator is now with an initial rollout to over 100 customers and more than 100 SOC analysts, providing a self-serve immediate response to complex questions about their security data. eSentire LLM models are saving thousands of hours of customer and analyst time.”
Conclusion
In this post, we shared how eSentire built AI Investigator, a generative AI solution that provides private and secure self-serve customer interactions. Customers can get near real-time answers to complex questions about their data. AI Investigator has also saved eSentire significant analyst time.
The aforementioned LLM gateway project is eSentire’s own product and AWS bears no responsibility.
If you have any comments or questions, share them in the comments section.

About the Authors
Aishwarya Subramaniam is a Sr. Solutions Architect in AWS. She works with commercial customers and AWS partners to accelerate customers’ business outcomes by providing expertise in analytics and AWS services.
Ilia Zenkov is a Senior AI Developer specializing in generative AI at eSentire. He focuses on advancing cybersecurity with expertise in machine learning and data engineering. His background includes pivotal roles in developing ML-driven cybersecurity and drug discovery platforms.
Dustin Hillard is responsible for leading product development and technology innovation, systems teams, and corporate IT at eSentire. He has deep ML experience in speech recognition, translation, natural language processing, and advertising, and has published over 30 papers in these areas.

Anthropic AI Releases Claude 3.5: A New AI Model that Surpasses GPT-4o …

Anthropic AI has launched Claude 3.5 Sonnet, marking the first release in its new Claude 3.5 model family. This latest iteration of Claude brings significant advancements in AI capabilities, setting a new benchmark in the industry for intelligence and performance.

Introduction to Claude 3.5 Sonnet

Anthropic AI introduced Claude 3.5 Sonnet, which is available for free on Claude.ai and the Claude iOS app. The model is accessible via the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. Enhanced rate limits are provided for Claude Pro and Team plan subscribers. The pricing structure is set at $3 per million input tokens and $15 per million output tokens, with a 200K token context window, making it cost-effective and highly efficient.

Image Source

Performance and Capabilities

Claude 3.5 Sonnet boasts twice the speed of its predecessor, Claude 3 Opus while maintaining mid-tier model costs. It excels in graduate-level reasoning, undergraduate-level knowledge, and coding proficiency, significantly improving understanding of nuance, humor, and complex instructions. Its ability to write high-quality content in a natural and relatable tone further solidifies its position as a leading AI model.

Image Source

In internal coding evaluations, Claude 3.5 Sonnet outperformed previous models by solving 64% of problems, compared to 38% solved by Claude 3 Opus. This evaluation tested the model’s ability to fix bugs or add functionalities to an open-source codebase based on natural language descriptions. Claude 3.5 Sonnet demonstrated sophisticated reasoning and troubleshooting capabilities, making it particularly effective for updating legacy applications and migrating codebases.

Image Source

Visual and Interactive Enhancements

Claude 3.5 Sonnet also improves visual reasoning, surpassing its predecessor in standard vision benchmarks. It can accurately transcribe text from imperfect images, a crucial capability for industries like retail, logistics, and financial services, where visual data interpretation is essential. This enhancement makes Claude 3.5 Sonnet highly effective in tasks requiring visual reasoning, such as interpreting charts and graphs.

Anthropic AI introduced “Artifacts,” a new feature on Claude.ai that allows users to generate and interact with content like code snippets, text documents, or website designs within a dynamic workspace. This feature marks Claude’s evolution from a conversational AI to a collaborative work environment, paving the way for team collaboration and centralized knowledge management.

Safety and Privacy

Safety and privacy remain paramount in Claude 3.5 Sonnet’s development. The model has undergone rigorous testing to minimize misuse, with safety mechanisms evaluated by external experts, including the UK’s Artificial Intelligence Safety Institute (UK AISI). These evaluations ensure the model’s robustness against misuse while maintaining user privacy. Anthropic AI does not train its generative models on user-submitted data without explicit permission, reinforcing its commitment to data privacy.

Future Developments

Anthropic AI aims to continually improve the tradeoff between intelligence, speed, and cost. Later this year, the company plans to release Claude 3.5 Haiku and Claude 3.5 Opus, completing the Claude 3.5 model family. Future developments will also include new modalities and features to support more business use cases, including integrations with enterprise applications. The team is exploring features like Memory, which will enable Claude to remember user preferences and interaction history, enhancing personalization and efficiency.

Conclusion

Claude 3.5 Sonnet represents a significant leap in AI capabilities, offering advanced reasoning, coding proficiency, and visual understanding. With its introduction, Anthropic AI continues to push the boundaries of what AI can achieve, setting new standards for performance and safety. As the Claude 3.5 model family expands, users can look forward to powerful tools to support projects and workflows.

Try it here

The post Anthropic AI Releases Claude 3.5: A New AI Model that Surpasses GPT-4o on Multiple Benchmarks While Being 2x Faster than Claude 3 Opus appeared first on MarkTechPost.

StreamSpeech: A Direct Simul-S2ST Speech-to-Speech Translation Model t …

Large Language Models (LLMs) have gained significant attention in the field of simultaneous speech-to-speech translation (SimulS2ST). This technology has become crucial for low-latency communication in various scenarios, such as international conferences, live broadcasts, and online subtitles. The primary challenge in SimulS2ST lies in producing high-quality translated speech with minimal delay. This requires a sophisticated policy to determine the optimal moments to initiate translation within streaming speech inputs (READ action) and subsequently generate coherent target speech outputs (WRITE action).

Current methodologies face several challenges. Existing simultaneous translation methods primarily focus on text-to-text (Simul-T2TT) and speech-to-text translation (Simul-S2TT). These approaches typically rely on cascading external modules like speech recognition (ASR) and text-to-speech synthesis (TTS) to achieve SimulS2ST. However, this cascaded approach tends to amplify inference errors progressively between modules and impedes the joint optimization of various components, highlighting the need for a more integrated solution.

Researchers have made several attempts to address the challenges in simultaneous speech-to-speech translation, primarily focusing on Simul-T2TT and Simul-S2TT translation methods. In Simul-T2TT, approaches are categorized into fixed and adaptive methods. Fixed methods, such as the wait-k policy, employ a predetermined strategy of waiting for a set number of tokens before alternating between READ and WRITE actions. Adaptive methods utilize techniques like monotonic attention, alignments, non-autoregressive architecture, or language models to dynamically perform Simul-T2TT. For Simul-S2TT, the focus has been on speech segmentation. Fixed pre-decision methods divide speech into equal-length segments, while adaptive methods split speech inputs into words or segments before applying Simul-T2TT policies. Some researchers have also explored applying offline models to Simul-S2TT tasks. Despite these advancements, these methods still rely heavily on cascading external modules, which can lead to error propagation and hinder joint optimization of the translation process.

Researchers from Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS), Key Laboratory of AI Safety, Chinese Academy of Sciences, University of Chinese Academy of Sciences, School of Future Science and Engineering, Soochow University present StreamSpeech, it addresses SimulS2ST challenges by introducing textual information for both source and target speech, providing intermediate supervision and guiding policy through text-based alignments. This direct SimulS2ST model employs a two-pass architecture, first translating source speech to target text hidden states, and then converting these to target speech. Multiple CTC decoders, optimized via ASR and S2TT auxiliary tasks, provide intermediate supervision and learn alignments for policy guidance. By jointly optimizing all modules through multi-task learning, StreamSpeech enables concurrent learning of translation and policy, potentially overcoming the limitations of previous cascaded approaches.

StreamSpeech’s architecture comprises three main components: a streaming speech encoder, a simultaneous text decoder, and a synchronized text-to-unit generation module. The streaming speech encoder utilizes a chunk-based Conformer design, which enables it to process streaming inputs while maintaining bi-directional encoding within local chunks. The simultaneous text decoder generates target text by attending to the source speech hidden states, guided by a policy that determines when to generate each target token. This policy is informed by alignments learned through multiple CTC decoders, which are optimized via auxiliary tasks of ASR and S2TT. The text-to-unit generation module employs a non-autoregressive architecture to synchronously generate units corresponding to the decoded text. Finally, a HiFi-GAN vocoder synthesizes the target speech from these units.

StreamSpeech demonstrates superior performance in both offline and S2ST tasks. In offline S2ST, it outperforms the state-of-the-art UnitY model with an average improvement of 1.5 BLEU. The model’s architecture, combining autoregressive speech-to-text translation with non-autoregressive text-to-unit generation, proves effective in balancing modeling capabilities and alignment capture. In simultaneous S2ST, StreamSpeech significantly outperforms the Wait-k baseline, showing approximately 10 BLEU improvement under low latency conditions across French, Spanish, and German to English translations. The model’s alignment-derived policy enables more appropriate translation timing and coherent target speech generation. Also, StreamSpeech shows advantages over cascaded systems, highlighting the benefits of its direct approach in reducing error accumulation and improving overall performance in Simul-S2ST tasks.

StreamSpeech represents a significant advancement in simultaneous speech-to-speech translation technology. This innovative “All in One” seamless model effectively handles streaming ASR, simultaneous translation, and real-time speech synthesis within a unified framework. Its comprehensive approach allows for improved performance across multiple tasks, including offline speech-to-speech translation, streaming ASR, simultaneous speech-to-text translation, and simultaneous speech-to-speech translation.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit
The post StreamSpeech: A Direct Simul-S2ST Speech-to-Speech Translation Model that Jointly Learns Translation and Simultaneous Policy in a Unified Framework of Multi-Task Learning appeared first on MarkTechPost.

Firecrawl: A Powerful Web Scraping Tool for Turning Websites into Larg …

In the rapidly advancing field of Artificial Intelligence (AI), effective use of web data can lead to unique applications and insights. A recent tweet has brought attention to Firecrawl, a potent tool in this field created by the Mendable AI team. Firecrawl is a state-of-the-art web scraping program made to tackle the complex problems involved in getting data off the internet. Web scraping is useful, but it frequently requires overcoming various challenges like proxies, caching, rate limitations, and material generated with JavaScript. Firecrawl is a vital tool for data scientists because it addresses these issues head-on.

Even without a sitemap, Firecrawl explores every page on a website that is accessible. This guarantees a complete data extraction procedure by ensuring that no important data is lost. Traditional scraping techniques encounter difficulties when dealing with the dynamic rendering of material on numerous modern websites that rely on JavaScript. But Firecrawl efficiently collects data from these kinds of websites, guaranteeing that users can access the entire range of information accessible. 

Firecrawl extracts data and returns it in a clean, well-formatted Markdown. This format is especially useful for Large Language Model (LLM) applications because it makes integrating and using the scraped data easy. Web scraping relies heavily on time, which Firecrawl solves by coordinating concurrent crawling, which dramatically accelerates the data extraction process. With this orchestration, users are guaranteed to receive the data they require promptly and effectively. 

Firecrawl uses a caching mechanism to optimize efficiency further. Content that has been scraped is cached, so unless fresh content is found, there is no need to perform full scrapes again. This feature lessens the load on target websites and saves time. Firecrawl provides clean data in a format that is ready for use right away, catering to the unique requirements of AI applications.

The tweet has highlighted the use of generative feedback loops for data chunk cleansing as one new aspect. In order to make sure the scraped data is valid and valuable, this procedure includes reviewing and refining it using generative models. Here, generative models offer comments on the data pieces, pointing out errors and making recommendations for enhancements. 

The data is improved through this iterative process, increasing its dependability for further analysis and application. The quality of datasets created can be greatly improved by introducing generative feedback loops. By using this approach, the data is both contextually correct and clean, which is important when it comes to making wise decisions and developing AI models.

To begin using Firecrawl, users must register on the website in order to receive an API key. With various SDKs for Python, Node, Langchain, and Llama Index integrations, the service provides an intuitive API. For a self-hosted solution, user can run Firecrawl locally. Users who submit a crawl job receive a job ID that allows them to monitor the crawl’s progress, making the process simple and effective.

In conclusion, with its great capabilities and smooth integration, Firecrawl is a major development in web scraping and data storage. It offers a complete solution for users wishing to access the abundance of online data resources when combined with the creative method of cleaning data via generative feedback loops.

Check out the GitHub Repo. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit
The post Firecrawl: A Powerful Web Scraping Tool for Turning Websites into Large Language Model (LLM) Ready Markdown or Structured Data appeared first on MarkTechPost.

Imperva optimizes SQL generation from natural language using Amazon Be …

This is a guest post co-written with Ori Nakar from Imperva.
Imperva Cloud WAF protects hundreds of thousands of websites against cyber threats and blocks billions of security events every day. Counters and insights based on security events are calculated daily and used by users from multiple departments. Millions of counters are added daily, together with 20 million insights updated daily to spot threat patterns.
Our goal was to improve the user experience of an existing application used to explore the counters and insights data. The data is stored in a data lake and retrieved by SQL using Amazon Athena.
As part of our solution, we replaced multiple search fields with a single free text field. We used a large language model (LLM) with query examples to make the search work using the language used by Imperva internal users (business analysts).
The following figure shows a search query that was translated to SQL and run. The results were later formatted as a chart by the application. We have many types of insights—global, industry, and customer level insights used by multiple departments such as marketing, support, and research. Data was made available to our users through a simplified user experience powered by an LLM.

Figure 1: Insights search by natural language
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading artificial intelligence (AI) companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon within a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Studio is a new single sign-on (SSO)-enabled web interface that provides a way for developers across an organization to experiment with LLMs and other FMs, collaborate on projects, and iterate on generative AI applications. It offers a rapid prototyping environment and streamlines access to multiple FMs and developer tools in Amazon Bedrock.
Read more to learn about the problem, and how we obtained quality results using Amazon Bedrock for our experimentation and deployment.
The problem
Making data accessible to users through applications has always been a challenge. Data is normally stored in databases, and can be queried using the most common query language, SQL. Applications use different UI components to allow users to filter and query the data. There are applications with tens of different filters and other options–all created to make the data accessible.
Querying databases through applications cannot be as flexible as running SQL queries on a known schema. Giving more power to the user comes on account of simple user experience (UX). Natural language can solve this problem—it’s possible to support complex yet readable natural language queries without SQL knowledge. On schema changes, the application UX and code remain the same, or with minor changes, which saves development time and keeps the application user interface (UI) stable for the users.
Constructing SQL queries from natural language isn’t a simple task. SQL queries must be accurate both syntactically and logically. Using an LLM with the right examples can make this task less difficult.

Figure 2: High level database access using an LLM flow
The challenge
An LLM can construct SQL queries based on natural language. The challenge is to assure quality. The user can enter any text, and the application constructs a query based on it. There isn’t an option, like in traditional applications, to cover all options and make sure the application functions correctly. Adding an LLM to an application adds another layer of complexity. The response by the LLM is not deterministic. Examples sent to the LLM are based on the database data, which makes it even harder to control the requests sent to the LLM and assure quality.
The solution: A data science approach
In data science, it’s common to develop a model and fine tune it using experimentation. The idea is to use metrics to compare experiments during development. Experiments might differ from each other in many ways, such as the input sent to the model, the model type, and other parameters. The ability to compare different experiments makes it possible to make progress. It’s possible to know how each change contributes to the model.
A test set is a static set of records that includes a prediction result for each record. Running predictions on the test set records results with the metrics needed to compare experiments. A common metric is the accuracy, which is the percentage of the correct results.
In our case the results generated by the LLM are SQL statements. The SQL statements generated by the LLM are not deterministic and are hard to measure, however running SQL statements on a static test database is deterministic and can be measured. We used a test database and a list of questions with known answers as a test set. It allowed us to run experiments and fine tune our LLM-based application.
Database access using LLM: Question to answer flow
Given a question we defined the following flow. The question is sent through a retrieval-augmented generation (RAG) process, which finds similar documents. Each document holds an example question and information about it. The relevant documents are built as a prompt and sent to the LLM, which builds a SQL statement. This flow is used both for development and application runtime:

Figure 3: Question to answer flow
As an example, consider a database schema with two tables: orders and items. The following figure is a question to SQL example flow:

Figure 4: Question to answer flow example
Database access using LLM: Development process
To develop and fine-tune the application we created the following data sets:

A static test database: Contains the relevant tables and a sample copy of the data.
A test set: Includes questions and test database result answers.
Question to SQL examples: A set with questions and translation to SQL. For some examples returned data is included to allow asking questions about the data, and not only about the schema.

Development of the application is done by adding new questions and updating the different datasets, as shown in the following figure.

Figure 5: Adding a new question
Datasets and other parameter updates are tracked as part of adding new questions and fine-tuning of the application. We used a tracking tool to track information about the experiments such as:

Parameters such as the number of questions, number of examples, LLM type, RAG search method
Metrics such as the accuracy and SQL errors rate
Artifacts such as a list of the wrong results including generated SQL, data returned, and more

Figure 6: Experiment flow
Using a tracking tool, we were able to make progress by comparing experiments. The following figure shows the accuracy and error rate metrics for the different experiments we did:

Figure 7: Accuracy and error rate over time
When there’s a mistake or an error, a drill down to the false results and the experiment details is done to understand the source of the error and fix it.
Experiment and deploy using Amazon Bedrock
Amazon Bedrock is a managed service that offers a choice of high-performing foundation models. You can experiment with and evaluate top FMs for your use case and customize them with your data.
By using Amazon Bedrock, we were able to switch between models and embedding options easily. The following is an example code using the LangChain python library, which allows using different models and embeddings:

import boto3
from langchain_community.llms.bedrock import Bedrock
from langchain_community.embeddings import BedrockEmbeddings

def get_llm(model_id: str, args: dict):
return Bedrock(model_id=model_id,
model_kwargs=args,
client=boto3.client(“bedrock-runtime”))

def get_embeddings(model_id: str):
return BedrockEmbeddings(model_id=model_id,
client=boto3.client(“bedrock-runtime”))

We used multiple models and embeddings with different hyper parameters to improve accuracy and decide which model is the best fit for us. We also tried to run experiments on smaller models, to determine if we can get to the same quality in terms of improved performance and reduced costs. We started using Anthropic Claude 2.1 and experimented with the Anthropic Claude instant model. Accuracy dropped by 20 percent, but after adding few additional examples, we achieved the same accuracy as Claude 2.1 with lower cost and faster response time

Conclusion
We used the same approach used in data science projects to construct SQL queries from natural language. The solution shown can be applied to other LLM-based applications, and not only for constructing SQL. For example, it can be used for API access, building JSON data, and more. The key is to create a test set together with measurable results and progress using experimentation.
Amazon Bedrock lets you use different models and switch between them to find the right one for your use case. You can compare different models, including small ones for better performance and costs. Because Amazon Bedrock is serverless, you don’t have to manage any infrastructure. We were able to test multiple models quickly, and finally integrate and deploy generative AI capabilities into our application.
You can start experimenting with natural language to SQL by running the code samples in this GitHub repository. This workshop is divided into modules that each build on the previous while introducing a new technique to solve this problem. Many of these approaches are based on an existing work from the community and cited accordingly.

About the Authors
Ori Nakar is a Principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group.
Eitan Sela is a Generative AI and Machine Learning Specialist Solutions Architect at AWS. He works with AWS customers to provide guidance and technical assistance, helping them build and operate Generative AI and Machine Learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.
Elad Eizner is a Solutions Architect at Amazon Web Services. He works with AWS enterprise customers to help them architect and build solutions in the cloud and achieving their goals.

Create natural conversations with Amazon Lex QnAIntent and Knowledge B …

Customer service organizations today face an immense opportunity. As customer expectations grow, brands have a chance to creatively apply new innovations to transform the customer experience. Although meeting rising customer demands poses challenges, the latest breakthroughs in conversational artificial intelligence (AI) empowers companies to meet these expectations.
Customers today expect timely responses to their questions that are helpful, accurate, and tailored to their needs. The new QnAIntent, powered by Amazon Bedrock, can meet these expectations by understanding questions posed in natural language and responding conversationally in real time using your own authorized knowledge sources. Our Retrieval Augmented Generation (RAG) approach allows Amazon Lex to harness both the breadth of knowledge available in repositories as well as the fluency of large language models (LLMs).
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
In this post, we show you how to add generative AI question answering capabilities to your bots. This can be done using your own curated knowledge sources, and without writing a single line of code.
Read on to discover how QnAIntent can transform your customer experience.
Solution overview
Implementing the solution consists of the following high-level steps:

Create an Amazon Lex bot.
Create an Amazon Simple Storage Service (Amazon S3) bucket and upload a PDF file that contains the information used to answer questions.
Create a knowledge base that will split your data into chunks and generate embeddings using the Amazon Titan Embeddings model. As part of this process, Knowledge Bases for Amazon Bedrock automatically creates an Amazon OpenSearch Serverless vector search collection to hold your vectorized data.
Add a new QnAIntent intent that will use the knowledge base to find answers to customers’ questions and then use the Anthropic Claude model to generate answers to questions and follow-up questions.

Prerequisites
To follow along with the features described in this post, you need access to an AWS account with permissions to access Amazon Lex, Amazon Bedrock (with access to Anthropic Claude models and Amazon Titan embeddings or Cohere Embed), Knowledge Bases for Amazon Bedrock, and the OpenSearch Serverless vector engine. To request access to models in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, choose Model access in the navigation pane.
Choose Manage model access.
Select the Amazon and Anthropic models. (You can also choose to use Cohere models for embeddings.)
Choose Request model access.

Create an Amazon Lex bot
If you already have a bot you want to use, you can skip this step.

On the Amazon Lex console, choose Bots in the navigation pane.
Choose Create bot
Select Start with an example and choose the BookTrip example bot.
For Bot name, enter a name for the bot (for example, BookHotel).
For Runtime role, select Create a role with basic Amazon Lex permissions.
In the Children’s Online Privacy Protection Act (COPPA) section, you can select No because this bot is not targeted at children under the age of 13.
Keep the Idle session timeout setting at 5 minutes.
Choose Next.
When using the QnAIntent to answer questions in a bot, you may want to increase the intent classification confidence threshold so that your questions are not accidentally interpreted as matching one of your intents. We set this to 0.8 for now. You may need to adjust this up or down based on your own testing.
Choose Done.
Choose Save intent.

Upload content to Amazon S3
Now you create an S3 bucket to store the documents you want to use for your knowledge base.

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
For Bucket name, enter a unique name.
Keep the default values for all other options and choose Create bucket.

For this post, we created an FAQ document for the fictitious hotel chain called Example Corp FictitiousHotels. Download the PDF document to follow along.

On the Buckets page, navigate to the bucket you created.

If you don’t see it, you can search for it by name.

Choose Upload.
Choose Add files.
Choose the ExampleCorpFicticiousHotelsFAQ.pdf that you downloaded.
Choose Upload.

The file will now be accessible in the S3 bucket.
Create a knowledge base
Now you can set up the knowledge base:

On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
Choose Create knowledge base.
For Knowledge base name¸ enter a name.
For Knowledge base description, enter an optional description.
Select Create and use a new service role.
For Service role name, enter a name or keep the default.
Choose Next.
For Data source name, enter a name.
Choose Browse S3 and navigate to the S3 bucket you uploaded the PDF file to earlier.
Choose Next.
Choose an embeddings model.
Select Quick create a new vector store to create a new OpenSearch Serverless vector store to store the vectorized content.
Choose Next.
Review your configuration, then choose Create knowledge base.

After a few minutes, the knowledge base will have been created.

Choose Sync to sync to chunk the documents, calculate the embeddings, and store them in the vector store.

This may take a while. You can proceed with the rest of the steps, but the syncing needs to finish before you can query the knowledge base.

Copy the knowledge base ID. You will reference this when you add this knowledge base to your Amazon Lex bot.

Add QnAIntent to the Amazon Lex bot
To add QnAIntent, compete the following steps:

On the Amazon Lex console, choose Bots in the navigation pane.
Choose your bot.
In the navigation pane, choose Intents.
On the Add intent menu, choose Use built-in intent.
For Built-in intent, choose AMAZON.QnAIntent.
For Intent name, enter a name.
Choose Add.
Choose the model you want to use to generate the answers (in this case, Anthropic Claude 3 Sonnet, but you can select Anthropic Claude 3 Haiku for a cheaper option with less latency).
For Choose knowledge store, select Knowledge base for Amazon Bedrock.
For Knowledge base for Amazon Bedrock Id, enter the ID you noted earlier when you created your knowledge base.
Choose Save Intent.
Choose Build to build the bot.
Choose Test to test the new intent.

The following screenshot shows an example conversation with the bot.

In the second question about the Miami pool hours, you refer back to the previous question about pool hours in Las Vegas and still get a relevant answer based on the conversation history.
It’s also possible to ask questions that require the bot to reason a bit around the available data. When we asked about a good resort for a family vacation, the bot recommended the Orlando resort based on the availability of activities for kids, proximity to theme parks, and more.
Update the confidence threshold
You may have some questions accidentally match your other intents. If you run into this, you can adjust the confidence threshold for your bot. To modify this setting, choose the language of your bot (English) and in the Language details section, choose Edit.
After you update the confidence threshold, rebuild the bot for the change to take effect.

Add addional steps
By default, the next step in the conversation for the bot is set to Wait for user input after a question has been answered. This keeps the conversation in the bot and allows a user to ask follow-up questions or invoke any of the other intents in your bot.
If you want the conversation to end and return control to the calling application (for example, Amazon Connect), you can change this behavior to End conversation. To update the setting, complete the following steps:

On the Amazon Lex console, navigate to the QnAIntent.
In the Fulfillment section, choose Advanced options.
On the Next step in conversation dropdown menu, choose End conversation.

If you would like the bot add a specific message after each response from the QnAIntent (such as “Can I help you with anything else?”), you can add a closing response to the QnAIntent.
Clean up
To avoid incurring ongoing costs, delete the resources you created as part of this post:

Amazon Lex bot
S3 bucket
OpenSearch Serverless collection (This is not automatically deleted when you delete your knowledge base)
Knowledge bases

Conclusion
The new QnAIntent in Amazon Lex enables natural conversations by connecting customers with curated knowledge sources. Powered by Amazon Bedrock, the QnAIntent understands questions in natural language and responds conversationally, keeping customers engaged with contextual, follow-up responses.
QnAIntent puts the latest innovations in reach to transform static FAQs into flowing dialogues that resolve customer needs. This helps scale excellent self-service to delight customers.
Try it out for yourself. Reinvent your customer experience!

About the Author
Thomas Rinfuss is a Sr. Solutions Architect on the Amazon Lex team. He invents, develops, prototypes, and evangelizes new technical features and solutions for Language AI services that improves the customer experience and eases adoption.

Evaluate the reliability of Retrieval Augmented Generation application …

Retrieval Augmented Generation (RAG) is a technique that enhances large language models (LLMs) by incorporating external knowledge sources. It allows LLMs to reference authoritative knowledge bases or internal repositories before generating responses, producing output tailored to specific domains or contexts while providing relevance, accuracy, and efficiency. RAG achieves this enhancement without retraining the model, making it a cost-effective solution for improving LLM performance across various applications. The following diagram illustrates the main steps in a RAG system.

Although RAG systems are promising, they face challenges like retrieving the most relevant knowledge, avoiding hallucinations inconsistent with the retrieved context, and efficient integration of retrieval and generation components. In addition, RAG architecture can lead to potential issues like retrieval collapse, where the retrieval component learns to retrieve the same documents regardless of the input. A similar problem occurs for some tasks like open-domain question answering—there are often multiple valid answers available in the training data, therefore the LLM could choose to generate an answer from its training data. Another challenge is the need for an effective mechanism to handle cases where no useful information can be retrieved for a given input. Current research aims to improve these aspects for more reliable and capable knowledge-grounded generation.
Given these challenges faced by RAG systems, monitoring and evaluating generative artificial intelligence (AI) applications powered by RAG is essential. Moreover, tracking and analyzing the performance of RAG-based applications is crucial, because it helps assess their effectiveness and reliability when deployed in real-world scenarios. By evaluating RAG applications, you can understand how well the models are using and integrating external knowledge into their responses, how accurately they can retrieve relevant information, and how coherent the generated outputs are. Additionally, evaluation can identify potential biases, hallucinations, inconsistencies, or factual errors that may arise from the integration of external sources or from sub-optimal prompt engineering. Ultimately, a thorough evaluation of RAG-based applications is important for their trustworthiness, improving their performance, optimizing cost, and fostering their responsible deployment in various domains, such as question answering, dialogue systems, and content generation.
In this post, we show you how to evaluate the performance, trustworthiness, and potential biases of your RAG pipelines and applications on Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
RAG evaluation and observability challenges in real-world scenarios
Evaluating a RAG system poses significant challenges due to its complex architecture consisting of multiple components, such as the retrieval module and the generation component represented by the LLMs. Each module operates differently and requires distinct evaluation methodologies, making it difficult to assess the overall end-to-end performance of the RAG architecture. The following are some of the challenges you may encounter:

Lack of ground truth references – In many open-ended generation tasks, there is no single correct answer or reference text against which to evaluate the system’s output. This makes it difficult to apply standard evaluation metrics like BERTScore (Zhang et al. 2020) BLEU, or ROUGE used for machine translation and summarization.
Faithfulness evaluation – A key requirement for RAG systems is that the generated output should be faithful and consistent with the retrieved context. Evaluating this faithfulness, which also serves to measure the presence of hallucinated content, in an automated manner is non-trivial, especially for open-ended responses.
Context relevance assessment – The quality of the RAG output depends heavily on retrieving the right contextual knowledge. Automatically assessing the relevance of the retrieved context to the input prompt is an open challenge.
Factuality vs. coherence trade-off – Although factual accuracy from the retrieved knowledge is important, the generated text should also be naturally coherent. Evaluating and balancing factual consistency with language fluency is difficult.
Compounding errors, diagnosis, and traceability – Errors can compound from the retrieval and generation components. Diagnosing whether errors stem from retrieval failures or generation inconsistencies is hard without clear intermediate outputs. Given the complex interplay between various components of the RAG architecture, it’s also difficult to provide traceability of the problem in the evaluation process.
Human evaluation challenges – Although human evaluation is possible for sample outputs, it’s expensive and subjective, and may not scale well for comprehensive system evaluation across many examples. The need for a domain expert to create and evaluate against a dataset is essential, because the evaluation process requires specialized knowledge and expertise. The labor-intensive nature of the human evaluation process is time-consuming, because it often involves manual effort.
Lack of standardized benchmarks – There are no widely accepted and standardized benchmarks yet for holistically evaluating different capabilities of RAG systems. Without such benchmarks, it can be challenging to compare the various capabilities of different RAG techniques, models, and parameter configurations. Consequently, you may face difficulties in making informed choices when selecting the most appropriate RAG approach that aligns with your unique use case requirements.

Addressing these evaluation and observability challenges is an active area of research, because robust metrics are critical for iterating on and deploying reliable RAG systems for real-world applications.
RAG evaluation concepts and metrics
As mentioned previously, RAG-based generative AI application is composed of two main processes: retrieval and generation. Retrieval is the process where the application uses the user query to retrieve the relevant documents from a knowledge base before adding it to as context augmenting the final prompt. Generation is the process of generating the final response from the LLM. It’s important to monitor and evaluate both processes because they impact the performance and reliability of the application.
Evaluating RAG systems at scale requires an automated approach to extract metrics that are quantitative indicators of its reliability. Generally, the metrics to look for are grouped by main RAG components or by domains. Aside from the metrics discussed in this section, you can incorporate tailored metrics that align with your business objectives and priorities.
Retrieval metrics
You can use the following retrieval metrics:

Context relevance – This measures whether the passages or chunks retrieved by the RAG system are relevant for answering the given query, without including extraneous or irrelevant details. The values range from 0–1, with higher values indicating better context relevancy.
Context recall – This evaluates how well the retrieved context matches to the annotated answer, treated as the ground truth. It’s computed based on the ground truth answer and the retrieved context. The values range between 0–1, with higher values indicating better performance.
Context precision – This measures if all the truly relevant pieces of information from the given context are ranked highly or not. The preferred scenario is when all the relevant chunks are placed at the top ranks. This metric is calculated by considering the question, the ground truth (correct answer), and the context, with values ranging from 0–1, where higher scores indicate better precision.

Generation metrics
You can use the following generation metrics:

Faithfulness – This measures whether the answer generated by the RAG system is faithful to the information contained in the retrieved passages. The aim is to avoid hallucinations and make sure the output is justified by the context provided as input to the RAG system. The metric ranges from 0–1, with higher values indicating better performance.
Answer relevance – This measures whether the generated answer is relevant to the given query. It penalizes cases where the answer contains redundant information or doesn’t sufficiently answer the actual query. Values range between 0–1, where higher scores indicate better answer relevancy.
Answer semantic similarity – It compares the meaning and content of a generated answer with a reference or ground truth answer. It evaluates how closely the generated answer matches the intended meaning of the ground truth answer. The score ranges from 0–1, with higher scores indicating greater semantic similarity between the two answers. A score of 1 means that the generated answer conveys the same meaning as the ground truth answer, whereas a score of 0 suggests that the two answers have completely different meanings.

Aspects evaluation
Aspects are evaluated as follows:

Harmfulness (Yes, No) – If the generated answer carries the risk of causing harm to people, communities, or more broadly to society
Maliciousness (Yes, No) – If the submission intends to harm, deceive, or exploit users
Coherence (Yes, No) – If the generated answer presents ideas, information, or arguments in a logical and organized manner
Correctness (Yes, No) – If the generated answer is factually accurate and free from errors
Conciseness (Yes, No) – If the submission conveys information or ideas clearly and efficiently, without unnecessary or redundant details

The RAG Triad proposed by TrueLens consists of three distinct assessments, as shown in the following figure: evaluating the relevance of the context, examining the grounding of the information, and assessing the relevance of the answer provided. Achieving satisfactory scores across all three evaluations provides confidence that the corresponding RAG application is not generating hallucinated or fabricated content.

The RAGAS paper proposes automated metrics to evaluate these three quality dimensions in a reference-free manner, without needing human-annotated ground truth answers. This is done by prompting a language model and analyzing its outputs appropriately for each aspect.
To automate the evaluation at scale, metrics are computed using machine learning (ML) models called judges. Judges can be LLMs with reasoning capabilities, lightweight language models that are fine-tuned for evaluation tasks, or transformer models that compute similarities between text chunks such as cross-encoders.
Metric outcomes
When metrics are computed, they need to be examined to further optimize the system in a feedback loop:

Low context relevance means that the retrieval process isn’t fetching the relevant context. Therefore, data parsing, chunk sizes and embeddings models need to be optimized.
Low answer faithfulness means that the generation process is likely subject to hallucination, where the answer is not fully based on the retrieved context. In this case, the model choice needs to be revisited or further prompt engineering needs to be done.
Low answer relevance means that the answer generated by the model doesn’t correspond to the user query, and further prompt engineering or fine-tuning needs to be done.

Solution overview
You can use Amazon Bedrock to evaluate your RAG-based applications. In the following sections, we go over the steps to implement this solution:

Set up observability.
Prepare the evaluation dataset.
Choose the metrics and prepare the evaluation prompts.
Aggregate and review the metric results, then optimize the RAG system.

The following diagram illustrates the continuous process for optimizing a RAG system.

Set up observability
In a RAG system, multiple components (input processing, embedding, retrieval, prompt augmentation, generation, and output formatting) interact to generate answers assisted by external knowledge sources. Monitoring arriving user queries, search results, metadata, and component latencies help developers identify performance bottlenecks, understand system interactions, monitor for issues, and conduct root cause analysis, all of which are essential for maintaining, optimizing, and scaling the RAG system effectively.
In addition to metrics and logs, tracing is essential for setting up observability for a RAG system due to its distributed nature. The first step to implement tracing in your RAG system is to instrument your application. Instrumenting your application involves adding code to your application, automatically or manually, to send trace data for incoming and outbound requests and other events within your application, along with metadata about each request. There are several different instrumentation options you can choose from or combine, based on your particular requirements:

Auto instrumentation – Instrument your application with zero code changes, typically through configuration changes, adding an auto-instrumentation agent, or other mechanisms
Library instrumentation – Make minimal application code changes to add pre-built instrumentation targeting specific libraries or frameworks, such as the AWS SDK, LangChain, or LlamaIndex
Manual instrumentation – Add instrumentation code to your application at each location where you want to send trace information

To store and analyze your application traces, you can use AWS X-Ray or third-party tools like Arize Phoenix.
Prepare the evaluation dataset
To evaluate the reliability of your RAG system, you need a dataset that evolves with time, reflecting the state of your RAG system. Each evaluation record contains at least three of the following elements:

Human query – The user query that arrives in the RAG system
Reference document – The document content retrieved and added as a context to the final prompt
AI answer – The generated answer from the LLM
Ground truth – Optionally, you can add ground truth information:

Context ground truth – The documents or chunks relevant to the human query
Answer ground truth – The correct answer to the human query

If you have set up tracing, your RAG system traces already contain these elements, so you can either use them to prepare your evaluation dataset, or you can create a custom curated synthetic dataset specific for evaluation purposes based on your indexed data. In this post, we use Anthropic’s Claude 3 Sonnet, available in Amazon Bedrock, to evaluate the reliability of sample trace data of a RAG system that indexes the FAQs from the Zappos website.
Choose your metrics and prepare the evaluation prompts
Now that the evaluation dataset is prepared, you can choose the metrics that matter most to your application and your use case. In addition to the metrics we’ve discussed, you can create their own metrics to evaluate aspects that matter to you most. If your evaluation dataset provides answer ground truth, n-gram comparison metrics like ROUGE or embedding-based metrics BERTscore can be relevant before using an LLM as a judge. For more details, refer to the AWS Foundation Model Evaluations Library and Model evaluation.
When using an LLM as a judge to evaluate the metrics associated with a RAG system, the evaluation prompts play a crucial role in providing accurate and reliable assessments. The following are some best practices when designing evaluation prompts:

Give a clear role – Explicitly state the role the LLM should assume, such as “evaluator” or “judge,” to make sure it understands its task and what it is evaluating.
Give clear indications – Provide specific instructions on how the LLM should evaluate the responses, such as criteria to consider or rating scales to use.
Explain the evaluation procedure – Outline the parameters that need to be evaluated and the evaluation process step by step, including any necessary context or background information.
Deal with edge cases – Anticipate and address potential edge cases or ambiguities that may arise during the evaluation process. For example, determine if an answer based on irrelevant context be considered evaluated as factual or hallucinated.

In this post, we show how to create three custom binary metrics that don’t need ground truth data and that are inspired from some of the metrics we’ve discussed: faithfulness, context relevance, and answer relevance. We created three evaluation prompts.
The following is our faithfulness evaluation prompt template:

You are an AI assistant trained to evaluate interactions between a Human and an AI Assistant. An interaction is composed of a Human query, a reference document, and an AI answer. Your goal is to classify the AI answer using a single lower-case word among the following : “hallucinated” or “factual”.
“hallucinated” indicates that the AI answer provides information that is not found in the reference document.
“factual” indicates that the AI answer is correct relative to the reference document, and does not contain made up information.
Here is the interaction that needs to be evaluated:
Human query: {query} Reference document: {reference} AI answer: {response} Classify the AI’s response as: “factual” or “hallucinated”. Skip the preamble or explanation, and provide the classification.

We also created the following context relevance prompt template:

You are an AI assistant trained to evaluate a knowledge base search system. A search request is composed of a Human query and a reference document. Your goal is to classify the reference document using one of the following classifications in lower-case: “relevant” or “irrelevant”.
“relevant” means that the reference document contains the necessary information to answer the Human query.
“irrelevant” means that the reference document doesn’t contain the necessary information to answer the Human query.
Here is the search request that needs to be evaluated:
Human query: {query} Reference document: {reference}
Classify the reference document as: “relevant” or “irrelevant”. Skip any preamble or explanation, and provide the classification.

The following is our answer relevance prompt template:

You are an AI assistant trained to evaluate interactions between a Human and an AI Assistant. An interaction is composed of a Human query, a reference document, and an AI answer that should be based on the reference document. Your goal is to classify the AI answer using a single lower-case word among the following : “relevant” or “irrelevant”.
“relevant” means that the AI answer answers the Human query and stays relevant to the Human query, even if the reference document lacks full information.
“irrelevant” means that the Human query is not correctly or only partially answered by the AI.
Here is the interaction that needs to be evaluated:
Human query: {query} Reference document: {reference} AI answer: {response}
Classify the AI’s response as: “relevant” or “irrelevant”. Skip the preamble or explanation, and provide the classification.

Aggregate and review your metric results and then optimize your RAG system
After you obtain the evaluation results, you can store metrics in your observability systems alongside the stored traces to identify areas for improvement based on the values of their values or aggregates.
As indicated in the following diagram, every aspect of a RAG system has cascading impact on what follows; for instance, suboptimal document parsing impacts how reliably chunks are created, impacting embeddings quality, retrieval, and model output. When reviewing reliability metrics of your RAG system to find out what needs to be optimized, you should start by optimizing and reviewing what is earlier in the chain—from the left side of the following diagram.

In the following table, we present 3 of the 15 queries we used from the Zappos FAQs to get the correspondent LLM answers together with the reference documents and the calculated metrics for faithfulness, context relevance, and answer relevance.

Question
Retrieved Document
Answer
Faithfulness
Context Relevance
Answer Relevance

What are the specific care instructions for different fabric types like cashmere, silk, and wool?

Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. Garments usually look best when professionally dry-cleaned.

Cashmere: Generally, hand wash in cold water with a mild detergent. Do not wring or twist. Lay flat to dry.Silk: Dry cleaning is often recommended. Some silks can be hand- or machine-washed, but avoid using a clothes dryer.
Wool: Check the label for specific instructions. Many wools can be machine-washed on a gentle cycle with cold water. Avoid high temperatures, as they can cause the wool to shrink or felt.

hallucinated

irrelevant

relevant

How can I identify if an email claiming to be from Zappos is legitimate or a phishing attempt?

recommends you update your online passwords, including the password to your Zappos.com Account, and report the email to the Federal Trade Commission (FTC). Visit the FTC website here to report the email and find more information on how to identify and prevent phishing attacks.

To identify if an email claiming to be from Zappos is legitimate or a phishing attempt, you should check for the following:1. Look for the correct sender email address. Zappos emails should come from an address ending in “@zappos.com”.
2. Check for spelling and grammatical errors in the email. Phishing emails often contain mistakes.
3. Be wary of emails asking for personal information, especially passwords or credit card details. Zappos will never ask for this information via email.
4. Hover over the links in the email without clicking to see if they lead to the correct Zappos URL.
5. If you’re unsure, contact Zappos directly through their official website or customer service number to verify the email’s authenticity.Report any suspicious emails to the Federal Trade Commission (FTC) through their website.

hallucinated

relevant

relevant

What is Amazon Pay, and how do I access my transaction history with it on Zappos?

How do I find my Amazon Pay transactions?

Amazon Pay is a digital wallet that allows you to make payments online using the information stored in your Amazon account, including your shipping addresses, payment methods, and default delivery settings. To view your Amazon Pay transaction history on Zappos or any other merchant site, you should log into your Amazon account. Once logged in, navigate to the “Your Orders” page, where you can view all of your past and present orders, including those made using Amazon Pay. If you’re unable to find a specific transaction, you can use the search bar to look up the order number or the date of the transaction.

hallucinated

irrelevant

relevant


.
.
.
.
.

.
.
.
87%
67%
100%

From the preceding aggregates, we can see that answer relevance has a high score; however, context relevance in this example RAG system is 67%. In addition to that, the system is demonstrating a level of hallucination in some cases. Therefore, we should start optimizing earlier in the chain to improve context relevance. If we look at sample questions where context relevance is classified as irrelevant, we can see that text is well parsed; however, we can also see that chunks may start or end in the middle of a sentence or just include the FAQ question without the answer. Therefore, we start by optimizing the chunking method.
After we update the chunking mechanism to prevent starting or ending a chunk in the middle of a sentence and to include the FAQ question and answer pairs, we redo the evaluation over the same 15 questions. The following table shows a sample of our results.

Question
Retrieved Document
Answer
Faithfulness
Context Relevance
Answer Relevance

What are the specific care instructions for different fabric types like cashmere, silk, and wool?

How do I care for XYZ fabric?
Acetate: While most items made of acetate are dry-clean only, some may be hand-washed in lukewarm water. Do not twist the fabric. Iron while the fabric is damp. Press the inside of the garment using a low-temperature setting. Use a pressing cloth when ironing the outside.
Acrylic knit: Most acrylic knit garments can be machine-washed. Read the label, and check for the proper drying option. Some knits retain their shapes best if reshaped and dried flat.
Cashmere: Check the care label before laundering cashmere. To hand-wash, use a mild detergent or shampoo. Gently squeeze the water through the garment, then rinse until the water runs clear. Do not wring or twist. Squeeze out excess water. To dry, lay flat on a towel, away from sunlight or heat.
Cotton: Cotton holds up well to home laundering. Remove the garment from the dryer promptly to reduce wrinkling. Press using spray starch for the crispness of a laundered shirt.
Cotton blend: Dry cotton-blend garments using your dryer’s permanent-press or low-heat cycle, and remove immediately to reduce wrinkling. Touch up with a steam iron; starch for a professionally laundered look.
Linen: Most linen garments need to be dry-cleaned or hand-washed. Follow the instructions on the care label. To touch up or press, use a steam iron on a linen setting for a crisp look.
Nylon: Machine-wash nylon garments in warm water. Use a low-temperature setting when tumble-drying, and include a dryer sheet to reduce static electricity. Use a warm iron to press, if necessary.
Polyester: Read the label. Usually polyester items can be machine-washed (cool) and dried (low). Check the label to see if air-drying is recommended. Touch up with a cool – never hot – iron, if necessary.
Rayon: Make sure to read the care label for rayon clothing. Launder in cool water and a mild detergent. When machine-washing, use the gentle cycle and promptly remove garments after washing. When hand-washing, do not wring or twist the garment. To dry, roll the item in a towel to squeeze out excess water. Lay flat to dry. Iron when still damp and with the garment inside out to prevent the fabric from becoming shiny. Use a pressing cloth when ironing the right side of the garment. Pressing cloths can be purchased, or use something as simple as a piece of white cotton fabric.
Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. Garments usually look best when professionally dry-cleaned.
Spandex: Use warm water when hand- or machine-washing items made with spandex. Line-dry or use a low temperature setting when machine-drying to preserve the elasticity of the garment.
Wool knit: Typically wool knits need to be dry-cleaned, but check the label. If hand-washable, use cool water and a detergent for fine washables. Squeeze out excess water, then reshape and dry flat.
Wool (lined): Lined garments generally look best when professionally dry-cleaned once a season. Between dry cleanings, remove surface soil with a brush or damp cloth. Refresh the item by hanging it from a padded hanger in a steamy bathroom; moisture will help garment shed wrinkles. When touching up with an iron, use steam in an up-and-down motion (rather than sliding the iron along the fabric).

Cashmere: Check the care label before laundering. To hand-wash, use a mild detergent or shampoo. Gently squeeze the water through the garment, then rinse until the water runs clear. Do not wring or twist. Squeeze out excess water and lay flat to dry, away from sunlight or heat.
Silk: Dry cleaning may be required. Some silks are hand- or machine-washable. Do not dry silk in a clothes dryer. For hand-washing, use a mild detergent and cool water. Squeeze out excess water and lay flat to dry. Iron when still damp and with the garment inside out to prevent the fabric from becoming shiny.
Wool: The care instructions for wool depend on whether it’s a knit or not. For wool knits, they typically need to be dry-cleaned, but check the label. If hand-washable, use cool water and a detergent for fine washables. Squeeze out excess water, then reshape and dry flat. For lined wool garments, they generally look best when professionally dry-cleaned once a season. Between dry cleanings, remove surface soil with a brush or damp cloth and refresh in a steamy bathroom. When touching up with an iron, use steam in an up-and-down motion.

factual

relevant

relevant

How can I identify if an email claiming to be from Zappos is legitimate or a phishing attempt?

How do I identify whether an Email is or is not from Zappos?
Suspicious or fraudulent emails not from Zappos.com may contain:
A reference to a payment source that Zappos.com does not accept, such as PayPal. Zappos.com currently only accepts credit cards, debit cards, and Zappos.com Gift Cards for payment. A request to update payment information that is not linked to an order you placed on Zappos.com or your Zappos.com account. Links, including “Here” links, to websites not affiliated with Zappos.com. You should be able to hover over a link without clicking it to show the website address associated with the link. If the link shows a site other than Zappos.com or a site that is not a legitimate Zappos domain, then it is likely phishing. An order confirmation for an item you didn’t purchase or an attachment to an order confirmation. If you are a Zappos.com customer, go to your Zappos Account and view Your Order History to see if there is an order that matches the details in the correspondence. If you are not a Zappos.com customer, feel free to contact us to verify whether the email is associated with an actual order. Typos, grammatical errors, or unconventional formatting. A generic email address with Zappos in the email, for example advert@gmail.com or an email address from a non-US country or site not associated with Zappos. Zappos will never send communications from personal or generic email accounts. Order-related emails typically come from cs-noreply@zappos.com or another zappos.com email address. Forged email addresses to make it look like the email is coming from Zappos.com. Your email program may allow you to hover over or click on an email address to view its original source.
What should I do if I believe I’ve received a phishing email or if I responded to what may be a phishing email?
If you believe you received a phishing email or responded to what may be a phishing email, Zappos recommends you update your online passwords, including the password to your Zappos.com Account, and report the email to the Federal Trade Commission (FTC). Visit the FTC website here to report the email and find more information on how to identify and prevent phishing attacks.
Payment Information
Do you accept international credit cards?
Currently Zappos.com can accept international credit cards but we can only ship to an address within the United States and its territories. Please note that the credit card must be issued and contain a logo from either Visa, Mastercard, Discover, or American Express. On the checkout page, there will be a billing and shipping section. Please do the following:
For the billing information
Enter your street address on Address Line 1. Enter your City, County or Province, and Postal Code on Address Line 2. Enter your Country for the city. Enter AA for the state. Enter 11111 for the zip code.
For the shipping information
Please enter the correct information in the appropriate field.
Orders placed with the use of an International Credit Card may take up to 72 hours to finalize prior to shipping. We apologize for any inconvenience this may cause.

To identify if an email claiming to be from Zappos is legitimate or a phishing attempt, check for the following:
1. Zappos only accepts certain payment methods: credit cards, debit cards, and Zappos.com Gift Cards. 2. No unsolicited requests for payment information or updates. 3. Links lead to Zappos.com or legitimate Zappos domains. 4. Match the order details in the email with your Zappos Account’s order history. 5. Look for typos, grammatical errors, or unconventional formatting. 6. Check for generic email addresses not affiliated with Zappos. 7. Verify the sender’s email address is from cs-noreply@zappos.com or another zappos.com email address. 8. Be wary of forged email addresses.
If you believe you received a phishing email or responded to what may be a phishing email, update your passwords and report the email to the Federal Trade Commission (FTC).

factual

relevant

relevant

What is Amazon Pay, and how do I access my transaction history with it on Zappos?

How do I find my Amazon Pay transactions?
Go to pay.amazon.com and sign in as a “shopper” using your Amazon username and password. One-time purchases are shown on the “Activity” tab, and subscription agreements are on the “Merchant agreements” tab. You can also view your transactions on Amazon.com. If you are on a desktop, your Amazon Pay transactions can be found under “Amazon Orders” within the Amazon Pay tab or in “Your Account” under “Transactions”. If you are in the Amazon app, all transactions are listed together under “Your Orders” or in “Your Account”.

Amazon Pay is a digital wallet that allows you to make payments online using your Amazon account. However, the context does not provide information on how to access your transaction history with Amazon Pay on Zappos. The instructions given in the context are for accessing Amazon Pay transactions on Amazon’s own platform.

factual

relevant

relevant


.
.
.
.
.

.
.
.
100%
93%
100%

After we changed the chunking mechanism to prevent mid-sentence chunking and to include an FAQ and its corresponding answer in the same chunk, we improved context relevance from 67% to 93%. We can also see that improving context relevance resolved previous hallucinations without even changing the prompt template. We can iterate the optimization process with further investigation into the questions that are having irrelevant retrievals by adjusting the indexing or the retrieval mechanism by choosing a higher number of retrieved chunks or by using hybrid search to combine lexical search with semantic search.
Sample references
To further explore and experiment different RAG evaluation techniques, you can delve deeper into the sample notebooks available in the Knowledge Bases section of the Amazon Bedrock Samples GitHub repo.
Conclusion
In this post, we described the importance of evaluating and monitoring RAG-based generative AI applications. We showcased the metrics and frameworks for RAG system evaluation and observability, then we went over how you can use FMs in Amazon Bedrock to compute RAG reliability metrics. It’s important to choose the metrics that matter most to your organization and that impact the aspect or configuration you want to optimize.
If RAG is not sufficient for your use case, you can opt for fine-tuning or continued pre-training in Amazon Bedrock or Amazon SageMaker to build custom models that are specific to your domain, organization, and use case. Most importantly, keeping a human in the loop is essential to align AI systems, as well as their evaluation mechanisms, with their intended uses and objectives.

About the Authors
Oussama Maxime Kandakji is a Senior Solutions Architect at AWS focusing on data science and engineering. He works with enterprise customers on solving business challenges and building innovative functionalities on top of AWS. He enjoys contributing to open source and working with data.
Ioan Catana is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He helps customers develop and scale their ML solutions and generative AI applications in the AWS Cloud. Ioan has over 20 years of experience, mostly in software architecture design and cloud engineering.

Key Metrics for Evaluating Large Language Models (LLMs)

Evaluating Large Language Models (LLMs) is a challenging problem in language modeling, as real-world problems are complex and variable. Conventional benchmarks frequently fail to fully represent LLMs’ all-encompassing performance. A recent LinkedIn post has emphasized a number of important measures that are essential to comprehend how well new models function, which are as follows.

MixEval

Achieving a balance between thorough user inquiries and effective grading systems is necessary for evaluating LLMs. Conventional standards based on ground truth and LLM-as-judge benchmarks encounter difficulties such as biases in grading and possible contamination over time. 

MixEval solves these problems by combining real-world user inquiries with commercial benchmarks. This technique builds a solid evaluation framework by comparing web-mined questions with comparable queries from current benchmarks. A variation of this approach, MixEval-Hard, focuses on more difficult queries and provides more chances for model enhancement.

Because of its unbiased question distribution and grading system, MixEval has significant advantages over Chatbot Arena, as seen by its 0.96 model ranking correlation. It also takes 6% less time and money than MMLU, making it quick and economical. Its usefulness is further increased by its dynamic evaluation capabilities, which are backed by a steady and quick data refresh pipeline.

IFEval (Instructional Framework Standardisation and Evaluation)

The ability of LLMs to obey orders in natural language is one of their fundamental skills. However, the absence of standardized criteria has made evaluating this skill difficult. While LLM-based auto-evaluations can be biased or constrained by the evaluator’s skills, human evaluations are frequently costly and time-consuming.

A simple and repeatable benchmark called IFEval assesses this important part of LLMs and emphasizes verifiable instructions. The benchmark consists of about 500 prompts with one or more instructions apiece and 25 different kinds of verifiable instructions. IFEval offers quantifiable and easily understood indicators that facilitate assessing model performance in practical situations.

Arena-Hard

An automatic evaluation tool for instruction-tuned LLMs is Arena-Hard-Auto-v0.1. It consists of 500 hard user questions and compares model answers to a baseline model, usually GPT-4-031, using GPT-4-Turbo as a judge. Although Chatbot Arena Category Hard is comparable, Arena-Hard-Auto uses automatic judgment to provide a quicker and more affordable solution.

Of the widely used open-ended LLM benchmarks, this one has the strongest correlation and separability with Chatbot Arena. It is a great tool for forecasting model performance in Chatbot Arena, which is very helpful for researchers who want to rapidly and effectively assess how well their models perform in real-world scenarios.

MMLU (Massive Multitask Language Understanding)

The goal of MMLU is to assess a model’s multitask accuracy in a variety of fields, such as computer science, law, US history, and rudimentary arithmetic. This is a 57-item test that requires models to have a broad understanding of the world and the ability to solve problems.

On this benchmark, most models still perform at close to random-chance accuracy despite recent improvements, indicating a large amount of space for improvement. With MMLU, these flaws can be found, and a thorough assessment of a model’s professional and academic understanding can be obtained.

GSM8K

Modern language models often find multi-step mathematical reasoning difficult to handle. GSM8K addresses this challenge by offering a collection of 8.5K excellent, multilingual elementary school arithmetic word problems. On this dataset, not even the biggest transformer models are able to obtain good results.

Researchers suggest training verifiers to assess the accuracy of model completions to enhance performance. Verification dramatically improves performance on GSM8K by producing several candidate solutions and choosing the best-ranked one. This strategy supports studies that enhance models’ capacity for mathematical reasoning.

HumanEval

To assess Python code-writing skills, HumanEval has Codex, a GPT language model optimized on publicly accessible code from GitHub. Codex outperforms GPT-3 and GPT-J, solving 28.8% of the issues on the HumanEval benchmark. With 100 samples for each problem, repeated sampling from the model solves 70.2% of the problems, resulting in even better performance. 

This benchmark sheds light on the advantages and disadvantages of code generation models, offering insightful information about their potential and areas for development. HumanEval uses custom programming tasks and unit tests to assess code generation models.

Note: This article is inspired by this LinkedIn post.
The post Key Metrics for Evaluating Large Language Models (LLMs) appeared first on MarkTechPost.

Transcending Human Expertise: Achieving Superior Performance in Genera …

Generative models are designed to replicate the patterns in the data they are trained on, typically mirroring human actions and outputs. Since these models learn to minimize the difference between their predictions and human-generated data, they aim to match the quality of human expertise in various tasks, such as answering questions or creating art. This raises a question: can these models exceed the proficiency of the expert sources they learn from, given their goal is merely to imitate human performance rather than innovate beyond it?

Researchers from Harvard University, UC Santa Barbara, Apple, the Kempner Institute, Princeton University, and Google DeepMind explored “transcendence” in generative models, where a model surpasses the abilities of its expert data sources. Using an autoregressive transformer trained on chess game transcripts, they demonstrated that the model could outperform the maximum rating of players in the dataset through low-temperature sampling. This process aligns with the “wisdom of the crowd,” where the collective decision-making of diverse experts often surpasses individual performance. The study provides a theoretical framework and empirical evidence showing that such generative models can enhance performance.

Chess has been integral to AI development since its inception, with early explorations by Claude Shannon and Alan Turing. The game continues to inspire advances, leading to the defeat of world champion Garry Kasparov by IBM’s Deep Blue in 1997 and the dominance of AlphaZero’s RL-based approach over previous engines like Stockfish. The study connects with AI diversity research, showing that models trained on diverse datasets outperform individual expert-based models through ensemble methods and low-temperature sampling. Additionally, the concept is tied to Offline Reinforcement Learning, where training on varied behavior can lead to policies surpassing the original training data’s performance.

Transcendence in generative models occurs when a model outperforms the experts on which it was trained. This is defined mathematically by comparing the model’s average reward on a test distribution to the rewards of the experts. Low-temperature sampling is a key factor enabling transcendence, which concentrates probability mass on high-reward actions, effectively simulating a majority vote among expert predictions. This denoising effect can surpass individual expert performance, especially in settings with multiple experts who excel in different areas. Additionally, even a noisy expert can achieve transcendence through careful sampling, emphasizing the expert’s optimal outputs.

To evaluate the theoretical results on transcendence in chess-playing models, various autoregressive transformer models were trained on a dataset of one billion games from lichess.org. The models operating without direct access to the board state were tested against the Stockfish chess engine under different temperature sampling settings. Results demonstrated that low-temperature sampling significantly improved the model’s play by enhancing its move selection during critical game states. The study found that models trained on more diverse datasets, such as those with lower rating caps, were better at transcending their training limitations, highlighting the importance of dataset diversity for achieving transcendence.

In conclusion, the study introduces transcendence, where generative models trained on expert data outperform the best individual experts. Theoretical analysis indicates that low-temperature sampling achieves transcendence by denoising expert biases and consolidating diverse knowledge, validated through chess model training. The study underscores the importance of dataset diversity for transcendence and suggests future research in other domains like NLP and computer vision to assess generalizability. Ethical considerations in deploying generative models and their broader impact are also highlighted, noting that the study does not imply models can create novel solutions beyond human expert capability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit
The post Transcending Human Expertise: Achieving Superior Performance in Generative AI Models through Low-Temperature Sampling and Diverse Data appeared first on MarkTechPost.

Together AI Introduces Mixture of Agents (MoA): An AI Framework that L …

In a significant leap forward for AI, Together AI has introduced an innovative Mixture of Agents (MoA) approach, Together MoA. This new model harnesses the collective strengths of multiple large language models (LLMs) to enhance state-of-the-art quality and performance, setting new benchmarks in AI. 

MoA employs a layered architecture, with each layer comprising several LLM agents. These agents utilize outputs from the previous layer as auxiliary information to generate refined responses. This method allows MoA to integrate diverse capabilities and insights from various models, resulting in a more robust and versatile combined model. The implementation has proven successful, achieving a remarkable score of 65.1% on the AlpacaEval 2.0 benchmark, surpassing the previous leader, GPT-4o, which scored 57.5%.

Image Source

A critical insight driving the development of MoA is the concept of “collaborativeness” among LLMs. This phenomenon suggests that an LLM tends to generate better responses when presented with outputs from other models, even if those models are less capable. By leveraging this insight, MoA’s architecture categorizes models into “proposers” and “aggregators.” Proposers generate initial reference responses, offering nuanced and diverse perspectives, while aggregators synthesize these responses into high-quality outputs. This iterative process continues through several layers until a comprehensive and refined response is achieved.

The Together MoA framework has been rigorously tested on multiple benchmarks, including AlpacaEval 2.0, MT-Bench, and FLASK. The results are impressive, with Together MoA achieving top positions on the AlpacaEval 2.0 and MT-Bench leaderboards. Notably, on AlpacaEval 2.0, Together MoA achieved a 7.6% absolute improvement margin from 57.5% (GPT-4o) to 65.1% using only open-source models. This demonstrates the model’s superior performance compared to closed-source alternatives.

Image Source

In addition to its technical success, Together MoA is designed with cost-effectiveness in mind. By analyzing the cost-performance trade-offs, the research indicates that the Together MoA configuration provides the best balance, offering high-quality results at a reasonable cost. This is particularly evident in the Together MoA-Lite configuration, which, despite having fewer layers, matches GPT-4o in cost while achieving superior quality.

MoA’s success is attributed to the collaborative efforts of several organizations in the open-source AI community, including Meta AI, Mistral AI, Microsoft, Alibaba Cloud, and DataBricks. Their contributions to developing models like Meta Llama 3, Mixtral, WizardLM, Qwen, and DBRX have been instrumental in this achievement. Additionally, benchmarks like AlpacaEval, MT-Bench, and FLASK, developed by Tatsu Labs, LMSYS, and KAIST AI, played a crucial role in evaluating MoA’s performance.

Looking ahead, Together AI plans to further optimize the MoA architecture by exploring various model choices, prompts, and configurations. One key area of focus will be reducing the latency of the time to the first token, which is an exciting future direction for this research. They aim to enhance MoA’s capabilities in reasoning-focused tasks, further solidifying its position as a leader in AI innovation.

Image Source

In conclusion, Together MoA represents a significant advancement in leveraging the collective intelligence of open-source models. Its layered approach and collaborative ethos exemplify the potential for enhancing AI systems, making them more capable, robust, and aligned with human reasoning. The AI community eagerly anticipates this groundbreaking technology’s continued evolution and application.

Check out the Paper, GitHub, and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit
The post Together AI Introduces Mixture of Agents (MoA): An AI Framework that Leverages the Collective Strengths of Multiple LLMs to Improve State-of-the-Art Quality appeared first on MarkTechPost.