Detect and protect sensitive data with Amazon Lex and Amazon CloudWatc …

In today’s digital landscape, the protection of personally identifiable information (PII) is not just a regulatory requirement, but a cornerstone of consumer trust and business integrity. Organizations use advanced natural language detection services like Amazon Lex for building conversational interfaces and Amazon CloudWatch for monitoring and analyzing operational data.
One risk many organizations face is the inadvertent exposure of sensitive data through logs, voice chat transcripts, and metrics. This risk is exacerbated by the increasing sophistication of cyber threats and the stringent penalties associated with data protection violations. Dealing with massive datasets is not just about identifying and categorizing PII. The challenge also lies in implementing robust mechanisms to obfuscate and redact this sensitive data. At the same time, it’s crucial to make sure these security measures don’t undermine the functionality and analytics critical to business operations.
This post addresses this pressing pain point, offering prescriptive guidance on safeguarding PII through detection and masking techniques specifically tailored for environments using Amazon Lex and CloudWatch Logs.
Solution overview
To address this critical challenge, our solution uses the slot obfuscation feature in Amazon Lex and the data protection capabilities of CloudWatch Logs, tailored specifically for detecting and protecting PII in logs.
In Amazon Lex, slots are used to capture and store user input during a conversation. Slots are placeholders within an intent that represent an action the user wants to perform. For example, in a flight booking bot, slots might include departure city, destination city, and travel dates. Slot obfuscation makes sure any information collected through Amazon Lex conversational interfaces, such as names, addresses, or any other PII entered by users, is obfuscated at the point of capture. This method reduces the risk of sensitive data exposure in chat logs and playbacks.
In CloudWatch Logs, data protection and custom identifiers add an additional layer of security by enabling the masking of PII within session attributes, input transcripts, and other sensitive log data that is specific to your organization.
This approach minimizes the footprint of sensitive information across these services and helps with compliance with data protection regulations.
In the following sections, we demonstrate how to identify and classify your data, locate your sensitive data, and finally monitor and protect it, both in transit and at rest, especially in areas where it may inadvertently appear. The following are the four ways to do this:

Amazon Lex – Monitor and protect data with Amazon Lex using slot obfuscation and selective conversation log capture
CloudWatch Logs – Monitor and protect data with CloudWatch Logs using playbacks and log group policies
Amazon S3 – Monitor and protect data with Amazon Simple Storage Service (Amazon S3) using bucket security and encryption
Service Control Policies – Monitor and protect with data governance controls and risk management policies using Service Control Policies (SCPs) to prevent changes to Amazon Lex chatbots and CloudWatch Logs groups, and restrict unmasked data viewing in CloudWatch Logs Insights

Identify and classify your data
The first step is to identify and classify the data flowing through your systems. This involves understanding the types of information processed and determining their sensitivity level.
To determine all the slots in an intent in Amazon Lex, complete the following steps:

On the Amazon Lex console, choose Bots in the navigation pane.
Choose your preferred bot.
In the navigation pane, choose the locale under All Languages and choose Intents.
Choose the required intent from the list.
In the Slots section, make note of all the slots within the intent.

After you identify the slots within the intent, it’s important to classify them according to their sensitivity level and the potential impact of unauthorized access or disclosure. For example, you may have the following data types:

Name
Address
Phone number
Email address
Account number

Email address and physical mailing address are often considered a medium classification level. Sensitive data, such as name, account number, and phone number, should be tagged with a high classification level, indicating the need for stringent security measures. These guidelines can help with systematically evaluating data.
Locate your data stores
After you classify the data, the next step is to locate where this data resides or is processed in your systems and applications. For services involving Amazon Lex and CloudWatch, it’s crucial to identify all data stores and their roles in handling PII.
CloudWatch captures logs generated by Amazon Lex, including interaction logs that might contain PII. Regular audits and monitoring of these logs are essential to detect any unauthorized access or anomalies in data handling.
Amazon S3 is often used in conjunction with Amazon Lex for storing call recordings or transcripts, which may contain sensitive information. Making sure these storage buckets are properly configured with encryption, access controls, and lifecycle policies are vital to protect the stored data.
Organizations can create a robust framework for protection by identifying and classifying data, along with pinpointing the data stores (like CloudWatch and Amazon S3). This framework should include regular audits, access controls, and data encryption to prevent unauthorized access and comply with data protection laws.
Monitor and protect data with Amazon Lex
In this section, we demonstrate how to protect your data with Amazon Lex using slot obfuscation and selective conversation log capture.
Slot obfuscation in Amazon Lex
Sensitive information can appear in the input transcripts of conversation logs. It’s essential to implement mechanisms that detect and mask or redact PII in these transcripts before they are stored or logged.
In the development of conversational interfaces using Amazon Lex, safeguarding PII is crucial to maintain user privacy and comply with data protection regulations. Slot obfuscation provides a mechanism to automatically obscure PII within conversation logs, making sure sensitive information is not exposed. When configuring an intent within an Amazon Lex bot, developers can mark specific slots—placeholders for user-provided information—as obfuscated. This setting tells Amazon Lex to replace the actual user input for these slots with a placeholder in the logs. For instance, enabling obfuscation for slots designed to capture sensitive information like account numbers or phone numbers makes sure any matching input is masked in the conversation log. Slot obfuscation allows developers to significantly reduce the risk of inadvertently logging sensitive information, thereby enhancing the privacy and security of the conversational application. It’s a best practice to identify and mark all slots that could potentially capture PII during the bot design phase to provide comprehensive protection across the conversation flow.
To enable obfuscation for a slot from the Amazon Lex console, complete the following steps:

On the Amazon Lex console, choose Bots in the navigation pane.
Choose your preferred bot.
In the navigation pane, choose the locale under All Languages and choose Intents.
Choose your preferred intent from the list.
In the Slots section, expand the slot details.
Choose Advanced options to access additional settings.
Select Enable slot obfuscation.
Choose Update slot to save the changes.

Selective conversation log capture
Amazon Lex offers capabilities to select how conversation logs are captured with text and audio data from live conversations by enabling the filtering of certain types of information from the conversation logs. Through selective capture of necessary data, businesses can minimize the risk of exposing private or confidential information. Additionally, this feature can help organizations comply with data privacy regulations, because it gives more control over the data collected and stored. There is a choice between text, audio, or text and audio logs.
When selective conversation log capture is enabled for text and audio logs, it disables logging for all intents and slots in the conversation. To generate text and audio logs for particular intents and slots, set the text and audio selective conversation log capture session attributes for those intents and slots to “true”. When selective conversation log capture is enabled, any slot values in SessionState, Interpretations, and Transcriptions for which logging is not enabled using session attributes will be obfuscated in the generated text log.
To enable selective conversation log capture, complete the following steps:

On the Amazon Lex console, choose Bots in the navigation pane.
Choose your preferred bot.
Choose Aliases under Deployment and choose the bot’s alias.
Choose Manage conversation logs.
Select Selectively log utterances.

For text logs, choose a CloudWatch log group.
For audio logs, choose an S3 bucket to store the logs and assign an AWS Key Management Service (AWS KMS) key for added security.

Save the changes.

Now selective conversation log capture for a slot is activated.

Choose Intents in the navigation pane and choose your intent.
Under Initial responses, choose Advanced options and expand Set values.
For Session attributes, set the following attributes based on the intents and slots for which you want to enable selective conversation log capture. This will capture utterances that contain only a specific slot in the conversation.

x-amz-lex:enable-audio-logging:<intent>:<slot> = “true”
x-amz-lex:enable-text-logging:<intent>:<slot> = “true”

Choose Update options and rebuild the bot.

Replace <intent> and <slot> with respective intent and slot names.

Monitor and protect data with CloudWatch Logs
In this section, we demonstrate how to protect your data with CloudWatch using playbacks and log group policies.
Playbacks in CloudWatch Logs
When Amazon Lex engages in interactions, delivering prompts or messages from the bot to the customer, there’s a potential risk for PII to be inadvertently included in these communications. This risk extends to CloudWatch Logs, where these interactions are recorded for monitoring, debugging, and analysis purposes. The playback of prompts or messages designed to confirm or clarify user input can inadvertently expose sensitive information if not properly handled. To mitigate this risk and protect PII within these interactions, a strategic approach is necessary when designing and deploying Amazon Lex bots.
The solution lies in carefully structuring how slot values, which may contain PII, are referenced and used in the bot’s response messages. Adopting a prescribed format for passing slot values, specifically by encapsulating them within curly braces (for example, {slotName}), allows developers to control how this information is presented back to the user and logged in CloudWatch. This method makes sure that when the bot constructs a message, it refers to the slot by its name rather than its value, thereby preventing any sensitive information from being directly included in the message content. For example, instead of the bot saying, “Is your phone number 123-456-7890? ” it would use a generic placeholder, “Is your phone number {PhoneNumber}? ” with {PhoneNumber} being a reference to the slot that captured the user’s phone number. This approach allows the bot to confirm or clarify information without exposing the actual data.
When these interactions are logged in CloudWatch, the logs will only contain the slot name references, not the actual PII. This technique significantly reduces the risk of sensitive information being exposed in logs, enhancing privacy and compliance with data protection regulations. Organizations should make sure all personnel involved in bot design and deployment are trained on these practices to consistently safeguard user information across all interactions.
The following is a sample AWS Lambda function code in Python for referencing the slot value of a phone number provided by the user. SML tags are used to format the slot value to provide slow and clear speech output, and returning a response to confirm the correctness of the captured phone number:

def lambda_handler(event, context):
# Extract the intent name from the event
intent_name = event[‘sessionState’][‘intent’][‘name’]
# Extract the slots from the event
slots = event[‘sessionState’][‘intent’][‘slots’]

# Check if the intent name is ‘INTENT_NAME’
if intent_name == ‘INTENT_NAME’:
# Retrieve the phone number from the ‘SLOT_NAME’ slot
phone_number = slots[‘SLOT_NAME’][‘value’][‘interpretedValue’]

# Create an SSML-formatted message with the phone number
msg = f”'<speak>
Thank you for providing your phone number. Is
<prosody rate=”slow”>
<say-as interpret-as=”telephone”>{phone_number}</say-as>
</prosody> correct?
</speak>”’

# Create a message array
message_array = [
{
‘contentType’: ‘SSML’,
‘content’: msg
}
]

# Response with the dialog action, intent state, and the message array
response = {
‘sessionState’: {
‘dialogAction’: {
‘type’: ‘Close’
},
‘intent’: {
‘name’: intent_name,
‘state’: ‘Fulfilled’
}
},
‘messages’: message_array
}
else:
# Generic response for unhandled intents
response = {
‘sessionState’: {
‘dialogAction’: {
‘type’: ‘Close’
},
‘intent’: {
‘name’: intent_name,
‘state’: ‘Fulfilled’
}
},
‘messages’: [
{
‘contentType’: ‘PlainText’,
‘content’: ‘I apologize, but I am unable to assist.’
}
]
}
return response

Replace INTENT_NAME and SLOT_NAME with your preferred intent and slot names, respectively.
CloudWatch data protection log group policies for data identifiers
Sensitive data that’s ingested by CloudWatch Logs can be safeguarded by using log group data protection policies. These policies allow to audit and mask sensitive data that appears in log events ingested by the log groups in your account.
CloudWatch Logs supports both managed and custom data identifiers.
Managed data identifiers offer preconfigured data types to protect financial data, personal health information (PHI), and PII. For some types of managed data identifiers, the detection depends on also finding certain keywords in proximity with the sensitive data.
Each managed data identifier is designed to detect a specific type of sensitive data, such as name, email address, account numbers, AWS secret access keys, or passport numbers for a particular country or region. When creating a data protection policy, you can configure it to use these identifiers to analyze logs ingested by the log group, and take actions when they are detected.
CloudWatch Logs data protection can detect the categories of sensitive data by using managed data identifiers.
To configure managed data identifiers on the CloudWatch console, complete the following steps:

On the CloudWatch console, under Logs in the navigation pane, choose Log groups.
Select your log group and on the Actions menu, choose Create data protection policy.
Under Auditing and masking configuration, for Managed data identifiers, select all the identifiers for which data protection policy should be applied.
Choose the data store to apply the policy to and save the changes.

Custom data identifiers let you define your own custom regular expressions that can be used in your data protection policy. With custom data identifiers, you can target business-specific PII use cases that managed data identifiers don’t provide. For example, you can use custom data identifiers to look for a company-specific account number format.
To create a custom data identifier on the CloudWatch console, complete the following steps:

On the CloudWatch console, under Logs in the navigation pane, choose Log groups.
Select your log group and on the Actions menu, choose Create data protection policy.
Under Custom Data Identifier configuration, choose Add custom data identifier.
Create your own regex patterns to identify sensitive information that is unique to your organization or specific use case.
After you add your data identifier, choose the data store to apply this policy to.
Choose Activate data protection.

For details about the types of data that can be protected, refer to Types of data that you can protect.
Monitor and protect data with Amazon S3
In this section, we demonstrate how to protect your data in S3 buckets.
Encrypt audio recordings in S3 buckets
PII can often be captured in audio recordings, especially in sectors like customer service, healthcare, and financial services, where sensitive information is frequently exchanged over voice interactions. To comply with domain-specific regulatory requirements, organizations must adopt stringent measures for managing PII in audio files.
One approach is to disable the recording feature entirely if it poses too high a risk of non-compliance or if the value of the recordings doesn’t justify the potential privacy implications. However, if audio recordings are essential, streaming the audio data in real time using Amazon Kinesis provides a scalable and secure method to capture, process, and analyze audio data. This data can then be exported to a secure and compliant storage solution, such as Amazon S3, which can be configured to meet specific compliance needs including encryption at rest. You can use AWS KMS or AWS CloudHSM to manage encryption keys, offering robust mechanisms to encrypt audio files at rest, thereby securing the sensitive information they might contain. Implementing these encryption measures makes sure that even if data breaches occur, the encrypted PII remains inaccessible to unauthorized parties.
Configuring these AWS services allows organizations to balance the need for audio data capture with the imperative to protect sensitive information and comply with regulatory standards.
S3 bucket security configurations
You can use an AWS CloudFormation template to configure various security settings for an S3 bucket that stores Amazon Lex data like audio recordings and logs. For more information, see Creating a stack on the AWS CloudFormation console. See the following example code:

AWSTemplateFormatVersion: ‘2010-09-09’
Description: Create a secure S3 bucket with KMS encryption to store Lex Data
Resources:
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: YOUR_LEX_DATA_BUCKET
AccessControl: Private
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
BucketEncryption:
ServerSideEncryptionConfiguration:
– ServerSideEncryptionByDefault:
SSEAlgorithm: aws:kms
KMSMasterKeyID: alias/aws/s3
VersioningConfiguration:
Status: Enabled
ObjectLockConfiguration:
ObjectLockEnabled: Enabled
Rule:
DefaultRetention:
Mode: GOVERNANCE
Years: 5
LoggingConfiguration:
DestinationBucketName: !Ref YOUR_SERVER_ACCESS_LOG_BUCKET
LogFilePrefix: lex-bucket-logs/

The template defines the following properties:

BucketName– Specifies your bucket. Replace YOUR_LEX_DATA_BUCKET with your preferred bucket name.
AccessControl – Sets the bucket access control to Private, denying public access by default.
PublicAccessBlockConfiguration – Explicitly blocks all public access to the bucket and its objects
BucketEncryption – Enables server-side encryption using the default KMS encryption key ID, alias/aws/s3, managed by AWS for Amazon S3. You can also create custom KMS keys. For instructions, refer to Creating symmetric encryption KMS keys
VersioningConfiguration – Enables versioning for the bucket, allowing you to maintain multiple versions of objects.
ObjectLockConfiguration – Enables object lock with a governance mode retention period of 5 years, preventing objects from being deleted or overwritten during that period.
LoggingConfiguration – Enables server access logging for the bucket, directing log files to a separate logging bucket for auditing and analysis purposes. Replace YOUR_SERVER_ACCESS_LOG_BUCKET with your preferred bucket name.

This is just an example; you may need to adjust the configurations based on your specific requirements and security best practices.
Monitor and protect with data governance controls and risk management policies
In this section, we demonstrate how to protect your data with using a Service Control Policy (SCP). To create an SCP, see Creating an SCP.
Prevent changes to an Amazon Lex chatbot using an SCP
To prevent changes to an Amazon Lex chatbot using an SCP, create one that denies the specific actions related to modifying or deleting the chatbot. For example, you could use the following SCP:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Deny”,
“Action”: [
“lex:DeleteBot”,
“lex:DeleteBotAlias”,
“lex:DeleteBotChannelAssociation”,
“lex:DeleteBotVersion”,
“lex:DeleteIntent”,
“lex:DeleteSlotType”,
“lex:DeleteUtterances”,
“lex:PutBot”,
“lex:PutBotAlias”,
“lex:PutIntent”,
“lex:PutSlotType”
],
“Resource”: [
“arn:aws:lex:*:YOUR_ACCOUNT_ID:bot:YOUR_BOT_NAME”,
“arn:aws:lex:*:YOUR_ACCOUNT_ID:intent:YOUR_BOT_NAME:*”,
“arn:aws:lex:*:YOUR_ACCOUNT_ID:slottype:YOUR_BOT_NAME:*”
],
“Condition”: {
“StringEquals”: {
“aws:PrincipalArn”: “arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_IAM_ROLE”
}
}
}
]
}

The code defines the following:

Effect – This is set to Deny, which means that the specified actions will be denied.
Action – This contains a list of actions related to modifying or deleting Amazon Lex bots, bot aliases, intents, and slot types.
Resource – This lists the Amazon Resource Names (ARNs) for your Amazon Lex bot, intents, and slot types. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_BOT_NAME with the name of your Amazon Lex bot.
Condition – This makes sure the policy only applies to actions performed by a specific IAM role. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_IAM_ROLE with the name of the AWS Identity and Access Management (IAM) provisioned role you want this policy to apply to.

When this SCP is attached to an AWS Organizations organizational unit (OU) or an individual AWS account, it will allow only the specified provisioning role while preventing all other IAM entities (users, roles, or groups) within that OU or account from modifying or deleting the specified Amazon Lex bot, intents, and slot types.
This SCP only prevents changes to the Amazon Lex bot and its components. It doesn’t restrict other actions, such as invoking the bot or retrieving its configuration. If more actions need to be restricted, you can add them to the Action list in the SCP.
Prevent changes to a CloudWatch Logs log group using an SCP
To prevent changes to a CloudWatch Logs log group using an SCP, create one that denies the specific actions related to modifying or deleting the log group. The following is an example SCP that you can use:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Deny”,
“Action”: [
“logs:DeleteLogGroup”,
“logs:PutRetentionPolicy”
],
“Resource”: “arn:aws:logs:*:YOUR_ACCOUNT_ID:log-group:/aws/YOUR_LOG_GROUP_NAME*”,
“Condition”: {
“StringEquals”: {
“aws:PrincipalArn”: “arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_IAM_ROLE”
}
}
}
]
}

The code defines the following:

Effect – This is set to Deny, which means that the specified actions will be denied.
Action – This includes logs:DeleteLogGroup and logs:PutRetentionPolicy actions, which prevent deleting the log group and modifying its retention policy, respectively.
Resource – This lists the ARN for your CloudWatch Logs log group. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_LOG_GROUP_NAME with the name of your log group.
Condition – This makes sure the policy only applies to actions performed by a specific IAM role. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_IAM_ROLE with the name of the IAM provisioned role you want this policy to apply to.

Similar to the preceding chatbot SCP, when this SCP is attached to an Organizations OU or an individual AWS account, it will allow only the specified provisioning role to delete the specified CloudWatch Logs log group or modify its retention policy, while preventing all other IAM entities (users, roles, or groups) within that OU or account from performing these actions.
This SCP only prevents changes to the log group itself and its retention policy. It doesn’t restrict other actions, such as creating or deleting log streams within the log group or modifying other log group configurations. To restrict additional actions, add it to the Action list in the SCP.
Also, this SCP will apply to all log groups that match the specified resource ARN pattern. To target a specific log group, modify the Resource value accordingly.
Restrict viewing of unmasked sensitive data in CloudWatch Logs Insights using an SCP
When you create a data protection policy, by default, any sensitive data that matches the data identifiers you’ve selected is masked at all egress points, including CloudWatch Logs Insights, metric filters, and subscription filters. Only users who have the logs:Unmask IAM permission can view unmasked data. The following is an SCP you can use:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “RestrictUnmasking”,
“Effect”: “Deny”,
“Action”: “logs:Unmask”,
“Resource”: “arn:aws:logs:*:YOUR_ACCOUNT_ID:log-group:YOUR_LOG_GROUP:*”,
“Condition”: {
“StringEquals”: {
“aws:PrincipalArn”: “arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_IAM_ROLE”
}
}
}
]
}

It defines the following:

Effect – This is set to Deny, which means that the specified actions will be denied.
Action – This includes logs:Unmask, which prevents viewing of masked data.
Resource – This lists the ARN for your CloudWatch Logs log group. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_LOG_GROUP_NAME with the name of your log group.
Condition – This makes sure the policy only applies to actions performed by a specific IAM role. Replace YOUR_ACCOUNT_ID with your AWS account ID and YOUR_IAM_ROLE with the name of the IAM provisioned role you want this policy to apply to.

Similar to the previous SCPs, when this SCP is attached to an Organizations OU or an individual AWS account, it will allow only the specified provisioning role while preventing all other IAM entities (users, roles, or groups) within that OU or account from unmasking sensitive data from the CloudWatch Logs log group.
Similar to the previous log group service control policy, this SCP only prevents changes to the log group itself and its retention policy. It doesn’t restrict other actions such as creating or deleting log streams within the log group or modifying other log group configurations. To restrict additional actions, add them to the Action list in the SCP.
Also, this SCP will apply to all log groups that match the specified resource ARN pattern. To target a specific log group, modify the Resource value accordingly.
Clean up
To avoid incurring additional charges, clean up your resources:

Delete the Amazon Lex bot:

On the Amazon Lex console, choose Bots in the navigation pane.
Select the bot to delete and on the Action menu, choose Delete.

Delete the associated Lambda function:

On the Lambda console, choose Functions in the navigation pane.
Select the function associated with the bot and on the Action menu, choose Delete.

Delete the account-level data protection policy. For instructions, see DeleteAccountPolicy.
Delete the CloudFormation log group policy:

On the CloudWatch console, under Logs in the navigation pane, choose Log groups.
Choose your log group.
On the Data protection tab, under Log group policy, choose the Actions menu and choose Delete policy.

Delete the S3 bucket that stores the Amazon Lex data:

On the Amazon S3 console, choose Buckets in the navigation pane.
Select the bucket you want to delete, then choose Delete.
To confirm that you want to delete the bucket, enter the bucket name and choose Delete bucket.

Delete the CloudFormation stack. For instructions, see Deleting a stack on the AWS CloudFormation console.
Delete the SCP. For instructions, see Deleting an SCP.
Delete the KMS key. For instructions, see Deleting AWS KMS keys.

Conclusion
Securing PII within AWS services like Amazon Lex and CloudWatch requires a comprehensive and proactive approach. By following the steps in this post—identifying and classifying data, locating data stores, monitoring and protecting data in transit and at rest, and implementing SCPs for Amazon Lex and Amazon CloudWatch—organizations can create a robust security framework. This framework not only protects sensitive data, but also complies with regulatory standards and mitigates potential risks associated with data breaches and unauthorized access.
Emphasizing the need for regular audits, continuous monitoring, and updating security measures in response to emerging threats and technological advancements is crucial. Adopting these practices allows organizations to safeguard their digital assets, maintain customer trust, and build a reputation for strong data privacy and security in the digital landscape.

About the Authors
Rashmica Gopinath is a software development engineer with Amazon Lex. Rashmica is responsible for developing new features, improving the service’s performance and reliability, and ensuring a seamless experience for customers building conversational applications. Rashmica is dedicated to creating innovative solutions that enhance human-computer interaction. In her free time, she enjoys winding down with the works of Dostoevsky or Kafka.

Dipkumar Mehta is a Principal Consultant with the Amazon ProServe Natural Language AI team. He focuses on helping customers design, deploy, and scale end-to-end Conversational AI solutions in production on AWS. He is also passionate about improving customer experience and driving business outcomes by leveraging data. Additionally, Dipkumar has a deep interest in Generative AI, exploring its potential to revolutionize various industries and enhance AI-driven applications.
David Myers is a Sr. Technical Account Manager with AWS Enterprise Support . With over 20 years of technical experience observability has been part of his career from the start. David loves improving customers observability experiences at Amazon Web Services.
Sam Patel is a Security Consultant specializing in safeguarding Generative AI (GenAI), Artificial Intelligence systems, and Large Language Models (LLM) for Fortune 500 companies. Serving as a trusted advisor, he invents and spearheads the development of cutting-edge best practices for secure AI deployment, empowering organizations to leverage transformative AI capabilities while maintaining stringent security and privacy standards.

AWS AI chips deliver high performance and low cost for Llama 3.1 model …

Today, we are excited to announce AWS Trainium and AWS Inferentia support for fine-tuning and inference of the Llama 3.1 models. The Llama 3.1 family of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes. In a previous post, we covered how to deploy Llama 3 models on AWS Trainium and Inferentia based instances in Amazon SageMaker JumpStart. In this post, we outline how to get started with fine-tuning and deploying the Llama 3.1 family of models on AWS AI chips, to realize their price-performance benefits.
Overview of Llama 3.1 models
The Llama 3.1 family of multilingual LLMs are a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text and code out). All models support long context length (128k) and are optimized for inference with support for grouped query attention (GQA).
The Llama 3.1 instruction tuned models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the publicly available chat models on common industry benchmarks. They have been trained to generate tool calls for a few specific tools for capabilities like search, image generation, code execution, and mathematical reasoning. In addition, they support zero-shot tool use.
Llama 3.1 405B is the world’s largest publicly available LLM according to Meta. The model sets a new standard for artificial intelligence (AI) and is ideal for enterprise-level applications and research and development. It’s ideal for tasks like synthetic data generation, where the outputs of the model can be used to improve smaller Llama models after fine-tuning, and model distillations to transfer knowledge to smaller models from the 405B model. This model excels at general knowledge, long-form text generation, multilingual translation, machine translation, coding, math, tool use, enhanced contextual understanding, and advanced reasoning and decision-making.
Architecturally, the core LLM for Llama 3 and Llama 3.1 has the same dense architecture. They are auto-regressive language models that use an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
The responsible use guide from Meta can assist you in implementing additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.
Trainium powers Llama 3.1 on Amazon Bedrock and Amazon SageMaker
The fastest way to get started with Llama 3.1 on AWS is through Amazon Bedrock, which is powered by our purpose-built AI infrastructure including AWS Trainium. Through its fully managed API, Amazon Bedrock delivers the benefits of our purpose-built AI infrastructure and simplifies access to these powerful models so you can focus on building differentiated AI applications.
If you need greater control over the underlying resources, you can fine-tune and deploy Llama 3.1 models with SageMaker. Trainium support for Llama 3.1 in SageMaker JumpStart is coming soon.
AWS Trainium and AWS Inferentia2 enable high performance and low cost for Llama 3.1 models
If you want to build your own ML pipelines for training and inference for greater flexibility and control, you can get started with Llama 3.1 on AWS AI chips using Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances. Let’s see how you can get started with the new Llama 3.1 8/70B models on Trainium using the AWS Neuron SDK.
Fine-tune Llama 3.1 on Trainium
To get started with fine-tuning either Llama 3.1 8B or Llama 3.1 70B, you can use the NeuronX Distributed library. NeuronX Distributed provides implementations of some of the more popular distributed training and inference techniques. To start fine-tuning, you can use the following samples:

Training Llama 3.1 8B
Training Llama 3.1 70B

Both samples are built on top of AWS ParallelCluster to manage the Trainium cluster infrastructure and Slurm for workload management. The following is the example Slurm command to initiate training for Llama3.1 70B:

sbatch –exclusive
–nodes 32
–cpus-per-task 128
–wrap=”srun bash $(pwd)/run_llama3_70B_tp_pp.sh”

Inside the Slurm script, we launch a distributed training process on our cluster. In the runner scripts, we load the pre-trained weights and configuration provided by Meta, and launch the training process: 

torchrun $DISTRIBUTED_ARGS run_llama_nxd.py
–train_batch_size $BS
–use_meta_device_init 1
–training_dir $DATA_PATH
–training_config $SCRIPT_DIR/${MODEL_SIZE}_config_llama${LLAMA_VERSION}
–max_steps $max_steps
–seq_len $SEQ_LEN
–pipeline_parallel_size $PP_DEGREE
–tensor_parallel_size $TP_DEGREE
–num_microbatches $NUM_MICROBATCHES
–lr 0.000015
–min_lr 1e-06
–beta1 0.9
–beta2 0.95
–weight_decay 0.1
–warmup_steps 2000
–constant_steps 0
–use_zero1_optimizer 1
–use_selective_checkpoint 1
    –use_flash_attention 1
    –qkv_linear 1
    –kv_replicator 4
    –pretrained_weight 1
    –save_load_xser 1
    –checkpoint_dir “/shared/llama${LLAMA_VERSION}${MODEL_SIZE}/”
    –checkpoint_freq $checkpoint_freq
    –num_kept_checkpoint -1
    –loading_step -1
    –tb_dir $tb_dir |& tee $LOG_PATH/log
exit ${PIPESTATUS[0]}

Deploy Llama 3.1 on Trainium or Inferentia
When your model is ready to deploy, you can do so by updating the model ID in the previous Llama 3 8B Neuron sample code. For example, the below code deploys the model on a inf2.48xlarge instance.

model_id = “meta-llama/Meta-Llama-3.1-8B”
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp=’bf16′, n_positions=4096)
neuron_model.to_neuron()

You can use the same sample inference code:

tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = “Hello, I’m a language model and I like to”
input_ids = tokenizer.encode(prompt, return_tensors=”pt”)

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() – start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f’generated sequences {generated_sequences} in {elapsed} seconds’)

For step-by-step details, refer to the new Llama 3.1 examples:

Meta Llama 3.1 8B
Meta Llama 3.1 70B
Meta Llama 3.1 8B 32k
Meta Llama 3.1 405B on Trainium is coming soon

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker through the Hugging Face Model Hub. From the Llama 3.1 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium. Copy the example code into a SageMaker notebook, then choose Run.

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client(“iam”)
    role = iam.get_role(RoleName=”sagemaker_execution_role”)[“Role”][“Arn”]

# Hub Model configuration. https://huggingface.co/models
hub = {
    “HF_MODEL_ID”: “meta-llama/Meta-Llama-3.1-8B”,
    “HF_NUM_CORES”: “2”,
    “HF_AUTO_CAST_TYPE”: “fp16”,
    “MAX_BATCH_SIZE”: “8”,
    “MAX_INPUT_LENGTH”: “3686”,
    “MAX_TOTAL_TOKENS”: “4096”,
    “HF_TOKEN”: “<REPLACE WITH YOUR TOKEN>”,
}

assert hub[“HF_TOKEN”] != “<REPLACE WITH YOUR TOKEN>”, “Please replace ‘<REPLACE WITH YOUR TOKEN>’ with your Hugging Face Hub API token”

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri(“huggingface-neuronx”, version=”0.0.23″),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=”ml.inf2.xlarge”,
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        “inputs”: “What is is the capital of France?”,
        “parameters”: {
            “do_sample”: True,
            “max_new_tokens”: 128,
            “temperature”: 0.7,
            “top_k”: 50,
            “top_p”: 0.95,
        }
    }
)

Additionally, if you want to use vLLM to deploy the models, you can refer to the continuous batching guide to create the environment. After you create the environment, you can use vLLM to deploy Llama 3.1 8/70B models on AWS Trainium or Inferentia. The following an example to deploy Llama 3.1 8B:

from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
    “Hello, my name is”,
    “The president of the United States is”,
    “The capital of France is”,
    “The future of AI is”,
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
    model=”meta-llama/Meta-Llama-3.1-8B”,
    max_num_seqs=8,
    # The max_model_len and block_size arguments are required to be same as max sequence length,
    # when targeting neuron device. Currently, this is a known limitation in continuous batching
    # support in transformers-neuronx.
    max_model_len=128,
    block_size=128,
    # The device can be automatically detected when AWS Neuron SDK is installed.
    # The device argument can be either unspecified for automated detection, or explicitly assigned.
    device=”neuron”,
    tensor_parallel_size=8)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f”Prompt: {prompt!r}, Generated text: {generated_text!r}”)

Conclusion
AWS Trainium and Inferentia deliver high performance and low cost for fine-tuning and deploying Llama 3.1 models. We are excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, refer to Model Samples and Tutorials in AWS Neuron Documentation.

About the Authors
John Gray is a Sr. Solutions Architect in Annapurna Labs, AWS, based out of Seattle. In this role, John works with customers on their AI and machine learning use cases, architects solutions to cost-effectively solve their business problems, and helps them build a scalable prototype using AWS AI chips.
Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.
Kamran Khan, Head of Business Development for AWS Inferentina/Trianium at AWS. He has over a decade of experience helping customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium.
Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Use Llama 3.1 405B for synthetic data generation and distillation to f …

Today, we are excited to announce the availability of the Llama 3.1 405B model on Amazon SageMaker JumpStart, and Amazon Bedrock in preview. The Llama 3.1 models are a collection of state-of-the-art pre-trained and instruct fine-tuned generative artificial intelligence (AI) models in 8B, 70B, and 405B sizes. Amazon SageMaker JumpStart is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. Amazon Bedrock offers a straightforward way to build and scale generative AI applications with Meta Llama models, using a single API.
In this post, we show how to use Llama 3.1 405B to generate data (labels for a sample dataset), and how to use the generated data for distillation to fine-tune a smaller model like Llama 3 8B to generate better responses compared to the non-fine-tuned model. We also provide the code notebook that you can use to run and test the solution.
Overview of Llama 3.1 405B
The Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text out). All models support long context length (128,000) and are optimized for inference with support for grouped query attention (GQA). The Llama 3.1 instruction tuned text-only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the publicly available chat models on common industry benchmarks.
Llama 3.1 405B is the first publicly available model that rivals the top models in AI when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. There are some unique ways to use it—in addition to direct inference, you can use the Llama 3.1 405B model to generate synthetic data to improve smaller models, and it can be a powerful domain-specific model by acting as the base model for domain-specific fine-tuning.
Llama 3.1 models are available today for inferencing on SageMaker JumpStart and Amazon Bedrock. On SageMaker JumpStart, they are rolling out to all AWS Regions where SageMaker JumpStart is available and support the required instance types. Llama 3.1 405B will require P5 instances on Amazon SageMaker. The Llama 3.1 models are also available today in the us-west-2 Region on Amazon Bedrock, with planned future expanded Regional availability.
Prerequisites
The following prerequisites are needed to implement the steps outlined in this post:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker and Amazon Bedrock. For more information, refer to Identity and Access Management for Amazon SageMaker and Identity and access management for Amazon Bedrock.
Access to Amazon SageMaker Studio or a SageMaker notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code.

Responses from the Llama 3 8B Instruct model
Firstly, we perform inference with the Llama 3 8B model either directly through Amazon Bedrock or a deployed endpoint using SageMaker JumpStart. With Llama 3 Instruct models, which are optimized for dialogue use cases, the input to the model endpoints is the previous history between the chat assistant and the user. We can ask context-aware questions to conversations that have happened so far, using specific formatting for the input text (described in our earlier Llama 3B release posts, Meta Llama 3 models are now available in Amazon Bedrock and Meta Llama 3 models are now available in Amazon SageMaker JumpStart).
In the following example, the user has a conversation with the assistant about tourist sites in Paris. The assistant generated four different recommendation options, and then the user inquires about the first option:

Input: <|begin_of_text|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nnParis, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:nn1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.n2. The Louvre Museum: The Louvre is one of the world’s largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.nnThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it’s no wonder that Paris is one of the most popular tourist destinations in the world.<|eot_id|><|start_header_id|>user<|end_header_id|>nnWhat is so great about #1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn

Output: The Eiffel Tower! It’s an iconic symbol of Paris and one of the most recognizable landmarks in the world. Here are some reasons why it’s so great:nn1. **Engineering marvel**: When it was built for the 1889 World’s Fair, the Eiffel Tower was a groundbreaking feat of engineering. It was the tallest structure in the world at the time, standing at 324 meters (1,063 feet).n2. **Panoramic views**: The Eiffel Tower offers breathtaking 360-degree views of the city from its observation decks. You can see many of Paris’ famous landmarks, such as the Seine River, the Louvre, and Montmartre.n3. **Romantic atmosphere**: The Eiffel Tower is often associated with romance and is a popular spot for proposals, weddings, and romantic dinners. The tower’s iron latticework and twinkling lights at night add to its romantic ambiance.n4. **Historical significance**: The Eiffel Tower has played a significant role in French history, serving as a communication tower during World War I and a symbol of French culture and engineering.n5. **Accessibility**: The Eiffel Tower is easily accessible by public transportation or on foot, and there are various ways to experience it, including taking the stairs, elevator, or a guided tour.n6. **Nighttime spectacle**: The Eiffel Tower is even more magical at night, when it’s illuminated with thousands of twinkling lights. The light show, known as the “Eiffel Tower Sparkle,” takes place every hour on the hour from dusk to midnight.n7. **Iconic status**: The Eiffel Tower has been featured in countless films, literature, and art, making it an instantly recognizable symbol of Paris and France.nnOverall, the Eiffel Tower is a must-visit attraction in Paris, offering a unique combination of history, engineering, romance, and stunning views.

The Llama 3 8B model is able to generate answers for the questions without issues.
Next, let’s test the ability of Llama 3 8B to answer logical and arithmetic questions (derived from Hugging Face’s AQUA-RAT dataset—instead of multiple choice options, we ask for full answers) as follows:

Input:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
A board 7ft. 9 inches long is divided into 3 equal parts .
What is the length of each part??<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Output:
A nice simple problem!Let’s convert the length of the board to inches:
7 ft 9 inches = 95 inches (since 1 foot = 12 inches).Since the board is
divided into 3 equal parts, each part will have the same length.
To find the length of each part, we’ll divide the total length (95 inches)
by 3:95 inches ÷ 3 = 31.67 inches
So each part of the board is approximately 31.67 inches long.

This answer looks almost correct but not quite. The correct answer is 31 inches long. Similar logical questions are not answered correctly by the Llama 3 8B model.
In order for the Llama 3 8B model to improve its logical question answering capability, we want to fine-tune the model with data from the AQUA-RAT dataset. As we already mentioned, the AQUA-RAT dataset contains multiple choice options for the LLM to choose from. Because we don’t have the full answers for this dataset, we use the Llama 3.1 405B model to generate the verbal answer to the questions, and use that dataset to fine-tune the Llama 3 8B model.
Generate label data using Llama 3.1 405B
Because Llama 3.1 405B is the most capable of the Llama 3.1 collection of models, and because of its state-of-the-art math and general knowledge capabilities, we run direct inference of the questions in the AQUA-RAT dataset on Llama 3.1 405B using either SageMaker JumpStart or Amazon Bedrock. This helps us generate the answers we want to use to fine-tune the smaller Llama 3 8B models. In essence, we’re using Llama 3.1 405B as an alternative to human annotation to generate labels for the dataset. The following are example inference outputs from the 405B model:

Input:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
A board 7ft. 9 inches long is divided into 3 equal parts .
What is the length of each part??<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Output:
A nice simple problem!Let’s break it down:The board is 7ft 9in long.
We can convert this to inches only:7ft = 7 × 12 = 84in (since there are
12 inches in a foot) Add the 9 inches: 84in + 9in = 93in
Now, the board is divided into 3 equal parts, so we need to divide the
total length (93in) by 3:93in ÷ 3 = 31inSo, each part is 31 inches long.

We can clearly see that the 405B answer is logically and mathematically correct, and we can use this answer in the place of human annotation for fine-tuning purposes. We concatenate the answers from the 405B model for approximately 2,000 questions from the AQUA-RAT dataset, which becomes our training dataset. Our test dataset comes from extracted answers for approximately 4,000 more questions.
The training data is constructed as a JSON object as follows:

{“instruction”: “The entrance fee for a fair is $5 for persons under the age of 18, and 20% more for persons older. Each ride at the fair costs $0.50. If Joe goes with her 6 years old twin brothers, and they each took 3 rides in total. How much money does Joe end up spending at the fair?”, “response”: “The entrance fee for the 2 brothers is 5 x 2 = 10 dollars. The entrance fee for Joe is 5 x 1.2 = 6 dollars. So the total entrance fee is 10 + 6 = 16 dollars. The cost of the rides is 3 x 0.5 x 3 = 4.5 dollars. So the total cost is 16 + 4.5 = 20.5 dollars. The answer is 20.5.”}

{“instruction”: “John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of number is?”, “response”: “The average of 15 numbers is 40. So the sum of the 15 numbers is 15 x 40 = 600. If 10 is added to each number, then the sum becomes 600 + 15 x 10 = 600 + 150 = 750. The average of the new numbers is 750 / 15 = 50.n#### 50”}

{“instruction”: “A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance?”, “response”: “A simple problem!nnLet’s use the formula:nnDistance = Speed u00d7 TimennGiven:nSpeed (v) = 20 km/hrnTime (t) = 2.5 hrnnPlugging in the values, we get:nnDistance = 20 km/hr u00d7 2.5 hrn= 50 kmnnSo, the person traveled a distance of 50 km.”}

The instruction and response dataset are then used for distillation by fine-tuning the Llama 3 8B model in SageMaker JumpStart.
Fine-tune the Llama 3 8B model with the generated labels
For instructions on fine-tuning this model, refer to Fine-tune Code Llama on Amazon SageMaker JumpStart. We used a g5.12xlarge instance using the instruction fine-tuning option for five epochs, and observed that the loss function showed improvement both on the training and test sets.
When the fine-tuning was complete, we deployed the fine-tuned model for inference using SageMaker JumpStart on a g5.12xlarge instance (you can deploy on any other supported instance based on your throughput and latency requirements). Please refer to Meta’s End User License Agreement (EULA) for requirements when using synthetic-fine-tuned models for any use case.
Then we performed inference on another 1,000 unseen queries from the AQUA-RAT dataset. The following table shows a few of the queries for comparison.

Query
Answer
Llama 3 8B Responses Before Fine-Tuning
Llama 3 8B Responses After Fine-Tuning

Tim has 350 pounds of cement in 100, 50, and 25 pound bags. He has an equal number of each size bag. How many bags of cement does Tim have?
6
We know there are 3 types of bags. 100, 50, and 25 pounds. He has an equal number of each. So if x is the number of each type of bag, then you know 100x + 50x + 25x = 350. This is because each type of bag has x number of bags, and there is 100 pounds of cement per bag in the first case, 50 in the second, and 25 in the third. 175x = 350. So if we divide both sides by 175 we get x = 2. The answer is 2 bags of each.
Let’s assume Tim has X 100 pound bags, X 50 pound bags, and X 25 pound bags. He has in total 100X + 50X + 25X = 175X pounds of cement. Since he has 350 pounds total, we know 175X = 350, so X = 2. That means he has 2 bags of each type, so in total he has 2 + 2 + 2 = 6 bags. The answer is 6.

The purchase price of an article is $48. In order to include 30% of cost for overhead and to provide $12 of net profit, what should the markup be?
$55
To solve this problem, we can start by identifying the cost of the article, which is the purchase price of $48. We are also given that the markup should be enough to include 30% of the cost for overhead and provide a net profit of $12. Let’s break this down step by step: 1. Calculate the cost of the article: $48 (purchase price) 2. Calculate the overhead cost: 30% of the cost = 0.3 x $48 = $14.40 3. Calculate the total cost: cost + overhead cost = $48 + $14.40 = $62.40 4. Calculate the selling price: total cost + net profit = $62.40 + $12 = $74.40 Therefore, the markup should be $74.40 – $48 = $26.40. This means that the selling price should be $26.40 more than the purchase price to include 30% of the cost for overhead and provide a net profit of $12.
Step 1: Calculate the cost price of the articlenThe cost price of the article is $48.nn## Step 2: Calculate the overhead cost as a percentage of the cost pricenThe overhead cost is 30% of the cost price, which is 0.3 * $48 = $14.40.nn## Step 3: Calculate the selling price of the articlenThe selling price of the article is the cost price plus the overhead cost plus the net profit. The net profit is $12.nn## Step 4: Calculate the selling pricenThe selling price is $48 + $14.40 + $12 = $74.40.nn## Step 5: Calculate the markup as a percentage of the cost pricenThe markup is the difference between the selling price and the cost price, divided by the cost price. The markup is ($74.40 – $48) / $48 = $26.40 / $48 = 55%.nnThe final answer is: $55‘}

The slant height of a cone is 35 cm and radius of the base is 14cm, find the curved surface of the cone.
1540
A nice problem! The slant height of a cone is the distance from the center of the base to the vertex of the cone. Let’s call this distance “l”. We are given that the slant height is 35 cm, so: l = 35 cm The radius of the base is given as 14 cm, so the diameter of the base is 28 cm. The formula to find the curved surface area of a cone is: Curved Surface Area = πr(l + r) where r is the radius of the base and l is the slant height. Plugging in the values, we get: Curved Surface Area = π(14)(35 + 14) = π(14)(49) = 686π Now, we can simplify this expression by multiplying the numbers: Curved Surface Area = 686 × 3.14 (approximately) = 2153.24 cm² So, the curved surface area of the cone is approximately 2153.24 cm².
The formula for the curved surface area of a cone is: Curved Surface Area = πrl Where r is the radius of the base and l is the slant height. Given: r = 14 cm  l = 35 cm Plugging in the values, we get: Curved Surface Area = π(14)(35) = 14π(35) = 490π = 490 × 3.14 = 1537.4 cm² Therefore, the curved surface area of the cone is approximately 1537.4 cm².

The speed at which a man can row a boat in still water is 25 kmph. If he rows downstream, where the speed of current is 11 kmph, what time will he take to cover 80 metres?
8 seconds
The speed of the man in still water is 25 kmph. The speed of the current is 11 kmph. So the speed downstream is 25 + 11 = 36 kmph. 80 meters is 0.02 km. So the time taken is 0.02 / 36 = 0.0056 hours. The answer is 5.6 minutes.
Speed of the boat in still water = 25 kmph Speed of the current = 11 kmph Speed downstream = 25 + 11 = 36 kmph Speed downstream in m/s = (36 * 5/18) = 10 m/s Distance = 80 m Time = Distance / Speed = 80 / 10 = 8 seconds.

We can qualitatively see that the answers for the math problems and logical questions are improved after fine-tuning the model. We will perform additional quantitative metric evaluation in future posts. This exercise demonstrates how you can use the Llama 3.1 405B model to efficiently generate datasets in an accelerated fashion and then use those datasets to significantly improve the task-specific capabilities of smaller models.
Conclusion
In this post, we showed how you can use the new Llama 3.1 405B model to synthesize and generate data labels to improve the performance of a much smaller model through distillation (Llama 3 8B in this case). We also showed that the responses generated by the fine-tuned model are much improved compared to the model without fine-tuning. We also provided the code notebook that you can use to run and test the solution.
As a next step, we encourage you to use this idea along with the Llama-3.1 405B model in your use case to generate labels or even unlabeled data that can then be used by a smaller model downstream to help solve your use case.

About the Authors
Sebastian Bustillo is an Enterprise Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through cloud technologies and AI/ML. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and riding his MTB.
Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.
Dr. Natarajan Chennimalai Kumar is a Principal Solutions Architect in the 3rd Party Model Provider team at AWS, working closely with the Llama partner engineering team at Meta to enable AWS customers use Meta’s Llama models. He holds a PhD from University of Illinois at Urbana-Champaign. He is based in the Bay Area in California. Outside of work, he enjoys watching shows with his kids, playing tennis, and traveling with his family.
Madhur Prashant is an AI and ML Solutions Architect at Amazon Web Services. He is passionate about the intersection of human thinking and generative AI. His interests lie in generative AI, specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, writing blogs, hiking, spending time with his twin, and playing the guitar.
Dr. Nikita Ivkin is a Senior Applied Scientist for Amazon SageMaker. He focuses on inference acceleration for foundation models and scalable ML algorithms in general. His research interests are in the area of inference acceleration, streaming algorithms, and federated learning, with publishing in a variety of machine learning and computer science venues such as NeurIPS, ICML, ICLR, STOC, PODS, and others.
Supriya Puragundla is a Senior Solutions Architect at AWS. She has over 15 years of IT experience in software development, design, and architecture. She helps key customer accounts on their data, generative AI, and AI/ML journeys. She is passionate about data-driven AI and the area of depth in ML and generative AI.
Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, and KDD conferences, and Royal Statistical Society: Series A.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Karl Albertsen leads the product management and partnership teams for Amazon SageMaker. He is focused on making AI accessible, cost-effective, and high-performing for business applications.
Christopher Whitten is an SDE with the SageMaker JumpStart team leading model onboarding and deeper integration with SageMaker services. Chris is passionate about accelerating the ubiquity of AI in practical business applications. His technical interests include agentic workflows and MLOps.
Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his master’s from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.
Evan Kravitz is a software engineer at Amazon Web Services, working on SageMaker JumpStart. He is interested in the confluence of machine learning with cloud computing. Evan received his undergraduate degree from Cornell University and master’s degree from the University of California, Berkeley. In 2021, he presented a paper on adversarial neural networks at the ICLR conference. In his free time, Evan enjoys cooking, traveling, and going on runs in New York City.

Open Artificial Knowledge (OAK) Dataset: A Large-Scale Resource for AI …

The rapid advancement of Artificial Intelligence (AI) and Machine Learning (ML) has highlighted the critical need for large, diverse, and high-quality datasets to train and evaluate foundation models. However, acquiring such datasets presents significant challenges, including data scarcity, privacy concerns, and high data collection and annotation costs. Artificial (synthetic) data has emerged as a promising solution to these challenges, offering a way to generate data that mimics real-world patterns and characteristics. The importance of artificial data in AI research has grown substantially due to several factors: scalability, privacy preservation, diversity and representation, and cost-effectiveness. Synthetic data can be generated at scale, address privacy issues, cover a wide range of scenarios to mitigate biases, and provide a more economical alternative to collecting and annotating real-world data.

Recent work in training state-of-the-art language models (LLMs) has increasingly incorporated synthetic datasets, as seen in models like Llama-3. While handcrafted human data has shown significant improvements in supervised fine-tuning (SFT), especially for tasks like code generation and mathematical reasoning, the scarcity and cost of such data have led to increased use of synthetic data. This method utilizes capable LLMs, like the GPT family, to produce high-quality synthetic data. Recent research has highlighted LLMs’ ability to rephrase and boost synthetic data for effective SFT, suggesting continued growth in synthetic data use for improving LLM performance and alignment.

Artificial data generation has several key challenges. These include ensuring diversity and generalization, maintaining quality, preserving privacy, addressing bias, and adhering to ethical and legal considerations. Diversity in artificial data is crucial for model generalization, while quality directly impacts the performance of models trained on it. Privacy concerns must be addressed to prevent revealing sensitive information. Bias in artificial data can arise from underlying algorithms and training data, potentially leading to unfair or inaccurate model predictions. Ethical and legal considerations involve adhering to guidelines and regulations such as GDPR and CCPA. Also, practical challenges include scalability, cost-effectiveness, developing robust evaluation metrics, ensuring factual accuracy, and maintaining and updating synthetic data to reflect current trends and linguistic changes.

Vadim Borisov and Richard H. Schreiber introduce The Open Artificial Knowledge (OAK) dataset that addresses the challenges of artificial data generation by providing a large-scale resource of over 500 million tokens. OAK utilizes an ensemble of state-of-the-art LLMs, including GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B, to generate high-quality text across diverse domains. The data generation pipeline begins by querying knowledge databases to gather topics, which are then expanded using LLMs. These topics are transformed into prompts used to generate texts with advanced models. The OAK dataset is continuously evaluated and updated to ensure its effectiveness and reliability for training advanced language models. By systematically addressing each challenge, OAK provides a robust resource for developing more accurate and aligned language models.

The OAK dataset generation follows a structured approach designed to address key challenges in artificial data creation. The process involves four main steps: subject extraction, subtopic expansion, prompt generation, and text generation with open-source LLMs. This approach tackles challenges such as diversity and generalization, quality, bias, and factual accuracy. The dataset also addresses privacy concerns by using only publicly available data and open-source models. 

To ensure ethical and legal compliance, the OAK team implements a comprehensive strategy, including code publication for transparency and a commitment to content removal upon request. Toxicity and harmful content are mitigated through automated filtering techniques and fine-tuned models. The dataset’s effectiveness is evaluated using common benchmarks, and regular updates are planned to maintain relevance.

The OAK dataset has two main techniques for prompt generation: programming prompt engineering and meta prompt engineering. These methods ensure diversity in prompts while maintaining quality and addressing potential biases. The resulting dataset provides a robust resource for developing more accurate and aligned language models, with its use intended primarily for research purposes in areas such as model alignment, bias mitigation, and prompt engineering.

OAK dataset offers a comprehensive resource for AI research, derived from Wikipedia’s main categories. Utilizing advanced models like GPT4o, LLaMa3, Mixtral, Gemma, and Gemma2, OAK addresses data scarcity, privacy concerns, and diversity issues. With over 500 million tokens, this freely available dataset supports model alignment, fine-tuning, and benchmarking across various AI tasks and applications. OAK’s creation process involves sophisticated techniques to ensure quality, diversity, and ethical considerations, making it a valuable resource for advancing AI technologies while addressing critical challenges in the field of artificial data generation and utilization.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here
The post Open Artificial Knowledge (OAK) Dataset: A Large-Scale Resource for AI Research Derived from Wikipedia’s Main Categories appeared first on MarkTechPost.

Stability AI Open-Sources Stable Audio Open: An Audio Generation Model …

In the field of Artificial Intelligence, open, generative models stand out as a cornerstone for progress. These models are vital for advancing research and fostering creativity by allowing fine-tuning and serving as benchmarks for new innovations. However, a significant challenge persists as many state-of-the-art text-to-audio models remain proprietary, limiting their accessibility for researchers. 

Recently, a team of researchers from Stability AI has introduced a new open-weight text-to-audio model that is trained exclusively on Creative Commons data. This paradigm is intended to guarantee openness and moral data use while offering the AI community a potent tool. Its key features are as follows:

This new model has open weights, in contrast to numerous proprietary models. This enables researchers and developers to examine, alter, and expand upon the model because its design and parameters are made available to the general public. 

Only audio files with Creative Commons licenses have been used to train the model. This decision guarantees the training materials’ ethical and legal soundness. The developers have encouraged openness in data methods and steered clear of possible copyright issues by using data that is available under Creative Commons.

The architecture of the new model is intended to provide accessible, high-quality audio synthesis, which is as follows:

The model makes use of a sophisticated architecture that provides remarkable fidelity in text-to-audio generation. At a sampling rate of 44.1kHz, it can generate high-quality stereo sound, guaranteeing that the resulting audio satisfies strict requirements for clarity and realism.

A variety of audio files with Creative Commons licenses have been used in the instruction process. This method guarantees that the model can produce realistic and varied audio outputs while also assisting it in learning from a wide variety of soundscapes.

To make sure the new model matches or exceeds the standards set by the previous models, its performance has been thoroughly assessed. Measuring the realism of the generated audio, FDopenl3 is one of the primary assessment metrics employed. This metric’s findings showcased the model’s capacity to generate high-quality audio by showing that it performs on par with the industry’s top models. To evaluate the model’s capabilities and pinpoint areas for development, its performance has been compared to that of other well-performing models. This comparative study attests to the new model’s superior quality and usability. 

In conclusion, the development of generative audio technology has advanced significantly with the release of this open-weight text-to-audio model. The concept solves many of the existing problems in the industry by emphasizing openness, ethical data utilization, and high-quality audio synthesis. It sets new standards for text-to-audio production and is a significant resource for scholars, artists, and developers. 

Check out the Paper, Model, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here
The post Stability AI Open-Sources Stable Audio Open: An Audio Generation Model with Variable-Length (up to 47s) Stereo Audio at 44.1kHz from Text Prompts appeared first on MarkTechPost.

This AI Paper by Alibaba Introduces Data-Juicer Sandbox: A Probe-Analy …

Multi-modal generative models integrate various data types, such as text, images, and videos, expanding AI applications across different fields. However, optimizing these models presents complex challenges related to data processing and model training. The need for cohesive strategies to refine both data and models is crucial for achieving superior AI performance.

A major issue in multi-modal generative model development is the isolated progression of data-centric and model-centric approaches. Researchers often struggle to integrate data processing and model training, leading to inefficiencies and suboptimal results. This separation hampers the ability to enhance data and models simultaneously, which is essential for improving AI capabilities.

Current methods for developing multi-modal generative models typically focus either on refining algorithms and model architectures or enhancing data processing techniques. These methods operate independently, relying on heuristic approaches and human intuition. Consequently, they lack systematic guidance for collaborative optimization of data and models, resulting in fragmented and less effective development efforts.

Researchers from Alibaba Group have introduced the Data-Juicer Sandbox, an open-source suite, to address these challenges. This sandbox facilitates the co-development of multi-modal data and generative models by integrating a variety of customizable components. It offers a flexible platform for systematic exploration and optimization, bridging the gap between data processing and model training. The suite is designed to streamline the development process and enhance the synergy between data and models.

The Data-Juicer Sandbox employs a “Probe-Analyze-Refine” workflow, allowing researchers to test and refine different data processing operators (OPs) and model configurations systematically. This method involves creating equal-size data pools, each processed uniquely by a single OP. Models are trained on these data pools, enabling in-depth analysis of OP effectiveness and its correlation with model performance across various quantitative and qualitative indicators. This systematic approach improves both data quality and model performance, providing valuable insights into the complex interplay between data preprocessing and model behavior.

In their methodology, the researchers implemented a hierarchical data pyramid, categorizing data pools based on their ranked model metric scores. This stratification helps identify the most effective OPs, which are then combined into data recipes and scaled up. By maintaining consistent hyperparameters and using cost-effective strategies like downsizing data pools and limiting training iterations, the researchers ensured an efficient and resource-conscious development process. The sandbox’s compatibility with existing model-centric infrastructures makes it a versatile tool for AI development.

The Data-Juicer Sandbox achieved significant performance improvements in several tasks. For image-to-text generation, the average performance on TextVQA, MMBench, and MME increased by 7.13%. In text-to-video generation, using the EasyAnimate model, the sandbox achieved the top spot on the VBench leaderboard, outperforming strong competitors. The experiments also demonstrated a 59.9% increase in aesthetic scores and a 49.9% improvement in language scores when using high-quality data pools. These results highlight the sandbox’s effectiveness in optimizing multi-modal generative models.

Moreover, the sandbox facilitated practical applications in two distinct scenarios: image-to-text generation and text-to-video generation. In the image-to-text task, using the Mini-Gemini model, the sandbox achieved top-tier performance in understanding image content. For the text-to-video task, the EasyAnimate model demonstrated the sandbox’s capability to generate high-quality videos from textual descriptions. These applications exemplified the sandbox’s versatility and effectiveness in enhancing multi-modal data-model co-development.

In conclusion, the Data-Juicer Sandbox addresses the critical problem of integrating data processing and model training in multi-modal generative models. By providing a systematic and flexible platform for co-development, it enables researchers to achieve significant improvements in AI performance. This innovative approach represents a major advancement in the field of AI, offering a comprehensive solution to the challenges of optimizing multi-modal generative models.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here
The post This AI Paper by Alibaba Introduces Data-Juicer Sandbox: A Probe-Analyze-Refine Approach to Co-Developing Multi-Modal Data and Generative AI Models appeared first on MarkTechPost.

Athene-Llama3-70B Released: An Open-Weight LLM Trained through RLHF ba …

Nexusflow has released Athene-Llama3-70B, an open-weight chat model fine-tuned from Meta AI’s Llama-3-70B. Athene-70B has achieved an Arena-Hard-Auto score of 77.8%, rivaling proprietary models like GPT-4o and Claude-3.5-Sonnet. This marks a significant improvement from its predecessor, Llama-3-70B-Instruct, which scored 46.6%. The enhancement stems from Nexusflow’s targeted post-training pipeline, designed to improve specific model behaviors. Athene-70B is currently undergoing public testing on Chatbot Arena.

To maximize Llama-3-70B’s potential, Nexusflow developed internal benchmarks evaluating LLM capabilities in instruction following, coding, creative writing, and multilingual tasks. Based on these evaluations, high-quality preference data was curated for targeted Reinforcement Learning from Human Feedback (RLHF). This pipeline resulted in substantial performance improvements compared to Llama-3-70B-Instruct. The enhancements span key aspects such as precise instruction following, math and reasoning, comprehensive coding assistance, inspired creative writing, and multilingual mastery.

Athene-70B demonstrates Nexusflow’s capability to customize models for specific enterprise requirements through targeted post-training. Building on previous successes with Starling-7B and NexusRaven-V2, Nexusflow aims to advance its models to meet enterprise-grade application standards. The company offers tailored solutions to help businesses excel in GenAI copilot and agent technologies. Nexusflow invites organizations to explore how Athene-70B can enhance their AI initiatives by contacting them for further information and collaboration opportunities.

Athene-Llama3-70B, an open-weights chat model developed by Nexusflow, demonstrates significant improvements over its predecessor. The model achieves competitive performance compared to proprietary models in the Arena-Hard-Auto benchmark. Nexusflow’s targeted post-training pipeline, utilizing internal benchmarks and Reinforcement Learning from Human Feedback, has enhanced the model’s capabilities across various domains, including instruction following, math and reasoning, coding, creative writing, and multilingual tasks. This advancement showcases Nexusflow’s ability to tailor models for enterprise needs, building on their previous successes. The company positions itself as a provider of customized enterprise-grade AI solutions, inviting organizations to explore the potential of Athene-70B for their AI initiatives.

Check out the Model Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here
The post Athene-Llama3-70B Released: An Open-Weight LLM Trained through RLHF based on Llama-3-70B-Instruct appeared first on MarkTechPost.

Apple AI Released a 7B Open-Source Language Model Trained on 2.5T Toke …

Language models (LMs) have become fundamental in natural language processing (NLP), enabling text generation, translation, and sentiment analysis tasks. These models demand vast amounts of training data to function accurately and efficiently. However, the quality and curation of these datasets are critical to the performance of LMs. This field focuses on refining the data collection and preparation methods to enhance the models’ effectiveness.

A significant challenge in developing effective language models is improving training datasets. High-quality datasets are essential for training models that generalize well across various tasks, but creating such datasets is complex. It involves filtering out irrelevant or harmful content, removing duplicates, and selecting the most useful data sources.

Existing methods for dataset curation typically involve heuristic-based filtering, deduplication, and data sourcing from extensive web crawls. While these methods have provided some success, they often need more standardized benchmarks, leading to consistency in the performance evaluation of language models. This variability makes it difficult to determine the most effective data curation strategies, which hinders progress in the field.

Researchers from Apple, the University of Washington, and many other institutions have introduced DataComp for Language Models (DCLM) to address these issues. They have recently open-sourced the DCIM models and datasets on the Hugging Face Platform. The open source release comprises DCLM-7B, DCLM-1B, dclm-7b-it, DCLM-7B-8k, dclm-baseline-1.0, and dclm-baseline-1.0-parquet. This innovative testbed allows controlled experiments with large datasets to improve language models. The DCLM framework includes a comprehensive corpus of 240 trillion tokens from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. This setup provides a standardized approach to dataset curation, enabling consistent and comparable experiments.

DCLM offers a structured workflow for researchers. Participants can choose scales ranging from 412M to 7B parameters and experiment with data curation strategies such as deduplication, filtering, and data mixing. Researchers can train models on curated datasets using a standardized training recipe and specific hyperparameters. The performance of these models is then evaluated on a suite of downstream tasks, providing a clear measure of dataset quality. This systematic approach helps identify the most effective data curation strategies.

The introduction of DCLM has led to notable improvements in language model training. For instance, a baseline dataset created using DCLM enabled the training of a 7B parameter language model from scratch. This model achieved a 64% 5-shot accuracy on the MMLU benchmark with 2.6 trillion training tokens. This performance represents a 6.6 percentage point improvement over the previous state-of-the-art open-data language model, MAP-Neo, while using 40% less computing. The DCLM baseline model also performed comparably to Mistral-7B-v0.3 and Llama 3 8B, which required significantly more computational resources.

Image Source

The DCLM framework’s effectiveness is further demonstrated by its scalability. Researchers conducted extensive experiments at different scales, from 400M to over 7B parameters, using DCLM-Pool, a corpus of 240 trillion tokens derived from Common Crawl. These experiments highlighted the critical role of model-based filtering in assembling high-quality training sets. The DCLM baseline dataset, created through this rigorous process, consistently outperformed other open-source datasets like RefinedWeb and RedPajama in various evaluations.

Image Source

The research team also explored the impact of various data curation techniques. They compared text extraction methods, such as resiliparse and trafilatura, and found that these approaches significantly improved downstream performance compared to Common Crawl’s pre-extracted text. The team investigated several model-based quality filtering strategies, ultimately determining that the fastText OH-2.5 + ELI5 classifier was the most effective, providing a substantial lift in accuracy.

In conclusion, the introduction of DCLM enables researchers to conduct controlled experiments and identify the most effective strategies for improving language models by providing a standardized and systematic approach to dataset curation. The DCLM framework sets a new benchmark for dataset quality and demonstrates the potential for significant performance improvements with reduced computational resources.
The post Apple AI Released a 7B Open-Source Language Model Trained on 2.5T Tokens on Open Datasets appeared first on MarkTechPost.

Nephilim v3 8B Released: An Innovative AI Approach to Merging Models f …

Llama-3-Nephilim-v3-8B and llama-3-Nephilim-v3-8B-GGUF are two innovative models released on Hugging Face. Although these models were never explicitly trained for roleplay, they exhibit remarkable capability in this domain, highlighting the potential of “found art” approaches in AI development. 

The creation of these models involved merging several pre-trained language models using mergekit, a tool designed to combine the strengths of different models. The llama-3-Nephilim-v3-8B model, with 8.03 billion parameters and utilizing BF16 tensor types, was tested with a temperature setting of one and a minimum probability (minP) of 0.01. This configuration allowed the model to lean towards creative outputs, which can be adjusted as desired. Despite initial format consistency issues, the model’s performance can be enhanced through prompt steering and proper instruct prompts, ensuring more consistent and varied text generation outputs.

The llama-3-Nephilim-v3-8B-GGUF variant, also boasting 8.03 billion parameters, features multiple quantization options, including 4-bit, 5-bit, 6-bit, and 8-bit quantizations. This model was tested with its counterpart’s temperature and minP settings. Including GGUF quants in the merge was aimed at maintaining creativity while optimizing the model’s performance for roleplay scenarios.

The research utilized the task arithmetic merge method, which allowed for the combination of the strengths of several models. The base model for this merge was the grimjim/Llama-3-Instruct-8B-SPPO-Iter3-SimPO, complemented by the tokyotech-llm/Llama-3-Swallow-8B-Instruct-v0.1 model at a lower weight. This combination aimed to enhance chain-of-thought capabilities critical for roleplay and narrative consistency.

During testing, it was found that none of the merged models’ components were initially designed for roleplay. However, through rigorous testing, including RP interactions and ad hoc tests, the study identified three models that performed exceptionally well in roleplay scenarios. These included the models of SPPO (Self-Play Preference Optimization) and SimPO (Simple Preference Optimization with a Reference-Free Reward). Despite not being benchmarked on the Open LLM Leaderboard, these models demonstrated strong performance in maintaining narrative coherence and character consistency.

The methodology also highlighted the potential of prompt steering in the instruction system. This approach can improve text generation’s readability and stylistic appeal and bypass censorship limitations during roleplay. While some glitches, such as misattribution of utterances and spontaneous gender flips, were observed, the overall performance of the merged models was impressive.

In conclusion, the release of these models on Hugging Face marks a significant contribution by merging models not initially intended for roleplay. The research demonstrated that innovative approaches could yield highly effective results. The llama-3-Nephilim-v3-8B and llama-3-Nephilim-v3-8B-GGUF models stand as a testament to the potential of AI models to adapt and excel in unforeseen applications.
The post Nephilim v3 8B Released: An Innovative AI Approach to Merging Models for Enhanced Roleplay and Creativity appeared first on MarkTechPost.

ZebraLogic: A Logical Reasoning AI Benchmark Designed for Evaluating L …

Large language models (LLMs) demonstrate proficiency in information retrieval and creative writing, with notable improvements in mathematics and coding. ZebraLogic, a benchmark consisting of Logic Grid Puzzles, assesses LLMs’ logical reasoning capabilities. Each puzzle presents N houses with M features, requiring unique value assignments based on given clues. This task, a Constraint Satisfaction Problem (CSP), evaluates deductive reasoning skills and is commonly employed in assessments like the Law School Admission Test (LSAT) to measure human logical aptitude.

The study presents an example, a 2×3 Logic Grid Puzzle with two houses and three features: names, car models, and animals. Clues provide essential information for deductive reasoning. Through logical analysis, the solution emerges: Eric occupies House 1, owns a Ford F150, and keeps horses. Arnold resides in House 2, drives a Tesla Model 3, and owns a cat. This example demonstrates the step-by-step reasoning process required to solve such puzzles, illustrating the logical deduction skills evaluated by the ZebraLogic benchmark.

The ZebraLogic benchmark comprises 1,000 programmatically generated puzzles, ranging from 2×2 to 6×6 in size, with 40 puzzles per size category. Large language models undergo testing using a one-shot example approach. This method includes providing reasoning steps and a JSON-formatted solution. Models are instructed to output their reasoning process first, followed by presenting their answers in the same JSON format as the provided example. This standardized approach enables consistent evaluation of LLMs’ logical reasoning abilities across various puzzle complexities.

The evaluation employs two primary metrics: puzzle-level accuracy and cell-wise accuracy. For NxM puzzles, cell-wise accuracy measures the proportion of correctly filled cells out of NxM total cells. Puzzle-level success requires all cells to be correct. The 1,000 puzzles are categorized into easy and hard subsets based on size.

Random guessing probability for correct feature assignment is 1/(N!), and for all cells is (1/N!)M. Logarithmic values of these probabilities are presented in a table, illustrating the increasing difficulty with puzzle size. This approach quantifies the complexity and assesses LLM performance against random chance.

Humans solve these puzzles using strategic reasoning, employing techniques like reductio ad absurdum and elimination. LLMs demonstrate weakness in logical reasoning tasks, with Claude 3.5 Sonnet achieving 33.4% overall accuracy and 12.4% on hard puzzles. DeepSeek-v2-Chat (0628) outperforms other open-weight models. Smaller models (7-10 billion parameters) struggle significantly with hard puzzles.

Results indicate LLMs lack crucial abilities for complex logical reasoning: counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization. Greedy decoding generally outperforms sampling for most models in hard reasoning tasks. Gemini-1.5 models show unexpected performance patterns across different decoding methods.

Human performance varies by puzzle size, with solving times ranging from 15 seconds for 2×2 puzzles to 10-15 minutes for 4×4 puzzles. A demo on HuggingFace allows for the exploration of the data and leaderboard.

The puzzle creation process involves several systematic steps:

1. Define features and possible values for each.

2. Establish clue types with language templates containing placeholders.

3. Generate solutions by randomly assigning values to a sampled grid.

4. Enumerate all possible clues describing variable relationships.

5. Iteratively remove clues through weighted sampling, ensuring remaining clues lead to a unique solution.

6. Format puzzles using prompting templates for LLM input.

Clue types include: Found_At, Not_At, Same_House, Direct_Left/Right, Side_By_Side, Left/Right_Of, and One/Two_between. Each type represents a specific logical constraint, allowing for diverse and challenging puzzle configurations.

ZebraLogic, a benchmark of 1,000 Logic Grid Puzzles, evaluates LLMs’ logical reasoning abilities. These puzzles require assigning unique values to features across N houses based on given clues. The study uses puzzle-level and cell-wise accuracy metrics, comparing LLM performance to random guessing probabilities. Results show LLMs struggle with complex logical reasoning, with the best model (Claude 3.5 Sonnet) solving only 33.4% of all puzzles and 12.4% of hard puzzles. The research highlights LLMs’ deficiencies in counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization. The article details the puzzle creation process and various clue types used, providing insights into the challenges of logical reasoning for AI systems.

Check out the Benchmark, GitHub, and Dataset Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here
The post ZebraLogic: A Logical Reasoning AI Benchmark Designed for Evaluating LLMs with Logic Puzzles appeared first on MarkTechPost.

DeepSeek-V2-0628 Released: An Improved Open-Source Version of DeepSeek …

DeepSeek has recently released its latest open-source model on Hugging Facel, DeepSeek-V2-Chat-0628. This release marks a significant advancement in AI-driven text generation and chatbot technology capabilities, positioning DeepSeek at the forefront of the industry.

DeepSeek-V2-Chat-0628 is an enhanced iteration of the previous DeepSeek-V2-Chat model. This new version has been meticulously refined to deliver superior performance across various benchmarks. According to the LMSYS Chatbot Arena Leaderboard, DeepSeek-V2-Chat-0628 has secured an impressive overall ranking of #11, outperforming all other open-source models. This achievement underscores DeepSeek’s commitment to advancing the field of artificial intelligence and providing top-tier solutions for conversational AI applications.

The improvements in DeepSeek-V2-Chat-0628 are extensive, covering various critical aspects of the model’s functionality. Notably, the model exhibits substantial enhancements in several benchmark tests:

HumanEval: The score improved from 81.1 to 84.8, reflecting a 3.7-point increase.

MATH: A remarkable leap from 53.9 to 71.0, indicating a 17.1-point improvement.

BBH: The performance score rose from 79.7 to 83.4, marking a 3.7-point enhancement.

IFEval: A significant increase from 63.8 to 77.6, a 13.8-point improvement.

Arena-Hard: Demonstrated the most dramatic improvement, with the score jumping from 41.6 to 68.3, a 26.7-point rise.

JSON Output (Internal): Improved from 78 to 85, showing a 7-point enhancement.

The DeepSeek-V2-Chat-0628 model also features optimized instruction-following capabilities within the “system” area, significantly enhancing the user experience. This optimization benefits tasks such as immersive translation and Retrieval-Augmented Generation (RAG), providing users with a more intuitive and efficient interaction with the AI.

Image Source

For those interested in deploying DeepSeek-V2-Chat-0628, the model requires 80GB*8 GPUs for inference in BF16 format. Users can utilize Huggingface’s Transformers for model inference, which involves importing the necessary libraries and setting up the model and tokenizer with appropriate configurations. Compared to previous versions, the complete chat template has been updated, enhancing the model’s response generation and interaction capabilities. The new template includes specific formatting and token settings that ensure more accurate and relevant outputs based on user inputs.

vLLM is recommended for model inference, which offers a streamlined approach for integrating the model into various applications. The vLLM setup involves merging a pull request into the vLLM codebase and configuring the model and tokenizer to handle the desired tasks efficiently.

Image Source

The DeepSeek-V2-Chat-0628 model is available under the MIT License for the code repository, with the model itself subject to the Model License. This allows for commercial use of the DeepSeek-V2 series, including both Base and Chat models, making it accessible for businesses and developers aiming to integrate advanced AI capabilities into their products & services.

In conclusion, the release of DeepSeek-V2-Chat-0628 for DeepSeek showcases its ongoing dedication to innovation in artificial intelligence. With impressive performance metrics and enhanced user experience, this model is poised to set new standards in conversational AI.

Check out the Model Card and API. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here
The post DeepSeek-V2-0628 Released: An Improved Open-Source Version of DeepSeek-V2 appeared first on MarkTechPost.

UT Austin Researchers Introduce PUTNAMBENCH: A Comprehensive AI Benchm …

Automating mathematical reasoning has long been a goal in artificial intelligence, with formal frameworks like Lean 4, Isabelle, and Coq playing a significant role. These frameworks enable users to write machine-verifiable proofs of mathematical theorems, providing a structured environment for proving complex problems. Developing neural theorem-provers, which aim to automate this process, requires rigorous benchmarks to evaluate their effectiveness and drive further research.

A critical issue in AI-driven theorem proving is the lack of comprehensive benchmarks that challenge these systems with advanced mathematical problems. Existing benchmarks, such as MINI F2F and FIMO, primarily focus on high-school-level mathematics and need to sufficiently test the capabilities of neural theorem provers on more complex, undergraduate-level problems. This gap necessitates the creation of a more robust benchmark encompassing a wider range of mathematical challenges.

Researchers from UT Austin have introduced PUTNAMBENCH, a new benchmark designed to evaluate neural theorem-provers using problems from the William Lowell Putnam Mathematical Competition. This competition is renowned in North America for its challenging college-level mathematics problems, making it an ideal source for a rigorous benchmark. PUTNAMBENCH includes 1697 formalizations of 640 issues, each available in Lean 4 and Isabelle and a significant subset in Coq. This multilingual approach ensures comprehensive evaluation across different theorem-proving environments.

PUTNAMBENCH’s methodology involves manually constructing formalizations of Putnam competition problems, ensuring each problem is carefully debugged and available in multiple formal proof languages. These formalizations cover various topics taught in undergraduate mathematics courses, such as algebra, analysis, number theory, and combinatorics. The problems are designed to test significant problem-solving abilities and proficiency in various mathematical concepts, making PUTNAMBENCH a challenging benchmark for neural theorem provers.

The evaluation of PUTNAMBENCH utilized several neural and symbolic theorem-provers, including Draft-Sketch-Prove, COPRA, GPT-4, Sledgehammer, and Coqhammer. These methods were tested on the 1697 formalizations, with each technique attempting to solve the problems using their unique approaches. The results showed that current methods could solve only a handful of the PUTNAMBENCH problems. For instance, GPT-4 solved only one out of 640 problems in Lean 4 and Coq, while Sledgehammer solved three out of 640 issues in Isabelle.

Image Source

One of the key challenges highlighted by the PUTNAMBENCH evaluations is the difficulty synthesizing new lemmas and orchestrating these lemmas into intricate proofs. While current theorem provers can effectively stitch together standard proof steps well-represented in their training corpus, they often need help creating new, innovative proof strategies. This limitation underscores the need for more advanced neural models that can leverage deep mathematical knowledge and reasoning.

PUTNAMBENCH’s multilingual nature sets it apart from previous benchmarks. By including problems in Lean 4, Isabelle, and Coq, PUTNAMBENCH allows for a more comprehensive evaluation of theorem-proving methods. This approach ensures that the benchmark can test theorem-provers’ robustness across different formal proof environments, providing a complete picture of their capabilities and limitations.

In conclusion, PUTNAMBENCH, by providing a diverse set of 1697 formalizations of Putnam competition problems across multiple formal proof languages, addresses the limitations of existing benchmarks. It sets a new standard for rigor and comprehensiveness. The results from current evaluations indicate that while progress has been made, there is still a long way to go in developing neural theorem provers capable of solving complex mathematical problems. PUTNAMBENCH will undoubtedly be crucial in driving future research and innovation.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post UT Austin Researchers Introduce PUTNAMBENCH: A Comprehensive AI Benchmark for Evaluating the Capabilities of Neural Theorem-Provers with Putnam Mathematical Problems appeared first on MarkTechPost.

Q-Sparse: A New Artificial Intelligence AI Approach to Enable Full Spa …

LLMs excel in natural language processing tasks but face deployment challenges due to high computational and memory demands during inference. Recent research [MWM+24, WMD+23, SXZ+24, XGZC23, LKM23] aims to enhance LLM efficiency through quantization, pruning, distillation, and improved decoding. Sparsity, a key approach, reduces computation by omitting zero elements and lessens I/O transfer between memory and computation units. While weight sparsity saves computation, it struggles with GPU parallelization and accuracy loss. Activation sparsity, achieved via techniques like the mixture-of-experts (MoE) mechanism, also needs full efficiency and requires further study on scaling laws compared to dense models.

Researchers from Microsoft and the University of Chinese Academy of Sciences have developed Q-Sparse, an efficient approach for training sparsely-activated LLMs. Q-Sparse enables full activation sparsity by applying top-K sparsification to activations and using a straight-through estimator during training, significantly enhancing inference efficiency. Key findings include achieving baseline LLM performance with lower inference costs, establishing an optimal scaling law for sparsely-activated LLMs, and demonstrating effectiveness in various training settings. Q-Sparse works with full-precision and 1-bit models, offering a path to more efficient, cost-effective, and energy-saving LLMs.

Q-Sparse enhances the Transformer architecture by enabling full sparsity in activations through top-K sparsification and the straight-through estimator (STE). This approach applies a top-K function to the activations during matrix multiplication, reducing computational costs and memory footprint. It supports full-precision and quantized models, including 1-bit models like BitNet b1.58. Additionally, Q-Sparse uses squared ReLU for feed-forward layers to improve activation sparsity. For training, it overcomes gradient vanishing by using STE. Q-Sparse is effective for training from scratch, continue-training, and fine-tuning, maintaining efficiency and performance across various settings.

Recent studies show that LLM performance scales with model size and training data follow a power law. The researchers explore this for sparsely-activated LLMs, finding their performance also follows a power law with model size and an exponential statute with sparsity ratio. Experiments reveal that, with a fixed sparsity ratio, sparsely-activated models’ performance scales are similar to those of dense models. The performance gap between sparse and dense models diminishes with increasing model size. An inference-optimal scaling law indicates that sparse models can efficiently match or outperform dense models with proper sparsity, with optimal sparsity ratios of 45.58% for full precision and 61.25% for 1.58-bit models.

The researchers evaluated Q-Sparse LLMs in various settings, including training from scratch, continue-training, and fine-tuning. When training from scratch with 50B tokens, Q-Sparse matched dense baselines at 40% sparsity. BitNet b1.58 models with Q-Sparse outperformed dense baselines with the same compute budget. Continue-training of Mistral 7B showed that Q-Sparse achieved comparable performance to dense baselines but with higher efficiency. Fine-tuning results demonstrated that Q-Sparse models with around 4B activated parameters matched or exceeded the performance of dense 7B models, proving Q-Sparse’s efficiency and effectiveness across training scenarios.

In conclusion, the results show that combining BitNet b1.58 with Q-Sparse offers significant efficiency gains, particularly in inference. The researchers plan to scale up training with more model sizes and tokens and integrate YOCO to optimize KV cache management. Q-Sparse complements MoE and will be adapted for batch processing to enhance its practicality. Q-Sparse performs comparably to dense baselines, enhancing inference efficiency through top-K sparsification and the straight-through estimator. It is effective across various settings and compatible with full-precision and 1-bit models, making it a pivotal approach for improving LLM efficiency and sustainability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post Q-Sparse: A New Artificial Intelligence AI Approach to Enable Full Sparsity of Activations in LLMs appeared first on MarkTechPost.

Snowflake-Arctic-Embed-m-v1.5 Released: A 109M Parameters Groundbreaki …

Snowflake recently announced the release of its updated text embedding model, snowflake-arctic-embed-m-v1.5. This model generates highly compressible embedding vectors while maintaining high performance. The model’s most noteworthy feature is its ability to produce embedding vectors compressed to as small as 128 bytes per vector without significantly losing quality. This is achieved through Matryoshka Representation Learning (MRL) and uniform scalar quantization. These techniques enable the model to retain most of its retrieval quality even at this high compression level, a critical advantage for applications requiring efficient storage and fast retrieval.

The snowflake-arctic-embed-m-v1.5 model builds upon its predecessors by incorporating improvements in the architecture and training process. Originally released on April 16, 2024, the snowflake-arctic-embed family of models has been designed to improve embedding vector compressibility while achieving slightly higher overall performance. The updated version, v1.5, continues this trend with enhancements that make it particularly suitable for resource-constrained environments where storage and computational efficiency are paramount.

Evaluation results of snowflake-arctic-embed-m-v1.5 show that it maintains high-performance metrics across various benchmarks. For instance, the model achieves a mean retrieval score of 55.14 on the MTEB (Massive Text Embedding Benchmark) Retrieval benchmark when using 256-dimensional vectors, surpassing several other models trained with similar objectives. Compressed to 128 bytes, it still retains a commendable retrieval score of 53.7, demonstrating its robustness even under significant compression.

Image Source

The model’s technical specifications reveal a design that emphasizes efficiency and compatibility. It consists of 109 million parameters and utilizes 256-dimensional vectors by default, which can be further truncated and quantized for specific use cases. This adaptability makes it an attractive option for applications, from search engines to recommendation systems, where efficient text processing is crucial.

Snowflake Inc. has also provided comprehensive usage instructions for the snowflake-arctic-embed-m-v1.5 model. Users can implement the model using popular frameworks like Hugging Face’s Transformers and Sentence Transformers libraries. Example code snippets illustrate how to load the model, generate embeddings, and compute similarity scores between text queries and documents. These instructions facilitate easy integration into existing NLP pipelines, allowing users to leverage the model’s capabilities with minimal overhead.

Image Source

In terms of deployment, snowflake-arctic-embed-m-v1.5 can be used in various environments, including serverless inference APIs and dedicated inference endpoints. This flexibility ensures that the model can be scaled according to the specific needs and infrastructure of the user, whether they are operating on a small-scale or a large enterprise-level application.

In conclusion, as Snowflake Inc. continues to refine and expand its offerings in text embeddings, the snowflake-arctic-embed-m-v1.5 model stands out as a testament to its expertise and vision. Addressing the critical needs for compression and text embedding performance underscores the company’s commitment to advancing state-of-the-art text embedding technology, providing powerful tools for efficient and effective text processing. The model’s innovative design and high performance make it a valuable asset for developers & researchers seeking to enhance their applications with cutting-edge NLP capabilities.

Check out the Paper and HF Model Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post Snowflake-Arctic-Embed-m-v1.5 Released: A 109M Parameters Groundbreaking Text Embedding Model with Enhanced Compression and Performance Capabilities appeared first on MarkTechPost.

From Diagrams to Solutions: MAVIS’s Three-Stage Framework for Mathem …

Large Language Models (LLMs) and their multi-modal counterparts (MLLMs) have made significant strides in advancing artificial general intelligence (AGI) across various domains. However, these models face a significant challenge in the realm of visual mathematical problem-solving. While MLLMs have demonstrated impressive capabilities in diverse tasks, they struggle to fully utilize their potential when confronted with mathematical problems presented in visual contexts. This limitation is particularly evident in scenarios where models must interpret geometric figures, understand spatial relationships, and integrate complex mathematical concepts with visual information.

The difficulty lies in the unique demands of visual mathematical problem-solving, which requires a seamless integration of analytical reasoning from textual questions with the contextual information provided by visual diagrams. Unlike text-only mathematical problems, where LLMs have shown considerable progress due to abundant training data and their inherent language proficiency, visual mathematics introduces an additional layer of complexity. Models must not only comprehend the mathematical concepts but also accurately interpret visual elements such as geometric shapes, angles, measurements, and spatial relationships represented in diagrams.

Visual instruction tuning for MLLMs has seen significant advancements through approaches like LLaMA-Adapter, LLaVA, Flamingo, SPHINX, and InternVL, each introducing efficient techniques for vision-language integration. Simultaneously, text-based mathematical problem-solving has progressed with projects like MAmmoTH, MetaMATH, and MathCoder. However, in the multi-modal mathematical domain, efforts remain limited. Datasets such as Geometry3K and UniMath have emerged, but their scope and scale are insufficient. G-LLaVA shows promise in graphical geometry but struggles in other mathematical areas, highlighting the need for more robust, comprehensive approaches to visual mathematical problem-solving.

Researchers from CUHK, Peking University, Shanghai AI Laboratory, and Oracle introduce MAVIS (MAthematical VISual instruction tuning) which presents a robust approach addressing  the limitations of MLLMs in visual mathematical problem-solving. This framework tackles three critical issues: unsatisfactory math diagram embeddings by vision encoders, diagram-language misalignment between vision encoders and LLMs, and inaccurate mathematical reasoning with visual elements. MAVIS introduces two extensive datasets, MAVIS-Caption and MAVIS-Instruct, covering various mathematical domains. It employs a progressive three-stage training pipeline to enhance diagram visual encoding and reasoning capabilities. The result is MAVIS-7B, a specialized MLLM optimized for visual mathematical tasks, which demonstrates superior performance on evaluation benchmarks compared to existing open-source MLLMs, highlighting the effectiveness of this targeted approach in advancing visual mathematical problem-solving capabilities.

MAVIS introduces an innovative data engine to generate high-quality mathematical diagrams efficiently, addressing the scarcity of visual mathematics datasets. The engine covers three main diagram types: plane geometry, analytic geometry, and function. For plane geometry, it employs multi-hop data curation principles, iteratively combining basic shapes to create diverse configurations. Analytic geometry diagrams are constructed on a Cartesian coordinate system, incorporating various geometric elements without overlap. Function diagrams focus on seven fundamental types, using parameterized equations to generate diverse graphs. All diagrams are rendered using Matplotlib, with additional features like vertex labeling and key point plotting to enhance mathematical understanding and reasoning capabilities.

MAVIS-Caption, a crucial component of the MAVIS framework, is a large-scale dataset comprising 588,000 diagram-caption pairs. It covers three mathematical domains: plane geometry (299K pairs), analytic geometry (77K pairs), and function (212K pairs). The dataset’s captions are detailed, with an average length of 61.48 words and a vocabulary size of 149. Caption generation strategies vary by diagram type, utilizing GPT-4-created templates and specific rules for each domain. Plane geometry captions are built iteratively, analytic geometry captions use coordinate-based descriptions, and function captions detail various properties of the graphed functions. All captions are refined by ChatGPT for natural language expression, ensuring high-quality, diverse, and mathematically accurate descriptions of visual mathematical content.

MAVIS-Instruct is a comprehensive dataset of 834,000 visual math problems designed to enhance MLLMs’ visual mathematical reasoning capabilities. It covers plane geometry and function problems, each accompanied by a Chain-of-Thought (CoT) rationale averaging 150 words. The dataset’s questions are streamlined to minimize textual redundancy, encouraging MLLMs to extract critical information from visual inputs. MAVIS-Instruct is compiled from four sources: manually collected problems augmented by GPT-4 (84K), existing datasets expanded by GPT-4 (80K), data engine captions annotated by GPT-4 (51K), and problems directly generated by the data engine. This diverse approach ensures broad coverage of mathematical concepts and problem types, while maintaining high-quality, detailed solutions and rationales for each problem.

MAVIS-7B demonstrates superior performance across multiple mathematical benchmarks, showcasing its effectiveness in visual mathematical problem-solving. On the comprehensive MathVerse benchmark, MAVIS-7B achieves the highest overall accuracy among open-source models, surpassing larger models and specialized mathematical MLLMs. It outperforms InternLM-XComposer2 (7B) by 11.0% and ShareGPT4V (13B) by 10.1%. In specific domains, MAVIS-7B excels on GeoQA for plane geometry, achieving 66.7% accuracy, and on FunctionQA, reaching 40.3% accuracy, outperforming both traditional methods and other MLLMs. Qualitative analysis reveals MAVIS-7B’s superior understanding of geometric elements, function curves, and coordinate axes, leading to higher-quality Chain-of-Thought reasoning compared to GPT-4V.

This study introduces MAVIS, an efficient approach to mathematical visual instruction tuning for MLLMs. The framework comprises two key components: high-quality datasets (MAVIS-Caption and MAVIS-Instruct) generated by a sophisticated data engine, and a three-stage training pipeline. This process sequentially enhances the math-specific vision encoder, improves diagram-language alignment, and develops mathematical reasoning capabilities. The resulting specialist model, MAVIS-7B, demonstrates exceptional performance across various mathematical visual benchmarks. MAVIS’s innovative approach sets a new standard in visual mathematical problem-solving, paving the way for future advancements in this critical area of artificial intelligence and education technology.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit
The post From Diagrams to Solutions: MAVIS’s Three-Stage Framework for Mathematical AI appeared first on MarkTechPost.