Use Amazon Bedrock to generate, evaluate, and understand code in your …

Generative artificial intelligence (AI) models have opened up new possibilities for automating and enhancing software development workflows. Specifically, the emergent capability for generative models to produce code based on natural language prompts has opened many doors to how developers and DevOps professionals approach their work and improve their efficiency. In this post, we provide an overview of how to take advantage of the advancements of large language models (LLMs) using Amazon Bedrock to assist developers at various stages of the software development lifecycle (SDLC).
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
The following process architecture proposes an example SDLC flow that incorporates generative AI in key areas to improve the efficiency and speed of development.

The intent of this post is to focus on how developers can create their own systems to augment, write, and audit code by using models within Amazon Bedrock instead of relying on out-of-the-box coding assistants. We discuss the following topics:

A coding assistant use case to help developers write code faster by providing suggestions
How to use the code understanding capabilities of LLMs to surface insights and recommendations
An automated application generation use case to generate functioning code and automatically deploy changes into a working environment

Considerations
It’s important to consider some technical options when choosing your model and approach to implementing this functionality at each step. One such option is the base model to use for the task. With each model having been trained on a different corpus of data, there will inherently be different task performance per model. Anthropic’s Claude 3 on Amazon Bedrock models write code effectively out of the box in many common coding languages, for example, whereas others may not be able to reach that performance without further customization. Customization, however, is another technical choice to make. For instance, if your use case includes a less common language or framework, customizing the model through fine-tuning or using Retrieval Augmented Generation (RAG) may be necessary to achieve production-quality performance, but involves more complexity and engineering effort to implement effectively.
There is an abundance of literature breaking down these trade-offs; for this post, we are just describing what should be explored in its own right. We are simply laying the context that goes into the builder’s initial steps in implementing their generative AI-powered SDLC journey.
Coding assistant
Coding assistants are a very popular use case, with an abundance of examples from which to choose. AWS offers several services that can be applied to assist developers, either through in-line completion from tools like Amazon CodeWhisperer, or to be interacted with via natural language using Amazon Q. Amazon Q for builders has several implementations of this functionality, such as:

Amazon Q AWS expert interface
Amazon Q Developer in IDEs
Amazon EC2 instance type selection
Generative SQL for Amazon Redshift Query Editor

In nearly all the use cases described, there can be an integration with the chat interface and assistants. The use cases here are focused on more direct code generation use cases using natural language prompts. This is not to be confused with in-line generation tools that focus on autocompleting a coding task.
The key benefit of an assistant over in-line generation is that you can start new projects based on simple descriptions. For instance, you can describe that you want a serverless website that will allow users to post in blog fashion, and Amazon Q can start building the project by providing sample code and making recommendations on which frameworks to use to do this. This natural language entry point can give you a template and framework to operate within so you can spend more time on the differentiating logic of your application rather than the setup of repeatable and commoditized components.
Code understanding
It’s common for a company that begins to experiment with generative AI to augment the productivity of their individual developers to then use LLMs to infer meaning and functionality of code to improve the reliability, efficiency, security, and speed of the development process. Code understanding by humans is a central part of the SDLC: creating documentation, performing code reviews, and applying best practices. Onboarding new developers can be a challenge even for mature teams. Instead of a more senior developer taking time to respond to questions, an LLM with awareness of the code base and the team’s coding standards could be used to explain sections of code and design decisions to the new team member. The onboarding developer has everything they need with a rapid response time and the senior developer can focus on building. In addition to user-facing behaviors, this same mechanism can be repurposed to work completely behind the scenes to augment existing continuous integration and continuous delivery (CI/CD) processes as an additional reviewer.
For instance, you can use prompt engineering techniques to guide and automate the application of coding standards, or include the existing code base as referential material to use custom APIs. You can also take proactive measures by prefixing each prompt with a reminder to follow the coding standards and make a call to get them from document storage, passing them to the model as context with the prompt. As a retroactive measure, you can add a step during the review process to check the written code against the standards to enforce adherence, similar to how a team code review would work. For example, let’s say that one of the team’s standards is to reuse components. During the review step, the model can read over a new code submission, note that the component already exists in the code base, and suggest to the reviewer to reuse the existing component instead of recreating it.
The following diagram illustrates this type of workflow.

Application generation
You can extend the concepts from the use cases described in this post to create a full application generation implementation. In the traditional SDLC, a human creates a set of requirements, makes a design for the application, writes some code to implement that design, builds tests, and receives feedback on the system from external sources or people, and then the process repeats. The bottleneck in this cycle typically comes at the implementation and testing phases. An application builder needs to have substantive technical skills to write code effectively, and there are typically numerous iterations required to debug and perfect code—even for the most skilled builders. In addition, a foundational knowledge of a company’s existing code base, APIs, and IP are fundamental to implementing an effective solution, which can take humans a long time to learn. This can slow down the time to innovation for new teammates or teams with technical skills gaps. As mentioned earlier, if models can be used with the capability to both create and interpret code, pipelines can be created that perform the developer iterations of the SDLC by feeding outputs of the model back in as input.
The following diagram illustrates this type of workflow.

For example, you can use natural language to ask a model to write an application that prints all the prime numbers between 1–100. It returns a block of code that can be run with applicable tests defined. If the program doesn’t run or some tests fail, the error and failing code can be fed back into the model, asking it to diagnose the problem and suggest a solution. The next step in the pipeline would be to take the original code, along with the diagnosis and suggested solution, and stitch the code snippets together to form a new program. The SDLC restarts in the testing phase to get new results, and either iterates again or a working application is produced. With this basic framework, an increasing number of components can be added in the same manner as in a traditional human-based workflow. This modular approach can be continuously improved until there is a robust and powerful application generation pipeline that simply takes in a natural language prompt and outputs a functioning application, handling all of the error correction and best practice adherence behind the scenes.
The following diagram illustrates this advanced workflow.

Conclusion
We are at the point in the adoption curve of generative AI that teams are able to get real productivity gains from using the variety of techniques and tools available. In the near future, it will be imperative to take advantage of these productivity gains to stay competitive. One thing we do know is that the landscape will continue to rapidly progress and change, so building a system tolerant of change and flexibility is key. Developing your components in a modular fashion allows for stability in the face of an ever-changing technical landscape while being ready to adopt the latest technology at each step of the way.
For more information about how to get started building with LLMs, see these resources:

How Q4 Inc. used Amazon Bedrock, RAG, and SQLDatabaseChain to address numerical and structured dataset challenges building their Q&A chatbot
Boosting RAG-based intelligent document assistants using entity extraction, SQL querying, and agents with Amazon Bedrock
Create summaries of recordings using generative AI with Amazon Bedrock and Amazon Transcribe

About the Authors
Ian Lenora is an experienced software development leader who focuses on building high-quality cloud native software, and exploring the potential of artificial intelligence. He has successfully led teams in delivering complex projects across various industries, optimizing efficiency and scalability. With a strong understanding of the software development lifecycle and a passion for innovation, Ian seeks to leverage AI technologies to solve complex problems and create intelligent, adaptive software solutions that drive business value.
Cody Collins is a New York-based Solutions Architect at Amazon Web Services, where he collaborates with ISV customers to build cutting-edge solutions in the cloud. He has extensive experience in delivering complex projects across diverse industries, optimizing for efficiency and scalability. Cody specializes in AI/ML technologies, enabling customers to develop ML capabilities and integrate AI into their cloud applications.
Samit Kumbhani is an AWS Senior Solutions Architect in the New York City area with over 18 years of experience. He currently collaborates with Independent Software Vendors (ISVs) to build highly scalable, innovative, and secure cloud solutions. Outside of work, Samit enjoys playing cricket, traveling, and biking.

Inference AudioCraft MusicGen models using Amazon SageMaker

Music generation models have emerged as powerful tools that transform natural language text into musical compositions. Originating from advancements in artificial intelligence (AI) and deep learning, these models are designed to understand and translate descriptive text into coherent, aesthetically pleasing music. Their ability to democratize music production allows individuals without formal training to create high-quality music by simply describing their desired outcomes.
Generative AI models are revolutionizing music creation and consumption. Companies can take advantage of this technology to develop new products, streamline processes, and explore untapped potential, yielding significant business impact. Such music generation models enable diverse applications, from personalized soundtracks for multimedia and gaming to educational resources for students exploring musical styles and structures. It assists artists and composers by providing new ideas and compositions, fostering creativity and collaboration.
One prominent example of a music generation model is AudioCraft MusicGen by Meta. MusicGen code is released under MIT, model weights are released under CC-BY-NC 4.0. MusicGen can create music based on text or melody inputs, giving you better control over the output. The following diagram shows how MusicGen, a single stage auto-regressive Transformer model, can generate high-quality music based on text descriptions or audio prompts.

MusicGen uses cutting-edge AI technology to generate diverse musical styles and genres, catering to various creative needs. Unlike traditional methods that include cascading several models, such as hierarchically or upsampling, MusicGen operates as a single language model, which operates over several streams of compressed discrete music representation (tokens). This streamlined approach empowers users with precise control over generating high-quality mono and stereo samples tailored to their preferences, revolutionizing AI-driven music composition.
MusicGen models can be used across education, content creation, and music composition. They can enable students to experiment with diverse musical styles, generate custom soundtracks for multimedia projects, and create personalized music compositions. Additionally, MusicGen can assist musicians and composers, fostering creativity and innovation.
This post demonstrates how to deploy MusicGen, a music generation model on Amazon SageMaker using asynchronous inference. We specifically focus on text conditioned generation of music samples using MusicGen models.
Solution overview
With the ability to generate audio, music, or video, generative AI models can be computationally intensive and time-consuming. Generative AI models with audio, music, and video output can use asynchronous inference that queues incoming requests and process them asynchronously. Our solution involves deploying the AudioCraft MusicGen model on SageMaker using SageMaker endpoints for asynchronous inference. This entails deploying AudioCraft MusicGen models sourced from the Hugging Face Model Hub onto a SageMaker infrastructure.
The following solution architecture diagram shows how a user can generate music using natural language text as an input prompt by using AudioCraft MusicGen models deployed on SageMaker.

The following steps detail the sequence happening in the workflow from the moment the user enters the input to the point where music is generated as output:

The user invokes the SageMaker asynchronous endpoint using an Amazon SageMaker Studio notebook.
The input payload is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket for inference. The payload consists of both the prompt and the music generation parameters. The generated music will be downloaded from the S3 bucket.
The facebook/musicgen-large model is deployed to a SageMaker asynchronous endpoint. This endpoint is used to infer for music generation.
The HuggingFace Inference Containers image is used as a base image. We use an image that supports PyTorch 2.1.0 with a Hugging Face Transformers framework.
The SageMaker HuggingFaceModel is deployed to a SageMaker asynchronous endpoint.
The Hugging Face model (facebook/musicgen-large) is uploaded to Amazon S3 during deployment. Also, during inference, the generated outputs are uploaded to Amazon S3.
We use Amazon Simple Notification Service (Amazon SNS) topics to notify the success and failure as defined as a part of SageMaker asynchronous inference configuration.

Prerequisites
Make sure you have the following prerequisites in place :

Confirm you have access to the AWS Management Console to create and manage resources in SageMaker, AWS Identity and Access Management (IAM), and other AWS services.
If you’re using SageMaker Studio for the first time, create a SageMaker domain. Refer to Quick setup to Amazon SageMaker to create a SageMaker domain with default settings.
Obtain the AWS Deep Learning Containers for Large Model Inference from pre-built HuggingFace Inference Containers.

Deploy the solution
To deploy the AudioCraft MusicGen model to a SageMaker asynchronous inference endpoint, complete the following steps:

Create a model serving package for MusicGen.
Create a Hugging Face model.
Define asynchronous inference configuration.
Deploy the model on SageMaker.

We detail each of the steps and show how we can deploy the MusicGen model onto SageMaker. For sake of brevity, only significant code snippets are included. The full source code for deploying the MusicGen model is available in the GitHub repo.
Create a model serving package for MusicGen
To deploy MusicGen, we first create a model serving package. The model package contains a requirements.txt file that lists the necessary Python packages to be installed to serve the MusicGen model. The model package also contains an inference.py script that holds the logic for serving the MusicGen model.
Let’s look at the key functions used in serving the MusicGen model for inference on SageMaker:

def model_fn(model_dir):
”’loads model”’
model = MusicgenForConditionalGeneration.from_pretrained(“facebook/musicgen-large”)
return model

The model_fn function loads the MusicGen model facebook/musicgen-large from the Hugging Face Model Hub. We rely on the MusicgenForConditionalGeneration Transformers module to load the pre-trained MusicGen model.
You can also refer to musicgen-large-load-from-s3/deploy-musicgen-large-from-s3.ipynb, which demonstrates the best practice of downloading the model from the Hugging Face Hub to Amazon S3 and reusing the model artifacts for future deployments. Instead of downloading the model every time from Hugging Face when we deploy or when scaling happens, we download the model to Amazon S3 and reuse it for deployment and during scaling activities. Doing so can improve the download speed, especially for large models, thereby helping prevent the download from happening over the internet from a website outside of AWS. This best practice also maintains consistency, which means the same model from Amazon S3 can be deployed across various staging and production environments.
The predict_fn function uses the data provided during the inference request and the model loaded through model_fn:

texts, generation_params = _process_input(data)
processor = AutoProcessor.from_pretrained(“facebook/musicgen-large”)
inputs = processor (
text = texts,
padding=True,
return_tensors=”pt”,
)

Using the information available in the data dictionary, we process the input data to obtain the prompt and generation parameters used to generate the music. We discuss the generation parameters in more detail later in this post.

device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
model.to(device)
audio_values = model.generate(**inputs.to(device),
**generation_params)

We load the model to the device and then send the inputs and generation parameters as inputs to the model. This process generates the music in the form of a three-dimensional Torch tensor of shape (batch_size, num_channels, sequence_length).

sampling_rate = model.config.audio_encoder.sampling_rate
disk_wav_locations = _write_wavs_to_disk(sampling_rate, audio_values)
# Upload wavs to S3
result_dict[“generated_outputs_s3″] = _upload_wav_files(disk_wav_locations, bucket_name)
# Clean up disk
for wav_on_disk in disk_wav_locations:
_delete_file_on_disk(wav_on_disk)

We then use the tensor to generate .wav music and upload these files to Amazon S3 and clean up the .wav files saved on disk. We then obtain the S3 URI of the .wav files and send them locations in the response.
We now create the archive of the inference scripts and upload those to the S3 bucket:

musicgen_prefix = ‘musicgen_large’
s3_model_key = f'{musicgen_prefix}/model/model.tar.gz’
s3_model_location = f”s3://{sagemaker_session_bucket}/{s3_model_key}”
s3 = boto3.resource(“s3”)
s3.Bucket(sagemaker_session_bucket).upload_file(“model.tar.gz”, s3_model_key)

The uploaded URI of this object on Amazon S3 will later be used to create the Hugging Face model.
Create the Hugging Face model
Now we initialize HuggingFaceModel with the necessary arguments. During deployment, the model serving artifacts, stored in s3_model_location, will be deployed. Before the model serving, the MusicGen model will be downloaded from Hugging Face as per the logic in model_fn.

huggingface_model = HuggingFaceModel(
name=async_endpoint_name,
model_data=s3_model_location, # path to your model artifacts
role=role,
env= {
‘TS_MAX_REQUEST_SIZE’: ‘100000000’,
‘TS_MAX_RESPONSE_SIZE’: ‘100000000’,
‘TS_DEFAULT_RESPONSE_TIMEOUT’: ‘3600’
},# iam role with permissions to create an Endpoint
transformers_version=”4.37″, # transformers version used
pytorch_version=”2.1″, # pytorch version used
py_version=”py310″, # python version used
)

The env argument accepts a dictionary of parameters such as TS_MAX_REQUEST_SIZE and TS_MAX_RESPONSE_SIZE, which define the byte size values for request and response payloads to the asynchronous inference endpoint. The TS_DEFAULT_RESPONSE_TIMEOUT key in the env dictionary represents the timeout in seconds after which the asynchronous inference endpoint stops responding.
You can run MusicGen with the Hugging Face Transformers library from version 4.31.0 onwards. Here we set transformers_version to 4.37. MusicGen requires at least PyTorch version 2.1 or latest, and we have set pytorch_version to 2.1.
Define asynchronous inference configuration
Music generation using a text prompt as input can be both computationally intensive and time-consuming. Asynchronous inference in SageMaker is designed to address these demands. When working with music generation models, it’s important to note that the process can often take more than 60 seconds to complete.
SageMaker asynchronous inference queues incoming requests and processes them asynchronously, making it ideal for requests with large payload sizes (up to 1 GB), long processing times (up to 1 hour), and near real-time latency requirements. By queuing incoming requests and processing them asynchronously, this capability efficiently handles the extended processing times inherent in music generation tasks. Moreover, asynchronous inference enables seamless auto scaling, making sure that resources are allocated only when needed, leading to cost savings.
Before we proceed with asynchronous inference configuration , we create SNS topics for success and failure that can be used to perform downstream tasks:

from utils.sns_client import SnsClient
import time
sns_client = SnsClient(boto3.client(“sns”))
timestamp = time.time_ns()
topic_names = [f”musicgen-large-topic-SuccessTopic-{timestamp}”, f”musicgen-large-topic-ErrorTopic-{timestamp}”]

topic_arns = []
for topic_name in topic_names:
print(f”Creating topic {topic_name}.”)
response = sns_client.create_topic(topic_name)
topic_arns.append(response.get(‘TopicArn’))

We now create an asynchronous inference endpoint configuration by specifying the AsyncInferenceConfig object:

# create async endpoint configuration
async_config = AsyncInferenceConfig(
output_path=s3_path_join(
“s3://”, sagemaker_session_bucket, “musicgen_large/async_inference/output”
), # Where our results will be stored
# Add nofitication SNS if needed
notification_config={
“SuccessTopic”: topic_arns[0],
“ErrorTopic”: topic_arns[1],
}, # Notification configuration
)

The arguments to the AsyncInferenceConfig are detailed as follows:

output_path – The location where the output of the asynchronous inference endpoint will be stored. The files in this location will have an .out extension and will contain the details of the asynchronous inference performed by the MusicGen model.
notification_config – Optionally, you can associate success and error SNS topics. Dependent workflows can poll these topics to make informed decisions based on the inference outcomes.

Deploy the model on SageMaker
With the asynchronous inference configuration defined, we can deploy the Hugging Face model, setting initial_instance_count to 1:

# deploy the endpoint
async_predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type=instance_type,
async_inference_config=async_config,
endpoint_name=async_endpoint_name,
)

After successfully deploying, you can optionally configure automatic scaling to the asynchronous endpoint. With asynchronous inference, you can also scale down your asynchronous endpoint’s instances to zero.
We now dive into inferencing the asynchronous endpoint for music generation.
Inference
In this section, we show how to perform inference using an asynchronous inference endpoint with the MusicGen model. For the sake of brevity, only significant code snippets are included. The full source code for inferencing the MusicGen model is available in the GitHub repo. The following diagram explains the sequence of steps to invoke the asynchronous inference endpoint.

We detail the steps to invoke the SageMaker asynchronous inference endpoint for MusicGen by prompting a desired mood in natural language using English. We then demonstrate how to download and play the .wav files generated from the user prompt. Finally, we cover the process of cleaning up the resources created as part of this deployment.
Prepare prompt and instructions
For controlled music generation using MusicGen models, it’s important to understand various generation parameters:

generation_params = {
‘guidance_scale’: 3,
‘max_new_tokens’: 1200,
‘do_sample’: True,
‘temperature’: 1
}

From the preceding code, let’s understand the generation parameters:

guidance_scale – The guidance_scale is used in classifier-free guidance (CFG), setting the weighting between the conditional logits (predicted from the text prompts) and the unconditional logits (predicted from an unconditional or ‘null’ prompt). A higher guidance scale encourages the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer audio quality. CFG is enabled by setting guidance_scale > 1. For best results, use guidance_scale = 3. Our deployment defaults to 3.
max_new_tokens – The max_new_tokens parameter specifies the number of new tokens to generate. Generation is limited by the sinusoidal positional embeddings to 30-second inputs, meaning MusicGen can’t generate more than 30 seconds of audio (1,503 tokens). Our deployment defaults to 256.
do_sample – The model can generate an audio sample conditioned on a text prompt through use of the MusicgenProcessor to preprocess the inputs. The preprocessed inputs can then be passed to the .generate method to generate text-conditional audio samples. Our deployment defaults to True.
temperature – This is the softmax temperature parameter. A higher temperature increases the randomness of the output, making it more diverse. Our deployment defaults to 1.

Let’s look at how to build a prompt to infer the MusicGen model:

data = {
“texts”: [
“Warm and vibrant weather on a sunny day, feeling the vibes of hip hop and synth”,
],
“bucket_name”: sagemaker_session_bucket,
“generation_params”: generation_params
}

The preceding code is the payload, which will be saved as a JSON file and uploaded to an S3 bucket. We then provide the URI of the input payload during the asynchronous inference endpoint invocation along with other arguments as follows.
The texts key accepts an array of texts, which may contain the mood you want to reflect in your generated music. You can include musical instruments in the text prompt to the MusicGen model to generate music featuring those instruments.
The response from the invoke_endpoint_async is a dictionary of various parameters:

response = sagemaker_runtime.invoke_endpoint_async(
EndpointName=endpoint_name,
InputLocation=input_s3_location,
ContentType=”application/json”,
InvocationTimeoutSeconds=3600
)

OutputLocation in the response metadata represents Amazon S3 URI where the inference response payload is stored.
Asynchronous music generation
As soon as the response metadata is sent to the client, the asynchronous inference begins the music generation. The music generation happens on the instance chosen during the deployment of the MusicGen model on the SageMaker asynchronous Inference endpoint , as detailed in the deployment section.
Continuous polling and obtaining music files
While the music generation is in progress, we continuously poll for the response metadata parameter OutputLocation:

from utils.inference_utils import get_output
output = get_output(sm_session, response.get(‘OutputLocation’))

The get_output function keeps polling for the presence of OutputLocation and returns the S3 URI of the .wav music file.
Audio output
Lastly, we download the files from Amazon S3 and play the output using the following logic:

from utils.inference_utils import play_output_audios
music_files = []
for s3_url in output.get(‘generated_outputs_s3’):
if s3_url is not None:
music_files.append(download_from_s3(s3_url))
play_output_audios(music_files, data.get(‘texts’))

You now have access to the .wav files and can try changing the generation parameters to experiment with various text prompts.

Audio-File-1

The following is another music sample based on the following generation parameters:

generation_params = { ‘guidance_scale’: 5, ‘max_new_tokens’: 1503, ‘do_sample’: True, ‘temperature’: 0.9 }
data = {
“texts”: [
“Catchy funky beats with drums and bass, synthesized pop for an upbeat pop game”,
],
“bucket_name”: sagemaker_session_bucket,
“generation_params”: generation_params
}

Audio-File-2

Clean up
To avoid incurring unnecessary charges, you can clean up using the following code:

import boto3
sagemaker_runtime = boto3.client(‘sagemaker-runtime’)

cleanup = False # < – Set this to True to clean up resources.
endpoint_name = <Endpoint_Name>

sm_client = boto3.client(‘sagemaker’)
endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_config_name = endpoint[‘EndpointConfigName’]
endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
model_name = endpoint_config[‘ProductionVariants’][0][‘ModelName’]
notification_config = endpoint_config[‘AsyncInferenceConfig’][‘OutputConfig’].get(‘NotificationConfig’, None)
print(f”””
About to delete the following sagemaker resources:
Endpoint: {endpoint_name}
Endpoint Config: {endpoint_config_name}
Model: {model_name}
“””)
for k,v in notification_config.items():
print(f’About to delete SNS topics for {k} with ARN: {v}’)

if cleanup:
# delete endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)
# delete endpoint config
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# delete model
sm_client.delete_model(ModelName=model_name)
print(‘deleted model, config and endpoint’)

The aforementioned cleanup routine will delete the SageMaker endpoint, endpoint configurations, and models associated with MusicGen model, so that you avoid incurring unnecessary charges. Make sure to set cleanup variable to True, and replace <Endpoint_Name> with the actual endpoint name of the MusicGen model deployed on SageMaker. Alternatively, you can use the console to delete the endpoints and its associated resources that were created while running the code mentioned in the post.
Conclusion
In this post, we learned how to use SageMaker asynchronous inference to deploy the AudioCraft MusicGen model. We started by exploring how the MusicGen models work and covered various use cases for deploying MusicGen models. We also explored how you can benefit from capabilities such as auto scaling and the integration of asynchronous endpoints with Amazon SNS to power downstream tasks. We then took a deep dive into the deployment and inference workflow of MusicGen models on SageMaker, using the AWS Deep Learning Containers for HuggingFace inference and the MusicGen model sourced from the Hugging Face Hub.
Get started with generating music using your creative prompts by signing up for AWS. The full source code is available on the official GitHub repository.
References

Inference Audiocraft Musicgen on Amazon SageMaker
Open sourcing AudioCraft: Generative AI for audio made simple and available to all
MusicGen – Large – 3.3B
MusicGen: Generation
AudioCraft
Hugging Face Model
Predictors
Transcription inference on Amazon SageMaker Inference
MusicGen

About the Authors
Pavan Kumar Rao Navule is a Solutions Architect at Amazon Web Services, where he works with ISVs in India to help them innovate on the AWS platform. He is specialized in architecting AI/ML and generative AI services at AWS. Pavan is a published author for the book “Getting Started with V Programming.” In his free time, Pavan enjoys listening to the great magical voices of Sia and Rihanna.
David John Chakram is a Principal Solutions Architect at AWS. He specializes in building data platforms and architecting seamless data ecosystems. With a profound passion for databases, data analytics, and machine learning, he excels at transforming complex data challenges into innovative solutions and driving businesses forward with data-driven insights.
Sudhanshu Hate is a principal AI/ML specialist with AWS and works with clients to advise them on their MLOps and generative AI journey. In his previous role before Amazon, he conceptualized, created, and led teams to build ground-up open source-based AI and gamification platforms, and successfully commercialized it with over 100 clients. Sudhanshu has to his credit a couple of patents, has written two books and several papers and blogs, and has presented his points of view in various technical forums. He has been a thought leader and speaker, and has been in the industry for nearly 25 years. He has worked with Fortune 1000 clients across the globe and most recently with digital native clients in India.
Rupesh Bajaj is a Solutions Architect at Amazon Web Services, where he collaborates with ISVs in India to help them leverage AWS for innovation. He specializes in providing guidance on cloud adoption through well-architected solutions and holds seven AWS certifications. With 5 years of AWS experience, Rupesh is also a Gen AI Ambassador. In his free time, he enjoys playing chess.

Bytedance Researchers Present Cross Language Agent – Simultaneous In …

One of the most difficult challenges in translation is simultaneous speech translation (SiST). The ability to translate spoken words into another language in real time is known as simultaneous speech translation, and it paves the way for instantaneous communication across language barriers. There has been a lot of buzz about machine-assisted autonomous interpretation in natural language processing (NLP). Streaming Automatic Speech Recognition (ASR), punctuation, and Machine Translation (MT) models are typically employed in a cascaded system in traditional simultaneous translation systems. Unfortunately, the ASR module is a common latency and error propagation source in such cascaded systems. 

Academic SiST models and commercial SiST engines have come a long way, yet translation quality still needs to improve. With the help of humans, studies evaluated the available SiST systems as they are now. These systems significantly impact the efficacy of communication from a user-centered standpoint since they only provide listeners with less than 42% of the correct information. On the other hand, a human translator can convey at least 95% of the intended meaning and often more than 70%. As a result, researchers utilize 80% to denote highly qualified human interpreters in this work. LLMs are suggested to complete the SiST task because of their enormous success with machine and spoken translation.

Starting with the read-write policy, which requires LLM only to offer partial translation for input speech, integrating LLM into the SiST takes work. Second, LLMs can’t learn rare terms or terminologies from training data; thus, getting human-equivalent performance is challenging. Finally, the performance on the SiST task is still hindered by the shortage of training data. In response to these challenges, researchers from ByteDance have introduced CLASI, a unique Cross-Lingual Agent that achieves Simultaneous Interpretation through the repeated execution of various operations. 

CLASI overcomes the first obstacle by emulating human interpreters’ approach of segmenting full sentences into smaller, more manageable pieces based on syntactic markers and contextual meaning. This is achieved through a data-driven policy learning method, enabling CLASI to learn and apply a rigorous read-write policy for SiST. To address the second obstacle, the CLASI agent was enhanced with two additional modules: a memory that records speech context and an external knowledge database with terminologies and matched translations. However, the external knowledge database can introduce noise and slow down the technique. To mitigate this, the researchers propose a new method called Multi-Modal Retrieval Augmented Generation (MM-RAG). This method uses a multi-modal retriever to search an external database for relevant information, thereby improving the efficiency of the CLASI agent. 

They add the obtained information and memory context to the LLM agent’s prompt to improve the translation using in-context learning. They use a three-stage training methodology—pretraining, ongoing training, and fine-tuning—to tackle the data scarcity of the SiST job. LLM and audio encoder are pre trained separately using their massive internal datasets. The team trains their model continuously using billions of tokens of low-quality synthetic speech translation data to further their goal of achieving modal alignment between voice and text. For LLM to make greater use of the retriever’s and preceding translation’s contextual information, they also incorporate several activities to improve its in-context learning capability. Finally, they use a tiny quantity of human-annotated data to fine-tune the model, making it more resilient and producing better translations by mimicking the actions of human professionals. Since SiST frequently incorporates compaction, abstraction, and paraphrasing, it is possible that the traditional automatic evaluation criteria of simultaneous interpretation do not accurately reflect its performance.

Valid Information Proportion (VIP)2 is a new evaluation metric they offer, which aligns with human interpreters. The primary goal of SiST is real-time communication, and VIP indicates the proportion of information that can be transmitted precisely. The researchers found that the proposed method significantly beats other available algorithms in human evaluations conducted on challenging real-world long speech datasets that are both diverse and varied in topic. As an example, in the direction of Chinese-to-English translation, CLASI gets an 81.3% VIP score, which is far better than human interpreters. This promising result indicates a bright future for SiST.

The results in Chinese-to-English and English-to-Chinese jobs were much better than those of commercial systems, but the team highlights that language considerations should be expanded in the future. Each translation round triggers a full action sequence in the presented implementation of CLASI. Since the model can accurately translate without any external knowledge, some activities are optional for simple translation scenarios. It is possible to train the model to skip extra steps in the future.

Therefore, the Valid Information Proportion (VIP) metric is suggested for enhanced human evaluation. This underscores the need for more reliable automated quality and latency measurements in the future. The evidence also points to the potential of reinforcement learning from human feedback (RLHF) to enhance LLM performance. While CLASI outperforms prior state-of-the-art systems, there is a clear need for additional research into improving multi-modal reward models, as well as RL approaches for SiST. Promising areas of study include multi-modal integration, such as end-to-end video-to-video or speech-to-speech production.  

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Bytedance Researchers Present Cross Language Agent – Simultaneous Interpretation (CLASI): A High-Quality And Human-Like Simultaneous Speech Translation (SiST) System appeared first on MarkTechPost.

The Evolution of Artificial Intelligence (AI) Agents: Workflow, Planni …

Artificial Intelligence (AI) is significantly developing at a pace that is transforming many different industries. AI agents created to automate and simplify many parts of business processes are among the most intriguing recent advances. These agents, which may be divided into three categories: Planning Agents, Workflow Agents, and Matrix Agents, are the next wave of AI-powered automation technologies that hold great potential for companies looking to increase productivity and efficiency.

Planning Agents

Planning agents are made to draw out a schedule for completing particular activities. These agents, in contrast to standard automation technologies, plan out the entire process rather than merely carrying out pre-defined activities, enabling a more dynamic and flexible approach to job execution. After a plan is created, human operators usually evaluate and approve it to make sure the agent’s suggested course of action is in line with the organization’s overall goals.

Code creation is one of the most common uses for planning agents. These agents are capable of problem analysis, solution design, and even generating the code required to put the solution into practice. Businesses can expedite the creation of software tools and other AI agents by automating this planning process, which will cut down on the time and effort needed for complicated jobs. The capacity of these agents to plan and carry out activities on their own will probably increase as they develop further, thereby decreasing the need for human interaction in many aspects of business operations.

Workflow Agents

By carrying out preset workflows in response to particular circumstances, workflow agents elevate automation. These agents are designed to carry out a set of functions, each of which is represented as a workflow node. The agent can modify its activities in response to the inputs and outputs at each node, guaranteeing a smooth and efficient workflow progression.

Due to their adaptability, workflow agents are useful for automating many business processes. These agents are capable of handling the complexity of contemporary enterprise processes, whether they are running code, transforming data, establishing connections to different systems, or formatting outputs. Workflow agents increase productivity and creativity in organizations by automating repetitive processes, freeing up human workers to concentrate on more strategic endeavors.

One of workflow agents’ main features is their smooth integration with current systems. This enables companies to use the sophisticated automation capabilities these agents provide while still utilizing their current infrastructure. As more companies use workflow agents, a major shift towards more simplified and effective business processes has been anticipated.

Automated sequences controlled by AI agents that carry out tasks autonomously reduce the requirement for human intervention in AI agentic workflows. These workflows use AI technology, such as predictive analytics, natural language processing, and machine learning, to effectively perform complicated tasks. AI agentic workflows are dynamic, in contrast to traditional workflows, enabling agents to make decisions and carry out activities based on data that is updated in real-time. Across many industries, this autonomy improves accuracy, scalability, and productivity.

Matrix Agents

A unique approach to AI automation has been represented by matrix agents, which are made to manage tasks requiring recurrent processing or analysis of various input sets. Consider the requirement for a venture capitalist (VC) to assess the performance of the businesses in their portfolio regularly. A matrix agent can carry out this analysis automatically by applying a number of operations to a collection of inputs.

Matrix agents are useful for more than financial analysis. They can also produce ad material, optimize pages for search engines, and carry out a number of other repetitive activities on diverse data sets. By automating these repetitive, monotonous chores, knowledge workers can concentrate on more high-level strategic operations.

In conclusion, the swift advancement of AI agents is creating the conditions for a time when intelligent systems will perform the majority of the labor-intensive tasks in businesses. Businesses can achieve new heights of innovation and growth by adopting these innovations in addition to streamlining operations.

AI agents will play an increasingly important role in businesses as they develop. These agents will probably take on a larger portion of the workload over time, which will enable firms to run more successfully and efficiently. The introduction of planning, workflow, and matrix agents represents a major turning point in the development of AI, and in the years to come, their influence on business automation will only increase.

The post The Evolution of Artificial Intelligence (AI) Agents: Workflow, Planning, and Matrix Agents Leading Enterprise Automation appeared first on MarkTechPost.

BRAG Released: High-Performance SLMs (Small Language Models) Specifica …

BRAG is a series of high-performance Retrieval Augmented Generation (RAG) models developed by Maximalists AI Researcher. The BRAG models are a family of small language models (SLMs) designed to offer cost-effective, high-performance alternatives in AI-driven language processing. These models have been trained at an impressively low cost of under $25 each, positioning them as efficient and economical solutions in artificial intelligence.

The BRAG models were created in response to the need for efficient and high-performing language models that do not require the extensive computational resources typically associated with large-scale models like those from Nvidia and OpenAI. The primary motivation behind BRAG was to develop a series of models that could match or exceed the performance of leading models such as Cohere’s Command R+, Qwen2, Llama3.1, and Llama3 Instruct while keeping the training costs minimal.

The BRAG series includes four models: 

BRAG-Qwen2-7b-v0.1

BRAG-Llama-3.1-8b-v0.1

BRAG-Llama-3-8b-v0.1

BRAG-Qwen2-1.5b-v0.1

These models are chosen based on their performance in open benchmarks and ability to balance efficiency and capability. The models underwent a two-stage fine-tuning process inspired by Nvidia’s ChatQA approach, which involves initial training on general instruction datasets followed by RAG-specific datasets.

Image Source

The BRAG models are particularly noteworthy for their performance relative to their size. The 1.5B models offer an excellent balance of performance and efficiency. In comparison, the 7B and 8B models can handle more complex tasks, such as long context understanding, tabular data interpretation, and mathematical reasoning. This strategic selection of models and training methodology allowed Maximalists to optimize performance while managing costs effectively.

The BRAG model training involved LoRA (Low-Rank Adaptation) and QLoRA (quantized LoRA) techniques. LoRA enables faster training with reduced computational demands by simplifying the adaptation matrices. In contrast, QLoRA compresses weight parameters to 4-bit precision, significantly reducing memory footprint and facilitating training on consumer-grade GPUs.

Image Source

The models were evaluated using the ChatRAG-Bench, a benchmark designed to assess conversational QA and RAG capabilities across various document types and question formats. The evaluation metrics included F1-Score and Exact Match Accuracy, which provided insights into the models’ ability to generate precise and contextually relevant responses.

Image Source

During the training process, several challenges were encountered, including handling long documents, interpreting tabular data, and addressing domain-specific queries. These issues were mitigated through careful dataset selection and experimentation with various data combinations. For instance, including datasets like DROP, Quoref, and SQuAD helped improve the models’ capabilities in handling complex and diverse data types. The F1 score metric, while widely accepted, was noted to have limitations in capturing semantic nuances and context. This highlighted the need for more holistic and context-aware evaluation metrics to better gauge model performance.

In conclusion, the Maximalists plan to enhance BRAG models by improving RAG performance and tabular data handling and introducing citation generation for better interpretability. They also aim to refine query rewriting techniques to improve search accuracy and relevance. The development of BRAG was supported by credits from Modal Labs, which facilitated cost-effective experimentation. By leveraging innovative training techniques and strategic model selection, BRAG has demonstrated that top-tier performance can be achieved with minimal resource expenditure, paving the way for more accessible and efficient AI solutions.

Check out the Models and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post BRAG Released: High-Performance SLMs (Small Language Models) Specifically Trained for RAG Tasks Under $25 Each appeared first on MarkTechPost.

Build an end-to-end RAG solution using Knowledge Bases for Amazon Bedr …

Retrieval Augmented Generation (RAG) is a state-of-the-art approach to building question answering systems that combines the strengths of retrieval and foundation models (FMs). RAG models first retrieve relevant information from a large corpus of text and then use a FM to synthesize an answer based on the retrieved information.
An end-to-end RAG solution involves several components, including a knowledge base, a retrieval system, and a generation system. Building and deploying these components can be complex and error-prone, especially when dealing with large-scale data and models.
This post demonstrates how to seamlessly automate the deployment of an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation, enabling organizations to quickly and effortlessly set up a powerful RAG system.
Solution overview
The solution provides an automated end-to-end deployment of a RAG workflow using Knowledge Bases for Amazon Bedrock. We use AWS CloudFormation to set up the necessary resources, including :

An AWS Identity and Access Management (IAM) role
An Amazon OpenSearch Serverless collection and index
A knowledge base with its associated data source

The RAG workflow enables you to use your document data stored in an Amazon Simple Storage Service (Amazon S3) bucket and integrate it with the powerful natural language processing capabilities of FMs provided in Amazon Bedrock. The solution simplifies the setup process, allowing you to quickly deploy and start querying your data using the selected FM.
Prerequisites
To implement the solution provided in this post, you should have the following:

An active AWS account and familiarity with FMs, Amazon Bedrock, and OpenSearch Serverless.
An S3 bucket where your documents are stored in a supported format (.txt, .md, .html, .doc/docx, .csv, .xls/.xlsx, .pdf).
The Amazon Titan Embeddings G1-Text model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If the Amazon Titan Embeddings G1-Text model is enabled, the access status will show as Access granted, as shown in the following screenshot.

Set up the solution
When the prerequisite steps are complete, you’re ready to set up the solution:

Clone the GitHub repository containing the solution files:

git clone https://github.com/aws-samples/amazon-bedrock-samples.git

Navigate to the solution directory:

cd knowledge-bases/features-examples/04-infrastructure/e2e-rag-deployment-using-bedrock-kb-cfn

Run the sh script, which will create the deployment bucket, prepare the CloudFormation templates, and upload the ready CloudFormation templates and required artifacts to the deployment bucket:

bash deploy.sh

While running deploy.sh, if you provide a bucket name as an argument to the script, it will create a deployment bucket with the specified name. Otherwise, it will use the default name format: e2e-rag-deployment-${ACCOUNT_ID}-${AWS_REGION}
As shown in the following screenshot, if you complete the preceding steps in an Amazon SageMaker notebook instance, you can run the bash deploy.sh at the terminal, which creates the deployment bucket in your account (account number has been redacted).

After the script is complete, note the S3 URL of the main-template-out.yml.

On the AWS CloudFormation console, create a new stack.
For Template source, select Amazon S3 URL and enter the URL you copied earlier.
Choose Next.

Provide a stack name and specify the RAG workflow details according to your use case and then choose Next.

Leave everything else as default and choose Next on the following pages.

Review the stack details and select the acknowledgement check boxes.

Choose Submit to start the deployment process.

You can monitor the stack deployment progress on the AWS CloudFormation console.

Test the solution
When the deployment is successful (which may take 7–10 minutes to complete), you can start testing the solution.

On the Amazon Bedrock console, navigate to the created knowledge base.
Choose Sync to initiate the data ingestion job.

After data synchronization is complete, select the desired FM to use for retrieval and generation (it requires model access to be granted to this FM in Amazon Bedrock before using).

Start querying your data using natural language queries.

That’s it! You can now interact with your documents using the RAG workflow powered by Amazon Bedrock.
Clean up
To avoid incurring future charges, delete the resources used in this solution:

On the Amazon S3 console, manually delete the contents inside the bucket you created for template deployment, then delete the bucket.
On the AWS CloudFormation console, choose Stacks in the navigation pane, select the main stack, and choose Delete.

Your created knowledge base will be deleted when you delete the stack.

Conclusion
In this post, we introduced an automated solution for deploying an end-to-end RAG workflow using Knowledge Bases for Amazon Bedrock and AWS CloudFormation. By using the power of AWS services and the preconfigured CloudFormation templates, you can quickly set up a powerful question answering system without the complexities of building and deploying individual components for RAG applications. This automated deployment approach not only saves time and effort, but also provides a consistent and reproducible setup, enabling you to focus on utilizing the RAG workflow to extract valuable insights from your data.
Try it out and see firsthand how it can streamline your RAG workflow deployment and enhance efficiency. Please share your feedback to us!

About the Authors
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. With a keen interest in exploring new frontiers in the field, she continuously strives to push boundaries. Outside of work, she loves traveling, working out, and exploring new things.
Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Faster LLMs with speculative decoding and AWS Inferentia2

In recent years, we have seen a big increase in the size of large language models (LLMs) used to solve natural language processing (NLP) tasks such as question answering and text summarization. Larger models with more parameters, which are in the order of hundreds of billions at the time of writing, tend to produce better results. For example, Llama-3-70B, scores better than its smaller 8B parameters version on metrics like reading comprehension (SQuAD 85.6 compared to 76.4). Thus, customers often experiment with larger and newer models to build ML-based products that bring value.
However, the larger the model, the more computationally demanding it is, and the higher the cost to deploy. For example, on AWS Trainium, Llama-3-70B has a median per-token latency of 21.4 ms, while Llama-3-8B takes 4.7 ms. Similarly, Llama-2-70B has a median per-token latency of 20.6 ms, while Llama-2-7B takes 3.7 ms. Customers have to consider performance to ensure they meet their users’ needs. In this blog post, we will explore how speculative sampling can help make large language model inference more compute efficient and cost-effective on AWS Inferentia and Trainium. This technique improves LLM inference throughput and output token latency (TPOT).
Introduction
Modern language models are based on the transformer architecture. The input prompts are processed first using a technique called context encoding, which runs fast because it is parallelizable. Next, we perform auto-regressive token generation where the output tokens are generated sequentially. Note that we cannot generate the next token until we know the previous one, as depicted in Figure 1. Therefore, to generate N output tokens we need N serial runs through the decoder. A run takes longer through a larger model, like Llama-3-70B, than through a smaller model, like Llama-3-8B.

Figure 1: Sequential token generation in LLMs

From a computational perspective, token generation in LLMs is a memory bandwidth-bound process. The larger the model, the more likely it is that we will wait on memory transfers. This results in underutilizing the compute units and not fully benefiting from the floating-point operations (FLOPS) available.
Speculative sampling
Speculative sampling is a technique that improves the computational efficiency for running inference with LLMs, while maintaining accuracy. It works by using a smaller, faster draft model to generate multiple tokens, which are then verified by a larger, slower target model. This verification step processes multiple tokens in a single pass rather than sequentially and is more compute efficient than processing tokens sequentially. Increasing the number of tokens processed in parallel increases the compute intensity because a larger number of tokens can be multiplied with the same weight tensor. This provides better performance compared with the non-speculative run, which is usually memory bandwidth-bound, and thus leads to better hardware resource utilization.
The speculative process involves an adjustable window k, where the target model provides one guaranteed correct token, and the draft model speculates on the next k-1 tokens. If the draft model’s tokens are accepted, the process speeds up. If not, the target model takes over, ensuring accuracy.

Figure 2: Case when all speculated tokens are accepted

Figure 2 illustrates a case where all speculated tokens are accepted, resulting in faster processing. The target model provides a guaranteed output token, and the draft model runs multiple times to produce a sequence of possible output tokens. These are verified by the target model and subsequently accepted by a probabilistic method.

Figure 3: Case when some speculated tokens are rejected

On the other hand, Figure 3 shows a case where some of the tokens are rejected. The time it takes to run this speculative sampling loop is the same as in Figure 2, but we obtain fewer output tokens. This means we will be repeating this process more times to complete the response, resulting in slower overall processing.
By adjusting the window size k and understanding when the draft and target models are likely to produce similar results, we can maximize the benefits of speculative sampling.
A Llama-2-70B/7B demonstration
We will show how speculative sampling works on Inferentia2-powered Amazon EC2 Inf2 instances and Trainium-powered EC2 Trn1 instances. We will be using a sample where we generate text faster with Llama-2-70B by using a Llama-2-7B model as a draft model. The example walk-through is based on Llama-2 models, but you can follow a similar process for Llama-3 models as well.
Loading models
You can load the Llama-2 models using data type bfloat16. The draft model needs to be loaded in a standard way like in the example below. The parameter n_positions is adjustable and represents the maximum sequence length you want to allow for generation. The only batch_size we support for speculative sampling at the time of writing is 1. We will explain tp_degree later in this section.

draft_model = LlamaForSampling.from_pretrained(‘Llama-2-7b’, n_positions=128, batch_size=1, tp_degree=32, amp=’bf16′)

The target model should be loaded in a similar way, but with speculative sampling functionality enabled. The value k was described previously.

target_model = LlamaForSampling.from_pretrained(‘Llama-2-70b’, n_positions=128, batch_size=1, tp_degree=32, amp=’bf16′)
target_model.enable_speculative_decoder(k)

Combined, the two models need almost 200 GB of device memory for the weights with additional memory in the order of GBs needed for key-value (KV) caches. If you prefer to use the models with float32 parameters, they will need around 360 GB of device memory. Note that the KV caches grow linearly with sequence length (input tokens + tokens yet to be generated). Use neuron-top to see the memory utilization live. To accommodate for these memory requirements, we’ll need either the largest Inf2 instance (inf2.48xlarge) or largest Trn1 instance (trn1.32xlarge).
Because of the size of the models, their weights need to be distributed amongst the NeuronCores using a technique called tensor parallelism. Notice that in the sample provided, tp_degree is used per model to specify how many NeuronCores that model should use. This, in turn, affects the memory bandwidth utilization, which is critical for token generation performance. A higher tp_degree can lead to better bandwidth utilization and improved throughput. The topology for Trn1 requires that tp_degree is set to 1, 2, 8, 16 or a multiple of 32. For Inf2, it needs to be 1 or multiples of 2.
The order in which you load the models also matters. After a set of NeuronCores has been initialized and allocated for one model, you cannot use the same NeuronCores for another model unless it’s the exact same set. If you try to use only some of the NeuronCores that were previously initialized, you will get an nrt_load_collectives – global nec_comm is already init’d error.
Let’s go through two examples on trn1.32xlarge (32 NeuronCores) to understand this better. We will calculate how many NeuronCores we need per model. The formula used is the observed model size in memory, using neuron-top, divided by 16GB which is the device memory per NeuronCore.

If we run the models using bfloat16, we need more than 10 NeuronCores for Llama-2-70B and more than 2 NeuronCores for Llama-2-7B. Because of topology constraints, it means we need at least tp_degree=16 for Llama-2-70B. We can use the remaining 16 NeuronCores for Llama-2-7B. However, because both models fit in memory across 32 NeuronCores, we should set tp_degree=32 for both, to speed-up the model inference for each.
If we run the models using float32, we need more than 18 NeuronCores for Llama-2-70B and more than 3 NeuronCores for Llama-2-7B. Because of topology constraints, we have to set tp_degree=32 for Llama-2-70B. That means Llama-2-7B needs to re-use the same set of NeuronCores, so you need to set tp_degree=32 for Llama-2-7B too.

Walkthrough
The decoder we’ll use from transformers-neuronx is LlamaForSampling, which is suitable for loading and running Llama models. You can also use NeuronAutoModelForCausalLM which will attempt to auto-detect which decoder to use. To perform speculative sampling, we need to create a speculative generator first which takes two models and the value k described previously.

spec_gen = SpeculativeGenerator(draft_model, target_model, k)

We invoke the inferencing process by calling the following function:

spec_gen.sample(input_ids=input_token_ids, sequence_length=total_output_length)

During sampling, there are several hyper-parameters (for example: temperature, top_p, and top_k) that affect if the output is deterministic across multiple runs. At the time of writing, the speculative sampling implementation sets default values for these hyper-parameters. With these values, expect randomness in results when you run a model multiple times, even if it’s with the same prompt. This is normal intended behavior for LLMs because it improves their qualitative responses.
When you run the sample, you will use the default token acceptor, based on the DeepMind paper which introduced speculative sampling, which uses a probabilistic method to accept tokens. However, you can also implement a custom token acceptor, which you can pass as part of the acceptor parameter when you initialize the SpeculativeGenerator. You would do this if you wanted more deterministic responses, for example. See the implementation of the DefaultTokenAcceptor class in transformers-neuronx to understand how to write your own.
Conclusion
As more developers look to incorporate LLMs into their applications, they’re faced with a choice of using larger, more costly, and slower models that will deliver higher quality results. Or they can use smaller, less expensive and faster models that might reduce quality of answers. Now, with AWS artificial intelligence (AI) chips and speculative sampling, developers don’t have to make that choice. They can take advantage of the high-quality outputs of larger models and the speed and responsiveness of smaller models.
In this blog post, we have shown that we can accelerate the inference of large models, such as Llama-2-70B, by using a new feature called speculative sampling.
To try it yourself, check out the speculative sampling example, and tweak the input prompt and k parameter to see the results you get. For more advanced use cases, you can develop your own token acceptor implementation. To learn more about running your models on Inferentia and Trainium instances, see the AWS Neuron documentation. You can also visit repost.aws AWS Neuron channel to discuss your experimentations with the AWS Neuron community and share ideas.

About the Authors
Syl Taylor is a Specialist Solutions Architect for Efficient Compute. She advises customers across EMEA on Amazon EC2 cost optimization and improving application performance using AWS-designed chips. Syl previously worked in software development and AI/ML for AWS Professional Services, designing and implementing cloud native solutions. She’s based in the UK and loves spending time in nature.
Emir Ayar is a Senior Tech Lead Solutions Architect with the AWS Prototyping team. He specializes in assisting customers with building ML and generative AI solutions, and implementing architectural best practices. He supports customers in experimenting with solution architectures to achieve their business objectives, emphasizing agile innovation and prototyping. He lives in Luxembourg and enjoys playing synthesizers.

Catalog, query, and search audio programs with Amazon Transcribe and K …

Information retrieval systems have powered the information age through their ability to crawl and sift through massive amounts of data and quickly return accurate and relevant results. These systems, such as search engines and databases, typically work by indexing on keywords and fields contained in data files.
However, much of our data in the digital age also comes in non-text format, such as audio and video files. Finding relevant content usually requires searching through text-based metadata such as timestamps, which need to be manually added to these files. This can be hard to scale as the volume of unstructured audio and video files continues to grow.
Fortunately, the rise of artificial intelligence (AI) solutions that can transcribe audio and provide semantic search capabilities now offer more efficient solutions for querying content from audio files at scale. Amazon Transcribe is an AWS AI service that makes it straightforward to convert speech to text. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
In this post, we show how Amazon Transcribe and Amazon Bedrock can streamline the process to catalog, query, and search through audio programs, using an example from the AWS re:Think podcast series.
Solution overview
The following diagram illustrates how you can use AWS services to deploy a solution for cataloging, querying, and searching through content stored in audio files.

In this solution, audio files stored in mp3 format are first uploaded to Amazon Simple Storage Service (Amazon S3) storage. Video files (such as mp4) that contain audio in supported languages can also be uploaded to Amazon S3 as part of this solution. Amazon Transcribe will then transcribe these files and store the entire transcript in JSON format as an object in Amazon S3.
To catalog these files, each JSON file in Amazon S3 should be tagged with the corresponding episode title. This allows us to later retrieve the episode title for each query result.
Next, we use Amazon Bedrock to create numerical representations of the content inside each file. These numerical representations are also called embeddings, and they’re stored as vectors inside a vector database that we can later query.
Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API. Included with Amazon Bedrock is Knowledge Bases for Amazon Bedrock. As a fully managed service, Knowledge Bases for Amazon Bedrock makes it straightforward to set up a Retrieval Augmented Generation (RAG) workflow.
With Knowledge Bases for Amazon Bedrock, we first set up a vector database on AWS. Knowledge Bases for Amazon Bedrock can then automatically split the data files stored in Amazon S3 into chunks and then create embeddings of each chunk using Amazon Titan on Amazon Bedrock. Amazon Titan is a family of high-performing FMs from Amazon. Included with Amazon Titan is Amazon Titan Text Embeddings, which we use to create the numerical representation of the text inside each chunk and store them in a vector database.
When a user queries the contents of the audio files through a generative AI application or AWS Lambda function, it makes an API call to Knowledge Bases for Amazon Bedrock. Knowledge Bases for Amazon Bedrock will then orchestrate a call to the vector database to perform a semantic search, which returns the most relevant results. Next, Knowledge Bases for Amazon Bedrock augments the user’s original query with these results to a prompt, which is sent to the large language model (LLM). The LLM will return results that are more accurate and relevant to the user query.
Let’s walk through an example of how you can catalog, query, and search through a library of audio files using these AWS AI services. For this post, we use episodes of the re:Think podcast series, which has over 20 episodes. Each episode is an audio program recorded in mp3 format. As we continue to add new episodes, we will want to use AI services to make the task of querying and searching for specific content more scalable without the need to manually add metadata for each episode.
Prerequisites
In addition to having access to AWS services through the AWS Management Console, you need a few other resources to deploy this solution.
First, you need a library of audio files to catalog, query, and search. For this post, we use episodes of the AWS re:Think podcast series.
To make API calls to Amazon Bedrock from our generative AI application, we use Python version 3.11.4 and the AWS SDK for Python (Boto3).
Transcribe audio files
The first task is to transcribe each mp3 file using Amazon Transcribe. For instructions on transcribing with the AWS Management Console or AWS CLI, refer to the Amazon Transcribe Developer guide. Amazon Transcribe can create a transcript for each episode and store it as an S3 object in JSON format.
Catalog audio files using tagging
To catalog each episode, we tag the S3 object for each episode with the corresponding episode title. For instructions on tagging objects in S3, refer to the Amazon Simple Storage Service User Guide. For example, for the S3 object AI-Accelerators.json, we tag it with key = “title” and value = “Episode 20: AI Accelerators in the Cloud.”

The title is the only metadata we need to manually add for each audio file. There is no need to manually add timestamps for each chapter or section in order to later search for specific content.
Set up a vector database using Knowledge Bases for Amazon Bedrock
Next, we set up our fully managed RAG workflow using Knowledge Bases for Amazon Bedrock. For instructions on creating a knowledge base, refer to the Amazon Bedrock User Guide. We begin by specifying a data source. In our case, we choose the S3 bucket location where our transcripts in JSON format are stored.

Next, we select an embedding model. The embedding model will convert each chunk of our transcript into embeddings. Embeddings are numbers, and the meaning of each embedding depends on the model. In our example, we select Titan Text Embeddings v2 with a dimension size of 1024.

The embeddings are stored as vectors in a vector database. You can either specify an existing vector database you have already created or have Knowledge Bases for Amazon Bedrock create one for you. For our example, we have Knowledge Bases for Amazon Bedrock create a vector database using Amazon OpenSearch Serverless.

Before you can query the vector database, you must first sync it with the data source. During each sync operation, Knowledge Bases for Amazon Bedrock will split the data source into chunks and then use the selected embedding model to embed each chunk as a vector. Knowledge Bases for Amazon Bedrock will then store these vectors in the vector database.
The sync operation as well as other Amazon Bedrock operations described so far can be performed either using the console or API calls.
Query the audio files
Now we’re ready to query and search for specific content from our library of podcast episodes. In episode 20, titled “AI Accelerators in the Cloud,” our guest Matthew McClean, a senior manager from AWS’s Annapurna team, shared why AWS decided to buy Annapurna Labs in 2015. For our first query, we ask, “Why did AWS acquire Annapurna Labs?”
We entered this query into Knowledge Bases for Amazon Bedrock using Anthropic Claude and got the following response:
“AWS acquired Annapurna Labs in 2015 because Annapurna was providing AWS with nitro cards that offloaded virtualization, security, networking and storage from EC2 instances to free up CPU resources.”
This is an exact quote from Matthew McClean in the podcast episode. You wouldn’t get this quote if you had entered the same prompt into other publicly available generative AI chatbots because they don’t have the vector database with embeddings of the podcast transcript to provide more relevant context.
Retrieve an episode title
Now let’s suppose that in addition to getting more relevant responses, we also want to retrieve the correct podcast episode title that was relevant to this query from our catalog of podcast episodes.
To retrieve the episode title, we first use the most relevant data chunk from the query. Whenever Knowledge Bases for Amazon Bedrock responds to a query, it also provides one or more chunks of data that it retrieved from the vector database that were most relevant to the query in order of relevance. We can take the first chunk that was returned. These chunks are returned as JSON documents. Nested inside the JSON is the S3 location of the transcript object. In our example, the S3 location is s3://rethinkpodcast/text/transcripts/AI-Accelerators.json.
The first words in the chunk text are: “Yeah, sure. So maybe I can start with the history of Annapurna…”
Because we have already tagged this transcript object in Amazon S3 with the episode title, we can retrieve the title by retrieving the value of the tag where key = “title”. In this case, the title is “Episode 20: AI Accelerators in the Cloud.”
Search the start time
What if we also want to search and find the start time inside the episode where the relevant content begins? We want to do so without having to manually read through the transcript or listen to the episode from the beginning, and without manually adding timestamps for every chapter.
We can find the start time much faster by having our generative AI application make a few more API calls. We start by treating the chunk text as a substring of the entire transcript. We then search for the start time of the first word in the chunk text.
In our example, the first words returned were “Yeah, sure. So maybe I can start with the history of Annapurna…” We now need to search the entire transcript for the start time of the word “Yeah.”
Amazon Transcribe outputs the start time of every word in the transcript. However, any word can appear more than once. The word “Yeah” occurs 28 times in the transcript, and each occurrence has its own start time. So how do we determine the correct start time for “Yeah” in our example?
There are multiple approaches an application developer can use to find the correct start time. For our example, we use the Python string find() method to find the position of the chunk text within the entire transcript.
For the chunk text that begins with “Yeah, sure. So maybe I can start with the history of Annapurna…” the find() method returned the position as 2047. If we treat the transcript as one long text string, the chunk “Yeah, sure. So maybe…” starts at character position 2047.
Finding the start time now becomes a matter of counting the character position of each word in the transcript and using it to look up the correct start time from the transcript file generated by Amazon Transcribe. This may be tedious for a person to do manually, but trivial for a computer.
In our example Python code, we loop through an array that contains the start time for each token while counting the number of the character position that each token starts at. Because we’re looping through the tokens, we can build a new array that stores the start time for each character position.
In this example query, the start time for the word “Yeah” at position 2047 is 160 seconds, or 2 minutes and 40 seconds into the podcast. You can check the recording starting at 2 minutes 40 seconds.
Clean up
This solution incurs charges based on the services you use:

Amazon Transcribe operates under a pay-as-you-go pricing model. For more details, see Amazon Transcribe Pricing.
Amazon Bedrock uses an on-demand quota, so you only pay for what you use. For more information, refer to Amazon Bedrock pricing.
With OpenSearch Serverless, you only pay for the resources consumed by your workload.
If you’re using Knowledge Bases for Amazon Bedrock with other vector databases besides OpenSearch Serverless, you may continue to incur charges even when not running any queries. It is recommended you delete your knowledge base and its associated vector store along with audio files stored in Amazon S3 to avoid unnecessary costs when you’re done testing this solution.

Conclusion
Cataloging, querying, and searching through large volumes of audio files can be difficult to scale. In this post, we showed how Amazon Transcribe and Knowledge Bases for Amazon Bedrock can help automate and make the process of retrieving relevant information from audio files more scalable.
You can begin transcribing your own library of audio files with Amazon Transcribe. To learn more on how Knowledge Bases for Amazon Bedrock can then orchestrate a RAG workflow for your transcripts with vector stores, refer to Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock.
With the help of these AI services, we can now expand the frontiers of our knowledge bases.

About the Author

Nolan Chen is a Partner Solutions Architect at AWS, where he helps startup companies build innovative solutions using the cloud. Prior to AWS, Nolan specialized in data security and helping customers deploy high-performing wide area networks. Nolan holds a bachelor’s degree in Mechanical Engineering from Princeton University.

Protein Annotation-Improved Representations (PAIR): A Flexible Fine-Tu …

Protein language models (PLMs) are trained on large protein databases to predict amino acid sequences and generate feature vectors representing proteins. These models have proven useful in various applications, such as predicting protein folding and mutation effects. A key reason for their success is their ability to capture conserved sequence motifs, which are often important for protein fitness. However, evolutionary and environmental factors can influence the relationship between sequence conservation and fitness, making it complex. PLMs rely on pseudo-likelihood objectives, but incorporating additional data sources, such as text annotations describing protein functions and structures, could improve their accuracy.

Researchers from the University of Toronto and the Vector Institute conducted a study that enhanced PLMs by fine-tuning them with text annotations from UniProt, focusing on nineteen types of expert-curated data. They introduced the Protein Annotation-Improved Representations (PAIR) framework, which uses a text decoder to guide the model’s training. PAIR significantly improved the models’ performance on function prediction tasks, even outperforming the BLAST search algorithm, especially for proteins with low sequence similarity to training data. This approach highlights the potential of incorporating diverse text-based annotations to advance protein representation learning.

The field of protein labeling traditionally relies on methods like BLAST, which detects protein sequence homology through sequence alignment, and Hidden Markov Models (HMMs) that incorporate additional data such as protein family and evolutionary information. These classical approaches perform well with sequences of high similarity but struggle with remote homology detection. This challenge has led to the development of PLMs, which apply deep learning techniques to learn protein representations from large-scale sequence data inspired by natural language processing models. Recent advancements also integrate text annotations, with models like ProtST leveraging diverse data sources to improve protein function prediction.

The model utilizes an attention-based sequence-to-sequence architecture, initialized with pretrained models and enhanced by adding cross-attention between the encoder and decoder. The encoder processes protein sequences into continuous representations using self-attention, while the decoder generates text annotations in an auto-regressive manner. Pretrained protein models from the ProtT5 and ESM families serve as the encoder, while SciBERT is the text decoder. The model is trained on multiple annotation types using a specialized sampling approach, with training conducted on an HPC cluster using multi-node training with bfloat16 precision.

The PAIR framework enhances protein function prediction by fine-tuning pre-trained transformer models, like ESM and ProtT5, on high-quality annotations from databases like Swiss-Prot. By integrating a cross-attention module, PAIR allows text tokens to attend to amino acid sequences, improving the relationship between protein sequences and their annotations. PAIR significantly outperforms traditional methods like BLAST, especially for proteins with low sequence similarity, and shows strong generalization to new tasks. Its ability to handle limited data scenarios makes it a valuable tool in bioinformatics and protein function prediction.

The PAIR framework enhances protein representations by utilizing diverse text annotations that capture essential functional properties. By combining these annotations, PAIR significantly improves the prediction of various functional properties, including those of previously uncharacterized proteins. PAIR consistently outperforms base protein language models and traditional methods like BLAST, especially for sequences with low similarity to training data. The results suggest incorporating additional data modalities, such as 3D structural information or genomic data, could enrich protein representations. PAIR’s flexible design also has potential applications for representing other biological entities, such as small molecules and nucleic acids.

Check out the Paper and Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Protein Annotation-Improved Representations (PAIR): A Flexible Fine-Tuning Framework that Employs a Text Decoder to Guide the Fine-Tuning Process of the Encoder appeared first on MarkTechPost.

The Kolmogorov-Arnold Theorem Revisited: Why Averaging Functions Work …

Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to traditional Multi-Layer Perceptrons (MLPs). Inspired by the Kolmogorov-Arnold representation theorem, these networks utilize neurons that perform simple summation operations. However, the current implementation of KANs poses some challenges in practical applications. Currently, researchers are investigating the possibility of identifying alternative multivariate functions for KAN neurons that could offer enhanced practical utility across several benchmarks related to machine-learning tasks.

Research has highlighted the potential of KANs in various fields, like computer vision, time series analysis, and quantum architecture search. Some studies show that KANs can outperform MLPs in data fitting and PDE tasks while using fewer parameters. However, some research has raised concerns about the robustness of KANs to noise and their performance compared to MLPs. Variations and improvements to the standard KAN architecture are also explored, such as graph-based designs, convolutional KANs, and transformer-based KANs to solve the issues. Moreover, alternative activation functions like wavelets, radial basis functions, and sinusoidal functions are investigated to improve KAN efficiency. Despite these works, there is a need for further improvements to enhance KAN performance.

A Researcher from the Center for Applied Intelligent Systems Research at Halmstad University, Sweden, has proposed a novel approach to enhance the performance of Kolmogorov-Arnold Networks (KANs). This method aims to identify the optimal multivariate function for KAN neurons across various machine learning classification tasks. The traditional use of addition as the node-level function is often non-ideal, especially for high-dimensional datasets with multiple features. This can cause the inputs to exceed the effective range of subsequent activation functions, leading to training instability and reduced generalization performance. To solve this problem, the researcher suggests using the mean instead of the sum as the node function. 

To evaluate the proposed KAN modifications, 10 popular datasets from the UCI Machine Learning Database Repository are utilized, covering multiple domains and varying sizes. These datasets are divided into training (60%), validation (20%), and testing (20%) partitions. A standardized preprocessing method is applied across all datasets, which includes categorical feature encoding, missing value imputation, and instance randomization. Models are trained for 2000 iterations using the Adam optimizer with a learning rate of 0.01 and a batch size of 32. Model accuracy on the testing set serves as the primary evaluation metric. The parameter count is managed by setting the grid to 3 and using default hyperparameters for the KAN models.

The results support the hypothesis that using the mean function in KAN neurons is more effective than the traditional sum function. This enhancement is due to the mean’s ability to keep input values within the optimal range of the spline activation function, which is [-1.0, +1.0]. Standard KANs struggled to keep values within this range in intermediate layers as the number of features increased. However, adopting the mean function in neurons leads to enhanced performance, keeping values within the desired range across datasets with 20 or more features. For datasets with fewer features, values stayed within the range more than 99.0% of the time, except for the ‘abalone’ dataset, which had a slightly lower adherence rate of 96.51%.

In this paper, a Researcher from the Center for Applied Intelligent Systems Research at Halmstad University, Sweden, has proposed a method to enhance the performance of KANs. An important modification to KANs is introduced in this paper by replacing the traditional summation in KAN neurons with an averaging function. Experimental results show that this change leads to more stable training processes and keeps inputs within the effective range of spline activations. This adjustment to KAN architecture solves previous challenges related to input range and training stability. In the future, this work offers a promising direction for future KAN implementations, potentially enhancing their performance and applicability in various machine-learning tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post The Kolmogorov-Arnold Theorem Revisited: Why Averaging Functions Work Better appeared first on MarkTechPost.

Magpie-Ultra Dataset Released: Harnessing Llama 3.1 405B for Diverse A …

Magpie-ultra, a new dataset by the Argilla team for supervised fine-tuning, has been released, featuring 50,000 instruction-response pairs. This synthetically generated dataset utilizes the advanced Llama 3.1 405B-Instruct model and other Llama models like Llama-Guard-3-8B and Meta-Llama-3.1-8B-Instruct. The dataset covers various tasks, including coding, mathematics, data analysis, creative writing, advice-seeking, and brainstorming, offering challenging instructions and responses to enhance AI model training.

This dataset is created with distilabel, and the dataset’s creation follows the Magpie recipe, as outlined in the paper “Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing.” This iteration differs from the original Magpie release by employing the new Llama 3.1 family of models and generating a more focused set of 50,000 instruction-response pairs, compared to the previous 1 million. The pipeline utilizes various models for instruction generation, response creation, quality assessment, and safety classification.

The generation process involved a single 8xH100 machine, with the instruction-response pair creation taking approximately 60 hours. Additional steps, such as generating responses with the base model, computing embeddings, assessing quality and difficulty, and classifying instructions, required about 51 hours combined. This efficient process resulted in a comprehensive dataset with multiple data points for each entry.

The dataset’s structure includes various columns providing rich information about each instruction-response pair. Key columns include the instruction itself, responses from both instruct and base models, intent, required knowledge, difficulty level, quality assessment, and category classification. Also, the dataset incorporates safety checks using Llama-Guard-3-8B and provides embedding information for each instruction.

One of the dataset’s strengths lies in its potential applications. It can be used for Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), depending on the score difference between instruct and base model responses. This flexibility allows researchers and developers to tailor the dataset to their specific needs in AI model training and optimization.

While this release marks a significant step forward in AI training data, it’s important to note its limitations. This version is unfiltered, with a filtered version planned for future release. Also, the dataset may need to be more balanced, an issue that will be addressed in upcoming iterations. Despite these limitations, Magpie-ultra represents a valuable resource for advancing AI capabilities across various domains.

Check out the Pipeline and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Magpie-Ultra Dataset Released: Harnessing Llama 3.1 405B for Diverse AI Instruction-Response Pairs appeared first on MarkTechPost.

This AI Paper by Meta FAIR Introduces MoMa: A Modality-Aware Mixture-o …

Multimodal artificial intelligence focuses on developing models capable of processing and integrating diverse data types, such as text and images. These models are essential for answering visual questions and generating descriptive text for images, highlighting AI’s ability to understand and interact with a multifaceted world. Blending information from different modalities allows AI to perform complex tasks more effectively, demonstrating significant promise in research and practical applications.

One of the primary challenges in multimodal AI is optimizing model efficiency. Traditional methods fusing modality-specific encoders or decoders often limit the model’s ability to integrate information across different data types effectively. This limitation results in increased computational demands and reduced performance efficiency. Researchers have been striving to develop new architectures that seamlessly integrate text and image data from the outset, aiming to enhance the model’s performance and efficiency in handling multimodal inputs.

Existing methods for handling mixed-modal data include architectures that preprocess and encode text and image data separately before integrating them. These approaches, while functional, can be computationally intensive and may only partially exploit the potential of early data fusion. The separation of modalities often leads to inefficiencies and an inability to adequately capture the complex relationships between different data types. Therefore, innovative solutions are required to overcome these challenges and achieve better performance.

To address these challenges, researchers at Meta introduced MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed to pre-train mixed-modal, early-fusion language models. MoMa processes text and images in arbitrary sequences by dividing expert modules into modality-specific groups. Each group exclusively handles designated tokens, employing learned routing within each group to maintain semantically informed adaptivity. This architecture significantly improves pre-training efficiency, with empirical results showing substantial gains. The research, conducted by a team at Meta, showcases the potential of MoMa to advance mixed-modal language models.

The technology behind MoMa involves a combination of mixture-of-experts (MoE) and mixture-of-depths (MoD) techniques. In MoE, tokens are routed across a set of feed-forward blocks (experts) at each layer. These experts are divided into text-specific and image-specific groups, allowing for specialized processing pathways. This approach, termed modality-aware sparsity, enhances the model’s ability to capture features specific to each modality while maintaining cross-modality integration through shared self-attention mechanisms. Furthermore, MoD allows tokens to selectively skip computations at certain layers, further optimizing the processing efficiency.

The performance of MoMa was evaluated extensively, showing substantial improvements in efficiency and effectiveness. Under a 1-trillion-token training budget, the MoMa 1.4B model, which includes 4 text experts and 4 image experts, achieved a 3.7× overall reduction in floating-point operations per second (FLOPs) compared to a dense baseline. Specifically, it achieved a 2.6× reduction for text and a 5.2× reduction for image processing. When combined with MoD, the overall FLOPs savings increased to 4.2×, with text processing improving by 3.4× and image processing by 5.3×. These results highlight MoMa’s potential to significantly enhance the efficiency of mixed-modal, early-fusion language model pre-training.

MoMa’s innovative architecture represents a significant advancement in multimodal AI. By integrating modality-specific experts and advanced routing techniques, the researchers have developed a more resource-efficient AI model that maintains high performance across diverse tasks. This innovation addresses critical computational efficiency issues, paving the way for developing more capable and resource-effective multimodal AI systems. The team’s work demonstrates the potential for future research to build upon these foundations, exploring more sophisticated routing mechanisms and extending the approach to additional modalities and tasks.

In summary, the MoMa architecture, developed by Meta researchers, offers a promising solution to the computational challenges in multimodal AI. The approach leverages modality-aware mixture-of-experts and mixture-of-depths techniques to achieve significant efficiency gains while maintaining robust performance. This breakthrough paves the way for the next generation of multimodal AI models, which can process and integrate diverse data types more effectively and efficiently, enhancing AI’s capability to understand and interact with the complex, multimodal world we live in.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post This AI Paper by Meta FAIR Introduces MoMa: A Modality-Aware Mixture-of-Experts Architecture for Efficient Multimodal Pre-training appeared first on MarkTechPost.

MLPs vs KANs: Evaluating Performance in Machine Learning, Computer Vis …

Multi-layer perceptrons (MLPs) have become essential components in modern deep learning models, offering versatility in approximating nonlinear functions across various tasks. However, these neural networks face challenges in interpretation and scalability. The difficulty in understanding learned representations limits their transparency, while expanding the network scale often proves complex. Also, MLPs rely on fixed activation functions, potentially constraining their adaptability. Researchers have identified these limitations as significant hurdles in advancing neural network capabilities. Consequently, there is a growing need for alternative architectures that can address these challenges while maintaining or improving the performance of traditional MLPs in tasks such as classification, regression, and feature extraction.

Researchers have made considerable advancements in Kolmogorov-Arnold Networks (KANs) to address the limitations of MLPs. Various approaches have been explored, including replacing B-spline functions with alternative mathematical representations such as Chebyshev polynomials, wavelet functions, and orthogonal polynomials. These modifications aim to enhance KANs’ properties and performance. Furthermore, KANs have been integrated with existing network architectures like convolutional networks, vision transformers, U-Net, Graph Neural Networks (GNNs), and Neural Radiance Fields (NeRF). These hybrid approaches seek to utilize the strengths of KANs in diverse applications, ranging from image classification and medical image processing to graph-related tasks and 3D reconstruction. However, despite these improvements, a comprehensive and fair comparison between KANs and MLPs still needs to understand their relative capabilities and potential fully.

Researchers from the National University of Singapore conducted a fair and comprehensive comparison between KANsn and MLPs. The researchers control parameters and FLOPs for both network types, evaluating their performance across diverse domains, including symbolic formula representation, machine learning, computer vision, natural language processing, and audio processing. This approach ensures a balanced assessment of the two architectures’ capabilities. The study also investigates the impact of activation functions on network performance, particularly B-spline. The research extends to examining the networks’ behavior in continual learning scenarios, challenging previous findings on KAN’s superiority in this area. By providing a thorough and equitable comparison, the study seeks to offer valuable insights for future research on KAN and potential MLP alternatives.

The study aims to provide a comprehensive comparison between KANs and MLPs across diverse domains. The researchers designed experiments to evaluate performance under controlled conditions, ensuring either equal parameter counts or FLOPs for both network types. The assessment covers a wide range of tasks, including machine learning, computer vision, natural language processing, audio processing, and symbolic formula representation. This broad scope allows for a thorough examination of each architecture’s strengths and weaknesses in various applications. To maintain consistency, all experiments utilized the Adam optimizer with a batch size of 128 and learning rates of either 1e-3 or 1e-4. The use of a single RTX3090 GPU for all experiments further ensures the comparability of results across different tasks.

In machine learning tasks across eight datasets, MLPs generally outperformed KANs. The study used varied configurations for both architectures, including different hidden layer widths, activation functions, and normalization techniques. KANs were tested with various B-spline parameters and expanded input ranges. After 20-epoch training runs, MLPs showed superior performance on six datasets, while KANs matched or exceeded MLPs on two. This suggests MLPs maintain an overall advantage in machine learning applications, though KANs’ occasional superiority warrants further investigation through architecture ablation studies.

In computer vision experiments across eight datasets, MLPs consistently outperformed KANs. Both architectures were tested with various configurations, including different hidden layer widths and activation functions. KANs used varying B-spline parameters. After 20-epoch training runs, MLPs showed superior performance on all datasets, whether compared by equal parameter counts or FLOPs. The conductive bias from KAN’s spline functions proved ineffective for visual tasks. This suggests MLPs maintain a significant advantage in computer vision applications, indicating that KAN’s architectural differences may not be well-suited for processing visual data.

In audio and text classification tasks across four datasets, MLPs generally outperformed KANs. Various configurations were tested for both architectures. MLPs consistently excelled in audio tasks and on the AG News dataset. Results were mixed for the CoLA dataset, with KANs showing an advantage when controlling for parameters, but not when controlling for FLOPs due to their higher computational requirements. Overall, MLPs emerged as the preferred choice for audio and text tasks, demonstrating more consistent performance across datasets and evaluation metrics. This suggests MLPs remain more effective for processing audio and textual data compared to KANs.

In symbolic formula representation tasks across eight datasets, KANs generally outperformed MLPs. With equal parameter counts, KANs excelled in 7 out of 8 datasets. When controlling for FLOPs, KANs’ performance was comparable to MLPs due to higher computational complexity, outperforming on two datasets and underperforming on one. Overall, KANs demonstrated superior capability in representing symbolic formulas compared to traditional MLPs.

This comprehensive study compared KANs and MLPs across various tasks. KANs, viewed as a special type of MLP with learnable B-spline activation functions, only showed advantages in symbolic formula representation. MLPs outperformed KANs in machine learning, computer vision, natural language processing, and audio tasks. Interestingly, MLPs with B-spline activations matched or surpassed KAN performance across all tasks. In class-incremental learning, KANs exhibited more severe forgetting issues than MLPs. These findings provide valuable insights for future research on neural network architectures and their applications.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post MLPs vs KANs: Evaluating Performance in Machine Learning, Computer Vision, NLP, and Symbolic Tasks appeared first on MarkTechPost.

Whisper-Medusa Released: aiOla’s New Model Delivers 50% Faster Speec …

Israeli AI startup aiOla has unveiled a groundbreaking innovation in speech recognition with the launch of Whisper-Medusa. This new model, which builds upon OpenAI’s Whisper, has achieved a remarkable 50% increase in processing speed, significantly advancing automatic speech recognition (ASR). aiOla’s Whisper-Medusa incorporates a novel “multi-head attention” architecture that allows for the simultaneous prediction of multiple tokens. This development promises to revolutionize how AI systems translate and understand speech.

The introduction of Whisper-Medusa represents a significant leap forward from the widely used Whisper model developed by OpenAI. While Whisper has set the standard in the industry with its ability to process complex speech, including various languages and accents, in near real-time, Whisper-Medusa takes this capability a step further. The key to this enhancement lies in its multi-head attention mechanism; this enables the model to predict ten tokens at each pass instead of the standard one. This architectural change results in a 50% increase in speech prediction speed and generation runtime without compromising accuracy.

aiOla emphasized the importance of releasing Whisper-Medusa as an open-source solution. By doing so, aiOla aims to foster innovation and collaboration within the AI community, encouraging developers and researchers to contribute to and build upon their work. This open-source approach will lead to further speed improvements and refinements, benefiting various applications across various sectors such as healthcare, fintech, and multimodal AI systems.

The unique capabilities of Whisper-Medusa are particularly significant in the context of compound AI systems, which aim to understand & respond to user queries in almost real-time. Whisper-Medusa’s enhanced speed and efficiency make it a valuable asset when quick and accurate speech-to-text conversion is crucial. This is especially relevant in conversational AI applications, where real-time responses can greatly enhance user experience and productivity.

The development process of Whisper-Medusa involved modifying Whisper’s architecture to incorporate the multi-head attention mechanism. This approach allows the model to jointly attend to information from different representation subspaces at other positions, using multiple “attention heads” in parallel. This innovative technique not only speeds up the prediction process but also maintains the high level of accuracy that Whisper is known for. They pointed out that improving the speed and latency of large language models (LLMs) is easier than ASR systems due to the complexity of processing continuous audio signals and handling noise or accents. However, aiOla’s novel approach has successfully addressed these challenges, resulting in a model nearly doubling the prediction speed.

Training Whisper-Medusa involved a machine-learning approach called weak supervision. aiOla froze the main components of Whisper and used audio transcriptions generated by the model as labels to train additional token prediction modules. The initial version of Whisper-Medusa employs a 10-head model, with plans to expand to a 20-head version capable of predicting 20 tokens at a time. This scalability further enhances the model’s speed and efficiency without compromising accuracy.

Whisper-Medusa has been tested on real enterprise data use cases to ensure its performance in real-world scenarios; the company is still exploring early access opportunities with potential partners. The ultimate goal is to enable faster turnaround times in speech applications, paving the way for real-time responses. Imagine a virtual assistant like Alexa recognizing and responding to commands in seconds, significantly enhancing user experience and productivity.

In conclusion, aiOla’s Whisper-Medusa is poised to impact speech recognition substantially. By combining innovative architecture with an open-source approach, aiOla is driving the capabilities of ASR systems forward, making them faster and more efficient. The potential applications of Whisper-Medusa are vast, promising improvements in various sectors and paving the way for more advanced and responsive AI systems.

Check out the Model and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Whisper-Medusa Released: aiOla’s New Model Delivers 50% Faster Speech Recognition with Multi-Head Attention and 10-Token Prediction appeared first on MarkTechPost.

Black Forest Labs Open-Source FLUX.1: A 12 Billion Parameter Rectified …

In a seminal announcement, Black Forest Labs has emerged as a new player in the generative AI landscape. With deep roots in the research community, this innovative company aims to revolutionize the field of generative deep learning models, particularly focusing on media such as images and videos. Their mission is clear: to push the boundaries of creativity, efficiency, and diversity in AI-generated content. Black Forest Labs envisions generative AI as a cornerstone of future technologies and is committed to making their models accessible to a broad audience. By doing so, they hope to educate the public and foster trust in the safety of these advanced models. As their inaugural offering, Black Forest Labs has unveiled the FLUX.1 suite, a collection of cutting-edge models designed to redefine the possibilities of text-to-image synthesis.

Image Source: https://blackforestlabs.ai/announcing-black-forest-labs/

The FLUX.1 suite represents a significant leap forward in text-to-image synthesis. This innovative collection of models sets new benchmarks in several key areas:

• Image detail: Producing stunningly crisp and intricate visuals

• Prompt adherence: Accurately translating text descriptions into visual representations

• Style diversity: Offering a wide range of artistic and stylistic options

• Scene complexity: Handling intricate and multifaceted image compositions

To cater to various user needs, FLUX.1 is available in three distinct variants:

• FLUX.1 [pro]: The flagship model, offering top-tier performance for professional applications

• FLUX.1 [dev]: An open-weight model for non-commercial use, balancing quality and efficiency

• FLUX.1 [schnell]: A swift model designed for local development and personal projects

Image Source: https://blackforestlabs.ai/announcing-black-forest-labs/

Each variant is accessible through different platforms and licensing options, ensuring that users from various backgrounds can harness the power of FLUX.1 for their specific requirements.

Image Source: https://blackforestlabs.ai/announcing-black-forest-labs/

Building on the foundation of flow matching, FLUX.1 models employ a sophisticated hybrid architecture. This design incorporates multimodal and parallel diffusion transformer blocks, scaled to an impressive 12 billion parameters. The integration of rotary positional embeddings and parallel attention layers enhances both performance and hardware efficiency, setting FLUX.1 apart from previous state-of-the-art diffusion models in the field of generative AI.

FLUX.1 has established itself as a frontrunner in image synthesis technology, setting new benchmarks across various model classes. The FLUX.1 [pro] and [dev] variants have surpassed popular competitors like Midjourney v6.0, DALL·E 3 (HD), and SD3-Ultra in critical aspects such as visual quality, prompt adherence, size and aspect ratio flexibility, typography, and output diversity. Even the FLUX.1 [schnell] model, designed for rapid processing, outperforms not only its direct competitors but also robust non-distilled models. A key strength of the FLUX.1 suite is its ability to maintain the full spectrum of output diversity from pretraining, offering significantly enhanced creative possibilities compared to existing state-of-the-art models in the field.

Image Source: https://blackforestlabs.ai/announcing-black-forest-labs/

FLUX.1 boasts several key features that set it apart in the generative AI landscape:

• Premium output quality and precise prompt adherence, rivaling closed-source alternatives

• FLUX.1 [schnell] employs latent adversarial diffusion distillation, enabling high-quality image generation in just 1-4 steps

• Released under the Apache 2.0 license, allowing for versatile use across personal, scientific, and commercial applications.

These features combine to make FLUX.1 a powerful and accessible tool for a wide range of image synthesis needs.

To facilitate adoption and development, Black Forest Labs has provided a reference implementation and sampling code for FLUX.1 [schnell] in a dedicated GitHub repository. This resource serves as an excellent starting point for developers and creatives looking to utilize the capabilities of FLUX.1 [schnell] in their projects, encouraging innovation and experimentation with this advanced text-to-image model.

Building on the accessible nature of FLUX.1, Black Forest Labs has streamlined the local setup process. For those eager to experiment with the model on their own machines, the following step-by-step guide provides a straightforward installation method:

This simple setup process allows developers and enthusiasts to quickly integrate FLUX.1 into their local environments, facilitating hands-on exploration and development with this cutting-edge text-to-image model.

While FLUX.1 represents a significant advancement in text-to-image synthesis, it’s important to acknowledge its limitations and intended use. The model is not designed to provide factual information and may inadvertently amplify societal biases. Its output quality can vary depending on prompting style. Users must adhere to strict ethical guidelines, avoiding any illegal activities, exploitation of minors, dissemination of false information, harassment, non-consensual content creation, or automated decision-making that impacts individuals’ rights. The model should not be used for large-scale disinformation campaigns or to generate personal identifiable information that could harm others. These restrictions ensure responsible use of this powerful AI tool.

Black Forest Labs has introduced FLUX.1, a suite of cutting-edge text-to-image synthesis models. Available in three variants ([pro], [dev], and [schnell]), FLUX.1 sets new benchmarks in image detail, prompt adherence, style diversity, and scene complexity. The models use a hybrid architecture with 12 billion parameters, surpassing competitors like Midjourney v6.0 and DALL·E 3 in various aspects. FLUX.1 is released under the Apache 2.0 license, allowing for versatile applications. While powerful, users must adhere to ethical guidelines to ensure responsible use. Black Forest Labs aims to revolutionize generative AI and make it accessible to a broad audience.

Check out the Details, GitHub, FLUX.1 [pro], FLUX.1 [dev], and FLUX.1 [schnell]. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post Black Forest Labs Open-Source FLUX.1: A 12 Billion Parameter Rectified Flow Transformer Capable of Generating Images from Text Descriptions appeared first on MarkTechPost.