Meet Guardrails: An Open-Source Python Package for Specifying Structur …

In the vast world of artificial intelligence, developers face a common challenge – ensuring the reliability and quality of outputs generated by large language models (LLMs). The outputs, like generated text or code, must be accurate, structured, and aligned with specified requirements. These outputs may contain biases, bugs, or other usability issues without proper validation.

While developers often rely on LLMs to generate various outputs, there is a need for a tool that can add a layer of assurance, validating and correcting the results. Existing solutions are limited, often requiring manual intervention or lacking a comprehensive approach to ensure both structure and type guarantees in the generated content. This gap in the existing tools prompted the development of Guardrails, an open-source Python package designed to address these challenges.

Guardrails introduces the concept of a “rail spec,” a human-readable file format (.rail) that allows users to define the expected structure and types of LLM outputs. This spec also includes quality criteria, such as checking for biases in generated text or bugs in code. The tool utilizes validators to enforce these criteria and takes corrective actions, such as reasking the LLM when validation fails.

One of Guardrails‘ notable features is its compatibility with various LLMs, including popular ones like OpenAI’s GPT and Anthropic’s Claude, as well as any language model available on Hugging Face. This flexibility allows developers to integrate Guardrails seamlessly into their existing workflows.

To showcase its capabilities, Guardrails offers Pydantic-style validation, ensuring that the outputs conform to the specified structure and predefined variable types. The tool goes beyond simple structuring, allowing developers to set up corrective actions when the output fails to meet the specified criteria. For example, if a generated pet name exceeds the defined length, Guardrails triggers a reask to the LLM, prompting it to generate a new, valid name.

Guardrails also supports streaming, enabling users to receive validations in real-time without waiting for the entire process to complete. This enhancement enhances efficiency and provides a dynamic way to interact with the LLM during the generation process.

In conclusion, Guardrails addresses a crucial aspect of AI development by providing a reliable solution to validate and correct the outputs of LLMs. Its rail spec, Pydantic-style validation, and corrective actions make it a valuable tool for developers striving to enhance AI-generated content’s accuracy, relevance, and quality. With Guardrails, developers can navigate the challenges of ensuring reliable AI outputs with greater confidence and efficiency.
The post Meet Guardrails: An Open-Source Python Package for Specifying Structure and Type, Validating and Correcting the Outputs of Large Language Models (LLMs) appeared first on MarkTechPost.

Cornell Researchers Introduce Graph Mamba Networks (GMNs): A General F …

Graph-based machine learning is undergoing a significant transformation, largely propelled by the introduction of Graph Neural Networks (GNNs). These networks have been pivotal in harnessing the complexity of graph-structured data, offering innovative solutions across various domains. Despite their initial success, traditional GNNs face critical challenges, particularly those relying on local message-passing mechanisms. They need help managing long-range dependencies within graphs and often encounter the issue of over-squashing, where information from distant nodes is compressed excessively as it passes through the network layers.

Graph Mamba Networks (GMNs) by researchers from Cornell University emerge as a groundbreaking solution to these challenges. By integrating the principles of State Space Models (SSMs), widely celebrated for their efficiency and effectiveness across different data modalities, GMNs offer a novel approach to graph learning. This innovative framework is designed to overcome the limitations of both traditional GNNs and their more recent advancements, such as Graph Transformers, which, despite their promise, grapple with scalability due to their quadratic computational requirements.

At the heart of GMNs lies a meticulously crafted architecture that embraces neighborhood tokenization, token ordering, and a bidirectional selective SSM encoder, among other features. This structure enhances the network’s ability to capture and model long-range dependencies effectively and addresses the computational and structural constraints that have hampered previous models. GMNs adopt a selective approach to SSM application on graph data, enabling more nuanced and efficient handling of the inherent complexities of graph-structured information.

The introduction of GMNs into the landscape of graph-based machine learning is not without empirical validation. Rigorous testing across a spectrum of benchmarks reveals that GMNs excel in tasks requiring modeling long-range interactions within graphs. This exceptional performance is not just a testament to the architectural ingenuity of GMNs but also highlights the strategic leverage of SSMs’ strengths in a graph-learning context. GMNs distinguish themselves through their computational efficiency, setting a new standard in the field.

GMNs stand out as a beacon of progress. They signify a major leap in our capacity to learn from graph-structured data and open up a myriad of possibilities for exploration and application. From analyzing complex social networks to deciphering the intricate molecular structures that define life, GMNs offer a robust and efficient framework for understanding how data connects and interacts.

In conclusion, the advent of Graph Mamba Networks marks a pivotal moment in graph-based machine learning:

GMNs adeptly incorporate state space models to address the limitations of traditional GNNs and Graph Transformers, paving the way for more efficient graph learning.

The unique architecture of GMNs, featuring neighborhood tokenization and a bidirectional selective SSM encoder, enables the nuanced handling of graph-structured data.

Demonstrated through extensive benchmarks, GMNs excel in capturing long-range dependencies within graphs, showcasing superior performance and remarkable computational efficiency.

GMNs open new avenues for research and application across various domains by enhancing our ability to model and understand graph-structured data.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post Cornell Researchers Introduce Graph Mamba Networks (GMNs): A General Framework for a New Class of Graph Neural Networks Based on Selective State Space Models appeared first on MarkTechPost.

LAION Presents BUD-E: An Open-Source Voice Assistant that Runs on a Ga …

In the fast-paced world of technology, where innovation often outpaces human interaction, LAION and its collaborators at the ELLIS Institute Tübingen, Collabora, and the Tübingen AI Center are taking a giant leap towards revolutionizing how we converse with artificial intelligence. Their brainchild, BUD-E (Buddy for Understanding and Digital Empathy), seeks to break down the barriers of stilted, mechanical responses that have long hindered our immersive experiences with AI voice assistants.

The journey began with a mission to create a baseline voice assistant that not only responded in real time but also embraced natural voices, empathy, and emotional intelligence. The team recognized the shortcomings of existing models, focusing on reducing latency and enhancing the overall conversational quality. The result? A carefully evaluated model boasts response times as low as 300 to 500 ms, setting the stage for a more seamless and responsive interaction.

However, the developers acknowledge that the road to a truly empathic and natural voice assistant is still in progress. Their open-source initiative invites contributions from a global community, emphasizing the need to tackle immediate problems and work towards a shared vision.

One key area of focus is the reduction of latency and system requirements. The team aims to achieve response times below 300 ms through sophisticated quantization techniques and fine-tuning streaming models, even with larger models. This dedication to real-time interaction lays the groundwork for an AI companion that mirrors the fluidity of human conversation.

The quest for naturalness extends to speech and responses. Leveraging a dataset of natural human dialogues, the developers are fine-tuning BUD-E to respond similarly to humans, incorporating interruptions, affirmations, and thinking pauses. The goal is to create an AI voice assistant that not only understands language but also mirrors the nuances of human expression.

BUD-E’s memory is another remarkable feature in development. With tools like Retrieval Augmented Generation (RAG) and Conversation Memory, the model aims to keep track of conversations over extended periods, unlocking a new level of context familiarity.

The developers are not stopping there. BUD-E is envisioned to be a multi-modal assistant, incorporating visual input through a lightweight vision encoder. The incorporation of webcam images to evaluate user emotions adds a layer of emotional intelligence, bringing the AI voice assistant closer to understanding and responding to human feelings.

Building a user-friendly interface is also a priority. The team plans to implement LLamaFile for easy cross-platform installation and deployment, introducing an animated avatar akin to Meta’s Audio2Photoreal. A chat-based interface capturing conversations in writing and providing ways to capture user feedback aims to make the interaction intuitive and enjoyable.

Furthermore, BUD-E is not limited by language or the number of speakers. The developers are extending streaming Speech-to-Text to more languages, including low-resource ones, and plan to accommodate multi-speaker environments seamlessly.

In conclusion, the development of BUD-E represents a collective effort to create AI voice assistants that engage in natural, intuitive, and empathetic conversations. The future of conversational AI looks promising as BUD-E stands as a beacon, lighting the way for the next era of human-technology interaction.

Check out the Code and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post LAION Presents BUD-E: An Open-Source Voice Assistant that Runs on a Gaming Laptop with Low Latency without Requiring an Internet Connection appeared first on MarkTechPost.

How to Use Restore by to Boost Your Ad Campaigns

Digital marketing’s getting trickier by the day, isn’t it? Between all the new privacy laws, the end of those handy third-party cookies, and the ever-annoying bots tricking us into wasting our ad budgets, it feels like we’re running an obstacle course. And let’s not even get started on trying to keep up with how fast everything changes, including what our audiences are looking for. It’s a lot.

But, imagine if we didn’t have to stress so much about these things. What if there was a way to get ahead of the game, to really understand and reach the people we want to talk to, without stepping on any privacy landmines? And what if we could make sure our hard-earned ad dollars were actually going towards real, live human beings interested in what we’ve got to offer?

Enter Restore. It’s our new tool that’s all about making your life a whole lot easier. We’ve figured out a way to track what matters—real interest and intent—without relying on those disappearing cookies. And bots? We can help you spot those so you’re not throwing your budget into the digital void.

In this blog post, we’re diving deep into what’s been making digital advertising such a headache lately and how Restore is changing the game. It’s not just about keeping up anymore; it’s about setting the pace. We’ll talk about how Restore’s tech can give you a clearer view of who’s actually interested in what you’re selling, all while playing nice with privacy rules and keeping your ad spend in check.

Common Digital Advertising Problems

Solving Digital Advertising Problems with Restore

See Who Is On Your Site Right Now!

Turn anonymous visitors into genuine contacts.

Try it Free, No Credit Card Required

Get The X-Ray Pixel

Common Digital Advertising Problems

Let’s take a step back and explore how the digital advertising landscape has evolved, leading us to navigate through some challenging waters today. It’s been quite a journey from the simpler times of online advertising to the complex scenario we’re dealing with now.

The Rise of Privacy Concerns

One of the most significant shifts has been the heightened focus on privacy. With regulations like the GDPR in Europe and the CCPA in California coming into effect, the importance of respecting user privacy and handling data responsibly has taken center stage. These changes have fundamentally altered how we can collect and use data, emphasizing the need for transparency and user consent. It’s a necessary evolution, ensuring that trust and respect form the foundation of our interactions online.

Navigating a World Without Cookies

The gradual elimination of third-party cookies is another pivotal change. Major browsers are moving away from cookies, pushing the industry towards more privacy-friendly ways of understanding user behavior. This transition challenges us to innovate and find new methods to reach our audience without compromising on privacy.

The Challenge of Bots and Fraud

Bots have also become a more pronounced issue, creating a landscape where ad fraud can easily eat into budgets without delivering any real engagement. Identifying and mitigating the impact of bots is crucial in ensuring that our advertising efforts reach real people and drive genuine interactions.

Ad Saturation and Consumer Fatigue

Ad Saturation and Consumer Fatigue: With the digital space becoming increasingly crowded, capturing and maintaining audience attention has become more difficult. Consumers are bombarded with ads, leading to ad fatigue and making it harder for messages to stand out. This saturation calls for more creative and engaging approaches to advertising, prioritizing quality and relevance.

The Increasing Complexity of Digital Advertising

Finally, the sheer complexity of digital advertising today, with its myriad platforms, formats, and strategies, demands a more sophisticated approach. Staying ahead requires not just keeping up with current trends but anticipating future shifts, all while maintaining an ethical and responsible stance towards privacy and data use.

In this evolving landscape, the need for tools that can navigate these complexities while upholding the principles of privacy and engagement has never been clearer. Enter a solution designed to meet these modern challenges with integrity and innovation. Let’s dive into how we can adapt and thrive in this new era of digital advertising, making every connection count.

Navigating the current digital advertising landscape requires a blend of innovation, strategy, and a keen understanding of the evolving challenges. As we’ve explored the shifts and turns in this environment, the next step is to delve into effective strategies that can help advertisers thrive amidst these changes. Three key strategies stand out for their ability to enhance engagement, optimize spend, and ultimately drive better results: accurate retargeting and lookalike audiences, understanding the full customer journey, and avoiding wasteful spending on bots.

Overcoming Challenges with Restore by

We’ve designed our Restore tool to help our customers vault over these pernicious hurdles. Here’s how: 

Accurate Retargeting and Lookalike Audiences

Retargeting and lookalike audiences are the two most crucial audience building tools in your digital advertising arsenal. 

But because Facebook’s pixel is blocked by so many browsers, these crucial targeting audiences are built with bad data, meaning your campaigns won’t be reaching who they should. That’s bad! 

Our Restore tool gathers data from our Website Visitor ID X-Ray Pixel, which isn’t blocked by any browsers. That means that you can build your retargeting and lookalike audience off of people who are actually visiting your site. 

Identifying High-Intent Users

Zooming in on high-intent users is a game-changer. It’s all about spotting those who are just a nudge away from making a purchase or signing up. By tracking key behaviors—like how often they visit your site, what they’re checking out, or how long they linger on certain pages—you get a clearer picture of who’s really interested. This isn’t just about following the crowd; it’s about understanding who’s ready to take the leap. With smarter analytics, you can tailor your messages to catch them at just the right moment, turning warm leads into solid conversions. This strategy sharpens your focus, ensuring your ad dollars target the people most likely to act, making every penny count.

Our Restore tool allows you to target people who’ve visited specific high-intent pages or viewed the same product several times. This means that you don’t waste money on badly targeted campaigns to low-intent visitors! 

Creating these audiences is very simple! 

In your account, navigate to your My Leads tab and click on “Audiences.” 

You’ll see a big list of all your contacts. Then select “Add FIlter.”

In the Attribute drop-down, select “Landing Page URL” and in the Operator drop-down select “Equals.” Paste the landing page URL in the Value section. 

Then just save it as an audience and you’re good to send it to your Facebook account! 

Understanding the Full Customer Journey

The digital touchpoints a customer interacts with on their journey are like pieces of a puzzle. Having visibility into the entire picture is crucial for crafting campaigns that not only reach the customer at the right time but also with the right message. This holistic view goes beyond the last click, acknowledging that earlier interactions, though they may not directly lead to a conversion, play a significant role in influencing the customer’s decision-making process.

Our Customer Journey tools allow you to understand your customers’ journeys across devices and retarget ads to them more effectively. 

Avoiding Spending on Bots

The digital ad space is fraught with inefficiencies, one of the most glaring being the expenditure on non-human traffic, namely bots. Bots can skew analytics, drain budgets, and dilute the effectiveness of campaigns. Implementing strategies to identify and exclude bot traffic is not just about saving money; it’s about ensuring that every dollar spent is an investment in reaching real, engaged users.

Because our audiences are built on who actually visits your site–and we verify your contacts–you don’t have to worry about chasing a bot who’ll never purchase from you (because they don’t really exist!). 

Interested in getting started? See how many contacts we could pull for your ad audiences here: 

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 500 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post How to Use Restore by to Boost Your Ad Campaigns appeared first on

Google AI Introduces ScreenAI: A Vision-Language Model for User interf …

The capacity of infographics to strategically arrange and use visual signals to clarify complicated concepts has made them essential for efficient communication. Infographics include various visual elements such as charts, diagrams, illustrations, maps, tables, and document layouts. This has been a long-standing technique that makes the material easier to understand. User interfaces (UIs) on desktop and mobile platforms share design concepts and visual languages with infographics in the modern digital world. 

Though there is a lot of overlap between UIs and infographics, creating a cohesive model is made more difficult by the complexity of each. It is difficult to develop a single model that can efficiently analyze and interpret the visual information encoded in pixels because of the intricacy required in understanding, reasoning, and engaging with the various aspects of infographics and user interfaces.

To address this, in a recent Google Research, a team of researchers proposed ScreenAI as a solution. ScreenAI is a Vision-Language Model (VLM) that has the ability to comprehend both UIs and infographics fully. Tasks like graphical question-answering (QA), which may contain charts, pictures, maps, and more, have been included in its scope.

The team has shared that ScreenAI can manage jobs like element annotation, summarization, navigation, and additional UI-specific QA. To accomplish this, the model combines the flexible patching method taken from Pix2struct with the PaLI architecture, which allows it to tackle vision-related tasks by converting them into text or image-to-text problems.

Several tests have been carried out to demonstrate how these design decisions affect the model’s functionality. Upon evaluation, ScreenAI produced new state-of-the-art results on tasks like Multipage DocVQA, WebSRC, MoTIF, and Widget Captioning with under 5 billion parameters. It achieved remarkable performance on tasks including DocVQA, InfographicVQA, and Chart QA, outperforming models of comparable size. 

The team has made available three additional datasets: Screen Annotation, ScreenQA Short, and Complex ScreenQA. One of these datasets specifically focuses on the screen annotation task for future research, while the other two datasets are focused on question-answering, thus further expanding the resources available to advance the field. 

The team has summarized their primary contributions as follows:

The Vision-Language Model (VLM) ScreenAI concept is a step towards a holistic solution that focuses on infographic and user interface comprehension. By utilizing the common visual language and sophisticated design of these components, ScreenAI offers a comprehensive method for understanding digital material.

One significant advancement is the development of a textual representation for UIs. During the pretraining stage, this representation has been used to teach the model how to comprehend user interfaces, improving its capacity to comprehend and process visual data.

To automatically create training data at scale, ScreenAI has used LLMs and the new UI representation, making training more effective and comprehensive.

Three new datasets, Screen Annotation, ScreenQA Short, and Complex ScreenQA, have been released. These datasets allow for thorough model benchmarking for screen-based question answering and the suggested textual representation.

ScreenAI has outperformed larger models by a factor of ten or more on four public infographics QA benchmarks, even with its low number of 4.6 billion parameters. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

The post Google AI Introduces ScreenAI: A Vision-Language Model for User interfaces (UI) and Infographics Understanding appeared first on MarkTechPost.

What is Fine Tuning and Best Methods for Large Language Model (LLM) Fi …

Large Language Models (LLMs) such as GPT, PaLM, and LLaMa have made major advancements in the field of Artificial Intelligence (AI) and Natural Language Processing (NLP) by enabling machines to comprehend and produce content that is similar to that of humans. These models possess an extensive comprehension of language and its subtleties, having been trained on massive amounts of data. However, their generalist character frequently proves inadequate when used for specialized activities or domains. This is where finetuning enters the picture, which is a crucial procedure that greatly improves the model’s performance.

What is Fine Tuning?

Finetuning is a way to modify a language model that has already been taught to perform well in a certain area. Even though LLMs have remarkable comprehension and production skills, they are not naturally suited to tackle specialized activities accurately. By retraining the model on a more manageable, domain-specific dataset, finetuning overcomes this constraint and enables the model to acquire the nuances and distinctive features of the intended field.

A pre-trained model with a broad grasp of language is the starting point for finetuning. This model is finetuned by subjecting it to a carefully selected dataset. The model modifies its internal parameters, such as weights and biases, through this exposure to better match the data’s characteristics. This specialized training phase greatly enhances the model’s performance on tasks linked to the domain, which helps the model understand the intricacies, vocabulary, and context.

Fine Tuning Approaches

Parameter Efficient Fine Tuning (PEFT) 

Reducing the trainable parameters in a neural network makes the training process more computationally efficient, and this is the main notion underlying PEFT. LoRA and QLoRA are a few prominent PEFT approaches.

a) LoRA 

Low-Rank Adaptation, or LoRA, is a PEFT method that operates as an adapter-based strategy. LoRA simply adds new parameters during the training phase, never permanently changing the model architecture. This method enables parameter-efficient finetuning without adding more parameters to the model overall.

LoRA divides the weight update matrix into two smaller matrices, A and B, each of which has a rank parameter ‘r.’ This allows for parameter efficiency. The rank parameter determines the size of these smaller matrices. The weight update matrix has the same size as the number of parameters that need to be updated during finetuning, and it basically represents the modifications learned through backpropagation. These smaller matrices help the model be trained using standard backpropagation. 

b) QLoRA

Quantized LoRA, often known as QLoRA, is an improvement on LoRA that combines low-precision storage with high-precision computation techniques. The goal of this combination is to maintain good accuracy and performance while keeping the model small.

To accomplish its objectives, QLoRA presents two crucial concepts,  i.e., Normal Float for 4 bits, in which numerical values are represented using a 4-bit normal float representation, and Double quantization, which includes quantizing both the learning rate and the model parameters. 

2. Supervised finetuning

Supervised finetuning is a method of optimizing LLMs using task-specific labeled datasets. The foundation of this approach is the idea that every input data point in these datasets is labeled with an accurate label or response, acting as a final manual for the model to follow during its learning phase. The model is motivated to modify its internal parameters in order to achieve high-accuracy label prediction through supervised fine-tuning. This uses the model’s huge knowledge base, which it gathered from large datasets during its initial pre-training phase, and refines it to the particulars and demands of the intended task. 

a) Basic Hyperparameter Tuning

Using this fundamental method, the model’s hyperparameters and important variables that control the training process, like learning rate, batch size, and number of training epochs, are carefully adjusted. The essence of basic hyperparameter tweaking is finding the ideal mix of these parameters that enables the model to learn from the task-specific data most effectively. This significantly increases learning efficacy, improving the model’s task-specific performance while reducing the likelihood of overfitting.

b) Transfer Learning

Transfer learning is particularly useful when there is a shortage of task-specific data. It begins with a pre-trained model on a large-scale, widely-used dataset. The smaller, task-specific dataset is then used to refine this model. Utilizing the model’s previously gained, broad information and tailoring it to the new task is the essence of transfer learning. In addition to saving time and training resources, this method frequently produces better outcomes than creating a model from scratch.

c) Few-shot learning

Few-shot learning enables a model to rapidly adjust to a new task using the least amount of task-specific data possible. By utilizing the model’s vast pre-trained knowledge base, it can understand the new task in a few instances. This approach is helpful when gathering a sizable labeled dataset for the new task is not feasible. The foundation of few-shot learning is the idea that a limited number of examples given during inference can successfully direct the model’s comprehension and execution of the novel job.

3. Reinforcement Learning from Human Feedback (RLHF) 

RLHF is an approach to language model training that integrates human evaluation skills and sophisticated comprehension into machine learning. This technology allows language models to be dynamically improved, resulting in outputs that are accurate, socially and contextually suitable. The key to RLHF is its capacity to combine the algorithmic learning powers of models with the subjective assessments of human feedback, allowing the models to develop more naturally and more responsively.

a) Reward modeling

By exposing the model to a range of possible reactions, reward modeling involves assessing the model’s performance through human evaluation. A variety of factors, such as appropriateness, coherence, and relevance, are taken into consideration by the evaluators when rating or ranking these outputs. The model is then trained as a reward function using human input as it learns to predict the rewards for various outputs depending on human evaluations. The model uses this learned reward function as a guide to modify its outputs over time to maximize these rewards from humans.

b) Proximal Policy Optimisation

Within the RLHF paradigm, Proximal Policy Optimisation is a more technical step that focuses on improving the model’s decision-making policy iteratively in order to improve the expected reward outcomes. The key to PPO’s effectiveness is its deliberate approach to policy updates, which attempts to make modifiable but cautiously incremental changes to the model’s policy to prevent dramatic shifts that can upset the learning trajectory. 

An objective function that has been created and incorporates a clipping method to control the policy update rate accomplishes this. By doing this, PPO guarantees that the policy updates retain a controlled and steady advancement in learning by not deviating too much from the prior policy iteration, even while they are still significant enough to contribute to learning. PPO’s constraint mechanism is essential to its effectiveness because it fosters a steady and balanced learning process that is less vulnerable to the dangers of unpredictable policy changes.


Parameter-Efficient Fine-Tuning of Large Language Models with LoRA and QLoRA

A Comprehensive Guide to Fine-Tuning Large Language Models

The post What is Fine Tuning and Best Methods for Large Language Model (LLM) Fine-Tuning appeared first on MarkTechPost.

Unlocking AI’s Potential: A Comprehensive Survey of Prompt Engineeri …

Prompt engineering has burgeoned into a pivotal technique for augmenting the capabilities of large language models (LLMs) and vision-language models (VLMs), utilizing task-specific instructions or prompts to amplify model efficacy without altering core model parameters. These prompts range from natural language instructions that provide context to guide the model to learning vector representations that activate relevant knowledge, fostering success in myriad applications like question-answering and commonsense reasoning. Despite its burgeoning use, a systematic organization and understanding of the diverse prompt engineering methods still need to be discovered.

This survey by researchers from the Indian Institute of Technology Patna, Stanford University, and Amazon AI endeavors to bridge this gap by offering a structured overview of the recent advancements in prompt engineering, categorized by application area. It meticulously analyzes over 29 distinct techniques, delving into their methodologies, applications, models involved, and datasets utilized. This examination extends from foundational methods like zero-shot and few-shot prompting to more intricate approaches such as chain of code prompting, showcasing the field’s breadth and depth.

The survey highlights the transformative impact of prompt engineering on the adaptability of LLMs and VLMs, enabling these models to excel across diverse tasks and domains with a finesse previously unattainable through traditional model training paradigms. Prompt engineering pushes the boundaries of AI by sidestepping the need for model retraining or extensive fine-tuning, paving the way for a future teeming with possibilities.

The survey underscores the importance of prompt engineering in steering model responses, thus enhancing the adaptability and applicability of LLMs across various sectors. It presents a comprehensive taxonomy and summarizes key points, datasets, models, and the critical features of each prompting technique, providing a clearer understanding of this rapidly developing field. This systematic analysis aims to illuminate open challenges and opportunities for prompt engineering, facilitating future research in this dynamic arena.

In conclusion, the domain of artificial intelligence witnesses prompt engineering as a transformative force, unlocking the vast potential of LLMs. This survey serves as a foundational resource, categorizing distinct prompt engineering techniques based on their functionalities, inspiring further research, and empowering innovators in the evolving landscape of prompt engineering. Despite its successes, challenges such as biases, factual inaccuracies, and interpretability gaps persist, necessitating continued investigation and mitigation strategies. With emerging trends like meta-learning and hybrid prompting architectures, the future of prompt engineering holds immense potential, yet ethical considerations remain paramount to ensure its responsible development and deployment.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

The post Unlocking AI’s Potential: A Comprehensive Survey of Prompt Engineering Techniques appeared first on MarkTechPost.

Streamline diarization using AI as an assistive technology: ZOO Digita …

ZOO Digital provides end-to-end localization and media services to adapt original TV and movie content to different languages, regions, and cultures. It makes globalization easier for the world’s best content creators. Trusted by the biggest names in entertainment, ZOO Digital delivers high-quality localization and media services at scale, including dubbing, subtitling, scripting, and compliance.
Typical localization workflows require manual speaker diarization, wherein an audio stream is segmented based on the identity of the speaker. This time-consuming process must be completed before content can be dubbed into another language. With manual methods, a 30-minute episode can take between 1–3 hours to localize. Through automation, ZOO Digital aims to achieve localization in under 30 minutes.
In this post, we discuss deploying scalable machine learning (ML) models for diarizing media content using Amazon SageMaker, with a focus on the WhisperX model.
ZOO Digital’s vision is to provide a faster turnaround of localized content. This goal is bottlenecked by the manually intensive nature of the exercise compounded by the small workforce of skilled people that can localize content manually. ZOO Digital works with over 11,000 freelancers and localized over 600 million words in 2022 alone. However, the supply of skilled people is being outstripped by the increasing demand for content, requiring automation to assist with localization workflows.
With an aim to accelerate the localization of content workflows through machine learning, ZOO Digital engaged AWS Prototyping, an investment program by AWS to co-build workloads with customers. The engagement focused on delivering a functional solution for the localization process, while providing hands-on training to ZOO Digital developers on SageMaker, Amazon Transcribe, and Amazon Translate.
Customer challenge
After a title (a movie or an episode of a TV series) has been transcribed, speakers must be assigned to each segment of speech so that they can be correctly assigned to the voice artists that are cast to play the characters. This process is called speaker diarization. ZOO Digital faces the challenge of diarizing content at scale while being economically viable.
Solution overview
In this prototype, we stored the original media files in a specified Amazon Simple Storage Service (Amazon S3) bucket. This S3 bucket was configured to emit an event when new files are detected within it, triggering an AWS Lambda function. For instructions on configuring this trigger, refer to the tutorial Using an Amazon S3 trigger to invoke a Lambda function. Subsequently, the Lambda function invoked the SageMaker endpoint for inference using the Boto3 SageMaker Runtime client.
The WhisperX model, based on OpenAI’s Whisper, performs transcriptions and diarization for media assets. It’s built upon the Faster Whisper reimplementation, offering up to four times faster transcription with improved word-level timestamp alignment compared to Whisper. Additionally, it introduces speaker diarization, not present in the original Whisper model. WhisperX utilizes the Whisper model for transcriptions, the Wav2Vec2 model to enhance timestamp alignment (ensuring synchronization of transcribed text with audio timestamps), and the pyannote model for diarization. FFmpeg is used for loading audio from source media, supporting various media formats. The transparent and modular model architecture allows flexibility, because each component of the model can be swapped out as needed in the future. However, it’s essential to note that WhisperX lacks full management features and isn’t an enterprise-level product. Without maintenance and support, it may not be suitable for production deployment.
In this collaboration, we deployed and evaluated WhisperX on SageMaker, using an asynchronous inference endpoint to host the model. SageMaker asynchronous endpoints support upload sizes up to 1 GB and incorporate auto scaling features that efficiently mitigate traffic spikes and save costs during off-peak times. Asynchronous endpoints are particularly well-suited for processing large files, such as movies and TV series in our use case.
The following diagram illustrates the core elements of the experiments we conducted in this collaboration.

In the following sections, we delve into the details of deploying the WhisperX model on SageMaker, and evaluate the diarization performance.
Download the model and its components
WhisperX is a system that includes multiple models for transcription, forced alignment, and diarization. For smooth SageMaker operation without the need to fetch model artifacts during inference, it’s essential to pre-download all model artifacts. These artifacts are then loaded into the SageMaker serving container during initiation. Because these models aren’t directly accessible, we offer descriptions and sample code from the WhisperX source, providing instructions on downloading the model and its components.
WhisperX uses six models:

A Faster Whisper model
A Voice Activity Detection (VAD) model
A Wav2Vec2 model
pyannote’s Speaker Diarization model
pyannote’s Segmentation model
SpeechBrain’s Speaker Embedding model

Most of these models can be obtained from Hugging Face using the huggingface_hub library. We use the following download_hf_model() function to retrieve these model artifacts. An access token from Hugging Face, generated after accepting the user agreements for the following pyannote models, is required:

Speaker Diarization
Voice Activity Detection

import huggingface_hub
import yaml
import torchaudio
import urllib.request
import os

CONTAINER_MODEL_DIR = “/opt/ml/model”
WHISPERX_MODEL = “guillaumekln/faster-whisper-large-v2”
DIARIZATION_MODEL = “pyannote/speaker-diarization”

def download_hf_model(model_name: str, hf_token: str, local_model_dir: str) -> str:
Fetches the provided model from HuggingFace and returns the subdirectory it is downloaded to
:param model_name: HuggingFace model name (and an optional version, appended with @[version])
:param hf_token: HuggingFace access token authorized to access the requested model
:param local_model_dir: The local directory to download the model to
:return: The subdirectory within local_modeL_dir that the model is downloaded to
model_subdir = model_name.split(‘@’)[0]
huggingface_hub.snapshot_download(model_subdir, token=hf_token, local_dir=f”{local_model_dir}/{model_subdir}”, local_dir_use_symlinks=False)
return model_subdir

The VAD model is fetched from Amazon S3, and the Wav2Vec2 model is retrieved from the torchaudio.pipelines module. Based on the following code, we can retrieve all the models’ artifacts, including those from Hugging Face, and save them to the specified local model directory:
def fetch_models(hf_token: str, local_model_dir=”./models”):
Fetches all required models to run WhisperX locally without downloading models every time
:param hf_token: A huggingface access token to download the models
:param local_model_dir: The directory to download the models to
# Fetch Faster Whisper’s Large V2 model from HuggingFace
download_hf_model(model_name=WHISPERX_MODEL, hf_token=hf_token, local_model_dir=local_model_dir)

# Fetch WhisperX’s VAD Segmentation model from S3
vad_model_dir = “whisperx/vad”
if not os.path.exists(f”{local_model_dir}/{vad_model_dir}”):

urllib.request.urlretrieve(VAD_MODEL_URL, f”{local_model_dir}/{vad_model_dir}/pytorch_model.bin”)

# Fetch the Wav2Vec2 alignment model
torchaudio.pipelines.__dict__[WAV2VEC2_MODEL].get_model(dl_kwargs={“model_dir”: f”{local_model_dir}/wav2vec2/”})

# Fetch pyannote’s Speaker Diarization model from HuggingFace

# Read in the Speaker Diarization model config to fetch models and update with their local paths
with open(f”{local_model_dir}/{DIARIZATION_MODEL}/config.yaml”, ‘r’) as file:
diarization_config = yaml.safe_load(file)

embedding_model = diarization_config[‘pipeline’][‘params’][’embedding’]
embedding_model_dir = download_hf_model(model_name=embedding_model,
diarization_config[‘pipeline’][‘params’][’embedding’] = f”{CONTAINER_MODEL_DIR}/{embedding_model_dir}”

segmentation_model = diarization_config[‘pipeline’][‘params’][‘segmentation’]
segmentation_model_dir = download_hf_model(model_name=segmentation_model,
diarization_config[‘pipeline’][‘params’][‘segmentation’] = f”{CONTAINER_MODEL_DIR}/{segmentation_model_dir}/pytorch_model.bin”

with open(f”{local_model_dir}/{DIARIZATION_MODEL}/config.yaml”, ‘w’) as file:
yaml.safe_dump(diarization_config, file)

# Read in the Speaker Embedding model config to update it with its local path
speechbrain_hyperparams_path = f”{local_model_dir}/{embedding_model_dir}/hyperparams.yaml”
with open(speechbrain_hyperparams_path, ‘r’) as file:
speechbrain_hyperparams =

speechbrain_hyperparams = speechbrain_hyperparams.replace(embedding_model_dir, f”{CONTAINER_MODEL_DIR}/{embedding_model_dir}”)

with open(speechbrain_hyperparams_path, ‘w’) as file:

Select the appropriate AWS Deep Learning Container for serving the model
After the model artifacts are saved using the preceding sample code, you can choose pre-built AWS Deep Learning Containers (DLCs) from the following GitHub repo. When selecting the Docker image, consider the following settings: framework (Hugging Face), task (inference), Python version, and hardware (for example, GPU). We recommend using the following image: 763104351884.dkr.ecr.[REGION] This image has all the necessary system packages pre-installed, such as ffmpeg. Remember to replace [REGION] with the AWS Region you are using.
For other required Python packages, create a requirements.txt file with a list of packages and their versions. These packages will be installed when the AWS DLC is built. The following are the additional packages needed to host the WhisperX model on SageMaker:


Create an inference script to load the models and run inference
Next, we create a custom script to outline how the WhisperX model and its components are loaded into the container and how the inference process should be run. The script contains two functions: model_fn and transform_fn. The model_fn function is invoked to load the models from their respective locations. Subsequently, these models are passed to the transform_fn function during inference, where transcription, alignment, and diarization processes are performed. The following is a code sample for
import io
import json
import logging
import tempfile
import time

import torch
import whisperx

DEVICE = ‘cuda’ if torch.cuda.is_available() else ‘cpu’

def model_fn(model_dir: str) -> dict:
Deserialize and return the models
“””“Loading WhisperX model”)
model = whisperx.load_model(whisper_arch=f”{model_dir}/guillaumekln/faster-whisper-large-v2″,
vad_options={‘model_fp’: f”{model_dir}/whisperx/vad/pytorch_model.bin”})“Loading alignment model”)
align_model, metadata = whisperx.load_align_model(language_code=”en”,
model_dir=f”{model_dir}/wav2vec2″)“Loading diarization model”)
diarization_model = whisperx.DiarizationPipeline(model_name=f”{model_dir}/pyannote/speaker-diarization/config.yaml”,

return {
‘model’: model,
‘align_model’: align_model,
‘metadata’: metadata,
‘diarization_model’: diarization_model

def transform_fn(model: dict, request_body: bytes, request_content_type: str, response_content_type=”application/json”) -> (str, str):
Load in audio from the request, transcribe and diarize, and return JSON output

# Start a timer so that we can log how long inference takes
start_time = time.time()

# Unpack the models
whisperx_model = model[‘model’]
align_model = model[‘align_model’]
metadata = model[‘metadata’]
diarization_model = model[‘diarization_model’]

# Load the media file (the request_body as bytes) into a temporary file, then use WhisperX to load the audio from it“Loading audio”)
with io.BytesIO(request_body) as file:
tfile = tempfile.NamedTemporaryFile(delete=False)
audio = whisperx.load_audio(

# Run transcription“Transcribing audio”)
result = whisperx_model.transcribe(audio, batch_size=16)

# Align the outputs for better timings“Aligning outputs”)
result = whisperx.align(result[“segments”], align_model, metadata, audio, DEVICE, return_char_alignments=False)

# Run diarization“Running diarization”)
diarize_segments = diarization_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Calculate the time it took to perform the transcription and diarization
end_time = time.time()
elapsed_time = end_time – start_time”Transcription and Diarization took {int(elapsed_time)} seconds”)

# Return the results to be stored in S3
return json.dumps(result), response_content_type

Within the model’s directory, alongside the requirements.txt file, ensure the presence of in a code subdirectory. The models directory should resemble the following:

├── code
│ ├──
│ └── requirements.txt
├── guillaumekln
│ └── faster-whisper-large-v2
├── pyannote
│ ├── segmentation
│ │ └── …
│ └── speaker-diarization
│ └── …
├── speechbrain
│ └── spkrec-ecapa-voxceleb
│ └── …
├── wav2vec2
│ └── …
└── whisperx
└── vad
└── …

Create a tarball of the models
After you create the models and code directories, you can use the following command lines to compress the model into a tarball (.tar.gz file) and upload it to Amazon S3. At the time of writing, using the faster-whisper Large V2 model, the resulting tarball representing the SageMaker model is 3 GB in size. For more information, refer to Model hosting patterns in Amazon SageMaker, Part 2: Getting started with deploying real time models on SageMaker.

# Save the model artifacts to the ‘model’ directory and create a tarball
tar cvzf model.tar.gz -C model/ .
# Upload the model to S3
aws s3 cp model.tar.gz s3://<target_bucket>

Create a SageMaker model and deploy an endpoint with an asynchronous predictor
Now you can create the SageMaker model, endpoint config, and asynchronous endpoint with AsyncPredictor using the model tarball created in the previous step. For instructions, refer to Create an Asynchronous Inference Endpoint.
Evaluate diarization performance
To assess the diarization performance of the WhisperX model in various scenarios, we selected three episodes each from two English titles: one drama title consisting of 30-minute episodes, and one documentary title consisting of 45-minute episodes. We utilized pyannote’s metrics toolkit, pyannote.metrics, to calculate the diarization error rate (DER). In the evaluation, manually transcribed and diarized transcripts provided by ZOO served as the ground truth.
We defined the DER as follows:

Total is the length of the ground truth video. FA (False Alarm) is the length of segments that are considered as speech in predictions, but not in ground truth. Miss is the length of segments that are considered as speech in ground truth, but not in prediction. Error, also called Confusion, is the length of segments that are assigned to different speakers in prediction and ground truth. All the units are measured in seconds. The typical values for DER can vary depending on the specific application, dataset, and the quality of the diarization system. Note that DER can be larger than 1.0. A lower DER is better.
To be able to calculate the DER for a piece of media, a ground truth diarization is required as well as the WhisperX transcribed and diarized outputs. These must be parsed and result in lists of tuples containing a speaker label, speech segment start time, and speech segment end time for each segment of speech in the media. The speaker labels don’t need to match between the WhisperX and ground truth diarizations. The results are based mostly on the time of the segments. pyannote.metrics takes these tuples of ground truth diarizations and output diarizations (referred to in the pyannote.metrics documentation as reference and hypothesis) to calculate the DER. The following table summarizes our results.

Video Type 
False Alarm 




These results reveal a significant performance difference between the drama and documentary titles, with the model achieving notably better results (using DER as an aggregate metric) for the drama episodes compared to the documentary title. A closer analysis of the titles provides insights into potential factors contributing to this performance gap. One key factor could be the frequent presence of background music overlapping with speech in the documentary title. Although preprocessing media to enhance diarization accuracy, such as removing background noise to isolate speech, was beyond the scope of this prototype, it opens avenues for future work that could potentially enhance the performance of WhisperX.
In this post, we explored the collaborative partnership between AWS and ZOO Digital, employing machine learning techniques with SageMaker and the WhisperX model to enhance the diarization workflow. The AWS team played a pivotal role in assisting ZOO in prototyping, evaluating, and understanding the effective deployment of custom ML models, specifically designed for diarization. This included incorporating auto scaling for scalability using SageMaker.
Harnessing AI for diarization will lead to substantial savings in both cost and time when generating localized content for ZOO. By aiding transcribers in swiftly and precisely creating and identifying speakers, this technology addresses the traditionally time-consuming and error-prone nature of the task. The conventional process often involves multiple passes through the video and additional quality control steps to minimize errors. The adoption of AI for diarization enables a more targeted and efficient approach, thereby increasing productivity within a shorter timeframe.
We’ve outlined key steps to deploy the WhisperX model on the SageMaker asynchronous endpoint, and encourage you to try it yourself using the provided code. For further insights into ZOO Digital’s services and technology, visit ZOO Digital’s official site. For details on deploying the OpenAI Whisper model on SageMaker and various inference options, refer to Host the Whisper Model on Amazon SageMaker: exploring inference options. Feel free to share your thoughts in the comments.

About the Authors
Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS. Her primary areas of interest encompass Deep Learning, with a focus on GenAI, Computer Vision, NLP, and time series data prediction. In her spare time, she relishes spending quality moments with her family, immersing herself in novels, and hiking in the national parks of the UK.
Ethan Cumberland is an AI Research Engineer at ZOO Digital, where he works on using AI and Machine Learning as assistive technologies to improve workflows in speech, language, and localisation. He has a background in software engineering and research in the security and policing domain, focusing on extracting structured information from the web and leveraging open-source ML models for analysing and enriching collected data.
Gaurav Kaila leads the AWS Prototyping team for UK & Ireland. His team works with customers across diverse industries to ideate & co-develop business critical workloads with a mandate to accelerate adoption of AWS services.

The Evolution of Email Deliverability: From Basics to AI-Driven Insigh …

Did you know you can get a free email deliverability audit from Start your free trial and get an automatic deliverability score!

Achieving that coveted spot in the inbox requires more than just sending messages; it requires navigating spam filters and privacy laws, all while writing copy that drives opens and keeps engagement high. No easy feat.

Email deliverability has long been a challenge and unfortunately, without the help of technology, it’s not going to get any easier. 

That’s where AI comes in. 

AI isn’t just helping to improve email deliverability. It’s turning the concept on its head, offering new strategies to help marketers deal with new changes and reach the inbox. 

Let’s explore the transformation from basic deliverability tactics to cutting-edge, AI-enhanced strategies that are setting new benchmarks in email marketing success.

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 500 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

Looking Back at Email Deliverability

In the early days of email marketing, deliverability hinged on one thing – avoiding the spam folder. 

There weren’t promotions tabs or super-sophisticated filters. You didn’t have to worry about having too many links in your message or using characters incorrectly and ending up in the spam folder.

The biggest challenges were crafting non-spammy subject lines and managing bounce rates while the strategies were straightforward – focus on list hygiene and mass distribution without much nuance.

Ah, simpler times. 

But as the saying goes, all good things must come to an end.

As email became ubiquitous, the landscape began to grow more complicated. 

The introduction of Sender Policy Framework (SPF), DomainKeys Identified Mail (DKIM), and Domain-based Message Authentication, Reporting, and Conformance (DMARC) marked significant milestones and new challenges for email marketers.

These authentication protocols, designed to verify the sender’s identity and combat phishing, also made deliverability much more difficult. 

At the same time as new protocols were being put in place, ISPs were evolving and adopting more sophisticated algorithms to filter out spam. 

All of this together led to one thing – the need for an email deliverability strategy.

These changes really forced email marketers to rethink how they got messages to the inbox and placed a greater emphasis on sender reputation and engagement metrics. 

What began as a simple battle to reach the inbox has become a complex endeavor to ensure emails are welcomed by recipients and trusted by email providers.

Contemporary Challenges in Email Deliverability

With the glory days of email marketing behind us, today’s email marketers have to navigate a whole new set of challenges to achieve deliverability. 

The sophistication of spam filters has reached unprecedented levels, employing AI and machine learning to scrutinize every aspect of an email, from technology to content to sender behavior. 

These filters aren’t fooled by basic tactics. They demand genuine engagement and relevant content. 

They also demand a clean sender reputation and adherence to privacy guidelines. 

Sender Reputation

Sender reputation is a huge part of deliverability.

ISPs and email services meticulously score senders on various metrics, including open rates, click-through rates, and spam complaints. 

This scrutiny means that maintaining a clean, engaged email list is more crucial than ever. 

A drop in sender reputation can lead to emails being sent to the spam folder, or worse, blocked entirely.

Privacy Regulations

GDPR and CCPA have really reshaped the email marketing landscape by mandating compliance.

These regulations don’t just suggest consent for data collection, they require it. In fact, violations risk hefty fines.

With these new challenges, email marketers must employ more nuanced and sophisticated strategies. 

It’s a balancing act – ensuring compliance with privacy laws, maintaining a positive sender reputation, navigating the intricate algorithms of spam filters, and requiring innovation and adaptability.

The Role of AI in Enhancing Email Deliverability

AI has ushered in a new era for email marketing.

From predictive analytics to personalization, AI is helping email marketers navigate all of the crazy changes taking place.

Predictive Analytics

I think we can all agree that predictive analytics has revolutionized email marketing, particularly when it comes to send time and frequency. 

Instead of having to manually analyze data and test assumptions, AI can take your data and predict the optimal moment for email engagement. The results? Higher open rates and improved deliverability.


Personalization is another area where AI is making its mark. 

Through NLP, AI email writers can help craft email content that resonates with readers while steering clear of spam filters. 

They can analyze the effectiveness of subject lines, call-to-actions, and overall content, suggesting improvements that enhance engagement rates. 

Sender Reputation Management

The ability to predict and adapt content for better engagement and compliance is invaluable not just for getting your messages delivered but also for maintaining a healthy sender reputation.

Machine learning models can forecast potential deliverability issues before they even happen. 

By analyzing email interactions, AI can identify patterns that may lead to blacklisting or spam complaints, allowing marketers to adjust strategies proactively.

Way better than being blacklisted right?

AI is already making waves when it comes to email deliverability and it’s only going to get better. 

Marketers leveraging AI-driven platforms report significant improvements in open rates, reduced spam complaints, and enhanced overall campaign performance. 

In fact, during a 2023 survey carried out among email marketers from the United States, the United Kingdom, and other European countries, it was found that ~51% of respondents believed that AI-supported email marketing was more effective than traditional email marketing approaches.

These advancements are not just about overcoming challenges; they’re about setting new benchmarks in email marketing effectiveness.

AI for Improved Email Deliverability 

The evolution of email deliverability, from its simplest forms to the sophisticated, AI-driven landscape of today, shows us that email marketing is in a constant state of flux. 

The challenges that once seemed impossible have given way to innovative solutions and opportunities for growth and engagement.

The role of AI in this evolution cannot be overstated. 

It is now a pivotal tool, giving marketers the ability to navigate the many, many complexities of modern email marketing with unprecedented precision. 

AI-driven insights and technologies are not just enhancing deliverability, they are reshaping the very foundations of email marketing strategies, offering a glimpse into a future where personalization, engagement, and compliance are seamlessly integrated.

For strategic email marketers, the message is clear – embracing AI-driven insights and technologies is no longer an option but a necessity for staying ahead. 

The future of email deliverability lies in the ability to adapt, innovate, and harness the full potential of AI to create meaningful, engaging, and successful email campaigns.

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post The Evolution of Email Deliverability: From Basics to AI-Driven Insights appeared first on

Checkmate with Scale: Google DeepMind’s Revolutionary Leap in Chess …

The intersection of artificial intelligence and the ancient game of chess has long captivated researchers, offering a fertile ground for testing the limits of computational strategy and intelligence. The journey from IBM’s Deep Blue, which in 1997 famously defeated the reigning world champion, to today’s highly sophisticated engines like Stockfish and AlphaZero underscores a continuous quest to refine and redefine machine intellect. These advancements have primarily been anchored in explicit search algorithms and intricate heuristics tailored to dissect and dominate the chessboard.

In an era where AI’s prowess is increasingly measured by its capacity to learn and adapt, a groundbreaking study shifts the narrative by harnessing the power of large-scale data and advanced neural architectures. This research by Google DeepMind revolves around a bold experiment: training a transformer model equipped with 270 million parameters, purely through supervised learning techniques, on an extensive dataset comprised of 10 million chess games. This model stands apart by not leaning on the conventional crutches of domain-specific adaptations or the explicit navigation of the decision tree that chess inherently represents.

Rather than concocting a labyrinth of search paths and handcrafted heuristics, the model learns to predict the most advantageous moves directly from the positions on the chessboard. This methodological pivot is not just a departure from tradition but a testament to the transformative potential of large-scale attention-based learning. By annotating each game state with action values derived from the formidable Stockfish 16 engine, the research taps into a deep well of strategic insight, distilling this knowledge into a neural network capable of grandmaster-level decision-making.

The performance metrics of this transformer model are nothing short of revolutionary. Achieving a Lichess blitz Elo rating of 2895 not only sets a new benchmark in human-computer chess confrontations but also demonstrates a remarkable proficiency in solving intricate chess puzzles that have historically been the domain of the most advanced search-based engines. A comparative analysis with existing field giants further underscores this performance leap. The model not only outperforms the policy and value networks of AlphaZero. This program had itself redefined AI’s approach to chess through self-play and deep learning, but it also eclipses the capabilities of GPT-3.5-turbo-instruct in understanding and executing chess strategy.

This paradigm-shifting success story is underpinned by meticulously examining the factors contributing to AI excellence in chess. The study delineates a direct correlation between the scale of the training data and the model’s effectiveness, revealing that the depth of strategic understanding and the ability to generalize across unseen board configurations only emerge at a certain magnitude of dataset and model complexity. This insight reinforces the significance of scale in AI’s conquest of intellectual domains and illustrates the nuanced balance between data diversity and computational heuristics.

In conclusion, this research not only redefines the boundaries of AI in chess but also illuminates a path forward for artificial intelligence. The key takeaways include:

The feasibility of achieving grandmaster-level chess play without explicit search algorithms relying solely on the predictive power of transformer models trained on large-scale datasets.

This demonstrates that the traditional reliance on complex heuristics and domain-specific adjustments can be bypassed, paving the way for more generalized and scalable approaches to AI problem-solving.

The critical role of dataset and model size in unlocking the full potential of AI suggests a broader applicability of these findings beyond the chessboard.

These revelations propel further exploration into the capabilities of neural networks, suggesting that the future of AI may well lie in its ability to distill complex patterns and strategies from vast oceans of data across diverse domains without the need for explicitly programmed guidance.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

The post Checkmate with Scale: Google DeepMind’s Revolutionary Leap in Chess AI appeared first on MarkTechPost.

Huawei Researchers Tries to Rewrite the Rules with PanGu-π Pro: The D …

A groundbreaking study conducted by researchers from Huawei Noah’s Ark Lab, in collaboration with Peking University and Huawei Consumer Business Group, presents a transformative approach to developing tiny language models (TLMs) suitable for mobile devices. Despite their reduced size, these compact models aim to deliver performance on par with their larger counterparts, addressing the crucial need for efficient AI applications in resource-constrained environments.

The research team tackled the pressing challenge of optimizing language models for mobile deployment. Traditional large language models, while powerful, could be more practical for mobile use due to their substantial computational and memory requirements. This study introduces an innovative tiny language model, PanGu-π Pro, which leverages a meticulously designed architecture and advanced training methodologies to achieve remarkable efficiency and effectiveness.

At the core of their methodology is a strategic optimization of the model’s components. The team embarked on a series of empirical studies to dissect the impact of various elements on the model’s performance. A notable innovation is the compression of the tokenizer, significantly reducing the model’s size without compromising its ability to understand and generate language. Furthermore, architectural adjustments were made to streamline the model, including parameter inheritance from larger models and a multi-round training strategy that enhances learning efficiency.

The introduction of PanGu-π Pro in 1B and 1.5B parameter versions marks a significant leap forward. Following the newly established optimization protocols, the models were trained on a 1.6T multilingual corpus. The results were astounding; PanGu-π-1B Pro demonstrated an average improvement of 8.87 on benchmark evaluation sets. More impressively, PanGu-π-1.5B Pro surpassed several state-of-the-art models with larger sizes, establishing new benchmarks for performance in compact language models.

The implications of this research extend far beyond the realm of mobile devices. By achieving such a delicate balance between size and performance, the Huawei team has opened new avenues for deploying AI technologies in various scenarios where computational resources are limited. Their work not only paves the way for more accessible AI applications but also sets a precedent for future research in optimizing language models.

This study’s findings are a testament to the possibilities inherent in AI, showcasing how innovative approaches can overcome the limitations of current technologies. The Huawei team’s contributions are poised to revolutionize how we think about and interact with AI, making it more ubiquitous and integrated into our daily lives. As we progress, the principles and methodologies developed in this research will undoubtedly influence the evolution of AI technologies, making them more adaptable, efficient, and accessible to all.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

The post Huawei Researchers Tries to Rewrite the Rules with PanGu-π Pro: The Dawn of Ultra-Efficient, Tiny Language Models Is Here! appeared first on MarkTechPost.

Run ML inference on unplanned and spiky traffic using Amazon SageMaker …

Amazon SageMaker multi-model endpoints (MMEs) are a fully managed capability of SageMaker inference that allows you to deploy thousands of models on a single endpoint. Previously, MMEs pre-determinedly allocated CPU computing power to models statically regardless the model traffic load, using Multi Model Server (MMS) as its model server. In this post, we discuss a solution in which an MME can dynamically adjust the compute power assigned to each model based on the model’s traffic pattern. This solution enables you to use the underlying compute of MMEs more efficiently and save costs.
MMEs dynamically load and unload models based on incoming traffic to the endpoint. When utilizing MMS as the model server, MMEs allocate a fixed number of model workers for each model. For more information, refer to Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints.
However, this can lead to a few issues when your traffic pattern is variable. Let’s say you have a singular or few models receiving a large amount of traffic. You can configure MMS to allocate a high number of workers for these models, but this gets assigned to all the models behind the MME because it’s a static configuration. This leads to a large number of workers using hardware compute—even the idle models. The opposite problem can happen if you set a small value for the number of workers. The popular models won’t have enough workers at the model server level to properly allocate enough hardware behind the endpoint for these models. The main issue is that it’s difficult to remain traffic pattern agnostic if you can’t dynamically scale your workers at the model server level to allocate the necessary amount of compute.
The solution we discuss in this post uses DJLServing as the model server, which can help mitigate some of the issues that we discussed and enable per-model scaling and enable MMEs to be traffic pattern agnostic.
MME architecture
SageMaker MMEs enable you to deploy multiple models behind a single inference endpoint that may contain one or more instances. Each instance is designed to load and serve multiple models up to its memory and CPU/GPU capacity. With this architecture, a software as a service (SaaS) business can break the linearly increasing cost of hosting multiple models and achieve reuse of infrastructure consistent with the multi-tenancy model applied elsewhere in the application stack. The following diagram illustrates this architecture.

A SageMaker MME dynamically loads models from Amazon Simple Storage Service (Amazon S3) when invoked, instead of downloading all the models when the endpoint is first created. As a result, an initial invocation to a model might see higher inference latency than the subsequent inferences, which are completed with low latency. If the model is already loaded on the container when invoked, then the download step is skipped and the model returns the inferences with low latency. For example, assume you have a model that is only used a few times a day. It’s automatically loaded on demand, whereas frequently accessed models are retained in memory and invoked with consistently low latency.
Behind each MME are model hosting instances, as depicted in the following diagram. These instances load and evict multiple models to and from memory based on the traffic patterns to the models.

SageMaker continues to route inference requests for a model to the instance where the model is already loaded such that the requests are served from a cached model copy (see the following diagram, which shows the request path for the first prediction request vs. the cached prediction request path). However, if the model receives many invocation requests, and there are additional instances for the MME, SageMaker routes some requests to another instance to accommodate the increase. To take advantage of automated model scaling in SageMaker, make sure you have instance auto scaling set up to provision additional instance capacity. Set up your endpoint-level scaling policy with either custom parameters or invocations per minute (recommended) to add more instances to the endpoint fleet.

Model server overview
A model server is a software component that provides a runtime environment for deploying and serving machine learning (ML) models. It acts as an interface between the trained models and client applications that want to make predictions using those models.
The primary purpose of a model server is to allow effortless integration and efficient deployment of ML models into production systems. Instead of embedding the model directly into an application or a specific framework, the model server provides a centralized platform where multiple models can be deployed, managed, and served.
Model servers typically offer the following functionalities:

Model loading – The server loads the trained ML models into memory, making them ready for serving predictions.
Inference API – The server exposes an API that allows client applications to send input data and receive predictions from the deployed models.
Scaling – Model servers are designed to handle concurrent requests from multiple clients. They provide mechanisms for parallel processing and managing resources efficiently to ensure high throughput and low latency.
Integration with backend engines – Model servers have integrations with backend frameworks like DeepSpeed and FasterTransformer to partition large models and run highly optimized inference.

DJL architecture
DJL Serving is an open source, high performance, universal model server. DJL Serving is built on top of DJL, a deep learning library written in the Java programming language. It can take a deep learning model, several models, or workflows and make them available through an HTTP endpoint. DJL Serving supports deploying models from multiple frameworks like PyTorch, TensorFlow, Apache MXNet, ONNX, TensorRT, Hugging Face Transformers, DeepSpeed, FasterTransformer, and more.
DJL Serving offers many features that allow you to deploy your models with high performance:

Ease of use – DJL Serving can serve most models out of the box. Just bring the model artifacts, and DJL Serving can host them.
Multiple device and accelerator support – DJL Serving supports deploying models on CPU, GPU, and AWS Inferentia.
Performance – DJL Serving runs multithreaded inference in a single JVM to boost throughput.
Dynamic batching – DJL Serving supports dynamic batching to increase throughput.
Auto scaling – DJL Serving will automatically scale workers up and down based on the traffic load.
Multi-engine support – DJL Serving can simultaneously host models using different frameworks (such as PyTorch and TensorFlow).
Ensemble and workflow models – DJL Serving supports deploying complex workflows comprised of multiple models, and runs parts of the workflow on CPU and parts on GPU. Models within a workflow can use different frameworks.

In particular, the auto scaling feature of DJL Serving makes it straightforward to ensure the models are scaled appropriately for the incoming traffic. By default, DJL Serving determines the maximum number of workers for a model that can be supported based on the hardware available (CPU cores, GPU devices). You can set lower and upper bounds for each model to make sure that a minimum traffic level can always be served, and that a single model doesn’t consume all available resources.
DJL Serving uses a Netty frontend on top of backend worker thread pools. The frontend uses a single Netty setup with multiple HttpRequestHandlers. Different request handlers will provide support for the Inference API, Management API, or other APIs available from various plugins.
The backend is based around the WorkLoadManager (WLM) module. The WLM takes care of multiple worker threads for each model along with the batching and request routing to them. When multiple models are served, WLM checks the inference request queue size of each model first. If the queue size is greater than two times a model’s batch size, WLM scales up the number of workers assigned to that model.
Solution overview
The implementation of DJL with an MME differs from the default MMS setup. For DJL Serving with an MME, we compress the following files in the model.tar.gz format that SageMaker Inference is expecting:

model.joblib – For this implementation, we directly push the model metadata into the tarball. In this case, we are working with a .joblib file, so we provide that file in our tarball for our inference script to read. If the artifact is too large, you can also push it to Amazon S3 and point towards that in the serving configuration you define for DJL. – Here you can configure any model server-related environment variables. The power of DJL here is that you can configure minWorkers and maxWorkers for each model tarball. This allows for each model to scale up and down at the model server level. For instance, if a singular model is receiving the majority of the traffic for an MME, the model server will scale the workers up dynamically. In this example, we don’t configure these variables and let DJL determine the necessary number of workers depending on our traffic pattern. – This is the inference script for any custom preprocessing or postprocessing you would like to implement. The expects your logic to be encapsulated in a handle method by default.
requirements.txt (optional) – By default, DJL comes installed with PyTorch, but any additional dependencies you need can be pushed here.

For this example, we showcase the power of DJL with an MME by taking a sample SKLearn model. We run a training job with this model and then create 1,000 copies of this model artifact to back our MME. We then showcase how DJL can dynamically scale to handle any type of traffic pattern that your MME may receive. This can include an even distribution of traffic across all models or even a few popular models receiving the majority of the traffic. You can find all the code in the following GitHub repo.
For this example, we use a SageMaker notebook instance with a conda_python3 kernel and ml.c5.xlarge instance. To perform the load tests, you can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or a larger SageMaker notebook instance. In this example, we scale to over a thousand transactions per second (TPS), so we suggest testing on a heavier EC2 instance such as an ml.c5.18xlarge so that you have more compute to work with.
Create a model artifact
We first need to create our model artifact and data that we use in this example. For this case, we generate some artificial data with NumPy and train using an SKLearn linear regression model with the following code snippet:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import joblib

# Generate dummy data
X = np.random.rand(100, 1)
y = 2 * X + 1 + 0.1 * np.random.randn(100, 1)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Linear Regression model
model = LinearRegression()
# Train the model on the training data, y_train)

# Create serialized model artifact
model_filename = “model.joblib”
joblib.dump(model, model_filename)

After you run the preceding code, you should have a model.joblib file created in your local environment.
Pull the DJL Docker image
The Docker image djl-inference:0.23.0-cpu-full-v1.0 is our DJL serving container used in this example. You can adjust the following URL depending on your Region:
inference_image_uri = “”
Optionally, you can also use this image as a base image and extend it to build your own Docker image on Amazon Elastic Container Registry (Amazon ECR) with any other dependencies you need.
Create the model file
First, we create a file called This instructs DJLServing to use the Python engine. We also define the max_idle_time of a worker to be 600 seconds. This makes sure that we take longer to scale down the number of workers we have per model. We don’t adjust minWorkers and maxWorkers that we can define and we let DJL dynamically compute the number of workers needed depending on the traffic each model is receiving. The is shown as follows. To see the complete list of configuration options, refer to Engine Configuration.


Next, we create our file, which defines the model loading and inference logic. For MMEs, each file is specific to a model. Models are stored in their own paths under the model store (usually /opt/ml/model/). When loading models, they will be loaded under the model store path in their own directory. The full example in this demo can be seen in the GitHub repo.
We create a model.tar.gz file that includes our model (model.joblib),, and

#Build tar file with model data + inference code, replace this cell with your model.joblib
bashCommand = “tar -cvpzf model.tar.gz model.joblib requirements.txt”
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

For demonstration purposes, we make 1,000 copies of the same model.tar.gz file to represent the large number of models to be hosted. In production, you need to create a model.tar.gz file for each of your models.
Lastly, we upload these models to Amazon S3.
Create a SageMaker model
We now create a SageMaker model. We use the ECR image defined earlier and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure Mode as MultiModel. This tells DJLServing that we’re creating an MME.

mme_model_name = “sklearn-djl-mme” + strftime(“%Y-%m-%d-%H-%M-%S”, gmtime())
print(“Model name: ” + mme_model_name)

create_model_response = sm_client.create_model(
PrimaryContainer={“Image”: inference_image_uri, “Mode”: “MultiModel”, “ModelDataUrl”: mme_artifacts},

Create a SageMaker endpoint
In this demo, we use 20 ml.c5d.18xlarge instances to scale to a TPS in the thousands range. Make sure to get a limit increase on your instance type, if necessary, to achieve the TPS you are targeting.

mme_epc_name = “sklearn-djl-mme-epc” + strftime(“%Y-%m-%d-%H-%M-%S”, gmtime())
endpoint_config_response = sm_client.create_endpoint_config(
“VariantName”: “sklearnvariant”,
“ModelName”: mme_model_name,
“InstanceType”: “ml.c5d.18xlarge”,
“InitialInstanceCount”: 20

Load testing
At the time of writing, the SageMaker in-house load testing tool Amazon SageMaker Inference Recommender doesn’t natively support testing for MMEs. Therefore, we use the open source Python tool Locust. Locust is straightforward to set up and can track metrics such as TPS and end-to-end latency. For a full understanding of how to set it up with SageMaker, see Best practices for load testing Amazon SageMaker real-time inference endpoints.
In this use case, we have three different traffic patterns we want to simulate with MMEs, so we have the following three Python scripts that align with each pattern. Our goal here is to prove that, regardless of what our traffic pattern is, we can achieve the same target TPS and scale appropriately.

Evenly distributed traffic (
90% traffic to 10 popular models (
90% traffic to a single hot model (

We can specify a weight in our Locust script to assign traffic across different portions of our models. For instance, with our single hot model, we implement two methods as follows:

# popular model
def sendPopular(self):

request_meta = {
“request_type”: “InvokeEndpoint”,
“name”: “SageMaker”,
“start_time”: time.time(),
“response_length”: 0,
“response”: None,
“context”: {},
“exception”: None,
start_perf_counter = time.perf_counter()
response = self.sagemaker_client.invoke_endpoint(
TargetModel = “sklearn-0.tar.gz”

# rest of model
def sendRest(self):

request_meta = {
“request_type”: “InvokeEndpoint”,
“name”: “SageMaker”,
“start_time”: time.time(),
“response_length”: 0,
“response”: None,
“context”: {},
“exception”: None,
start_perf_counter = time.perf_counter()

response = self.sagemaker_client.invoke_endpoint(
TargetModel = f’sklearn-{random.randint(1,989)}.tar.gz’
response_body = response[“Body”].read()

We can then assign a certain weight to each method, which is when a certain method receives a specific percentage of the traffic:

# assign weights to models
class MyUser(BotoUser):

# 90% of traffic to singular model
def send_request(self):

def send_request_major(self):

For 20 ml.c5d.18xlarge instances, we see the following invocation metrics on the Amazon CloudWatch console. These values remain fairly consistent across all three traffic patterns. To understand CloudWatch metrics for SageMaker real-time inference and MMEs better, refer to SageMaker Endpoint Invocation Metrics.

You can find the rest of the Locust scripts in the locust-utils directory in the GitHub repository.
In this post, we discussed how an MME can dynamically adjust the compute power assigned to each model based on the model’s traffic pattern. This newly launched feature is available in all AWS Regions where SageMaker is available. Note that at the time of announcement, only CPU instances are supported. To learn more, refer to Supported algorithms, frameworks, and instances.

About the Authors
Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.
Xu Deng is a Software Engineer Manager with the SageMaker team. He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker. In his spare time, he loves traveling and snowboarding.
Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.
Rohith Nallamaddi is a Software Development Engineer at AWS. He works on optimizing deep learning workloads on GPUs, building high performance ML inference and serving solutions. Prior to this, he worked on building microservices based on AWS for Amazon F3 business. Outside of work he enjoys playing and watching sports.

Use Amazon Titan models for image generation, editing, and searching

Amazon Bedrock provides a broad range of high-performing foundation models from Amazon and other leading AI companies, including Anthropic, AI21, Meta, Cohere, and Stability AI, and covers a wide range of use cases, including text and image generation, searching, chat, reasoning and acting agents, and more. The new Amazon Titan Image Generator model allows content creators to quickly generate high-quality, realistic images using simple English text prompts. The advanced AI model understands complex instructions with multiple objects and returns studio-quality images suitable for advertising, ecommerce, and entertainment. Key features include the ability to refine images by iterating on prompts, automatic background editing, and generating multiple variations of the same scene. Creators can also customize the model with their own data to output on-brand images in a specific style. Importantly, Titan Image Generator has in-built safeguards, like invisible watermarks on all AI-generated images, to encourage responsible use and mitigate the spread of disinformation. This innovative technology makes producing custom images in large volume for any industry more accessible and efficient.
The new Amazon Titan Multimodal Embeddings model  helps build more accurate search and recommendations by understanding text, images, or both. It converts images and English text into semantic vectors, capturing meaning and relationships in your data. You can combine text and images like product descriptions and photos to identify items more effectively. The vectors power speedy, accurate search experiences. Titan Multimodal Embeddings is flexible in vector dimensions, enabling optimization for performance needs. An asynchronous API and Amazon OpenSearch Service connector make it easy to integrate the model into your neural search applications.
In this post, we walk through how to use the Titan Image Generator and Titan Multimodal Embeddings models via the AWS Python SDK.
Image generation and editing
In this section, we demonstrate the basic coding patterns for using the AWS SDK to generate new images and perform AI-powered edits on existing images. Code examples are provided in Python, and JavaScript (Node.js) is also available in this GitHub repository.
Before you can write scripts that use the Amazon Bedrock API, you need to install the appropriate version of the AWS SDK in your environment. For Python scripts, you can use the AWS SDK for Python (Boto3). Python users may also want to install the Pillow module, which facilitates image operations like loading and saving images. For setup instructions, refer to the GitHub repository.
Additionally, enable access to the Amazon Titan Image Generator and Titan Multimodal Embeddings models. For more information, refer to Model access.
Helper functions
The following function sets up the Amazon Bedrock Boto3 runtime client and generates images by taking payloads of different configurations (which we discuss later in this post):

import boto3
import json, base64, io
from random import randint
from PIL import Image

bedrock_runtime_client = boto3.client(“bedrock-runtime”)

def titan_image(
payload: dict,
num_image: int = 2,
cfg: float = 10.0,
seed: int = None,
modelId: str = “amazon.titan-image-generator-v1”,
) -> list:
# ImageGenerationConfig Options:
# – numberOfImages: Number of images to be generated
# – quality: Quality of generated images, can be standard or premium
# – height: Height of output image(s)
# – width: Width of output image(s)
# – cfgScale: Scale for classifier-free guidance
# – seed: The seed to use for reproducibility
seed = seed if seed is not None else randint(0, 214783647)
body = json.dumps(
“imageGenerationConfig”: {
“numberOfImages”: num_image, # Range: 1 to 5
“quality”: “premium”, # Options: standard/premium
“height”: 1024, # Supported height list above
“width”: 1024, # Supported width list above
“cfgScale”: cfg, # Range: 1.0 (exclusive) to 10.0
“seed”: seed, # Range: 0 to 214783647

response = bedrock_runtime_client.invoke_model(

response_body = json.loads(response.get(“body”).read())
images = [
for base64_image in response_body.get(“images”)
return images

Generate images from text
Scripts that generate a new image from a text prompt follow this implementation pattern:

Configure a text prompt and optional negative text prompt.
Use the BedrockRuntime client to invoke the Titan Image Generator model.
Parse and decode the response.
Save the resulting images to disk.

The following is a typical image generation script for the Titan Image Generator model:

# Text Variation
# textToImageParams Options:
#   text: prompt to guide the model on how to generate variations
#   negativeText: prompts to guide the model on what you don’t want in image
images = titan_image(
“taskType”: “TEXT_IMAGE”,
“textToImageParams”: {
“text”: “two dogs walking down an urban street, facing the camera”, # Required
“negativeText”: “cars”, # Optional

This will produce images similar to the following.

Response Image 1
Response Image 2

Image variants
Image variation provides a way to generate subtle variants of an existing image. The following code snippet uses one of the images generated in the previous example to create variant images:

# Import an input image like this (only PNG/JPEG supported):
with open(“<YOUR_IMAGE_FILE_PATH>”, “rb”) as image_file:
input_image = base64.b64encode(“utf8”)

# Image Variation
# ImageVariationParams Options:
#   text: prompt to guide the model on how to generate variations
#   negativeText: prompts to guide the model on what you don’t want in image
#   images: base64 string representation of the input image, only 1 is supported
images = titan_image(
“taskType”: “IMAGE_VARIATION”,
“imageVariationParams”: {
“text”: “two dogs walking down an urban street, facing the camera”, # Required
“images”: [input_image], # One image is required
“negativeText”: “cars”, # Optional

This will produce images similar to the following.

Original Image
Response Image 1
Response Image 2

Edit an existing image
The Titan Image Generator model allows you to add, remove, or replace elements or areas within an existing image. You specify which area to affect by providing one of the following:

Mask image – A mask image is a binary image in which the 0-value pixels represent the area you want to affect and the 255-value pixels represent the area that should remain unchanged.
Mask prompt – A mask prompt is a natural language text description of the elements you want to affect, that uses an in-house text-to-segmentation model.

For more information, refer to Prompt Engineering Guidelines.
Scripts that apply an edit to an image follow this implementation pattern:

Load the image to be edited from disk.
Convert the image to a base64-encoded string.
Configure the mask through one of the following methods:

Load a mask image from disk, encoding it as base64 and setting it as the maskImage parameter.
Set the maskText parameter to a text description of the elements to affect.

Specify the new content to be generated using one of the following options:

To add or replace an element, set the text parameter to a description of the new content.
To remove an element, omit the text parameter completely.

Use the BedrockRuntime client to invoke the Titan Image Generator model.
Parse and decode the response.
Save the resulting images to disk.

Object editing: Inpainting with a mask image
The following is a typical image editing script for the Titan Image Generator model using maskImage. We take one of the images generated earlier and provide a mask image, where 0-value pixels are rendered as black and 255-value pixels as white. We also replace one of the dogs in the image with a cat using a text prompt.

with open(“<YOUR_MASK_IMAGE_FILE_PATH>”, “rb”) as image_file:
mask_image = base64.b64encode(“utf8”)

# Import an input image like this (only PNG/JPEG supported):
with open(“<YOUR_ORIGINAL_IMAGE_FILE_PATH>”, “rb”) as image_file:
input_image = base64.b64encode(“utf8”)

# Inpainting
# inPaintingParams Options:
#   text: prompt to guide inpainting
#   negativeText: prompts to guide the model on what you don’t want in image
#   image: base64 string representation of the input image
#   maskImage: base64 string representation of the input mask image
#   maskPrompt: prompt used for auto editing to generate mask

images = titan_image(
“taskType”: “INPAINTING”,
“inPaintingParams”: {
“text”: “a cat”, # Optional
“negativeText”: “bad quality, low res”, # Optional
“image”: input_image, # Required
“maskImage”: mask_image,

This will produce images similar to the following.

Original Image
Mask Image
Edited Image

Object removal: Inpainting with a mask prompt
In another example, we use maskPrompt to specify an object in the image, taken from the earlier steps, to edit. By omitting the text prompt, the object will be removed:

# Import an input image like this (only PNG/JPEG supported):
with open(“<YOUR_IMAGE_FILE_PATH>”, “rb”) as image_file:
input_image = base64.b64encode(“utf8”)

images = titan_image(
“taskType”: “INPAINTING”,
“inPaintingParams”: {
“negativeText”: “bad quality, low res”, # Optional
“image”: input_image, # Required
“maskPrompt”: “white dog”, # One of “maskImage” or “maskPrompt” is required

This will produce images similar to the following.

Original Image
Response Image

Background editing: Outpainting
Outpainting is useful when you want to replace the background of an image. You can also extend the bounds of an image for a zoom-out effect. In the following example script, we use maskPrompt to specify which object to keep; you can also use maskImage. The parameter outPaintingMode specifies whether to allow modification of the pixels inside the mask. If set as DEFAULT, pixels inside of the mask are allowed to be modified so that the reconstructed image will be consistent overall. This option is recommended if the maskImage provided doesn’t represent the object with pixel-level precision. If set as PRECISE, the modification of pixels inside of the mask is prevented. This option is recommended if using a maskPrompt or a maskImage that represents the object with pixel-level precision.

# Import an input image like this (only PNG/JPEG supported):
with open(“<YOUR_IMAGE_FILE_PATH>”, “rb”) as image_file:
input_image = base64.b64encode(“utf8”)

# OutPaintingParams Options:
#   text: prompt to guide outpainting
#   negativeText: prompts to guide the model on what you don’t want in image
#   image: base64 string representation of the input image
#   maskImage: base64 string representation of the input mask image
#   maskPrompt: prompt used for auto editing to generate mask
#   outPaintingMode: DEFAULT | PRECISE
images = titan_image(
“taskType”: “OUTPAINTING”,
“outPaintingParams”: {
“text”: “forest”, # Required
“image”: input_image, # Required
“maskPrompt”: “dogs”, # One of “maskImage” or “maskPrompt” is required
“outPaintingMode”: “PRECISE”, # One of “PRECISE” or “DEFAULT”

This will produce images similar to the following.

Original Image
Response Image



In addition, the effects of different values for outPaintingMode, with a maskImage that doesn’t outline the object with pixel-level precision, are as follows.

Original Image
Mask Image
Response Image



This section has given you an overview of the operations you can perform with the Titan Image Generator model. Specifically, these scripts demonstrate text-to-image, image variation, inpainting, and outpainting tasks. You should be able to adapt the patterns for your own applications by referencing the parameter details for those task types detailed in Amazon Titan Image Generator documentation.
Multimodal embedding and searching
You can use the Amazon Titan Multimodal Embeddings model for enterprise tasks such as image search and similarity-based recommendation, and it has built-in mitigation that helps reduce bias in searching results. There are multiple embedding dimension sizes for best latency/accuracy trade-offs for different needs, and all can be customized with a simple API to adapt to your own data while persisting data security and privacy. Amazon Titan Multimodal Embeddings is provided as simple APIs for real-time or asynchronous batch transform searching and recommendation applications, and can be connected to different vector databases, including Amazon OpenSearch Service.
Helper functions
The following function converts an image, and optionally text, into multimodal embeddings:

def titan_multimodal_embedding(
image_path: str = None, # maximum 2048 x 2048 pixels
description: str = None, # English only and max input tokens 128
dimension: int = 1024, # 1,024 (default), 384, 256
model_id: str = “amazon.titan-embed-image-v1”,
payload_body = {}
embedding_config: dict = {“embeddingConfig”: {“outputEmbeddingLength”: dimension}}

# You can specify either text or image or both
if image_path:
# Maximum image size supported is 2048 x 2048 pixels
with open(image_path, “rb”) as image_file:
payload_body[“inputImage”] = base64.b64encode(
if description:
payload_body[“inputText”] = description

assert payload_body, “please provide either an image and/or a text description”

response = bedrock_runtime_client.invoke_model(
body=json.dumps({**payload_body, **embedding_config}),

return json.loads(response.get(“body”).read())

The following function returns the top similar multimodal embeddings given a query multimodal embeddings. Note that in practice, you can use a managed vector database, such as OpenSearch Service. The following example is for illustration purposes:

from scipy.spatial.distance import cdist
import numpy as np

def search(query_emb: np.array, indexes: np.array, top_k: int = 1):
dist = cdist(query_emb, indexes, metric=”cosine”)
return dist.argsort(axis=-1)[0, :top_k], np.sort(dist, axis=-1)[:top_k]

Synthetic dataset
For illustration purposes, we use Anthropic’s Claude 2.1 model in Amazon Bedrock to randomly generate seven different products, each with three variants, using the following prompt:
Generate a list of 7 items description for an online e-commerce shop, each comes with 3 variants of color or type. All with separate full sentence description.
The following is the list of returned outputs:

1. T-shirt
– A red cotton t-shirt with a crew neck and short sleeves.
– A blue cotton t-shirt with a v-neck and short sleeves.
– A black polyester t-shirt with a scoop neck and cap sleeves.

2. Jeans
– Classic blue relaxed fit denim jeans with a mid-rise waist.
– Black skinny fit denim jeans with a high-rise waist and ripped details at the knees.
– Stonewash straight leg denim jeans with a standard waist and front pockets.

3. Sneakers
– White leather low-top sneakers with an almond toe cap and thick rubber outsole.
– Gray mesh high-top sneakers with neon green laces and a padded ankle collar.
– Tan suede mid-top sneakers with a round toe and ivory rubber cupsole.

4. Backpack
– A purple nylon backpack with padded shoulder straps, front zipper pocket and laptop sleeve.
– A gray canvas backpack with brown leather trims, side water bottle pockets and drawstring top closure.
– A black leather backpack with multiple interior pockets, top carry handle and adjustable padded straps.

5. Smartwatch
– A silver stainless steel smartwatch with heart rate monitor, GPS tracker and sleep analysis.
– A space gray aluminum smartwatch with step counter, phone notifications and calendar syncing.
– A rose gold smartwatch with activity tracking, music controls and customizable watch faces.

6. Coffee maker
– A 12-cup programmable coffee maker in brushed steel with removable water tank and keep warm plate.
– A compact 5-cup single serve coffee maker in matt black with travel mug auto-dispensing feature.
– A retro style stovetop percolator coffee pot in speckled enamel with stay-cool handle and glass knob lid.

7. Yoga mat
– A teal 4mm thick yoga mat made of natural tree rubber with moisture-wicking microfiber top.
– A purple 6mm thick yoga mat made of eco-friendly TPE material with integrated carrying strap.
– A patterned 5mm thick yoga mat made of PVC-free material with towel cover included.

Assign the above response to variable response_cat. Then we use the Titan Image Generator model to create product images for each item:

import re

def extract_text(input_string):
pattern = r”- (.*?)($|n)”
matches = re.findall(pattern, input_string)
extracted_texts = [match[0] for match in matches]
return extracted_texts

product_description = extract_text(response_cat)

titles = []
for prompt in product_description:
images = titan_image(
“taskType”: “TEXT_IMAGE”,
“textToImageParams”: {
“text”: prompt, # Required
title = “_”.join(prompt.split()[:4]).lower()
images[0].save(f”{title}.png”, format=”png”)

All the generated images can be found in the appendix at the end of this post.
Multimodal dataset indexing
Use the following code for multimodal dataset indexing:

multimodal_embeddings = []
for image_filename, description in zip(titles, product_description):
embedding = titan_multimodal_embedding(f”{image_filename}.png”, dimension=1024)[“embedding”]

Multimodal searching
Use the following code for multimodal searching:

query_prompt = “<YOUR_QUERY_TEXT>”
query_embedding = titan_multimodal_embedding(description=query_prompt, dimension=1024)[“embedding”]
# If searching via Image
# query_image_filename = “<YOUR_QUERY_IMAGE>”
# query_emb = titan_multimodal_embedding(image_path=query_image_filename, dimension=1024)[“embedding”]
idx_returned, dist = search(np.array(query_embedding)[None], np.array(multimodal_embeddings))

The following are some search results.



“white sneaker”

“leather backpack”

“purple backpack”

The post introduces the Amazon Titan Image Generator and Amazon Titan Multimodal Embeddings models. Titan Image Generator enables you to create custom, high-quality images from text prompts. Key features include iterating on prompts, automatic background editing, and data customization. It has safeguards like invisible watermarks to encourage responsible use. Titan Multimodal Embeddings converts text, images, or both into semantic vectors to power accurate search and recommendations. We then provided Python code samples for using these services, and demonstrated generating images from text prompts and iterating on those images; editing existing images by adding, removing, or replacing elements specified by mask images or mask text; creating multimodal embeddings from text, images, or both; and searching for similar multimodal embeddings to a query. We also demonstrated using a synthetic e-commerce dataset indexed and searched using Titan Multimodal Embeddings. The aim of this post is to enable developers to start using these new AI services in their applications. The code patterns can serve as templates for custom implementations.
All the code is available on the GitHub repository. For more information, refer to the Amazon Bedrock User Guide.

About the Authors
Rohit Mittal is a Principal Product Manager at Amazon AI building multi-modal foundation models. He recently led the launch of Amazon Titan Image Generator model as part of Amazon Bedrock service. Experienced in AI/ML, NLP, and Search, he is interested in building products that solves customer pain points with innovative technology.
Dr. Ashwin Swaminathan is a Computer Vision and Machine Learning researcher, engineer, and manager with 12+ years of industry experience and 5+ years of academic research experience. Strong fundamentals and proven ability to quickly gain knowledge and contribute to newer and emerging areas.
Dr. Yusheng Xie is a Principal Applied Scientist at Amazon AGI. His work focuses building multi-modal foundation models. Before joining AGI, he was leading various multi-modal AI development at AWS such as Amazon Titan Image Generator and Amazon Textract Queries.
Dr. Hao Yang is a Principal Applied Scientist at Amazon. His main research interests are object detection and learning with limited annotations. Outside work, Hao enjoys watching films, photography, and outdoor activities.
Dr. Davide Modolo is an Applied Science Manager at Amazon AGI, working on building large multimodal foundational models. Before joining Amazon AGI, he was a manager/lead for 7 years in AWS AI Labs (Amazon Bedrock and Amazon Rekognition). Outside of work, he enjoys traveling and playing any kind of sport, especially soccer.
Dr. Baichuan Sun, is currently serving as a Sr. AI/ML Solutions Architect at AWS, focusing on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends.
Dr. Kai Zhu currently works as Cloud Support Engineer at AWS, helping customers with issues in AI/ML related services like SageMaker, Bedrock, etc. He is a SageMaker Subject Matter Expert. Experienced in data science and data engineering, he is interested in building generative AI powered projects.
Kris Schultz has spent over 25 years bringing engaging user experiences to life by combining emerging technologies with world class design. In his role as Senior Product Manager, Kris helps design and build AWS services to power Media & Entertainment, Gaming, and Spatial Computing.

In the following sections, we demonstrate challenging sample use cases like text insertion, hands, and reflections to highlight the capabilities of the Titan Image Generator model. We also include the sample output images produced in earlier examples.
The Titan Image Generator model excels at complex workflows like inserting readable text into images. This example demonstrates Titan’s ability to clearly render uppercase and lowercase letters in a consistent style within an image.

a corgi wearing a baseball cap with text “genai”
a happy boy giving a thumbs up, wearing a tshirt with text “generative AI”

The Titan Image Generator model also has the ability to generate detailed AI images. The image shows realistic hands and fingers with visible detail, going beyond more basic AI image generation that may lack such specificity. In the following examples, notice the precise depiction of the pose and anatomy.

a person’s hand viewed from above
a close look at a person’s hands holding a coffee mug

The images generated by the Titan Image Generator model spatially arrange objects and accurately reflect mirror effects, as demonstrated in the following examples.

A cute fluffy white cat stands on its hind legs, peering curiously into an ornate golden mirror. In the reflection the cat sees itself
beautiful sky lake with reflections on the water

Synthetic product images
The following are the product images generated earlier in this post for the Titan Multimodal Embeddings model.

Build a contextual chatbot application using Knowledge Bases for Amazo …

Modern chatbots can serve as digital agents, providing a new avenue for delivering 24/7 customer service and support across many industries. Their popularity stems from the ability to respond to customer inquiries in real time and handle multiple queries simultaneously in different languages. Chatbots also offer valuable data-driven insights into customer behavior while scaling effortlessly as the user base grows; therefore, they present a cost-effective solution for engaging customers. Chatbots use the advanced natural language capabilities of large language models (LLMs) to respond to customer questions. They can understand conversational language and respond naturally. However, chatbots that merely answer basic questions have limited utility. To become trusted advisors, chatbots need to provide thoughtful, tailored responses.
One way to enable more contextual conversations is by linking the chatbot to internal knowledge bases and information systems. Integrating proprietary enterprise data from internal knowledge bases enables chatbots to contextualize their responses to each user’s individual needs and interests. For example, a chatbot could suggest products that match a shopper’s preferences and past purchases, explain details in language adapted to the user’s level of expertise, or provide account support by accessing the customer’s specific records. The ability to intelligently incorporate information, understand natural language, and provide customized replies in a conversational flow allows chatbots to deliver real business value across diverse use cases.
The popular architecture pattern of Retrieval Augmented Generation (RAG) is often used to augment user query context and responses. RAG combines the capabilities of LLMs with the grounding in facts and real-world knowledge that comes from retrieving relevant texts and passages from corpus of data. These retrieved texts are then used to inform and ground the output, reducing hallucination and improving relevance.
In this post, we illustrate contextually enhancing a chatbot by using Knowledge Bases for Amazon Bedrock, a fully managed serverless service. The Knowledge Bases for Amazon Bedrock integration allows our chatbot to provide more relevant, personalized responses by linking user queries to related information data points. Internally, Amazon Bedrock uses embeddings stored in a vector database to augment user query context at runtime and enable a managed RAG architecture solution. We use the Amazon letters to shareholders dataset to develop this solution.
Retrieval Augmented Generation
RAG is an approach to natural language generation that incorporates information retrieval into the generation process. RAG architecture involves two key workflows: data preprocessing through ingestion, and text generation using enhanced context.
The data ingestion workflow uses LLMs to create embedding vectors that represent semantic meaning of texts. Embeddings are created for documents and user questions. The document embeddings are split into chunks and stored as indexes in a vector database. The text generation workflow then takes a question’s embedding vector and uses it to retrieve the most similar document chunks based on vector similarity. It augments prompts with these relevant chunks to generate an answer using the LLM. For more details, refer to the Primer on Retrieval Augmented Generation, Embeddings, and Vector Databases section in Preview – Connect Foundation Models to Your Company Data Sources with Agents for Amazon Bedrock.
The following diagram illustrates the high-level RAG architecture.

Although the RAG architecture has many advantages, it involves multiple components, including a database, retrieval mechanism, prompt, and generative model. Managing these interdependent parts can introduce complexities in system development and deployment. The integration of retrieval and generation also requires additional engineering effort and computational resources. Some open source libraries provide wrappers to reduce this overhead; however, changes to libraries can introduce errors and add additional overhead of versioning. Even with open source libraries, significant effort is required to write code, determine optimal chunk size, generate embeddings, and more. This setup work alone can take weeks depending on data volume.
Therefore, a managed solution that handles these undifferentiated tasks could streamline and accelerate the process of implementing and managing RAG applications.
Knowledge Bases for Amazon Bedrock
Knowledge Bases for Amazon Bedrock is a serverless option to build powerful conversational AI systems using RAG. It offers fully managed data ingestion and text generation workflows.
For data ingestion, it handles creating, storing, managing, and updating text embeddings of document data in the vector database automatically. It splits the documents into manageable chunks for efficient retrieval. The chunks are then converted to embeddings and written to a vector index, while allowing you to see the source documents when answering a question.
For text generation, Amazon Bedrock provides the RetrieveAndGenerate API to create embeddings of user queries, and retrieves relevant chunks from the vector database to generate accurate responses. It also supports source attribution and short-term memory needed for RAG applications.
This enables you to focus on your core business applications and removes the undifferentiated heavy lifting.
Solution overview
The solution presented in this post uses a chatbot created using a Streamlit application and includes the following AWS services:

Amazon Simple Storage Service (Amazon S3) as source
Knowledge Bases for Amazon Bedrock for data ingestion
An Amazon OpenSearch Serverless vector store to save text embeddings
AWS Lambda as an API function to invoke the Knowledge Bases API

The following diagram is a common solution architecture pattern you can use to integrate any chatbot application to Knowledge Bases for Amazon Bedrock.

This architecture includes the following steps:

A user interacts with the Streamlit chatbot interface and submits a query in natural language
This triggers a Lambda function, which invokes the Knowledge Bases RetrieveAndGenerate API. Internally, Knowledge Bases uses an Amazon Titan embedding model and converts the user query to a vector and finds chunks that are semantically similar to the user query. The user prompt is than augmented with the chunks that are retrieved from the knowledge base. The prompt alongside the additional context is then sent to a LLM for response generation. In this solution, we use Anthropic Claude Instant as our LLM to generate user responses using additional context. Note that this solution is supported in Regions where Anthropic Claude on Amazon Bedrock is available.
A contextually relevant response is sent back to the chatbot application and user.

Amazon Bedrock users need to request access to foundation models before they are available for use. This is a one-time action and takes less than a minute. For this solution, you’ll need to enable access to the Titan Embeddings G1 – Text and Claude Instant – v1.2 model in Amazon Bedrock. For more information, refer to Model access.
Clone the GitHub repo
The solution presented in this post is available in the following GitHub repo. You need to clone the GitHub repository to your local machine. Open a terminal window and run the following command. Note this is one single git clone command.

git clone –depth 2 –filter=blob:none –no-checkout && cd amazon-bedrock-samples && git checkout main rag-solutions/contextual-chatbot-using-knowledgebase

Upload your knowledge dataset to Amazon S3
We download the dataset for our knowledge base and upload it into a S3 bucket. This dataset will feed and power knowledge base. Complete the following steps:

Navigate to the Annual reports, proxies and shareholder letters data repository and download the last few years of Amazon shareholder letters.
On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
Name the bucket knowledgebase-<your-awsaccount-number>.
Leave all other bucket settings as default and choose Create.
Navigate to the knowledgebase-<your-awsaccount-number> bucket.
Choose Create folder and name it dataset.
Leave all other folder settings as default and choose Create.
Navigate back to the bucket home and choose Create folder to create a new folder and name it lambdalayer.
Leave all other settings as default and choose Create.
Navigate to the dataset folder.
Upload the annual reports, proxies and shareholder letters dataset files you downloaded earlier to this bucket and choose Upload.
Navigate to the lambdalayer folder.
Upload the file available under the /lambda/layer folder in the GitHub repo you cloned earlier and choose Upload. You will use this Lambda layer code later to create the Lambda function.

Create a knowledge base
In this step, we create a knowledge base using the Amazon shareholder letters dataset we uploaded to our S3 bucket in the previous step.

On the Amazon Bedrock console, under Orchestration in the navigation pane, choose Knowledge base.
Choose Create knowledge base.
In the Knowledge base details section, enter a name and optional description.
In the IAM permissions section, select Create and use a new service role and enter a name for the role.
Add tags as needed.
Choose Next.
Leave Data source name as the default name.
For S3 URI, choose Browse S3 to choose the S3 bucket knowledgebase-<your-account-number>/dataset/.You need to point to the bucket and dataset folder you created in the previous steps.
In the Advanced settings section, leave the default values (if you want, you can change the default chunking strategy and specify the chunk size and overlay in percentage).
Choose Next.
For Embeddings model, select Titan Embedding G1 – Text.
For Vector database, you can either select Quick create a new vector store or Choose a vector store you have created. Note that, to use the vector store of your choice, you need have a vector store preconfigured to use. We currently support four vector engine types: the vector engine for Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, and Redis Enterprise Cloud. For this post, we select Quick create a new vector store, which by default creates a new OpenSearch Serverless vector store in your account.
Choose Next.
On the Review and create page, review all the information, or choose Previous to modify any options.
Choose Create knowledge base.Note the knowledge base creation process begins and the status is In progress. It will take a few minutes to create the vector store and knowledge base. Don’t navigate away from the page, otherwise creation will fail.
When the knowledge base status is in the Ready state, note down the knowledge base ID. You will use it in the next steps to configure the Lambda function.
Now that knowledge base is ready, we need to sync our Amazon shareholders letter data to it. In the Data Source section of the knowledge base details page, choose Sync to trigger the data ingestion process from the S3 bucket to the knowledge base.

This sync process splits the document files into smaller chunks of the chunk size specified earlier, generates vector embeddings using the selected text embedding model, and stores them in the vector store managed by Knowledge Bases for Amazon Bedrock.

When the dataset sync is complete, the status of the data source will change to the Ready state. Note that, if you add any additional documents in the S3 data folder, you need to re-sync the knowledge base.

Congratulations, your knowledge base is ready.
Note that you can also use Knowledge Bases for Amazon Bedrock service APIs and the AWS Command Line Interface (AWS CLI) to programmatically create a knowledge base. You will need to run various sections of the Jupyter notebook provided under the /notebook folder in the GitHub repo.
Create a Lambda function
This Lambda function is deployed using an AWS CloudFormation template available in the GitHub repo under the /cfn folder. The template requires two parameters: the S3 bucket name and the knowledge base ID.

On the AWS CloudFormation service home page, choose Create stack to create a new stack.
Select Template is ready for Prepare template.
Select Upload the template file for Template source.
Choose Choose file, navigate to the GitHub repo you cloned earlier, and choose the .yaml file under the /cfn folder.
Choose Next.
For Stack name, enter a name.
In the Parameters section, enter the knowledge base ID and S3 bucket name you noted down earlier.
Choose Next.
Leave all default options as is, choose Next, and choose Submit.
Verify that the CloudFormation template ran successfully, and there are no errors.

Congratulations, you have created a Lambda function, related roles, and policies successfully.
Test the contextual chatbot application
To test your chatbot application, complete the following steps:

Open a new terminal or a command line window on your machine.
Run the following command to install the AWS SDK for Python (Boto3). Boto3 makes it straightforward to integrate a Python application, library, or script with AWS services.

pip install boto3

Run the following command to install and set up a local Python development environment to run the Streamlit application:

pip install streamlit

Navigate to the /streamlit folder in the GitHub repository folder you cloned earlier.
Run the following command to instantiate the chatbot application:

python -m streamlit run

This should open a web-based chat application powered by Streamlit in your default web browser.

Use this Streamlit chatbot application to post natural language questions to start the conversations powered by Knowledge Bases for Amazon Bedrock.

When you submit a prompt, the Streamlit app triggers the Lambda function, which invokes the Knowledge Bases RetrieveAndGenerate API to search and generate responses.
The following table includes some sample questions and related knowledge base responses. Try out some of these questions by using prompts.


What is Amazon doing in the field of generative AI?
Amazon has been working on their own large language models (LLMs) for generative AI and believes it will transform and improve every customer experience. They plan to continue investing substantially in these models across all their consumer, seller, brand, and creator experiences.

What is AWS year-over-year revenue in 2022?
AWS revenue grew 29% year-over-year in 2022 on a $62 billion revenue base.

How many days has Amazon asked employees to come to work in office?
Amazon has asked corporate employees to come back to office at least three days a week beginning May 2022.

By what percentage did AWS revenue grow year-over-year in 2022?
AWS had a 29% year-over-year (‘YoY’) revenue in 2022.

Compared to Graviton2 processors, what performance improvement did Graviton3 chips deliver according to the passage?
In 2022, AWS delivered their Graviton3 chips, providing 25% better performance than the Graviton2 processors.

Which was the first inference chip launched by AWS according to the passage?
AWS launched their first inference chips (“Inferentia”) in 2019, and they have saved companies like Amazon over a hundred million dollars in capital expense.

According to the context, in what year did Amazon’s annual revenue increase from $245B to $434B?
Amazon’s annual revenue increased from $245B in 2019 to $434B in 2022.

Tell me again what was the revenue in 2019?
Amazon’s revenue in 2019 was $245 billion.

and, 2021?
Amazon’s revenue in 2021 was $469.8 billion, an increase of 22% over 2020.

And, remind me again when was the first inference chip was launched?
Amazon’s first inference chip was Inferentia, which launched in 2019.

During the first call to the Lambda function, the RetrieveAndGenerate API returns a sessionId, which is then passed by the Streamlit app along with the subsequent user prompt as an input to the RetrieveAndGenerate API to continue the conversation in the same session. The RetrieveAndGenerate API manages the short-term memory and uses the chat history as long as the same sessionId is passed as an input in the successive calls.
Congratulations, you have successfully created and tested a chatbot application using Knowledge Bases for Amazon Bedrock.
Clean up
Failing to delete resources such as the S3 bucket, OpenSearch Serverless collection, and knowledge base will incur charges. To clean up these resources, delete the CloudFormation stack, delete the S3 bucket (including any document folders and files stored in that bucket), delete the OpenSearch Serverless collection, delete the knowledge base, and delete any roles, policies, and permissions that you created earlier.
In this post, we provided an overview of contextual chatbots and explained why they’re important. We described the complexities involved in data ingestion and text generation workflows for a RAG architecture. We then introduced how Knowledge Bases for Amazon Bedrock creates a fully managed serverless RAG system, including a vector store. Finally, we provided a solution architecture and sample code in a GitHub repo to retrieve and generate contextual responses for a chatbot application using a knowledge base.
By explaining the value of contextual chatbots, the challenges of RAG systems, and how Knowledge Bases for Amazon Bedrock addresses those challenges, this post aimed to showcase how Amazon Bedrock enables you to build sophisticated conversational AI applications with minimal effort.
For more information, see the Amazon Bedrock Developer Guide and Knowledge Base APIs.

About the Authors
Manish Chugh is a Principal Solutions Architect at AWS based in San Francisco, CA. He specializes in machine learning and generative AI. He works with organizations ranging from large enterprises to early-stage startups on problems related to machine learning. His role involves helping these organizations architect scalable, secure, and cost-effective workloads on AWS. He regularly presents at AWS conferences and other partner events. Outside of work, he enjoys hiking on East Bay trails, road biking, and watching (and playing) cricket.
Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Pallavi Nargund is a Principal Solutions Architect at AWS. In her role as a cloud technology enabler, she works with customers to understand their goals and challenges, and give prescriptive guidance to achieve their objective with AWS offerings. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Outside of work she enjoys volunteering, gardening, cycling and hiking.

Meet Hydragen: A Hardware-Aware Exact Implementation of Attention with …

As artificial intelligence continues to permeate every facet of technology, optimizing the performance of large language models (LLMs) for practical applications has become a pivotal challenge. The advent of Transformer-based LLMs has revolutionized how we interact with AI, enabling applications that range from conversational agents to complex problem-solving tools. However, the widespread deployment of these models, especially in scenarios where they process batches of sequences sharing common prefixes, has highlighted a significant efficiency bottleneck. Traditional attention mechanisms, while foundational to the success of LLMs, often struggle with computational redundancy when sequences within a batch share a starting point. This inefficiency strains computing resources and limits the scalability of LLM applications.

A groundbreaking approach by the research team from Stanford University, the University of Oxford, and the University of Waterloo named Hydragen has been introduced to address this challenge. Hydragen is ingeniously designed to optimize LLM inference in shared-prefix scenarios, dramatically improving throughput and reducing computational overhead. By decomposing the attention operation into separate computations for shared prefixes and unique suffixes, Hydragen minimizes redundant memory reads and maximizes the efficiency of matrix multiplications—a process better aligned with the capabilities of modern GPUs. This decomposition allows for the batching of attention queries across sequences when processing the shared prefix, significantly enhancing computational efficiency.

Hydragen’s innovation lies in its two-fold approach. Firstly, it decomposes the attention mechanism to address the shared prefixes and the distinct suffixes of sequences separately. This strategy cleverly circumvents the inefficiencies of traditional attention computations, which treat each sequence independently, leading to unnecessary repetition of computations for the shared segments. Secondly, Hydragen introduces inter-sequence batching for the shared prefix, leveraging the uniformity of this segment across sequences to perform a single, consolidated attention computation. This method reduces the workload on the GPU and ensures that the computational power of tensor cores is used to its fullest potential.

The impact of Hydragen is profound, offering up to 32 times improvement in end-to-end LLM throughput compared to existing methods. Such performance enhancement is particularly significant as it scales with both the batch size and the length of the shared prefix, showcasing Hydragen’s adaptability to various operational scales and scenarios. Moreover, Hydragen’s methodology extends beyond simple prefix-suffix splits, accommodating more complex, tree-based sharing patterns common in advanced LLM applications. This flexibility allows Hydragen to significantly reduce inference times in various settings, from chatbot interactions to competitive programming challenges.

The results of implementing Hydragen are compelling, underscoring its capability to transform LLM inference. Not only does Hydragen dramatically increase throughput, but it also enables the efficient processing of very long shared contexts with minimal throughput penalty. This means that LLMs can now handle more extensive and context-rich prompts without a corresponding increase in computational cost or time. For instance, in tasks involving long document question answering, Hydragen demonstrates its superiority by processing queries in significantly less time than traditional methods, even when dealing with documents with tens of thousands of long tokens.

In conclusion, the development of Hydragen marks a significant milestone in optimizing LLMs for real-world applications. The key takeaways from this research include:

Innovative Decomposition: Hydragen’s unique attention decomposition method significantly enhances computational efficiency for batches of sequences with shared prefixes.

Enhanced Throughput: Hydragen demonstrates up to a 32x improvement in throughput, setting a new standard for LLM performance, especially in large-batch and shared-prefix scenarios.

Versatile Application: The methodology is adaptable to complex sharing patterns, making it suitable for a wide range of LLM applications, from conversational AI to intricate problem-solving tools.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

The post Meet Hydragen: A Hardware-Aware Exact Implementation of Attention with Shared Prefixes appeared first on MarkTechPost.