Tinkoff Researchers Unveil ReBased: Pioneering Machine Learning with E …

New standards are being set across various activities by Large Language Models (LLMs), which are causing a revolution in natural language processing. Despite their successes, most of these models rely on attention mechanisms implemented in Transformer frameworks. Impractical computing complexity for extending contextual processing is caused by these techniques, which scale poorly with large text sequences.

Several substitutes for Transformers were put forward to deal with this limitation. To avoid the quadratic difficulty of the series length, some research has proposed switching out the exponential function for the kernel function in the attention mechanism. This would reorder the computations. Nevertheless, this method diminishes performance when contrasted with plain old Transformers. Additionally, there is still no resolution to the issue of kernel function selection. State Space Models (SSMs) provide an alternate method of linear model definition; when evaluated with the complexity of language modeling, they can produce results on par with Transformers.

Note that Linear Transformers and SSMs are both Recurrent Neural Networks (RNNs) types. However, as data volumes grow, RNNs have problems managing long-term text dependencies due to memory overflow. In addition, SSMs demonstrated superior text modeling quality, even though Linear Transformers had a bigger hidden state than RNNs. To tackle these issues, the Based model was introduced with a hybrid design that combined a Linear Transformer with a new kernel function obtained from an exponential function’s Taylor expansion. While tested on the Multi-Query Associative Recall (MQAR) task, research showed that the based model performed better than others when dealing with longer content. Unlike the traditional transformer architecture, even the Based model suffers a performance decline in the presence of broad contexts. 

To progress with the Based architectures, one must have a deep understanding of the processes taking place within them. Researchers from Tinkoff claim that the kernel function used in Based is not ideal and has limits when dealing with long context and small model capacity based on their examination of the attention score distribution. 

In response, the team presented ReBased, an improved variant of the Linear Transformer model. Their main focus was fixing Based’s attention process bug, which prevented it from disregarding certain tokens with zero probability. A model that simplifies the calculation of the attention mechanism and improves accuracy on tasks involving retrieving information from long sequences of tokens was developed by refining the kernel function and introducing new architectural improvements. 

The researchers found that ReBased is more similar to attention than Based after comparing its internal representation with that of Based and vanilla attention modules. Unlike Based’s usage of a Taylor expansion of an exponential function, a ReBased kernel function differs from the exponent yet demonstrates superior performance. The findings suggest that a second-order polynomial isn’t enough for optimal performance and that more advanced learnable kernels could be used to boost trained models’ efficiency. Normalization has the potential to enhance numerous kernel functions even more. This shows that academics should look again at traditional kernel-based methods to see if they can make them more flexible and efficient. The research shows that attention-based models, particularly as sequence lengths grow, perform much worse than other models, Based on the MQAR challenge. Using the MQAR task to evaluate their improved architecture, ReBased outperforms the original Based model in various scenarios and with different model sizes. The findings also show that ReBased outperformed its predecessor in In-Context Learning and modeled associative dependencies exceptionally well using enhanced perplexity measures after training with the Pile dataset.

Compared to non-attention models, attention models perform far better on longer sequences. Further study into strategies that could bridge this gap and attain the performance of attention-based methods is necessary, as highlighted by these data. It is possible that other models can meet or even surpass the better features of attention processes, particularly on associative recall tasks like machine translation. This might be better understood, leading to more effective models for handling lengthy sequences on different natural language processing tasks.

The team highlights that their proposed approach works well for most jobs that Transformers are used for, but how well it handles tasks that need extensive copying or remembering past context is still up in the air. To completely alleviate inference issues related to attention mechanisms, it is essential to handle these jobs effectively. Furthermore, it should be mentioned that the models tested in the research are of an academic scale only. Specifically, when trying to apply the results to bigger models, this does provide some restrictions. Despite these limitations, they believe that their findings shed light on the method’s potential effectiveness. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post Tinkoff Researchers Unveil ReBased: Pioneering Machine Learning with Enhanced Subquadratic Architectures for Superior In-Context Learning appeared first on MarkTechPost.

Meet FinTral: A Suite of State-of-the-Art Multimodal Large Language Mo …

Financial documents are usually laden with complex numerical data and very specific terminology and jargon, which presents a challenge for existing Natural Language Processing (NLP) models. These models require advanced capabilities for numerical processing and a deep understanding of this jargon to accurately interpret and leverage the wealth of information in these documents. The rapid pace of financial markets adds another layer of complexity, necessitating real-time analysis for effective decision-making. Financial documents often feature diverse types of visual content, demanding multimodal processing abilities to fully exploit their potential for generating actionable insights and market intelligence.

Recent advancements in financial NLP have been marked by the development of specialized models like FinBERT, which paved the way for more sophisticated systems, including BloombergGPT, PIXIU, Instruct-FinGPT, and GPT-FinRE. These models have been designed to tackle the unique challenges of financial language, from sentiment analysis to event extraction and investment strategy enhancement. Innovations have also extended to multimodal capabilities with FinVis-GPT and rigorous model evaluation frameworks like FinLMEval and DISCFinLLM. Despite these advancements, a pressing need remains to address further issues, such as preventing information hallucination and enhancing numerical reasoning in financial NLP models.

A team of researchers from the University of British Columbia & Invertible AI have introduced a groundbreaking Large Language Model (LLM), FinTral, tailored for the financial sector. FinTral employs a multimodal approach, processing textual, numerical, tabular, and visual data to navigate the complexities of financial documents. It introduces FinSet, a comprehensive benchmark for evaluating financial LLMs. It demonstrates remarkable capabilities, including a version with enhanced vision and tool retrieval functions, outperforming established models like GPT-4 in numerous tasks.

Building on the foundational introduction of FinTral, this model stands out by integrating a multimodal approach, leveraging textual, numerical, tabular, and visual data for an enriched financial document analysis. Utilizing the base Mistral-7b model, FinTral undergoes further domain-specific pretraining on the expansive FinSet dataset, comprising 20 billion tokens collected from diverse sources such as C4, news, and financial filings. To refine its understanding and responsiveness to financial queries, it benefits from instruction tuning and AI-driven feedback, incorporating human and AI feedback to enhance performance. FinTral integrates visual data processing through CLIP encoders and employs tools for numerical tasks, effectively augmenting its capabilities. The model’s effectiveness is further amplified by Direct Policy Optimization and Retrieval Augmented Generation, enabling it to tackle the complexities of financial analysis with unprecedented accuracy and depth.

Experiments demonstrate FinTral’s exceptional performance across various financial tasks, quantitatively surpassing many contemporary models. The model FinTral-INST, obtained by fine-tuning the pre-trained model, outperformed all other models with an average score of 0.49. Models that underwent reinforcement learning with AI feedback showed marked improvements, with FinTral-DPO outperforming ChatGPT. FinTral-DPO model demonstrates exceptional performance with an average score of 0.59. This score indicates its advanced capabilities, placing it just below GPT-4’s average score of 0.69. However, with these results, there is still a set of scopes for improvement, including but not limited to real-time data handling, maintenance and updating, scarcity of annotated data, etc. 

In conclusion, FinTral is an advanced financial language model leveraging extensive datasets and diverse training methods to analyze complex financial data. It reduces model hallucinations by pretraining with clean financial data and employing retrieval methods, enhancing accuracy and reliability. Its real-time adaptability to financial markets and dynamic data retrieval can significantly improve predictive accuracy and decision-making. The researchers acknowledge the limitations and risk factors involved in the research and are optimistic about the future developments this work could pave the way for.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post Meet FinTral: A Suite of State-of-the-Art Multimodal Large Language Models (LLMs) Built Upon the Mistral-7B Model Tailored for Financial Analysis appeared first on MarkTechPost.

This Paper from Google DeepMind Explores Sparse Training: A Game-Chang …

The efficacy of deep reinforcement learning (RL) agents critically depends on their ability to utilize network parameters efficiently. Recent insights have cast light on deep RL agents’ challenges, notably their tendency to underutilize network parameters, leading to suboptimal performance. This inefficiency is not merely a technical hiccup but a fundamental bottleneck that curtails the potential of RL agents in complex domains.

The problem is the need for more utilization of network parameters by deep RL agents. Despite the remarkable successes of deep RL in various applications, evidence suggests these agents often fail to harness the full potential of their network’s capacity. This inefficiency manifests in dormant neurons during training and an implicit underparameterization, leading to a significant performance gap in tasks requiring intricate reasoning and decision-making.

While pioneering, current methodologies in the field grapple with this challenge to varying degrees of success. Sparse training methods have shown promise, which aims to streamline network parameters to essential ones. However, these methods often lead to a trade-off between sparsity and performance without fundamentally addressing the root cause of parameter underutilization.

The study by researchers from Google DeepMind, Mila – Québec AI Institute, and Université de Montréal introduces a groundbreaking technique known as gradual magnitude pruning, which meticulously trims down the network parameters, ensuring that only those of paramount importance are retained. This approach is rooted in the understanding that dormant neurons and underutilized parameters significantly hamper the efficiency of a network. This phenomenon restricts the agent’s learning capacity and inflates computational costs without commensurate benefits. By applying a principled strategy to increase network sparsity gradually, the research unveils an unseen scaling law, demonstrating that judicious pruning can lead to substantial performance gains across various tasks.

Networks subjected to gradual magnitude pruning consistently outperformed their dense counterparts across a spectrum of reinforcement learning tasks. This was not limited to simple environments but extended to complex domains requiring sophisticated decision-making and reasoning. The method’s efficacy was particularly pronounced when traditional dense networks struggled, underscoring the potential of pruning to unlock new performance levels in deep RL agents.

By significantly reducing the number of active parameters, gradual magnitude pruning presents a sustainable path toward more efficient and cost-effective reinforcement learning applications. This approach aligns with making AI technologies more accessible and reducing their environmental impact, a consideration of increasing importance in the field.

In conclusion, the contributions of this research are manifold, offering new perspectives on optimizing deep RL agents:

Introduction of gradual magnitude pruning: A novel technique that maximizes parameter efficiency, leading to significant performance improvements.

Demonstration of a scaling law: Unveiling the relationship between network size and performance, challenging the prevailing notion that bigger networks are inherently better.

Evidence of general applicability: Showing the technique’s effectiveness across various agents and training regimes, suggesting its potential as a universal method for enhancing deep RL agents.

Alignment with sustainability goals: Proposing a path towards more environmentally friendly and cost-effective AI applications by reducing computational requirements.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post This Paper from Google DeepMind Explores Sparse Training: A Game-Changer in Machine Learning Efficiency for Reinforcement Learning Agents appeared first on MarkTechPost.

Build a robust text-to-SQL solution generating complex queries, self-c …

Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. Today, generative AI can enable people without SQL knowledge. This generative AI task is called text-to-SQL, which generates SQL queries from natural language processing (NLP) and converts text into semantically correct SQL. The solution in this post aims to bring enterprise analytics operations to the next level by shortening the path to your data using natural language.
With the emergence of large language models (LLMs), NLP-based SQL generation has undergone a significant transformation. Demonstrating exceptional performance, LLMs are now capable of generating accurate SQL queries from natural language descriptions. However, challenges still remain. First, human language is inherently ambiguous and context-dependent, whereas SQL is precise, mathematical, and structured. This gap may result in inaccurate conversion of the user’s needs into the SQL that’s generated. Second, you might need to build text-to-SQL features for every database because data is often not stored in a single target. You may have to recreate the capability for every database to enable users with NLP-based SQL generation. Third, despite the larger adoption of centralized analytics solutions like data lakes and warehouses, complexity rises with different table names and other metadata that is required to create the SQL for the desired sources. Therefore, collecting comprehensive and high-quality metadata also remains a challenge. To learn more about text-to-SQL best practices and design patterns, see Generating value from enterprise data: Best practices for Text2SQL and generative AI.
Our solution aims to address those challenges using Amazon Bedrock and AWS Analytics Services. We use Anthropic Claude v2.1 on Amazon Bedrock as our LLM. To address the challenges, our solution first incorporates the metadata of the data sources within the AWS Glue Data Catalog to increase the accuracy of the generated SQL query. The workflow also includes a final evaluation and correction loop, in case any SQL issues are identified by Amazon Athena, which is used downstream as the SQL engine. Athena also allows us to use a multitude of supported endpoints and connectors to cover a large set of data sources.
After we walk through the steps to build the solution, we present the results of some test scenarios with varying SQL complexity levels. Finally, we discuss how it is straightforward to incorporate different data sources to your SQL queries.
Solution overview
There are three critical components in our architecture: Retrieval Augmented Generation (RAG) with database metadata, a multi-step self-correction loop, and Athena as our SQL engine.
We use the RAG method to retrieve the table descriptions and schema descriptions (columns) from the AWS Glue metastore to ensure that the request is related to the right table and datasets. In our solution, we built the individual steps to run a RAG framework with the AWS Glue Data Catalog for demonstration purposes. However, you can also use knowledge bases in Amazon Bedrock to build RAG solutions quickly.
The multi-step component allows the LLM to correct the generated SQL query for accuracy. Here, the generated SQL is sent for syntax errors. We use Athena error messages to enrich our prompt for the LLM for more accurate and effective corrections in the generated SQL.
You can consider the error messages occasionally coming from Athena like feedback. The cost implications of an error correction step are negligible compared to the value delivered. You can even include these corrective steps as supervised reinforced learning examples to fine-tune your LLMs. However, we did not cover this flow in our post for simplicity purposes.
Note that there is always inherent risk of having inaccuracies, which naturally comes with generative AI solutions. Even if Athena error messages are highly effective to mitigate this risk, you can add more controls and views, such as human feedback or example queries for fine-tuning, to further minimize such risks.
Athena not only allows us to correct the SQL queries, but it also simplifies the overall problem for us because it serves as the hub, where the spokes are multiple data sources. Access management, SQL syntax, and more are all handled via Athena.
The following diagram illustrates the solution architecture.

Figure 1. The solution architecture and process flow.

The process flow includes the following steps:

Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an Amazon OpenSearch Serverless vector store, which serves as our knowledge base in our RAG framework.

At this stage, the process is ready to receive the query in natural language. Steps 7–9 represent a correction loop, if applicable.

The user enters their query in natural language. You can use any web application to provide the chat UI. Therefore, we did not cover the UI details in our post.
The solution applies a RAG framework via similarity search, which adds the extra context from the metadata from the vector database. This table is used for finding the correct table, database, and attributes.
The query is merged with the context and sent to Anthropic Claude v2.1 on Amazon Bedrock.
The model gets the generated SQL query and connects to Athena to validate the syntax.
If Athena provides an error message that mentions the syntax is incorrect, the model uses the error text from Athena’s response.
The new prompt adds Athena’s response.
The model creates the corrected SQL and continues the process. This iteration can be performed multiple times.
Finally, we run the SQL using Athena and generate output. Here, the output is presented to the user. For the sake of architectural simplicity, we did not show this step.

Prerequisites
For this post, you should complete the following prerequisites:

Have an AWS account.
Install the AWS Command Line Interface (AWS CLI).
Set up the SDK for Python (Boto3).
Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an OpenSearch Serverless vector store.

Implement the solution
You can use the following Jupyter notebook, which includes all the code snippets provided in this section, to build the solution. We recommend using Amazon SageMaker Studio to open this notebook with an ml.t3.medium instance with the Python 3 (Data Science) kernel. For instructions, refer to Train a Machine Learning Model. Complete the following steps to set up the solution:

Create the knowledge base in OpenSearch Service for the RAG framework:

def add_documnets(self,index_name: str,file_name:str):

documents = JSONLoader(file_path=file_name, jq_schema=’.’, text_content=False, json_lines=False).load()
docs = OpenSearchVectorSearch.from_documents(embedding=self.embeddings, opensearch_url=self.opensearch_domain_endpoint, http_auth=self.http_auth, documents=documents, index_name=index_name, engine=”faiss”)
index_exists = self.check_if_index_exists(index_name,aws_region,opensearch_domain_endpoint,http_auth)
if not index_exists :
logger.info(f’index :{index_name} is not existing ‘)
sys.exit(-1)
else:
logger.info(f’index :{index_name} Got created’)

Build the prompt (final_question) by combining the user input in natural language (user_query), the relevant metadata from the vector store (vector_search_match), and our instructions (details):

def userinput(user_query):
logger.info(f’Searching metadata from vector store’)

# vector_search_match=rqst.getEmbeddding(user_query)
vector_search_match = rqst.getOpenSearchEmbedding(index_name,user_query)

# print(vector_search_match)
details = “It is important that the SQL query complies with Athena syntax.
During join if column name are same please use alias ex llm.customer_id
in select statement. It is also important to respect the type of columns:
if a column is string, the value should be enclosed in quotes.
If you are writing CTEs then include all the required columns.
While concatenating a non string column, make sure cast the column to string.
For date columns comparing to string , please cast the string input.”
final_question = “nnHuman:”+details + vector_search_match + user_query+ “nnAssistant:”
answer = rqst.generate_sql(final_question)
return answer

Invoke Amazon Bedrock for the LLM (Claude v2) and prompt it to generate the SQL query. In the following code, it makes multiple attempts in order to illustrate the self-correction step:x

try:
logger.info(f’we are in Try block to generate the sql and count is :{attempt + 1}’)
generated_sql = self.llm.predict(prompt)
query_str = generated_sql.split(““`”)[1]
query_str = ” “.join(query_str.split(“n”)).strip()
sql_query = query_str[3:] if query_str.startswith(“sql”) else query_str

# return sql_query
syntaxcheckmsg=rqstath.syntax_checker(sql_query)
if syntaxcheckmsg==’Passed’:
logger.info(f’syntax checked for query passed in attempt number :{attempt + 1}’)
return sql_query

If any issues are received with the generated SQL query ({sqlgenerated}) from the Athena response ({syntaxcheckmsg}), the new prompt (prompt) is generated based on the response and the model tries again to generate the new SQL:

else:
prompt = f”””{prompt}
This is syntax error: {syntaxcheckmsg}.
To correct this, please generate an alternative SQL query which will correct the syntax error. The updated query should take care of all the syntax issues encountered. Follow the instructions mentioned above to remediate the error.
Update the below SQL query to resolve the issue:
{sqlgenerated}
Make sure the updated SQL query aligns with the requirements provided in the initial question.”””
prompts.append(prompt)

After the SQL is generated, the Athena client is invoked to run and generate the output:

query_execution = self.athena_client.start_query_execution(
QueryString=query_string,
ResultConfiguration=result_config,
QueryExecutionContext=query_execution_context, )
execution_id = query_execution[“QueryExecutionId”]

Test the solution
In this section, we run our solution with different example scenarios to test different complexity levels of SQL queries.
To test our text-to-SQL, we use two datasets available from IMDB. Subsets of IMDb data are available for personal and non-commercial use. You can download the datasets and store them in Amazon Simple Storage Service (Amazon S3). You can use the following Spark SQL snippet to create tables in AWS Glue. For this example, we use title_ratings and title:

source_title_ratings3_path = ‘s3://llm-athena-output/input_data/title.ratings.tsv’
target_title_s3_path=’s3://llm-athena-output/output_data/imdb_stg/title_ratings’
source_titleratingdf=spark.read.csv(source_title_ratings3_path,sep=”t”,header=True)
source_titleratingdf.write.mode(‘overwrite’).format(‘parquet’).option(‘path’, target_title_s3_path).saveAsTable(‘imdb_stg.title_ratings’)

Store data in Amazon S3 and metadata in AWS Glue
In this scenario, our dataset is stored in an S3 bucket. Athena has an S3 connector that allows you to use Amazon S3 as a data source that can be queried.
For our first query, we provide the input “I am new to this. Can you help me see all the tables and columns in imdb schema?”
The following is the generated query:

WITH tables AS (
SELECT table_name
FROM information_schema.tables
WHERE table_schema = ‘imdb_stg’),
columns AS (
SELECT
c.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM information_schema.columns c
WHERE c.table_schema = ‘imdb_stg’)
SELECT
t.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM tables t
INNER JOIN columns c
ON t.table_name = c.table_name
ORDER BY
t.table_name,
c.ordinal_position
LIMIT 10;

The following screenshot and code show our output.

table_name column_name data_type
0 title titleid varchar
1 title ordering integer
2 title title varchar
3 title region varchar
4 title language varchar

For our second query, we ask “Show me all the title and details in US region whose rating is more than 9.5.”
The following is our generated query:

WITH us_titles AS (
SELECT t.title, t.region, tr.averageRating, tr.numVotes
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.region = ‘US’ AND cast(tr.averageRating as varchar) > ‘9.5’
)
SELECT title, region, averageRating, numVotes
FROM us_titles
LIMIT 100;

The response is as follows.

title region averageRating numVotes
0 The Way You Saw Me US 9.7 8
1 The Brother Side of the Wake US 9.6 20
2 Ignis Fatuus US 9.6 11
3 Love and Hip Hop Atlanta US 9.9 11
4 ronny/lily US 9.7 14781

For our third query, we enter “Great Response! Now show me all the original type titles having ratings more than 7.5 and not in the US region.”
The following query is generated:

WITH titles AS (
SELECT t.titleId,
t.title,
t.types,
t.isOriginalTitle,
cast(tr.averageRating as decimal(3,1)) as averageRating,
tr.numVotes,
t.region
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.isOriginalTitle = ‘1’
AND cast(tr.averageRating as decimal(3,1)) > 7.5
AND t.region != ‘US’)
SELECT *
FROM titles
LIMIT 100;

We get the following results.

titleId title types isOriginalTitle averageRating numVotes region
0 tt0986264 Taare Zameen Par original 1 8.3 203760 XWW

Generate self-corrected SQL
This scenario simulates a SQL query that has syntax issues. Here, the generated SQL will be self-corrected based on the response from Athena. In the following response, Athena gave a COLUMN_NOT_FOUND error and mentioned that table_description can’t be resolved:

Status : {‘State’: ‘FAILED’, ‘StateChangeReason’: “COLUMN_NOT_FOUND: line 1:50: Column ‘table_description’
cannot be resolved or requester is not authorized to access requested resources”,
‘SubmissionDateTime’: datetime.datetime(2024, 1, 14, 14, 38, 57, 501000, tzinfo=tzlocal()),
‘CompletionDateTime’: datetime.datetime(2024, 1, 14, 14, 38, 57, 778000, tzinfo=tzlocal()),
‘AthenaError’: {‘ErrorCategory’: 2, ‘ErrorType’: 1006, ‘Retryable’: False, ‘ErrorMessage’: “COLUMN_NOT_FOUND:
line 1:50: Column ‘table_description’ cannot be resolved or requester is not authorized to
access requested resources”}}
COLUMN_NOT_FOUND: line 1:50: Column ‘table_description’ cannot be resolved or requester is not authorized to access requested resources
Try Count: 2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,Try Count: 2
we are in Try block to generate the sql and count is :2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,we are in Try block to generate the sql and count is :2
Executing: Explain WITH tables AS ( SELECT table_name FROM information_schema.tables WHERE table_schema = ‘imdb_stg’ ), columns AS ( SELECT c.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM information_schema.columns c WHERE c.table_schema = ‘imdb_stg’ ) SELECT t.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM tables t INNER JOIN columns c ON t.table_name = c.table_name ORDER BY t.table_name, c.ordinal_position LIMIT 10;
I am checking the syntax here
execution_id: 904857c3-b7ac-47d0-8e7e-6b9d0456099b
Status : {‘State’: ‘SUCCEEDED’, ‘SubmissionDateTime’: datetime.datetime(2024, 1, 14, 14, 39, 29, 537000, tzinfo=tzlocal()), ‘CompletionDateTime’: datetime.datetime(2024, 1, 14, 14, 39, 30, 183000, tzinfo=tzlocal())}
syntax checked for query passed in tries number :2

Using the solution with other data sources
To use the solution with other data sources, Athena handles the job for you. To do this, Athena uses data source connectors that can be used with federated queries. You can consider a connector as an extension of the Athena query engine. Pre-built Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), and Amazon Relational Database Service (Amazon RDS), and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. After you set up a connection to any data source, you can use the preceding code base to extend the solution. For more information, refer to Query any data source with Amazon Athena’s new federated query.
Clean up
To clean up the resources, you can start by cleaning up your S3 bucket where the data resides. Unless your application invokes Amazon Bedrock, it will not incur any cost. For the sake of infrastructure management best practices, we recommend deleting the resources created in this demonstration.
Conclusion
In this post, we presented a solution that allows you to use NLP to generate complex SQL queries with a variety of resources enabled by Athena. We also increased the accuracy of the generated SQL queries via a multi-step evaluation loop based on error messages from downstream processes. Additionally, we used the metadata in the AWS Glue Data Catalog to consider the table names asked in the query through the RAG framework. We then tested the solution in various realistic scenarios with different query complexity levels. Finally, we discussed how to apply this solution to different data sources supported by Athena.
Amazon Bedrock is at the center of this solution. Amazon Bedrock can help you build many generative AI applications. To get started with Amazon Bedrock, we recommend following the quick start in the following GitHub repo and familiarizing yourself with building generative AI applications. You can also try knowledge bases in Amazon Bedrock to build such RAG solutions quickly.

About the Authors
Sanjeeb Panda is a Data and ML engineer at Amazon. With the background in AI/ML, Data Science and Big Data, Sanjeeb design and develop innovative data and ML solutions that solve complex technical challenges and achieve strategic goals for global 3P sellers managing their businesses on Amazon. Outside of his work as a Data and ML engineer at Amazon, Sanjeeb Panda is an avid foodie and music enthusiast.
Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.

Customer Journey & Segmentation are Key to Enhancing Engagement

I’ll start with an obvious observation: a customer’s first interaction with your company is never “making a purchase.” Though it’d be nice if they arrived immediately at the product they were interested in and bought it right away, most customers go on a long journey before ultimately forking over their cash. Understanding each customer’s journey, then, is absolutely key to building a better, more efficient marketing and sales process. 

Even savvy companies are learning that this is easier said than done. 

Years of targeted digital ads and email campaigns have raised customers expectations. They expect more personalization than ever. 

Simultaneously, the death of third-party cookies and other privacy changes continue to make finding and using the data required to curate that personalized experience harder and harder to get. 

That’s why our Customer Journey feature is so crucial. With our Website Visitor ID X-Ray Pixel, you can see how each identified visitor navigates your site. Seeing what choices they make before ultimately purchasing provides you with crucial information you can use to optimize your digital ads, email campaigns, and more! 

I understand now, and I appreciate your patience. Let’s directly address the content within each subsection for sections II and III, focusing on delivering complete thoughts and insights within each part, without introductions to future discussions.

The Evolving Expectations of Customers

Challenges in Meeting the New Expectations

Importance of Customer Journey Segmentation

Understanding Customer Journey with Customers.ai

Segmenting Based on Customer Journey

See Who Is On Your Site Right Now!

Turn anonymous visitors into genuine contacts.

Try it Free, No Credit Card Required

Get The X-Ray Pixel

The Evolving Expectations of Customers

The Rise of Personalized Marketing

The digital age has transformed passive consumers into active participants who dictate the terms of engagement. They no longer navigate the digital world as mere observers but as individuals seeking a reflection of their unique preferences and interests. This shift has been fueled by advancements in data analytics and machine learning, enabling brands to deliver personalized experiences at scale. Consumers now expect interactions with brands to be tailored specifically to them, drawing on their past behaviors, purchases, and interactions to anticipate their needs.

Setting the Standard: The Influence of Tech Giants

Companies like Amazon and Netflix have not just participated in the evolution of consumer expectations; they’ve spearheaded it. By leveraging vast data to curate personalized experiences, these tech giants have created a new normal. Amazon’s recommendation algorithms that suggest products based on browsing and purchasing history, and Netflix’s ability to recommend shows and movies with uncanny accuracy, have set a high bar. These experiences have reshaped what consumers expect from all digital interactions, demanding a level of personalization once reserved for the realm of science fiction.

Challenges in Meeting the New Expectations

The End of the Cookie Era

The impending extinction of third-party cookies poses a formidable challenge to personalized marketing. For years, cookies have been the linchpins of digital advertising, enabling brands to track user behavior across the web. Their decline, driven by privacy concerns and regulatory changes, has left marketers searching for new ways to understand and reach their audiences. This seismic shift requires a reimagining of data collection and user tracking, pushing marketers towards more privacy-conscious strategies.

Navigating Privacy Regulations

The introduction of stringent privacy laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) has further complicated the landscape. These regulations mandate greater transparency and user consent, limiting how marketers can collect and utilize data. As companies navigate this new terrain, they must balance the desire for personalization with the imperative of compliance, a task that often requires significant adjustments to their data practices.

Finding Alternatives for Data Collection and Analysis

In response to these challenges, innovative solutions have emerged. Marketers are increasingly turning to first-party data, gathered directly from interactions with their brand, as a cookie alternative. This shift not only complies with privacy regulations but also opens new avenues for building deeper, more direct relationships with customers. Additionally, technologies such as machine learning and AI offer new ways to analyze data, predict consumer behavior, and personalize experiences without infringing on privacy. These tools are paving the way for a new era of marketing, where personalization and privacy coexist.

Through these sections, the blog will delve deeply into the ongoing evolution of customer expectations and the challenges marketers face in meeting these demands, offering insights into the future of personalized marketing in a privacy-conscious world.

Importance of Customer Journey Segmentation

In the intricate dance of digital marketing, understanding the rhythm of your audience’s movements is paramount. Customer journey segmentation is not just a tool; it’s a compass that guides companies through the vast, often tumultuous sea of consumer interactions. By dividing the customer journey into distinct segments, businesses can tailor their strategies to meet individuals exactly where they are, enhancing engagement and driving conversions.

Crafting Tailored Experiences

At the heart of customer journey segmentation is the ability to craft experiences that resonate on a personal level. Whether a customer is at the discovery phase, considering their options, or ready to purchase, each segment requires a different approach. This tailored treatment transforms a generic interaction into a personal conversation, significantly increasing the likelihood of conversion.

Enhancing Customer Understanding

Segmentation offers a window into the customer’s world, providing insights that are otherwise obscured by the aggregate data. By dissecting the journey into manageable parts, marketers can identify patterns, preferences, and pain points specific to each segment. This deep dive into the customer psyche is invaluable, informing everything from product development to customer service strategies.

Optimizing Marketing Efforts

With resources always at a premium, efficiency is the watchword of any successful marketing department. Customer journey segmentation ensures that efforts are not wasted on mismatched messaging. By aligning marketing tactics with the customer’s stage in the journey, companies can allocate their budgets more effectively, maximizing ROI and minimizing waste.

Predicting Future Behaviors

Understanding the past and present of customer behavior is crucial, but the real power lies in prediction. Segmentation allows companies to forecast future behaviors based on observed patterns, providing a strategic advantage. Armed with this knowledge, businesses can preemptively address customer needs and desires, staying one step ahead in a competitive landscape.

In conclusion, customer journey segmentation is the linchpin of modern marketing strategy. It underpins the creation of personalized experiences, deepens customer understanding, optimizes marketing resources, and offers predictive insights into future behaviors. In a world where personalization is king, segmentation is the key to the kingdom.

Understanding Customer Journey with Customers.ai

Customers.ai’s Website Visitor ID X-Ray Pixel allows you to track the journey of every new contact you generate. Even better, setting it up is super simple. 

Add the Pixel to every page on your site. This ensures that you can track the whole journey. If a page doesn’t have our X-Ray Pixel on it, you won’t be able to see whether a customer visited that page or not. 

Click on your My Leads tab. 

Click on the contact. 

See their whole Customer Journey, including which pages they visited, the order of the pages they visited, which emails they’ve received, and what they did after they received the email! 

Segmenting Based on Customer Journey

The specific strategy you’ll need depends on your customers and your industry. But here are some foundational principles that will be useful no matter what: 

Study Your Customers

Now that you’ve got the data, make sure that you use it. Study the Customer Journeys to understand how they are navigating your site and engaging with your marketing campaigns. 

You get detailed analytics for every email campaign you send to see what resonates and what doesn’t. Our Customer Journey tool shows you which pages Customers looked at and the order they looked at them, so you can see which actions lead to other actions. 

Segment Based on Page Viewed

Our system allows you to build audiences based on different intent signals. Contacts that originate looking at your Blue Hats, for example, are probably interested in different things than contacts that originate looking at your blog. 

You can easily build audiences that filter and then export them into your existing CRM flows! 

Design Campaigns Based on Customer Journeys 

If your Customer Journey insights reveal common trajectories, use that information to design specific campaigns based on that! 

If you see, for example, that people are often looking at a specific informational page immediately  before making a purchase, you can replicate that experience in your targeted email campaigns and digital ads. 

Get started understanding customer journeys today by seeing how many of your website visitors we can turn into contacts!

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 500 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Customer Journey & Segmentation are Key to Enhancing Engagement appeared first on Customers.ai.

Google AI Introduces LLM Comparator: A Step Towards Understanding the …

Improving LLMs involves continuously refining algorithms and training procedures to enhance their accuracy and versatility. However, the primary challenge in developing LLMs is accurately evaluating their performance. LLMs generate complex, freeform text, making it difficult to benchmark their outputs against a fixed standard. This complexity necessitates innovative approaches to assessment, moving beyond simple accuracy metrics to more nuanced evaluations of text quality and relevance.

Current challenges in analyzing evaluation results include needing more specialized tools, difficulty reading and comparing long texts, and the need to compute metrics by slices. Various methodologies and tools have been developed in the visualization community for analysis, including visualizing individual data points, supporting slice-level analysis, explaining individual predictions, and model comparison. Automatic side-by-side evaluation (AutoSxS) is prevalent in evaluating LLMs. The process involves using baseline models, selecting prompt sets, obtaining individual ratings, and calculating aggregated metrics.

A team of researchers at Google Research has introduced the LLM Comparator tool, which facilitates the side-by-side comparison of LLM outputs, enabling an in-depth analysis of their performance. The LLM Comparator allows users to interactively explore the differences between model responses, clearly representing where and why one model may outperform another.

The LLM Comparator integrates visual analytics, allowing users to delve into the specifics of model performance across different scenarios. It features a score distribution histogram, offering a detailed view of rating variances and a performance visualization across different prompt categories. It is instrumental in pinpointing specific areas of model strength or weakness. Moreover, the tool’s rationale clusters ingeniously condense raters’ reasoning into thematic groups, providing deep insights into their decision-making processes. Adding n-gram analysis and custom functions further enhances this functionality, enabling users to delve into the intricacies of model responses.

The effectiveness of the LLM Comparator is underscored by its impact on Google. Since its introduction, the tool has attracted significant attention, with over 400 users engaging in more than 1,000 evaluation experiments. This widespread adoption speaks to its utility in streamlining the evaluation process for LLM developers, offering valuable insights that guide the refinement of these complex AI systems.

In conclusion, the LLM Comparator represents a significant step forward in evaluating large language models. Providing a scalable, interactive analysis platform addresses the critical challenge of assessing LLM performance. This tool facilitates a deeper understanding of model capabilities and accelerates the development of more advanced and effective AI systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post Google AI Introduces LLM Comparator: A Step Towards Understanding the Evaluation of Large Language Models appeared first on MarkTechPost.

This AI Paper Boldly Quantizes the Weight Matrices of LLMs to 1-Bit: P …

Large language models (LLMs), as computational giants capable of understanding and generating text with astonishing accuracy, hold the key to various applications, from automated content creation to sophisticated conversational agents. However, their deployment is marred by a significant hurdle: computational and memory requirements. As models become complex, deploying them outside high-powered servers becomes a formidable challenge, limiting their accessibility and real-world utility.

Approaches to model optimization have ventured into various territories, from pruning to knowledge distillation. Yet, a solution that marries a minimal memory footprint with minimal loss in performance has to be discovered. Within this context, a pioneering approach, dubbed OneBit, emerges from the collaborative efforts of researchers at Tsinghua University and Harbin Institute of Technology. OneBit represents a paradigm shift, addressing the efficiency challenge head-on by introducing a framework for quantization-aware training (QAT) of LLMs to an unprecedented 1-bit representation.

While successful to a degree, traditional quantization methods falter when pushed to the extremes of low-bit representations, often resulting in a drastic degradation of model performance. OneBit, however, circumvents this issue through a novel parameter representation method that significantly reduces the bit-width of weight matrices without severely impacting the model’s effectiveness. This is achieved by decomposing weight matrices that retain essential information with minimal spatial requirements, coupled with an astute parameter initialization method that enhances the convergence speed of the training process.

OneBit’s methodology leverages a novel linear layer and Sign-Value-Independent Decomposition (SVID) for weight matrices, enabling the representation of LLMs using approximately 1-bit values. This decomposition separates each original weight matrix into a sign matrix and two value vectors, with the former maintaining the high rank of the original matrix at a fraction of the space cost and the latter providing the necessary floating-point precision in linear projections. This strategic decomposition and the utilization of quantization-aware knowledge distillation facilitate the transfer of capabilities from the original model to its 1-bit counterpart, ensuring that the essence of the model’s predictive power is preserved.

OneBit has demonstrated its ability to retain at least 83% of a model’s non-quantized performance across various tasks, showcasing its viability for efficient LLM deployment. This achievement paves the way for applying LLMs in environments with limited resources and establishes a new standard for research in model quantization.

OneBit’s implications are profound. By significantly reducing the memory footprint required to deploy LLMs, it democratizes access to cutting-edge natural language processing capabilities, enabling their integration into everyday devices and applications. This breakthrough has the potential to accelerate the adoption of LLMs across a wide range of sectors, from education and healthcare to entertainment and customer service, making the benefits of AI more accessible to people around the world.

In conclusion, OneBit represents a significant leap forward in the quest for efficient and accessible large language models. By marrying the seemingly conflicting goals of minimal memory usage and minimal performance loss, it addresses a critical challenge in the deployment of LLMs and opens new avenues for their application. The contributions of the OneBit research team remind us of the transformative power of innovation, charting a course toward a future where the potential of large language models can be fully realized, unfettered by the constraints of computational and memory resources.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post This AI Paper Boldly Quantizes the Weight Matrices of LLMs to 1-Bit: Paving the Way for the Extremely Low Bit-Width Deployment of LLMs appeared first on MarkTechPost.

Microsoft AI Research Introduces UFO: An Innovative UI-Focused Agent t …

Microsoft has recently released UFO, a UI-focused agent for specialized Windows OS Interaction. UFO addresses the challenges faced in interacting with the graphical user interface (GUI) of applications on the Windows operating system (OS) through natural language commands. LLMs have shown successful results in understanding and executing textual commands, but LLMs still are not able to navigate and operate within the UI of Windows applications.

Currently, existing models are majorly focused on smartphones or web applications, and the requirement of UI agents tailored specifically for the Windows OS environment remained unavailable. To fulfill the requirement, Microsoft’s researchers proposed UFO, a UI-focused agent designed for smooth interaction with Windows applications. UFO tailored a dual-agent framework comprising an Application Selection Agent (AppAgent) and an Action Selection Agent (ActAgent). They utilize GPT-Vision to analyze GUI screenshots and control information, which allows the agents to understand application selection and execute required actions. UFO also incorporates features such as control interaction, application switching, action customization, and safeguards to enhance its functionality and user experience.

UFO works by first analyzing the user’s request and the current desktop environment, which includes screenshots and available applications. Based on this analysis, the AppAgent selects an appropriate application and develops a global task completion strategy. While ActAgent then performs actions within the selected application, iteratively selecting controls and performing actions until the user request is fulfilled. UFO’s control interaction module makes it easier to translate selected actions into executable operations, allowing for automated execution without the need for human intervention. 

The framework is highly extensible and allows users to create custom actions and controls for specific tasks and applications. The proposed model is evaluated on a wide range of user requests to analyze its performance; the model demonstrated successful results on almost every task in Windows applications, highlighting its versatility and potential to increase user productivity.

In conclusion, the proposed model efficiently interacts with Windows applications through natural language commands. By leveraging GPT-Vision and a dual-agent framework, UFO demonstrates superior effectiveness in navigating and operating within Windows applications to fulfill user requests.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post Microsoft AI Research Introduces UFO: An Innovative UI-Focused Agent to Fulfill User Requests Tailored to Applications on Windows OS, Harnessing the Capabilities of GPT-Vision appeared first on MarkTechPost.

How Axfood enables accelerated machine learning throughout the organiz …

This is a guest post written by Axfood AB. 
In this post, we share how Axfood, a large Swedish food retailer, improved operations and scalability of their existing artificial intelligence (AI) and machine learning (ML) operations by prototyping in close collaboration with AWS experts and using Amazon SageMaker.
Axfood is Sweden’s second largest food retailer, with over 13,000 employees and more than 300 stores. Axfood has a structure with multiple decentralized data science teams with different areas of responsibility. Together with a central data platform team, the data science teams bring innovation and digital transformation through AI and ML solutions to the organization. Axfood has been using Amazon SageMaker to cultivate their data using ML and has had models in production for many years. Lately, the level of sophistication and the sheer number of models in production is increasing exponentially. However, even though the pace of innovation is high, the different teams had developed their own ways of working and were in search of a new MLOps best practice.
Our challenge
To stay competitive in terms of cloud services and AI/ML, Axfood chose to partner with AWS and has been collaborating with them for many years.
During one of our recurring brainstorming sessions with AWS, we were discussing how to best collaborate across teams to increase the pace of innovation and efficiency of data science and ML practitioners. We decided to put in a joint effort to build a prototype on a best practice for MLOps. The aim of the prototype was to build a model template for all data science teams to build scalable and efficient ML models—the foundation to a new generation of AI and ML platforms for Axfood. The template should bridge and combine best practices from AWS ML experts and company-specific best practice models—the best of both worlds.
We decided to build a prototype from one of the currently most developed ML models within Axfood: forecasting sales in stores. More specifically, the forecast for fruits and vegetables of upcoming campaigns for food retail stores. Accurate daily forecasting supports the ordering process for the stores, increasing sustainability by minimizing food waste as a result of optimizing sales by accurately predicting the needed in-store stock levels. This was the perfect place to start for our prototype—not only would Axfood gain a new AI/ML platform, but we would also get a chance to benchmark our ML capabilities and learn from leading AWS experts.
Our solution: A new ML template on Amazon SageMaker Studio
Building a full ML pipeline that is designed for an actual business case can be challenging. In this case, we are developing a forecasting model, so there are two main steps to complete:

Train the model to make predictions using historical data.
Apply the trained model to make predictions of future events.

In Axfood’s case, a well-functioning pipeline for this purpose was already set up using SageMaker notebooks and orchestrated by the third-party workflow management platform Airflow. However, there are many clear benefits of modernizing our ML platform and moving to Amazon SageMaker Studio and Amazon SageMaker Pipelines. Moving to SageMaker Studio provides many predefined out-of-the-box features:

Monitoring model and data quality as well as model explainability
Built-in integrated development environment (IDE) tools such as debugging
Cost/performance monitoring
Model acceptance framework
Model registry

However, the most important incentive for Axfood is the ability to create custom project templates using Amazon SageMaker Projects to be used as a blueprint for all data science teams and ML practitioners. The Axfood team already had a robust and mature level of ML modeling, so the main focus was on building the new architecture.
Solution overview
Axfood’s proposed new ML framework is structured around two main pipelines: the model build pipeline and the batch inference pipeline:

These pipelines are versioned within two separate Git repositories: one build repository and one deploy (inference) repository. Together, they form a robust pipeline for forecasting fruits and vegetables.
The pipelines are packaged into a custom project template using SageMaker Projects in integration with a third-party Git repository (Bitbucket) and Bitbucket pipelines for continuous integration and continuous deployment (CI/CD) components.
The SageMaker project template includes seed code corresponding to each step of the build and deploy pipelines (we discuss these steps in more detail later in this post) as well as the pipeline definition—the recipe for how the steps should be run.
Automation of building new projects based on the template is streamlined through AWS Service Catalog, where a portfolio is created, serving as an abstraction for multiple products.
Each product translates into an AWS CloudFormation template, which is deployed when a data scientist creates a new SageMaker project with our MLOps blueprint as the foundation. This activates an AWS Lambda function that creates a Bitbucket project with two repositories—model build and model deploy—containing the seed code.

The following diagram illustrates the solution architecture. Workflow A depicts the intricate flow between the two model pipelines—build and inference. Workflow B shows the flow to create a new ML project.

Model build pipeline
The model build pipeline orchestrates the model’s lifecycle, beginning from preprocessing, moving through training, and culminating in being registered in the model registry:

Preprocessing – Here, the SageMaker ScriptProcessor class is employed for feature engineering, resulting in the dataset the model will be trained on.
Training and batch transform – Custom training and inference containers from SageMaker are harnessed to train the model on historical data and create predictions on the evaluation data using a SageMaker Estimator and Transformer for the respective tasks.
Evaluation – The trained model undergoes evaluation by comparing the generated predictions on the evaluation data to the ground truth using ScriptProcessor.
Baseline jobs – The pipeline creates baselines based on statistics in the input data. These are essential for monitoring data and model quality, as well as feature attributions.
Model registry – The trained model is registered for future use. The model will be approved by designated data scientists to deploy the model for use in production.

For production environments, data ingestion and trigger mechanisms are managed via a primary Airflow orchestration. Meanwhile, during development, the pipeline is activated each time a new commit is introduced to the model build Bitbucket repository. The following figure visualizes the model build pipeline.

Batch inference pipeline
The batch inference pipeline handles the inference phase, which consists of the following steps:

Preprocessing – Data is preprocessed using ScriptProcessor.
Batch transform – The model uses the custom inference container with a SageMaker Transformer and generates predictions given the input preprocessed data. The model used is the latest approved trained model in the model registry.
Postprocessing – The predictions undergo a series of postprocessing steps using ScriptProcessor.
Monitoring – Continuous surveillance completes checks for drifts related to data quality, model quality, and feature attribution.

If discrepancies arise, a business logic within the postprocessing script assesses whether retraining the model is necessary. The pipeline is scheduled to run at regular intervals.
The following diagram illustrates the batch inference pipeline. Workflow A corresponds to preprocessing, data quality and feature attribution drift checks, inference, and postprocessing. Workflow B corresponds to model quality drift checks. These pipelines are divided because the model quality drift check will only run if new ground truth data is available.

SageMaker Model Monitor
With Amazon SageMaker Model Monitor integrated, the pipelines benefit from real-time monitoring on the following:

Data quality – Monitors any drift or inconsistencies in data
Model quality – Watches for any fluctuations in model performance
Feature attribution – Checks for drift in feature attributions

Monitoring model quality requires access to ground truth data. Although obtaining ground truth can be challenging at times, using data or feature attribution drift monitoring serves as a competent proxy to model quality.
Specifically, in the case of data quality drift, the system watches out for the following:

Concept drift – This pertains to changes in the correlation between input and output, requiring ground truth
Covariate shift – Here, the emphasis is on alterations in the distribution of independent input variables

SageMaker Model Monitor’s data drift functionality meticulously captures and scrutinizes the input data, deploying rules and statistical checks. Alerts are raised whenever anomalies are detected.
In parallel to using data quality drift checks as a proxy for monitoring model degradation, the system also monitors feature attribution drift using the normalized discounted cumulative gain (NDCG) score. This score is sensitive to both changes in feature attribution ranking order as well as to the raw attribution scores of features. By monitoring drift in attribution for individual features and their relative importance, it’s straightforward to spot degradation in model quality.
Model explainability
Model explainability is a pivotal part of ML deployments, because it ensures transparency in predictions. For a detailed understanding, we use Amazon SageMaker Clarify.
It offers both global and local model explanations through a model-agnostic feature attribution technique based on the Shapley value concept. This is used to decode why a particular prediction was made during inference. Such explanations, which are inherently contrastive, can vary based on different baselines. SageMaker Clarify aids in determining this baseline using K-means or K-prototypes in the input dataset, which is then added to the model build pipeline. This functionality enables us to build generative AI applications in the future for increased understanding of how the model works.
Industrialization: From prototype to production
The MLOps project includes a high degree of automation and can serve as a blueprint for similar use cases:

The infrastructure can be reused entirely, whereas the seed code can be adapted for each task, with most changes limited to the pipeline definition and the business logic for preprocessing, training, inference, and postprocessing.
The training and inference scripts are hosted using SageMaker custom containers, so a variety of models can be accommodated without changes to the data and model monitoring or model explainability steps, as long as the data is in tabular format.

After finishing the work on the prototype, we turned to how we should use it in production. To do so, we felt the need to make some additional adjustments to the MLOps template:

The original seed code used in the prototype for the template included preprocessing and postprocessing steps run before and after the core ML steps (training and inference). However, when scaling up to use the template for multiple use cases in production, the built-in preprocessing and postprocessing steps may lead to decreased generality and reproduction of code.
To improve generality and minimize repetitive code, we chose to slim down the pipelines even further. Instead of running the preprocessing and postprocessing steps as part of the ML pipeline, we run these as part of the primary Airflow orchestration before and after triggering the ML pipeline.
This way, use case-specific processing tasks are abstracted from the template, and what is left is a core ML pipeline performing tasks that are general across multiple use cases with minimal repetition of code. Parameters that differ between use cases are supplied as input to the ML pipeline from the primary Airflow orchestration.

The result: A rapid & efficient approach to model build & deployment
The prototype in collaboration with AWS has resulted in an MLOps template following current best practices that is now available for use to all of Axfood’s data science teams. By creating a new SageMaker project within SageMaker Studio, data scientists can get started on new ML projects quickly and seamlessly transition to production, allowing for more efficient time management. This is made possible by automating tedious, repetitive MLOps tasks as part of the template.
Furthermore, several new functionalities have been added in an automated fashion to our ML setup. These gains include:

Model monitoring – We can perform drift checks for model and data quality as well as model explainability
Model and data lineage – It’s now possible to trace exactly which data has been used for which model
Model registry – This helps us catalog models for production and manage model versions

Conclusion
In this post, we discussed how Axfood improved operations and scalability of our existing AI and ML operations in collaboration with AWS experts and by using SageMaker and its related products.
These improvements will help Axfood’s data science teams building ML workflows in a more standardized way and will greatly simplify analysis and monitoring of models in production—ensuring the quality of ML models built and maintained by our teams.
Please leave any feedback or questions in the comments section.

About the Authors
Dr. Björn Blomqvist is the Head of AI Strategy at Axfood AB. Before joining Axfood AB he led a team of Data Scientists at Dagab, a part of Axfood, building innovative machine learning solutions with the mission to provide good and sustainable food to people all over Sweden. Born and raised in the north of Sweden, in his spare time Björn ventures to snowy mountains and open seas.
Oskar Klang is a Senior Data Scientist at the analytics department at Dagab, where he enjoys working with everything analytics and machine learning, e.g. optimizing supply chain operations, building forecasting models and, more recently, GenAI applications. He is committed to building more streamlined machine learning pipelines, enhancing efficiency and scalability.
Pavel Maslov is a Senior DevOps and ML engineer in the Analytic Platforms team. Pavel has extensive experience in the development of frameworks, infrastructure, and tools in the domains of DevOps and ML/AI on the AWS platform. Pavel has been one of the key players in building the foundational capability within ML at Axfood.
Joakim Berg is the Team Lead and Product Owner Analytic Platforms, based in Stockholm Sweden. He is leading a team of Data Platform end DevOps/MLOps engineers providing Data and ML platforms for the Data Science teams. Joakim has many years of experience leading senior development and architecture teams from different industries.

21 High-Powered Takeaways from Ad Universe Summit

On February 22, the Ad Universe Summit took the digital marketing world by storm, packing a punch with its 5-hour marathon of cutting-edge advertising wisdom. 

Bringing together 22 top-notch advertising experts, we dove deep into everything from crafting killer ad creatives to mastering Facebook and Google ads, not to mention YouTube and LinkedIn strategies. 

It wasn’t just about the big names, though. With over 2000 registrants, the summit truly was buzzing with enthusiastic marketers eager to level up their game. 

Now, we could go on for days and days about the sessions and the information provided, but we don’t have that kind of time and neither do you.

Instead, let’s look at 21 key takeaways from the summit that you can add to your marketing strategies right now.

Convert Website Visitors into Real Contacts!

Identify who is visiting your site with name, email and more. Get 500 contacts for free!

Please enable JavaScript in your browser to complete this form.Website / URL *Grade my website

1. Don’t Rent. Own Your Ad Pixel – Larry Kim

CEO, Customers.ai

Until not too long ago, the standard approach was to rely on ad platforms for pixel tracking. You’d grab their pixel, slot it into your site, and let it feed data back to their servers. 

But this method has its drawbacks: match rates often disappoint, data enrichment is non-existent, and the decline of cookies is shrinking audience sizes.

The game-changer? Bringing your own data to the table and directly supplying it to the ad platforms.

2. Go All in on Influencer & User Generated Content – Ezra Firestone

Founder of BOOM Beauty

Want an old-school strategy that not only works but scales? It’s time to fully embrace influencer and user-generated content. 

That means allocating at least 10% of your marketing budget to influencer efforts. These images and videos make super-effective ads, enrich email content, and enhance blogs.

3. Combine First-Party Data & Audience Signals – Areen Mayela

Vice President, Paid Media Strategy at Hawke Media

Audience signals are one of the few levers we have to drive growth with Performance Max. When it comes to these audience signals, a mistake people make is that they use very broad data indicators – affinity and in-market audiences. 

You should be using your first-party high-value audience data (think customer lists). Seeing the most growth in PMax.

4. Protect Your Geotargeting with Exclusions – Navah Hopkins

Brand Evangelist at Optmyzr

Google recently made adjustments to zip codes and zip codes are starting to be considered personal info. We know what that means…there is a good chance it’ll be removed. 

So if you need to target a narrow geographic location, make sure you use exclusions to do so. It’ll essentially protect the targeting you have in place.

5. Create Trust Through Localized Experiences – Mark Yeramian

Co-Founder, CEO of Moast

Localized experiences are helpful to connect customers and build trust with your brand. They can also be really powerful, especially for certain product categories.

For example, if I am looking for outdoor furniture, I might want to connect with people who own that in my area. Does it hold up to Seattle rain? What about the heat and humidity of the northeast? Make it easy for your customers to find localized reviews and user-generated content.

6. Forget 80/20. It’s 95/5 for B2B Marketers  – Andrea Cruz

Sr. Director, Client Strategy at Tinuiti

It’s time to get real – it’s no longer an 80/20 world for marketers. The reality is 95% of your prospects aren’t ready to buy. Does that mean you shouldn’t target them? Not at all. We need to stay top of mind for those who aren’t yet ready to buy while also being mindful of our budgets. 

The solution is simple – make sure your audiences are highly targeted. Go beyond job titles and layer in as much data as you can. 

7. Microtarget Your Audience with Twitter Ads – Dennis Yu

CEO of BlitzMetrics

Want to capture the attention of your audience and not blow through your budget? With the right content and promotion tactics, Twitter can be a great place to do just that. 

Twitter allows you to create really narrow audiences through geotargeting. Want to target just one company? Run an ad targeted to the location of their corporate office. 

8. Combine AI and First-Party Data for Amped Campaigns – Mati Ram

CEO of AdScale

First-party data is king. By pairing it with AI, marketers can elevate their campaigns to unprecedented heights. 

Leverage AI to pinpoint growth opportunities and diagnose barriers, craft highly targeted campaigns, and optimize budgets and bids around the clock across all channels.

9. Use AI for Unlimited Ad Creative – Nick Shackelford

Chief Revenue Officer of Structured

Creative is often the hardest part of ads. It’s time-consuming, requires design skills, and can be really hard to scale. But what if you could create great ads and scale creative at the same time?

You can with A, the right tools (Figma, Google Sheets, GPT) and some simple configurations. Honestly, it’s hard to put this process into words.

Watch the replay for more details from Nick.

10. Reduce Bounce Rates with Mirrored Landing Pages – Steph Carcamo

Partner Marketing Manager at Justuno

How many times have you clicked on an ad only to be taken to a generic landing page? As a user it’s frustrating. As an advertiser, it’s a bad look.

Create a cohesive experience for your customers by mirroring your landing pages and ads. Mirror the offer, the CTA, and make sure any other exclusions noted in the ad are repeated on the landing page. This type of mirroring can reduce bounce rates by 20% and improve time to sale. 

11. Get the Spam Out of Your PMax Campaigns – Menachem Ani

Founder, CEO of JXT Group

The key to PMax is giving the system the right audience signals. Because there is no pure audience targeting with PMax, you need to give the system data on who your target customer is. Remember, Google doesn’t know what is good and what is bad so the most important thing you can do is give it your data, whether through a CRM connection or uploading audience data. 

To prevent spam and avoid the “feedback loop of doom, make sure to install a spam filter on your landing page so conversion tracking never gets triggered.

12. Boost Your Budget with Thought Leader Ads – AJ Wilcox

CEO of B2Linked.com

One of LinkedIn’s newest ad formats, thought leader ads are low cost, high visibility, and they drive results! 

The reason? Thought leader ads come from real people vs. a company page. When people see the post in their feed, they are more likely to stop and actually engage vs. scrolling past.

Thought leader ads can be used for all stages of the funnel and audience targeting is the same as other ad types.

13. Reward High Value Customers with Special Offers – Parya Behrouzian

Sr. Director of Revenue Marketing at AdRoll

Zero and first party data is essential to personalization and loyalty. Using the data you have and the tools in your tech stack, create personalized ads that can be served to your best customers. 

Offer exclusive promotions and rewards to motivate repeat shoppers to keep coming back or sign up for a subscription.

The result? A better customer experience, increased sales, and increase LTV.

14. Create Custom Rules to Avoid Google Partner Network – Greg Kohler

Sr. Digital Marketing Manager at ServiceMaster

By creating custom rules in Google Ads editor, you can scale rules across campaigns. For example, if you want to ensure your ads aren’t shown on the Google Partner or Display network, a custom rule can be put in place, preventing it from showing. 

The same thing can be done for things like location targeting and even broad match. Custom rules make life easier and help scale campaigns. 

15. Use AI to Test New Markets – Jason Dodge

Founder, CEO of BlackTruck Media

When testing new markets, you can spend hours upon hours analyzing data to determine if it’s a fit. With some basic prompt engineering, you can cut down on that time. 

For example, a hotel company might want to know the most popular destinations from a particular airport to determine potential new locations. Instead of spending hours gathering info, AI tools like GPT can not only give you the data quickly, it can also layer on affinity data. More data = better.

16. Use Modular Creative to 10x Ad Output – Maxwell Finn

President at Unicorn Traffic

Ad iteration and testing is critical for ad success. Unfortunately, it can also be time consuming. Not with the Ford method.

The Ford method works by iterating on successful creative. The goal is to not just find a winning ad and move on but instead, to make what’s working, work better. By making your creative modular (think elements like hook, CTA, body, text overlay, etc.), you can change small variables for continuous improvement.

17. Focus on Your CAC and LTV Ratio – John Moran

Co-Founder of Solutions 8

While platforms like Google and Facebook are good ad platforms, they aren’t good measurement platforms. And with the deprecation of cookies, attribution isn’t going to get better. 

To get a better sense of how your ads are performing and determine where to put your budget, look at your CAC to LTV by channel and holistically. You can more quickly understand how much money you can afford to spend and know where to spend it.

18. Turn Your One Hit Wonders into Loyal Customers – Sabir Semerkant

Founder of Growth by Sabir

Turning one-time buyers (or as Sabir calls them, one hit wonders) into loyal customers is every company’s dream. Repeat customers cost less than new customers and they tend to be higher value.

To turn your one hit wonders into loyal customers, you need to treat them as their own audience, separate from your repeat customer base. Identify your one hit wonders, segment them into their own audiences, and create hyper-personalized messaging. Remember, personalization is the key. Without it, you won’t see the results you want.

19. Use Financial Day Trading Zone Strategy to Optimize Budgets – Scott Desgrosseilliers

CEO of Wicked Reports

Hitting your budget sweet spot can be tricky. Each day you check your goals, make adjustments, and it can easily result in overreactions.

The better way is to set a range, similar to day trading, and automate the actions based on that range. For example, if you have Meta ads running, you can set a rule that if your CAC hits the upper part of the range, you reduce budget but if it hits the lower part of the range, you up it. The result? A more efficient and effective budget strategy.

20. Focus on Profit Margin, Not ROAS – Saptarshi Nath

Co-Founder & CEO of Airboxr

Advertisers often get caught looking strictly at ROAS and ignoring actual profits.

After all, not all products have the same profit margin. Some may have 10% margins while others have 80%.

Take a look at which products with ad spend are driving sales and have a higher profit margin. Those are the products you want to put more budget towards.

21. Leverage YouTube Reach for Growth – Justin Buckley

Co-Founder of ATTN Agency

Vertical videos are everything right now and platforms are battling for the views. Upload the shorts you’re creating for TikTok and Meta into YouTube. 

They are cost-effective (Justin is seeing as low as ONE cent short placements), they can be targeted, and they can drive real engagement. Even better, if you hook up your Google Ads to creator/influencer channels, you can access their audience insights and build retargeting audiences.  

Whew! Is Your Brain Full Yet?

These are just a few of the many tips and tricks that were shared at the Ad Universe Summit. Be on the lookout for more summit content and if you missed it, be sure to register for the replay!

If you want to learn more about Customers.ai and how we can help take your ad campaigns to the next level, talk to our sales team or sign up and start your free trial!

See Who Is On Your Site Right Now!

Turn anonymous visitors into genuine contacts.

Try it Free, No Credit Card Required

Get The X-Ray Pixel

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post 21 High-Powered Takeaways from Ad Universe Summit appeared first on Customers.ai.

Google DeepMind Introduces Round-Trip Correctness for Assessing Large …

The advent of code-generating Large Language Models (LLMs) has marked a significant leap forward. These models, capable of understanding and generating code, are revolutionizing how developers approach coding tasks. From automating mundane tasks to fixing complex bugs, LLMs promise to reduce development time and improve code quality significantly. Accurately assessing these models’ capabilities remains a challenge. Evaluation benchmarks, while foundational, offer a narrow window into the vast landscape of software development, focusing primarily on basic programming tasks or limited data science applications. This narrow focus falls short of capturing developers’ diverse challenges, highlighting the need for a more comprehensive evaluation method.

Google DeepMind introduces Round-Trip Correctness (RTC), an innovative evaluation method that broadens the assessment horizon of code LLMs. Unlike conventional benchmarks that rely on manual curation of tasks, RTC adopts an unsupervised approach, enabling evaluations across a wider array of real-world software domains without requiring exhaustive manual effort. The essence of RTC lies in its unique evaluation framework, where a model predicts a coding task and its inverse, such as generating code from a description and vice versa. This method evaluates the model’s ability to maintain the semantic integrity of the original input throughout the round-trip, offering a nuanced measure of its understanding and generation capabilities.

By leveraging the model’s performance on both forward and reverse tasks, RTC assesses its code synthesis and editing proficiency, among other applications. This approach evaluates the model’s accuracy in generating semantically correct code and its effectiveness in understanding and interpreting code descriptions. The adaptability of RTC extends to various coding tasks and domains, showcasing its potential as a universal framework for model evaluation.

Demonstrating a strong correlation with model performance on established narrow-domain benchmarks, RTC also reveals its capability to facilitate evaluations in a broader range of software domains. This comprehensive assessment is pivotal for developing LLMs that are more attuned to the multifaceted needs of software development. The insights gained from RTC evaluations are invaluable for guiding the evolution of code-generating models, ensuring they are robust, versatile, and aligned with real-world development challenges.

In conclusion, the introduction of Round-Trip Correctness as a method for evaluating code LLMs represents a significant advancement in the field. This method offers:

A comprehensive and unsupervised approach to model evaluation extends beyond the limitations of traditional benchmarks.

The capability to assess models across a diverse spectrum of software domains, reflecting the real-world challenges of software development.

Insights into LLMs’ code generation and understanding capabilities, fostering the development of more effective and adaptable models.

By bridging the gap between narrow-domain benchmarks and the expansive needs of software development, RTC paves the way for the next generation of code-generating LLMs. These models promise to be more in tune with developers’ diverse needs, ultimately enhancing the efficiency and quality of software development processes.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post Google DeepMind Introduces Round-Trip Correctness for Assessing Large Language Models appeared first on MarkTechPost.

Can We Drastically Reduce AI Training Costs? This AI Paper from MIT, P …

Training Large Language Models (LLMs) involves two main phases: pre-training on extensive datasets and fine-tuning for specific tasks. While pre-training requires significant computational resources, fine-tuning adds comparatively less new information to the model, making it more compressible. This pretrain-finetune paradigm has greatly advanced machine learning, allowing LLMs to excel in various tasks and adapt to individual needs, promising a future with highly specialized models tailored to specific requirements.

Various quantization techniques, such as rescaling activations, decomposing matrix multiplications, and iterative weight rounding, aim to reduce memory usage and latency in LLMs. Additionally, pruning methods induce sparsity by zeroing certain parameter values. Parameter-efficient fine-tuning (PEFT) approaches, like adapter layers and Low-Rank Adaptation (LoRA), reduce trainable parameters during fine-tuning, enhancing efficiency without sacrificing accuracy. These methods offer significant potential for compression-aware training and multi-tenant serving systems.

Researchers from the Massachusetts Institute of Technology, Princeton University, and Together AI have proposed BitDelta, which effectively quantizes fine-tuning deltas to 1 bit without sacrificing performance. This discovery suggests potential redundancy in fine-tuning information and offers multi-tenant serving and storage implications. By employing a high-precision base model alongside multiple 1-bit deltas, BitDelta significantly reduces GPU memory requirements by over 10×, thereby enhancing generation latency in multi-tenant environments.

BitDelta employs a two-stage process for efficient quantization of fine-tuning deltas in LLMs. Firstly, it quantizes each weight matrix delta into a binary matrix multiplied by a scaling factor, initialized as the average absolute value of the delta. Secondly, it calibrates scaling factors via model distillation over a small dataset, maintaining frozen binary matrices. BitDelta‘s efficiency allows for rapid compression of models, facilitating shared server usage and significantly reducing GPU memory consumption and inference latency.

BitDelta is evaluated against original uncompressed models and 8-bit RTN and 4-bit GPTQ quantization methods. Across Llama-2 and Mistral model families, BitDelta consistently performs well on high-margin metrics, often outperforming baselines. It accurately preserves fine-tuned information, even surpassing GPTQ when applied to quantized base models, showcasing its effectiveness and versatility across different model sizes and fine-tuning techniques.

In conclusion, researchers from the Massachusetts Institute of Technology, Princeton University, and Together AI have proposed BitDelta, a simple yet powerful method for quantizing weight deltas in LLMs down to 1 bit, efficiently representing multiple fine-tuned models with one base model and multiple deltas. BitDelta achieves minimal performance degradation through distillation-based calibration while significantly reducing GPU memory requirements and improving generation latency. This approach paves the way for more efficient model deployment and resource utilization in machine learning applications.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post Can We Drastically Reduce AI Training Costs? This AI Paper from MIT, Princeton, and Together AI Unveils How BitDelta Achieves Groundbreaking Efficiency in Machine Learning appeared first on MarkTechPost.

Scaling Up LLM Agents: Unlocking Enhanced Performance Through Simplici …

While large language models (LLMs) excel in many areas, they can struggle with complex tasks that require precise reasoning. Recent solutions often focus on sophisticated ensemble methods or frameworks where multiple LLM agents collaborate. These approaches certainly improve performance, but they add layers of complexity. However, what if a simpler strategy could lead to significant gains?

This work investigates a fascinating phenomenon: the potential to improve LLM performance simply by scaling up the number of agents used. It introduces a remarkably straightforward method – sampling and voting – that involves generating multiple outputs from LLMs and using majority voting to decide the final response. Let’s dive into the details.

Reference: https://arxiv.org/pdf/2402.05120.pdf

The Sampling-and-Voting Method

At its core, the sampling-and-voting method is refreshingly simple and comprises two phases (See Fig. 2):

Sampling: The task query is repeatedly fed into an LLM (or a framework with multiple LLM agents), generating multiple outputs (samples).

Voting: Majority voting determines the final answer. For closed-ended tasks (e.g., multiple choice), this involves counting the frequency of each option. For open-ended tasks (e.g., code generation), similarity measures like BLEU score are used to rank samples. The sample with the highest similarity to others wins.

This process (Algorithm 1) is elegantly agnostic, making it a potent plug-in to enhance existing LLM techniques.

The method’s efficacy is extensively evaluated across the following three tasks:

Arithmetic Reasoning: GSM8K and the challenging MATH dataset

General Reasoning: MMLU and a chess state tracking task

Code Generation: HumanEval dataset

To explore the range of benefits, the authors tested language models of varying scales, including Llama2, GPT-3.5-Turbo, and GPT-4.

To test how well the method plays with other methods, it was combined with diverse techniques:

Prompt Engineering: Integrating with Chain-of-Thought (CoT), Zero-Shot Cot, and Solo Performance Prompting.

Multiple LLM Agents Collaboration: Used in conjunction with debate-style (LLM-Debate) and self-reflection methods.

The results offer compelling insights:

Performance Scaling: Increasing the number of agents generally boosts LLM performance across tasks and models of varying sizes. Surprisingly, smaller LLMs, when scaled up, often rival or outperform larger counterparts (Fig. 1).

Compatibility: The method combines seamlessly with other techniques, leading to even greater performance gains.

Simplicity vs. Complexity: In most cases, the proposed method alone achieves results on par with more complex approaches, suggesting power in its straightforward design.

Thorough experiments demonstrate the method’s consistency across hyperparameters (Fig. 4) and reveal a key point: performance gains positively correlate with task difficulty (Table 5). To unpack this relationship, three dimensions of difficulty are isolated:

Inherent Difficulty: Gains first increase and then decrease as problems become extremely complex.

Number of Steps: Gains become more pronounced as the steps needed to solve the task increase.

Prior Probability: Performance improves when the likelihood of a correct answer is higher.

These findings inspired optimizations like stepwise or hierarchical sampling-and-voting, maximizing gains through a nuanced understanding of task difficulty.

In conclusion, this work establishes a new benchmark, demonstrating that sometimes, ‘more agents’ may indeed be all you need. In many cases, scaling up LLM agents with a simple sampling-and-voting strategy significantly improves performance without intricate methods. This discovery simplifies complex LLM applications and paves the way for cost-optimization of future systems, a focus of ongoing research.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….
The post Scaling Up LLM Agents: Unlocking Enhanced Performance Through Simplicity appeared first on MarkTechPost.

Techniques and approaches for monitoring large language models on AWS

Large Language Models (LLMs) have revolutionized the field of natural language processing (NLP), improving tasks such as language translation, text summarization, and sentiment analysis. However, as these models continue to grow in size and complexity, monitoring their performance and behavior has become increasingly challenging.
Monitoring the performance and behavior of LLMs is a critical task for ensuring their safety and effectiveness. Our proposed architecture provides a scalable and customizable solution for online LLM monitoring, enabling teams to tailor your monitoring solution to your specific use cases and requirements. By using AWS services, our architecture provides real-time visibility into LLM behavior and enables teams to quickly identify and address any issues or anomalies.
In this post, we demonstrate a few metrics for online LLM monitoring and their respective architecture for scale using AWS services such as Amazon CloudWatch and AWS Lambda. This offers a customizable solution beyond what is possible with model evaluation jobs with Amazon Bedrock.
Overview of solution
The first thing to consider is that different metrics require different computation considerations. A modular architecture, where each module can intake model inference data and produce its own metrics, is necessary.
We suggest that each module take incoming inference requests to the LLM, passing prompt and completion (response) pairs to metric compute modules. Each module is responsible for computing its own metrics with respect to the input prompt and completion (response). These metrics are passed to CloudWatch, which can aggregate them and work with CloudWatch alarms to send notifications on specific conditions. The following diagram illustrates this architecture.

Fig 1: Metric compute module – solution overview

The workflow includes the following steps:

A user makes a request to Amazon Bedrock as part of an application or user interface.
Amazon Bedrock saves the request and completion (response) in Amazon Simple Storage Service (Amazon S3) as the per configuration of invocation logging.
The file saved on Amazon S3 creates an event that triggers a Lambda function. The function invokes the modules.
The modules post their respective metrics to CloudWatch metrics.
Alarms can notify the development team of unexpected metric values.

The second thing to consider when implementing LLM monitoring is choosing the right metrics to track. Although there are many potential metrics that you can use to monitor LLM performance, we explain some of the broadest ones in this post.
In the following sections, we highlight a few of the relevant module metrics and their respective metric compute module architecture.
Semantic similarity between prompt and completion (response)
When running LLMs, you can intercept the prompt and completion (response) for each request and transform them into embeddings using an embedding model. Embeddings are high-dimensional vectors that represent the semantic meaning of the text. Amazon Titan provides such models through Titan Embeddings. By taking a distance such as cosine between these two vectors, you can quantify how semantically similar the prompt and completion (response) are. You can use SciPy or scikit-learn to compute the cosine distance between vectors. The following diagram illustrates the architecture of this metric compute module.

Fig 2: Metric compute module – semantic similarity

This workflow includes the following key steps:

A Lambda function receives a streamed message via Amazon Kinesis containing a prompt and completion (response) pair.
The function gets an embedding for both the prompt and completion (response), and computes the cosine distance between the two vectors.
The function sends that information to CloudWatch metrics.

Sentiment and toxicity
Monitoring sentiment allows you to gauge the overall tone and emotional impact of the responses, whereas toxicity analysis provides an important measure of the presence of offensive, disrespectful, or harmful language in LLM outputs. Any shifts in sentiment or toxicity should be closely monitored to ensure the model is behaving as expected. The following diagram illustrates the metric compute module.

Fig 3: Metric compute module – sentiment and toxicity

The workflow includes the following steps:

A Lambda function receives a prompt and completion (response) pair through Amazon Kinesis.
Through AWS Step Functions orchestration, the function calls Amazon Comprehend to detect the sentiment and toxicity.
The function saves the information to CloudWatch metrics.

For more information about detecting sentiment and toxicity with Amazon Comprehend, refer to Build a robust text-based toxicity predictor and Flag harmful content using Amazon Comprehend toxicity detection.
Ratio of refusals
An increase in refusals, such as when an LLM denies completion due to lack of information, could mean that either malicious users are trying to use the LLM in ways that are intended to jailbreak it, or that users’ expectations are not being met and they are getting low-value responses. One way to gauge how often this is happening is by comparing standard refusals from the LLM model being used with the actual responses from the LLM. For example, the following are some of Anthropic’s Claude v2 LLM common refusal phrases:
“Unfortunately, I do not have enough context to provide a substantive response. However, I am an AI assistant created by Anthropic to be helpful, harmless, and honest.”
“I apologize, but I cannot recommend ways to…”
“I’m an AI assistant created by Anthropic to be helpful, harmless, and honest.”
On a fixed set of prompts, an increase in these refusals can be a signal that the model has become overly cautious or sensitive. The inverse case should also be evaluated. It could be a signal that the model is now more prone to engage in toxic or harmful conversations.
To help model integrity and model refusal ratio, we can compare the response with a set of known refusal phrases from the LLM. This could be an actual classifier that can explain why the model refused the request. You can take the cosine distance between the response and known refusal responses from the model being monitored. The following diagram illustrates this metric compute module.

Fig 4: Metric compute module – ratio of refusals

The workflow consists of the following steps:

A Lambda function receives a prompt and completion (response) and gets an embedding from the response using Amazon Titan.
The function computes the cosine or Euclidian distance between the response and existing refusal prompts cached in memory.
The function sends that average to CloudWatch metrics.

Another option is to use fuzzy matching for a straightforward but less powerful approach to compare the known refusals to LLM output. Refer to the Python documentation for an example.
Summary
LLM observability is a critical practice for ensuring the reliable and trustworthy use of LLMs. Monitoring, understanding, and ensuring the accuracy and reliability of LLMs can help you mitigate the risks associated with these AI models. By monitoring hallucinations, bad completions (responses), and prompts, you can make sure your LLM stays on track and delivers the value you and your users are looking for. In this post, we discussed a few metrics to showcase examples.
For more information about evaluating foundation models, refer to Use SageMaker Clarify to evaluate foundation models, and browse additional example notebooks available in our GitHub repository. You can also explore ways to operationalize LLM evaluations at scale in Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services. Finally, we recommend referring to Evaluate large language models for quality and responsibility to learn more about evaluating LLMs.

About the Authors
Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.
Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Researchers from NVIDIA and the University of Maryland Propose ODIN: A …

The well-known Artificial Intelligence (AI)-based chatbot, i.e., ChatGPT, which has been built on top of GPT’s transformer architecture, uses the technique of Reinforcement Learning from Human Feedback (RLHF). RLHF is an increasingly important method for utilizing the potential of pre-trained Large Language Models (LLMs) to generate more helpful, truthful responses that are in line with human preferences.

In RLHF, a language model is trained to produce responses that maximize the learned reward through reinforcement learning, after which a reward model is trained based on human preferences for particular prompts. Since gathering human ratings is typically less complicated than gathering demos for supervised fine-tuning, this approach streamlines the process of collecting data. 

However, reward hacking is a subtle problem with RLHF, where the policy gets a large reward without meeting the real objectives. This happens as a result of the reward model’s limited Out-Of-Distribution (OOD) generalization and potential imperfections in representing human preferences. Being a strong LLM, the language model can provide OOD examples to take advantage of flaws in the reward model. 

The scenario is further complicated by human preference data, which is frequently skewed and inconsistent due to task complexity and subjectivity, defects in rating standards, and the low caliber of raters. Verbosity is a popular example of reward hacking, in which models produce more tokens to appear more thorough or better formatted in responses, but there is no real improvement in quality.

In order to address these issues, recent research from NVIDIA and the University of Maryland has aimed to mitigate reward hacking by examining how RL algorithms and incentive models affect verbosity and performance. The team has presented an evaluation technique to compare various training setups and account for biases in model-based evaluations. The technique has provided a comprehensive knowledge of various response durations by evaluating performance on the Pareto front of evaluation score vs. length. 

This process is intended to analyze the trade-off between the LLM’s assessment score and response duration, allowing for a systematic comparison of different training settings. By varying the training hyperparameters, it can be evaluated how these modifications affect the ratio of verbosity to answer quality.

The study looks at RL hyperparameters and techniques, such as reward clipping and length penalty, to lessen reward hacking on length. The primary goal is to remove the spurious length signal from the reward, even though various tuning procedures can yield better outcomes. To accomplish this, the team has suggested a two-head reward model that separates representations for length from true preferences. The length head is deleted during RL. 

The suggested reward disentangling technique, ODIN, has been used with the help of which, even with a more costly tuning budget, the policy was able to attain a larger Pareto front than prior results. Proximal Policy Optimisation (PPO) and ReMax both benefit from ODIN’s effectiveness, indicating that it can be used to enhance other RL-tuning methods and lessen length hacking.

In conclusion, this method’s experimental results have shown a noteworthy decrease in the reward model’s association with response duration. The derived strategy performs significantly better when the quality of the information is prioritized over verbosity. This method successfully reduces the problem of response length-related reward hacking, improving the dependability and utility of LLMs trained using the RLHF paradigm.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel
The post Researchers from NVIDIA and the University of Maryland Propose ODIN: A Reward Disentangling Technique that Mitigates Hacking in Reinforcement Learning from Human Feedback (RLHF) appeared first on MarkTechPost.