Align and monitor your Amazon Bedrock powered insurance assistance cha …

Generative AI applications are gaining widespread adoption across various industries, including regulated industries such as financial services and healthcare. As these advanced systems accelerate in playing a critical role in decision-making processes and customer interactions, customers should work towards ensuring the reliability, fairness, and compliance of generative AI applications with industry regulations. To address this need, AWS generative AI best practices framework was launched within AWS Audit Manager, enabling auditing and monitoring of generative AI applications. This framework provides step-by-step guidance on approaching generative AI risk assessment, collecting and monitoring evidence from Amazon Bedrock and Amazon SageMaker environments to assess your risk posture, and preparing to meet future compliance requirements.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Agents can be used to configure specialized agents that run actions seamlessly based on user input and your organization’s data. These managed agents play conductor, orchestrating interactions between FMs, API integrations, user conversations, and knowledge bases loaded with your data.
Insurance claim lifecycle processes typically involve several manual tasks that are painstakingly managed by human agents. An Amazon Bedrock-powered insurance agent can assist human agents and improve existing workflows by automating repetitive actions as demonstrated in the example in this post, which can create new claims, send pending document reminders for open claims, gather claims evidence, and search for information across existing claims and customer knowledge repositories.
Generative AI applications should be developed with adequate controls for steering the behavior of FMs. Responsible AI considerations such as privacy, security, safety, controllability, fairness, explainability, transparency and governance help ensure that AI systems are trustworthy. In this post, we demonstrate how to use the AWS generative AI best practices framework on AWS Audit Manager to evaluate this insurance claim agent from a responsible AI lens.
Use case
In this example of an insurance assistance chatbot, the customer’s generative AI application is designed with Amazon Bedrock Agents to automate tasks related to the processing of insurance claims and Amazon Bedrock Knowledge Bases to provide relevant documents. This allows users to directly interact with the chatbot when creating new claims and receiving assistance in an automated and scalable manner.

The user can interact with the chatbot using natural language queries to create a new claim, retrieve an open claim using a specific claim ID, receive a reminder for documents that are pending, and gather evidence about specific claims.
The agent then interprets the user’s request and determines if actions need to be invoked or information needs to be retrieved from a knowledge base. If the user request invokes an action, action groups configured for the agent will invoke different API calls, which produce results that are summarized as the response to the user. Figure 1 depicts the system’s functionalities and AWS services. The code sample for this use case is available in GitHub and can be expanded to add new functionality to the insurance claims chatbot.
How to create your own assessment of the AWS generative AI best practices framework

To create an assessment using the generative AI best practices framework on Audit Manager, go to the AWS Management Console and navigate to AWS Audit Manager.
Choose Create assessment.

Specify the assessment details, such as the name and an Amazon Simple Storage Service (Amazon S3) bucket to save assessment reports to. Select AWS Generative AI Best Practices Framework for assessment.

Select the AWS accounts in scope for assessment. If you’re using AWS Organizations and you have enabled it in Audit Manager, you will be able to select multiple accounts at once in this step. One of the key features of AWS Organizations is the ability to perform various operations across multiple AWS accounts simultaneously.

Next, select the audit owners to manage the preparation for your organization. When it comes to auditing activities within AWS accounts, it’s considered a best practice to create a dedicated role specifically for auditors or auditing purposes. This role should be assigned only the permissions required to perform auditing tasks, such as reading logs, accessing relevant resources, or running compliance checks.

Finally, review the details and choose Create assessment.

Principles of AWS generative AI best practices framework
Generative AI implementations can be evaluated based on eight principles in the AWS generative AI best practices framework. For each, we will define the principle and explain how Audit Manager conducts an evaluation.
Accuracy
A core principle of trustworthy AI systems is accuracy of the application and/or model. Measures of accuracy should consider computational measures, and human-AI teaming. It is also important that AI systems are well tested in production and should demonstrate adequate performance in the production setting. Accuracy measurements should always be paired with clearly defined and realistic test sets that are representative of conditions of expected use.
For the use case of an insurance claims chatbot built with Amazon Bedrock Agents, you will use the large language model (LLM) Claude Instant from Anthropic, which you won’t need to further pre-train or fine-tune. Hence, it is relevant for this use case to demonstrate the performance of the chatbot through performance metrics for the tasks through the following:

A prompt benchmark
Source verification of documents ingested in knowledge bases or databases that the agent has access to
Integrity checks of the connected datasets as well as the agent
Error analysis to detect the edge cases where the application is erroneous
Schema compatibility of the APIs
Human-in-the-loop validation.

To measure the efficacy of the assistance chatbot, you will use promptfoo—a command line interface (CLI) and library for evaluating LLM apps. This involves three steps:

Create a test dataset containing prompts with which you test the different features.
Invoke the insurance claims assistant on these prompts and collect the responses. Additionally, the traces of these responses are also helpful in debugging unexpected behavior.
Set up evaluation metrics that can be derived in an automated manner or using human evaluation to measure the quality of the assistant.

In the example of an insurance assistance chatbot, designed with Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases, there are four tasks:

getAllOpenClaims: Gets the list of all open insurance claims. Returns all claim IDs that are open.
getOutstandingPaperwork: Gets the list of pending documents that need to be uploaded by the policy holder before the claim can be processed. The API takes in only one claim ID and returns the list of documents that are pending to be uploaded. This API should be called for each claim ID.
getClaimDetail: Gets all details about a specific claim given a claim ID.
sendReminder: Send a reminder to the policy holder about pending documents for the open claim. The API takes in only one claim ID and its pending documents at a time, sends the reminder, and returns the tracking details for the reminder. This API should be called for each claim ID you want to send reminders for.

For each of these tasks, you will create sample prompts to create a synthetic test dataset. The idea is to generate sample prompts with expected outcomes for each task. For the purposes of demonstrating the ideas in this post, you will create only a few samples in the synthetic test dataset. In practice, the test dataset should reflect the complexity of the task and possible failure modes for which you would want to test the application. Here are the sample prompts that you will use for each task:

getAllOpenClaims

What are the open claims?
List open claims.

getOutstandingPaperwork

What are the missing documents from {{claim}}?
What is missing from {{claim}}?

getClaimDetail

Explain the details to {{claim}}
What are the details of {{claim}}

sendReminder

Send reminder to {{claim}}
Send reminder to {{claim}}. Include the missing documents and their requirements.

Also include sample prompts for a set of unwanted results to make sure that the agent only performs the tasks that are predefined and doesn’t provide out of context or restricted information.

List all claims, including closed claims
What is 2+2?

Set up
You can start with the example of an insurance claims agent by cloning the use case of Amazon Bedrock-powered insurance agent. After you create the agent, set up promptfoo. Now, you will need to create a custom script that can be used for testing. This script should be able to invoke your application for a prompt from the synthetic test dataset. We created a Python script, invoke_bedrock_agent.py, with which we invoke the agent for a given prompt.
python invoke_bedrock_agent.py “What are the open claims?”
Step 1: Save your prompts
Create a text file of the sample prompts to be tested. As seen in the following, a claim can be a parameter that is inserted into the prompt during testing.
%%writefile prompts_getClaimDetail.txt
Explain the details to {{claim}}.

What are the details of {{claim}}.

Step 2: Create your prompt configuration with tests
For prompt testing, we defined test prompts per task. The YAML configuration file uses a format that defines test cases and assertions for validating prompts. Each prompt is processed through a series of sample inputs defined in the test cases. Assertions check whether the prompt responses meet the specified requirements. In this example, you use the prompts for task getClaimDetail and define the rules. There are different types of tests that can be used in promptfoo. This example uses keywords and similarity to assess the contents of the output. Keywords are checked using a list of values that are present in the output. Similarity is checked through the embedding of the FM’s output to determine if it’s semantically similar to the expected value.
%%writefile promptfooconfig.yaml
prompts: [prompts_getClaimDetail.txt] # text file that has the prompts
providers: [‘bedrock_agent_as_provider.js’] # custom provider setting
defaultTest:
options:
provider:
embedding:
id: huggingface:sentence-similarity:sentence-transformers/all-MiniLM-L6-v2
tests:
– description: ‘Test via keywords’
vars:
claim: claim-008 # a claim that is open
assert:
– type: contains-any
value:
– ‘claim’
– ‘open’
– description: ‘Test via similarity score’
vars:
claim: claim-008 # a claim that is open
assert:
– type: similar
value: ‘Providing the details for claim with id xxx: it is created on xx-xx-xxxx, last activity date on xx-xx-xxxx, status is x, the policy type is x.’
threshold: 0.6
Step 3: Run the tests
Run the following commands to test the prompts against the set rules.
npx promptfoo@latest eval -c promptfooconfig.yaml npx promptfoo@latest share
The promptfoo library generates a user interface where you can view the exact set of rules and the outcomes. The user interface for the tests that were run using the test prompts is shown in the following figure.

For each test, you can view the details, that is, what was the prompt, what was the output and the test that was performed, as well as the reason. You see the prompt test result for getClaimDetail in the following figure, using the similarity score against the expected result, given as a sentence.

Similarly, using the similarity score against the expected result, you get the test result for getOpenClaims as shown in the following figure.

Step 4: Save the output
For the final step, you want to attach evidence for both the FM as well as the application as a whole to the control ACCUAI 3.1: Model Evaluation Metrics. To do so, save the output of your prompt testing into an S3 bucket. In addition, the performance metrics of the FM can be found in the model card, which is also first saved to an S3 bucket. Within Audit Manager, navigate to the corresponding control, ACCUAI 3.1: Model Evaluation Metrics, select Add manual evidence and Import file from S3 to provide both model performance metrics and application performance as shown in the following figure.

In this section, we showed you how to test a chatbot and attach the relevant evidence. In the insurance claims chatbot, we did not customize the FM and thus the other controls—including ACCUAI3.2: Regular Retraining for Accuracy, ACCUAI3.11: Null Values, ACCUAI3.12: Noise and Outliers, and ACCUAI3.15: Update Frequency—are not applicable. Hence, we will not include these controls in the assessment performed for the use case of an insurance claims assistant.
We showed you how to test a RAG-based chatbot for controls using a synthetic test benchmark of prompts and add the results to the evaluation control. Based on your application, one or more controls in this section might apply and be relevant to demonstrate the trustworthiness of your application.
Fair
Fairness in AI includes concerns for equality and equity by addressing issues such as harmful bias and discrimination.
Fairness of the insurance claims assistant can be tested through the model responses when user-specific information is presented to the chatbot. For this application, it’s desirable to see no deviations in the behavior of the application when the chatbot is exposed to user-specific characteristics. To test this, you can create prompts containing user characteristics and then test the application using a process similar to the one described in the previous section. This evaluation can then be added as evidence to the control for FAIRAI 3.1: Bias Assessment.
An important element of fairness is having diversity in the teams that develop and test the application. This helps incorporate different perspectives are addressed in the AI development and deployment lifecycle so that the final behavior of the application addresses the needs of diverse users. The details of the team structure can be added as manual evidence for the control FAIRAI 3.5: Diverse Teams. Organizations might also already have ethics committees that review AI applications. The structure of the ethics committee and the assessment of the application can be included as manual evidence for the control FAIRAI 3.6: Ethics Committees.
Moreover, the organization can also improve fairness by incorporating features to improve accessibility of the chatbot for individuals with disabilities. By using Amazon Transcribe to stream transcription of user speech to text and Amazon Polly to play back speech audio to the user, voice can be used with an application built with Amazon Bedrock as detailed in Amazon Bedrock voice conversation architecture.
Privacy
NIST defines privacy as the norms and practices that help to safeguard human autonomy, identity, and dignity. Privacy values such as anonymity, confidentiality, and control should guide choices for AI system design, development, and deployment. The insurance claims assistant example doesn’t include any knowledge bases or connections to databases that contain customer data. If it did, additional access controls and authentication mechanisms would be required to make sure that customers can only access data they are authorized to retrieve.
Additionally, to discourage users from providing personally identifiable information (PII) in their interactions with the chatbot, you can use Amazon Bedrock Guardrails. By using the PII filter and adding the guardrail to the agent, PII entities in user queries of model responses will be redacted and pre-configured messaging will be provided instead. After guardrails are implemented, you can test them by invoking the chatbot with prompts that contain dummy PII. These model invocations are logged in Amazon CloudWatch; the logs can then be appended as automated evidence for privacy-related controls including PRIAI 3.10: Personal Identifier Anonymization or Pseudonymization and PRIAI 3.9: PII Anonymization.
In the following figure, a guardrail was created to filter PII and unsupported topics. The user can test and view the trace of the guardrail within the Amazon Bedrock console using natural language. For this use case, the user asked a question whose answer would require the FM to provide PII. The trace shows that sensitive information has been blocked because the guardrail detected PII in the prompt.

As a next step, under the Guardrail details section of the agent builder, the user adds the PII guardrail, as shown in the figure below.

Amazon Bedrock is integrated with CloudWatch, which allows you to track usage metrics for audit purposes. As described in Monitoring generative AI applications using Amazon Bedrock and Amazon CloudWatch integration, you can enable model invocation logging. When analyzing insights with Amazon Bedrock, you can query model invocations. The logs provide detailed information about each model invocation, including the input prompt, the generated output, and any intermediate steps or reasoning. You can use these logs to demonstrate transparency and accountability.
Model innovation logging can be used to collected invocation logs including full request data, response data, and metadata with all calls performed in your account. This can be enabled by following the steps described in Monitor model invocation using CloudWatch Logs.
You can then export the relevant CloudWatch logs from Log Insights for this model invocation as evidence for relevant controls. You can filter for bedrock-logs and choose to download them as a table, as shown in the figure below, so the results can be uploaded as manual evidence for AWS Audit Manager.

For the guardrail example, the specific model invocation will be shown in the logs as in the following figure. Here, the prompt and the user who ran it are captured. Regarding the guardrail action, it shows that the result is INTERVENED because of the blocked action with the PII entity email. For AWS Audit Manager, you can export the result and upload it as manual evidence under PRIAI 3.9: PII Anonymization.

Furthermore, organizations can establish monitoring of their AI applications—particularly when they deal with customer data and PII data—and establish an escalation procedure for when a privacy breach might occur. Documentation related to the escalation procedure can be added as manual evidence for the control PRIAI3.6: Escalation Procedures – Privacy Breach.
These are some of the most relevant controls to include in your assessment of a chatbot application from the dimension of Privacy.
Resilience
In this section, we show you how to improve the resilience of an application to add evidence of the same to controls defined in the Resilience section of the AWS generative AI best practices framework.
AI systems, as well as the infrastructure in which they are deployed, are said to be resilient if they can withstand unexpected adverse events or unexpected changes in their environment or use. The resilience of a generative AI workload plays an important role in the development process and needs special considerations.
The various components of the insurance claims chatbot require resilient design considerations. Agents should be designed with appropriate timeouts and latency requirements to ensure a good customer experience. Data pipelines that ingest data to the knowledge base should account for throttling and use backoff techniques. It’s a good idea to consider parallelism to reduce bottlenecks when using embedding models, account for latency, and keep in mind the time required for ingestion. Considerations and best practices should be implemented for vector databases, the application tier, and monitoring the use of resources through an observability layer. Having a business continuity plan with a disaster recovery strategy is a must for any workload. Guidance for these considerations and best practices can be found in Designing generative AI workloads for resilience. Details of these architectural elements should be added as manual evidence in the assessment.
Responsible
Key principles of responsible design are explainability and interpretability. Explainability refers to the mechanisms that drive the functionality of the AI system, while interpretability refers to the meaning of the output of the AI system with the context of the designed functional purpose. Together, both explainability and interpretability assist in the governance of an AI system to maintain the trustworthiness of the system. The trace of the agent for critical prompts and various requests that users can send to the insurance claims chatbot can be added as evidence for the reasoning used by the agent to complete a user request.
The logs gathered from Amazon Bedrock offer comprehensive insights into the model’s handling of user prompts and the generation of corresponding answers. The figure below shows a typical model invocation log. By analyzing these logs, you can gain visibility into the model’s decision-making process. This logging functionality can serve as a manual audit trail, fulfilling RESPAI3.4: Auditable Model Decisions.

Another important aspect of maintaining responsible design, development, and deployment of generative AI applications is risk management. This involves risk assessment where risks are identified across broad categories for the applications to identify harmful events and assign risk scores. This process also identifies mitigations that can reduce an inherent risk of a harmful event occurring to a lower residual risk. For more details on how to perform risk assessment of your Generative AI application, see Learn how to assess the risk of AI systems. Risk assessment is a recommended practice, especially for safety critical or regulated applications where identifying the necessary mitigations can lead to responsible design choices and a safer application for the users. The risk assessment reports are good evidence to be included under this section of the assessment and can be uploaded as manual evidence. The risk assessment should also be periodically reviewed to update changes to the application that can introduce the possibility of new harmful events and consider new mitigations for reducing the impact of these events.
Safe
AI systems should “not under defined conditions, lead to a state in which human life, health, property, or the environment is endangered.” (Source: ISO/IEC TS 5723:2022) For the insurance claims chatbot, following safety principles should be followed to prevent interactions with users outside of the limits of the defined functions. Amazon Bedrock Guardrails can be used to define topics that are not supported by the chatbot. The intended use of the chatbot should also be transparent to users to guide them in the best use of the AI application. An unsupported topic could include providing investment advice, which be blocked by creating a guardrail with investment advice defined as a denied topic as described in Guardrails for Amazon Bedrock helps implement safeguards customized to your use case and responsible AI policies.
After this functionality is enabled as a guardrail, the model will prohibit unsupported actions. The instance illustrated in the following figure depicts a scenario where requesting investment advice is a restricted behavior, leading the model to decline providing a response.

After the model is invoked, the user can navigate to CloudWatch to view the relevant logs. In cases where the model denies or intervenes in certain actions, such as providing investment advice, the logs will reflect the specific reasons for the intervention, as shown in the following figure. By examining the logs, you can gain insights into the model’s behavior, understand why certain actions were denied or restricted, and verify that the model is operating within the intended guidelines and boundaries. For the controls defined under the safety section of the assessment, you might want to design more experiments by considering various risks that arise from your application. The logs and documentation collected from the experiments can be attached as evidence to demonstrate the safety of the application.

Secure
NIST defines AI systems to be secure when they maintain confidentiality, integrity, and availability through protection mechanisms that prevent unauthorized access and use. Applications developed using generative AI should build defenses for adversarial threats including but not limited to prompt injection, data poisoning if a model is being fine-tuned or pre-trained, and model and data extraction exploits through AI endpoints.
Your information security teams should conduct standard security assessments that have been adapted to address the new challenges with generative AI models and applications—such as adversarial threats—and consider mitigations such as red-teaming. To learn more on various security considerations for generative AI applications, see Securing generative AI: An introduction to the Generative AI Security Scoping Matrix. The resulting documentation of the security assessments can be attached as evidence to this section of the assessment.
Sustainable
Sustainability refers to the “state of the global system, including environmental, social, and economic aspects, in which the needs of the present are met without compromising the ability of future generations to meet their own needs.”
Some actions that contribute to a more sustainable design of generative AI applications include considering and testing smaller models to achieve the same functionality, optimizing hardware and data storage, and using efficient training algorithms. To learn more about how you can do this, see Optimize generative AI workloads for environmental sustainability. Considerations implemented for achieving more sustainable applications can be added as evidence for the controls related to this part of the assessment.
Conclusion
In this post, we used the example of an insurance claims assistant powered by Amazon Bedrock Agents and looked at various principles that you need to consider when getting this application audit ready using the AWS generative AI best practices framework on Audit Manager. We defined each principle of safeguarding applications for trustworthy AI and provided some best practices for achieving the key objectives of the principles. Finally, we showed you how these development and design choices can be added to the assessment as evidence to help you prepare for an audit.
The AWS generative AI best practices framework provides a purpose-built tool that you can use for monitoring and governance of your generative AI projects on Amazon Bedrock and Amazon SageMaker. To learn more, see:

AWS generative AI best practices framework v2
AWS Audit Manager launches AWS Best Practices Framework for Generative AI
AWS Audit Manager extends generative AI best practices framework to Amazon SageMaker

About the Authors
Bharathi Srinivasan is a Generative AI Data Scientist at the AWS Worldwide Specialist Organisation. She works on developing solutions for Responsible AI, focusing on algorithmic fairness, veracity of large language models, and explainability. Bharathi guides internal teams and AWS customers on their responsible AI journey. She has presented her work at various learning conferences.
Irem Gokcek is a Data Architect in the AWS Professional Services team, with expertise spanning both Analytics and AI/ML. She has worked with customers from various industries such as retail, automotive, manufacturing and finance to build scalable data architectures and generate valuable insights from the data. In her free time, she is passionate about swimming and painting.
Fiona McCann is a Solutions Architect at Amazon Web Services in the public sector. She specializes in AI/ML with a focus on Responsible AI. Fiona has a passion for helping nonprofit customers achieve their missions with cloud solutions. Outside of building on AWS, she loves baking, traveling, and running half marathons in cities she visits.

London Stock Exchange Group uses Amazon Q Business to enhance post-tra …

This post was co-written with Ben Doughton, Head of Product Operations – LCH, Iulia Midus, Site Reliability Engineer – LCH, and Maurizio Morabito, Software and AI specialist – LCH (part of London Stock Exchange Group, LSEG).
In the financial industry, quick and reliable access to information is essential, but searching for data or facing unclear communication can slow things down. An AI-powered assistant can change that. By instantly providing answers and helping to navigate complex systems, such assistants can make sure that key information is always within reach, improving efficiency and reducing the risk of miscommunication. Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business enables employees to become more creative, data-driven, efficient, organized, and productive.
In this blog post, we explore a client services agent assistant application developed by the London Stock Exchange Group (LSEG) using Amazon Q Business. We will discuss how Amazon Q Business saved time in generating answers, including summarizing documents, retrieving answers to complex Member enquiries, and combining information from different data sources (while providing in-text citations to the data sources used for each answer).
The challenge
The London Clearing House (LCH) Group of companies includes leading multi-asset class clearing houses and are part of the Markets division of LSEG PLC (LSEG Markets). LCH provides proven risk management capabilities across a range of asset classes, including over-the-counter (OTC) and listed interest rates, fixed income, foreign exchange (FX), credit default swap (CDS), equities, and commodities.
As the LCH business continues to grow, the LCH team has been continuously exploring ways to improve their support to customers (members) and to increase LSEG’s impact on customer success. As part of LSEG’s multi-stage AI strategy, LCH has been exploring the role that generative AI services can have in this space. One of the key capabilities that LCH is interested in is a managed conversational assistant that requires minimal technical knowledge to build and maintain. In addition, LCH has been looking for a solution that is focused on its knowledge base and that can be quickly kept up to date. For this reason, LCH was keen to explore techniques such as Retrieval Augmented Generation (RAG). Following a review of available solutions, the LCH team decided to build a proof-of-concept around Amazon Q Business.
Business use case
Realizing value from generative AI relies on a solid business use case. LCH has a broad base of customers raising queries to their client services (CS) team across a diverse and complex range of asset classes and products. Example queries include: “What is the eligible collateral at LCH?” and “Can members clear NIBOR IRS at LCH?” This requires CS team members to refer to detailed service and policy documentation sources to provide accurate advice to their members.
Historically, the CS team has relied on producing product FAQs for LCH members to refer to and, where required, an in-house knowledge center for CS team members to refer to when answering complex customer queries. To improve the customer experience and boost employee productivity, the CS team set out to investigate whether generative AI could help answer questions from individual members, thus reducing the number of customer queries. The goal was to increase the speed and accuracy of information retrieval within the CS workflows when responding to the queries that inevitably come through from customers.
Project workflow
The CS use case was developed through close collaboration between LCH and Amazon Web Service (AWS) and involved the following steps:

Ideation: The LCH team carried out a series of cross-functional workshops to examine different large language model (LLM) approaches including prompt engineering, RAG, and custom model fine tuning and pre-training. They considered different technologies such as Amazon SageMaker and Amazon SageMaker Jumpstart and evaluated trade-offs between development effort and model customization. Amazon Q Business was selected because of its built-in enterprise search web crawler capability and ease of deployment without the need for LLM deployment. Another attractive feature was the ability to clearly provide source attribution and citations. This enhanced the reliability of the responses, allowing users to verify facts and explore topics in greater depth (important aspects to increase their overall trust in the responses received).
Knowledge base creation: The CS team built data sources connectors for the LCH website, FAQs, customer relationship management (CRM) software, and internal knowledge repositories and included the Amazon Q Business built-in index and retriever in the build.
Integration and testing: The application was secured using a third-party identity provider (IdP) as the IdP for identity and access management to manage users with their enterprise IdP and used AWS Identity and Access Management (IAM) to authenticate users when they signed in to Amazon Q Business. Testing was carried out to verify factual accuracy of responses, evaluating the performance and quality of the AI-generated answers, which demonstrated that the system had achieved a high level of factual accuracy. Wider improvements in business performance were demonstrated including enhancements in response time, where responses were delivered within a few seconds. Tests were undertaken with both unstructured and structured data within the documents.
Phased rollout: The CS AI assistant was rolled out in a phased approach to provide thorough, high-quality answers. In the future, there are plans to integrate their Amazon Q Business application with existing email and CRM interfaces, and to expand its use to additional use cases and functions within LSEG. 

Solution overview
In this solution overview, we’ll explore the LCH-built Amazon Q Business application.
The LCH admin team developed a web-based interface that serves as a gateway for their internal client services team to interact with the Amazon Q Business API and other AWS services (Amazon Elastic Compute Cloud (Amazon ECS), Amazon API Gateway, AWS Lambda, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and Amazon Bedrock) and secured it using SAML 2.0 IAM federation—maintaining secure access to the chat interface—to retrieve answers from a pre-indexed knowledge base and to validate the responses using Anthropic’s Claude v2 LLM.
The following figure illustrates the architecture for the LCH client services application.

The workflow consists of the following steps:

The LCH team set up the Amazon Q Business application using a SAML 2.0 IAM IdP. (The example in the blog post shows connecting with Okta as the IdP for Amazon Q Business. However, the LCH team built the application using a third-party solution as the IdP instead of Okta). This architecture allows LCH users to sign in using their existing identity credentials from their enterprise IdP, while they maintain control over which users have access to their Amazon Q Business application.
The application had two data sources as part of the configuration for their Amazon Q Business application:

An S3 bucket to store and index their internal LCH documents. This allows the Amazon Q Business application to access and search through their internal product FAQ PDF documents as part of providing responses to user queries. Indexing the documents in Amazon S3 makes them readily available for the application to retrieve relevant information.
In addition to internal documents, the team has also set up their public-facing LCH website as a data source using a web crawler that can index and extract information from their rulebooks.

The LCH team opted for a custom user interface (UI) instead of the built-in web experience provided by Amazon Q Business to have more control over the frontend by directly accessing the Amazon Q Business API. The application’s frontend was developed using the open source application framework and hosted on Amazon ECS. The frontend application accesses an Amazon API Gateway REST API endpoint to interact with the business logic written in AWS Lambda
The architecture consists of two Lambda functions:

An authorizer Lambda function is responsible for authorizing the frontend application to access the Amazon Q business API by generating temporary AWS credentials.
A ChatSync Lambda function is responsible for accessing the Amazon Q Business ChatSync API to start an Amazon Q Business conversation.

The architecture includes a Validator Lambda function, which is used by the admin to validate the accuracy of the responses generated by the Amazon Q Business application.

The LCH team has stored a golden answer knowledge base in an S3 bucket, consisting of approximately 100 questions and answers about their product FAQs and rulebooks collected from their live agents. This knowledge base serves as a benchmark for the accuracy and reliability of the AI-generated responses.
By comparing the Amazon Q Business chat responses against their golden answers, LCH can verify that the AI-powered assistant is providing accurate and consistent information to their customers.
The Validator Lambda function retrieves data from a DynamoDB table and sends it to Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) that can be used to quickly experiment with and evaluate top FMs for a given use case, privately customize the FMs with existing data using techniques such as fine-tuning and RAG, and build agents that execute tasks using enterprise systems and data sources.
The Amazon Bedrock service uses Anthropic’s Claude v2 model to validate the Amazon Q Business application queries and responses against the golden answers stored in the S3 bucket.
Anthropic’s Claude v2 model returns a score for each question and answer, in addition to a total score, which is then provided to the application admin for review.
The Amazon Q Business application returned answers within a few seconds for each question. The overall expectation is that Amazon Q Business saves time for each live agent on each question by providing quick and correct responses.

This validation process helped LCH to build trust and confidence in the capabilities of Amazon Q Business, enhancing the overall customer experience.
Conclusion
This post provides an overview of LSEG’s experience in adopting Amazon Q Business to support LCH client services agents for B2B query handling. This specific use case was built by working backward from a business goal to improve customer experience and staff productivity in a complex, highly technical area of the trading life cycle (post-trade). The variety and large size of enterprise data sources and the regulated environment that LSEG operates in makes this post particularly relevant to customer service operations dealing with complex query handling. Managed, straightforward-to-use RAG is a key capability within a wider vision of providing technical and business users with an environment, tools, and services to use generative AI across providers and LLMs. You can get started with this tool by creating a sample Amazon Q Business application.

About the Authors
Ben Doughton is a Senior Product Manager at LSEG with over 20 years of experience in Financial Services. He leads product operations, focusing on product discovery initiatives, data-informed decision-making and innovation. He is passionate about machine learning and generative AI as well as agile, lean and continuous delivery practices.
Maurizio Morabito, Software and AI specialist at LCH, one of the early adopters of Neural Networks in the years 1990–1992 before a long hiatus in technology and finance companies in Asia and Europe, finally returning to Machine Learning in 2021. Maurizio is now leading the way to implement AI in LSEG Markets, following the motto “Tackling the Long and the Boring”
Iulia Midus is a recent IT Management graduate and currently working in Post-trade. The main focus of the work so far has been data analysis and AI, and looking at ways to implement these across the business.
Magnus Schoeman is a Principal Customer Solutions Manager at AWS. He has 25 years of experience across private and public sectors where he has held leadership roles in transformation programs, business development, and strategic alliances. Over the last 10 years, Magnus has led technology-driven transformations in regulated financial services operations (across Payments, Wealth Management, Capital Markets, and Life & Pensions).
Sudha Arumugam is an Enterprise Solutions Architect at AWS, advising large Financial Services organizations. She has over 13 years of experience in creating reliable software solutions to complex problems and She has extensive experience in serverless event-driven architecture and technologies and is passionate about machine learning and AI. She enjoys developing mobile and web applications.
Elias Bedmar is a Senior Customer Solutions Manager at AWS. He is a technical and business program manager helping customers be successful on AWS. He supports large migration and modernization programs, cloud maturity initiatives, and adoption of new services. Elias has experience in migration delivery, DevOps engineering and cloud infrastructure.
Marcin Czelej is a Machine Learning Engineer at AWS Generative AI Innovation and Delivery. He combines over 7 years of experience in C/C++ and assembler programming with extensive knowledge in machine learning and data science. This unique skill set allows him to deliver optimized and customised solutions across various industries. Marcin has successfully implemented AI advancements in sectors such as e-commerce, telecommunications, automotive, and the public sector, consistently creating value for customers.
Zmnako Awrahman, Ph.D., is a generative AI Practice Manager at AWS Generative AI Innovation and Delivery with extensive experience in helping enterprise customers build data, ML, and generative AI strategies. With a strong background in technology-driven transformations, particularly in regulated industries, Zmnako has a deep understanding of the challenges and opportunities that come with implementing cutting-edge solutions in complex environments.

Evaluate large language models for your machine translation tasks on A …

Large language models (LLMs) have demonstrated promising capabilities in machine translation (MT) tasks. Depending on the use case, they are able to compete with neural translation models such as Amazon Translate. LLMs particularly stand out for their natural ability to learn from the context of the input text, which allows them to pick up on cultural cues and produce more natural sounding translations. For instance, the sentence “Did you perform well?” translated in French might be translated into “Avez-vous bien performé?” The target translation can vary widely depending on the context. If the question is asked in the context of sport, such as “Did you perform well at the soccer tournament?”, the natural French translation would be very different. It is critical for AI models to capture not only the context, but also the cultural specificities to produce a more natural sounding translation. One of LLMs’ most fascinating strengths is their inherent ability to understand context.
A number of our global customers are looking to take advantage of this capability to improve the quality of their translated content. Localization relies on both automation and humans-in-the-loop in a process called Machine Translation Post Editing (MTPE). Building solutions that help enhance translated content quality present multiple benefits:

Potential cost savings on MTPE activities
Faster turnaround for localization projects
Better experience for content consumers and readers overall with enhanced quality

LLMs have also shown gaps with regards to MT tasks, such as:

Inconsistent quality over certain language pairs
No standard pattern to integrate past translations knowledge, also known as translation memory (TM)
Inherent risk of hallucination

Switching MT workloads from to LLM-driven translation should be considered on a case-by-case basis. However, the industry is seeing enough potential to consider LLMs as a valuable option.
This blog post with accompanying code presents a solution to experiment with real-time machine translation using foundation models (FMs) available in Amazon Bedrock. It can help collect more data on the value of LLMs for your content translation use cases.
Steering the LLMs’ output
Translation memory and TMX files are important concepts and file formats used in the field of computer-assisted translation (CAT) tools and translation management systems (TMSs).
Translation memory
A translation memory is a database that stores previously translated text segments (typically sentences or phrases) along with their corresponding translations. The main purpose of a TM is to aid human or machine translators by providing them with suggestions for segments that have already been translated before. This can significantly improve translation efficiency and consistency, especially for projects involving repetitive content or similar subject matter.
Translation Memory eXchange (TMX) is a widely used open standard for representing and exchanging TM data. It is an XML-based file format that allows for the exchange of TMs between different CAT tools and TMSs. A typical TMX file contains a structured representation of translation units, which are groupings of a same text translated into multiple languages.
Integrating TM with LLMs
The use of TMs in combination with LLMs can be a powerful approach for improving the quality and efficiency of machine translation. The following are a few potential benefits:

Improved accuracy and consistency – LLMs can benefit from the high-quality translations stored in TMs, which can help improve the overall accuracy and consistency of the translations produced by the LLM. The TM can provide the LLM with reliable reference translations for specific segments, reducing the chances of errors or inconsistencies.
Domain adaptation – TMs often contain translations specific to a particular domain or subject matter. By using a domain-specific TM, the LLM can better adapt to the terminology, style, and context of that domain, leading to more accurate and natural translations.
Efficient reuse of human translations – TMs store human-translated segments, which are typically of higher quality than machine-translated segments. By incorporating these human translations into the LLM’s training or inference process, the LLM can learn from and reuse these high-quality translations, potentially improving its overall performance.
Reduced post-editing effort – When the LLM can accurately use the translations stored in the TM, the need for human post-editing can be reduced, leading to increased productivity and cost savings.

Another approach to integrating TM data with LLMs is to use fine-tuning in the same way you would fine-tune a model for business domain content generation, for instance. For customers operating in global industries, potentially translating to and from over 10 languages, this approach can prove to be operationally complex and costly. The solution proposed in this post relies on LLMs’ context learning capabilities and prompt engineering. It enables you to use an off-the-shelf model as is without involving machine learning operations (MLOps) activity.
Solution overview
The LLM translation playground is a sample application providing the following capabilities:

Experiment with LLM translation capabilities using models available in Amazon Bedrock
Create and compare various inference configurations
Evaluate the impact of prompt engineering and Retrieval Augmented Generation (RAG) on translation with LLMs
Configure supported language pairs
Import, process, and test translation using your existing TMX file with Multiple LLMS
Custom terminology conversion
Performance, quality, and usage metrics including BLEU, BERT, METEOR and, CHRF

The following diagram illustrates the translation playground architecture. The numbers are color-coded to represent two flows: the translation memory ingestion flow (orange) and the text translation flow (gray). The solution offers two TM retrieval modes for users to choose from: vector and document search. This is covered in detail later in the post.

The TM ingestion flow (orange) consists of the following steps:

The user uploads a TMX file to the playground UI.
Depending on which retrieval mode is being used, the appropriate adapter is invoked.
When using the Amazon OpenSearch Service adapter (document search), translation unit groupings are parsed and stored into an index dedicated to the uploaded file. When using the FAISS adapter (vector search), translation unit groupings are parsed and turned into vectors using the selected embedding model from Amazon Bedrock.
When using the FAISS adapter, translation units are stored into a local FAISS index along with the metadata.

The text translation flow (gray) consists of the following steps:

The user enters the text they want to translate along with source and target language.
The request is sent to the prompt generator.
The prompt generator invokes the appropriate knowledge base according to the selected mode.
The prompt generator receives the relevant translation units.
Amazon Bedrock is invoked using the generated prompt as input along with customization parameters.

The translation playground could be adapted into a scalable serverless solution as represented by the following diagram using AWS Lambda, Amazon Simple Storage Service (Amazon S3), and Amazon API Gateway.

Strategy for TM knowledge base
The LLM translation playground offers two options to incorporate the translation memory into the prompt. Each option is available through its own page within the application:

Vector store using FAISS – In this mode, the application processes the .tmx file the user uploaded, indexes it, and stores it locally into a vector store (FAISS).
Document store using Amazon OpenSearch Serverless – Only standard document search using Amazon OpenSearch Serverless is supported. To test vector search, use the vector store option (using FAISS).

In vector store mode, the translation segments are processed as follows:

Embed the source segment.
Extract metadata:

Segment language
System generated <tu> segment unique identifier

Store source segment vectors along with metadata and the segment itself in plain text as a document

The translation customization section allows you to select the embedding model. You can choose either Amazon Titan Embeddings Text V2 or Cohere Embed Multilingual v3. Amazon Titan Text Embeddings V2 includes multilingual support for over 100 languages in pre-training. Cohere Embed supports 108 languages.
In document store mode, the language segments are not embedded and are stored following a flat structure. Two metadata attributes are maintained across the documents:

Segment Language
System generated <tu> segment unique identifier

Prompt engineering
The application uses prompt engineering techniques to incorporate several types of inputs for the inference. The following sample XML illustrates the prompt’s template structure:

<prompt>
<system_prompt>…</system_prompt>
<source_language>EN</source_language>
<target_language>FR</target_language>
<translation_memory_pairs>
<source_language>…</source_language>
<target_language>…</target_language>
</translation_memory_pairs>
<custom_terminology_pairs>
<source_language>…</source_language>
<target_language>…</target_language>
</custom_terminology_pairs ><user_prompt>…</user_prompt>
</prompt>

Prerequisites
The project code uses the Python version of the AWS Cloud Development Kit (AWS CDK). To run the project code, make sure that you have fulfilled the AWS CDK prerequisites for Python.
The project also requires that the AWS account is bootstrapped to allow the deployment of the AWS CDK stack.
Install the UI
To deploy the solution, first install the UI (Streamlit application):

Clone the GitHub repository using the following command:

git clone https://github.com/aws-samples/llm-translation-playground.git

Navigate to the deployment directory:

cd llm-translation-playground

Install and activate a Python virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install Python libraries:

python -m pip install -r requirements.txt
Deploy the AWS CDK stack
Complete the following steps to deploy the AWS CDK stack:

Move into the deployment folder:

cd deployment/cdk

Configure the AWS CDK context parameters file context.json. For collection_name, use the OpenSearch Serverless collection name. For example:

“collection_name”: “search-subtitles”

Deploy the AWS CDK stack:

cdk deploy

Validate successful deployment by reviewing the OpsServerlessSearchStack stack on the AWS CloudFormation The status should read CREATE_COMPLETE.
On the Outputs tab, make note of the OpenSearchEndpoint attribute value.

Configure the solution
The stack creates an AWS Identity and Access Management (IAM) role with the right level of permission needed to run the application. The LLM translation playground assumes this role automatically on your behalf. To achieve this, modify the role or principal under which you are planning to run the application so you are allowed to assume the newly created role. You can use the pre-created policy and attach it to your role. The policy Amazon Resource Name (ARN) can be retrieved as a stack output under the key LLMTranslationPlaygroundAppRoleAssumePolicyArn, as illustrated in the preceding screenshot. You can do so from the IAM console after selecting your role and choosing Add permissions. If you prefer to use the AWS Command Line Interface (AWS CLI), refer to the following sample command line:
aws iam attach-role-policy –role-name &lt;role-name&gt;  –policy-arn &lt;policy-arn&gt;
Finally, configure the .env file in the utils folder as follows:

APP_ROLE_ARN – The ARN of the role created by the stack (stack output LLMTranslationPlaygroundAppRoleArn)
HOST – OpenSearch Serverless collection endpoint (without https)
REGION – AWS Region the collection was deployed into
INGESTION_LIMIT – Maximum amount of translation units (<tu> tags) indexed per TMX file you upload

Run the solution
To start the translation playground, run the following commands:
cd llm-translation-playground/source
streamlit run LLM_Translation_Home.py
Your default browser should open a new tab or window displaying the Home page.

Simple test case
Let’s run a simple translation test using the phrase mentioned earlier: “Did you perform well?”
Because we’re not using a knowledge base for this test case, we can use either a vector store or document store. For this post, we use a document store.

Choose With Document Store.
For Source Text, enter the text to be translated.
Choose your source and target languages (for this post, English and French, respectively).
You can experiment with other parameters, such as model, maximum tokens, temperature, and top-p.
Choose Translate.

The translated text appears in the bottom section. For this example, the translated text, although accurate, is close to a literal translation, which is not a common phrasing in French.

We can rerun the same test after slightly modifying the initial text: “Did you perform well at the soccer tournament?”

We’re now introducing some situational context in the input. The translated text should be different and closer to a more natural translation. The new output literally means “Did you play well at the soccer tournament?”, which is consistent with the initial intent of the question.

Also note the completion metrics on the left pane, displaying latency, input/output tokens, and quality scores.
This example highlights the ability of LLMs to naturally adapt the translation to the context.
Adding translation memory
Let’s test the impact of using a translation memory TMX file on the translation quality.

Copy the text contained within test/source_text.txt and paste into the Source text
Choose French as the target language and run the translation.
Copy the text contained within test/target_text.txt and paste into the reference translation field.

Choose Evaluate and notice the quality scores on the left.
In the Translation Customization section, choose Browse files and choose the file test/subtitles_memory.tmx.

This will index the translation memory into the OpenSearch Service collection previously created. The indexing process can take a few minutes.

When the indexing is complete, select the created index from the index dropdown.
Rerun the translation.

You should see a noticeable increase in the quality score. For instance, we’ve seen up to 20 percentage points improvement in BLEU score with the preceding test case. Using prompt engineering, we were able to steer the model’s output by providing sample phrases directly pulled from the TMX file. Feel free to explore the generated prompt for more details on how the translation pairs were introduced.
You can replicate a similar test case with Amazon Translate by launching an asynchronous job customized using parallel data.

Here we took a simplistic retrieval approach, which consists of loading all of the samples as part of the same TMX file, matching the source and target language. You can enhance this technique by using metadata-driven filtering to collect the relevant pairs according to the source text. For example, you can classify the documents by theme or business domain, and use category tags to select language pairs relevant to the text and desired output.
Semantic similarity for translation memory selection
In vector store mode, the application allows you to upload a TMX and create a local index that uses semantic similarity to select the translation memory segments. First, we retrieve the segment with the highest similarity score based on the text to be translated and the source language. Then we retrieve the corresponding segment matching the target language and parent translation unit ID.
To try it out, upload the file in the same way as shown earlier. Depending on the size of the file, this can take a few minutes. There is a maximum limit of 200 MB. You can use the sample file as in the previous example or one of the other samples provided in the code repository.
This approach differs from the static index search because it’s assumed that the source text is semantically close to segments representative enough of the expected style and tone.

Adding custom terminology
Custom terminology allows you to make sure that your brand names, character names, model names, and other unique content get translated to the desired result. Given that LLMs are pre-trained on massive amounts of data, they can likely already identify unique names and render them accurately in the output. If there are names for which you want to enforce a strict and literal translation, you can try the custom terminology feature of this translate playground. Simply provide the source and target language pairs separated by semicolon in the Translation Customization section. For instance, if you want to keep the phrase “Gen AI” untranslated regardless of the language, you can configure the custom terminology as illustrated in the following screenshot.

Clean up
To delete the stack, navigate to the deployment folder and run:cdk destroy.
Further considerations
Using existing TMX files with generative AI-based translation systems can potentially improve the quality and consistency of translations. The following are some steps to use TMX files for generative AI translations:

TMX data pipeline – TMX files contain structured translation units, but the format might need to be preprocessed to extract the source and target text segments in a format that can be consumed by the generative AI model. This involves extract, transform, and load (ETL) pipelines able to parse the XML structure, handle encoding issues, and add metadata.
Incorporate quality estimation and human review – Although generative AI models can produce high-quality translations, it is recommended to incorporate quality estimation techniques and human review processes. You can use automated quality estimation models to flag potentially low-quality translations, which can then be reviewed and corrected by human translators.
Iterate and refine – Translation projects often involve iterative cycles of translation, review, and improvement. You can periodically retrain or fine-tune the generative AI model with the updated TMX file, creating a virtuous cycle of continuous improvement.

Conclusion
The LLM translation playground presented in this post enables you evaluate the use of LLMs for your machine translation needs. The key features of this solution include:

Ability to use translation memory – The solution allows you to integrate your existing TM data, stored in the industry-standard TMX format, directly into the LLM translation process. This helps improve the accuracy and consistency of the translations by using high-quality human-translated content.
Prompt engineering capabilities – The solution showcases the power of prompt engineering, demonstrating how LLMs can be steered to produce more natural and contextual translations by carefully crafting the input prompts. This includes the ability to incorporate custom terminology and domain-specific knowledge.
Evaluation metrics – The solution includes standard translation quality evaluation metrics, such as BLEU, BERT Score, METEOR, and CHRF, to help you assess the quality and effectiveness of the LLM-powered translations compared to their your existing machine translation workflows.

As the industry continues to explore the use of LLMs, this solution can help you gain valuable insights and data to determine if LLMs can become a viable and valuable option for your content translation and localization workloads.
To dive deeper into the fast-moving field of LLM-based machine translation on AWS, check out the following resources:

How 123RF saved over 90% of their translation costs by switching to Amazon Bedrock
Video auto-dubbing using Amazon Translate, Amazon Bedrock, and Amazon Polly
Multi-Model agentic & reflective translation workflow in Amazon Bedrock

About the Authors
Narcisse Zekpa is a Sr. Solutions Architect based in Boston. He helps customers in the Northeast U.S. accelerate their business transformation through innovative, and scalable solutions, on the AWS Cloud. He is passionate about enabling organizations to transform transform their business, using advanced analytics and AI. When Narcisse is not building, he enjoys spending time with his family, traveling, running, cooking and playing basketball.
Ajeeb Peter is a Principal Solutions Architect with Amazon Web Services based in Charlotte, North Carolina, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 20 years of technology experience on Software Development, Architecture and Analytics from industries like finance and telecom

Email Capture Best Practices: Turning Website Visitors into Customers

Here’s a fun fact: 81% of businesses say email is their primary customer acquisition channel. Surprising? Not really, considering the ROI potential when you’re sliding into inboxes with the right message. 

But before you start picturing your email list exploding, let’s pause for a reality check – getting those emails in the first place is where most brands fumble.

Email capture isn’t just about tossing up a “Sign up for updates” box and hoping for the best. It’s about creating real connections with your visitors and giving them a reason to share their details.

The good news? 

You don’t need to reinvent the wheel here – you just need to merely grease it with some practical, proven email capture best practices. 

So let’s cut the fluff and get into how to turn your website visitors into paying customers.

Unlock High-Intent Leads Hiding on Your Site

Book a demo of Customers.ai’s U.S. website visitor identification, customer journey insights and remarketing platform to skyrocket conversions and sales.

Book a Demo

1. Build Trust First: The Foundation of Email Capture

Our first email capture best practice may sound simple but it’s not. Because trust is everything and if your visitors don’t feel confident handing over their email address, they won’t…no matter what you offer them. 

To build that trust, you have to start with the basics. In this case, we’re talking things like clear privacy policies and trust signals. A simple note like “We’ll never spam you or share your email” goes a long way in easing concerns. Add visible security icons or badges near your form to show you take data protection seriously.

But don’t stop there! 

Social proof is one of the most underrated ways to build trust. A glowing testimonial, a review snippet, or even a “Join 50,000 happy subscribers” line can nudge hesitant visitors into action.

Why? Because people trust people.

Quick tip: Place testimonials or a bit of social proof right next to your email capture form. Seeing that others have taken the leap makes it easier for new visitors to follow suit. 

2. Design Forms That Actually Get Filled Out

Nobody’s ever really excited to fill out a form, right? If they were, we wouldn’t be writing a post on email capture best practices because there would be no need!

Unfortunately, there is a need and for you, that means making your form fills as easy and appealing as possible. Consider the following:

Form placement is key. Above-the-fold forms are a no-brainer. Your visitors shouldn’t have to scroll to find them. 

Pop-ups are another powerhouse but timing is everything. Use exit-intent triggers to catch users as they’re about to leave or a delayed pop-up that shows once they’ve scrolled halfway down the page.

Simplicity wins. Email capture forms with fewer fields convert better. Ask for only what you need – this is usually just an email address and maybe a first name if personalization is part of your strategy. Save the detailed questions for later down the funnel.

Mobile responsiveness. Over half of web traffic is mobile, so a clunky form that doesn’t adjust to smaller screens is a conversion killer. Think large, tappable buttons and easy-to-read text.

Pro tip: Try multi-step forms. Instead of overwhelming visitors with a single long form, break it into smaller, digestible steps. For example, ask for their email first, then follow up with optional questions. Multi-step forms can boost engagement and completion rates because they feel less like a commitment.

The bottom line? A form that’s easy to spot, simple to fill out, and mobile-friendly is a form that gets results. 

3. Craft Irresistible Offers That Visitors Can’t Refuse

No one gives away their email address for free anymore. If you want those inboxes, you need to make it worth their while. 

Enter incentives. The right incentive can take your email capture game from “meh” to “heck yes!”

Start with what works. Discounts, freebies, and exclusive access are the classics for a reason. 

For ecommerce brands, a 10-20% discount on the first purchase is a tried-and-true winner. SaaS companies? Think free trials, limited-time access to premium features, or exclusive content like eBooks or templates.

But we all know that not all incentives are created equal. The secret is to match your offer to your audience. 

Are you targeting bargain-hunters? Go with a bold discount. 

Are your customers info-hungry pros? Offer insider tips, early access to products, or a VIP experience.

Creative examples to inspire:

Ecommerce: “Spin the Wheel” pop-ups where visitors can win discounts or free shipping. (Bonus: They’re interactive and fun!)

SaaS: “Unlock 7 Expert Email Templates That Double Conversions” for signing up.

Digital Products: “Sign up and get exclusive access to our pre-launch sale. Limited spots available!”

Pro tip: Make your offer feel urgent or exclusive. Phrases like “limited-time” or “only for the first 100 sign-ups” create FOMO and push visitors to act fast.

4. Timing is Everything: When and How to Capture Emails

When it comes to email capture, timing can make or break your efforts. Pop-ups, forms, and CTAs work best when they feel natural.

We’d also like to remind you that behavioral triggers are your best friend. Instead of blasting a pop-up the moment someone lands on your site (and risking an instant bounce), use smart triggers like:

Scroll depth: Show a form after a visitor scrolls 50% of the page.

Time on page: If someone sticks around for 30 seconds, they’re probably interested.

Exit intent: Catch users just as they’re about to leave with an offer they can’t refuse.

These triggers help you strike while the iron’s hot without annoying your visitors.

Speaking of annoyance, balance is key. Pop-ups can be powerful but don’t overdo them. Consider using slide-ins or sticky bars as gentler alternatives for some scenarios.

Pro tip: Test everything. Timing isn’t a one-size-fits-all strategy. Run A/B tests with different triggers to see what works best for your audience. For example, test whether a pop-up at 15 seconds performs better than one triggered at 30 seconds.

By timing your email capture efforts to align with user behavior, you’re creating the perfect moment to ask. It’s not called a best practice for nothing!

5. Personalization: Making Email Capture Feel Less Robotic

Customers expect way more from brands these days. They want you to go above and beyond for them and to make them feel special. 

That’s where personalization comes in. 

Start with personalized CTAs 

Generic calls-to-action like “Sign up for updates” just don’t cut it anymore. Instead, use language that speaks directly to the visitor and what they care about. For example:

“Get Your Exclusive 15% Off Today!”

“Hey, Boston! Don’t Miss Out on Our Local Deals!”

“Unlock [Product] Secrets for Your Next Big Win.”

Take it a step further with dynamic content 

Using tools that adjust your messaging based on user behavior, location, or referral source is a game-changer. If a visitor lands on your site after viewing a specific ad or blog post, tailor the email capture form to match their interests. For example:

A visitor browsing winter jackets? “Be the first to know about our winter collection drops.”

A reader checking out your beginner’s guide? “Get a free checklist for mastering [topic].”

Real-world examples of personalization done right:

Ecommerce: A fitness brand offering, “Get 10% off your first order of running gear” to visitors browsing running shoes.

SaaS: “Sign up for your free trial and get a personalized onboarding guide,” displayed to users on the pricing page.

Hospitality: “Planning a trip to NYC? Get exclusive local tips and deals,” for visitors researching travel destinations.

Pro tip: Pair personalization with urgency for even better results. A localized pop-up saying, “Limited spots left for free shipping to [visitor’s location]!” creates FOMO and drives action.

An email capture strategy without personalization is a poor one so make sure you have yours set.

6. Optimize for Conversion: The Secrets of High-Performing CTAs

Your call-to-action (CTA) is the tipping point between a bounce and a conversion. It’s where all your hard work comes down to one moment: will they click, or won’t they? 

Here’s how to stack the odds in your favor.

Use power words and action-oriented language 

Your CTA needs to grab attention and make clicking feel irresistible. Words like “Get,” “Unlock,” “Claim,” or “Discover” put the focus on action and benefits. Pair them with clear, direct promises:

“Claim Your 20% Discount Now!”

“Unlock Exclusive Content Instantly.”

“Sign Up Today and Save.”

Design matters more than you think

Button size, color, and placement all impact conversion rates. A CTA that blends into the background or sits in an awkward spot is a lost opportunity. 

Make your button:

Bold and contrasting: It should stand out visually on the page.

Easy to find: Above-the-fold placement is great, but don’t shy away from repeating CTAs further down the page.

Mobile-friendly: Big enough to tap without accidentally clicking something else.

Pro tip: Add urgency or scarcity to your copy. People act faster when they feel like they might miss out. Phrases like “Limited Time Offer,” “Only 3 Spots Left,” or “Ends Tonight” nudge visitors to take action before it’s too late.

Your CTA is one of the most underrated parts of the email capture form so make sure you do it right. 

7. Use Visitor Identification to Make Email Capture Easier (and Better)

Imagine capturing the emails of 30% of the people on your site – even if they don’t fill out a form. That’s the power of visitor identification tools like Customers.ai! 

If you aren’t familiar, visitor ID is a technology that identifies and enriches data on your site visitors, even the ones who might otherwise remain anonymous.

With tools like Customers.ai, you can match visitor data to known profiles, capturing emails and enriching them with valuable insights like purchase history or behavior patterns.

How it works for email capture:

Email capture on forms: Automatically link visitor profiles with form submissions or form abandoners, reducing friction for users while giving you more complete data.

Contact enrichment: Go beyond just an email address. Customers.ai enriches your leads with detailed information like demographics and activity so you can segment smarter and personalize better.

Seamless ESP integrations: Sync directly with tools like Klaviyo, Mailchimp, and HubSpot to instantly add contacts to your email lists and workflows without missing a beat.

Why this is essential: Visitor ID doesn’t replace the strategies above – it enhances them! 

Paired with clear CTAs, personalized offers, and well-timed pop-ups, visitor identification makes sure you’re capturing every opportunity to grow your list. 

Plus, by enriching data automatically, you can focus on converting leads instead of chasing them. Not bad, right? 

Capturing Emails is Only Step One

So there you have it…the blueprint for email capture best practices that work.

From building trust and designing irresistible forms to nailing the timing and using personalization to stand out, these best practices are your ticket to turning visitors into subscribers. Add in tools like Customers.ai to enrich your efforts and you’re operating at pro level.

But remember – capturing emails is just the beginning. What you do after that, the follow-ups, the welcome sequences, and the value you deliver, determines whether those new subscribers become paying customers.

Ready to level up your strategy? 

Start implementing these tips today, and watch your email list (and conversions) grow.

Get your free trial of Customers.ai today and capture 500 contacts free!

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Email Capture Best Practices: Turning Website Visitors into Customers appeared first on Customers.ai.

Customer Targeting Strategies: How to Find and Engage Your Ideal Audie …

If you’re still relying on the same old targeting tactics from five years ago (we see you broad demographics and generic email blasts) you’re leaving dollars on the table. 

Customers expect more now. And guess what? The ecommerce brands delivering that “more” are winning big.

In fact, companies that use advanced targeting strategies see 60% higher conversion rates compared to those that don’t. Why? Because they’re not just assuming they know what their audience wants – they know.

And that’s what we are going to talk about here. But…this isn’t your typical “define your audience” post. We’re diving into strategies that push beyond the basics, using real-time data, predictive analytics, and omnichannel precision to help you connect with your customers. 

Let’s get into it.

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Customer Targeting: The Evolution from Basic to Advanced

Remember when knowing your customer’s age, gender, and location felt like enough? Back then, simply tossing up a campaign for “women, 25-34, interested in fitness” seemed cutting-edge. 

Spoiler alert: it’s not anymore.

Traditional targeting tactics just don’t cut it. They’re too vague, too slow, and let’s be real, they miss what really matters: behavior. 

Advanced marketers know it’s no longer about who your customers are…it’s about what they do.

That’s where advanced targeting comes in. 

Good marketers have moved onto real-time data, behavioral insights, and AI-driven segmentation. They’ve moved from “one-size-fits-all” campaigns to strategies that feel tailor-made for customers.

Here’s why this shift matters to you:

Your competitors are doing it. Advanced targeting is basically table stakes in 2024.

Today’s customers expect personalization. Anything less feels lazy (and they won’t hesitate to bounce).

It’s a conversion booster. The brands that get this right see higher ROI, deeper engagement, and happier customers who keep coming back.

If you’re still relying on old-school methods, it’s time to knock it off and level up. 

So, let’s break down how to build strategies that actually work in the next section.

The Foundation: Knowing Your Ideal Customer Profile (ICP)

Before you can create an effective customer targeting strategy, you need to know exactly who you’re targeting. And no, “millennials who like coffee” isn’t cutting it anymore. 

Image: Saleshandy

Your Ideal Customer Profile (ICP) goes way deeper than just demographics. It’s about understanding behaviors, needs, and the problems your customers want solved.

1. Go Beyond the Basics

Sure, start with the basics like age, location, and income, but don’t stop there. A great ICP includes:

Pain points: What challenges are they facing that your product solves?

Buying behavior: Do they prefer discounts? Are they impulse buyers or slow decision-makers?

Motivations: What drives their purchase decisions? Convenience, quality, or something else?

2. Use First-Party Data and Customer Feedback

Your ICP isn’t something you pull out of thin air. Use the data you already have:

Analyze customer purchase history, browsing habits, and loyalty trends.

Send surveys or gather feedback directly from customers. Tools like Typeform or Google Forms make this easy.

Dive into your CRM to spot patterns. Look for traits that your best customers have in common.

3. Leverage Tools to Map Your ICP

Creating a detailed ICP is easier with the right tools.

Visitor Identification Tools: Tools like Customers.ai can analyze visitor behavior to identify high-value segments.

Segmentation Tools: Platforms like HubSpot and Klaviyo help you break down your audience into actionable groups.

Visualization Tools: Tools like Funnelytics let you map out customer journeys tied to specific ICPs, giving you a clear picture of how they interact with your brand.

Your ICP isn’t just a profile. Nail this and everything else (ads, emails, offers) becomes sharper, smarter, and way more effective. 

Ready to start segmenting? Let’s look at five customer targeting strategies.

1. Behavioral Segmentation: Targeting Based on What Customers Do

Knowing who your customers are is great but knowing what they do? That’s where the good stuff happens! 

Behavioral segmentation takes your targeting to the next level by focusing on customer actions. Things like how they interact with your site, what they buy, and even what they don’t buy.

What is Behavioral Segmentation?

It’s exactly what it sounds like: grouping your audience based on their behaviors instead of just demographics. Think about it…would you send the same email to someone who abandoned their cart as you would to someone who’s a repeat buyer? Nope!

Behavioral segmentation ensures your campaigns are relevant, timely, and way more likely to convert.

Examples for Ecommerce:

Here are some common behavioral segments and how you can use them:

High-Spend vs. Low-Spend Customers

High-spend customers: Reward them with exclusive offers or early access to sales.

Low-spend customers: Tempt them with discounts or bundles to increase average order value.

Repeat Buyers vs. First-Time Visitors

Repeat buyers: Build loyalty with VIP programs or personalized recommendations.

First-time visitors: Welcome them with a discount or a “why shop with us” email.

Cart Abandoners and Wishlist Creators

Cart abandoners: Hit them with a friendly reminder email, throw in free shipping, or offer a time-sensitive discount.

Wishlist creators: Send alerts when their saved items go on sale or when stock is running low.

Customer Targeting Strategies to Engage Each Behavioral Group

Once you’ve identified your segments, it’s time to craft campaigns that speak to them directly:

Use Dynamic Content: Tailor your emails, ads, and website experiences based on behavior.

Set Up Trigger-Based Emails: For example, send an automated email when someone abandons their cart or makes a second purchase.

Personalize Your Offers: Use behavioral data to craft promotions that feel like they were made just for that customer.

The goal of behavioral segmentation is to work smarter. When you understand what makes your customers tick, you can deliver the kind of experiences that keep them coming back (and spending more). 

Let’s talk real-time personalization in the next section.

2. Real-Time Personalization: The Key to Engaging in the Moment

Real-time personalization offers you a chance to connect with customers in the exact moment they’re ready to engage. By using customer data in real time, you can create experiences that feel tailor-made and impossible to ignore.

How to Use Customer Data for Real-Time Personalization

It starts with capturing data as customers interact with your site – think browsing behavior, location, or even cart activity. 

With this data, you can serve up personalized offers, recommendations, or messages that feel instant and relevant. The goal? Anticipate their needs before they even realize them.

Tactics for Ecommerce:

Here are some killer ways to put real-time personalization into action:

Dynamic Product RecommendationsShow customers products they’re most likely to love based on their browsing or purchase history. For example, if they’re looking at running shoes, recommend a best-selling pair or complementary gear like socks or water bottles.

Location-Based OffersUse geolocation to tailor promotions. Visitors from warm climates? Highlight your summer collection. In colder regions? Show them winter gear or offer free shipping to their area.

Personalized Pop-UpsTrigger pop-ups based on behavior.

For first-time visitors: Offer a welcome discount.

For cart abandoners: Pop up with an extra incentive, like free shipping if they complete their purchase.

For frequent visitors: Highlight new arrivals in categories they’ve browsed.

Tools That Make Real-Time Personalization Seamless

You don’t have to do this manually (because who has time for that?). Use tools like:

Customers.ai: Real-time visitor identification and contact enrichment.

Dynamic Yield: Advanced AI-powered personalization for ecommerce sites.

OptinMonster: Great for behavior-triggered pop-ups and on-site targeting.

Klaviyo: Perfect for personalized email and SMS campaigns tied to real-time site actions.

Real-time personalization offers the kind of experience that builds loyalty and drives conversions. 

Ready to take it omnichannel? Let’s explore that next.

3. Omnichannel Targeting: Reaching Customers Wherever They Are

Your customers aren’t just hanging out in one spot. They’re scrolling Tiktok, checking emails, reading texts, and sometimes, if you’re lucky, browsing your site. 

To really connect with them, you need to be everywhere they are. That’s where omnichannel targeting comes in.

Why Omnichannel Matters for Ecommerce

Gone are the days of relying on a single channel to drive sales. Today’s customers expect brands to show up consistently across every platform they use. And when done right, omnichannel strategies lead to 23x higher customer satisfaction and better ROI. Not bad, right?

For ecommerce, this means:

Email: Sending personalized offers and order updates.

SMS: Delivering time-sensitive promotions and cart reminders.

Social: Running retargeting ads or engaging organically.

Paid Ads: Reaching them wherever they browse, from Google to YouTube.

Targeting Strategies for Creating a Seamless Omnichannel Experience

Unified Messaging Across Platforms

Your customers should feel like they’re talking to the same brand, whether they’re opening an email or seeing an Instagram ad. Use consistent branding, tone, and visuals to tie everything together. 

Pro tip: Sync your messaging so one platform picks up where another left off (e.g., retargeting cart abandoners from email to social).

Channel-Specific Personalization

Don’t just copy-paste your campaigns across channels. Tailor your message to fit the context:

On email: Share detailed product recommendations.

On SMS: Keep it short and sweet, like “Don’t miss 15% off—ends tonight!”

On social: Use visuals that grab attention and match their browsing history.

Omnichannel targeting is all about creating a cohesive, personalized experience that meets your customers wherever they are. 

Ready to get even smarter? Let’s dive into predictive targeting next.

4. Predictive Targeting: Using AI to Anticipate Customer Needs

Imagine knowing what your customers want before they even realize it. That’s the power of predictive targeting! 

By analyzing past behaviors and patterns, AI can help you anticipate customer needs and take action at the perfect moment to boost engagement and conversions.

How Predictive Analytics Works in Customer Targeting

Predictive targeting uses machine learning and AI to analyze customer data like browsing history, purchase behavior, and engagement metrics, and identify trends. 

From there, it forecasts future actions, such as what products a customer is likely to buy or when they might disengage. This allows you to create campaigns that feel perfectly timed and personalized.

Examples of Predictive Targeting in Ecommerce

Here’s how predictive targeting plays out in real-world ecommerce strategies:

Anticipating Product Needs: If a customer browses winter coats, predictive analytics might suggest they’re likely to buy boots or scarves next. Using this insight, you can offer product recommendations or bundles tailored to their shopping habits.

Predicting Churn Risk: AI can identify customers who haven’t interacted with your site or emails in a while, flagging them as at-risk. With this knowledge, you can send re-engagement campaigns like exclusive discounts or personalized “we miss you” messages before they churn.

Tools to Implement Predictive Targeting

While Customers.ai focuses on identifying visitors and providing actionable demographic insights, other tools specialize in predictive analytics:

Dynamic Yield: Delivers AI-driven recommendations and targeted campaigns based on customer behavior.

Bluecore: Perfect for ecommerce brands looking to predict purchase intent and optimize email campaigns.

Adobe Sensei: Part of Adobe Analytics, it uses machine learning to uncover trends and drive predictive targeting.

Predictive targeting helps you create experiences that feel intuitive, timely, and personal. Pairing Customers.ai with predictive analytics tools can give you the ultimate edge in understanding and engaging your audience. 

Up next: testing and refining these strategies through experimentation.

5. Experimentation: The Role of Testing in Advanced Customer Targeting

Think you’ve nailed your targeting strategy? Think again. 

Advanced customer targeting isn’t a “set it and forget it” game. Testing is where the real wins happen but A/B testing alone won’t cut it anymore. 

To truly optimize, you need to level up your experimentation with multivariate testing and dynamic strategies.

Why A/B Testing Isn’t Enough

Sure, A/B testing is great for comparing two versions of an email subject line or ad creative. But what happens when you want to test multiple elements at once (think headline, image, and CTA)? 

That’s where multivariate testing comes in. It lets you test several variables at the same time, giving you deeper insights into what really resonates with your audience.

How to Test Advanced Targeting Strategies Effectively

Testing isn’t just about trying random things. It’s about being intentional and focused. Here’s how to do it right:

Dynamic Ad CreativesTest different combinations of headlines, images, and CTAs for your ads. For example, create versions of an ad targeting “repeat buyers” with personalized product recommendations, then tweak the messaging to see what drives the highest clicks and conversions.

Personalized Product RecommendationsRun experiments to test which types of recommendations work best for different segments. Do customers respond more to “bestsellers” or “new arrivals”? What about “frequently bought together” bundles? Use real-time data to refine as you go.

Tips for Measuring Success and Scaling What Works

Once you’ve run your tests, here’s how to make sense of the results and put them to work:

Define Clear Metrics: Whether it’s CTR, conversion rate, or revenue per visitor, know what success looks like before you start.

Look at the Data, Not Just the Winner: Multivariate tests often reveal unexpected insights—pay attention to trends across segments.

Scale Strategically: Take what works and roll it out across your campaigns, but don’t forget to keep testing. What works today might not work next month.

When you’re constantly refining your approach, you’re able to stay ahead of the game. 

Ecommerce Webinar

Beyond Abandoned Cart to Abandoned Product View Revenue

with Email Deliverability Hacks & AI Tools

Watch The Webinar

How Customers.ai Helps with Customer Targeting

When it comes to advanced customer targeting, Customers.ai is the ultimate tool for ecommerce marketers looking to gain real-time insights and engage their ideal audience. 

Instead of relying on incomplete data or broad assumptions, Customers.ai gives you the clarity you need to target effectively. 

Here’s how it works and why it’s a must-have.

Real-Time Visitor Identification

Most tools leave you guessing who’s on your site. Customers.ai identifies visitors in real time, turning anonymous traffic into actionable leads.

With details like location, industry, and even past interactions, you’re armed with data that takes the guesswork out of targeting.

Actionable Demographic Data

Customers.ai goes beyond surface-level insights. You’ll get a deep dive into who’s visiting your site, including demographics, behaviors, and preferences. 

This means you can segment and target your audience with precision, delivering campaigns that actually resonate.

Seamless Campaign Integration

Once you know who your visitors are, Customers.ai makes it easy to act on that data. 

Whether you’re running email campaigns, SMS outreach, or paid ads, the insights flow directly into your marketing stack, so you can create hyper-targeted campaigns effortlessly.

Boost Conversions with Smarter Targeting

By knowing exactly who your audience is, you can tailor every touchpoint from personalized pop-ups to follow-up emails. 

Customers.ai helps you connect with your audience at the perfect moment, turning clicks into customers and visits into sales.

If you’re serious about targeting the right audience and driving better ROI, Customers.ai is your go-to. 

Your Blueprint for a Smarter Customer Targeting Strategy

If there’s one takeaway here, it’s this – traditional targeting tactics are holding you back. 

The days of relying on broad demographics or generic campaigns are over and today’s ecommerce success comes from advanced strategies like behavioral segmentation, real-time personalization, and precise customer targeting.

By moving beyond the basics, you’re able to build deeper connections with your audience. Whether it’s engaging cart abandoners with a timely offer, using real-time data to personalize experiences, or leveraging tools like Customers.ai to uncover your ideal audience, the customer targeting  strategies outlined here are designed to help you thrive.

Remember…it’s time to stop guessing and start connecting. It’s time to take action! 

Implement these strategies, invest in the right tools, and watch your conversions climb. 

Ready to get started? Start your free trial of Customers.ai today and see how we can help.

See Who Is On Your Site Right Now!

Get names, emails, phone numbers & more.

Try it Free, No Credit Card Required

Start Your Free Trial

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

Customer Targeting Strategy FAQs

What are customer targeting strategies, and why are they important?

Customer targeting strategies are techniques used to identify, segment, and engage specific groups of customers based on their characteristics, behaviors, and preferences. They are essential because they help businesses focus on the most relevant audiences, improve engagement, and boost conversions by delivering personalized experiences.

How can I improve my customer targeting strategies?

To improve customer targeting strategies, use data-driven insights from tools like demographic trackers or CRM platforms. Incorporate advanced tactics like behavioral segmentation, real-time personalization, and predictive analytics. Continuously test and refine your strategies through A/B or multivariate testing to understand what resonates most with your audience.

What is the difference between basic and advanced customer targeting?

Basic customer targeting relies on broad categories like age, gender, and location. Advanced targeting goes deeper, using real-time data, behavioral insights, and AI-driven segmentation to create more personalized and effective campaigns tailored to specific customer actions and preferences.

Why is segmentation critical in customer targeting strategies?

Segmentation allows you to divide your audience into smaller, more actionable groups based on shared characteristics or behaviors. This makes your campaigns more focused, personalized, and impactful, as you can tailor messaging and offers to the unique needs of each segment.

What are some effective ways to segment customers?

You can segment customers based on:

Demographics (age, gender, income).

Behaviors (purchasing habits, browsing activity).

Interests (hobbies, preferences).

Geography (location, climate).

Engagement level (new visitors vs. loyal customers).

How does behavioral segmentation work?

Behavioral segmentation groups customers based on their actions, such as browsing habits, purchase history, or engagement patterns. For example, segmenting cart abandoners allows you to send specific follow-ups, while targeting high-spend customers with loyalty rewards drives retention.

What role does real-time data play in customer targeting?

Real-time data allows businesses to act immediately on customer behaviors, such as offering personalized product recommendations or triggering pop-ups based on browsing activity. It ensures your campaigns are timely, relevant, and more likely to convert.

How do I create a strong Ideal Customer Profile (ICP)?

To create a robust ICP, combine demographic information (age, location, income) with behavioral and psychographic data (values, pain points, purchasing habits). Use tools like CRM platforms and analytics software to gather insights from past customers and refine your profile over time.

What tools are best for customer targeting strategies?

Some popular tools include:

Customers.ai: For real-time visitor identification.

HubSpot: For CRM and email segmentation.

Google Analytics: For demographic and behavioral data.

Dynamic Yield: For AI-driven personalization.

How does omnichannel targeting improve customer engagement?

Omnichannel targeting ensures your brand is present and consistent across multiple platforms, such as email, SMS, social media, and paid ads. This creates a seamless customer experience, increasing engagement and making it easier to reach customers wherever they are.

What is predictive targeting, and how does it work?

Predictive targeting uses AI to analyze customer data and forecast future actions, such as what products they’re likely to buy or when they might disengage. This allows businesses to proactively create campaigns that anticipate customer needs, improving retention and conversions.

How do I measure the success of my customer targeting strategies?

Measure success by tracking key performance indicators (KPIs) like conversion rates, click-through rates, customer lifetime value, and ROI. Use these metrics to analyze the effectiveness of your campaigns and identify areas for improvement.

Can customer targeting strategies help reduce churn?

Yes, by identifying at-risk customers through behavioral data and predictive analytics, you can implement re-engagement strategies, such as personalized discounts, loyalty rewards, or targeted email campaigns, to reduce churn and retain customers.

What’s the role of personalization in customer targeting?

Personalization is key to effective targeting. By tailoring messages, offers, and experiences to individual customers based on their preferences and behaviors, you can improve engagement, build loyalty, and drive higher conversions.

How do I test my customer targeting strategies effectively?

Use multivariate testing to experiment with multiple variables, such as ad creatives, CTAs, and product recommendations. Focus on specific audience segments, analyze the results, and scale the tactics that perform best.

Why is customer feedback important in targeting strategies?

Customer feedback provides valuable insights into what your audience wants and needs. Use surveys, reviews, and direct interactions to refine your ICP, improve segmentation, and ensure your campaigns resonate with your customers.

How does ethical data collection impact customer targeting?

Ethical data collection builds trust with your audience and ensures compliance with regulations like GDPR and CCPA. By being transparent about how you use data and allowing customers to opt in or out, you can maintain a positive reputation while still leveraging valuable insights.

How can small businesses implement advanced targeting strategies?

Small businesses can start with affordable tools like Google Analytics and free trials of CRM platforms. Focus on collecting first-party data, segmenting your audience, and testing small, personalized campaigns before scaling efforts.

How does customer targeting impact ROI?

Targeting the right audience with the right message at the right time reduces wasted ad spend, improves engagement, and increases conversions. Businesses that implement advanced targeting strategies often see significant ROI improvements due to more effective marketing.
The post Customer Targeting Strategies: How to Find and Engage Your Ideal Audience appeared first on Customers.ai.

Researchers from USC and Prime Intellect Released METAGENE-1: A 7B Par …

In a time when global health faces persistent threats from emerging pandemics, the need for advanced biosurveillance and pathogen detection systems is increasingly evident. Traditional genomic analysis methods, while effective in isolated cases, often struggle to address the complexities of large-scale health monitoring. A significant challenge is identifying and understanding the genomic diversity in environments such as wastewater, which contains a rich mix of microbial and viral DNA and RNA. The rapid advancements in biological research have further emphasized the importance of scalable, accurate, and interpretable models to analyze vast amounts of metagenomic data, aiding in the prediction and mitigation of health crises.

Researchers from the University of Southern California, Prime Intellect, and the Nucleic Acid Observatory have introduced METAGENE-1, a metagenomic foundation model. This 7-billion-parameter autoregressive transformer model is specifically designed to analyze metagenomic sequences. METAGENE-1 is trained on a dataset comprising over 1.5 trillion DNA and RNA base pairs derived from human wastewater samples, utilizing next-generation sequencing technologies and a tailored byte-pair encoding (BPE) tokenization strategy to capture the intricate genomic diversity present in these datasets. The model is open-sourced, encouraging collaboration and further advancements in the field.

Technical Highlights and Benefits

METAGENE-1’s architecture draws on modern transformer models, including GPT and Llama families. This decoder-only transformer uses a causal language modeling objective to predict the next token in a sequence based on preceding tokens. Its key features include:

Dataset Diversity: The training data encompasses sequences from tens of thousands of species, representing the microbial and viral diversity found in human wastewater.

Tokenization Strategy: The use of BPE tokenization enables the model to process novel nucleic acid sequences efficiently.

Training Infrastructure: Advanced distributed training setups ensured stable training on large datasets despite hardware limitations.

Applications: METAGENE-1 supports tasks like pathogen detection, anomaly detection, and species classification, making it valuable for metagenomic studies and public health research.

These features enable METAGENE-1 to generate high-quality sequence embeddings and adapt to specific tasks, enhancing its utility in the genomic and public health domains.

Results and Insights

The capabilities of METAGENE-1 were assessed using multiple benchmarks, where it demonstrated notable performance. In a pathogen detection benchmark based on human wastewater samples, the model achieved an average Matthews correlation coefficient (MCC) of 92.96, significantly outperforming other models. Additionally, METAGENE-1 showed strong results in anomaly detection tasks, effectively distinguishing metagenomic sequences from other genomic data sources.

In embedding-based genomic analyses, METAGENE-1 excelled on the Gene-MTEB benchmark, achieving a global average score of 0.59. This performance underscores its adaptability in both zero-shot and fine-tuning scenarios, reinforcing its value in handling complex and diverse metagenomic data.

Conclusion

METAGENE-1 represents a thoughtful integration of artificial intelligence and metagenomics. By leveraging transformer architectures, the model offers practical solutions for biosurveillance and pandemic preparedness. Its open-source release invites researchers to collaborate and innovate, advancing the field of genomic science. As challenges related to emerging pathogens and global pandemics continue, METAGENE-1 demonstrates how technology can play a crucial role in addressing public health concerns effectively and responsibly.

Check out the Paper, Website, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post Researchers from USC and Prime Intellect Released METAGENE-1: A 7B Parameter Autoregressive Transformer Model Trained on Over 1.5T DNA and RNA Base Pairs appeared first on MarkTechPost.

This AI paper from the Beijing Institute of Technology and Harvard Unv …

Predicting transcriptomes directly from genome sequences is a significant challenge in microbial genomics, particularly for the numerous sequenced microbes that remain unculturable or require complex experimental protocols like RNA-seq. The gap between genomic information and functional understanding leaves us without knowledge of the microbial adaptive processes, survival mechanisms, and gene regulation functions. This must be addressed to make better studies of microbial ecosystems, analysis of non-model organisms, and synthetic biology applications better.

Present techniques of transcriptome profiling are mainly experimental approaches such as RNA sequencing, which is time-consuming, expensive, and usually unsuitable for microorganisms with special growth requirements or those that survive under extreme environments. Computational models on UTRs or long DNA sequences are only partially useful since they can not be easily generalized to all taxonomic groups. Moreover, these methods fail to consider evolutionary constraints relevant to protein synthesis, making them even less useful in predicting transcriptomes of non-model and novel microbial species.

Researchers from the Beijing Institute of Technology and Harvard University propose TXpredict, a transformative framework for transcriptome prediction that utilizes annotated genome sequences. Leveraging a pre-trained protein language model (ESM2) extracts predictive features from protein embeddings while incorporating evolutionary principles. This innovation surmounts limitations on scalability, generalizability, and computational efficiency yet introduces new capabilities such as condition-specific gene expression predictions. Due to its capability to analyze the diversity of microbial taxa, including unculturable species, TXpredict is a significant advancement in microbial genomics.

TXpredict is based on transcriptome data for 22 bacterial and 10 archaeal species, featuring 11.5 million gene expression measurements. The model uses a transformer encoder architecture with multi-head self-attention to capture complex sequence relationships. Inputs include protein embeddings from ESM2 and basic sequence statistics. Model training utilized leave-one-genome-out cross-validation for robust generalization. Condition-specific predictions were also enabled by incorporating 5′ UTR sequences. The framework is computationally efficient, completing transcriptome prediction for a microbial genome within 22 minutes on standard hardware.

TXpredict proved to be very accurate and scalable in the context of transcriptome prediction. It achieved a mean Spearman correlation coefficient of 0.53 for bacterial organisms and 0.42 for archaea and showed significant results for specific species such as B. hinzii (0.64), B. thetaiotaomicron (0.62), and C. beijerinckii (0.62). The predictions were extended to 900 additional genomes representing 276 genera and 3.11 million genes, which covered a large number of previously uncharacterized taxa. In the context of condition-specific transcriptomes, the model showed an average correlation of 0.52 over 4.6k experimental conditions, thereby capturing dynamic regulatory patterns. These results indicate that the framework is capable of giving precise predictions across a wide range of microbial species while keeping computational efficiency in check.

TXpredict addresses critical challenges in microbial genomics by bridging the gap between genome sequences and transcriptome predictions. This method, with the integration of protein embeddings, evolutionary constraints, and features specific to different conditions, presents a scalable, precise, and effective solution for various microbial taxa. This strategy not only yields valuable insights into gene regulation and adaptation but also possesses the potential to enhance synthetic biology and ecological research. Notwithstanding certain limitations, including dependence on pre-existing RNA-seq datasets and the exclusion of non-coding RNA components, TXpredict establishes a foundational framework for innovative applications in the field of microbial research.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post This AI paper from the Beijing Institute of Technology and Harvard Unveils TXpredict for Predicting Microbial Transcriptomes appeared first on MarkTechPost.

Meet Height: An Autonomous Project Management Platform Leading the Nex …

When it comes to AI tools, chatbots are often the first thing that comes to mind —conversation-based interfaces for users to write queries and receive responses. These dialogue interfaces are certainly useful, but they aren’t always the best fit for handling our everyday work. Often tacked on to the side of our workflows, chatbots supplement our processes, but often add quite a bit of friction as well.

The next wave of AI tools is now immersing autonomous actions into our daily work. Today, we’re excited to introduce you to Height.app, an autonomous project management tool that takes work off your plate instead of adding to it. Height uses real-time context from your team’s interactions and workspace data to handle tedious tasks on your behalf, like triaging bugs, updating specs, cleaning up your backlog, and more.

Height has a suite of LLM-based features for all the time consuming project management scripts we handle on day-to-day basis. Let’s walk through a few below:

1. Real-time edits to product documentation

As you’re building a feature, product documentation rarely ever stays the same. New ideas emerge, blockers arise, and scope is inevitably redefined. Accounting for each of those spec changes is remarkably tedious, especially in fast-moving projects. Height looks at these situations granularly, treating each message, whether an idea or blocker, as an event, and those events as contextual information for how an LLM should take action.

By analyzing your team’s project conversations as they unfold inside the app, Height is able to discern when your team raises questions and decisions are made. Then, it maps those identified outcomes back to your product documentation, adding necessary context and updates without you having to lift a finger.

2. Autonomous backlog grooming

Staying on top of a backlog feels like a never-ending chore. Tickets are often created without anyone else knowing, and rarely are the appropriate tags applied with consistency. But inside of every filed ticket is contextual data, like name, description, and chat messages, all of which an LLM can use to infer the feature being referenced inside the ticket and what completing it may entail.

Height recognizes when tickets are added to the backlog, and then uses the contextual data of each ticket to apply appropriate tags on your behalf. From feature tags to time estimates, and even impact type, Height proactively keeps your backlog organized — so it’s easy for you to find the improvements and requests worth pursuing next.

3. Live project updates

Whether you’re building a new feature or updating your website, tracking completed tasks and open questions requires overwhelming mental effort. But for large and collaborative initiatives, it’s especially important to keep everyone aligned on what’s progressing and what’s blocked. Conveniently, most of the project information you need to track — status changes, discussions between teammates, and blocked tasks — is contextual data that LLMs can parse through and make sense of in an instant.

Height processes all of this project data to create a cadenced regular report of how the project is progressing and what’s left to work on. Instead of manually sifting through activity to write your own updates, Height provides a detailed summary of what’s happened, the tasks in progress, and what’s left to accomplish (as well as any flagged blockers).

Height’s core philosophy is that AI should reduce friction, rather than add layers of complexity. With an intentional approach to making project data accessible to LLMs, Height is moving beyond traditional chatbot implementations, focusing instead on realtime context processing to drive intelligent automation. The result is a tool that handles all of the frustrating pain points of managing projects, so you can focus on building.

How to get started with Height 2.0

Getting started with Height is easy.

Go to Height.app.

Tap the sign-up button and create a workspace for your team.

Follow along with the interactive onboarding to get started easily.

Thanks to Height Team for the thought leadership/ Educational article. Height Team has supported us in this content/article.
The post Meet Height: An Autonomous Project Management Platform Leading the Next Wave of AI Tools appeared first on MarkTechPost.

Efficiently build and tune custom log anomaly detection models with Am …

In this post, we walk you through the process to build an automated mechanism using Amazon SageMaker to process your log data, run training iterations over it to obtain the best-performing anomaly detection model, and register it with the Amazon SageMaker Model Registry for your customers to use it.
Log-based anomaly detection involves identifying anomalous data points in log datasets for discovering execution anomalies, as well as suspicious activities. It usually comprises parsing log data into vectors or machine-understandable tokens, which you can then use to train custom machine learning (ML) algorithms for determining anomalies.
You can adjust the inputs or hyperparameters for an ML algorithm to obtain a combination that yields the best-performing model. This process is called hyperparameter tuning and is an essential part of machine learning. Choosing appropriate hyperparameter values is crucial for success, and it’s usually performed iteratively by experts, which can be time-consuming. Added to this are the general data-related processes such as loading data from appropriate sources, parsing and processing them with custom logic, storing the parsed data back to storage, and loading them again for training custom models. Moreover, these tasks need to be done repetitively for each combination of hyperparameters, which doesn’t scale well with increasing data and new supplementary steps. You can use Amazon SageMaker Pipelines to automate all these steps into a single execution flow. In this post, we demonstrate how to set up this entire workflow.
Solution overview
Contemporary log anomaly detection techniques such as Drain-based detection [1] or DeepLog [2] consist of the following general approach: perform custom processing on logs, train their anomaly detection models using custom models, and obtain the best-performing model with an optimal set of hyperparameters. To build an anomaly detection system using such techniques, you need to write custom scripts for processing as well for training. SageMaker provides support for developing scripts by extending in-built algorithm containers, or by building your own custom containers. Moreover, you can combine these steps as a series of interconnected stages using SageMaker Pipelines. The following figure shows an example architecture:

The workflow consists of the following steps:

The log training data is initially stored in an Amazon Simple Storage Service (Amazon S3) bucket, from where it’s picked up by the SageMaker processing step of the SageMaker pipeline.
After the pipeline is started, the processing step loads the Amazon S3 data into SageMaker containers and runs custom processing scripts that parse and process the logs before uploading them to a specified Amazon S3 destination. This processing could be either decentralized with a single script running on one or more instances, or it could be run in parallel over multiple instances using a distributed framework like Apache Spark. We discuss both approaches in this post.
After processing, the data is automatically picked up by the SageMaker tuning step, where multiple training iterations with unique hyperparameter combinations are run for the custom training script.
Finally, the SageMaker model step creates a SageMaker model using the best-trained model obtained from the tuning step and registers it to the SageMaker Model Registry for consumers to use. These consumers, for example, could be testers, who use models trained on different datasets by different pipelines to compare their effectiveness and generality, before deploying them to a public endpoint.

We walk through implementing the solution with the following high-level steps:

Perform custom data processing, using either a decentralized or distributed approach.
Write custom SageMaker training scripts that automatically tune the resulting models with a range of hyperparameters.
Select the best-tuned model, create a custom SageMaker model from it, and register it to the SageMaker Model Registry.
Combine all the steps in a SageMaker pipeline and run it.

Prerequisites
You should have the following prerequisites:

An AWS account
A SageMaker notebook instance
An S3 bucket to store the input data

Process the data
To start, upload the log dataset to an S3 bucket in your AWS account. You can use the AWS Command Line Interface (AWS CLI) using Amazon S3 commands, or use the AWS Management Console. To process the data, you use a SageMaker processing step as the first stage in your SageMaker pipeline. This step spins up a SageMaker container and runs a script that you provide for custom processing. There are two ways to do this: decentralized or distributed processing. SageMaker provides Processor classes for both approaches. You can choose either approach for your custom processing depending on your use case.
Decentralized processing with ScriptProcessor
In the decentralized approach, a single custom script runs on one or more standalone instances and processes the input data. The SageMaker Python SDK provides the ScriptProcessor class, which you can use to run your custom processing script in a SageMaker processing step. For small datasets, a single instance can usually suffice for performing data processing. Increasing the number of instances is recommended if your dataset is large and can be split into multiple independent components, which can all be processed separately (this can be done using the ShardedByS3Key parameter, which we discuss shortly).
If you have custom dependencies (which can often be the case during R&D processes), you can extend an existing container and customize it with your dependencies before providing it to the ScriptProcessor class. For example, if you’re using the Drain technique, you need the logparser Python library for log parsing, in which case you write a simple Dockerfile that installs it along with the usual Python ML libraries:

FROM python:3.7-slim-buster
RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3 logparser3 boto3
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT [“python3”]

You can use a Python SageMaker notebook instance in your AWS account to create such a Dockerfile and save it to an appropriate folder, such as docker. To build a container using this Dockerfile, enter the following code into a main driver program in a Jupyter notebook on your notebook instance:

import boto3
from sagemaker import get_execution_role

region = boto3.session.Session().region_name
role = get_execution_role()
account_id = boto3.client(“sts”).get_caller_identity().get(“Account”)
ecr_repository = “sagemaker-processing-my-container”
tag = “:latest”

uri_suffix = “amazonaws.com”
if region in [“cn-north-1”, “cn-northwest-1”]:
uri_suffix = “amazonaws.com.cn”
processing_repository_uri = “{}.dkr.ecr.{}.{}/{}”.format(
account_id, region, uri_suffix, ecr_repository + tag
)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login –region $region –registry-ids $account_id –no-include-email)
!aws ecr create-repository –repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

This code creates an Amazon Elastic Container Registry (Amazon ECR) repository where your custom container image will be stored (the repository will be created if it’s not already present). The container image is then built, tagged with the repository name (and :latest), and pushed to the ECR repository.
The next step is writing your actual processing script. For more information on writing a processing script using ScriptProcessor, refer to Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation. The following are a few key points to remember:

A SageMaker processing step loads the data from an input location (Amazon S3 or local developer workspace) to an input path specified by you under the /opt/ml/processing directory of your container. It then runs your script in the container and uploads the output data from your specified path under /opt/ml/processing to an Amazon S3 destination you’ve specified.
Customer log datasets can sometimes consist of multiple subsets without any inter-dependencies amongst them. For these cases, you can parallelize your processing by making your processing script run over multiple instances in a single processing step, with each instance processing one of these independent subsets. It’s a good practice to keep the script’s logic redundant so that each execution on every instance happens independently of the others. This avoids duplicative work.

When your script is ready, you can instantiate the SageMaker ScriptProcessor class for running it on your custom container (created in the previous step) by adding the following code to your driver program:

from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,
ScriptProcessor,
)
from sagemaker.workflow.pipeline_context import PipelineSession

from sagemaker.workflow.steps import ProcessingStep

pipeline_session = PipelineSession()
script_processor = ScriptProcessor(
command=[“python3″],
image_uri=processing_repository_uri,
role=role,
instance_count=1,
instance_type=”ml.m5.xlarge”,
sagemaker_session=pipeline_session,
)

script_processor_run_args = script_processor.run(
code=”preprocessing.py”,
inputs=[ProcessingInput(source=”s3://amzn-s3-demo-bucket-pca-detect/processing_input/”, destination=”/opt/ml/processing/input”)],
outputs=[ProcessingOutput(output_name=”training”, source=”/opt/ml/processing/train”)
])

step_processing = ProcessingStep(

name=”PreprocessStep”,

step_args=script_processor_run_args,

)

In the preceding code, a ScriptProcessor class is being instantiated to run the python3 command for running your custom Python script. You provide the following information:

You provide the ECR URI of your custom container image and give SageMaker PipelineSession credentials to the class. When you specify the PipelineSession, the ScriptProcessor doesn’t actually begin the execution when you call its run() method—rather, it defers until the SageMaker pipeline as a whole is invoked.
In the run() method, you specify the preprocessing script along with the appropriate ProcessingInput and ProcessingOutput These specify where the data will be mounted in your custom container from Amazon S3, and where it will be later uploaded in Amazon S3 from your container’s output folder. The output channel is named training, and the final Amazon output location will be located at s3://<amzn-s3-demo-bucket-pca-detect>/<job-name>/output/<output-name>.

You can also specify an additional parameter in run() named distribution, and it can either be ShardedByS3Key or FullyReplicated, depending on whether you’re splitting and sending your S3 dataset to multiple ScriptProcessor instances or not. You can specify the number of instances in the instance_count parameter of your ScriptProcessor class.
Once instantiated, you can pass the ScriptProcessor class as an argument to the SageMaker processing step along with an appropriate name.
Distributed processing with PySparkProcessor
An alternative to the decentralized processing is distributed processing. Distributed processing is particularly effective when you need to process large amounts of log data. Apache Spark is a popular engine for distributed data processing. It uses in-memory caching and optimized query execution for fast analytic queries against datasets of all sizes. SageMaker provides the PySparkProcessor class within the SageMaker Python SDK for running Spark jobs. For an example of performing distributed processing with PySparkProcessor on SageMaker processing, see Distributed Data Processing using Apache Spark and SageMaker Processing. The following are a few key points to note:

To install custom dependencies in your Spark container, you can either build a custom container image (similar to the decentralized processing example) or use the subprocess Python module to install them using pip at runtime. For example, to run the anomaly detection technique on Spark, you need an argformat module, which you can install along with other dependencies as follows:

import subprocess
subprocess.run([“pip3”, “install”, “scipy”, “scikit-learn”, “logparser3″])

Spark transformations are powerful operations to process your data, and Spark actions are the operations that actually perform the requested transformations on your data. The collect() method is a Spark action that brings all the data from worker nodes to the main driver node. It’s a good practice to use it in conjunction with filter functions so you don’t run into memory issues when working with large log datasets.
You should also try to partition your input data based on the total number of cores you plan to have in your SageMaker cluster. The official Spark recommendation is to have approximately 2–3 times the number of partitions as the total number of cores in your cluster.

When your Spark processing script is ready, you can instantiate the SageMaker PySparkProcessor class for running it by adding the following lines to your driver program:

from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,

PySparkProcessor
)

from sagemaker.workflow.steps import ProcessingStep

pipeline_session = PipelineSession()

spark_processor = PySparkProcessor(

base_job_name=”hdfs-spark-job”,

framework_version=”3.1″,

role=role,

sagemaker_session=pipeline_session,

instance_count=3,

instance_type=”ml.m5.xlarge”,

max_runtime_in_seconds=6000,

)

spark_processor.run(

submit_app=”./sagemaker_spark_processing.py”,

spark_event_logs_s3_uri=”s3://amzn-s3-demo-bucket-pca-detect/logs/spark_event_logs”,

logs=True,

)

step_processing = ProcessingStep(

name=”SparkPreprocessStep”,

step_args=spark_processor_run_args,

)

The preceding code instantiates a PySparkProcessor instance with three nodes in the SageMaker cluster with Spark v3.1 installed in them. You submit your Spark processing code to it along with the Amazon S3 location where your event logs would be uploaded. These logs can be useful for debugging.
In the run() method invocation, you don’t need to specify your inputs and outputs, which can be the case if these are fixed Amazon S3 destinations already known to your processing code. Otherwise, you can specify them using the ProcessingInput and ProcessingOutput parameters just like in the decentralized example.
Post-instantiation, the PySparkProcessor class is passed to a SageMaker processing step with an appropriate name. Its execution won’t be triggered until the pipeline is created.
Train and tune the model
Now that your processing steps are complete, you can proceed to the model training step. The training algorithm could either be a classical anomaly detection model like Drain-based detection or a neural-network based model like DeepLog. Every model takes in certain hyperparameters that influence how the model is trained. To obtain the best-performing model, the model is usually executed and validated multiple times over a wide range of hyperparameters. This can be a time-consuming manual process and can instead be automated using SageMaker hyperparameter tuning jobs. Tuning jobs perform hyperparameter optimization by running your training script with a specified range of hyperparameter values and obtaining the best model based on the metrics you specify. You can predefine these metrics if you use built-in SageMaker algorithms or define them for your custom training algorithm.
You first need to write your training script for your anomaly detection model. Keep the following in mind:

SageMaker makes artifacts available to your container under the /opt/ml container directory. You should use this when fetching your artifacts. For more details on the SageMaker container structure, see SageMaker AI Toolkits Containers Structure.
For using a tuning job, you need to make sure that your code doesn’t hardcode parameter hyperparameter values but instead reads them from the /opt/ml/input/config/hyperparameters.json file in your container where SageMaker places it.
When using a custom training script, you also need to add a custom training metric to your script that can be used by the tuning job to find the best model. For this, you should print your desired metrics in your training script using a logger or print function. For example, you could print out custom_metric_value: 91, which indicates that your custom metric’s value is 91. We demonstrate later in this post how SageMaker can be informed about this metric.

When your training script is ready, you can use it inside a SageMaker container. SageMaker provides a wide range of built-in algorithm containers that you can use to run your training code. However, there might be cases when you need to build your own training containers. This could be the case when you need custom libraries installed or if you plan to use a new algorithm not built in by SageMaker. In such a case, you can build your own containers in two ways:

You can extend an existing SageMaker built-in container that’s closest to your requirements and install your training script onto it. For instructions, refer to Step 2: Create and upload the Dockerfile and Python training scripts.
You can build your own new SageMaker container using your custom training script and any of your custom dependencies from scratch. For instructions to build your custom container and make it available to SageMaker, see Train and host Scikit-Learn models in Amazon SageMaker by building a Scikit Docker container.

After you create your training container image, you need to define the hyperparameter ranges for your tuning job. For example, if you’re using a custom adaptation of the PCA algorithm (like in Drain-based detection), you add the following lines to your driver program:

from sagemaker.tuner import (

IntegerParameter,

)

hyperparameter_ranges = {
“max_components”: IntegerParameter(1, 30, scaling_type=”Auto”)
}

The preceding code indicates that your hyperparameter max_components is an integer and it ranges from 1–30. The auto scaling type indicates that SageMaker will choose the best scale for hyperparameter changes. For more details on other scaling options, see Hyperparameter scaling types.
Then you can use the following code to fully configure your training and tuning steps in the driver program:

estimator = Estimator(
image_uri= training_image_uri,
role=role,
base_job_name=’new_training_job’,
sagemaker_session=pipeline_session,
instance_count=1,
instance_type=’ml.m5.large’,
output_path=’s3://amzn-s3-demo-bucket-pca-detect/models/’,
metric_definitions=[{‘Name’: custom_metric, ‘Regex’: “custom_metric_value: ([0-9\.]+)”}]
)

parameter_tuner = HyperparameterTuner(
estimator,
objective_metric_name =”custom_metric”,
hyperparameter_ranges,
metric_definitions=[{‘Name’: custom_metric, ‘Regex’: “custom_metric_value: ([0-9\.]+)”}],
max_jobs=30,
max_parallel_jobs=5,
strategy=”Bayesian”,
objective_type=”Maximize”,
early_stopping_type=”Auto”
)

hpo_args = parameter_tuner.fit(
inputs={
“training”: TrainingInput(
s3_data= step_processing.properties.ProcessingOutputConfig.Outputs[“training”].S3Output.S3Uri,
s3_data_type=”S3Prefix”,
distribution=”FullyReplicated”
)
}
)

step_tuning = TuningStep(
name=”AnomalyDetectionTuning”,
step_args=hpo_args,
)

In the preceding code, a SageMaker Estimator instance is created using your custom training image’s ECR URI. SageMaker Estimators help in training your models and orchestrating their training lifecycles. The Estimator is provided with a suitable role and the PipelineSession is designated as its SageMaker session.
You provide the location where your trained model should be stored to the Estimator and supply it with custom metric definitions that you created. For the example metric custom_metric_value: 91, the definition to the Estimator includes its name along with its regex. The regex informs SageMaker how to pick up the metric’s values from training logs in Amazon CloudWatch. The tuning job uses these values to find the best-performing model. You also specify where the output model should be uploaded in the output_path parameter.
You then use this Estimator to instantiate your HyperparameterTuner. Its parameters include the total and maximum parallel number of training jobs, search strategy (for more details on strategies, see Understand the hyperparameter tuning strategies available in Amazon SageMaker AI), and whether you want to use early stopping. Early stopping can be set to Auto so that SageMaker automatically stops model training when it doesn’t see improvements in your custom logged metric.
After the HyperparameterTuner is instantiated, you can call its fit() method. In its input parameter, you specify the output Amazon S3 URI from the processing step as the input location for obtaining training data in your tuning step. This way, you don’t need to specify the Amazon S3 URI yourself and it’s passed between steps implicitly. You can then specify your s3prefix and distribution depending on whether you’re using multiple instances or not.
Once instantiated, the HyperparameterTuner is passed to the tuning step, where it becomes part of your SageMaker pipeline. The training configuration is now complete!
Register the model
You can now choose the best model from the tuning step to create a SageMaker model and publish it to the SageMaker Model Registry. You can use the following driver program code:

from sagemaker import PipelineModel
from sagemaker.workflow.model_step import ModelStep

best_model = sagemaker.model.Model(
image_uri=training_image_uri,
model_data=step_tuning.get_top_model_s3_uri(
top_k=0,
s3_bucket=”amzn-s3-demo-bucket-pca-detect”,
prefix=”models”
)
)

pipeline_model = PipelineModel(
models=[best_model],
role=role,

sagemaker_session=pipeline_session,
)

register_model_step_args = pipeline_model.register(
content_types=[“text/csv”],
response_types=[“text/csv”],
model_package_group_name=”PCAAnomalyDetection”,
)

step_model_registration = ModelStep(
name=”NewRegistry”,
step_args=register_model_step_args,
)

The code instantiates a SageMaker model using the Amazon S3 URI of the best model obtained from the tuning step. The top_k attribute of the get_top_model_s3_uri() method indicates that you’re interested in only obtaining the best-trained model.
After the model is instantiated, you can use it to create a SageMaker PipelineModel so that your pipeline can work directly with your model. You then call the register() method of PipelineModel to register your model to the SageMaker Model Registry. In the register() call, you specify the name of the new model package group where your model will be registered and specify its input and output request and response prediction types.
Finally, a SageMaker ModelStep is invoked with the instantiated PipelineModel to carry out the model registration process.
Create and run a pipeline
You’ve now reached the final step where all your steps will be tied together in a SageMaker pipeline. Add the following code to your driver program to complete your pipeline creation steps:

from sagemaker.workflow.pipeline import Pipeline

pipeline = Pipeline(
name=”Anomaly-Detection-Pipeline”,
steps=[
step_processing,

step_tuning,
step_model_registration
],
sagemaker_session=pipeline_session,
)
pipeline.upsert(role_arn=role)

pipeline.start()

This code instantiates the SageMaker Pipeline construct and provides it with all the steps defined until now—processing, tuning, and registering the model. It’s provided with a role and then invoked with the start() method.
The pipeline invocation could be on-demand using code (using pipeline.start() as shown earlier) or it could be event-driven using Amazon EventBridge rules. For example, you can create an EventBridge rule that triggers when new training data is uploaded to your S3 buckets and specify your SageMaker pipeline as the target for this rule. This makes sure that when new data is uploaded to your training bucket, your SageMaker pipeline is automatically invoked. For more details on SageMaker and EventBridge integration, refer to Schedule Pipeline Runs.
On invocation, your SageMaker pipeline runs your custom processing script in the processing step and uploads the processed data to your specified Amazon S3 destination. It then starts a tuning job with your custom training code and iteratively trains multiple models with your supplied hyperparameters and selects the best model based on your custom provided metric. The following screenshot shows that it selected the best model when tuning was complete:

Finally, the best model is selected and a model package resource is created with it in your model registry. Your customers can use it to deploy your model:

You have now completed all the steps in processing, training, tuning, and registering your custom anomaly detection model automatically with the aid of a SageMaker pipeline that was initiated using your driver program.
Clean up
To avoid incurring future charges, complete the following steps:

Delete the SageMaker notebook instance used for this post.
Delete the model package resource that was created using the best-tuned model.
Delete any Amazon S3 data that was used for this post.

Conclusion
In this post, we demonstrated the building, training, tuning, and registering of an anomaly detection system with custom processing code, custom training code, and custom training metrics. We ran these steps automatically with the aid of a SageMaker pipeline, which was run by invoking a single main driver program. We also discussed the different ways of processing our data, and how it could be done using the various constructs and tools that SageMaker provides in a user-friendly and straightforward manner.
Try this approach for building your own custom anomaly detection model, and share your feedback in the comments.
References
[1] https://ieeexplore.ieee.org/document/8029742
[2] https://dl.acm.org/doi/pdf/10.1145/3133956.3134015

About the Author
Nitesh Sehwani is an SDE with the EC2 Threat Detection team where he’s involved in building large-scale systems that provide security to our customers. In his free time, he reads about art history and enjoys listening to mystery thrillers.

Graph Generative Pre-trained Transformer (G2PT): An Auto-Regressive Mo …

Graph generation is an important task across various fields, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. Despite recent advancements, many graph generative models still rely heavily on adjacency matrix representations. While effective, these methods can be computationally demanding and often lack flexibility. This can make it difficult to efficiently capture the intricate dependencies between nodes and edges, especially for large and sparse graphs. Current approaches, including diffusion-based and auto-regressive models, face challenges in scalability and accuracy, highlighting the need for more refined solutions.

Researchers from Tufts University, Northeastern University, and Cornell University have developed the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model designed to learn graph structures through next-token prediction. Unlike traditional methods, G2PT uses a sequence-based representation of graphs, encoding nodes and edges as sequences of tokens. This approach streamlines the modeling process, making it more efficient and scalable. By leveraging a transformer decoder for token prediction, G2PT generates graphs that maintain structural integrity and flexibility. Additionally, G2PT is adaptable to downstream tasks such as goal-oriented graph generation and graph property prediction, making it a versatile tool for various applications.

Technical Insights and Benefits

G2PT introduces a sequence-based representation that divides graphs into node and edge definitions. Node definitions detail indices and types, while edge definitions outline connections and labels. This approach departs from adjacency matrix representations by focusing solely on existing edges, reducing sparsity and computational complexity. The transformer decoder effectively models these sequences through next-token prediction, offering several advantages:

Efficiency: By addressing only existing edges, G2PT minimizes computational overhead.

Scalability: The architecture is well-suited for handling large, complex graphs.

Adaptability: G2PT can be fine-tuned for a variety of tasks, enhancing its utility across domains such as molecular design and social network analysis.

The researchers also explored fine-tuning methods for tasks like goal-oriented generation and graph property prediction, broadening the model’s applicability.

Experimental Results and Insights

G2PT has demonstrated strong performance across various datasets and tasks. In general graph generation, it matched or exceeded the performance of existing models across seven datasets. In molecular graph generation, G2PT showed high validity and uniqueness scores, reflecting its ability to accurately capture structural details. For example, on the MOSES dataset, G2PTbase achieved a validity score of 96.4% and a uniqueness score of 100%.

In a goal-oriented generation, G2PT aligned generated graphs with desired properties using fine-tuning techniques like rejection sampling and reinforcement learning. These methods enabled the model to adapt its outputs effectively. Similarly, in predictive tasks, G2PT’s embeddings delivered competitive results across molecular property benchmarks, reinforcing its suitability for both generative and predictive tasks.

Conclusion

The Graph Generative Pre-trained Transformer (G2PT) represents a thoughtful step forward in graph generation. By employing a sequence-based representation and transformer-based modeling, G2PT addresses many limitations of traditional approaches. Its combination of efficiency, scalability, and adaptability makes it a valuable resource for researchers and practitioners. While G2PT shows sensitivity to graph orderings, further exploration of universal and expressive edge-ordering mechanisms could enhance its robustness. G2PT exemplifies how innovative representations and modeling approaches can advance the field of graph generation.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post Graph Generative Pre-trained Transformer (G2PT): An Auto-Regressive Model Designed to Learn Graph Structures through Next-Token Prediction appeared first on MarkTechPost.

From Latent Spaces to State-of-the-Art: The Journey of LightningDiT

Latent diffusion models are advanced techniques for generating high-resolution images by compressing visual data into a latent space using visual tokenizers. These tokenizers reduce computational demands while retaining essential details. However, such models suffer from a critical challenge: increasing the dimensions of the token feature increases reconstruction quality but decreases image generation quality. It thus creates an optimization dilemma in which achieving a detailed reconstruction compromises the ability to generate visually appealing images.

Existing methods need much more computational power, which creates limitations. This presents difficulties in achieving both detailed reconstruction and high-quality image generation efficiently. Visual tokenizers like VAEs, VQVAE, and VQGAN compress visual data but struggle with poor codebook utilization and inefficient optimization in larger latent spaces. Continuous VAE diffusion models improve reconstruction but harm generation performance, increasing costs—methods like MAGVIT-v2 and REPA attempt to address these issues but add complexity without resolving core trade-offs. Diffusion Transformers, widely used for scalability, also face slow training speeds despite enhancements like SiT or MaskDiT. These tokenizers and latent spaces inefficiencies remain a key barrier to effectively integrating generative and reconstruction tasks.

To address optimization challenges in latent diffusion models, researchers from Huazhong University of Science and Technology proposed the VA-VAE method, which integrates a Vision Foundation model alignment loss (VF Loss) to enhance the training of high-dimensional visual tokenizers. This framework regularizes the latent space with element and pair-wise similarities, making it more aligned with the Vision Foundation model. VF Loss includes marginal cosine similarity loss and marginal distance matrix similarity loss, further improving alignment without limiting the latent space’s capacity. As a result, the framework enhances reconstruction and generation performance by addressing the intensity concentration in latent space distributions.

Researchers integrated VF loss within the latent diffusion system to improve reconstruction and generation performance by using LightningDiT, optimizing convergence and scalability. The VF loss, particularly with foundation models like DINOv2, accelerated convergence, with a speedup of up to 2.7x in training time. Experiments with different configurations, such as tokenizers with and without VF loss, showed that VF loss notably improved performance, especially in high-dimensional tokenizers, and bridged the gap between generative performance and reconstruction. The loss of VF also improved scalability, optimizing models ranging from 0.1B to 1.6B parameters so that high-dimensional tokenizers kept strong scalability without significant performance loss. The results showed the method’s effectiveness in improving generative performance and convergence speed and minimizing cfg dependency.

In conclusion, the proposed framework VA-VAE and LightningDiT address the optimization challenges in latent diffusion systems. VA-VAE aligns the latent space with vision models, improving convergence and uniformity, while LightningDiT accelerates training. The approach achieves FID on ImageNet with a 21.8× speedup. This work offers a foundation for future research, enabling further optimization and scalability improvements in generative models with reduced training costs.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post From Latent Spaces to State-of-the-Art: The Journey of LightningDiT appeared first on MarkTechPost.

ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High …

GUI agents face three critical challenges in professional environments: (1) the greater complexity of professional applications compared to general-use software, requiring detailed comprehension of intricate layouts; (2) the higher resolution of professional tools, resulting in smaller target sizes and reduced grounding accuracy; and (3) the reliance on additional tools and documents, adding complexity to workflows. These challenges highlight the need for advanced benchmarks and solutions to enhance GUI agent performance in these demanding scenarios.

Current GUI grounding models and benchmarks are insufficient to fulfill professional environment requirements. Tools like ScreenSpot are designed for low-resolution tasks and lack the variety to simulate real-world scenarios accurately. Models such as OS-Atlas and UGround are computationally inefficient and fail when the targets are small or the interface is icon-rich, which is common in professional applications. In addition, the absence of multilingual support reduces their applicability in global workflows. These shortcomings highlight the need for more comprehensive and realistic benchmarks to further this field.

A team of researchers from the National University of Singapore, East China Normal University, and Hong Kong Baptist University introduce ScreenSpot-Pro: a new framework that is tailored to professional high-resolution environments. This benchmark has a dataset of 1,581 tasks across 23 applications in industries such as development, creative tools, CAD, scientific platforms, and office suites. It incorporates high-resolution, full-screen visuals and expert annotations that ensure accuracy and realism. Multilingual guidelines encompass both English and Chinese for an expanded range of evaluation. ScreenSpot-Pro is unique as it documents the actual workflows that result in real, high-quality annotations, therefore serving as a tool for the full assessment and development of GUI grounding models.

The dataset ScreenSpot-Pro captures realistic and challenging scenarios. The base of this dataset is formed by high-resolution images, where the target regions form an average of only 0.07% of the total screen, thus pointing to subtle and small GUI elements. Data was collected by professional users with experience in relevant applications, who used specialized tools to ensure accurate annotations. Additionally, the dataset supports multilingual capabilities to test bilingual functionality and contains several workflows to capture the subtleties of real professional tasks. These characteristics render it particularly advantageous for the assessment and enhancement of the accuracy and flexibility of GUI agents.

The analysis of current GUI grounding models utilizing ScreenSpot-Pro reveals considerable deficiencies in their capacity to manage high-resolution professional settings. OS-Atlas-7B attained the greatest accuracy rate of 18.9%. However, iterative methodologies, exemplified by ReGround, demonstrated the capacity to enhance performance, reaching an accuracy of 40.2% by fine-tuning predictions through a multi-step methodology. Minor components, such as icons, presented significant difficulties, whereas bilingual assignments further highlighted the limitations of the models. These findings emphasize the necessity for improved techniques that bolster contextual comprehension and resilience in intricate GUI situations.

ScreenSpot-Pro sets a transformative benchmark for the evaluation of GUI agents in professional high-resolution environments. It addresses the specific challenges in complex workflows, offering a diverse and precise dataset to guide innovations in GUI grounding. This contribution forms the foundation of much smarter and more efficient agents that support a seamless performance of professional tasks, significantly boosting productivity and innovation in all industry fields.

Check out the Paper and Data. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High-Resolution Professional GUI-Agent and Computer-Use Environments appeared first on MarkTechPost.

PRIME: An Open-Source Solution for Online Reinforcement Learning with …

Large Language Models (LLMs) face significant scalability limitations in improving their reasoning capabilities through data-driven imitation, as better performance demands exponentially more high-quality training examples. Exploration-based methods, particularly reinforcement learning (RL), offer a promising alternative to overcome these limitations. The transformation from data-driven to exploration-based approaches presents two key challenges: developing efficient methods to generate precise reward signals and designing effective RL algorithms that maximize the utility of these signals. This shift represents a crucial step toward enhancing LLM reasoning capabilities.

A team of researchers introduces PRIME (Process Reinforcement through IMplicit Rewards), a novel approach to enhance language model reasoning through online RL with process rewards. The system employs implicit process reward modeling (PRM), which functions without requiring process labels and operates as an outcome reward model. This approach enables the development of Eurus-2-7B-PRIME, a powerful reasoning model that demonstrates significant improvements through both online RL training and inference-time scaling. The innovation of implicit PRM lies in its dual capability to enhance performance and facilitate effective RL training.

The research team selected Qwen2.5-Math-7B-Base as their foundation model and evaluated performance using high-level mathematics and programming benchmarks. The initial phase involves supervised fine-tuning (SFT) using an action-centric chain-of-thought framework where models choose from seven predefined actions. The team constructed a 230K dataset from various open-source materials, deliberately excluding high-quality datasets with ground-truth answers to reserve them for RL. Despite these efforts, the SFT model’s performance fell short of Qwen2.5-Math-7B-Instruct across mathematics benchmarks.

The research utilizes a comprehensive approach to dataset curation for RL, combining 457K math problems from NuminaMath-CoT and 27K coding problems from various sources, including APPS, CodeContests, TACO, and Codeforces. The team implements an innovative online prompt filtering strategy that dynamically selects prompts based on difficulty levels. By sampling multiple trajectories and maintaining prompts with accuracy scores between 0.2 and 0.8, they effectively balanced the training data distribution, eliminating both overly simple and excessively challenging problems. Here is the flow of the algorithm for the proposed method PRIME:

PRIME’s implementation follows a systematic process where the policy model and PRM initialize from the SFT model. The algorithm operates through sequential steps of generating rollouts, scoring them, and updating both models using combined outcome and process rewards. With PRIME, starting from Qwen2.5-Math-7B-Base, the trained model Eurus-2-7B-PRIME achieves 26.7% pass@1, surpassing GPT-4o and Qwen2.5-Math-7B-Instruct. This is achieved using only 1/10 data of Qwen Math (230K SFT + 150K RL). Moreover, PRIME achieves significant improvements over sparse reward approaches using specific hyperparameters and the results show 2.5 times faster training, 6.9% higher final rewards, and notably, Eurus-2-7B-PRIME demonstrated a 16.7% average improvement across benchmarks, with over 20% enhancement in AMC&AIME competitions. 

Lastly, the validation process for PRIME utilizes advanced mathematical reasoning models (QwQ-32B-Preview and Qwen2.5-Math-72B-Instruct) to evaluate problem solvability and solution correctness. Using insights from analyzing sample problems and synthetic unsolvable cases, the team developed specialized prompts to enhance validation accuracy. Each problem undergoes five comprehensive validation attempts, containing step-by-step LaTeX solutions, unsolvability checks, reasoning traces, standardized answer formatting, and documentation of solution impediments. This rigorous validation framework ensures the quality and reliability of the question-answer pairs.

Check out the Hugging Face Page, Technical Details, and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post PRIME: An Open-Source Solution for Online Reinforcement Learning with Process Rewards to Advance Reasoning Abilities of Language Models Beyond Imitation or Distillation appeared first on MarkTechPost.

FutureHouse Researchers Propose Aviary: An Extensible Open-Source Gymn …

Artificial intelligence (AI) has made significant strides in developing language models capable of solving complex problems. However, applying these models to real-world scientific challenges remains difficult. Many AI agents struggle with tasks requiring multiple cycles of observation, reasoning, and action. Moreover, existing models often lack the ability to integrate tools effectively or maintain consistency in multi-step reasoning. These issues are particularly pressing in scientific domains, where tasks demand precision, adaptability, and computational efficiency. Addressing these problems requires a flexible and practical framework for training and deploying language agents.

Introducing Aviary: An Extensible Open-Source Gymnasium

A team of researchers from FutureHouse Inc., the University of Rochester, and the Francis Crick Institute has introduced Aviary, an open-source gymnasium for language agents. Aviary addresses the limitations of existing frameworks by introducing language decision processes (LDPs), which model tasks as partially observable Markov decision processes grounded in natural language. This approach enables language agents to effectively handle complex, multi-step reasoning tasks.

Aviary includes five environments, three of which are designed for advanced scientific tasks:

Molecular Cloning: Manipulating DNA constructs using tools for sequence annotation and protocol planning.

Scientific Literature QA: Retrieving and analyzing scientific literature to answer detailed research questions.

Protein Stability Engineering: Proposing protein mutations to improve stability with the help of computational and biochemical tools.

These tasks make Aviary a valuable platform for training and evaluating language agents in real-world scenarios requiring reasoning, tool integration, and iterative learning.

Technical Insights and Benefits of Aviary

Aviary uses a stochastic computation graph framework to model language agents, enabling flexible and efficient optimization. Key features include:

Expert Iteration (EI): A training method that iteratively refines agents using high-quality trajectories.

Majority Voting: A technique to improve accuracy by combining multiple inference outputs without excessive computational overhead.

Tool Integration: Built-in support for tools like sequence annotators and literature retrieval systems, enhancing real-world applicability.

The researchers show that non-frontier, open-source models like Llama-3.1-8B-Instruct can achieve performance comparable to or better than frontier models (e.g., Claude 3.5 Sonnet) in these environments. Additionally, these models operate at significantly lower inference costs, making them accessible for large-scale scientific applications.

Results and Insights

Aviary-trained agents demonstrate impressive performance:

On molecular cloning tasks, the Llama-3.1-8B-Instruct agent showed notable accuracy improvements through EI and behavior cloning, outperforming human experts on SeqQA benchmarks.

In scientific literature QA tasks, the same model achieved performance levels on par with or better than humans, while maintaining efficiency.

Majority voting further enhanced accuracy, with SeqQA results reaching 89% after sampling multiple trajectories, surpassing human and frontier model benchmarks.

Conclusion

Aviary represents a thoughtful advancement in the development of language AI agents. By demonstrating that open-source, non-frontier models can excel in scientific tasks, Aviary opens new possibilities for accessible and cost-effective AI research. Its open-source design encourages collaboration, enabling researchers and developers to refine and extend its applications further.

With tools and training methods tailored for real-world challenges, Aviary sets a benchmark for how language agents can address complex tasks. It provides a compelling framework for advancing AI-driven scientific exploration and practical problem-solving.

Check out the Paper, Technical Details, and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post FutureHouse Researchers Propose Aviary: An Extensible Open-Source Gymnasium for Language Agents appeared first on MarkTechPost.

This AI Paper Introduces SWE-Gym: A Comprehensive Training Environment …

Software engineering agents have become essential for managing complex coding tasks, particularly in large repositories. These agents employ advanced language models to interpret natural language descriptions, analyze codebases, and implement modifications. Their applications include debugging, feature development, and optimization. The effectiveness of these systems relies on their ability to handle real-world challenges, such as interacting with extensive repositories and executing tests to validate solutions, making the development of such agents both exciting and challenging.

Lack of comprehensive training environments is one of the primary challenges in this domain. Many existing datasets and benchmarks, such as SWE-Bench and R2E, either focus on isolated problems or rely on synthetic instructions that do not represent the complexities of real-world coding tasks. For instance, while SWE-Bench offers test cases for validation, its training dataset lacks executable environments and dependency configurations. This discrepancy limits the utility of existing benchmarks for training agents capable of addressing the nuanced challenges of software engineering.

Efforts to address these challenges have previously relied on tools like HumanEval and APPS, which support isolated task evaluation but fail to integrate repository-level complexities. These tools often lack a coherent link between natural language problem descriptions, executable codebases, and comprehensive testing frameworks. As a result, there is a pressing need for a platform that can bridge these gaps by offering real-world tasks within functional and executable environments.

Researchers from UC Berkeley, UIUC, CMU, and Apple have developed SWE-Gym, a novel environment tailored for training software engineering agents. SWE-Gym integrates 2,438 Python tasks sourced from GitHub issues across 11 repositories, offering pre-configured executable environments and expert-validated test cases. This platform introduces a groundbreaking approach by combining real-world task complexity with automated testing mechanisms, creating a more effective training ecosystem for language models.

SWE-Gym’s methodology focuses on replicating real-world coding conditions. The tasks are derived from GitHub issues and paired with the corresponding repository snapshots and unit tests. Dependencies for each task are meticulously configured, ensuring the accuracy of the executable environment. These configurations were semi-manually validated through rigorous processes involving around 200 human annotation hours and 10,000 CPU core hours, resulting in a robust training dataset. The researchers also introduced a subset of 230 tasks, SWE-Gym Lite, which targets simpler and self-contained problems, enabling rapid prototyping and evaluation.

The performance evaluation of SWE-Gym demonstrated its significant impact on training software engineering agents. Using the Qwen-2.5 Coder model, fine-tuned agents achieved marked improvements in resolving tasks on SWE-Bench benchmarks. Specifically, resolve rates increased from 20.6% to 32.0% on SWE-Bench Verified and from 15.3% to 26.0% on SWE-Bench Lite. These gains represent a significant leap over previous benchmarks for open-weight language models. Furthermore, SWE-Gym-supported agents reduced failure rates in stuck-in-loop scenarios by 18.6% and improved task completion rates in real-world settings.

The researchers also explored inference-time scaling by employing a verifier trained on agent trajectories sampled from SWE-Gym. This approach allowed agents to generate multiple solution trajectories for a given problem, selecting the most promising one using a reward model. The verifier achieved a Best@K score of 32.0% on SWE-Bench Verified, demonstrating the environment’s capacity for improving agent performance through scalable compute strategies. These results emphasize the potential of SWE-Gym to enhance both the development and evaluation of software engineering agents.

SWE-Gym is a pivotal tool in advancing research on software engineering agents. Addressing the limitations of prior benchmarks and offering a scalable, realistic environment equips researchers with the resources needed to develop robust models capable of solving complex software challenges. With its open-source release, SWE-Gym paves the way for significant advancements in the field, setting new standards for the training and evaluation of software engineering agents.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.
The post This AI Paper Introduces SWE-Gym: A Comprehensive Training Environment for Real-World Software Engineering Agents appeared first on MarkTechPost.