Zyphra’s release of Zamba2-2.7B marks a pivotal moment in developing small language models, demonstrating a significant advancement in efficiency and performance. The model is trained on a substantial enough dataset of approximately 3 trillion tokens derived from Zyphra’s proprietary datasets, which allows it to match the performance of larger models like Zamba1-7B and other leading 7B models. This feat is achieved while notably reducing the resource requirements for inference, making it a highly efficient solution for on-device applications.
The model achieves a twofold improvement in time-to-first-token, a critical metric for applications requiring real-time interaction. This improvement means that Zamba2-2.7B can generate initial responses twice as fast as its competitors. This is crucial for applications such as virtual assistants, chatbots, and other responsive AI systems where quick response times are essential.
Image Source
In addition to its speed, Zamba2-2.7B is designed to use memory more efficiently. It reduces memory overhead by 27%, making it a suitable option for deployment on devices with limited memory resources. This smarter memory usage ensures the model can operate effectively even in environments with constrained computational resources, broadening its applicability across various devices and platforms.
Another key advantage of Zamba2-2.7B is its lower generation latency. The model delivers 1.29 times lower latency compared to Phi3-3.8B, which enhances the smoothness and continuity of interactions. Lower latency is particularly important in applications that require seamless and uninterrupted communication, such as customer service bots and interactive educational tools. Maintaining high performance with reduced latency positions Zamba2-2.7B as a leading choice for developers looking to enhance user experience in their AI-driven applications.
Image Source
Benchmark comparisons underscore the superior performance of Zamba2-2.7B. When benchmarked against other models of similar scale, including Gemma2-2.7B, StableLM-3B, and Phi2-2.7B, Zamba2-2.7B consistently outperforms its peers. This superior performance is a testament to Zyphra’s innovative approach & dedication to advancing AI technology. The company’s commitment to what small language models can achieve is evident in the impressive capabilities of Zamba2-2.7B.
The model utilizes an improved interleaved shared attention scheme with LoRA projectors on shared MLP blocks. This advanced architecture allows the model to handle complex tasks more efficiently, ensuring high-quality outputs with minimal delays. The upgrade from Mamba1 blocks to Mamba2 blocks further enhances the model’s performance, providing a solid foundation for its advanced capabilities. These innovations contribute to the model’s ability to deliver faster, smarter, and more efficient AI solutions.
Image Source
Zyphra’s release of Zamba2-2.7B signifies a major milestone in the evolution of small language models. Combining high performance with reduced latency and efficient memory usage, Zamba2-2.7B sets a new standard for on-device AI applications. The model meets and exceeds the expectations for small language models, offering a robust solution for developers and businesses looking to integrate sophisticated AI capabilities into their products.
In conclusion, Zyphra’s launch of Zamba2-2.7B marks a new era in AI technology where efficiency and performance are seamlessly integrated. This model’s ability to deliver faster, smarter, and more efficient AI solutions makes it a valuable asset for a wide range of on-device applications, paving the way for more advanced and responsive AI-driven experiences.
Check out the Details and Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here
The post Zamba2-2.7B Released: A State-of-the-Art Small Language Model Achieving Twice the Speed and 27% Reduced Memory Overhead appeared first on MarkTechPost.
Prior work on abstention in large language models (LLMs) has made significant strides in query processing, answerability assessment, and handling misaligned queries. Researchers have explored methods to predict question ambiguity, detect malicious queries, and develop frameworks for query alteration. The BDDR framework and self-adversarial training pipelines have been introduced to analyze query changes and classify attacks. Evaluation benchmarks like SituatedQA and AmbigQA have been crucial in assessing LLM performance with unanswerable or ambiguous questions. These contributions have established a foundation for implementing effective abstention strategies in LLMs, enhancing their ability to handle uncertain or potentially harmful queries.
The University of Washington and Allen Institute for AI researchers have surveyed abstention in large language models, highlighting its potential to reduce hallucinations and enhance AI safety. They present a framework analyzing abstention from the query, model, and human value perspectives. The study reviews existing abstention methods, categorizes them by LLM development stages, and assesses various benchmarks and metrics. The authors identify future research areas, including exploring abstention as a meta-capability across tasks and customizing abstention abilities based on context. This comprehensive review aims to expand the impact and applicability of abstention methodologies in AI systems, ultimately improving their reliability and safety.
This paper explores the capabilities and challenges of large language models in natural language processing. While LLMs excel in tasks like question answering and summarization, they can produce problematic outputs such as hallucinations and harmful content. The authors propose incorporating abstention mechanisms to mitigate these issues, allowing LLMs to refuse answers when uncertain. They introduce a framework evaluating query answerability and alignment with human values, aiming to expand abstention strategies beyond current calibration techniques. The survey encourages new abstention methods across diverse tasks, enhancing AI interaction robustness and trustworthiness. It contributes an analysis framework, reviewing existing methods and discussing underexplored abstention aspects.
The paper’s methodology focuses on classifying and examining abstention strategies in large language models. It categorizes methods based on their application during pre-training, alignment, and inference stages. A novel framework evaluates queries from the query, model capability, and human value alignment perspectives. The study explores input-processing approaches to determine abstention, including ambiguity prediction and value misalignment detection. It incorporates calibration techniques while acknowledging their limitations. The methodology also outlines future research directions, such as privacy-enhanced designs and generalizing abstention beyond LLMs. The authors review existing benchmarks and evaluation metrics, identifying gaps to inform future research and improve abstention strategies’ effectiveness in enhancing LLM reliability and safety.
The study’s findings highlight the critical role of judicious abstention in bolstering the dependability and security of large language models.. It introduces a framework considering abstention from query, model, and human value perspectives, providing a comprehensive overview of current strategies. The study identifies gaps in existing methodologies, including limitations in evaluation metrics and benchmarks. Future research directions proposed include enhancing privacy protections, generalizing abstention beyond LLMs, and improving multilingual abstention. The authors encourage studying abstention as a meta-capability across tasks and advocate for more generalizable evaluation and customization of abstention capabilities. These findings underscore abstention’s significance in LLMs and outline a roadmap for future research to improve abstention strategies’ effectiveness and applicability in AI systems.
The paper concludes by highlighting several key aspects of abstention in large language models. It identifies under-explored research directions and advocates studying abstention as a meta-capability across various tasks. The authors emphasize the potential of abstention-aware designs to enhance privacy and copyright protections. They suggest generalizing abstention beyond LLMs to other AI domains and stress the need for improved multilingual abstention capabilities. The survey underscores strategic abstention’s importance in enhancing LLM reliability and safety, emphasizing the need for more adaptive and context-aware mechanisms. Overall, the paper outlines a roadmap for future research to improve abstention strategies’ effectiveness and ethical considerations in AI systems.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here
The post This AI Paper Presents a Survey of the Current Methods Used to Achieve Refusal in LLMs: Provide Evaluation Benchmarks and Metrics Used to Measure Abstention in LLMs appeared first on MarkTechPost.
Large language models (LLMs) have emerged as powerful tools in artificial intelligence, demonstrating remarkable capabilities in understanding and generating text. These models utilize advanced technologies such as web-scale unsupervised pretraining, instruction fine-tuning, and value alignment, showcasing strong performance across various tasks. However, the application of LLMs to real-world big data presents significant challenges, primarily due to the enormous costs involved. By 2025, the total cost of LLMs is projected to reach nearly $5,000 trillion, far exceeding the GDP of major economies. This financial burden is particularly pronounced in processing text and structured data, which account for a substantial portion of the expenses despite being smaller in volume compared to multimedia data. As a result, there has been a growing focus on Relational Table Learning (RTL) in recent years, given that relational databases host approximately 73% of the world’s data.
Researchers from Shanghai Jiao Tong University and Tsinghua University present rLLM (relationLLM) project, which addresses the challenges in RTL by providing a platform for rapid development of RTL-type methods using LLMs. This innovative approach focuses on two key functions: decomposing state-of-the-art Graph Neural Networks (GNNs), LLMs, and Table Neural Networks (TNNs) into standardized modules, and enabling the construction of robust models through a “combine, align, and co-train” methodology. To demonstrate the application of rLLM, a simple RTL method called BRIDGE is introduced. BRIDGE processes table data using TNNs and utilizes “foreign keys” in relational tables to establish relationships between table samples, which are then analyzed using GNNs. This method considers multiple tables and their interconnections, providing a comprehensive approach to relational data analysis. Also, to address the scarcity of datasets in the emerging field of RTL, the project introduces a robust data collection named SJTUTables, comprising three relational table datasets: TML1M, TLF2K, and TACM12K.
The rLLM project introduces a comprehensive architecture consisting of three main layers: the Data Engine Layer, the Module Layer, and the Model Layer. This structure is designed to facilitate efficient processing and analysis of relational table data.
The Data Engine Layer forms the foundation, focusing on fundamental data structures for graph and table data. It decouples data loading and storage through Dataset subclasses and BaseGraph/BaseTable subclasses, respectively. This design allows for flexible handling of various graph and table data types, optimizing storage and processing for both homogeneous and heterogeneous graphs, as well as table data.
The Module Layer decomposes operations of GNNs, LLMs, and TNNs into standard submodules. For GNNs, it includes GraphTransform for preprocessing and GraphConv for implementing graph convolution layers. LLM modules comprise a Predictor for data annotation and an Enhancer for data augmentation. TNN modules feature TableTransform for mapping features to higher-dimensional spaces and TableConv for multi-layer interactive learning among feature columns.
BRIDGE demonstrates rLLM’s application in RTL-type methods. It addresses relational database complexity by processing both table and non-table features. A Table Encoder, using TableTransform and TableConv modules, handles heterogeneous table data to produce table embeddings. A Graph Encoder, employing GraphTransform and GraphConv modules, models foreign key relationships and generates graph embeddings. BRIDGE integrates outputs from both encoders, enabling simultaneous modeling of multi-table data and their interconnections. The framework supports both supervised and unsupervised training approaches, adapting to various data scenarios and learning objectives.
Experimental results reveal the limitations of traditional single-tabular TNNs in processing relational table data. These TNNs, confined to learning from a single target table, fail to utilize the rich information available in multiple tables and their interconnections, resulting in suboptimal performance. In contrast, the BRIDGE algorithm demonstrates superior capabilities by effectively combining a table encoder with a graph encoder. This integrated approach enables BRIDGE to extract valuable insights from both individual tables and their relationships. Consequently, BRIDGE achieves a significant performance improvement over conventional methods, highlighting the importance of considering the relational structure of data in table learning tasks.
The rLLM framework introduces a robust approach to relational table learning using Large Language Models. It integrates advanced methods and optimizes data structures for improved efficiency. The project invites collaboration from researchers and software engineers to expand its capabilities and applications in the field of relational data analysis.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here
The post rLLM (relationLLM): A PyTorch Library Designed for Relational Table Learning (RTL) with Large Language Models (LLMs) appeared first on MarkTechPost.
Amazon Q Business is a fully managed, permission aware generative artificial intelligence (AI)-powered assistant built with enterprise grade security and privacy features. Amazon Q Business can be configured to answer questions, provide summaries, generate content, and securely complete tasks based on your enterprise data. The native data source connectors provided by Amazon Q Business can seamlessly integrate and index content from multiple repositories into a unified index. Amazon Q Business uses AWS IAM Identity Center to record the workforce users you assign access to and their attributes, such as group associations. IAM Identity Center is used by many AWS managed applications such as Amazon Q. You connect your existing source of identities to Identity Center once and can then assign users to any of these AWS services. Because Identity Center serves as their common reference of your users and groups, these AWS applications can give your users a consistent experience as they navigate AWS. For example, it enables user subscription management across Amazon Q offerings and consolidates Amazon Q billing from across multiple AWS accounts. Additionally, Q Business conversation APIs employ a layer of privacy protection by leveraging trusted identity propagation enabled by IAM Identity Center. Amazon Q Business comes with rich API support to perform administrative tasks or to build an AI-assistant with customized user experience for your enterprise. With administrative APIs you can automate creating Q Business applications, set up data source connectors, build custom document enrichment, and configure guardrails. With conversation APIs, you can chat and manage conversations with Q Business AI assistant. Trusted identity propagation provides authorization based on user context, which enhances the privacy controls of Amazon Q Business. In this blog post, you will learn what trusted identity propagation is and why to use it, how to automate configuration of a trusted token issuer in AWS IAM Identity Center with provided AWS CloudFormation templates, and what APIs to invoke from your application facilitate calling Amazon Q Business identity-aware conversation APIs. Why use trusted identity propagation? Trusted identity propagation provides a mechanism that enables applications that authenticate outside of AWS to make requests on behalf of their users with the use of a trusted token issuer. Consider a client-server application that uses an external identity provider (IdP) to authenticate a user to provide access to an AWS resource that’s private to the user. For example, your web application might use Okta as an external IdP to authenticate a user to view their private conversations from Q Business. In this scenario, Q Business is unable to use the identity token generated by the third party provider to provide direct access to the user’s private data since there is no mechanism to trust the identity token issued by the third party. To solve this, you can use IAM Identity Center to get the user identity from your external IdP into an AWS Identity and Access Management (IAM) role session which allows you to authorize requests based on the human, their attributes, and their group memberships, rather than set up fine-grained permissions in an IAM policy. You can exchange the token issued by the external IdP for a token generated by Identity Center. The token generated by Identity Center refers to the corresponding Identity Center user. The web application can now use the new token to initiate a request to Q Business for the private chat conversation. That token refers to the corresponding user in Identity Center, Q Business can authorize the requested access to the private conversation based on the user or their group membership as represented in Identity Center. Some of the benefits of using trusted identity propagation are:
Prevents user impersonation and protects against unauthorized access to user private data by spoofing user identity. Facilitates auditability and fosters responsible use of resources as Q Business automatically logs API invocations to AWS CloudTrail along with user identifier. Promotes software design principles rooted in user privacy.
Overview of trusted identity propagation deployment The following figure is a model of a client-server architecture for trusted identity propagation.
To understand how your application can be integrated with IAM Identity Center for trusted identity propagation, consider the model client-server web application shown in the preceding figure. In this model architecture, the web browser represents the user interface to your application. This could be a web page rendered on a web browser, Slack, Microsoft Teams, or other applications. The application server might be a web server running on Amazon Elastic Container Service (Amazon ECS), or a Slack or Microsoft Teams gateway implemented with AWS Lambda. Identity Center itself might be deployed on a delegated admin account or Identity Center (the Identity Account in the preceding figure), or could be deployed in the same AWS account (the Application Account in the preceding figure) where the application server is deployed along with Amazon Q Business. Finally, you have an OAuth 2.0 OpenID Connect (OIDC) external IdP such as Okta, Ping One, Microsoft Entra ID, or Amazon Cognito for authenticating and authorizing. Deployment of trusted identity propagation involves five steps. As a best practice, we recommend that the security owner manages IAM Identity Center updates and the application owner manages application updates, providing clear separation of duties. The security owner is responsible for administering the Identity Center of an organization or account. The application owner is responsible for creating an application on AWS.
The security owner adds the external OIDC IdP’s issuer URL to the IAM Identity Center instance’s trusted token issuer. It’s important that the issuer URL matches the iss claim attribute present in the JSON Web Token (JWT) identity token generated by the IdP after user authentication. This is configured once for a given issuer URL. The security owner creates a customer managed identity provider application in IAM Identity Center and explicitly configures the specific audience for a given trusted token issuer is being authorized to perform token exchange using Identity Center. Because there could be more than one application (or audience) for which the external IdP could be authenticating users, explicitly specifying an audience helps prevent an unauthorized applications from using the token exchange process. It’s important the audience ID matches the aud claim attribute present in the JWT identity token generated by the IdP after user authentication. The security owner edits the application policy for the customer managed identity provider application created in the previous step to add or update the IAM execution role used by the application server or AWS Lambda. This helps prevent any unapproved users or applications from invoking the CreateTokenWithIAM API in Identity Center to initiate the token exchange. The application owner creates and adds an IAM policy to the application execution role to allow the application to invoke a CreateTokenWithIAM API on Identity Center to perform a token exchange and to create temporary credentials using AWS Security Token Service (AWS STS) . The application owner creates an IAM role with a policy allowing access to the Q Business Conversation API for use with STS to create a temporary credential to invoke Q Business APIs.
You can use AWS CloudFormation templates, discussed later in this blog, to automate the preceding deployment steps. See the IAM Identity Center documentation for detailed step-by-step instructions on setting up trusted identity propagation. You can also use the AWS Command Line Interface (AWS CLI) setup process in Making authenticated Amazon Q Business API calls using IAM Identity Center. Important: Choosing to add a trusted token issuer is a security decision that requires careful consideration. Only choose trusted token issuers that you trust to perform the following tasks:
Authenticate the user who is specified in the token. Control the audience claim, a claim you configure as the user identifier. Generate a token that IAM Identity Center can exchange for an Identity Center-created token. Control the Identity Center customer managed application policy to add only IAM users, roles, and execution roles that can perform the exchange.
Authorization flow For a typical web application, the trusted identity propagation process will involve five steps as shown in the following flow diagram.
Sign-in and obtain an authorization code from the IdP. Use the authorization code and client secret to retrieve the ID token from the IdP. Exchange the IdP generated JWT ID token with the IAM Identity Center token that includes the AWS STS context identity. Use the STS context identity to obtain temporary access credentials from AWS STS. Use temporary access credentials to access Q Business APIs.
An end-to-end implementation of the identity propagation is available for reference in <project_home>/webapp/main.py of AWS Samples – main.py.
Sample JWT tokens In the preceding authorization flow, one of the key steps is step 3, where the JWT ID token from the OAuth IdP is exchanged with IAM Identity Center for an AWS identity-aware JWT token. Key attributes of the respective JWT tokens are explored in the next section. An understanding of the tokens will help with troubleshooting authorization flow errors. OpenID Connect JWT ID token A sample JWT ID token generated by an OIDC OAuth IdP is shown in the following code sample. OIDC’s ID tokens take the form of a JWT, which is a JSON payload that’s signed with the private key of the issuer and can be parsed and verified by the application. In contrast to access tokens, ID tokens are intended to be understood by the OAuth client and include a handful of defined property names that provide information to the application. Important properties include aud, email, iss, and jti, which are used by IAM Identity Center to validate the token issuer, match the user directory, and issue a new Identity Center token. The following code sample shows a JWT identity token issued by an OIDC external IdP (such as Okta).
IAM Identity Center JWT token with identity context A sample JWT token generated by CreateTokenWithIAM is shown in the following code sample. This token includes a property called sts:identity_context which allows you to create an identity-enhanced IAM role session using an AWS STS AssumeRole API. The enhanced STS session allows the receiving AWS service to authorize the IAM Identity Center user to perform an action and log the user identity to CloudTrail for auditing.
Automate configuration of a trusted token issuer using AWS CloudFormation A broad range of possibilities exists to integrate your application with Amazon Q Business using IAM Identity Center and your enterprise IdP. For all integration projects, Identity Center needs to be configured to use a trusted token issuer. The sample CloudFormation templates discussed in this post focuses on helping you automate the core trusted token issuer setup. If you’re new to Amazon Q Business and don’t have all the inputs required to deploy the CloudFormation template, the prerequisites section includes links to resources that can help you get started. You can also follow a tutorial on Configuring sample web application with Okta included in the accompanying AWS Samples repository. Note: The full source code of the solution using AWS CloudFormation templates and sample web application is available in AWS Samples Repository. Prerequisites and considerations
IAM Identity Center is deployed with users and groups provisioned.
For information on enabling different IAM Identity Center instances, see Configure an IAM Identity Center instance. For tutorials on setting up users and groups, see the Identity CenterGetting started tutorials. The tutorials include syncing users and groups from Okta, Microsoft Entra ID, Google WorkSpace, Ping Identity, OneLogin, JumpCloud, and CyberArk.
Amazon Q Business application integrated with Identity Center.
For information on configuring a starter application see Creating a sample Amazon Q Business application.
A web application that requires access to Q Business APIs.
A sample web application is available in the AWS Samples – Webapp. Check the READ.md file in the <project_home>/webapp folder for additional instructions to set up the sample.
An external OIDC IdP is deployed.
For instructions to set up an Okta OIDC application, see Create an OIDC Web App in the Okta Admin Console. For instructions to set up a Microsoft Entra ID OIDC application, see Register an application with the Microsoft identity platform. For platform type, select Web Applications and then select Web. For instructions to set up an Amazon Cognito user pool, see Create a new user pool. For instructions to set up an Amazon Cognito user pool, see Create a new user pool. To configure your web application to interact with a Cognito user pool, see User pool app clients. A sample CloudFormation template to set up a Cognito user pool and configure an app client is made available in the AWS Samples – Cognito CFN.
Template for configuring AWS IAM Identity Center by a security owner A security owner can use this CloudFormation template to automate configuration of the trusted token issuer in your IAM Identity Center. Deploy this stack in the AWS account where your Identity Center instance is located. This could be in the same AWS account where your application is deployed as a standalone or account instance, or can be in a delegated admin account managed as part of AWS Organizations.
To launch the stack, choose:
You can download the latest version of the CloudFormation template from AWS Samples – TTI CFN. The following figure shows the stack input for the template
The stack creation requires four parameters:
AuthorizedAudiences: The authorized audience is an auto generated UUID by a third-party IdP service or a pseudo-ID configured by the administrator of the third-party IdP to uniquely identify the client (your application) for which the ID token is generated. The value must match the aud attribute value included in the JWT ID token generated by the third-party identity provider. ClientAppExecutionArn: The Amazon Resource Name (ARN) of the IAM user, group or execution role that’s used to run your application, which will invoke Identity Center for token exchange and AWS STS service for generating temporary credentials. For example, this could be the execution role ARN of the Lambda function where your code is run. IDCInstanceArn: The instance ARN of the IAM Identity Center instance used by your application. TokenIssuerUrl: The URL of the trusted token issuer. The trusted token issuer is a third-party identity provider that will authenticate a user and generate an ID token for authorization purposes. The token URL must match the iss attribute value included in the JWT ID token generated by the third-party identity provider.
The following figure shows the output of the CloudFormation stack to configure a trusted token issuer with IAM Identity Center
The stack creation produces the following output:
IDCApiAppArn: The ARN for the IAM Identity Center custom application auth provider. You will use this application to call the Identity Center CreateTokenWithIAM API to exchange the third-party JWT ID token with the Identity Center token.
Validate the configuration
From the AWS Management Console where your IAM Identity Center instance is located, go to the AWS IAM Identity Center console to verify if the trusted token issuer is configured properly. From the left navigation pane, choose Applications and choose the Customer Managed tab to see a list of applications as shown in the following figure. The newly created customer managed IdP application will be the same as the CloudFormation stack name. Choose application name to open the application configuration page. On your application configuration page, as shown in the following figure, verify the following:
User and group assignments are set to Do not require assignments. Trusted applications for identity propagation lists Amazon Q and includes the application scope qbusiness:conversations:access. Authentication with the trusted token issuer is set to configured.
Next, to verify trusted token issuer configuration, choose Actions on the top right of the page and select Edit configurations from the drop-down menu. At the bottom of the page, expand Authentication with trusted token issuer and verify: That your Issuer URL is selected by default and is listed under . The audience ID (Aud claim) is configured properly for the issuer URL, as shown in the following figure. Next expand Application credentials to verify if your application execution IAM role is listed.
Depending on your IAM Identity Center instance type, you might not be able to access the console customer managed applications page. In such cases, you can use the AWS CLI or SDK to view the configuration. Here is a list of useful AWS CLI commands: list-applications, list-application-access-scopes, get-application-assignment-configuration, describe-trusted-token-issuer, and list-application-grants. Template for configuring your application by the application owner To propagate user identities, your application will need to:
Invoke the IAM Identity Center instance to exchange a third-party JWT ID token and obtain an Identity Center ID token Invoke AWS STS to generate a temporary credential with an IAM assumed role.
The application owner can use a CloudFormation template to generate the required IAM policy, which can be attached to your application execution role and the assumed role with the required Q Business chat API privileges for use with AWS STS to generate temporary credentials. Remember to include the add-on policy generated to your application’s IAM execution role to allow the applications to invoke Identity Center and AWS STS APIs.
To launch the stack, choose:
You can download the latest version of the CloudFormation template from AWS Samples – App Roles CFN. The following figure shows the CloudFormation stack configuration to install IAM roles and policies required for the application to propagate identities
The stack creation takes four parameters, as shown in the preceding figure:
ClientAppExecutionArn: The ARN of an IAM user, group, or execution role that is used to run your application and will invoke IAM Identity Center for token exchange and AWS STS for generating temporary credentials. For example, this could be the execution role ARN of Lambda where your code is run. IDCApiAppArn: ARN for the IAM Identity Center custom application auth provider. This will be created as part of the trusted token issuer configuration. KMSKeyId: [Optional] The AWS Key Management Server (AWS KMS) ID, if the Q Business Application is encrypted with a customer managed encryption key. QBApplicationID: Q Business application ID, which your application will use to invoke chat APIs. The STS assume role will be restricted to this application ID.
The following figure shows the output of the CloudFormation stack to install IAM roles and policies required for the application to propagate identities.
The stack creation produces the following outputs:
ClientAppExecutionAddOnPolicyArn: This is a customer managed IAM policy created with the required permissions for your application to invoke the IAM Identity Center CreateTokenWithIAM API and call the STS AssumeRole API to generate temporary credentials to call Q Business chat APIs. You can include this policy in your application IAM execution role to allow access for the APIs. QBusinessSTSAssumeRoleArn: This IAM role will include the necessary permissions to call Q Business chat APIs, for use with the STS AssumeRole API call.
Validate the configuration
From the AWS account where your application is deployed, open the AWS IAM console, verify if the IAM role for STS AssumeRole and the user managed IAM policy for the application execution role are created.
To verify if the IAM Role for STS AssumeRole, obtain the role name QBusinessSTSAssumeRoleArn stack output value, choose theRoles link on the left panel of the IAM console and use the search bar to enter the role name and shown in the following figure.
Choose the link to the role to open the role and expand the inline policy to review the permissions, as shown in the following figure. To verify if the IAM policy for add-on to an application execution role is created, obtain the IAM policy name from the ClientAppExecutionAddOnPolicyArn stack output value, go the Policies in the IAM console, and search for the policy, as shown in the following figure. Choose the link to the policy name to open the policy and review the permissions, as shown in the following figure.
Update the application for invoking the Q Business API with identity propagation Most web applications using OAuth 2.0 with an IdP will have implemented a sign-in mechanism and invoke the IdPs ID endpoint to retrieve a JWT ID token. However, before invoking Amazon Q Business APIs that require identity propagation, your application needs to be updated to include calls to CreateTokenWithIAM and AssumeRole to facilitate trusted token propagation. The CreateTokenWithIAM API enables exchanging the JWT ID token received from the OIDC IdP with an IAM identity Center generated JWT token. The newly generated token is then passed on to AssumeRole API to create an identity aware temporary security credentials that you can use to access AWS resources. Note: Remember to add permissions to your IAM role and user policy to allow invoking these APIs. Alternatively, you can attach the sample policy referenced by ClientAppExecutionAddOnPolicyArn that was created by the CloudFormation template for configuring your application. A sample access helper method using get_oidc_id_token, get_idc_sts_id_context, or get_sts_credential is available in <project_home>/src/qbapi_tools/access_helpers.py (AWS Samples – access_helpers.py). An end-to-end sample implementation of the complete sequence of steps as depicted in the end-to-end authentication sequence is provided in <project_home>/webapp/main.py (AWS Samples – main.py). Restrictions and limitations Below are some common limitations and restrictions that you may encounter while configuring trusted token propagation along with recommendations on how to mitigate them. Group membership propagation Enterprises typically manage group membership in their external IdP. However, when using trusted token propagation, the web identity token generated by the external IdP is exchanged with an ID token generated by IAM Identity Center. Thus, when invoking the Q Business API from an STS session enhanced with Identity Center identity context, only the group membership information available for the user in Identity Center is passed to the Q Business API, not the group membership from the external IdP. To mitigate this issue, it’s recommended that all relevant users and groups are synchronized to Identity Center from the external IdP using System for Cross-domain Identity Management (SCIM). For more information, see automatic provisioning (synchronization) of users and groups. Caching credentials to prevent invalid grant types You can use a web identity token only once with the CreateTokenWithIAM API. This is to prevent token replay attacks, where an attacker can intercept a JWT and reuse it multiple times, allowing them to bypass authentication and authorization controls. Because it isn’t practical to generate a new ID token for every Q Business API, it’s recommended that the temporary credentials generated by a Q Business API session using AWS STS AssumeRole is cached and reused for subsequent API calls. Clean up To avoid incurring additional charges, make sure you delete any resources created in this post.
Follow the instructions in Deleting a stack on the AWS CloudFormation console to delete any CloudFormation stacks created using templates provided in this post. If you enabled an IAM Identity Center instance, follow the instructions to delete your IAM Identity Center instance. Ensure you unregister or delete any IdP services such as Okta, Entra ID, Ping Identity, or Amazon Cognito that you have created for this post. Finally, delete any sample code repositories you have cloned or downloaded, and any associated resources deployed as part of setting up the environment for running the samples in the code repository.
Conclusion Trusted identity propagation is an important mechanism for securely integrating Amazon Q Business APIs into enterprise applications that use external IdPs. By implementing trusted identity propagation with AWS IAM Identity Center, organizations can confidently build AI-powered applications and tools using Amazon Q Business APIs, knowing that user identities are properly verified and protected throughout the process. This approach allows enterprises to harness the full potential of generative AI while maintaining the highest standards of security and privacy. To get started with Amazon Q Business, explore the Getting started guide. To learn more about how trusted token propagation works, see How to develop a user-facing data application with IAM Identity Center and S3 Access Grants.
About the Author Rajesh Kumar Ravi is a Senior Solutions Architect at Amazon Web Services specializing in building generative AI solutions with Amazon Q Business, Amazon Bedrock, and Amazon Kendra. He is an accomplished technology leader with experience in building innovative AI products, nurturing the builder community, and contributes to the development of new ideas. He enjoys walking and loves to go on short hiking trips outside of work.
In today’s digital landscape, the demand for audio and video content is skyrocketing. Organizations are increasingly using media to engage with their audiences in innovative ways. From product documentation in video format to podcasts replacing traditional blog posts, content creators are exploring diverse channels to reach a wider audience. The rise of virtual workplaces has also led to a surge in content captured through recorded meetings, calls, and voicemails. Additionally, contact centers generate a wealth of media content, including support calls, screen-share recordings, and post-call surveys. We are excited to introduce Mediasearch Q Business, an open source solution powered by Amazon Q Business and Amazon Transcribe. Mediasearch Q Business builds on the Mediasearch solution powered by Amazon Kendra and enhances the search experience using Amazon Q Business. Mediasearch Q Business supercharges the way you consume media files by using them as part of the knowledge base used by Amazon Q Business to generate reliable answers to user questions. The solution also features an enhanced Amazon Q Business query application that allows users to play the relevant section of the original media files or YouTube videos directly from the search results page, providing a seamless and intuitive user experience. Solution overview Mediasearch Q Business is straightforward to install and try out.
The solution has two components, as illustrated in the following diagram:
A Mediasearch indexer that transcribes media files (audio and video) on an Amazon Simple Storage Service (Amazon S3) bucket or media from a YouTube playlist and ingests the transcriptions into either an Amazon Q Business native index (configured as part of the Amazon Q Business application) or an Amazon Kendra A Mediasearch finder, which provides a UI and makes API calls to the Amazon Q Business service APIs on behalf of the logged-in user. The response from API calls are displayed to the end-user.
The Mediasearch indexer finds and transcribes audio and video files stored in an S3 bucket. The indexer can also index YouTube videos from a YouTube playlist as audio files and transcribe these audio files. It prepares the transcriptions by embedding time markers at the start of each sentence, and it indexes each prepared transcription in an Amazon Q Business native retriever or an Amazon Kendra retriever. The indexer runs the first time when you install it, and subsequently runs on an interval that you specify, maintaining the index to reflect any new, modified, or deleted files. The Mediasearch finder is a web search client that you use to search for content in your Amazon Q Business application. Additionally, the Mediasearch finder includes in-line embedded media players in the search result, so you can see the relevant section of the transcript, and play the corresponding section from the original media (audio files and video files in your media bucket or a YouTube video) without navigating away from the search page.
In the sections that follow, we discuss the following topics:
How to deploy the solution to your AWS account How to use it to index and search sample media files How to use the solution with your own media files How the solution works The estimated costs involved How to monitor usage and troubleshoot problems Options to customize and tune the solution How to uninstall and clean up when you’re done experimenting
Prerequisites Make sure you have the following:
An AWS account where you can launch an AWS CloudFormation stack An AWS IAM Identity Center instance ARN that would be used by the Amazon Q Business application to provide access to users
Deploy the Mediasearch Q Business solution In this section, we walk through deploying the two solution components: the indexer and the finder. We use a CloudFormation stack to deploy the necessary resources in the us-east-1 AWS Region. If you’re deploying the solution to another Region, follow the instructions in the README available in the Mediasearch Q Business GitHub repository. Deploy the Mediasearch Q Business indexer component To deploy the indexer component, complete the following steps:
Choose Launch Stack. In the Identity center ARN and Retriever selection section, for IdentityCenterInstanceArn, enter the ARN for your IAM Identity Center instance.
You can find the ARN on the Settings page of the IAM Identity Center console. The ARN is a required field.
Use default values for all other parameters. We will customize these values later to suit your specific requirements. Acknowledge that the stack might create IAM resources with custom names, then choose Create stack.
The indexer stack takes around 10 minutes to deploy. Wait for the indexer to finish deploying before you deploy the Mediasearch Q Business finder. Deploy the Mediasearch Q Business finder component The Mediasearch finder uses Amazon Cognito to authenticate users to the solution. For an authenticated user to interact with an Amazon Q Business application, you must configure an IAM Identity Center customer managed application that either supports SAML 2.0 or OAuth 2.0. In this post, we create a customer managed application that supports OAuth 2.0, a secure way for applications to communicate and share user data without exposing passwords. We use a technique called trusted identity propagation, which allows the Mediasearch Q Business finder app to access the Amazon Q service securely without sharing passwords between the two identity providers (Amazon Cognito and IAM Identity Center in our example). Instead of sharing passwords, trusted identity propagation uses tokens. Tokens are like digital certificates that prove who the user is and what they’re allowed to do. AWS managed applications that work with trusted identity propagation get tokens directly from IAM Identity Center. IAM Identity Center can also exchange identity tokens and access tokens from external authorization servers like Amazon Cognito. This lets an application authenticate users and obtain tokens outside of AWS (like with Amazon Cognito, Microsoft Entra ID, or Okta), exchange that token for an IAM Identity Center token, and then use the new token to request access to AWS services like Amazon Q Business. For more information, see Using trusted identity propagation with customer managed applications. When the IAM Identity Center instance is in the same account where you are deploying the Mediasearch Q Business solution, the finder stack allows you to automatically create the IAM Identity Center customer managed application as part of the stack deployment. If you use the organization instance of IAM Identity Center enabled in your management account, then you will be deploying the Mediasearch Q Business finder stack in a different AWS account. In this case, follow the steps in the README to create an IAM Identity Center application manually. To deploy the finder component and create the IAM Identity Center customer managed application, complete the following steps:
Choose Launch Stack. For IdentityCenterInstanceArn, enter the ARN for the IAM Identity Center instance. This is the same value you used while deploying the indexer stack. For CreateIdentityCenterApplication, choose Yes to create the IAM Identity Center application for the Mediasearch finder application. Under Mediasearch Indexer parameters, enter the Amazon Q Business application ID that was created by the indexer stack. You can copy this from the QBusinessApplicationId output of the indexer stack. Select the retriever type that was used to deploy the Mediasearch indexer. (If you deployed an Amazon Kendra index, then select Kendra, otherwise select Native. If you selected Kendra, enter the Amazon Kendra index ID that was used by the indexer stack. For MediaBucketNames, use the MediaBucketsUsed output from the indexer CloudFormation stack to allow the search page to access media files across YTMediaBucket and Mediabucket. Acknowledge that the stack might create IAM resources with custom names, then choose Create stack.
Configure user access to Amazon Q Business To access the Mediasearch Q Business solution, add a user with an appropriate subscription to the Amazon Q Business application and to the IAM Identity Center customer managed application. Add a user to the Amazon Q Business application To start using the Amazon Q Business application, you can add users or groups to the Amazon Q Business application from your IAM Identity Center instance. Complete the following steps to add a user to the application:
Access the Amazon Q Business application by choosing the link for QBusinessApplication in the indexer CloudFormation stack outputs. Under Groups and users, on the Users tab, choose Manage access and subscription. Choose Add groups and users. Choose Add existing users and groups. Search for an existing user, choose the user, and choose Assign. Select the added user and on the Change subscription menu, choose Update subscription tier. Select the appropriate subscription tier and choose Confirm.
For details of each Amazon Q subscription, refer to Amazon Q Business pricing. Assign users to the IAM Identity Center customer managed application Now you can assign users or groups to the IAM Identity Center customer managed application. Complete the following steps to add a user:
From the outputs section of the finder CloudFormation stack, choose the URL for IdentityCenterApplicationConsoleURLto navigate to the customer managed application.
Choose Assign users and groups.
Select users and choose Assign users.
This concludes the user access configuration to the Mediasearch Q Business solution. Test with the sample media files When the Mediasearch indexer and finder stack are deployed, the indexer should have completed processing the audio (mp3) files for the YouTube videos and sample media files (selected AWS Podcast episodes and AWS Knowledge Center videos). You can now run your first Mediasearch query.
To log in to the Mediasearch finder application, choose the URL for MediasearchFinderURL in the stack outputs.
The Mediasearch finder application in your browser will show a splash page for Amazon Q Business.
Choose Get Started to access the Amazon Cognito page.
To access Mediasearch Q Business, you need to log in to the application using a user ID in the Amazon Cognito user pool created by the finder stack. The email address in Amazon Cognito must match the email address for the user in IAM Identity Center. Alternatively, the Mediasearch solution allows you to create a user through the application.
On the Create Account tab, enter your email (which matches the email address in IAM Identity Center), followed by a password and password confirmation, and choose Create Account.
Amazon Cognito will send an email with a confirmation code for email verification.
Enter this confirmation code to complete your email verification.
After email verification, you will now be able to log in to the Mediasearch Q Business application. After you’re logged in, in the Enter a prompt box, write a query, such as “What is AWS Fargate?”
The query returns a response from Amazon Q Business based on the media (sample media files and YouTube audio sources) ingested into the index. The response includes citations, with reference to sources. Users can verify their answer from Amazon Q Business by playing media files from their S3 buckets or YouTube starting at the time marker where the relevant information is found.
Use the embedded video player to play the original video inline. Observe that the media playback starts at the relevant section of the video based on the time marker. To play the video full screen in a new browser tab, use the Full screen menu option in the player, or choose the media file hyperlink shown above the answer text. Choose (right-click) the video file hyperlink, copy the URL, and enter it into a text editor.
If the media is an audio file for a YouTube video, it looks something like the following:
If the media file is a non-YouTube audio file that resides in MediaBucket, the URL looks like the following: https://mediasearchtest.s3.amazonaws.com/mediasamples/What_is_an_Interface_VPC_Endpoint_and_how_can_I_create_Interface_Endpoint_for_my_VPC_.mp4?AWSAccessKeyId=ASIAXMBGHMGZLSYWJHGD&Expires=1625526197&Signature=BYeOXOzT585ntoXLDoftkfS4dBU%3D&x-amz-security-token=…. #t=253.52 This is a presigned S3 URL that provides your browser with temporary read access to the media file referenced in the search result. Using presigned URLs means you don’t need to provide permanent public access to all of your indexed media files.
Experiment with additional queries, such as “How has AWS helped customers in building MLOps platform?” or “How can I use Generative AI to improve customer experience?” or try your own questions.
Index and search your own media files To index media files stored in your own S3 bucket, replace the MediaBucket and MediaFolderPrefix parameters with your own bucket name and prefix when you install or update the indexer component stack, and modify the MediaBucketName parameter with your own bucket name when you install or update the finder component stack. Additionally, you can replace the YouTube playlist (PlayListURL) with your own playlist URL and update the indexer stack.
When creating a new MediaSearch indexer stack, you can choose to use either a native retriever or an Amazon Kendra retriever. You can make this selection using the parameter RetrieverType. When using the Amazon Kendra retriever, you can either let indexer stack create an Amazon Kendra index or use an existing Amazon Kendra IndexId to add files stored in the new location. To deploy a new indexer, follow the steps from earlier in this post, but replace the defaults to specify the media bucket name and prefix for your own media files or replace the YouTube playlist URL with your own playlist URL. Make sure that you comply with the YouTube Terms of Service. Alternatively, update an existing MediaSearch indexer stack to replace the previously indexed files with files from the new location or update the YouTube playlist URL or the number of videos to download from the playlist:
Select the stack on the AWS CloudFormation console, choose Update, then Use current template, then Next. Modify the media bucket name and prefix parameter values as needed. Modify the YouTube Playlist URL and Number of YouTube Videos values as needed. Choose Next twice, select the acknowledgement check box, and choose Update stack.
Update an existing MediaSearch finder stack to change bucket names or add additional bucket names to the MediaBucketNames
When the MediaSearch indexer stack is successfully created or updated, the indexer automatically finds, transcribes, and indexes the media files stored in your S3 bucket. When it’s complete, you can submit queries and find answers from the audio tracks of your own audio and video files. You have the option to provide metadata for any or all of your media files. Use metadata to assign values to index attributes for sorting, filtering, and faceting your search results, or to specify access control lists to govern access to the files. Metadata files can be in the same S3 folder as your media files (default), or in a parallel folder structure specified by the optional indexer parameter MetadataFolderPrefix. For more information about how to create metadata files, see Amazon S3 document metadata. You can also provide customized transcription options for any or all of your media files. This allows you to take full advantage of Amazon Transcribe features such as custom vocabularies, automatic content redaction, and custom language models. How the Mediasearch solution works Let’s take a quick look at how the solution works, as illustrated in the following diagram.
The Mediasearch solution has an event-driven serverless computing architecture with the following steps:
You provide an S3 bucket containing the audio and video files you want to index and search. This is also known as the MediaBucket. Leave this blank if you don’t want to index media from your MediaBucket. You also provide your YouTube playlist URL and the number of videos to index from the YouTube playlist. Make sure that you comply with the YouTube Terms of Service. The YTIndexer will index the latest files from the YouTube playlist. For example, if the number of videos is set to 5, then the YTIndexer will index the five latest videos in the playlist. Any YouTube video indexed prior is ignored from being indexed. An AWS Lambda function fetches the YouTube videos from the playlist as audio (mp3 files) into the YTMediaBucket and also creates a metadata file in the MetadataFolderPrefix location with metadata for the YouTube video. The YouTube videoid along with the related metadata are recorded in an Amazon DynamoDB table (YTMediaDDBQueueTable). Amazon EventBridge generates events on a repeating interval (every 2 hours, every 6 hours, and so on) These events invoke the Lambda function S3CrawlLambdaFunction. An AWS Lambda function is invoked initially when the CloudFormation stack is first deployed, and then subsequently by the scheduled events from EventBridge. The S3CrawlLambdaFunction function crawls through the MediaBucket and the YTMediabucket and starts an Amazon Q Business index (or Amazon Kendra) data source sync job. The Lambda function lists all the supported media files (FLAC, MP3, MP4, Ogg, WebM, AMR, or WAV) and associated metadata and transcribe options stored in the user provided S3 bucket. Each new file is added to another DynamoDB tracking table and submitted to be transcribed by an Amazon Transcribe job. Any file that has been previously transcribed is submitted for transcription again only if it has been modified since it was previously transcribed, or if associated Amazon Transcribe options have been updated. The DynamoDB table is updated to reflect the transcription status and last modified timestamp of each file. Any tracked files that no longer exist in the S3 bucket are removed from the DynamoDB table and from the Amazon Q Business index (or Amazon Kendra index). If no new or updated files are discovered, the Amazon Q Business index (or Amazon Kendra) data source sync job is immediately stopped. The DynamoDB table holds a record for each media file with attributes to track transcription job names and status, and last modified timestamps. As each Amazon Transcribe job completes, EventBridge generates a job complete event, which invokes another Lambda function (S3JobCompletionLambdaFunction). The Lambda function processes the transcription job output, generating a modified transcription that has a time marker inserted at the start of each sentence. This modified transcription is indexed in Amazon Q Business (or Amazon Kendra), and the job status for the file is updated in the DynamoDB table. When the last file has been transcribed and indexed, the Amazon Q Business (or Amazon Kendra) data source sync job is stopped. The index is populated and kept in sync with the transcriptions of all the media files in the S3 bucket monitored by the Mediasearch indexer component, integrated with any additional content from any other provisioned data sources. The media transcriptions are used by the Amazon Q Business application, which allows users to find content and answers to their questions. The sample finder client application enhances users’ search experience by embedding an inline media player with each source or citation that is based on a transcribed media file. The client uses the time markers embedded in the transcript to start media playback at the relevant section of the original media file. An Amazon Cognito user pool is used to authenticate users and is configured to exchange tokens from IAM Identity Center to support Amazon Q Business service calls.
Estimated costs In addition to Amazon S3 costs associated with storing your media, the Mediasearch solution incurs usage costs from the Amazon Q, Amazon Kendra (if using an Amazon Kendra index), Amazon Transcribe, and Amazon API Gateway. Additional minor costs are incurred by the other services mentioned after free tier allowances have been used. For more information, see the pricing pages for Amazon Q Business, Amazon Kendra, Amazon Transcribe, Lambda, DynamoDB, and EventBridge. Monitor and troubleshoot To see the details of each media file transcript job, navigate to the Transcription jobs page on the Amazon Transcribe console. Each media file is transcribed only one time, unless the file is modified. Modified files are re-transcribed and re-indexed to reflect the changes. Choose any transcription job to review the transcription and examine additional job details.
You can check the status of the data source sync by navigating to the Amazon Q Business application deployed by the indexer stack (choose the link on the indexer stack outputs page for QApplication). In the data source section, choose the custom data source and view the status of the sync job.
On the DynamoDB console, choose Tables in the navigation pane. Use your MediaSearch stack name as a filter to display the MediaSearch DynamoDB tables, and examine the items showing each indexed media file and corresponding status. The table MediaSearch-Indexer-YTMediaDDBQueueTable has one record for each YouTube videoid that is downloaded as an audio (mp3) file along with the metadata for the video like author, view count, video title, and so on.
The table MediaSearch-Indexer-MediaDynamoTable has one record for each media file (including YouTube videos), and contains attributes with information about the file and its processing status.
On the Functions page of the Lambda console, use your indexer stack name as a filter to list the Lambda functions that are part of the solution:
The YouTubeVideoIndexer function indexes and downloads YouTube videos if the CloudFormation stack parameter PlayListURL is set to a valid YouTube playlist The S3CrawlLambdaFunction function crawls the YTMediaBucket and the MediaBucket for media files and initiates the transcription jobs for the media files
When the transcription job is complete, a completion event invokes the S3JobCompletionLambdaFunction function, which ingests the transcription into the Amazon Q Business index (or Amazon Kendra index) with any related metadata.
Choose any of the functions to examine the function details, including environment variables, source code, and more. Choose Monitor and View logs in CloudWatch to examine the output of each function invocation and troubleshoot any issues. On the Functions page of the Lambda console, use your finder stack name as a filter to list the Lambda functions that are part of the solution:
The BuildTriggerLambda function runs the build of the finder AWS Amplify application after cloning the AWS CodeCommit repository with the finder ReactJS code. The IDCTokenCreateLambda function uses the authorization header that contains a JWT token from a successful authentication with Amazon Cognito to exchange bearer tokens from IAM Identity Center. The IDCAppCreateLambda function creates an OAuth 2.0 IAM Identity Center application to exchange tokens from IAM Identity Center and a trusted token issuer for the Amazon Cognito user pool. The UserConversationLambda function is called from API Gateway to list or delete Amazon Q Business conversations. The UserPromptsLambda function is called from API Gateway to call the chat_sync API of the Amazon Q Business service. The PreSignedURLCreateLambda function is called from API Gateway to create a presigned URL for S3 buckets. The presigned URL is used to play the media files residing on the Mediabucket that serves as the source for an Amazon Q Business response.
Choose any of the functions to examine the function details, including environment variables, source code, and more. Choose Monitor and View logs in CloudWatch to examine the output of each function invocation and troubleshoot any issues. Customize and enhance the solution You can fork the MediaSearch Q Business GitHub repository, enhance the code, and send us pull requests so we can incorporate and share your improvements. The following are a few suggestions for features you might want to implement:
Enhance the indexer stack to allow the existing Amazon Q Business application IDs to be used Extend your search sources to include other video streaming platforms relevant to your organization Build Amazon CloudWatch metrics and dashboards to improve the manageability of MediaSearch
Clean up When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete the indexer and finder stacks that you deployed. This deletes all the resources that were created by deploying the solution. Preexisting Amazon Q Business applications or indexes or IAM Identity Center applications or trusted token issuers that were created manually aren’t deleted. Conclusion The combination of Amazon Q Business and Amazon Transcribe enables a scalable, cost-effective solution to surface insights from your media files. You can use the content of your media files to find accurate answers to your users’ questions, whether they’re from text documents or media files, and consume them in their native format. This solution enhances the overall experience of the previous Mediasearch solution by using the powerful generative artificial intelligence (AI) capabilities of Amazon Q Business. The sample MediaSearch Q Business solution is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features through GitHub pull requests. For expert assistance, AWS Professional Services and other Amazon partners are here to help. We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the MediaSearch Q Business GitHub repository.
About the Authors Roshan Thomas is a Senior Solutions Architect at Amazon Web Services. He is based in Melbourne, Australia, and works closely with power and utilities customers to accelerate their journey in the cloud. He is passionate about technology and helping customers architect and build solutions on AWS. Anup Dutta is a Solutions Architect with AWS based in Chennai, India. In his role at AWS, Anup works closely with startups to design and build cloud-centered solutions on AWS. Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team. Abhinav Jawadekar is a Principal Solutions Architect in the Amazon Q Business service team at AWS. Abhinav works with AWS customers and partners to help them build generative AI solutions on AWS.
This post is co-written with Benjamin Moody from Monks. Monks is the global, purely digital, unitary operating brand of S4Capital plc. With a legacy of innovation and specialized expertise, Monks combines an extraordinary range of global marketing and technology services to accelerate business possibilities and redefine how brands and businesses interact with the world. Its integration of systems and workflows delivers unfettered content production, scaled experiences, enterprise-grade technology and data science fueled by AI—managed by the industry’s best and most diverse digital talent—to help the world’s trailblazing companies outmaneuver and outpace their competition. Monks leads the way in crafting cutting-edge brand experiences. We shape modern brands through innovative and forward-thinking solutions. As brand experience experts, we harness the synergy of strategy, creativity, and in-house production to deliver exceptional results. Tasked with using the latest advancements in AWS services and machine learning (ML) acceleration, our team embarked on an ambitious project to revolutionize real-time image generation. Specifically, we focused on using AWS Inferentia2 chips with Amazon SageMaker to enhance the performance and cost-efficiency of our image generation processes.. Initially, our setup faced significant challenges regarding scalability and cost management. The primary issues were maintaining consistent inference performance under varying loads, while providing generative experience for the end-user. Traditional compute resources were not only costly but also failed to meet the low latency requirements. This scenario prompted us to explore more advanced solutions from AWS that could offer high-performance computing and cost-effective scalability. The adoption of AWS Inferentia2 chips and SageMaker asynchronous inference endpoints emerged as a promising solution. These technologies promised to address our core challenges by significantly enhancing processing speed (AWS Inferentia2 chips were four times faster in our initial benchmarks) and reducing costs through fully managed auto scaling inference endpoints. In this post, we share how we used AWS Inferentia2 chips with SageMaker asynchronous inference to optimize the performance by four times and achieve a 60% reduction in cost per image for our real-time diffusion AI image generation. Solution overview The combination of SageMaker asynchronous inference with AWS Inferentia2 allowed us to efficiently handle requests that had large payloads and long processing times while maintaining low latency requirements. A prerequisite was to fine-tune the Stable Diffusion XL model with domain-specific images which were stored in Amazon Simple Storage Service (Amazon S3). For this, we used Amazon SageMaker JumpStart. For more details, refer to Fine-Tune a Model. The solution workflow consists of the following components:
Endpoint creation – We created an asynchronous inference endpoint using our existing SageMaker models, using AWS Inferentia2 chips for higher price/performance. Request handling – Requests were queued by SageMaker upon invocation. Users submitted their image generation requests, where the input payload was placed in Amazon S3. SageMaker then queued the request for processing. Processing and output – After processing, the results were stored back in Amazon S3 in a specified output bucket. During periods of inactivity, SageMaker automatically scaled the instance count to zero, significantly reducing costs because charges only occurred when the endpoint was actively processing requests. Notifications – Completion notifications were set up through Amazon Simple Notification Service (Amazon SNS), notifying users of success or errors.
The following diagram illustrates our solution architecture and process workflow.
In the following sections, we discuss the key components of the solution in more detail. SageMaker asynchronous endpoints SageMaker asynchronous endpoints queue incoming requests to process them asynchronously, which is ideal for large inference payloads (up to 1 GB) or inference requests with long processing times (up to 60 minutes) that need to be processed as requests arrive. The ability to serve long-running requests enabled Monks to effectively serve their use case. Auto scaling the instance count to zero allows you to design cost-optimal inference in response to spiky traffic, so you only pay for when the instances are serving traffic. You can also scale the endpoint instance count to zero in the absence of outstanding requests and scale back up when new requests arrive. To learn how to create a SageMaker asynchronous endpoint, attach auto scaling policies, and invoke an asynchronous endpoint, refer to Create an Asynchronous Inference Endpoint. AWS Inferentia2 chips, which powered the SageMaker asynchronous endpoints, are AWS AI chips optimized to deliver high performance for deep learning inference applications at lowest cost. Integrated within SageMaker asynchronous inference endpoints, AWS Inferentia2 chips support scale-out distributed inference with ultra-high-speed connectivity between chips. This setup was ideal for deploying our large-scale generative AI model across multiple accelerators efficiently and cost-effectively. In the context of our high-profile nationwide campaign, the use of asynchronous computing was key in managing peak and unexpected spikes in concurrent requests to our inference infrastructure, which was expected to be in the hundreds of concurrent requests per second. Asynchronous inference endpoints, like those provided by SageMaker, offer dynamic scalability and efficient task management. The solution offered the following benefits:
Efficient handling of longer processing times – SageMaker asynchronous inference endpoints are perfect for scenarios where each request might involve substantial computational work. These fully managed endpoints queue incoming inference requests and process them asynchronously. This method was particularly advantageous in our application, because it allowed the system to manage fluctuating demand efficiently. The ability to process requests asynchronously makes sure our infrastructure can handle large unexpected spikes in traffic without causing delays in response times. Cost-effective resource utilization – One of the most significant advantages of using asynchronous inference endpoints is their impact on cost management. These endpoints can automatically scale the compute resources down to zero in periods of inactivity, without the risk of dropping or losing requests as resources scale back up.
Custom scaling policies using Amazon CloudWatch metrics SageMaker endpoint auto scaling behavior is defined through the use of a scaling policy, which helps us scale to multiple users using the application concurrently This policy defines how and when to scale resources up or down to provide optimal performance and cost-efficiency. SageMaker synchronous inference endpoints are typically scaled using the InvocationsPerInstance metric, which helps determine event triggers based on real-time demands. However, for SageMaker asynchronous endpoints, this metric isn’t available due to their asynchronous nature. We encountered challenges with alternative metrics such as ApproximateBacklogSizePerInstance because they didn’t meet our real-time requirements. The inherent delay in these metrics resulted in unacceptable latency in our scaling processes. Consequently, we sought a custom metric that could more accurately reflect the real-time load on our SageMaker instances. Amazon CloudWatch custom metrics provide a powerful tool for monitoring and managing your applications and services in the AWS Cloud. We had previously established a range of custom metrics to monitor various aspects of our infrastructure, including a particularly crucial one for tracking cache misses during image generation. Due to the nature of asynchronous endpoints, which don’t provide the InvocationsPerInstance metric, this custom cache miss metric became essential. It enabled us to gauge the number of requests contributing to the size of the endpoint queue. With this insight into the number of requests, one of our senior developers began to explore additional metrics available through CloudWatch to calculate the asynchronous endpoint capacity and utilization rate. We used the following calculations:
InferenceCapacity = (CPU utilization * 60) / (InferenceTimeInSeconds * InstanceGPUCount) Number of inference requests = (served from cache + cache misses) Usage rate = (number of requests) / (InferenceCapacity)
The calculations included the following variables:
CPU utilization – Represents the average CPU utilization percentage of the SageMaker instances (CPUUtilization CloudWatch metric). It provides a snapshot of how much CPU resources are currently being used by the instances. InferenceCapacity – The total number of inference tasks that the system can process per minute, calculated based on the average CPU utilization and scaled by the number of GPUs available (inf2.48xlarge has 12 GPUs). This metric provides an estimate of the system’s throughput capability per minute.
Multiply by 60 / Divide by InferenceTimeInSeconds – This step effectively adjusts the CPUUtilization metric to reflect how it translates into jobs per minute, assuming each job takes 10 seconds. Therefore, (CPU utilization * 60) / 10 represents the theoretical maximum number of jobs that can be processed in one minute based on current or typical CPU utilization. Multiply by 12 – Because the inf2.48xlarge instance has 12 GPUs, this multiplication provides a total capacity in terms of how many jobs all GPUs can handle collectively in 1 minute.
Number of inference requests (served from cache + cache misses) – We monitor the total number of inference requests processed, distinguishing between those served from cache and those requiring real-time processing due to cache misses. This helps us gauge the overall workload. Usage rate (number of inference requests) / (InferenceCapacity) – This formula determines the rate of resource usage by comparing the number of operations that invoke new tasks (number of requests) to the total inference capacity (InferenceCapacity).
A higher InferenceCapacity value suggests that we have either scaled up our resources or that our instances are under-utilized. Conversely, a lower capacity value could indicate that we’re reaching our capacity limits and might need to scale out to maintain performance. Our custom usage rate metric quantifies the usage rate of available SageMaker instance capacity. It’s a composite measure that factors in both the image generation tasks that weren’t served from cache and those that resulted in a cache miss, relative to the total capacity metric. The usage rate is intended to provide insights into how much of the total provisioned SageMaker instance capacity is actively being used for image generation operations. It serves as a key indicator of operational efficiency and helps identify the workload’s operational demands. We then used the usage rate metric as our auto scaling trigger metric. The use of this trigger in our auto scaling policy made sure SageMaker instances were neither over-provisioned nor under-provisioned. A high value for usage rate might indicate the need to scale up resources to maintain performance. A low value, on the other hand, could signal under-utilization, indicating a potential for cost optimization by scaling down resources. We applied our custom metrics as triggers for a scaling policy:
Deployment on AWS Inferentia2 chips The integration of AWS Inferentia2 chips into our SageMaker inference endpoints not only resulted in a four-times increase in inference performance for our finely-tuned Stable Diffusion XL model, but also significantly enhanced cost-efficiency. Specifically, SageMaker instances powered by these chips reduced our deployment costs by 60% compared to other comparable instances on AWS. This substantial reduction in cost, coupled with improved performance, underscores the value of using AWS Inferentia2 for intensive computational tasks such as real-time diffusion AI image generation. Given the importance of swift response times for our specific use case, we established an acceptance criterion of single digit second latency. SageMaker instances equipped with AWS Inferentia2 chips successfully optimized our infrastructure to deliver image generation in just 9.7 seconds. This enhancement not only met our performance requirements at a low cost, but also provided a seamless and engaging user experience owing to the high availability of Inferentia2 chips. The effort to integrate with the Neuron SDK also proved highly beneficial. The optimized model not only met our performance criteria, but also enhanced the overall efficiency of our inference processes. Results and benefits The implementation of SageMaker asynchronous inference endpoints significantly enhanced our architecture’s ability to handle varying traffic loads and optimize resource utilization, leading to marked improvements in performance and cost-efficiency:
Inference performance – The AWS Inferentia2 setup processed an average of 27,796 images per instance per hour, giving us 2x improvement in throughput over comparable accelerated compute instances. Inference savings – In addition to performance enhancements, the AWS Inferentia2 configurations achieved a 60% reduction in cost per image compared to the original estimation. The cost for processing each image with AWS Inferentia2 was $0.000425. Although the initial requirement to compile models for the AWS Inferentia2 chips introduced an additional time investment, the substantial throughput gains and significant cost reductions justified this effort. For demanding workloads that necessitate high throughput without compromising budget constraints, AWS Inferentia2 instances are certainly worthy of consideration. Smoothing out traffic spikes – We effectively smoothed out spikes in traffic to provide continual real-time experience for end-users. As shown in the following figure, the SageMaker asynchronous endpoint auto scaling and managed queue is preventing significant drift from our goal of single digit second latency per image generation.
Scheduled scaling to manage demand – We can scale up and back down on schedule to cover more predictable traffic demands, reducing inference costs while supplying demand. The following figure illustrates the impact of auto scaling reacting to unexpected demand as well as scaling up and down on a schedule.
Conclusion In this post, we discussed the potential benefits of applying SageMaker and AWS Inferentia2 chips within a production-ready generative AI application. SageMaker fully managed asynchronous endpoints provide an application time to react to both unexpected and predictable demand in a structured manner, even for high-demand applications such as image-based generative AI. Despite the learning curve involved in compiling the Stable Diffusion XL model to run on AWS Inferentia2 chips, using AWS Inferentia2 allowed us to achieve our demanding low-latency inference requirements, providing an excellent user experience, all while remaining cost-efficient. To learn more about SageMaker deployment options for your generative AI use cases, refer to the blog series Model hosting patterns in Amazon SageMaker. You can get started with hosting a Stable Diffusion model with SageMaker and AWS Inferentia2 by using the following example. Discover how Monks serves as a comprehensive digital partner by integrating a wide array of solutions. These encompass media, data, social platforms, studio production, brand strategy, and cutting-edge technology. Through this integration, Monks enables efficient content creation, scalable experiences, and AI-driven data insights, all powered by top-tier industry talent.
About the Authors Benjamin Moody is a Senior Solutions Architect at Monks. He focuses on designing and managing high-performance, robust, and secure architectures, utilizing a broad range of AWS services. Ben is particularly adept at handling projects with complex requirements, including those involving generative AI at scale. Outside of work, he enjoys snowboarding and traveling. Karan Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions. Raghu Ramesha is a Senior Gen AI/ML Specialist Solutions Architect with AWS. He focuses on helping enterprise customers build and deploy AI/ ML production workloads to Amazon SageMaker at scale. He specializes in generative AI, machine learning, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography. Rupinder Grewal is a Senior Gen AI/ML Specialist Solutions Architect with AWS. He currently focuses on model serving and MLOps on SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails. Parag Srivastava is a Senior Solutions Architect at AWS, where he has been helping customers in successfully applying generative AI to real-life business scenarios. During his professional career, he has been extensively involved in complex digital transformation projects. He is also passionate about building innovative solutions around geospatial aspects of addresses.
This post is co-authored by Daryl Martis and Darvish Shadravan from Salesforce. This is the fourth post in a series discussing the integration of Salesforce Data Cloud and Amazon SageMaker. In Part 1 and Part 2, we show how Salesforce Data Cloud and Einstein Studio integration with SageMaker allows businesses to access their Salesforce data securely using SageMaker’s tools to build, train, and deploy models to endpoints hosted on SageMaker. SageMaker endpoints can be registered with Salesforce Data Cloud to activate predictions in Salesforce. In Part 3, we demonstrate how business analysts and citizen data scientists can create machine learning (ML) models, without code, in Amazon SageMaker Canvas and deploy trained models for integration with Salesforce Einstein Studio to create powerful business applications. In this post, we show how native integrations between Salesforce and Amazon Web Services (AWS) enable you to Bring Your Own Large Language Models (BYO LLMs) from your AWS account to power generative artificial intelligence (AI) applications in Salesforce. Requests and responses between Salesforce and Amazon Bedrock pass through the Einstein Trust Layer, which promotes responsible AI use across Salesforce. We demonstrate BYO LLM integration by using Anthropic’s Claude model on Amazon Bedrock to summarize a list of open service cases and opportunities on an account record page, as shown in the following figure.
Partner quote
“We continue to expand on our strong collaboration with AWS with our BYO LLM integration with Amazon Bedrock, empowering our customers with more model choices and allowing them to create AI-powered features and Copilots customized for their specific business needs. Our open and flexible AI environment, grounded with customer data, positions us well to be leaders in AI-driven solutions in the CRM space.” –Kaushal Kurapati, Senior Vice President of Product for AI at Salesforce
Amazon Bedrock Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can quickly experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Since Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with. Salesforce Data Cloud and Einstein Model Builder Salesforce Data Cloud is a data platform that unifies your company’s data, giving every team a 360-degree view of the customer to drive automation and analytics, personalize engagement, and power trusted AI. Data Cloud creates a holistic customer view by turning volumes of disconnected data into a single, trusted model that’s simple to access and understand. With data harmonized within Salesforce Data Cloud, customers can put their data to work to build predictions and generative AI–powered business processes across sales, support, and marketing.
With Einstein Model Builder, customers can build their own models using Salesforce’s low-code model builder experience or integrate their own custom-built models into the Salesforce platform. Einstein Model Builder’s BYO LLM experience provides the capability to register custom generative AI models from external environments such as Amazon Bedrock and Salesforce Data Cloud. Once custom Amazon Bedrock models are registered in Einstein Model Builder, models are connected through the Einstein Trust Layer, a robust set of features and guardrails that protect the privacy and security of data, improve the safety and accuracy of AI results, and promote the responsible use of AI across Salesforce. Registered models can then be used in Prompt Builder, a newly launched, low-code prompt engineering tool that allows Salesforce admins to build, test, and fine-tune trusted AI prompts that can be used across the Salesforce platform. These prompts can be integrated with Salesforce capabilities such as Flows and Invocable Actions and Apex. Solution overview With the Salesforce Einstein Model Builder BYO LLM feature, you can invoke Amazon Bedrock models in your AWS account. At the time of this writing, Salesforce supports Anthropic Claude 3 models on Amazon Bedrock for BYO LLM. For this post, we use the Anthropic Claude 3 Sonnet model. To learn more about inference with Claude 3, refer to Anthropic Claude models in the Amazon Bedrock documentation. For your implementation, you may use the model of your choice. Refer to Bring Your Own Large Language Model in Einstein 1 Studio for models supported with Salesforce Einstein Model Builder. The following image shows a high-level architecture of how you can integrate the LLM from your AWS account into the Salesforce Prompt Builder.
In this post, we show how to build generative AI–powered Salesforce applications with Amazon Bedrock. The following are the high-level steps involved:
Grant Amazon Bedrock invoke model permission to an AWS Identity and Access Management (IAM) user Register the Amazon Bedrock model in Salesforce Einstein Model Builder Integrate the prompt template with the field in the Lightning App Builder
Prerequisites Before deploying this solution, make sure you meet the following prerequisites:
Have access to Salesforce Data Cloud and meet the requirements for using BYO LLM. Have Amazon Bedrock set up. If this is the first time you are accessing Anthropic Claude models on Amazon Bedrock, you need to request access. You need to have sufficient permissions to request access to models through the console. To request model access, sign in to the Amazon Bedrock console and select Model access at the bottom of the left navigation pane.
Solution walkthrough To build generative AI–powered Salesforce applications with Amazon Bedrock, implement the following steps. Grant Amazon Bedrock invoke model permission to an IAM User Salesforce Einstein Studio requires an access key and a secret to access the Amazon Bedrock API. Follow the instructions to set up an IAM user and access keys. The IAM user must have Amazon Bedrock invoke model permission to access the model. Complete the following steps:
On the IAM console, select Users in the navigation panel. On the right side of the console, choose Add permissions and Create inline policy. On the Specify permissions screen, in the Service dropdown menu, select Bedrock. Under Actions allowed, enter “invoke.” Under Read, select InvokeModel. Select All under Resources. Choose Next. On the Review and create screen, under Policy name, enter BedrockInvokeModelPolicy. Choose Create policy.
Register Amazon Bedrock model in Einstein Model Builder
On the Salesforce Data Cloud console, under the Einstein Studio tab, choose Add Foundation Model. Choose Connect to Amazon Bedrock. For Endpoint information, enter the endpoint name, your AWS account Access Key, and your Secret Key. Enter the Region and Model information. Choose Connect. Now, create the configuration for the model endpoint you created in the previous steps. Provide Inference parameters such as temperature to set the deterministic factor of the LLM. Enter a sample prompt to verify the response. Next, you can save this new model configuration. Enter the name for the saved LLM model and choose Create Model. After the model creation is successful, choose Close and proceed to create the prompt template. Select the Model name to open the Model configuration. Select Create Prompt Template to launch the prompt builder. Select Field Generation as the prompt template type, template name, set Object to Account, and set Object Field to PB Case and Oppty Summary. This will associate the template to a custom field in the account record object to summarize the cases.
For this demo, a rich text field named PB Case and Oppty Summary was created and added to the Salesforce Account page layout according to the Add a Field Generation Prompt Template to a Lightning Record Page instructions.
Provide the prompt and input variables or objects for data grounding and select the model. Refer to Prompt Builder to learn more.
Integrate prompt template with the field in the Lightning App builder
On the Salesforce console, use the search bar to find Lightning App Builder. Build or edit an existing page to integrate the prompt template with the field as shown in the following screenshot. Refer to Add a Field Generation Prompt Template to a Lightning Record Page for detailed instructions. Navigate to the Account page and click on the PB Case and Oppty Summary enabled for chat completion to launch the Einstein generative AI assistant and summarize the account case data.
Cleanup Complete the following steps to clean up your resources.
Delete the IAM user Delete the foundation model in Einstein Studio
Amazon Bedrock offers on-demand inference pricing. There’s no additional costs with a continued model subscription. To remove model access, refer to the steps in Remove model access. Conclusion In this post, we demonstrated how to use your own LLM in Amazon Bedrock to power Salesforce applications. We used summarization of open service cases on an account object as an example to showcase the implementation steps. Amazon Bedrock is a fully managed service that makes high-performing FMs from leading AI companies and Amazon available for your use through a unified API. You can choose from a wide range of FMs to find the model that is best suited for your use case. Salesforce Einstein Model Builder lets you register your Amazon Bedrock model and use it in Prompt Builder to create prompts grounded in your data. These prompts can then be integrated with Salesforce capabilities such as Flows and Invocable Actions and Apex. You can then build custom generative AI applications with Claude 3 that are grounded in the Salesforce user experience. Amazon Bedrock requests from Salesforce pass through the Einstein Trust Layer, which provides responsible AI use with features such as dynamic grounding, zero data retention, and toxicity detection while maintaining safety and security standards. AWS and Salesforce are excited for our mutual customers to harness this integration and build generative AI–powered applications. To learn more and start building, refer to the following resources.
Amazon Bedrock Amazon Bedrock resources Bring Your Own Large Language Model in Einstein 1 Studio Prompt Engineering for Salesforce Developers The Ultimate Guide to Prompt Builder | Spring ’24
About the Authors Daryl Martis is the Director of Product for Einstein Studio at Salesforce Data Cloud. He has over 10 years of experience in planning, building, launching, and managing world-class solutions for enterprise customers, including AI/ML and cloud solutions. He has previously worked in the financial services industry in New York City. Follow him on LinkedIn. Darvish Shadravan is a Director of Product Management in the AI Cloud at Salesforce. He focuses on building AI/ML features for CRM, and is the product owner for the Bring Your Own LLM feature. You can connect with him on LinkedIn. Rachna Chadha is a Principal Solutions Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music. Ravi Bhattiprolu is a Sr. Partner Solutions Architect at AWS. Ravi works with strategic partners Salesforce and Tableau to deliver innovative and well-architected products and solutions that help joint customers realize their business objectives. Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field. Mike Patterson is a Senior Customer Solutions Manager in the Strategic ISV segment at AWS. He has partnered with Salesforce Data Cloud to align business objectives with innovative AWS solutions to achieve impactful customer experiences. In Mike’s spare time, he enjoys spending time with his family, sports, and outdoor activities. Dharmendra Kumar Rai (DK Rai) is a Sr. Data Architect, Data Lake & AI/ML, serving strategic customers. He works closely with customers to understand how AWS can help them solve problems, especially in the AI/ML and analytics space. DK has many years of experience in building data-intensive solutions across a range of industry verticals, including high-tech, FinTech, insurance, and consumer-facing applications.
Amazon Forecast is a fully managed service that uses statistical and machine learning (ML) algorithms to deliver highly accurate time series forecasts. Launched in August 2019, Forecast predates Amazon SageMaker Canvas, a popular low-code no-code AWS tool for building, customizing, and deploying ML models, including time series forecasting models. With SageMaker Canvas, you get faster model building, cost-effective predictions, advanced features such as a model leaderboard and algorithm selection, and enhanced transparency. You can also either use the SageMaker Canvas UI, which provides a visual interface for building and deploying models without needing to write any code or have any ML expertise, or use its automated machine learning (AutoML) APIs for programmatic interactions. In this post, we provide an overview of the benefits SageMaker Canvas offers and details on how Forecast users can transition their use cases to SageMaker Canvas. Benefits of SageMaker Canvas Forecast customers have been seeking greater transparency, lower costs, faster training, and enhanced controls for building time series ML models. In response to this feedback, we have made next-generation time series forecasting capabilities available in SageMaker Canvas, which already offers a robust platform for preparing data and building and deploying ML models. With the addition of forecasting, you can now access end-to-end ML capabilities for a broad set of model types—including regression, multi-class classification, computer vision (CV), natural language processing (NLP), and generative artificial intelligence (AI)—within the unified user-friendly platform of SageMaker Canvas. SageMaker Canvas offers up to 50% faster model building performance and up to 45% quicker predictions on average for time series models compared to Forecast across various benchmark datasets. Generating predictions is significantly more cost-effective than Forecast, because costs are based solely on the Amazon SageMaker compute resources used. SageMaker Canvas also provides excellent model transparency by offering direct access to trained models, which you can deploy at your chosen location, along with numerous model insight reports, including access to validation data, model- and item-level performance metrics, and hyperparameters employed during training. SageMaker Canvas includes the key capabilities found in Forecast, including the ability to train an ensemble of forecasting models using both statistical and neural network algorithms. It creates the best model for your dataset by generating base models for each algorithm, evaluating their performance, and then combining the top-performing models into an ensemble. This approach leverages the strengths of different models to produce more accurate and robust forecasts. You have the flexibility to select one or several algorithms for model creation, along with the capability to evaluate the impact of model features on prediction accuracy. SageMaker Canvas simplifies your data preparation with automated solutions for filling in missing values, making your forecasting efforts as seamless as possible. It facilitates an out-of-the-box integration of external information, such as country-specific holidays, through simple UI options or API configurations. You can also take advantage of its data flow feature to connect with external data providers’ APIs to import data, such as weather information. Furthermore, you can conduct what-if analyses directly in the SageMaker Canvas UI to explore how various scenarios might affect your outcomes. We will continue to innovate and deliver cutting-edge, industry-leading forecasting capabilities through SageMaker Canvas by lowering latency, reducing training and prediction costs, and improving accuracy. This includes expanding the range of forecasting algorithms we support and incorporating new advanced algorithms to further enhance the model building and prediction experience. Transitioning from Forecast to SageMaker Canvas Today, we’re releasing a transition package comprising two resources to help you transition your usage from Forecast to SageMaker Canvas. The first component includes a workshop to get hands-on experience with the SageMaker Canvas UI and APIs and to learn how to transition your usage from Forecast to SageMaker Canvas. We also provide a Jupyter notebook that shows how to transform your existing Forecast training datasets to the SageMaker Canvas format. Before we learn how to build forecast models in SageMaker Canvas using your Forecast input datasets, let’s understand some key differences between Forecast and SageMaker Canvas:
Dataset types – Forecast uses multiple datasets – target time series, related time series (optional), and item metadata (optional). In contrast, SageMaker Canvas requires only one dataset, eliminating the need for managing multiple datasets. Model invocation – SageMaker Canvas allows you to invoke the model for a single dataset or a batch of datasets using the UI as well as the APIs. Unlike Forecast, which requires you to first create a forecast and then query it, you simply use the UI or API to invoke the endpoint where the model is deployed to generate forecasts. The SageMaker Canvas UI also gives you the option to deploy the model for inference on SageMaker real-time endpoints. With just a few clicks, you can receive an HTTPS endpoint that can be invoked from within your application to generate forecasts.
In the following sections, we discuss the high-level steps for transforming your data, building a model, and deploying a model using SageMaker Canvas using either the UI or APIs. Build and deploy a model using the SageMaker Canvas UI We recommend reorganizing your data sources to directly create a single dataset for use with SageMaker Canvas. Refer to Time Series Forecasts in Amazon SageMaker Canvas for guidance on structuring your input dataset to build a forecasting model in SageMaker Canvas. However, if you prefer to continue using multiple datasets as you do in Forecast, you have the following options to merge them into a single dataset supported by SageMaker Canvas:
SageMaker Canvas UI – Use the SageMaker Canvas UI to join the target time series, related time series, and item metadata datasets into one dataset. The following screenshot shows an example dataflow created in SageMaker Canvas to merge the three datasets into one SageMaker Canvas dataset. Python script – Use a Python script to merge the datasets. For sample code and hands-on experience in transforming multiple Forecast datasets into one dataset for SageMaker Canvas, refer to this workshop.
When the dataset is ready, use the SageMaker Canvas UI, available on the SageMaker console, to load the dataset into the SageMaker Canvas application, which uses AutoML to train, build, and deploy the model for inference. The workshop shows how to merge your datasets and build the forecasting model. After the model is built, there are multiple ways to generate and consume forecasts:
Make an in-app prediction – You can generate forecasts using the SageMaker Canvas UI and export them to Amazon QuickSight using built-in integration or download the prediction file to your local desktop. You can also access the generated predictions from the Amazon Simple Storage Service (Amazon S3) storage location where SageMaker Canvas is configured to store model artifacts, datasets, and other application data. Refer to Configure your Amazon S3 storage to learn more about the Amazon S3 storage location used by SageMaker Canvas. Deploy the model to a SageMaker endpoint – You can deploy the model to SageMaker real-time endpoints directly from the SageMaker Canvas UI. These endpoints can be queried by developers in their applications with a few lines of code. You can update the code in your existing application to invoke the deployed model. Refer to the workshop for more details.
Build and deploy a model using the SageMaker Canvas (Autopilot) APIs You can use the sample code provided in the notebook in the GitHub repo to process your datasets, including target time series data, related time series data, and item metadata, into a single dataset needed by SageMaker Canvas APIs. Next, use the SageMaker AutoML API for time series forecasting to process the data, train the ML model, and deploy the model programmatically. Refer to the sample notebook in the GitHub repo for a detailed implementation on how to train a time series model and produce predictions using the model. Refer to the workshop for more hands-on experience. Conclusion In this post, we outlined steps to transition from Forecast and build time series ML models in SageMaker Canvas, and provided a data transformation notebook and prescriptive guidance through a workshop. After the transition, you can benefit from a more accessible UI, cost-effectiveness, and higher transparency of the underlying AutoML API in SageMaker Canvas, democratizing time series forecasting within your organization and saving time and resources on model training and deployment. SageMaker Canvas can be accessed from the SageMaker console. Time series forecasting with Canvas is available in all regions where SageMaker Canvas is available. For more information about AWS Region availability, see AWS Services by Region. Resources For more information, see the following resources:
Refer to the workshop to get hands-on experience on using SageMaker Canvas Refer to Time Series Forecasts in Amazon SageMaker Canvas for information on time series forecasting using the SageMaker Canvas UI Refer to Create an AutoML job for time-series forecasting using the API for information on time series forecasting using the SageMaker Canvas API Refer to Time-Series Forecasting with Amazon SageMaker Autopilot on GitHub for a notebook showing a sample implementation to train a time series model and produce predictions using AutoML APIs To learn how to include weather data in your forecasting model, see Use weather data to improve forecasts with Amazon SageMaker Canvas To learn how to set up monitoring for your forecasting model for accuracy drift and automatically retrain the model based on the drift threshold, refer to Automated Time-series Performance Monitoring and Retraining using Amazon SageMaker Autopilot on GitHub
About the Authors Nirmal Kumar is Sr. Product Manager for the Amazon SageMaker service. Committed to broadening access to AI/ML, he steers the development of no-code and low-code ML solutions. Outside work, he enjoys travelling and reading non-fiction. Dan Sinnreich is a Sr. Product Manager for Amazon SageMaker, focused on expanding no-code / low-code services. He is dedicated to making ML and generative AI more accessible and applying them to solve challenging problems. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction. Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customer throughout Benelux. He has been a developer since very young, starting to code at the age of 7. He started learning AI/ML in his later years of university, and has fallen in love with it since then. Biswanath Hore is a Solutions Architect at Amazon Web Services. He works with customers early in their AWS journey, helping them adopt cloud solutions to address their business needs. He is passionate about Machine Learning and, outside of work, loves spending time with his family.
Amazon Q Business is the generative artificial intelligence (AI) assistant that empowers employees with your company’s knowledge and data. Microsoft SharePoint Online is used by many organizations as a secure place to store, organize, share, and access their internal data. With generative AI, employees can get answers to their questions, summarize content, or generate insights from data stored in SharePoint Online. Using Amazon Q Business Connectors, you can connect SharePoint Online data to an Amazon Q Business application and start gaining insights from your data quickly. This post demonstrates how to use Amazon Q Business with SharePoint Online as the data source to provide answers, generate summaries, and present insights using least privilege access controls and best practices recommended by Microsoft SharePoint Dev Support Team. Solution overview In this post, we walk you through the process of setting up an Amazon Q Business application that connects to your SharePoint Online sites using an out-of-the-box Amazon Q Business Connector and configuring it using the Sites.Selected application permission scope. The Sites.Selected permission is important because many organizations implement policies that prevent granting read access on all sites (Sites.Read.All) or full control (Sites.FullControl.All) to any connector. The solution approach respects users’ existing identities, roles, and permissions by enabling identity crawling and access control lists (ACLs) on the Amazon Q Business connector for SharePoint Online using secure credentials facilitated through AWS Secrets Manager. If a user doesn’t have permissions to access certain data without Amazon Q Business, then they can’t access it using Amazon Q Business either. Only the data the user has access to is used to support the user query. Prerequisites The following are the prerequisites necessary to deploy the solution:
An AWS account with an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for the application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account? An Amazon Q Business application. If you haven’t set one up yet, see Creating an Amazon Q Business application environment. A Microsoft account and a SharePoint Online subscription to create and publish the application using the steps outlined in this post. If you don’t have this, check with your organization admins to create sandboxes for you to experiment in, or create a new account and trial subscription as needed to complete the steps. An application in Microsoft Entra ID with Sites.FullControl application-level permissions, along with its client ID and client secret. This application won’t be used by the Amazon Q Business connector, but it’s needed to grant Sites.Selected permissions exclusively to the target application.
Register a new app in the Microsoft Azure portal Complete the following steps to register a new app in the Microsoft Azure portal:
Log in to the Azure Portal with your Microsoft account. Choose New registration.
For Name, provide the name for your application. For this post, we use the name TargetApp. The Amazon Q Business application uses TargetApp to connect to the SharePoint Online site to crawl and index the data. For Who can use this application or access this API, choose Accounts in this organizational directory only (<Tenant name> only – Single tenant). Choose Register.
Note down the application (client) ID and the directory (tenant) ID on the Overview You’ll need them later when asked for TargetApp-ClientId and TenantId. Choose API permissions under Manage in the navigation pane. Choose Add a permission to allow the application to read data in your organization’s directory about the signed-in user.
Choose Microsoft Graph. Choose Delegated permissions. Choose User.Read.All from the User section. Choose GroupMember.Read.All from the GroupMember section. Choose Sites.Selected from the Sites section. Choose Add permissions.
On the options menu (three dots), choose Remove permission. Remove the original User.Read – Delegated permission. Choose Grant admin consent for Default Directory.
Choose Certificates & secrets in the navigation pane. Choose New client secret.
For Description, enter a description. Choose a value for Expires. Note that in production, you’ll need to manually rotate your secret before it expires. Choose Add. Note down the value for your new secret. You’ll need it later when asked for your client secret (TargetApp-ClientSecret).
Optionally, choose Owners to add any additional owners for the application. Owners will be able to manage permissions of the Azure AD application (TargetApp).
Use the Graph API to grant permissions to the application on the SharePoint Online site In this step, you define which of your SharePoint Online sites will be granted access to TargetApp. Amazon Q Business App uses TargetApp to connect to the SharePoint Online site to crawl and index the data. For this post, we use Postman, a platform for using APIs, to grant permissions. To grant permissions to a specific SharePoint Online site, you need to have another Azure AD application, which we refer to as AdminApp, with Sites.FullControl.All permissions. If you don’t have the prerequisite AdminApp, follow the previous steps to register AdminApp and for Application Permissions, grant Sites.FullControl.All permissions. As mentioned in the prerequisites, AdminApp will be used only to grant SharePoint Online sites access permissions to TargetApp. We use the ClientId and ClientSecret values of AdminApp from the Azure AD application to get an AccessToken value.
Create a POST request in Postman with the URL https://login.microsoftonline.com/{TenantId}/oauth2/v2.0/token. In the body of the request, choose x-www-form-urlencoded and set the following key-value pairs:
Set client_id to AdminApp-ClientId. Set client_secret to AdminApp-ClientSecret. Set grant_type to client_credentials. Set scope to https://graph.microsoft.com/.default.
Choose Send. From the returned response, copy the value of access_token. You need it in a later step when asked for the bearer token. Use the value of access_token from the previous step to grant permissions to TargetApp.
Get the SiteId of the SharePoint Online site by visiting your site URL (for example, https://<yourcompany>.sharepoint.com/sites/{SiteName}) in a browser. You need to log in to the site by providing valid credentials to access the site. Edit the URL in the browser address bar to append /_api/site/id at the end of {SiteName} to get the SiteId. You need this SiteId in the next step.
Create another POST request in Postman using the URL https://graph.microsoft.com/v1.0/sites/{SiteId}/permissions. Replace {SiteId} in the URL of the request with the SiteId from the previous step.
You can repeat this step for each site you want to include in the Amazon Q Business SharePoint Online connector.
Choose Bearer Token for Type on the Authorization Enter the value of access_token from earlier for Token.
For the payload, select raw and enter the following JSON code (replace the <<TargetApp-ClientId>> and <<TargeApp-Name>> values):
Choose Send to complete the process of granting SharePoint Online sites access to the TargetApp Azure AD application.
Configure the Amazon Q Business SharePoint Online connector Complete the following steps to configure the Amazon Q Business application’s SharePoint Online connector:
On the Amazon Q Business console, choose Add Data source. Search for and choose SharePoint. Give it a name and description (optional). Choose SharePoint Online for Hosting method under Source settings. Provide the full URL for the SharePoint site that you want to include in crawling and indexing for Site URLs specific to your SharePoint repository.
If the full URL of the site is https://<yourcompany>.sharepoint.com/sites/anycompany, use <yourcompany> as the value for Domain.
Choose OAuth 2.0 authentication for Authentication method. Provide the value of TenantId for TenantId.
The SharePoint connector needs credentials to connect to the SharePoint Online site using the Microsoft Graph API. To facilitate this, create a new Secrets Manager secret. These credentials will not be used in any access logs for the SharePoint Online site.
Choose Create and add a new secret. Enter a name for the secret. Enter the user name and password of a SiteCollection administrator on the sites included in the Amazon Q repository. Enter your client ID and client secret that you got from registering TargetApp in the previous steps. Choose Save.
Choose Create a new service role to create an IAM role, and enter a name for the role. For Sync scope, choose Select entities and choose All (or specify the combination of items to sync). Choose a sync option based on your needs (on demand or at a frequency of your choice). For this post, we choose on-demand. Choose Add data source. After the data source is created, choose Sync now to start the crawling and indexing.
Test the solution To test the solution, you can add users and groups, assign subscriptions, and test user and group access within your Amazon Q business application. Clean up If you’re only experimenting using the steps in this post, delete your application from the Azure Portal and delete the Amazon Q application from the Amazon Q console to avoid incurring costs. Conclusion In this post, we discussed how to configure the Amazon Q Business SharePoint Online connector using least privilege access controls that work with site-level least privileges to crawl and index SharePoint Online site content securely. We also demonstrated how to retain and apply ACLs while responding to user conversations. Organizations can now use their existing SharePoint Online data to gain better insights, generate summaries, and get answers to natural language queries in a conversational way using Amazon Q Business. By connecting SharePoint Online as a data source, employees can interact with the organization’s knowledge and data stored in SharePoint using natural language, making it effortless to find relevant information, extract key points, and derive valuable insights. This can significantly improve productivity, decision-making, and knowledge sharing within the organization. Try out the solution in this post, and leave your feedback and questions in the comments section.
About the Authors Surendar Gajavelli is a Sr. Solutions Architect based out of Nashville, TN. He is a passionate technology enthusiast who enjoys working with customers and helping them build innovative solutions. Abhi Patlolla is a Sr. Solutions Architect based out of the NYC region, helping customers in their cloud transformation, AI/ML, and data initiatives. He is a strategic and technical leader, advising executives and engineers on cloud strategies to foster innovation and positive impact.
The LMSys Chatbot Arena has recently released scores for GPT-4o Mini, sparking a topic of discussion among AI researchers. GPT-4o Mini outperformed Claude 3.5 Sonnet, which is frequently praised as the most intelligent Large Language Model (LLM) on the market, according to the results. This rating prompted a more thorough study of the elements underlying GPT-4o Mini’s exceptional performance.
To quell the curiosity about the rankings, LMSys offered a random selection of one thousand actual user prompts. These questions contrasted the answers of GPT-4o Mini with those of Claude 3.5 Sonnet and other LLMs. In a recent Reddit post, significant insights into why GPT-4o Mini frequently outperformed Claude 3.5 Sonnet have been shared.
The GPT-4o Mini’s critical success factors are as follows:
Refusal Rate: The reduced rejection rate of GPT-4o Mini is one of the key areas in which it shines. In contrast to Claude 3.5 Sonnet, which occasionally chooses not to respond to specific commands, GPT-4o Mini usually does so more regularly. This quality fits in nicely with the requirements of users who would rather work with a more cooperative LLM and are eager to try to answer every question, no matter how difficult or peculiar.
Length of Response: GPT-4o Mini frequently offers more thorough and extended responses than Claude 3.5 Sonnet. Claude 3.5 strives for succinct responses, whereas GPT-4o Mini tends to be unduly detailed. This thoroughness might be especially enticing when people are looking for in-depth details or explanations of certain topics.
Formatting and presenting: GPT-4o Mini performs noticeably better than Claude 3.5 Sonnet in the formatting and presenting of replies. GPT-4o Mini uses headers, different font sizes, bolding, and efficient whitespace management to improve the readability and aesthetic appeal of its replies. Claude 3.5 Sonnet, on the other hand, styles its outputs minimally. GPT-4o Mini’s comments may be more interesting and simpler to understand as a result of this presentational variation.
Some users have a prevalent idea that suggests an ordinary human assessor does not possess the necessary discernment to assess the correctness of LLM responses. This idea, however, does not apply to LMSys. The majority of users ask questions that they are able to evaluate fairly, and the GPT-4o Mini winning answers were typically superior in at least one important prompt-related area.
LMSys prompts a wide range of topics, from challenging assignments like arithmetic, coding, and reasoning challenges to more standard questions like amusement or everyday task support. Both Claude 3.5 Sonnet and GPT-4o Mini can provide accurate responses despite their differing levels of sophistication. GPT-4o Mini has an advantage in simpler cases because of its superior formatting and refusal to refuse an answer.
In conclusion, GPT-4o Mini outperforms Claude 3.5 Sonnet on LMSys because of its superior formatting, lengthier and more thorough responses, and decreased refusal rate. These features meet the needs of the typical LMSys user, who prioritizes readability, thorough responses, and more collaboration from the LLM. Maintaining the top spots on platforms like LMSys will become harder as the accessibility landscape for LLM changes, necessitating constant updates and modifications from the models. The post Why GPT-4o Mini Outperforms Claude 3.5 Sonnet on LMSys? appeared first on MarkTechPost.
TensorOpera has announced the launch of its groundbreaking small language model, Fox-1, through an official press release. This innovative model represents a significant step forward in small language models (SLMs), setting new benchmarks for scalability and performance in generative AI, particularly for cloud and edge computing applications.
Fox-1-1.6B boasts a 1.6 billion parameter architecture, distinguishing it from other SLMs due to its superior performance and efficiency. The model has been meticulously designed to cater to the needs of developers and enterprises aiming for scalable and efficient AI deployment. It surpasses similar models from industry giants such as Apple, Google, and Alibaba.
A key feature of Fox-1 is its integration into TensorOpera’s AI and FedML platforms. This integration facilitates the deployment, training, and creation of AI applications across various platforms and devices, ranging from high-powered GPUs in the cloud to edge devices like smartphones and AI-enabled PCs. This versatility underscores TensorOpera’s commitment to providing a scalable, generative AI platform that enhances ownership and efficiency across diverse computing environments.
Image Source
SLMs, including Fox-1, offer several advantages over larger language models (LLMs). They are designed to operate with significantly reduced latency and require less computational power, making them ideal for environments with limited resources. This efficiency translates into faster data processing and lower costs, which is critical for deploying AI in various settings, from mobile devices to server-constrained environments.
Fox-1 is particularly noteworthy for its incorporation into composite AI architectures like Mixture of Experts (MoE) and model federation systems. These configurations leverage multiple SLMs working together to create more powerful systems capable of handling complex tasks such as multilingual processing and predictive analytics from various data sources.
Fox-1’s architecture is a decoder-only transformer-based model with 1.6 billion parameters, trained on a comprehensive dataset comprising 3 trillion tokens of text and code data. The model’s design includes Grouped Query Attention (GQA), enhancing its query processing efficiency and significantly improving inference latency and response times. This advanced architectural design allows Fox-1 to outperform competitors on standard benchmarks, demonstrating its robustness and capability.
Image Source
Performance evaluations reveal that Fox-1 excels in various benchmarks, including ARC Challenge, HellaSwag, TruthfulQA, MMLU, Winogrande, and GSM8k. It consistently outperforms models like Gemma-2B, Qwen1.5-1.8B, StableLM-2-1.6B, and OpenELM1.1B, showcasing its superior performance despite having fewer parameters than some.
Regarding inference efficiency, Fox-1 demonstrates impressive throughput, achieving over 200 tokens per second on the TensorOpera model serving platform. This high throughput is attributed to its efficient architectural design, particularly the GQA mechanism. Fox-1’s memory efficiency also makes it suitable for on-device deployment, requiring significantly less GPU memory than its peers.
Image Source
Integrating Fox-1 into TensorOpera’s product suite enhances its versatility, enabling seamless deployment and training across cloud and edge environments. This integration empowers AI developers to leverage the comprehensive capabilities of the TensorOpera AI Platform for cloud-based training and subsequently deploy and personalize these solutions on edge devices via the TensorOpera FedML platform. This approach offers cost efficiency and enhanced privacy and provides personalized user experiences.
In conclusion, TensorOpera’s Fox-1 is a pioneering model in the SLM landscape, setting new standards for performance and efficiency. Its versatile integration into cloud and edge platforms makes it a formidable tool for developers and enterprises seeking scalable AI solutions. TensorOpera is releasing the base version of Fox-1 under the Apache 2.0 license to facilitate broad adoption, allowing free use for production and research purposes. An instruction-tuned version is also in the pipeline, promising even greater capabilities.
Check out the Model and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here The post TensorOpera Unveils Fox Foundation Model: A Unique Step in Small Language Models Enhancing Scalability and Efficiency for Cloud and Edge Computing appeared first on MarkTechPost.
In the past decade, the data-driven method utilizing deep neural networks has driven artificial intelligence success in various challenging applications across different fields. These advancements address multiple issues; however, existing methodologies face the challenge in data science applications, especially in fields such as biology, healthcare, and business due to the requirement for deep expertise and advanced coding skills. Moreover, a significant barrier in this field is the lack of communication between domain experts and advanced artificial intelligence models.
In recent years, the fast progress in Large Language Models (LLMs) has opened up many possibilities in artificial intelligence. Some well-known LLMs are GPT-3, GPT-4, PaLM, LLaMA, and Qwen. These models have great potential to understand, generate, and apply natural language. These advancements have created a medium for LLM-powered agents that are now being developed to solve problems in search engines, software engineering, gaming, recommendation systems, and scientific experiments. These agents are often guided by a chain of thought (CoT) like ReAct and can use tools such as APIs, code interpreters, and retrievers. The methods discussed in this paper include (a) Enhancing LLMs with Function Calling, and (b) Powering LLMs by Code Interpreter.
A team of researchers from Hong Kong Polytechnic University has introduced LAMBDA, a new open-source and code-free multi-agent data analysis system developed to overcome the lack of effective communication between domain experts and advanced AI models. LAMBDA provides an essential medium that allows smooth interaction between domain knowledge and AI capabilities in data science. This method solves numerous problems like removing coding barriers, integrating human intelligence with AI, and reshaping data science education, promising reliability and portability. Reliability means LAMBDA can address the tasks of data analysis stably and correctly. Portability means it is compatible with various LLMs, allowing it to be enhanced by the latest state-of-the-art models.
The proposed method, LAMBDA, a multi-agent data analysis system, contains two agents that work together to solve data analysis tasks using natural language. The process starts with writing code based on user instructions and then executing that code. The two main roles of LAMBDA are the “programmer” and the “inspector.” The programmer writes code according to the user’s instructions and dataset. This code is then run on the host system. If the code encounters any errors during execution, the inspector plays the role of suggesting improvements. The programmer uses these suggestions to fix the code and submit it for re-evaluation.
The results of the experiments show that LAMBDA performs well in machine learning tasks. It achieved the highest accuracy rates of 89.67%, 100%, 98.07%, and 98.89% for the AIDS, NHANES, Breast Cancer, and Wine datasets, respectively for classification tasks. For regression tasks, it achieved the lowest MSE (Mean Squared Error) of 0.2749, 0.0315, 0.4542, and 0.2528, respectively. These results highlight its effectiveness in handling various models of data science applications. Moreover, LAMBDA successfully overcame the coding barrier without any human involvement in the entire process of these experiments, and connected data science with human experts who lack coding skills,
In this paper, a team of researchers from Hong Kong Polytechnic University has proposed a new open-source, code-free multi-agent data analysis system called LAMBDA that combines human intelligence with AI. The experimental results show that it performs well in data analysis tasks. In the future, it can be improved with planning and reasoning techniques. It bridged the gap between data science and humans with no coding skills, successfully connecting them without human involvement. By bridging the gap between human expertise and AI capabilities, LAMBDA aims to make data science and analysis more accessible, encouraging more innovation and discovery in the future.
Check out the Paper, Project, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here The post LAMBDA: A New Open-Source, Code-Free Multi-Agent Data Analysis System to Bridge the Gap Between Domain Experts and Advanced AI Models appeared first on MarkTechPost.
Competition significantly shapes human societies, influencing economics, social structures, and technology. Traditional research on competition, relying on empirical studies, is limited by data accessibility and lacks micro-level insights. Agent-based modeling (ABM) emerged to overcome these limitations, progressing from rule-based to machine learning-based agents. However, these approaches still struggle to accurately simulate complex human behavior. The advent of Large Language Models (LLMs) has enabled the creation of autonomous agents for social simulations. While recent work has explored LLM-based agents in various environments, studies specifically examining competition dynamics remain sparse. This gap hinders a comprehensive understanding of competition across different domains.
Empirical studies on competition have uncovered valuable insights, such as inter-team competition fostering intra-team cooperation and the “Matthew Effect” in academia. However, these studies face limitations in controlling variables and collecting comprehensive data. Recent advancements in LLM-empowered-ABM have revolutionized social simulations. Notable projects include the Generative Agent, which established a foundational framework for agent designs, and studies exploring information dissemination, recommendation systems, and macroeconomic environments. Significant progress has also been made in collaborative cooperation simulations.
Despite these advancements, research on competition mechanisms using LLM-based agents remains limited. Existing studies have explored auction scenarios and corporate competition, but they fall short of simulating complex competitive environments and thoroughly analyzing competitive behaviors and system evolution. This gap in research presents an opportunity for more comprehensive studies on competition dynamics using LLM-based agent simulations, which could overcome the limitations of traditional empirical studies and provide deeper insights into competitive phenomena.
Researchers from the University of Science and Technology of China, Microsoft Research, William & Mary, Georgia Institute of Technology, and Carnegie Mellon University introduce CompeteAI, a comprehensive framework to study competition dynamics between LLM-based agents. The framework consists of environment selection, setup, simulation execution, and analysis. Using GPT-4, researchers developed a virtual town simulation with restaurant and customer agents. Restaurant agents compete to attract customers, driving continuous evolution and innovation. Customer agents, with diverse characteristics, act as judges by selecting restaurants and providing feedback. This setup allows for a detailed examination of competitive behaviors and system evolution. The framework begins with selecting an appropriate competition context, followed by environment setup, running experiments to capture agent interactions, and finally analyzing behaviors to derive insights into competition dynamics. Also, the framework’s core component is creating a competitive environment with meticulously designed competitors, judges, and interactions. Constraints, such as resource and service limitations for competitors or financial restrictions for judges, are crucial for success. The design is inspired by resource dependence theory, where competition for resources influences organizational behavior and strategies.
The CompeteAI framework implements a simulated small-town environment with two competing restaurants and 50 diverse customers. The simulation runs for 15 days or until one restaurant quits. Both restaurants and customers are powered by GPT-4 (0613) LLM-based agents. Restaurant agents manage their establishments through pre-defined actions like modifying menus, managing chefs, and creating advertisements. Customer agents, either individuals or groups, choose restaurants daily based on provided information and leave feedback after meals.
To overcome challenges in practical implementation, the researchers developed a comprehensive restaurant management system with APIs, allowing text-based LLM agents to interact effectively with the simulated environment. The system incorporates diverse customer characteristics and relationships to trigger more realistic competitive behaviors. Restaurant agents analyze daily information, design strategies, and interact with the management system, storing summaries for future planning. Customer agents, with varying characteristics and group dynamics, make decisions based on restaurant information, personal preferences, and group discussions. Also, this framework includes a dish quality evaluation mechanism, considering factors such as the chef’s skill level, dish cost, and selling price. This empirical approach ensures a realistic representation of service quality in a competitive environment.
The researchers conducted experiments with 9 runs for individual customers and 6 runs for group customers. This analysis covered both micro-level and macro-level perspectives:
Micro-level results revealed the sophisticated behavior of LLM-based agents in the CompeteAI framework. Agents demonstrated contextual perception, analyzing scenarios from “shallow to deep” – examining customer flow trends, dish feedback, and rival actions before deeper strategic analysis. They employed classic market strategies including differentiation, imitation, customer orientation, and social learning. Customer decisions were influenced by multiple factors, with “satisfaction of needs” being crucial for all. In particular, individual customers valued the restaurant’s reputation more, while groups were more open to exploring new options, showcasing the framework’s ability to simulate diverse consumer behaviors.
The macro-level analysis uncovered several significant phenomena in the simulated competitive environment. Strategy dynamics exhibited a complex interplay of differentiation and imitation behaviors between competing restaurants. The Matthew Effect was observed, where initial advantages led to continued success for one restaurant through positive feedback loops. Interestingly, customer grouping diminished the “winner-take-all” phenomenon, occurring less frequently for group customers (16.7%) compared to individual customers (66.7%). Perhaps most importantly, competition consistently improved overall product quality. In 86.67% of cases, the average dish score in at least one restaurant improved over time, with average dish scores increasing by 0.26 for Restaurant 1 and 0.22 for Restaurant 2 from Day 1 to Day 15.
These findings demonstrate the complex dynamics of competition between LLM-based agents and provide insights into market behaviors, customer decision-making, and the impact of competition on service quality in simulated environments.
The CompeteAI framework introduces an innovative approach to studying competition dynamics using LLM-based agents. By simulating a virtual town with competing restaurants and diverse customers, the study reveals sophisticated agent behaviors aligning with classic economic and sociological theories. Key findings include the emergence of complex strategy dynamics, the Matthew Effect, and the impact of customer grouping on market outcomes. The research demonstrates that LLM-based agents can effectively simulate competitive environments, consistently improving product quality over time. This innovative framework offers valuable insights for future studies in sociology, economics, and human behavior, providing a promising platform for interdisciplinary research in controlled, realistic settings.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here The post CompeteAI: An Artificial Intelligence AI Framework that Understands the Competition Dynamics of Large Language Model-based Agents appeared first on MarkTechPost.
Evaluating model performance is essential in the significantly advancing fields of Artificial Intelligence and Machine Learning, especially with the introduction of Large Language Models (LLMs). This review procedure helps understand these models’ capabilities and create dependable systems based on them. However, what is referred to as Questionable Research Practices (QRPs) frequently jeopardize the integrity of these assessments. These methods have the potential to greatly exaggerate published results, deceiving the scientific community and the general public about the actual effectiveness of ML models.
The primary driving force for QRPs is the ambition to publish in esteemed journals or to attract funding and users. Due to the intricacy of ML research, which includes pre-training, post-training, and evaluation stages, there is much potential for QRPs. Contamination, cherrypicking, and misreporting are the three basic categories these actions fall into.
Contamination
When data from the test set is used for training, assessment, or even model prompts, this is known as contamination. High-capacity models such as LLMs can remember test data that is exposed during training. Researchers have provided extensive documentation on this problem, detailing cases in which models were purposefully or unintentionally trained using test data. There are various ways that contamination can occur, which are as follows.
Training on the Test Set: This results in unduly optimistic performance predictions when test data is unintentionally added to the training set.
Prompt Contamination: During few-shot evaluations, using test data in the prompt gives the model an unfair advantage.
Retrieval Augmented Generation (RAG) Contamination: Data leakage via retrieval systems using benchmarks.
Dirty Paraphrases and Contaminated Models: Rephrased test data and contaminated models are used to train models, while contaminated models are used to generate training data.
Over-hyping and Meta-contamination: Exaggerating and meta-contaminating designs by recycling contaminated designs or fine-tuning hyperparameters after test results are obtained.
Cherrypicking
Cherrypicking is the practice of adjusting experimental conditions to support the intended result. It is possible for researchers to test their models several times under different scenarios and only publish the best outcomes. This comprises of the following.
Baseline Nerfing: It is the deliberate under-optimization of baseline models to give the impression that the new model is better.
Runtime Hacking: It includes modifying inference parameters after the fact to improve performance metrics.
Benchmark Hacking Choosing simpler benchmarks or subsets of benchmarks to make sure the model runs well is known as benchmark hacking.
Golden Seed: Reporting the top-performing seed after training with several random seeds.
Misreporting
A variety of techniques are included in misreporting when researchers present generalizations based on skewed or limited benchmarks. For example, consider the following:
Superfluous Cog: Claiming originality by adding unnecessary modules.
Whack-a-mole: Keeping an eye on and adjusting certain malfunctions as needed.
P-hacking: The selective presentation of statistically significant findings.
Point Scores: Ignoring variability by reporting results from a single run without error bars.
Outright Lies and Over/Underclaiming: Creating fake outcomes or making incorrect assertions regarding the capabilities of the model.
Irreproducible Research Practices (IRPs), in addition to QRPs, add to the complexity of the ML evaluation environment. It is challenging for subsequent researchers to duplicate, expand upon, or examine earlier research because of IRPs. One common instance is dataset concealing, in which researchers withhold information about the training datasets they utilize, including metadata. The competitive nature of ML research and worries about copyright infringement frequently motivate this technique. The validation and replication of discoveries, which are essential to the advancement of science, are hampered by the lack of transparency in dataset sharing.
In conclusion, the integrity of ML research and assessment is critical. Although QRPs and IRPs may benefit companies and researchers in the near term, they damage the field’s credibility and dependability over the long run. Setting up and upholding strict guidelines for research processes is essential as ML models are used more often and have a greater impact on society. The full potential of ML models can only be attained by openness, responsibility, and a dedication to moral research. It is imperative that the community collaborates to recognize and address these practices, guaranteeing that the progress in ML is grounded in honesty and fairness.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here The post The Impact of Questionable Research Practices on the Evaluation of Machine Learning (ML) Models appeared first on MarkTechPost.
Autonomous web navigation focuses on developing AI agents capable of performing complex online tasks. These tasks range from data retrieval and form submissions to more intricate activities like finding the cheapest flights or booking accommodations. By leveraging large language models (LLMs) and other AI methodologies, autonomous web navigation aims to enhance productivity in both consumer and enterprise domains by automating tasks that are typically manual and time-consuming.
This research addresses the primary challenge of current web agents, which are inefficient and error-prone. Traditional web agents struggle with the noisy and expansive HTML Document Object Models (DOMs) and the dynamic nature of modern web pages. These agents often fail to perform tasks accurately due to their incompetence in handling the complexity & variability of web content effectively. This inefficiency is a significant barrier to the practical deployment of autonomous web agents in real-world applications, where reliability and precision are crucial.
Existing methods employed by web agents include encoding the DOM, using screenshots, and utilizing accessibility trees. Despite these techniques, current systems often fall short because they use a flat encoding of the DOM that does not capture the hierarchical structure of web pages. This leads to suboptimal performance, with agents failing to complete tasks or providing incorrect outputs. These limitations necessitate a more sophisticated approach to web navigation and task execution.
Researchers at Emergence AI introduced Agent-E, a novel web agent designed to overcome the shortcomings of existing systems. Agent-E’s hierarchical architecture divides the task planning and execution phases into two distinct components: the planner agent and the browser navigation agent. This separation allows each component to focus on its specific role, improving efficiency and performance. The planner agent decomposes tasks into sub-tasks, which are then executed by the browser navigation agent using advanced DOM distillation techniques.
The methodology of Agent-E involves several innovative steps to manage noisy and expansive web content effectively. The planner agent breaks down user tasks into smaller sub-tasks and assigns them to the browser navigation agent. This agent uses flexible DOM distillation techniques to select the most relevant DOM representation for each task, reducing noise and focusing on task-specific information. Agent-E employs change observation to monitor state changes during task execution, providing feedback that enhances the agent’s performance and accuracy.
Evaluations using the WebVoyager benchmark demonstrated that Agent-E significantly outperforms previous state-of-the-art web agents. Agent-E achieved a success rate of 73.2%, marking a 20% improvement over previous text-only web agents and a 16% increase over multi-modal web agents. On complex sites like Wolfram Alpha, Agent-E’s performance improvement reached up to 30%. Beyond success rates, the research team reported on additional metrics such as task completion times and error awareness. Agent-E averaged 150 seconds to complete a task successfully and 220 seconds for failed tasks. It required an average of 25 LLM calls per task, highlighting its efficiency and effectiveness.
In conclusion, the research conducted by Emergence AI represents a significant advancement in autonomous web navigation. By addressing the inefficiencies of current web agents through a hierarchical architecture and advanced DOM management techniques, Agent-E sets a new benchmark for performance and reliability. The study’s findings suggest that these innovations could be applied beyond web automation to other areas of AI-driven automation, offering valuable insights into the design principles of agentic systems. Agent-E’s success in achieving a 73.2% task completion rate and efficient task execution process underscores its potential for transforming web navigation and automation.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here The post Emergence AI Proposes Agent-E: A Web Agent Achieving 73.2% Success Rate with a 20% Improvement in Autonomous Web Navigation appeared first on MarkTechPost.