Economists from the University of Chicago Present a Study on the Adopt …

Artificial intelligence, particularly AI chatbots like ChatGPT, has ushered in a new era of technological interaction. These intelligent systems, capable of understanding and generating human-like text, are not just prevalent across various applications but are also transforming the way we communicate, work, and learn. The rapid adoption of AI chatbots, particularly ChatGPT, across different domains and age groups worldwide, from businesses optimizing customer service to educators offering personalized learning experiences, is a testament to the transformative power of these technologies.

AI chatbots have the potential to bridge generational gaps in technology adoption. While younger generations, particularly Millennials and Gen Z, are more inclined to use these tools for social interaction, entertainment, and learning, older adults are gradually embracing chatbots for more practical purposes. This diverse appeal of AI chatbots across different age groups is a promising sign of their potential to create a more inclusive and connected digital world. 

In the corporate sector, businesses are leveraging AI chatbots, particularly ChatGPT, to enhance customer service, streamline operations, and provide personalized user experiences.  This focus on enhancing user experiences through AI chatbots is a clear indication of how businesses are prioritizing the needs and preferences of their customers and employees. 

Researchers at the University of Chicago have recently shown that their AI chatbot, ChatGPT, is quite good at investing activities, such as anticipating investment policies, interpreting extensive disclosures, and recognizing risk. Their recent research, which was conducted between November 2023 and January 2024, examines the spread of ChatGPT by gathering data (both qualitative and quantitative) on who has used the platform so far, how employees feel it will influence their work, and why some people use it while others don’t. The authors surveyed 100,000 employees from 11 occupations exposed to ChatGPT, with Statistics Denmark also involved in the poll. They investigate if alerting employees about expert evaluations of ChatGPT in their job activities affect their adoption of the tool through an experimental component of their survey. The authors ‘ survey replies link administrative data on participants’ demographics, incomes, wealth, education, and labor market histories. 

Intensity Of Use Across Occupations & Genders

Regarding using ChatGPT, their findings suggest that the exposed occupations, including software developers, journalists, and legal professionals, have a high prevalence of ChatGPT. With acceptance rates ranging from 34% among financial advisors to 79% among software developers, the data shows that ChatGPT has seen the highest adoption rates among journalists, software developers, and IT support professionals, with over 50% of these workers using the AI chatbot at their jobs. However, adoption is lower in more traditional roles like office clerks, financial advisors, and customer service representatives, suggesting a divide in how different occupations leverage this emerging technology. Additionally, the significant portion of workers across all roles who are “Aware, Never Used” indicates there is still room for further adoption of ChatGPT as more professionals become familiar with its capabilities and integrate it into their workflows.

Among workers, 32% now use ChatGPT, with 6% having a Plus subscription. Marketing Professionals and Journalists have the highest intensity of ChatGPT usage, with 15% in Active Plus Subscription, while Software Developers have a slightly lower intensity, with 14% having an Active Plus Subscription. Legal Professionals, Accountants/Auditors, Teachers, Office Clerks, and Financial Advisors have the lowest intensity of use, with 4% or less having subscription ChatGPT.

The likelihood of utilizing ChatGPT is higher among younger and less experienced professionals. The probability of using ChatGPT decreases by 1 (0.7) percentage point for every year of age (experience). People with greater levels of education and better grades are more likely to utilize ChatGPT, which explains why its users have a somewhat higher income even though they have less experience with the company. 

Among working-age men and women, ChatGPT usage is 20% lower among women. There is 

no correlation between employees’ unique job responsibilities and the persistence of this gender disparity, even amongst co-workers in the same office. Employees in the affected occupations believe that ChatGPT has a huge opportunity to increase productivity. For individuals with more experience, the likelihood of stating that ChatGPT offers lower time savings is twice as high as for people with less competence. 

What People Say Prevents Them from Using ChatGPT?

When workers hear how experts rate ChatGPT’s ability to save them time on work tasks, their own opinions change. This change lasts for at least two weeks and narrows the gap between workers’ and experts’ opinions by 15%. The treatment does not change how workers use ChatGPT, though.

The study highlights several key factors that prevent workers, particularly women, from using ChatGPT. The most significant barrier appears to be “Restrictions on Use,” with 30% of women and 41% of men citing this as a deterrent. These restrictions could be related to data privacy, company policies, or regulatory compliance, and they may be limiting the adoption of ChatGPT among employees. Another notable factor is “Need Training,” cited by 48% of women and 37% of men. This indicates that a lack of training or familiarity with the tool is a barrier to usage, and points to a need for more educational resources and support to help workers, especially women, become comfortable using ChatGPT. Interestingly, “Data Confidentiality” is also a concern for a significant proportion of both women (26%) and men (31%), highlighting the importance of addressing data privacy and security considerations in the adoption of AI chatbots like ChatGPT in the workplace.  However, when asked why they don’t use ChatGPT, few employees mention “existential fears” of technological dependence or job loss.

The data suggests that policy restrictions, training needs, and data privacy concerns are the primary factors preventing more widespread use of ChatGPT among workers, with women generally facing greater barriers than their male counterparts.

According to this research, businesses have the power to make tools like ChatGPT more widely used. If governments or enterprises take the initiative to promote ChatGPT usage, it might help reverse some of the worrying tendencies that have been identified. Workers with less experience may want additional support to enjoy the advantages of generative AI, for instance, because those who use ChatGPT now made more money before it came. Similarly, this study’s findings on the gender gap would be widened if more training opportunities were made available to women. Lastly, many employees say they won’t increase their production on jobs where ChatGPT makes them more efficient. On the other hand, when businesses restructure their processes to use ChatGPT better, these productivity improvements could lead to even more output growth, which would benefit the economy. 

Check out the Full Report. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Economists from the University of Chicago Present a Study on the Adoption of ChatGPT appeared first on MarkTechPost.

Google AI Introduces CoverBench: A Challenging Benchmark Focused on Ve …

One of the primary challenges in AI research is verifying the correctness of language models (LMs) outputs, especially in contexts requiring complex reasoning. As LMs are increasingly used for intricate queries that demand multiple reasoning steps, domain expertise, and quantitative analysis, ensuring the accuracy and reliability of these models is crucial. This task is particularly important in fields like finance, law, and biomedicine, where incorrect information can lead to significant adverse outcomes.

Current methods for verifying LM outputs include fact-checking and natural language inference (NLI) techniques. These methods typically rely on datasets designed for specific reasoning tasks, such as question answering (QA) or financial analysis. However, these datasets are not tailored for claim verification, and existing methods exhibit limitations like high computational complexity, dependence on large volumes of labeled data, and inadequate performance on tasks requiring long-context reasoning or multi-hop inferences. High label noise and the domain-specific nature of many datasets further hinder the generalizability and applicability of these methods in broader contexts.

A team of researchers from Google and Tel Aviv University proposed CoverBench, a benchmark specifically designed for evaluating complex claim verification across diverse domains and reasoning types. CoverBench addresses the limitations of existing methods by providing a unified format and a diverse set of 733 examples requiring complex reasoning, including long-context understanding, multi-step reasoning, and quantitative analysis. The benchmark includes true and false claims vetted for quality, ensuring low levels of label noise. This novel approach allows for a comprehensive evaluation of LM verification capabilities, highlighting areas needing improvement and setting a higher standard for claim verification tasks.

CoverBench comprises datasets from nine different sources, including FinQA, QRData, TabFact, MultiHiertt, HybridQA, ContractNLI, PubMedQA, TACT, and Feverous. These datasets cover a range of domains such as finance, Wikipedia, biomedical, legal, and statistics. The benchmark involves converting various QA tasks into declarative claims, standardizing table representations, and generating negative examples using seed models like GPT-4. The final dataset contains long input contexts, averaging 3,500 tokens, which challenge current models’ capabilities. The datasets were manually vetted to ensure the correctness and difficulty of the claims.

The evaluation of CoverBench demonstrates that current competitive LMs struggle significantly with the tasks presented, achieving performance near the random baseline in many instances. The highest-performing models, such as Gemini 1.5 Pro, achieved a Macro-F1 score of 62.1, indicating substantial room for improvement. In contrast, models like Gemma-1.1-7b-it performed much lower, underscoring the benchmark’s difficulty. These results highlight the challenges LMs face in complex claim verification and the significant headroom for advancements in this area.

In conclusion, CoverBench significantly contributes to AI research by providing a challenging benchmark for complex claim verification. It overcomes the limitations of existing datasets by offering a diverse set of tasks that require multi-step reasoning, long-context understanding, and quantitative analysis. The benchmark’s thorough evaluation reveals that current LMs have substantial room for improvement in these areas. CoverBench thus sets a new standard for claim verification, pushing the boundaries of what LMs can achieve in complex reasoning tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Google AI Introduces CoverBench: A Challenging Benchmark Focused on Verifying Language Model LM Outputs in Complex Reasoning Settings appeared first on MarkTechPost.

IncarnaMind: An AI Tool that Enables You to Chat with Your Personal Do …

IncarnaMind is leading the way in Artificial Intelligence by enabling users to engage with their personal papers, whether they are in PDF or TXT format. The necessity of being able to query documents in natural language has increased with the introduction of AI-driven solutions. However, problems still exist, especially when it comes to accuracy and context management, even with strong models like GPT. Using a unique architecture intended to improve user-document interaction, IncarnaMind has tackled these problems.

Sliding Window Chunking and an Ensemble Retriever mechanism are the two main components of IncarnaMind, and they are both essential for effective and efficient information retrieval from documents.

Sliding Window Chunking: IncarnaMind’s Sliding Window Chunking dynamically modifies the window’s size and position in contrast to conventional Retrieval-Augmented Generation (RAG) methods, which depend on fixed chunk sizes. Depending on the complexity of the data and the user’s query, this adaptive technique guarantees that the system can balance between obtaining more comprehensive, contextually rich information and fine-grained details. This approach makes the system far more capable of parsing and comprehending complex documents, which makes it an effective tool for retrieving detailed information.

Ensemble Retriever: This approach improves queries even further by integrating several retrieval strategies. The Ensemble Retriever enhances the LLM’s responses by enabling IncarnaMind to effectively sort through both coarse- and fine-grained data in the user’s ground truth documents. By ensuring that the material presented is accurate and relevant, this multifaceted retrieval strategy helps alleviate the prevalent issue of factual hallucinations frequently observed in LLMs.

One of its greatest advantages is that IncarnaMind can solve some of the enduring problems that other AI-driven document interaction technologies still face. Because traditional tools use a single chunk size for information retrieval, they frequently have trouble with different levels of data complexity. This is addressed by IncarnaMind’s adaptive chunking technique, which allows for more accurate and pertinent data extraction by modifying chunk sizes based on the content and context of the document.

Most retrieval techniques concentrate on either precise data retrieval or semantic understanding. These two factors are balanced by IncarnaMind’s Ensemble Retriever, which guarantees responses that are both semantically rich and contextually appropriate. The inability of many current solutions to query more than one document at once restricts their use in scenarios involving several documents. IncarnaMind removes this obstacle by enabling multi-hop queries over several documents at once, providing a more thorough and integrated comprehension of the data.

IncarnaMind is made to be adaptable and work with many other LLMs, such as the Llama2 series, Anthropic Claude, and OpenAI GPT. The Llama2-70b-chat model, which has demonstrated the best performance in terms of reasoning and safety when compared to other models like GPT-4 and Claude 2.0, is the model for which the tool is specifically optimized. However, some users may find this to be a drawback as the Llama2-70b-gguf quantized version requires more than 35GB of GPU RAM to execute. The Together.ai API, which supports llama2-70b-chat and other open-source models, provides a workable substitute in these situations.

In conclusion, with IncarnaMind, AI will significantly advance how users interact with personal papers. It is well-positioned to emerge as a crucial tool for anyone requiring accurate and contextually aware document querying, as it tackles important issues in document retrieval and provides strong interoperability with multiple LLMs. 

Check out the GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post IncarnaMind: An AI Tool that Enables You to Chat with Your Personal Documents (PDF, TXT) Using Large Language Models (LLMs) like GPT appeared first on MarkTechPost.

Cisco achieves 50% latency improvement using Amazon SageMaker Inferenc …

This post is co-authored with Travis Mehlinger and Karthik Raghunathan from Cisco.
Webex by Cisco is a leading provider of cloud-based collaboration solutions which includes video meetings, calling, messaging, events, polling, asynchronous video and customer experience solutions like contact center and purpose-built collaboration devices. Webex’s focus on delivering inclusive collaboration experiences fuels our innovation, which leverages AI and Machine Learning, to remove the barriers of geography, language, personality, and familiarity with technology. Its solutions are underpinned with security and privacy by design. Webex works with the world’s leading business and productivity apps – including AWS.
Cisco’s Webex AI (WxAI) team plays a crucial role in enhancing these products with AI-driven features and functionalities, leveraging LLMs to improve user productivity and experiences. In the past year, the team has increasingly focused on building artificial intelligence (AI) capabilities powered by large language models (LLMs) to improve productivity and experience for users. Notably, the team’s work extends to Webex Contact Center, a cloud-based omni-channel contact center solution that empowers organizations to deliver exceptional customer experiences. By integrating LLMs, WxAI team enables advanced capabilities such as intelligent virtual assistants, natural language processing, and sentiment analysis, allowing Webex Contact Center to provide more personalized and efficient customer support. However, as these LLM models grew to contain hundreds of gigabytes of data, WxAI team faced challenges in efficiently allocating resources and starting applications with the embedded models. To optimize its AI/ML infrastructure, Cisco migrated its LLMs to Amazon SageMaker Inference, improving speed, scalability, and price-performance.
This blog post highlights how Cisco implemented faster autoscaling release reference. For more details on Cisco’s Use Cases, Solution & Benefits see How Cisco accelerated the use of generative AI with Amazon SageMaker Inference.
In this post, we will discuss the following:

Overview of Cisco’s use-case and architecture
Introduce new faster autoscaling feature

Single Model real-time endpoint
Deployment using Amazon SageMaker InferenceComponents

Share results on the performance improvements Cisco saw with faster autoscaling feature for GenAI inference
Next Steps

Cisco’s Use-case: Enhancing Contact Center Experiences
Webex is applying generative AI to its contact center solutions, enabling more natural, human-like conversations between customers and agents. The AI can generate contextual, empathetic responses to customer inquiries, as well as automatically draft personalized emails and chat messages. This helps contact center agents work more efficiently while maintaining a high level of customer service.

Architecture
Initially, WxAI embedded LLM models directly into the application container images running on Amazon Elastic Kubernetes Service (Amazon EKS). However, as the models grew larger and more complex, this approach faced significant scalability and resource utilization challenges. Operating the resource-intensive LLMs through the applications required provisioning substantial compute resources, which slowed down processes like allocating resources and starting applications. This inefficiency hampered WxAI’s ability to rapidly develop, test, and deploy new AI-powered features for the Webex portfolio.
To address these challenges, WxAI team turned to SageMaker Inference – a fully managed AI inference service that allows seamless deployment and scaling of models independently from the applications that use them. By decoupling the LLM hosting from the Webex applications, WxAI could provision the necessary compute resources for the models without impacting the core collaboration and communication capabilities.

“The applications and the models work and scale fundamentally differently, with entirely different cost considerations, by separating them rather than lumping them together, it’s much simpler to solve issues independently.”
– Travis Mehlinger, Principal Engineer at Cisco. 

This architectural shift has enabled Webex to harness the power of generative AI across its suite of collaboration and customer engagement solutions.

Today Sagemaker endpoint uses autoscaling with invocation per instance. However, it takes ~6 minutes to detect need for autoscaling.
Introducing new Predefined metric types for faster autoscaling
Cisco Webex AI team wanted to improve their inference auto scaling times, so they worked with Amazon SageMaker to improve inference.
Amazon SageMaker’s real-time inference endpoint offers a scalable, managed solution for hosting Generative AI models. This versatile resource can accommodate multiple instances, serving one or more deployed models for instant predictions. Customers have the flexibility to deploy either a single model or multiple models using SageMaker InferenceComponents on the same endpoint. This approach allows for efficient handling of diverse workloads and cost-effective scaling.
To optimize real-time inference workloads, SageMaker employs application automatic scaling (auto scaling). This feature dynamically adjusts both the number of instances in use and the quantity of model copies deployed (when using inference components), responding to real-time changes in demand. When traffic to the endpoint surpasses a predefined threshold, auto scaling increases the available instances and deploys additional model copies to meet the heightened demand. Conversely, as workloads decrease, the system automatically removes unnecessary instances and model copies, effectively reducing costs. This adaptive scaling ensures that resources are optimally utilized, balancing performance needs with cost considerations in real-time.
Working with Cisco, Amazon SageMaker releases new sub-minute high-resolution pre-defined metric type SageMakerVariantConcurrentRequestsPerModelHighResolution for faster autoscaling and reduced detection time. This newer high-resolution metric has shown to reduce scaling detection times by up to 6x (compared to existing SageMakerVariantInvocationsPerInstance metric) and thereby improving overall end-to-end inference latency by up to 50%, on endpoints hosting Generative AI models like Llama3-8B.
With this new release, SageMaker real-time endpoints also now emits new ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy CloudWatch metrics as well, which are more suited for monitoring and scaling Amazon SageMaker endpoints hosting LLMs and FMs.
Cisco’s Evaluation of faster autoscaling feature for GenAI inference
Cisco evaluated Amazon SageMaker’s new pre-defined metric types for faster autoscaling on their Generative AI workloads. They observed up to a 50% latency improvement in end-to-end inference latency by using the new SageMakerequestsPerModelHighResolution metric, compared to the existing SageMakerVariantInvocationsPerInstance  metric.
The setup involved using their Generative AI models, on SageMaker’s real-time inference endpoints. SageMaker’s autoscaling feature dynamically adjusted both the number of instances and the quantity of model copies deployed to meet real-time changes in demand. The new high-resolution SageMakerVariantConcurrentRequestsPerModelHighResolution metric reduced scaling detection times by up to 6x, enabling faster autoscaling and lower latency.
In addition, SageMaker now emits new CloudWatch metrics, including ConcurrentRequestsPerModel and ConcurrentRequestsPerModelCopy, which are better suited for monitoring and scaling endpoints hosting large language models (LLMs) and foundation models (FMs). This enhanced autoscaling capability has been a game-changer for Cisco, helping to improve the performance and efficiency of their critical Generative AI applications.

“We are really pleased with the performance improvements we’ve seen from Amazon SageMaker’s new autoscaling metrics. The higher-resolution scaling metrics have significantly reduced latency during initial load and scale-out on our Gen AI workloads. We’re excited to do a broader rollout of this feature across our infrastructure”
– Travis Mehlinger, Principal Engineer at Cisco.

Cisco further plans to work with SageMaker inference to drive improvements in rest of the variables that impact autoscaling latencies. Like model download and load times.
Conclusion
Cisco’s Webex AI team is continuing to leverage Amazon SageMaker Inference to power generative AI experiences across its Webex portfolio. Evaluation with faster autoscaling from SageMaker has shown Cisco up to 50% latency improvements in its GenAI inference endpoints. As WxAI team continues to push the boundaries of AI-driven collaboration, its partnership with Amazon SageMaker will be crucial in informing upcoming improvements and advanced GenAI inference capabilities. With this new feature Cisco looks forward to further optimizing its AI Inference performance by rolling it broadly in multiple regions and delivering even more impactful generative AI features to its customers.

About the Authors
Travis Mehlinger is a Principal Software Engineer in the Webex Collaboration AI group, where he helps teams develop and operate cloud-native AI and ML capabilities to support Webex AI features for customers around the world.In his spare time, Travis enjoys cooking barbecue, playing video games, and traveling around the US and UK to race go karts.
Karthik Raghunathan is the Senior Director for Speech, Language, and Video AI in the Webex Collaboration AI Group. He leads a multidisciplinary team of software engineers, machine learning engineers, data scientists, computational linguists, and designers who develop advanced AI-driven features for the Webex collaboration portfolio. Prior to Cisco, Karthik held research positions at MindMeld (acquired by Cisco), Microsoft, and Stanford University.
Praveen Chamarthi is a Senior AI/ML Specialist with Amazon Web Services. He is passionate about AI/ML and all things AWS. He helps customers across the Americas to scale, innovate, and operate ML workloads efficiently on AWS. In his spare time, Praveen loves to read and enjoys sci-fi movies.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, multi-tenant models, cost optimizations, and making deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.
Ravi Thakur is a Sr Solutions Architect Supporting Strategic Industries at AWS, and is based out of Charlotte, NC. His career spans diverse industry verticals, including banking, automotive, telecommunications, insurance, and energy. Ravi’s expertise shines through his dedication to solving intricate business challenges on behalf of customers, utilizing distributed, cloud-native, and well-architected design patterns. His proficiency extends to microservices, containerization, AI/ML, Generative AI, and more. Today, Ravi empowers AWS Strategic Customers on personalized digital transformation journeys, leveraging his proven ability to deliver concrete, bottom-line benefits.

How Cisco accelerated the use of generative AI with Amazon SageMaker I …

This post is co-authored with Travis Mehlinger and Karthik Raghunathan from Cisco.
Webex by Cisco is a leading provider of cloud-based collaboration solutions, including video meetings, calling, messaging, events, polling, asynchronous video, and customer experience solutions like contact center and purpose-built collaboration devices. Webex’s focus on delivering inclusive collaboration experiences fuels their innovation, which uses artificial intelligence (AI) and machine learning (ML), to remove the barriers of geography, language, personality, and familiarity with technology. Its solutions are underpinned with security and privacy by design. Webex works with the world’s leading business and productivity apps—including AWS.
Cisco’s Webex AI (WxAI) team plays a crucial role in enhancing these products with AI-driven features and functionalities, using large language models (LLMs) to improve user productivity and experiences. In the past year, the team has increasingly focused on building AI capabilities powered by LLMs to improve productivity and experience for users. Notably, the team’s work extends to Webex Contact Center, a cloud-based omni-channel contact center solution that empowers organizations to deliver exceptional customer experiences. By integrating LLMs, the WxAI team enables advanced capabilities such as intelligent virtual assistants, natural language processing (NLP), and sentiment analysis, allowing Webex Contact Center to provide more personalized and efficient customer support. However, as these LLM models grew to contain hundreds of gigabytes of data, the WxAI team faced challenges in efficiently allocating resources and starting applications with the embedded models. To optimize its AI/ML infrastructure, Cisco migrated its LLMs to Amazon SageMaker Inference, improving speed, scalability, and price-performance.
This post highlights how Cisco implemented new functionalities and migrated existing workloads to Amazon SageMaker inference components for their industry-specific contact center use cases. By integrating generative AI, they can now analyze call transcripts to better understand customer pain points and improve agent productivity. Cisco has also implemented conversational AI experiences, including chatbots and virtual agents that can generate human-like responses, to automate personalized communications based on customer context. Additionally, they are using generative AI to extract key call drivers, optimize agent workflows, and gain deeper insights into customer sentiment. Cisco’s adoption of SageMaker Inference has enabled them to streamline their contact center operations and provide more satisfying, personalized interactions that address customer needs.
In this post, we discuss the following:

Cisco’s business use cases and outcomes
How Cisco accelerated the use of generative AI powered by LLMs for their contact center use cases with the help of SageMaker Inference
Cisco’s generative AI inference architecture, which is built as a robust and secure foundation, using various services and features such as SageMaker Inference, Amazon Bedrock, Kubernetes, Prometheus, Grafana, and more
How Cisco uses an LLM router and auto scaling to route requests to appropriate LLMs for different tasks while simultaneously scaling their models for resiliency and performance efficiency.
How the solutions in this post impacted Cisco’s business roadmap and strategic partnership with AWS
How Cisco helped SageMaker Inference build new capabilities to deploy generative AI applications at scale

Enhancing collaboration and customer engagement with generative AI: Webex’s AI-powered solutions
In this section, we discuss Cisco’s AI-powered use cases.
Meeting summaries and insights
For Webex Meetings, the platform uses generative AI to automatically summarize meeting recordings and transcripts. This extracts the key takeaways and action items, helping distributed teams stay informed even if they missed a live session. The AI-generated summaries provide a concise overview of important discussions and decisions, allowing employees to quickly get up to speed. Beyond summaries, Webex’s generative AI capabilities also surface intelligent insights from meeting content. This includes identifying action items, highlighting critical decisions, and generating personalized meeting notes and to-do lists for each participant. These insights help make meetings more productive and hold attendees accountable.

Enhancing contact center experiences
Webex is also applying generative AI to its contact center solutions, enabling more natural, human-like conversations between customers and agents. The AI can generate contextual, empathetic responses to customer inquiries, as well as automatically draft personalized emails and chat messages. This helps contact center agents work more efficiently while maintaining a high level of customer service.

Webex customers realize positive outcomes with generative AI
Webex’s adoption of generative AI is driving tangible benefits for customers. Clients using the platform’s AI-powered meeting summaries and insights have reported productivity gains. Webex customers using the platform’s generative AI for contact centers have handled hundreds of thousands of calls with improved customer satisfaction and reduced handle times, enabling more natural, empathetic conversations between agents and clients. Webex’s strategic integration of generative AI is empowering users to work smarter and deliver exceptional experiences.
For more details on how Webex is harnessing generative AI to enhance collaboration and customer engagement, see Webex | Exceptional Experiences for Every Interaction on the Webex blog.
Using SageMaker Inference to optimize resources for Cisco
Cisco’s WxAI team is dedicated to delivering advanced collaboration experiences powered by cutting-edge ML. The team develops a comprehensive suite of AI and ML features for the Webex ecosystem, including audio intelligence capabilities like noise removal and optimizing speaker voices, language intelligence for transcription and translation, and video intelligence features like virtual backgrounds. At the forefront of WxAI’s innovations is the AI-powered Webex Assistant, a virtual assistant that provides voice-activated control and seamless meeting support in multiple languages. To build these sophisticated capabilities, WxAI uses LLMs, which can contain up to hundreds of gigabytes of training data.
Initially, WxAI embedded LLM models directly into the application container images running on Amazon Elastic Kubernetes Service (Amazon EKS). However, as the models grew larger and more complex, this approach faced significant scalability and resource utilization challenges. Operating the resource-intensive LLMs through the applications required provisioning substantial compute resources, which slowed down processes like allocating resources and starting applications. This inefficiency hampered WxAI’s ability to rapidly develop, test, and deploy new AI-powered features for the Webex portfolio. To address these challenges, the WxAI team turned to SageMaker Inference—a fully managed AI inference service that allows seamless deployment and scaling of models independently from the applications that use them. By decoupling the LLM hosting from the Webex applications, WxAI could provision the necessary compute resources for the models without impacting the core collaboration and communication capabilities.

 “The applications and the models work and scale fundamentally differently, with entirely different cost considerations; by separating them rather than lumping them together, it’s much simpler to solve issues independently.”
– Travis Mehlinger, Principal Engineer at Cisco.

This architectural shift has enabled Webex to harness the power of generative AI across its suite of collaboration and customer engagement solutions.
Solution overview: Improving efficiency and reducing costs by migrating to SageMaker Inference
To address the scalability and resource utilization challenges faced with embedding LLMs directly into their applications, the WxAI team migrated to SageMaker Inference. By taking advantage of this fully managed service for deploying LLMs, Cisco unlocked significant performance and cost-optimization opportunities. Key benefits include the ability to deploy multiple LLMs behind a single endpoint for faster scaling and improved response latencies, as well as cost savings. Additionally, the WxAI team implemented an LLM proxy to simplify access to LLMs for Webex teams, enable centralized data collection, and reduce operational overhead. With SageMaker Inference, Cisco can efficiently manage and scale their LLM deployments, harnessing the power of generative AI across the Webex portfolio while maintaining optimal performance, scalability, and cost-effectiveness.
The following diagram illustrates the WxAI architecture on AWS.

The architecture is built on a robust and secure AWS foundation:

The architecture uses AWS services like Application Load Balancer, AWS WAF, and EKS clusters for seamless ingress, threat mitigation, and containerized workload management.
The LLM proxy (a microservice deployed on an EKS pod as part of the Service VPC) simplifies the integration of LLMs for Webex teams, providing a streamlined interface and reducing operational overhead. The LLM proxy supports LLM deployments on SageMaker Inference, Amazon Bedrock, or other LLM providers for Webex teams.
The architecture uses SageMaker Inference for optimized model deployment, auto scaling, and routing mechanisms.
The system integrates Loki for logging, Amazon Managed Service for Prometheus for metrics, and Grafana for unified visualization, seamlessly integrated with Cisco SSO.
The Data VPC houses the data layer components, including Amazon ElastiCache for caching and Amazon Relational Database Service (Amazon RDS) for database services, providing efficient data access and management.

Use case overview: Contact center topic analytics
A key focus area for the WxAI team is to enhance the capabilities of the Webex Contact Center platform. A typical Webex Contact Center installation has hundreds of agents handling many interactions through various channels like phone calls and digital channels. Webex’s AI-powered Topic Analytics feature extracts the key reasons customers are calling about by analyzing aggregated historical interactions and clustering them into meaningful topic categories, as shown in the following screenshot. The contact center administrator can then use these insights to optimize operations, enhance agent performance, and ultimately deliver a more satisfactory customer experience.

The Topic Analytics feature is powered by a pipeline of three models: a call driver extraction model, a topic clustering model, and a topic labeling model, as illustrated in the following diagram.

The model details are as follows:

Call driver extraction – This generative model summarizes the primary reason or intent (referred to as the call driver) behind a customer’s call. Accurate automatic tagging of calls with call drivers helps contact center supervisors and administrators quickly understand the primary reason for any historical call. One of the key considerations when solving this problem was selecting the right model to balance quality and operational costs. The WxAI team chose the FLAN T5 model on SageMaker Inference and instruction fine-tuned it for extracting call drivers from call transcripts. FLAN-T5 is a powerful text-to-text transfer transformer model that performs various natural language understanding and generation tasks. This workload had a global footprint deployed in us-east-2, eu-west-2, eu-central-1, ap-southeast-1, ap-southeast-2, ap-northeast-1, and ca-central-1 AWS
Topic clustering – Although automatically tagging every contact center interaction with its call driver is a useful feature in itself, analyzing these call drivers in an aggregated fashion over a large batch of calls can uncover even more interesting trends and insights. The topic clustering model achieves this by clustering all the individually extracted call drivers from a large batch of calls into different topic clusters. It does this by creating a semantic embedding for each call driver and employing an unsupervised hierarchical clustering technique that operates on the vector embeddings. This results in distinct and coherent topic clusters where semantically similar call drivers are grouped together.
Topic labeling – The topic labeling model is a generative model that creates a descriptive name to serve as the label for each topic cluster. Several LLMs were prompt-tuned and evaluated in a few-shot setting to choose the ideal model for the label generation task. Finally, Llama2-13b-chat, with its ability to better capture contextual nuances and semantics of natural language conversation, was used for its accuracy, performance, and cost-effectiveness. Additionally, Llama2-13b-chat was deployed and used on SageMaker inference components, while maintaining relatively low operating costs compared to other LLMs, by using specific hardware like g4dn and g5

This solution also used the auto scaling capabilities of SageMaker to dynamically adjust the number of instances based on a desired minimum of 1 endpoint and maximum of 30. This approach provides efficient resource utilization while maintaining high throughput, allowing the WxAI platform to handle batch jobs overnight and scale to hundreds of inferences per minute during peak hours. By deploying the model on SageMaker Inference with auto scaling, WxAI team was able to deliver reliable and accurate responses to customer interactions for their Topic Analytics use case.
By accurately pinpointing the call driver, the system can suggest appropriate actions, resources, and next steps to the agent, streamlining the customer support process, further leading to personalized and accurate responses to customer questions.
To handle fluctuating demand and optimize resource utilization, the WxAI team implemented auto scaling for their SageMaker Inference endpoints. They configured the endpoints to scale from a minimum to a maximum instance count based on GPU utilization. Additionally, the LLM proxy routed requests between the different LLMs deployed on SageMaker Inference. This proxy abstracts the complexities of communicating with various LLM providers and enables centralized data collection and analysis. This led to enhanced generative AI workflows, optimized latency, and personalized use case implementations.
Benefits
Through the strategic adoption of AWS AI services, Cisco’s WxAI team has realized significant benefits, enabling them to build cutting-edge, AI-powered collaboration capabilities more rapidly and cost-effectively:

Improved development and deployment cycle time – By decoupling models from applications, the team has streamlined processes like bug fixes, integration testing, and feature rollouts across environments, accelerating their overall development velocity.
Simplified engineering and delivery – The clear separation of concerns between the lean application layer and resource-intensive model layer has simplified engineering efforts and delivery, allowing the team to focus on innovation rather than infrastructure complexities.
Reduced costs – By using fully managed services like SageMaker Inference, the team has offloaded infrastructure management overhead. Additionally, capabilities like asynchronous inference and multi-model endpoints have enabled significant cost optimization without compromising performance or availability.
Scalability and performance – Services like SageMaker Inference and Amazon Bedrock, combined with technologies like NVIDIA Triton Inference Server on SageMaker, have empowered the WxAI team to scale their AI/ML workloads reliably and deliver high-performance inference for demanding use cases.
Accelerated innovation – The partnership with AWS has given the WxAI team access to cutting-edge AI services and expertise, enabling them to rapidly prototype and deploy innovative capabilities like the AI-powered Webex Assistant and advanced contact center AI features.

Cisco’s contributions to SageMaker Inference: Enhancing generative AI inference capabilities
Building upon the success of their strategic migration to SageMaker Inference, Cisco has been instrumental in partnering with the SageMaker Inference team to build and enhance key generative AI capabilities within the SageMaker platform. Since the early days of generative AI, Cisco has provided the SageMaker Inference team with valuable inputs and expertise, enabling the introduction of several new features and optimizations:

Cost and performance optimizations for generative AI inference – Cisco helped the SageMaker Inference team develop innovative techniques to optimize the use of accelerators, enabling SageMaker Inference to reduce foundation model (ML) deployment costs by 50% on average and latency by 20% on average with inference components. This breakthrough delivers significant cost savings and performance improvements for customers running generative AI workloads on SageMaker.
Scaling improvements for generative AI inference – Cisco’s expertise in distributed systems and auto scaling has also helped the SageMaker team develop advanced capabilities to better handle the scaling requirements of generative AI models. These improvements reduce auto scaling times by up to 40% and auto scaling detection by 6 times, so customers can rapidly scale their generative AI workloads on SageMaker to meet spikes in demand without compromising performance.
Streamlined generative AI model deployment for inference – Recognizing the need for simplified generative AI model deployment, Cisco collaborated with AWS to introduce the ability to deploy open source LLMs and FMs with just a few clicks. This user-friendly functionality removes the complexity traditionally associated with deploying these advanced models, empowering more customers to harness the power of generative AI.
Simplified inference deployment for Kubernetes customers – Cisco’s deep expertise in Kubernetes and container technologies helped the SageMaker team develop new Kubernetes Operator-based inference capabilities. These innovations make it straightforward for customers running applications on Kubernetes to deploy and manage generative AI models, reducing LLM deployment costs by 50% on average.
Using NVIDIA Triton Inference Server for generative AI – Cisco worked with AWS to integrate the NVIDIA Triton Inference Server, a high-performance model serving container managed by SageMaker, to power generative AI inference on SageMaker Inference. This enabled the WxAI team to scale their AI/ML workloads reliably and deliver high-performance inference for demanding generative AI use cases.
Packaging generative AI models more efficiently – To further simplify the generative AI model lifecycle, Cisco worked with AWS to enhance the capabilities in SageMaker for packaging LLMs and FMs for deployment. These improvements make it straightforward to prepare and deploy these generative AI models, accelerating their adoption and integration.
Improved documentation for generative AI – Recognizing the importance of comprehensive documentation to support the growing generative AI ecosystem, Cisco collaborated with the AWS team to enhance the SageMaker documentation. This includes detailed guides, best practices, and reference materials tailored specifically for generative AI use cases, helping customers quickly ramp up their generative AI initiatives on the SageMaker platform.

By closely partnering with the SageMaker Inference team, Cisco has played a pivotal role in driving the rapid evolution of generative AI Inference capabilities in SageMaker. The features and optimizations introduced through this collaboration are empowering AWS customers to unlock the transformative potential of generative AI with greater ease, cost-effectiveness, and performance.

“Our partnership with the SageMaker Inference product team goes back to the early days of generative AI, and we believe the features we have built in collaboration, from cost optimizations to high-performance model deployment, will broadly help other enterprises rapidly adopt and scale generative AI workloads on SageMaker, unlocking new frontiers of innovation and business transformation.”
– Travis Mehlinger, Principal Engineer at Cisco.

Conclusion
By using AWS services like SageMaker Inference and Amazon Bedrock for generative AI, Cisco’s WxAI team has been able to optimize their AI/ML infrastructure, enabling them to build and deploy AI-powered features more efficiently, reliably, and cost-effectively. This strategic approach has unlocked significant benefits for Cisco in deploying and scaling its generative AI capabilities for the Webex platform. Cisco’s own journey with generative AI, as showcased in this post, offers valuable lessons and insights for other uses of SageMaker Inference.
Recognizing the impact of generative AI, Cisco has played a crucial role in shaping the future of these capabilities within SageMaker Inference. By providing valuable insights and hands-on collaboration, Cisco has helped AWS develop a range of powerful features that are making generative AI more accessible and scalable for organizations. From optimizing infrastructure costs and performance to streamlining model deployment and scaling, Cisco’s contributions have been instrumental in enhancing the SageMaker Inference service.
Moving forward, the Cisco-AWS partnership aims to drive further advancements in areas like conversational and generative AI inference. As generative AI adoption accelerates across industries, Cisco’s Webex platform is designed to scale and streamline user experiences through various use cases discussed in this post and beyond. You can expect to see ongoing innovation from this collaboration in SageMaker Inference capabilities, as Cisco and SageMaker Inference continue to push the boundaries of what’s possible in the world of AI.
For more information on Webex Contact Center’s Topic Analytics feature and related AI capabilities, refer to The Webex Advantage: Navigating Customer Experience in the Age of AI on the Webex blog.

About the Authors
Travis Mehlinger is a Principal Software Engineer in the Webex Collaboration AI group, where he helps teams develop and operate cloud-centered AI and ML capabilities to support Webex AI features for customers around the world. In his spare time, Travis enjoys cooking barbecue, playing video games, and traveling around the US and UK to race go-karts.
Karthik Raghunathan is the Senior Director for Speech, Language, and Video AI in the Webex Collaboration AI Group. He leads a multidisciplinary team of software engineers, machine learning engineers, data scientists, computational linguists, and designers who develop advanced AI-driven features for the Webex collaboration portfolio. Prior to Cisco, Karthik held research positions at MindMeld (acquired by Cisco), Microsoft, and Stanford University.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.
Ravi Thakur is a Senior Solutions Architect at AWS, based in Charlotte, NC. He specializes in solving complex business challenges using distributed, cloud-centered, and well-architected patterns. Ravi’s expertise includes microservices, containerization, AI/ML, and generative AI. He empowers AWS strategic customers on digital transformation journeys, delivering bottom-line benefits. In his spare time, Ravi enjoys motorcycle rides, family time, reading, movies, and traveling.
Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.
Madhur Prashant is an AI and ML Solutions Architect at Amazon Web Services. He is passionate about the intersection of human thinking and generative AI. His interests lie in generative AI, specifically building solutions that are helpful and harmless, and most of all optimal for customers. Outside of work, he loves doing yoga, hiking, spending time with his twin, and playing the guitar.

Discover insights from Box with the Amazon Q Box connector

Seamless access to content and insights is crucial for delivering exceptional customer experiences and driving successful business outcomes. Box, a leading cloud content management platform, serves as a central repository for diverse digital assets and documents in many organizations. An enterprise Box account typically contains a wealth of materials, including documents, presentations, knowledge articles, and more. However, extracting meaningful information from the vast amount of Box data can be challenging without the right tools and capabilities. Employees in roles such as customer support, project management, and product management require the ability to effortlessly query Box content, uncover relevant insights, and make informed decisions that address customer needs effectively.
Building a generative artificial intelligence (AI)-powered conversational application that is seamlessly integrated with your enterprise’s relevant data sources requires time, money, and people. First, you need to develop connectors to those data sources. Next, you need to index this data to make it available for a Retrieval Augmented Generation (RAG) approach where relevant passages are delivered with high accuracy to a large language model (LLM). To do this, you need to select an index that provides the capabilities to index the content for semantic and vector search, build the infrastructure to retrieve and rank the answers, and build a feature-rich web application. You also need to hire and staff a large team to build, maintain, and manage such a system.
Amazon Q Business is a fully managed generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take action using the data and expertise found in your company’s information repositories, code, and enterprise systems (such as Box, among others). Amazon Q provides out-of-the-box native data source connectors that can index content into a built-in retriever and uses an LLM to provide accurate, well-written answers. A data source connector is a component of Amazon Q that helps integrate and synchronize data from multiple repositories into one index.
Amazon Q Business offers multiple prebuilt connectors to a large number of data sources, including Box Content Cloud, Atlassian Confluence, Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, and many more, and helps you create your generative AI solution with minimal configuration. For a full list of Amazon Q Business supported data source connectors, see Amazon Q Business connectors.
In this post, we guide you through the process of configuring and integrating Amazon Q for Business with your Box Content Cloud. This will enable your support, project management, product management, leadership, and other teams to quickly obtain accurate answers to their questions from the documents stored in your Box account.
Find accurate answers from Box documents using Amazon Q Business
After you integrate Amazon Q Business with Box, you can ask questions based on the documents stored in your Box account. For example:

Natural language search – You can search for information within documents located in any folder by using conversational language, simplifying the process of finding desired data without the need to remember specific keywords or filters.
Summarization – You can ask Amazon Q Business to summarize contents of documents to meet your needs. This enables you to quickly understand the main points and find relevant information in your documents without having to scan through individual document descriptions manually.

Overview of the Box connector for Amazon Q Business
To crawl and index contents in Box, you can configure the Amazon Q Business Box connector as a data source in your Amazon Q Business application. When you connect Amazon Q Business to a data source and initiate the sync process, Amazon Q Business crawls and indexes documents from the data source into its index.
Types of documents
Let’s look at what are considered as documents in the context of the Amazon Q business Box connector. A document is a collection of information that consists of a title, the content (or the body), metadata (data about the document), and access control list (ACL) information to make sure answers are provided from documents that the user has access to.
The Amazon Q Business Box connector supports crawling of the following entities in Box:

Files – Each file is considered a single document
Comments – Each comment is considered a single document
Tasks – Each task is considered a single document
Web links – Each web link is considered a single document

Additionally, Box users can create custom objects and custom metadata fields. Amazon Q supports the crawling and indexing of these custom objects and custom metadata.
The Amazon Q Business Box connector also supports the indexing of a rich set of metadata from the various entities in Box. It further provides the ability to map these source metadata fields to Amazon Q index fields for indexing this metadata. These field mappings allow you to map Box field names to Amazon Q index field names. There are two types of metadata fields that Amazon Q connectors support:

Reserved or default fields – These are required with each document, such as the title, creation date, or author
Custom metadata fields – These are fields created in the data source in addition to what the data source already provides

Refer to Box data source connector field mappings for more information.
Authentication
Before you index the content from Box, you need to first establish a secure connection between the Amazon Q Business connector for Box with your Box cloud instance. To establish a secure connection, you need to authenticate with the data source. Let’s look at the supported authentication mechanisms for the Box connector.
The Amazon Q Box connector supports tokens with JWT authentication by Box as the authentication method. This authentication approach requires the configuration of several parameters, including the Box client ID, client secret, public key ID, private key, and passphrase. By implementing this token-based JWT authentication, the Amazon Q Business assistant can securely connect to and interact with data stored within the Box platform on behalf of your organization.
Refer to JWT Auth in the Box Developer documentation for more information on setting up and managing JWT tokens in Box.
Supported box subscriptions
To integrate Amazon Q Business with Box using the Box connector, access to Box Enterprise or Box Enterprise Plus plans is required. Both plans provide the necessary capabilities to create a custom application, download a JWT token as an administrator, and then configure the connector to ingest relevant data from Box.
Secure querying with ACL crawling, identity crawling, and User Store
The success of Amazon Q Business applications hinges on two key factors: making sure end-users only see responses generated from documents they have access to, and maintaining the privacy and security of each user’s conversation history. Amazon Q Business achieves this by validating the user’s identity every time they access the application, and using this to restrict tasks and answers to the user’s authorized documents. This is accomplished through the integration of AWS IAM Identity Center, which serves as the authoritative identity source and validates users. You can configure IAM Identity Center to use your enterprise identity provider (IdP)—such as Okta or Microsoft Entra ID—as the identity source.
ACLs and identity crawling are enabled by default and can’t be disabled. The Box connector automatically retrieves user identities and ACLs from the connected data sources. This allows Amazon Q Business to filter chat responses based on the end-user’s document access level, so they only see the information they are authorized to view. If you need to index documents without ACLs, you must explicitly mark them as public in your data source. For more information on how the Amazon Q Business connector crawls Box ACLs, refer to How Amazon Q Business connector crawls Box ACLs.
In the Box platform, an administrative user can provision additional user accounts and assign varying permission levels, such as viewer, editor, or co-owner, to files or folders. Fine-grained access is further enhanced through the Amazon Q User Store, which is an Amazon Q data source connector feature that streamlines user and group management across all the data sources attached to your application. This granular permission mapping enables Amazon Q Business to efficiently enforce access controls based on the user’s identity and permissions within the Box environment. For more information on the Amazon Q Business User store, refer to Understanding Amazon Q Business User Store.
Solution overview
In this post, we walk through the steps to configure a Box connector for an Amazon Q Business application. We use an existing Amazon Q application and configure the Box connector to sync data from specific Box folders, map relevant Box fields to the Amazon Q index, initiate the data sync, and then query the ingested Box data using the Amazon Q web experience.
As part of querying the Amazon Q Business application, we cover how to ask natural language questions on documents present in your Box folders and get back relevant results and insights using Amazon Q Business.
Prerequisites
For this walkthrough, you need the following:

An Amazon Q Business application. If you haven’t created one yet, refer to Build private and secure enterprise generative AI apps with Amazon Q Business and AWS IAM Identity Center for instructions.
Privileges to create a new Amazon Q application or add data sources to an existing application, AWS resources, and AWS Identity and Access Management (IAM) roles and policies.
A Box Enterprise or Box Enterprise Plus account.
A Box user with admin rights.
Access to AWS Secrets Manager.
Privileges to create IAM Identity Center users.

Create users in IAM Identity Center
For this post, you need to create three sample users in IAM Identity Center. One user will act as the admin user; the other two will serve as department-specific users. This is to simulate the configuration of user-level access control on distinct folders within your Box account. Make sure to use the same email addresses when creating the users in your Box account.
Complete the following steps to create the users in IAM Identity Center:

On the IAM Identity Center console, choose Users in the navigation pane.
Choose Add user.
For Username, enter a user name. For example, john_doe.
For Password, select Send an email to this user with password setup instructions.
For Email address and Confirm email address, enter your email address.
For First name and Last name, enter John and Doe, respectively. You can also provide your preferred first and last names if necessary.
Keep all other fields as default and choose Next.

On the Add user to groups page, keep everything as default and choose Next.
Verify the details on the Review and add user page, then choose Add user.

The user will get an email containing a link to join IAM Identity Center.

Choose Accept Invitation and set up a password for your user. Remember to note it down for testing the Amazon Q Business application later.
If required by your organization, complete the multi-factor authentication (MFA) setup for this user to enhance security during sign-in.
Confirm that you can log in as the first user using the credentials you created in the previous step.
Repeat the previous steps to create your second department-specific user. Use a different email address for this user. For example, set Username as mary_major, First name as Mary, and Last name as Major. Alternatively, you can use your own values if preferred.
Verify that you can log in as the second user using the credentials you created in the previous step.
Repeat the previous steps to create the third user, who will serve as the admin. Use your Box admin user’s email address for this account, and choose your preferred user name, first name, and last name. For this example, saanvi_sarkar will act as the admin user.
Confirm that you can log in as the admin user using the credentials you created in the previous step.

This concludes the setup of all three users in the IAM Identity Center, each with unique email addresses.

Create two users in your Box account
For this example, you need two demo users in your Box account in addition to the admin user. Complete the following steps to create these two demo users, using the same email addresses you used when setting up these users in IAM Identity Center:

Log in to your Box Enterprise Admin Console as an admin user.
Choose Users & Groups in the navigation pane.

On the Managed Users tab, the admin user is listed by default.

To create your first department-specific user, choose Add Users, then choose Add Users Manually.

Enter the same name and email address that you used while creating this first department-specific user in IAM Identity Center. For example, use John Doe for Name and his email address for Email. You don’t need to specify groups or folders.
Select the acknowledgement check box to agree to the payment method for adding this new user to your Box account.
Choose Next.

On the Add Users page, choose Complete to agree and add this new user to your Box account.
To create your second department-specific user, choose Add Users, then choose Add Users Manually.
Enter the same name and email address that you used while creating this second department-specific user in IAM Identity Center. For example, use Mary Major for Name and her email address for Email. You don’t need to specify groups or folders.

You now have all three users provisioned in your Box account.

Create a custom Box application for Amazon Q
Before you configure the Box data source connector in Amazon Q Business, you create a custom Box application in your Box account.
Complete the following steps to create an application and configure its authentication method:

Log in to your Box Enterprise Developer Console as an admin user.
Choose My Apps in the navigation pane.
Choose Create New App.
Choose Custom App.

For App name, enter a name for your app. For example, AmazonQConnector.
For Purpose, choose Other.
For Please specify, enter Other.
Leave the other options blank and choose Next.

For Authentication Method, select Server Authentication (with JWT).
Choose Create App.

In My Apps, choose your created app and go to the Configuration
In the App Access Level section, choose App + Enterprise Access.

In the Application Scopes section, select the following permissions:

Write all files and folders stored in Box
Manage users
Manage groups
Manage enterprise properties

In the Advanced Features section, select Make API calls using the as-user header.
In the Add and Manage Public Keys section, choose Generate a Public/Private Keypair.

Complete the two-step verification process and choose OK to download the JSON file to your computer.

Choose Save Changes.
On the Authorization tab, choose Review and Submit.

In the Review App Authorization Submission pop-up, for App description, enter AmazonQConnector and choose Submit.

Your Box Enterprise owner needs to approve the application before you can use it. Complete the following steps to complete the authorization:

Log in to your Box Enterprise Admin Console as the admin user.
Choose Apps in the navigation pane and choose the Customs App Manager tab to view the apps that need to be authorized.
Choose the AmazonQConnector app that says Pending Authorization.
Choose the options menu (three dots) and choose Authorize App.

Choose Authorize in the Authorize App pop-up.

This will authorize your AmazonQConnector application and change the status to Authorized.

You can review the downloaded JSON file in your computer’s downloads directory. It contains the client ID, client secret, public key ID, private key, passphrase, and enterprise ID, which you’ll need when creating the Box data source in a later step.
Add sample documents to your Box account
In this step, upload sample documents to your Box account. Later, you use the Amazon Q Box data source connector to crawl and index these documents.

Download the zip file to your computer.
Extract the files to a folder called AWS_Whitepapers.

Log in to your Box Enterprise account as an admin user.
Upload the AWS_Whitepapers folder to your Box account.

At the time of writing, this folder contains 6 folders and 60 files within them.

Set user-specific permissions on folders in your Box account
In this step, you set up user-level access control for two users on two separate folders in your Box account.
For this ACL simulation, consider the two department-specific users created earlier. Assume John is part of the machine learning (ML) team, so he needs access only to the Machine_Learning folder contents, whereas Mary belongs to the database team, so she needs access only to the Databases folder contents.
Log in to your Box account as an admin and grant viewer access to each user for their respective folders, as shown in the following screenshots. This restricts them to see only their assigned folder’s contents.
The Machine_Learning folder is accessible to the owner and user John Doe only.

The Databases folder is accessible to the owner and user Mary Major only.

Configure the Box connector for your Amazon Q Business application
Complete the following steps to configure your Box connector for Amazon Q Business:

On the Amazon Q Business console, choose Applications in the navigation pane.
Select the application you want to add the Box connector to.
On the Actions menu, choose Edit.

On the Update application page, leave all values unchanged and choose Update.

On the Update retriever page, leave all values unchanged and choose Next.

On the Connect data sources page, on the All tab, search for Box.
Choose the plus sign next to the Box connector.

On the Add data source page, for Data source name, enter a name, for example, box-data-source.
Open the JSON file you downloaded from the Box Developer Console.

The file contains values for clientID, clientSecret, publicKeyID, privateKey, passphrase, and enterpriseID.

In the Source section, for Box enterprise ID, enter the value of the enterpriseID key from the JSON file.

For Authorization, no change is needed because by default the ACLs are set to ON for the Box data source connector.
In the Authentication section, under AWS Secrets Manager secret, choose Create and add a new secret.
For Secret name, enter a name for the secret, for example, connector. The prefix QBusiness-Box- is automatically added for you.
For the remaining fields, enter the corresponding values from the downloaded JSON file.
Choose Save to add the secret.

In the Configure VPC and Security group section, use the default setting (No VPC) for this post.
Identity crawling is enabled by default, so no changes are necessary.

In the IAM role section, choose Create a new role (Recommended) and enter a role name, for example, box-role.

For more information on the required permissions to include in the IAM role, see IAM roles for data sources.

In the Sync scope section, in addition to file contents, you can include Box web links, comments, and tasks to your index. Use the default setting (unchecked) for this post.
In the Additional configuration section, you can choose to include or exclude regular expression (regex) patterns. These regex patterns can be applied based on the file name, file type, or file path. For this demo, we skip the regex patterns configuration.

In the Sync mode section, select New, modified, or deleted content sync.
In the Sync run schedule section, choose Run on demand.

In the Field Mappings section, keep the default settings.

After you complete the retriever creation, you can modify field mappings and add custom field attributes. You can access field mapping by editing the data source.

Choose Add data source and wait for the retriever to get created.

It can take a few seconds for the required roles and the connector to be created.

After the data source is created, you’re redirected to the Connect data sources page to add more data sources as needed.

For this walkthrough, choose Next.
In the Update groups and users section, choose Add groups and users to add the groups and users from IAM Identity Center set up by your administrator.

In the Add or assign users and groups pop-up, select Assign existing users and groups to add existing users configured in your connected IAM Identity Center and choose Next.

Optionally, if you have permissions to add users to connected IAM Identity Center, you can select Add new users.

On the Assign users and groups page, choose Get Started.
In the search box, enter John Doe and choose his user name.

Add the second user, Mary Major, by entering her name in the search box.

Optionally, you can add the admin user to this application.
Choose Assign to add these users to this Amazon Q app.
In the Groups and users section, choose the Users tab, where you will see no subscriptions configured currently.
Choose Manage access and subscriptions to configure the subscription.

On the Manage access and subscriptions page, choose the Users
Select your users.
Choose Change subscription and choose Update subscription tier.

On the Confirm subscription change page, for New subscription, choose Business Pro.
Choose Confirm.

Verify the changed subscription for all three users, then choose Done.

Choose Update application to complete adding and setting up the Box data connector for Amazon Q Business.

Configure Box field mappings
To help you structure data for retrieval and chat filtering, Amazon Q Business crawls data source document attributes or metadata and maps them to fields in your Amazon Q index. Amazon Q has reserved fields that it uses when querying your application. When possible, Amazon Q automatically maps these built-in fields to attributes in your data source.
If a built-in field doesn’t have a default mapping, or if you want to map additional index fields, use the custom field mappings to specify how a data source attribute maps to your Amazon Q application.

On the Amazon Q Business console, choose your application.
Under Data sources, select your data source.
On the Actions menu, choose Edit.

In the Field mappings section, select the required fields to crawl under Files and folders, Comments, Tasks, and Web Links that are available and choose Update.

 When selecting all items, make sure you navigate through each page by choosing the page numbers and selecting Select All on every page to include all mapped items.

Index sample documents from the Box account
The Box connector setup for Amazon Q is now complete. Because you configured the data source sync schedule to run on demand, you need to start it manually.
In the Data sources section, choose the data source box-data-source and choose Sync now.

The Current sync state changes to Syncing – crawling, then to Syncing – indexing.

After a few minutes, the Current sync state changes to Idle, the Last sync status changes to Successful, and the Sync run history section shows more details, including the number of documents added.
As shown in the following screenshot, Amazon Q has successfully scanned and added all 60 files from the AWS_Whitepapers Box folder.

Query Box data using the Amazon Q web experience
Now that the data synchronization is complete, you can start exploring insights from Amazon Q. In the newly created Amazon Q application, choose Customize web experience to open a new tab with a preview of the UI and options to customize according to your needs.
You can customize the Title, Subtitle, and Welcome message as needed, which will be reflected in the UI.

For this walkthrough, we use the defaults and choose View web experience to be redirected to the login page for the Amazon Q application.

Log in to the application as your first department-specific user, John Doe, using the credentials for the user that were added to the Amazon Q application.

When the login is successful, you’ll be redirected to the Amazon Q assistant UI, where you can start asking questions using natural language and get insights from your Box index.

Enter a prompt in the Amazon Q Business AI assistant at the bottom, such as “What AWS AI/ML service can I use to convert text from one language to another?” Press Enter or choose the arrow icon to generate the response. You can also try your own prompts.

Because John Doe has access to the Machine_Learning folder, Amazon Q Business successfully processed his query that was related to ML and displayed the response. You can choose Sources to view the source files contributing to the response, enhancing its authenticity.

Let’s attempt a different prompt related to the Databases folder, which John doesn’t have access to. Enter the prompt “How to reduce the amount of read traffic and connections to my Amazon RDS database?” or choose your own database-related prompt. Press Enter or choose the arrow icon to generate the response.

As anticipated, you’ll receive a response from the Amazon Q Business application indicating it couldn’t generate a reply from the documents John can access. Because John lacks access to the Databases folder, the Amazon Q Business application couldn’t generate a response.

Go back to the Amazon Q Business Applications page and choose your application again.
This time, open the web experience URL in private mode to initiate a new session, avoiding interference with the previous session.
Log in as Mary Major, the second department-specific user. Use her user name, password, and any MFA you set up initially.
Enter a prompt in the Amazon Q Business AI assistant at the bottom, such as “How to reduce the amount of read traffic and connections to my Amazon RDS database?” Press Enter or choose the arrow icon to generate the response. You can also try your own prompts.

Because Mary has access to the Databases folder, Amazon Q Business successfully processed her query that was related to databases and displayed the response. You can choose Sources to view the source files that contributed in generating the response.

Now, let’s attempt a prompt that contains information from the Machine_Learning folder, which Mary isn’t authorized to access. Enter the prompt “What AWS AI/ML service can I use to convert text from one language to another?” or choose your own ML-related prompt.

As anticipated, the Amazon Q Business application will indicate it couldn’t generate a response because Mary lacks access to the Machine_Learning folder.

The preceding test scenarios illustrate the functionality of the Amazon Q Box connector in crawling and indexing documents along with their associated ACLs. With this mechanism, only users with the relevant permissions can access the respective folders and files within the linked Box account.
Congratulations! You’ve effectively utilized Amazon Q to unveil answers and insights derived from the content indexed from your Box account.
Frequently asked questions
In this section, we provide guidance to frequently asked questions.
Amazon Q Business is unable to answer your questions
If you get the response “Sorry, I could not find relevant information to complete your request,” this may be due to a few reasons:

No permissions – ACLs applied to your Box account don’t allow you to query certain data sources. If this is the case, reach out to your application administrator to make sure your ACLs are configured to access the data sources.
Data connector sync failed – Your data connector may have failed to sync information from the source to the Amazon Q Business application. Verify the data connector’s sync run schedule and sync history to confirm the sync is successful.
Incorrect regex pattern – Validate the correct definition of the regex include or exclude pattern when setting up the Box data source.

If none of these reasons apply to your use case, open a support case and work with your technical account manager to get this resolved.
How to generate responses from authoritative data sources
If you want Amazon Q Business to only generate responses from authoritative data sources, the use of guardrails can be highly beneficial. Within the application settings, you can specify the authorized data repositories, such as content management systems and knowledge bases, from which the assistant is permitted to retrieve and synthesize information. By defining these approved data sources as guardrails, you can instruct Amazon Q Business to only use reliable, up-to-date, and trustworthy information, eliminating the risk of incorporating data from non-authoritative or potentially unreliable sources.
Additionally, Amazon Q Business offers the capability to define content filters as part of Guardrails for Amazon Bedrock. These filters can specify the types of content, topics, or keywords deemed appropriate and aligned with your organization’s policies and standards. By incorporating these content-based guardrails, you can further refine the assistant’s responses to make sure they align with your authoritative information and messaging. The integration of Amazon Q Business with IAM Identity Center also serves as a critical guardrail, allowing you to validate user identities and align ACLs to make sure end-users only receive responses based on their authorized data access.
Amazon Q Business responds using old (stale) data even though your data source is updated
If you find that Amazon Q Business is responding with outdated or stale data, you can use the relevance tuning and boosting features to surface the latest documents. The relevance tuning functionality allows you to adjust the weightings assigned to various document attributes, such as recency, to prioritize the most recent information. Boosting can also be used to explicitly elevate the ranking of the latest documents, making sure they are prominently displayed in the assistant’s responses. For more information on relevance tuning, refer to Boosting chat responses using relevance tuning.
Additionally, it’s important to review the sync schedule and status for your data connectors. Verifying the sync frequency and the last successful sync run can help identify any issues with data freshness. Adjusting the sync schedule or running manual syncs, as needed, can help keep the data up to date and improve the relevance of the Amazon Q Business responses. For more information, refer to Sync run schedule.
Clean up
To prevent incurring additional costs, it’s essential to clean up and remove any resources created during the implementation of this solution. Specifically, you should delete the Amazon Q application, which will consequently remove the associated index and data connectors. However, any IAM roles and secrets created during the Amazon Q application setup process need to be removed separately. Failing to clean up these resources may result in ongoing charges, so it’s crucial to take the necessary steps to completely remove all components related to this solution.
Complete the following steps to delete the Amazon Q application, secret, and IAM role:

On the Amazon Q Business console, select the application that you created.
On the Actions menu, choose Delete and confirm the deletion.
On the Secrets Manager console, select the secret that was created for the Box connector.
On the Actions menu, choose Delete.
Select the waiting period as 7 days and choose Schedule deletion.
On the IAM console, select the role that was created during the Amazon Q application creation.
Choose Delete and confirm the deletion.
Delete the AWS_Whitepapers folder and its contents from your Box
Delete the two demo users that you created in your Box Enterprise account.
On the IAM Identity Center console, choose Users in the navigation pane.
Select the three demo users that you created and choose Delete users to remove these users.

Conclusion
The Amazon Q Box connector allows organizations to seamlessly integrate their Box files into the powerful generative AI capabilities of Amazon Q. By following the steps outlined in this post, you can quickly configure the Box connector as a data source for Amazon Q and initiate synchronization of your Box information. The native field mapping options enable you to customize exactly which Box data to include in Amazon Q’s index.
Amazon Q can serve as a powerful assistant capable of providing rich insights and summaries about your Box files directly from natural language queries.
The Amazon Q Box integration represents a valuable tool for software teams to gain AI-driven visibility into their organization’s document repository. By bridging Box’s industry-leading content management with Amazon’s cutting-edge generative AI, teams can drive productivity, make better informed decisions, and unlock deeper insights into their organization’s knowledge base. As generative AI continues advancing, integrations like this will become critical for organizations aiming to deliver streamlined, data-driven software development lifecycles.
To learn more about the Amazon Q connector for Box, refer to Connecting Box to Amazon Q.

About the Author
Maran Chandrasekaran is a Senior Solutions Architect at Amazon Web Services, working with our enterprise customers. Outside of work, he loves to travel and ride his motorcycle in Texas Hill Country.
Senthil Kamala Rathinam is a Solutions Architect at Amazon Web Services specializing in data and analytics. He is passionate about helping customers design and build modern data platforms. In his free time, Senthil loves to spend time with his family and play badminton.
Vijai Gandikota is a Principal Product Manager in the Amazon Q and Amazon Kendra organization of Amazon Web Services. He is responsible for the Amazon Q and Amazon Kendra connectors, ingestion, security, and other aspects of the Amazon Q and Amazon Kendra services.

Intel Labs Introduce RAG Foundry: An Open-Source Python Framework for …

Open-source libraries facilitated RAG pipeline creation but lacked comprehensive training and evaluation capabilities. Proposed frameworks for RAG-based large language models (LLMs) omitted crucial training components. Novel approaches, such as treating LLM prompting as a programming language, emerged but introduced complexity. Evaluation methodologies using synthetic data and LLM critics were developed to assess RAG performance. Studies investigated the impact of retrieval mechanisms on RAG systems. Concurrent frameworks offered RAG implementations and datasets but often imposed rigid workflows. Intel Labs introduces RAG Foundry built upon these contributions, providing a flexible, extensible framework for comprehensive RAG system development and experimentation.

RAG Foundry emerges as a comprehensive solution to the challenges inherent in Retrieval-Augmented Generation (RAG) systems. This open-source framework integrates data creation, training, inference, and evaluation into a unified workflow. It enables rapid prototyping, dataset generation, and model training using specialized knowledge sources. The modular structure, controlled by configuration files, ensures inter-module compatibility and supports isolated experimentation. RAG Foundry’s customizable nature facilitates thorough experimentation across various RAG aspects, including data selection, retrieval, and prompt design.

Researchers identify several critical challenges in the implementation and evaluation of Retrieval-Augmented Generation (RAG) systems. These include the inherent complexity of RAG systems, which demand deep understanding of data and intricate design decisions. Evaluation difficulties arise from the need to assess both retrieval accuracy and generative quality. Reproducibility issues stem from variations in training data and configurations. Existing frameworks often lack support for diverse use cases and customization options. The need for a flexible framework allowing comprehensive experimentation across all RAG aspects is evident. RAG Foundry emerges as a solution to these challenges, offering a customizable and integrated approach.

The methodology for RAG Foundry employs a modular approach with four distinct components: data creation, training, inference, and evaluation. Data creation involves selecting and preparing relevant datasets for RAG tasks. Training focuses on fine-tuning LLMs using various RAG techniques. Inference generates predictions based on processed datasets. The evaluation assesses model performance using local and global metrics, including an Answer Processor for custom logic. Experiments were conducted on knowledge-intensive tasks like TriviaQA, ASQA, and PubmedQA to test RAG improvements. Results analysis compared outcomes across datasets, emphasizing main metrics, faithfulness, and relevancy scores.

These datasets offer diverse question-answering scenarios, including general knowledge and biomedical domains. Chosen for their complexity and relevance to knowledge-intensive tasks, they enable comprehensive assessment of RAG techniques. This approach highlights the importance of multi-aspect metrics in evaluation and demonstrates the RAG Foundry framework’s effectiveness in enhancing LLMs for various RAG applications.

The RAG Foundry experiment evaluated Retrieval-Augmented Generation techniques across TriviaQA, ASQA, and PubmedQA datasets, revealing diverse performance outcomes. In TriviaQA, retrieved context integration and RAG fine-tuning improved results, while Chain-of-Thought (CoT) reasoning decreased performance. ASQA saw improvements with all methods, particularly fine-tuned CoT. For PubmedQA, most methods outperformed the baseline, with fine-tuned RAG showing best results. Notably, only CoT configurations produced evaluable reasoning for PubmedQA’s binary answers. These findings underscore the dataset-dependent efficacy of RAG techniques, highlighting the need for tailored approaches in enhancing model performance across varied contexts.

In conclusion, the researchers introduced an open-source library designed to enhance large language models for Retrieval-Augmented Generation tasks. The framework demonstrates its effectiveness through experiments on two models across three datasets, utilizing comprehensive evaluation metrics. RAG Foundry’s modular design facilitates customization and rapid experimentation in data creation, training, inference, and evaluation. The robust evaluation process incorporates both local and global metrics, including an Answer Processor for custom logic. While showcasing the potential of RAG techniques in improving model performance, the study also highlights the need for careful evaluation and ongoing research to refine these methods, positioning RAG Foundry as a valuable tool for researchers in this evolving field.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Intel Labs Introduce RAG Foundry: An Open-Source Python Framework for Augmenting Large Language Models LLMs for RAG Use Cases appeared first on MarkTechPost.

DAGify: An Open-Source Program for Streamlining and Expediting the Tra …

Agile and cloud-native solutions are in high demand in the quickly developing fields of workflow orchestration and data engineering. Control-M and other legacy enterprise schedulers have long served as the backbone of many organizations’ operations. However, Apache Airflow has become the go-to option for contemporary data workflow management as the market moves towards more adaptable and scalable systems. However, switching from Control-M to Apache Airflow can be difficult and time-consuming. 

In many different industries, Control-M has shown to be a dependable and strong solution for handling batch processes and workflows. However, its proprietary nature and constraints may make it difficult for businesses to adopt more agile development methods and cloud-native designs. With its robust orchestration features, large community support, and open-source architecture, Apache Airflow presents a strong substitute. However, switching from Control-M, a system with a strong foundation, to Airflow is no easy task. Converting complex work descriptions, dependencies, and timelines is part of the process, which frequently calls for a lot of manual labor and skill.

In a recent research, a team of researchers from Google introduced DAGify, an open-source program that streamlines and expedites this transition from Control-M to Airflow. DAGify offers an automated conversion solution to help overcome this difficulty. It helps businesses to convert their current Control-M task definitions into Directed Acyclic Graphs (DAGs) in Airflow, which minimizes the chance of errors during the migration and lessens the manual labor required. 

Teams can concentrate on streamlining their workflows in Airflow instead of getting bogged down in the difficulties of manual conversion when they use DAGify to ease the migration process. Fundamentally, DAGify uses a template-driven method to make it easier to convert Control-M XML files into the native DAG format of Airflow. This technique makes DAGify extremely flexible in different Control-M configurations and Airflow requirements. The program extracts vital data about jobs, dependencies, and schedules by parsing Control-M XML files. After that, the data is mapped to the tasks, dependencies, and operators in Airflow, maintaining the fundamental framework of the initial workflow.

DAGify is highly configurable due to its template system, which lets users specify how Control-M properties should be converted into Airflow parameters. An Airflow SSHOperator, for instance, can have a Control-M “Command” task mapped to it via a user-defined YAML template. In order to ensure a smooth transition from Control-M to Airflow, this template outlines how attributes like JOBNAME and CMDLINE are included in the created DAG.

DAGify comes with a number of pre-made templates for typical Control-M job kinds. Users can alter these templates to suit their own requirements. Because of its adaptability, the tool can support a large variety of Control-M settings, ensuring a seamless migration procedure.

Google Cloud Composer is a compelling choice for enterprises using a fully managed Airflow solution. By simplifying the management of Airflow infrastructure, Cloud Composer frees teams up to concentrate on creating and coordinating their data pipelines. The migration of Control-M workflows to a cloud-native environment is now simpler than ever because of  DAGify’s seamless integration with Google Cloud Composer. Through this integration, the migration process can be made even more efficient and scalable, allowing organizations to reap the benefits of Airflow in the cloud more rapidly.

In conclusion, DAGify is a big step forward in making the switch from Control-M to Apache Airflow easier. Organizations can move to Airflow more quickly and confidently using DAGify’s automated conversion process and easy integration with Google Cloud Composer. DAGify is a priceless tool that can help speed up the transition and realize the full potential of Apache Airflow in data engineering operations, regardless of the user’s level of experience with the platform.

Check out the GitHub and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post DAGify: An Open-Source Program for Streamlining and Expediting the Transition from Control-M to Apache Airflow appeared first on MarkTechPost.

Comparing Taipy’s Callbacks and Streamlit’s Caching: A Detailed Te …

Taipy and Streamlit have garnered significant attention among data scientists & machine learning engineers in Python-based web application frameworks. Both platforms offer unique functionalities tailored to different development needs. Let’s compare Taipy’s callback functionalities and Streamlit’s caching mechanisms and how Taipy beats Streamlit in many instances, offering technical insights to help developers choose the right tool for their specific requirements.

Taipy: Advanced Callbacks for Enhanced Interactivity

Taipy, a newer Python web framework ecosystem entrant, offers a robust & flexible environment for building complex data-driven applications. It is an innovative open-source tool designed to streamline the creation, management, and execution of data-driven pipelines with minimal coding effort. It presents a solution for Python developers who find building production-ready web applications challenging due to the complexity of front-end and back-end development. It covers both the frontend and the backend. This dual approach provides a comprehensive and complete solution for developing applications that require both front-end and back-end development, particularly for data-driven tasks.

Callback Mechanisms in Taipy

Event-Driven Callbacks: Taipy employs a sophisticated callback mechanism that allows developers to create highly interactive applications. Various events, such as user interactions with widgets or changes in data, can trigger callbacks. This event-driven approach ensures that only the relevant parts of the application are updated, enhancing performance and user experience.

Scenario Management: Taipy’s unique feature is its scenario management capability, which enables users to conduct what-if analyses and manage different application states effectively. This is handy in applications that require complex decision-making processes or multiple user flows.

Design Flexibility: Taipy provides extensive design flexibility, allowing developers to customize the appearance & behavior of their applications beyond the standard templates Streamlit offers. This includes a rich library of UI components & the ability to handle large datasets efficiently through features like pagination and asynchronous execution.

Asynchronous Callbacks: Taipy supports asynchronous execution, which is particularly beneficial for handling long-running tasks without blocking the main application thread. This ensures a responsive user interface even when performing complex computations.

Data Nodes and Tasks: Taipy’s architecture includes data nodes and tasks that facilitate the creation of complex data pipelines. Data nodes represent the data state at any point in the pipeline, while tasks define operations on these nodes. This modular approach enhances application maintainability and scalability.

Streamlit: Simplifying Caching for Rapid Prototyping

Streamlit has gained popularity for its simplicity and ease of use. It enables developers to convert Python scripts into interactive web applications with minimal effort. One of its key features is its caching system, which optimizes performance by storing the results of expensive computations and preventing redundant executions.

Caching Mechanisms in Streamlit

st.cache_data: This decorator caches the return value of a function based on the input parameters. It is especially useful for functions that perform data fetching, cleaning, or other repetitive computations. The cached data can be stored in memory or disk, providing flexibility based on the application’s needs.

st.cache_resource: Designed for caching resources such as database connections or machine learning models, this decorator ensures that these resources are initialized only once, reducing the overhead of repeatedly re-establishing connections or loading models. This is critical for applications that require persistent and reusable resources across different sessions.

Session-Specific Caching: Streamlit supports session-specific caching, ensuring the cached data is unique to each user’s session. This feature is beneficial for applications where users interact with personalized datasets or perform unique operations that should not interfere with one another.

Function-Based Caching: Streamlit’s ‘@st.cache’ decorator allows developers to cache function outputs to avoid recomputation. This is particularly useful for data preprocessing and complex computations that do not change often. It helps in speeding up the application by reducing unnecessary recalculations.

State Management: Streamlit provides a session state feature that allows developers to persist data across different script runs. This is essential for maintaining user inputs, selections, and other states that must be preserved throughout the session.

Technical Comparison: Taipy vs. Streamlit

Prototyping and Ease of Use

Taipy: While Taipy also supports prototyping, it shines in production environments. Its extensive features cater to both early-stage development and the demanding needs of live, user-facing products. This dual capability makes Taipy a versatile tool for long-term projects.

Streamlit: Known for its rapid prototyping capabilities, Streamlit’s straightforward API and live reloading features make it ideal for quickly developing and iterating applications.

Caching and Performance

Taipy: Although Taipy does not need caching, its strength lies in its advanced callback mechanisms. These callbacks ensure that only the application’s necessary components are updated in response to user interactions, leading to better performance & a more responsive user experience.

Streamlit: Streamlit’s caching system is user-friendly and efficient. Caching data and resources minimizes redundant computations and improves overall performance.

Interactivity and User Experience

Taipy: Excels in creating highly interactive and customizable user interfaces. Its event-driven callbacks, and scenario management features allow developers to build applications that are not only responsive but also tailored to specific user needs and workflows. Taipy’s design flexibility enables the creation of unique and varied application appearances.

Streamlit: It provides a consistent user interface across applications. Its live reloading and rich widget library allows developers to create interactive dashboards with minimal code. However, this can be a limitation for developers seeking more customized and interactive designs.

Data Handling and Scalability

Taipy: Designed with scalability in mind, Taipy supports large data handling through features like pagination, chart decimation, and asynchronous execution. Its robust architecture makes it suitable for applications that process and visualize large datasets without compromising performance.

Streamlit: While Streamlit handles data well, it does not inherently support large-scale data management or complex data workflows. This can be a limitation for some applications that require extensive data processing or need to handle large datasets efficiently.

Backend Integration and Data Pipelines

Taipy: Offers comprehensive backend support, including pre-built components for data pipelines and scenario management. Taipy’s architecture includes data nodes and tasks that facilitate the creation of complex data pipelines. This integrated approach simplifies the development of full-stack applications.

Streamlit: Primarily focused on the front end, Streamlit does not provide extensive backend support or data pipeline management. Developers often need to integrate Streamlit with other tools to handle backend processes.

Asynchronous Execution and Long-Running Tasks

Taipy: Supports asynchronous execution, which is particularly beneficial for handling long-running tasks without blocking the main application thread. This ensures a responsive user interface even when performing complex computations.

Streamlit: Streamlit supports asynchronous execution to some extent, but its primary focus is on synchronous operations. This can limit applications requiring real-time data processing or long-running tasks.

Comparative Table: Taipy’s Callbacks and Streamlit’s Caching

Difference in UML infrastructure between Taipy and Streamlit

Taipy Infrastructure

Taipy is an advanced enterprise application development framework that handles complex workflows and data dependencies. Its infrastructure includes:

Core Components:

Taipy GUI: The user interface component.

Taipy Core: Manages workflows, data nodes, and scenarios.

Data Nodes: Represent data storage or data sources.

Scenarios: Define sets of actions to achieve specific goals.

Tasks: Units of work to be executed, usually data processing steps.

Sequences: Sequences of tasks forming complete workflows.

External Interactions:

Databases: For storing and retrieving data.

APIs: These are used to integrate with external services or data sources.

User Interface (UI): Interacts with end-users.

Taipy UML Diagram

Source: marktechpost.com

Streamlit Infrastructure

Streamlit is a lightweight framework designed to create data applications quickly. Its infrastructure consists of:

Core Components:

Streamlit Script: The Python script that defines the app.

Widgets: User interface elements like sliders, buttons, and text inputs.

Data: Direct interaction with data sources within the script.

Layout: Arrangement of widgets and visualizations on the app page.

Streamlit Server: Manages the app’s serving to users.

External Interactions:

Data Sources: Directly accessed within the script (e.g., files, databases, APIs).

UI: Interacts with end-users via the web app.

Streamlit UML Diagram

Source: marktechpost.com

Why are Taipy infrastructure and UML better compared to Streamlit?

The Taipy infrastructure, as illustrated in the UML diagram, offers a comprehensive and robust framework well-suited for enterprise-level applications. Its infrastructure is designed to handle complex workflows and data dependencies with advanced features such as automation, asynchronous execution, and tight integration of core components like data nodes, pipelines, scenarios, and tasks. This structured approach ensures that all aspects of the workflow are well-coordinated, reliable, and maintainable, providing a significant edge over simpler frameworks. By supporting sophisticated data pipelines and automatic task triggering, Taipy enhances efficiency and reduces manual intervention, making it ideal for large-scale data processing and real-time analytics. This level of sophistication and integration makes Taipy a superior choice for building highly efficient, scalable, and adaptive enterprise applications compared to straightforward solutions like Streamlit.

Why are Taipy Callbacks a Better Solution?

Advanced Features and Flexibility

Complex Workflows: Handle sophisticated data pipelines that trigger tasks and scenarios based on data changes or events.

Automation: Reduce manual intervention and enhance efficiency by automating workflow processes.

Asynchronous Execution: Support parallel processing for faster response times, crucial for large-scale data processing and real-time analytics.

Deep Integration with Core Components

Tightly Coupled Workflows: Ensure the workflow is well-coordinated, leading to reliable and maintainable applications.

Complex Dependencies Management: Manage and execute tasks in a well-defined sequence, ideal for enterprise applications requiring high reliability and scalability.

Adaptive Applications: Build responsive applications that adapt easily to changing business requirements and data environments. It provides a significant edge over simpler frameworks like Streamlit.

Use Cases Where Taipy Callbacks are Better Compared to Streamlit Caching

Taipy callbacks excel in use cases where complex data workflows and dependencies are prevalent. For instance, in financial analytics, where real-time data processing and complex computational models are essential, Taipy’s ability to automate task execution based on data changes ensures timely and accurate results. Similarly, managing patient data, diagnostics, and treatment plans in healthcare applications requires robust workflow management that Taipy’s callbacks can handle seamlessly. In contrast, Streamlit’s caching is more suitable for simpler scenarios where the primary goal is to improve app performance by storing frequently accessed data. Streamlit needs caching to speed up repetitive tasks, whereas the advanced automation and dependency management that Taipy offers makes it independent of caching requirements. Taipy is designed to empower developers to build sophisticated Python data and AI web applications effortlessly. Its advanced infrastructure supports large data sets, ensuring smooth and efficient data processing and visualization.

Conclusion

In conclusion, Taipy offers a more comprehensive solution for developers building complex, scalable applications. Its advanced callback mechanisms, design flexibility, and robust support for large datasets make it a powerful tool for production environments. Whether for prototyping or full-scale deployment, Taipy’s features provide a seamless pathway from development to execution.

Thanks to Taipy for the thought leadership/ Resources for this article. Taipy has supported us in this content/article.

The post Comparing Taipy’s Callbacks and Streamlit’s Caching: A Detailed Technical Analysis appeared first on MarkTechPost.

Improve AI assistant response accuracy using Knowledge Bases for Amazo …

AI chatbots and virtual assistants have become increasingly popular in recent years thanks the breakthroughs of large language models (LLMs). Trained on a large volume of datasets, these models incorporate memory components in their architectural design, allowing them to understand and comprehend textual context.
Most common use cases for chatbot assistants focus on a few key areas, including enhancing customer experiences, boosting employee productivity and creativity, or optimizing business processes. For instance, customer support, troubleshooting, and internal and external knowledge-based search.
Despite these capabilities, a key challenge with chatbots is generating high-quality and accurate responses. One way of solving this challenge is to use Retrieval Augmented Generation (RAG). RAG is the process of optimizing the output of an LLM so it references an authoritative knowledge base outside of its training data sources before generating a response. Reranking seeks to improve search relevance by reordering the result set returned by a retriever with a different model. In this post, we explain how two techniques—RAG and reranking—can help improve chatbot responses using Knowledge Bases for Amazon Bedrock.
Solution overview
RAG is a technique that combines the strengths of knowledge base retrieval and generative models for text generation. It works by first retrieving relevant responses from a database, then using those responses as context to feed the generative model to produce a final output. Using a RAG approach for building a chatbot has many advantages. For example, retrieving responses from its database before generating a response could provide more relevant and coherent responses. This helps improve the conversational flow. RAG also scales better with more data compared to pure generative models, and it doesn’t require fine-tuning of the model when new data is added to the knowledge base. Additionally, the retrieval component enables the model to incorporate external knowledge by retrieving relevant background information from its database. This approach helps provide factual, in-depth, and knowledgeable responses.
To find an answer, RAG takes an approach that uses vector search across the documents. The advantage of using vector search is speed and scalability. Rather than scanning every single document to find the answer, with the RAG approach, you turn the texts (knowledge base) into embeddings and store these embeddings in the database. The embeddings are a compressed version of the documents, represented by an array of numerical values. After the embeddings are stored, the vector search queries the vector database to find the similarity based on the vectors associated with the documents. Typically, a vector search will return the top k most relevant documents based on the user question, and return the k results. However, because the similarity algorithm in a vector database works on vectors and not documents, vector search doesn’t always return the most relevant information in the top k results. This directly impacts the accuracy of the response if the most relevant contexts aren’t available to the LLM.
Reranking is a technique that can further improve the responses by selecting the best option out of several candidate responses. The following architecture illustrates how a reranking solution could work.

Architecture diagram for Reranking model integration with Knowledge Bases for Bedrock

Let’s create a question answering solution, where we ingest The Great Gatsby, a 1925 novel by American writer F. Scott Fitzgerald. This book is publicly available through Project Gutenberg. We use Knowledge Bases for Amazon Bedrock to implement the end-to-end RAG workflow and ingest the embeddings into an Amazon OpenSearch Serverless vector search collection. We then retrieve answers using standard RAG and a two-stage RAG, which involves a reranking API. We then compare results from these two methods.
The code sample is available in this GitHub repo.
In the following sections, we walk through the high-level steps:

Prepare the dataset.
Generate questions from the document using an Amazon Bedrock LLM.
Create a knowledge base that contains this book.
Retrieve answers using the knowledge base retrieve API
Evaluate the response using the RAGAS
Retrieve answers again by running a two-stage RAG, using the knowledge base retrieve API and then applying reranking on the context.
Evaluate the two-stage RAG response using the RAGAS framework.
Compare the results and the performance of each RAG approach.

For efficiency purposes, we provided sample code in a notebook used to generate a set of questions and answers. These Q&A pairs are used in the RAG evaluation process. We highly recommend having a human to validate each question and answer for accuracy.
The following sections explains major steps with the help of code blocks.

Prerequisites
To clone the GitHub repository to your local machine, open a terminal window and run the following commands:

git clone https://github.com/aws-samples/amazon-bedrock-samples
cd knowledge-bases/features-examples/03-advanced-concepts/reranking

Prepare the dataset
Download the book from the Project Gutenberg website. For this post, we create 10 large documents from this book and upload them to Amazon Simple Storage Service (Amazon S3):

target_url = “https://www.gutenberg.org/ebooks/64317.txt.utf-8” # the great gatsby
data = urllib.request.urlopen(target_url)
my_texts = []
for line in data:
my_texts.append(line.decode())

doc_size = 700 # size of the document to determine number of batches
batches = math.ceil(len(my_texts) / doc_size)

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
s3_prefix = “bedrock/knowledgebase/datasource”

start = 0
s3 = boto3.client(“s3”)
for batch in range(batches):
    batch_text_arr = my_texts[start:start+doc_size]
    batch_text = “”.join(batch_text_arr)
    s3.put_object(
        Body=batch_text,
        Bucket=default_bucket,
        Key=f”{s3_prefix}/{start}.txt”
    )
    start += doc_size  

Create Knowledge Base for Bedrock
If you’re new to using Knowledge Bases for Amazon Bedrock, refer to Knowledge Bases for Amazon Bedrock now supports Amazon Aurora PostgreSQL and Cohere embedding models, where we described how Knowledge Bases for Amazon Bedrock manages the end-to-end RAG workflow.
In this step, you create a knowledge base using a Boto3 client. You use Amazon Titan Text Embedding v2 to convert the documents into embeddings (‘embeddingModelArn’) and point to the S3 bucket you created earlier as the data source (dataSourceConfiguration):

bedrock_agent = boto3.client(“bedrock-agent”)
response = bedrock_agent.create_knowledge_base(
name=knowledge_base_name,
description=’Knowledge Base for Bedrock’,
roleArn=role_arn,
knowledgeBaseConfiguration={
‘type’: ‘VECTOR’,
‘vectorKnowledgeBaseConfiguration’: {
’embeddingModelArn’: embedding_model_arn
}
},
storageConfiguration={
‘type’: ‘OPENSEARCH_SERVERLESS’,
‘opensearchServerlessConfiguration’: {
‘collectionArn’: collection_arn,
‘vectorIndexName’: index_name,
‘fieldMapping’: {
‘vectorField’:  “bedrock-knowledge-base-default-vector”,
‘textField’: ‘AMAZON_BEDROCK_TEXT_CHUNK’,
‘metadataField’: ‘AMAZON_BEDROCK_METADATA’
}
}
}
)
knowledge_base_id = response[‘knowledgeBase’][‘knowledgeBaseId’]
knowledge_base_name = response[‘knowledgeBase’][‘name’]

response = bedrock_agent.create_data_source(
knowledgeBaseId=knowledge_base_id,
name=f”{knowledge_base_name}-ds”,
dataSourceConfiguration={
‘type’: ‘S3’,
‘s3Configuration’: {
‘bucketArn’: f”arn:aws:s3:::{bucket}”,
‘inclusionPrefixes’: [
f”{s3_prefix}/”,
]
}
},
vectorIngestionConfiguration={
‘chunkingConfiguration’: {
‘chunkingStrategy’: ‘FIXED_SIZE’,
‘fixedSizeChunkingConfiguration’: {
‘maxTokens’: 300,
‘overlapPercentage’: 10
}
}
}
)
data_source_id = response[‘dataSource’][‘dataSourceId’]

response = bedrock_agent.start_ingestion_job(
knowledgeBaseId=knowledge_base_id,
dataSourceId=data_source_id,
)

Generate questions from the document
We use Anthropic Claude on Amazon Bedrock to generate a list of 10 questions and the corresponding answers. The Q&A data serves as the foundation for the RAG evaluation based on the approaches that we are going to implement. We define the generated answers from this step as ground truth data. See the following code:

prompt_template = “””The question should be diverse in nature
across the document. The question should not contain options, not start with Q1/ Q2.
Restrict the question to the context information provided.

<document>
{{document}}
</document>

Think step by step and pay attention to the number of question to create.

Your response should follow the format as followed:

Question: question
Answer: answer

“””
system_prompt = “””You are a professor. Your task is to setup 1 question for an upcoming
quiz/examination based on the given document wrapped in <document></document> XML tag.”””

prompt = prompt_template.replace(“{{document}}”, documents)
temperature = 0.9
top_k = 250
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
# Base inference parameters to use.
inference_config = {“temperature”: temperature, “maxTokens”: 512, “topP”: 1.0}
# Additional inference parameters to use.
additional_model_fields = {“top_k”: top_k}

# Send the message.
response = bedrock_runtime.converse(
modelId=model_id,
messages=messages,
system=[{“text”: system_prompt}],
inferenceConfig=inference_config,
additionalModelRequestFields=additional_model_fields
)
print(response[‘output’][‘message’][‘content’][0][‘text’])
result = response[‘output’][‘message’][‘content’][0][‘text’]
q_pos = [(a.start(), a.end()) for a in list(re.finditer(“Question:”, result))]
a_pos = [(a.start(), a.end()) for a in list(re.finditer(“Answer:”, result))]

Retrieve answers using the knowledge base APIs
We use the generated questions and retrieve answers from the knowledge base using the retrieve and converse APIs:

contexts = []
answers = []

for question in questions:
response = agent_runtime.retrieve(
knowledgeBaseId=knowledge_base_id,
retrievalQuery={
‘text’: question
},
retrievalConfiguration={
‘vectorSearchConfiguration’: {
‘numberOfResults’: topk
}
}
)

retrieval_results = response[‘retrievalResults’]
local_contexts = []
for result in retrieval_results:
local_contexts.append(result[‘content’][‘text’])
contexts.append(local_contexts)
combined_docs = “n”.join(local_contexts)
prompt = llm_prompt_template.replace(“{{documents}}”, combined_docs)
prompt = prompt.replace(“{{query}}”, question)
temperature = 0.9
top_k = 250
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
# Base inference parameters to use.
inference_config = {“temperature”: temperature, “maxTokens”: 512, “topP”: 1.0}
# Additional inference parameters to use.
additional_model_fields = {“top_k”: top_k}

# Send the message.
response = bedrock_runtime.converse(
modelId=model_id,
messages=messages,
inferenceConfig=inference_config,
additionalModelRequestFields=additional_model_fields
)
answers.append(response[‘output’][‘message’][‘content’][0][‘text’])

Evaluate the RAG response using the RAGAS framework
We now evaluate the effectiveness of the RAG using a framework called RAGAS. The framework provides a suite of metrics to evaluate different dimensions. In our example, we evaluate responses based on the following dimensions:

Answer relevancy – This metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0–1, where higher scores indicate better relevancy.
Answer similarity – This assesses the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0–1. A higher score signifies a better alignment between the generated answer and the ground truth.
Context relevancy – This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of 0–1, with higher values indicating better relevancy.
Answer correctness – The assessment of answer correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0–1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

A summarized report for standard RAG approach based on RAGAS evaluation:
answer_relevancy: 0.9006225160334027
answer_similarity: 0.7400904157096762
answer_correctness: 0.32703043056663855
context_relevancy: 0.024797687553157175
Two-stage RAG: Retrieve and rerank
Now that you have the results with the retrieve_and_generate API, let’s explore the two-stage retrieval approach by extending the standard RAG approach to integrate with a reranking model. In the context of RAG, reranking models are used after an initial set of contexts are retrieved by the retriever. The reranking model takes in the list of results and reranks each one based on the similarity between the context and the user query. In our example, we use a powerful reranking model called bge-reranker-large. The model is available in the Hugging Face Hub and is also free for commercial use. In the following code, we use the knowledge base’s retrieve API so we can get the handle on the context, and rerank it using the reranking model deployed as an Amazon SageMaker endpoint. We provide the sample code for deploying the reranking model in SageMaker in the GitHub repository. Here’s a code snippet that demonstrates two-stage retrieval process:

def generate_two_stage_context_answers(bedrock_runtime,
agent_runtime,
model_id,
knowledge_base_id,
retrieval_topk,
reranking_model,
questions,
rerank_top_k=3):
contexts = []
answers = []
predictor = Predictor(endpoint_name=reranking_model, serializer=JSONSerializer(), deserializer=JSONDeserializer())
for question in questions:
retrieval_results = two_stage_retrieval(agent_runtime, knowledge_base_id, question, retrieval_topk, predictor, rerank_top_k)
local_contexts = []
documents = []
for result in retrieval_results:
local_contexts.append(result)

contexts.append(local_contexts)
combined_docs = “n”.join(local_contexts)
prompt = llm_prompt_template.replace(“{{documents}}”, combined_docs)
prompt = prompt.replace(“{{query}}”, question)
temperature = 0.9
top_k = 250
messages = [{“role”: “user”, “content”: [{“text”: prompt}]}]
inference_config = {“temperature”: temperature, “maxTokens”: 512, “topP”: 1.0}
additional_model_fields = {“top_k”: top_k}

response = bedrock_runtime.converse(
modelId=model_id,
messages=messages,
inferenceConfig=inference_config,
additionalModelRequestFields=additional_model_fields
)
answers.append(response[‘output’][‘message’][‘content’][0][‘text’])
return contexts, answers

Evaluate the two-stage RAG response using the RAGAS framework
We evaluate the answers generated by the two-stage retrieval process. The following is a summarized report based on RAGAS evaluation:
answer_relevancy: 0.841581671275458
answer_similarity: 0.7961827348349313
answer_correctness: 0.43361356731293665
context_relevancy: 0.06049484724216884

Compare the results
Let’s compare the results from our tests. As shown in the following figure, the reranking API improves context relevancy, answer correctness, and answer similarity, which are important for improving the accuracy of the RAG process.

RAG vs Two Stage Retrieval evaluation metrics

Similarly, we also measured the RAG latency for both approaches. The results can be shown in the following metrics and the corresponding chart:
Standard RAG latency: 76.59s
Two Stage Retrieval latency: 312.12s

Latency metric for RAG and Two Stage Retrieval process

In summary, using a reranking model (tge-reranker-large) hosted on an ml.m5.xlarge instance yields approximately four times the latency compared to the standard RAG approach. We recommend testing with different reranking model variants and instance types to obtain the optimal performance for your use case.

Conclusion
In this post, we demonstrated how to implement a two-stage retrieval process by integrating a reranking model. We explored how integrating a reranking model with Knowledge Bases for Amazon Bedrock can provide better performance. Finally, we used RAGAS, an open source framework, to provide context relevancy, answer relevancy, answer similarity, and answer correctness metrics for both approaches.
Try out this retrieval process today, and share your feedback in the comments.

About the Author
Wei Teh is an Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business objectives using cutting-edge machine learning solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.
Pallavi Nargund is a Principal Solutions Architect at AWS. In her role as a cloud technology enabler, she works with customers to understand their goals and challenges, and give prescriptive guidance to achieve their objective with AWS offerings. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Outside of work she enjoys volunteering, gardening, cycling and hiking.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Automate the machine learning model approval process with Amazon SageM …

Innovations in artificial intelligence (AI) and machine learning (ML) are causing organizations to take a fresh look at the possibilities these technologies can offer. As you aim to bring your proofs of concept to production at an enterprise scale, you may experience challenges aligning with the strict security compliance requirements of their organization. In the face of these challenges, MLOps offers an important path to shorten your time to production while increasing confidence in the quality of deployed workloads by automating governance processes.
ML models in production are not static artifacts. They reflect the environment where they are deployed and, therefore, require comprehensive monitoring mechanisms for model quality, bias, and feature importance. Organizations often want to introduce additional compliance checks that validate that the model aligns with their organizational standards before it is deployed. These frequent manual checks can create long lead times to deliver value to customers. Automating these checks allows them to be repeated regularly and consistently rather than organizations having to rely on infrequent manual point- in-time checks.
This post illustrates how to use common architecture principles to transition from a manual monitoring process to one that is automated. You can use these principles and existing AWS services such as Amazon SageMaker Model Registry and Amazon SageMaker Pipelines to deliver innovative solutions to your customers while maintaining compliance for your ML workloads.
Challenge
As AI becomes ubiquitous, it’s increasingly used to process information and interact with customers in a sensitive context. Suppose a tax agency is interacting with its users through a chatbot. It’s important that this new system aligns with organizational guidelines by allowing developers to have a high degree of confidence that it responds accurately and without bias. At maturity, an organization may have tens or even hundreds of models in production. How can you make sure every model is properly vetted before it’s deployed and on each deployment?
Traditionally, organizations have created manual review processes to keep updated code from becoming available to the public through mechanisms such as an Enterprise Review Committee (ERC), Enterprise Review Board (ERB), or a Change Advisory Board (CAB).
Just as mechanisms have evolved with the rise of continuous integration and continuous delivery (CI/CD), MLOps can reduce the need for manual processes while increasing the frequency and thoroughness of quality checks. Through automation, you can scale in-demand skillsets, such as model and data analysis, introducing and enforcing in-depth analysis of your models at scale across diverse product teams.
In this post, we use SageMaker Pipelines to define the required compliance checks as code. This allows you to introduce analysis of arbitrary complexity while not being limited by the busy schedules of highly technical individuals. Because the automation takes care of repetitive analytics tasks, technical resources can focus on relentlessly improving the quality and thoroughness of the MLOps pipeline to improve compliance posture, and make sure checks are performing as expected.
Deployment of an ML model to production generally requires at least two artifacts to be approved: the model and the endpoint. In our example, the organization is willing to approve a model for deployment if it passes their checks for model quality, bias, and feature importance prior to deployment. Secondly, the endpoint can be approved for production if it performs as expected when deployed into a production-like environment. In a subsequent post, we walk you through how to deploy a model and implement sample compliance checks. In this post, we discuss how you can extend this process to large language models (LLMs), which produce a varied set of outputs and introduce complexities regarding automated quality assurance checks.
Aligning with AWS multi-account best practices
The solution outlined in this post spans across several accounts in a given AWS organization. For a deeper look at the various components required for an AWS organization multi-account enterprise ML environment, see MLOps foundation roadmap for enterprises with Amazon SageMaker. In this post, we refer to the advanced analytics governance account as the AI/ML governance account. We focus on the development of the enforcement mechanism for the centralized automated model approval within this account.
This account houses centralized components such as a model registry on SageMaker Model Registry, ML project templates on SageMaker Projects, model cards on Amazon SageMaker Model Cards, and container images on Amazon Elastic Container Registry (Amazon ECR).
We use an isolated environment (in this case, a separate AWS environment) to deploy and promote across various environments. You can modify the strategies discussed in this post along the spectrum of centralized vs. decentralized depending on the posture of your organization. For this example, we provide a centralized model. You can also extend this model to align with strict compliance requirements. For example, the AI/ML governance team trusts the development teams are sending the correct bias and explainability reports for a given model. Additional checks could be included to “trust by verify” to further bolster the posture of this organization. Additional complexities such as this are not addressed in this post. To dive further into the topic of MLOps secure implementations, refer to Amazon SageMaker MLOps: from idea to production in six steps.
Solution overview
The following diagram illustrates the solution architecture using SageMaker Pipelines to automate model approval.

The workflow comprises a comprehensive process for model building, training, evaluation, and approval within an organization containing different AWS accounts, integrating various AWS services. The detailed steps are as follows:

Data scientists from the product team use Amazon SageMaker Studio to create Jupyter notebooks used to facilitate data preprocessing and model pre-building. The code is committed to AWS CodeCommit, a managed source control service. Optionally, you can commit to third-party version control systems such as GitHub, GitLab, or Enterprise Git.
The commit to CodeCommit invokes the SageMaker pipeline, which runs several steps, including model building and training, and running processing jobs using Amazon SageMaker Clarify to generate bias and explainability reports.

SageMaker Clarify processes and stores its outputs, including model artifacts and reports in JSON format, in an Amazon Simple Storage Service (Amazon S3) bucket.
A model is registered in the SageMaker model registry with a model version.

The Amazon S3 PUT action invokes an AWS Lambda
This Lambda function copies all the artifacts from the S3 bucket in the development account to another S3 bucket in the AI/ML governance account, providing restricted access and data integrity. This post assumes your accounts and S3 buckets are in the same AWS Region. For cross-Region copying, see Copy data from an S3 bucket to another account and Region by using the AWS CLI.
Registering the model invokes a default Amazon CloudWatch event associated with SageMaker model registry actions.
The CloudWatch event is consumed by Amazon EventBridge, which invokes another Lambda
This Lambda function is tasked with starting the SageMaker approval pipeline.
The SageMaker approval pipeline evaluates the artifacts against predefined benchmarks to determine if they meet the approval criteria.
Based on the evaluation, the pipeline updates the model status to approved or rejected accordingly.

This workflow provides a robust, automated process for model approval using AWS’s secure, scalable infrastructure and services. Each step is designed to make sure that only models meeting the set criteria are approved, maintaining high standards for model performance and fairness.
Prerequisites
To implement this solution, you need to first create and register an ML model in the SageMaker model registry with the necessary SageMaker Clarify artifacts. You can create and run the pipeline by following the example provided in the following GitHub repository.
The following sections assume that a model package version has been registered with status Pending Manual Approval. This status allows you to build an approval workflow. You can either have a manual approver or set up an automated approval workflow based on metrics checks in the aforementioned reports.
Build your pipeline
SageMaker Pipelines allows you to define a series of interconnected steps defined as code using the Pipelines SDK. You can extend the pipeline to help meet your organizational needs with both automated and manual approval steps. In this example, we build the pipeline to include two major steps. The first step evaluates artifacts uploaded to the AI/ML governance account by the model build pipeline against threshold values set by model registry administrators for model quality, bias, and feature importance. The second step receives the evaluation and updates the model’s status and metadata based on the values received. The pipeline is represented in SageMaker Pipelines by the following DAG.

Next, we dive into the code required for the pipeline and its steps. First, we define a pipeline session to help manage AWS service integration as we define our pipeline. This can be done as follows:

pipeline_session = PipelineSession()

Each step runs as a SageMaker Processor for which we specify a small instance type due to the minimal compute requirements of our pipeline. The processor can be defined as follows:

from sagemaker.processing import Processor
step_processor=Processor(
image_uri=image_uri,
role=role,
instance_type=”ml.t3.medium”,
base_job_name=base_job_name,
instance_count=1,
sagemaker_session=pipeline_session,
)

We then define the pipeline steps using step_processor.run(…) as the input parameter to run our custom script inside the defined environment.
Validate model package artifacts
The first step takes two arguments: default_bucket and model_package_group_name. It outputs the results of the checks in JSON format stored in Amazon S3. The step is defined as follows:

process_step = ProcessingStep(
name=”RegisteredModelValidationStep”,
step_args= step_processor.run(
code=”automated-model-approval/model-approval-checks.py”,
inputs=[],
outputs=[
ProcessingOutput(
output_name=”checks”,
destination=f”s3://{default_bucket}/governance-pipeline/processor/”,
source=”/opt/ml/processing/output”
)],
arguments=[
“–default_bucket”, default_bucket_s3,
“–model_package_group_name”, model_package_group_name
]
)
)

This step runs the custom script passed to the code parameter. We now explore this script in more detail.
Values passed to arguments can be parsed using standard methods like argparse and will be used throughout the script. We use these values to retrieve the model package. We then parse the model package’s metadata to find the location of the model quality, bias, and explainability reports. See the following code:

model_package_arn = client.list_model_packages(ModelPackageGroupName=model_package_group_name)[
“ModelPackageSummaryList”
][0][“ModelPackageArn”]
model_package_metrics =
client.describe_model_package(ModelPackageName=model_package_arn)[“ModelMetrics”]
model_quality_s3_key = model_package_metrics[“ModelQuality”][“Statistics”][“S3Uri”].split(f”{default_bucket}/”)[1]
model_quality_bias = model_package_metrics[“Bias”]
model_quality_pretrain_bias_key = model_quality_bias[“PreTrainingReport”][“S3Uri”].split(f”{default_bucket}/”)[1]
model_quality__post_train_bias_key = model_quality_bias[“PostTrainingReport”][“S3Uri”].split(f”{default_bucket}/”)[1]
model_explainability_s3_key = model_package_metrics[“Explainability”][“Report”][“S3Uri”].split(f”{default_bucket}/”)[1]

The reports retrieved are simple JSON files we can then parse. In the following example, we retrieve the treatment equity and compare to our threshold in order to return a True or False result. Treatment equity is defined as the difference in the ratio of false negatives to false positives for the advantaged vs. disadvantaged group. We arbitrarily set the optimal threshold to be 0.8.

s3_obj = s3_client.get_object(Bucket=default_bucket, Key=model_quality__post_train_bias_key)
s3_obj_data = s3_obj[‘Body’].read().decode(‘utf-8’)
model_quality__post_train_bias_json = json.loads(s3_obj_data)
treatment_equity = model_quality__post_train_bias_json[“post_training_bias_metrics”][
“facets”][“column_8”][0][“metrics”][-1][“value”]
treatment_equity_check_threshold = 0.8
treatment_equity_check = True if treatment_equity < treatment_equity_check_threshold else False

After running through the measures of interest, we return the true/false checks to a JSON file that will be copied to Amazon S3 as per the output variable of the ProcessingStep.
Update the model package status in the model registry
When the initial step is complete, we use the JSON file created in Amazon S3 as input to update the model package’s status and metadata. See the following code:

update_model_status_step = ProcessingStep(
name=”UpdateModelStatusStep”,
step_args=step_processor.run(
code=”automated-model-approval/validate-model.py”,
inputs=[
ProcessingInput(
source=process_step.properties.ProcessingOutputConfig.Outputs[
“checks”
].S3Output.S3Uri,
destination=”/opt/ml/processing/input”,
),
],
outputs=[],
arguments=[
“–model_package_group_name”, model_package_group_name
]
),
)

This step runs the custom script passed to the code parameter. We now explore this script in more detail. First, parse the values in checks.json to evaluate if the model passed all checks or review the reasons for failure:

is_approved = True
reasons = []
with open(‘/opt/ml/processing/input/checks.json’) as checks:
checks = json.load(checks)
print(f”checks: {checks}”)
for key, value in checks.items():
if not value:
is_approved = False
reasons.append(key)

After we know if the model should be approved or rejected, we update the model status and metadata as follows:

if is_approved:
approval_description = “Model package meets organisational guidelines”
else:
approval_description = “Model values for the following checks does not meet threshold: ”

for reason in reasons:
approval_description+= f”{reason} ”

model_package_update_input_dict = {
“ModelPackageArn” : model_package_arn,
“ApprovalDescription”: approval_description,
“ModelApprovalStatus” : “Approved” if is_approved else “Rejected”
}

model_package_update_response = client.update_model_package(**model_package_update_input_dict)

This step produces a model with a status of Approved or Rejected based on the set of checks specified in the first step.
Orchestrate the steps as a SageMaker pipeline
We orchestrate the previous steps as a SageMaker pipeline with two parameter inputs passed as arguments to the various steps:

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.parameters import ParameterString

model_package_group_name = ParameterString(
name=”ModelPackageGroupName”, default_value=”ModelPackageGroupName is required variable.”
)

default_bucket_s3 = ParameterString(
name=”Bucket”, default_value=”Bucket is required variable”)

pipeline = Pipeline(
name=pipeline_name,
parameters=[model_package_group_name, default_bucket_s3],
steps=[process_step, update_model_status_step],
)

It’s straightforward to extend this pipeline by adding elements into the list passed to the steps parameter. In the next section, we explore how to run this pipeline as new model packages are registered to our model registry.
Run the event-driven pipeline
In this section, we outline how to invoke the pipeline using an EventBridge rule and Lambda function.
Create a Lambda function and select the Python 3.9 runtime. The following function retrieves the model package ARN, the model package group name, and the S3 bucket where the artifacts are stored based on the event. It then starts running the pipeline using these values:

import json
import boto3
sagemaker_client = boto3.client(‘sagemaker’)

def lambda_handler(event, context):
model_arn = event.get(‘detail’, {}).get(‘ModelPackageArn’, ‘Unknown’)
model_package_group_name = event.get(‘detail’, {}).get(‘ModelPackageGroupName’, ‘Unknown’)
model_package_name = event.get(‘detail’, {}).get(‘ModelPackageName’, ‘Unknown’)
model_data_url = event.get(‘InferenceSpecification’, {}).get(‘ModelDataUrl’, ‘Unknown’)

# Specify the name of your SageMaker pipeline
pipeline_name = ‘model-governance-pipeline’

# Define multiple parameters
pipeline_parameters = [
{‘Name’: “ModelPackageGroupName”, ‘Value’: model_package_group_name}, {‘Name’: “Bucket”, ‘Value’: model_data_url},
]
# Start the pipeline execution
response = sagemaker_client.start_pipeline_execution(
PipelineName=pipeline_name,
PipelineExecutionDisplayName=pipeline_name,
PipelineParameters=pipeline_parameters
)

# Return the response
return response

After defining the Lambda function, we create the EventBridge rule to automatically invoke the function when a new model package is registered with PendingManualApproval into the model registry. You can use AWS CloudFormation and the following template to create the rule:

{
“AWSTemplateFormatVersion”: “2010-09-09”,
“Description”: “CloudFormation template for EventBridge rule ‘invoke-model-approval-checks'”,
“Resources”: {
“EventRule0”: {
“Type”: “AWS::Events::Rule”,
“Properties”: {
“EventBusName”: “default”,
“EventPattern”: {
“source”: [“aws.sagemaker”],
“detail-type”: [“SageMaker Model Package State Change”],
“detail”: {
“ModelApprovalStatus”: [“PendingManualApproval”]
}
},
“Name”: “invoke-model-approval-checks”,
“State”: “ENABLED”,
“Targets”: [{
“Id”: “Id403a084c-2837-4408-940f-b808389653d1”,
“Arn”: “<Your Lambda function ARN>”
}]
}
}
}
}

We now have a SageMaker pipeline consisting of two steps being invoked when a new model is registered to evaluate model quality, bias, and feature importance metrics and update the model status accordingly.
Applying this approach to generative AI models
In this section, we explore how the complexities introduced by LLMs change the automated monitoring workflow.
Traditional ML models typically produce concise outputs with obvious ground truths in their training dataset. In contrast, LLMs can generate long, nuanced sequences that may have little to no ground truth due to the autoregressive nature of training this segment of model. This strongly influences various components of the governance pipeline we’ve described.
For instance, in traditional ML models, bias is detected by looking at the distributions of labels over different population subsets (for example, male vs. female). The labels (often a single number or a few numbers) are a clear and simple signal used to measure bias. In contrast, generative models produce lengthy and complex answers, which don’t provide an obvious signal to be used for monitoring. HELM (a holistic framework for evaluating foundation models) allows you to simplify monitoring by untangling the evaluation process into metrics of concern. This includes accuracy, calibration and uncertainty, robustness, fairness, bias and stereotypes, toxicity, and efficiency. We then apply downstream processes to measure for these metrics independently. This is generally done using standardized datasets composed of examples and a variety of accepted responses.
We concretely evaluate four metrics of interest to any governance pipelines for LLMs: memorization and copyright, disinformation, bias, and toxicity, as described in HELM. This is done by collecting inference results from the model pushed to the model registry. The benchmarks include:

Memorization and copyright with books from bookscorpus, which uses popular books from a bestseller list and source code of the Linux kernel. This can be quickly extended to include a number of copyrighted works.
Disinformation with headlines from the MisinfoReactionFrames dataset, which has false headlines across a number of topics.
Bias with Bias Benchmark for Question Answering (BBQ). This QA dataset works to highlight biases affecting various social groups.
Toxicity with Bias in Open-ended Language Generation Dataset (BOLD), which benchmarks across profession, gender, race, religion, and political ideology.

Each of these datasets is publicly available. They each allow complex aspects of a generative model’s behavior to be isolated and distilled down to a single number. This flow is described in the following architecture.

For a detailed view of this topic along with important mechanisms to scale in production, refer to Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services.
Conclusion
In this post, we discussed a sample solution to begin automating your compliance checks for models going into production. As AI/ML becomes increasingly common, organizations require new tools to codify the expertise of their highly skilled employees in the AI/ML space. By embedding your expertise as code and running these automated checks against models using event-driven architectures, you can increase both the speed and quality of models by empowering yourself to run these checks as needed rather than relying on the availability of individuals for manual compliance or quality assurance reviews By using well-known CI/CD techniques in the application development lifecycle and applying them to the ML modeling lifecycle, organizations can scale in the era of generative AI.
If you have any thoughts or questions, please leave them in the comments section.

About the Authors
Jayson Sizer McIntosh is a Senior Solutions Architect at Amazon Web Services (AWS) in the World Wide Public Sector (WWPS) based in Ottawa (Canada) where he primarily works with public sector customers as an IT generalist with a focus on Dev(Sec)Ops/CICD. Bringing his experience implementing cloud solutions in high compliance environments, he is passionate about helping customers successfully deliver modern cloud-based services to their users.
Nicolas Bernier is an AI/ML Solutions Architect, part of the Canadian Public Sector team at AWS. He is currently conducting research in Federated Learning and holds five AWS certifications, including the ML Specialty Certification. Nicolas is passionate about helping customers deepen their knowledge of AWS by working with them to translate their business challenges into technical solutions.
Pooja Ayre is a seasoned IT professional with over 9 years of experience in product development, having worn multiple hats throughout her career. For the past two years, she has been with AWS as a Solutions Architect, specializing in AI/ML. Pooja is passionate about technology and dedicated to finding innovative solutions that help customers overcome their roadblocks and achieve their business goals through the strategic use of technology. Her deep expertise and commitment to excellence make her a trusted advisor in the IT industry.

Writer Releases Palmyra-Med and Palmyra-Fin Models: Outperforming Othe …

The field of generative AI is increasingly focusing on creating models tailored to specific industries, enhancing performance in areas such as healthcare and finance. This specialization aims to meet the unique demands of these sectors, which require high accuracy and compliance due to their complex and regulated nature.

In healthcare and finance, traditional AI models often fall short of providing the precision and efficiency needed for industry-specific tasks. Medical and financial applications demand models that can handle specialized data accurately and cost-effectively. Existing general-purpose models may need to fully address these fields’ intricacies, leading to performance gaps and higher costs for industry applications.

Currently, medical and financial AI models, such as GPT-4 and Med-PaLM-2, are widely used. While these powerful models often need more specialized capabilities for advanced medical diagnostics and detailed financial analysis. This limitation highlights the need for more refined and focused models to deliver superior performance in these sectors.

To address these needs, the Writer Team has developed two new domain-specific models: Palmyra-Med and Palmyra-Fin. Palmyra-Med is designed for medical applications, while Palmyra-Fin targets financial tasks. These models are part of Writer’s suite of language models and are engineered to offer exceptional performance in their respective domains. Palmyra-Med-70B is distinguished by its high accuracy in medical benchmarks, achieving an average score of 85.9%. This surpasses competitors such as Med-PaLM-2 and performs particularly well in clinical knowledge, genetics, and biomedical research. Its cost efficiency is truly praiseworthy, priced at $10 per million output tokens, substantially lower than the $60 charged by models like GPT-4.

Palmyra-Fin-70B, designed for financial applications, has demonstrated outstanding results. It passed the CFA Level III exam with a score of 73%, outperforming general-purpose models like GPT-4, which scored only 33%. Furthermore, in the long-fin-eval benchmark, Palmyra-Fin-70B outperformed other models, including Claude 3.5 Sonnet and Mixtral-8x7b. This model excels in financial trend analysis, investment evaluations, and risk assessments, showcasing its ability to handle complex financial data precisely.

Palmyra-Med-70B uses advanced techniques to achieve its high benchmark scores. It integrates a specialized dataset and fine-tuning methodologies, including Direct Preference Optimization (DPO), to enhance its performance in medical tasks. The model’s accuracy in various benchmarks—such as 90.9% in MMLU Clinical Knowledge and 83.7% in MMLU Anatomy—demonstrates its deep understanding of clinical procedures and human anatomy. It scores 94.0% and 80% in genetics and biomedical research, respectively, underscoring its ability to interpret complex medical data and assist in research.

Palmyra-Fin-70B’s approach involves extensive training on financial data and custom fine-tuning. The model’s performance on the CFA Level III exam and its results in the long-fin-eval benchmark highlight its strong grasp of economic concepts and capability to process and analyze large amounts of financial information effectively. The model’s 100% accuracy in needle-in-haystack tasks reflects its ability to retrieve precise information from extensive financial documents.

In conclusion, Palmyra-Med and Palmyra-Fin represent significant advancements in specialized AI models for the medical and financial industries. Developed by Writer, these models offer enhanced accuracy and efficiency, addressing the specific needs of these sectors with a focus on cost-effectiveness and superior performance. They set a new standard for domain-specific AI applications, providing valuable tools for professionals in healthcare and finance.

Check out the Details, Palmyra-Fin-70B-32K Model, and Palmyra-Med-70b-32k Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Writer Releases Palmyra-Med and Palmyra-Fin Models: Outperforming Other Comparable Models, like GPT-4, Med-PaLM-2, and Claude 3.5 Sonnet appeared first on MarkTechPost.

Haize Labs Introduced Sphynx: A Cutting-Edge Solution for AI Hallucina …

Haize Labs has recently introduced Sphynx, an innovative tool designed to address the persistent challenge of hallucination in AI models. In this context, hallucinations refer to instances where language models generate incorrect or nonsensical outputs, which can be problematic in various applications. The introduction of Sphynx aims to enhance the robustness and reliability of hallucination detection models through dynamic testing and fuzzing techniques.

Hallucinations represent a significant issue in large language models (LLMs). These models can sometimes produce inaccurate or irrelevant outputs despite their impressive capabilities. This undermines their utility and poses risks in critical applications where accuracy is paramount. Traditional approaches to mitigate this problem have involved training separate LLMs to detect hallucinations. However, these detection models are not immune to the issue they are meant to resolve. This paradox raises crucial questions about their reliability and the necessity for more robust testing methods.

Haize Labs proposes a novel “haizing” approach involving fuzz-testing hallucination detection models to uncover their vulnerabilities. The idea is to intentionally induce conditions that might lead these models to fail, thereby identifying their weak points. This method ensures that detection models are theoretically sound and practically robust against various adversarial scenarios.

Image Source

Sphynx generates perplexing and subtly varied questions to test the limits of hallucination detection models. By perturbing elements such as the question, answer, or context, Sphynx aims to confuse the model into producing incorrect outputs. For instance, it might take a correctly answered question and rephrase it in a way that maintains the same intent but challenges the model to reassess its decision. This process helps identify scenarios where the model might incorrectly label a hallucination as valid or vice versa.

The core of Sphynx’s approach is a straightforward beam search algorithm. This method involves iteratively generating variations of a given question and testing the hallucination detection model against these variants. Sphynx effectively maps out the model’s robustness by ranking these variations based on their likelihood of inducing a failure. The simplicity of this algorithm belies its effectiveness, demonstrating that even basic perturbations can reveal significant weaknesses in state-of-the-art models.

Image Source

Sphynx’s testing methodology has yielded insightful results. For instance, when applied to leading hallucination detection models like GPT-4o (OpenAI), Claude-3.5-Sonnet (Anthropic), Llama 3 (Meta), and Lynx (Patronus AI), the robustness scores varied significantly. These scores, which measure the models’ ability to withstand adversarial attacks, highlighted substantial disparities in their performance. Such evaluations are critical for developers and researchers aiming to deploy AI systems in real-world applications where reliability is non-negotiable.

The introduction of Sphynx underscores the importance of dynamic and rigorous testing in AI development. While useful, more than static datasets and conventional testing approaches are needed for uncovering the nuanced and complex failure modes that can arise in AI systems. By forcing these failures to surface during development, Sphynx helps ensure that models are better prepared for real-world deployment.

In conclusion, Haize Labs’ Sphynx represents an advancement in the ongoing effort to mitigate AI hallucinations. By leveraging dynamic fuzz testing and a straightforward haizing algorithm, Sphynx offers a robust framework for enhancing the reliability of hallucination detection models. This innovation addresses a critical challenge in AI and sets the stage for more resilient and dependable AI applications in the future.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Haize Labs Introduced Sphynx: A Cutting-Edge Solution for AI Hallucination Detection with Dynamic Testing and Fuzzing Techniques appeared first on MarkTechPost.

NuMind Released: Empowering Custom NLP Model Creation with In-House Fo …

NuMind is an innovative tool designed to facilitate creation of custom natural language processing (NLP) models through an interactive teaching process. Developed by NuMind, the tool aims to democratize the use of advanced NLP models by allowing users to build high-performance information extraction models without requiring extensive technical expertise or sharing sensitive data.

NuMind leverages in-house foundation models, automatic machine learning, and an active learning strategy to streamline the model creation process. By teaching AI, users can develop lightweight custom models, typically less than 1 GB, highly efficient and often outperform larger, generic large language models (LLMs) such as GPT-3.5 and GPT-4 after sufficient training and corrections.

NuMind supports various NLP tasks, including classification, multilabel classification, named entity recognition (NER), and, soon, structured extraction. These tasks enable users to extract relevant information from diverse documents, such as medical reports, legal documents, financial statements, social media posts, and chat messages.

Teaching the AI involves three main steps: telling the AI what to do, showing the AI how to do it, and iteratively correcting the AI’s mistakes. This approach mimics the way humans teach each other and proves highly effective. Users begin by describing the project and creating classes or labels. Next, they demonstrate the task by annotating a few documents. The AI then uses these annotations to fine-tune its models, with an active learning procedure selecting the most informative documents for further annotation.

Image Source

NuMind ensures that all data and computations remain local, maintaining the privacy and confidentiality of user data. This feature is important for industries with stringent data privacy requirements. As users continue to correct the AI’s mistakes, the model improves rapidly, often requiring fewer corrections over time. This iterative process enables users to create high-quality custom models with minimal effort.

For instance, creating a NER model with NuMind involves downloading the tool, starting a project, and selecting the entity detection task. Users then annotate documents to teach the AI, which learns from these annotations and improves its performance through subsequent iterations. This method allows the creation of robust models capable of accurately identifying entities within documents.

Image Source

NuMind also includes a feature for reviewing disagreements between the user’s annotations and the model’s predictions. This review process helps users identify and correct any discrepancies, further enhancing the model’s accuracy. Additionally, NuMind provides a model playground where users can test and debug the model by editing text and observing the AI’s predictions. This interactive debugging is crucial for understanding and improving the model’s robustness.

Once satisfied with the model’s performance, users can deploy it on a server, creating a REST API for integration with other applications. The small size of the custom models, less than 1 GB, allows them to be hosted on inexpensive CPU servers, making deployment cost-effective. NuMind is also developing a production cloud solution for users with less stringent privacy requirements.

Image Source

NuMind’s versatility extends to various industries and languages, making it a valuable tool for multiple applications. Whether extracting information from complex legal documents, analyzing social media content, or processing financial data, NuMind provides a powerful, user-friendly solution for creating custom NLP models.

In conclusion, the release of NuMind simplifies the model creation process and ensures data privacy, empowering users across different industries to harness the power of AI for information extraction. This tool enhances productivity and allows for leveraging AI in various domains.

Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post NuMind Released: Empowering Custom NLP Model Creation with In-House Foundation Models and Active Learning for Over 10 Industries and Languages appeared first on MarkTechPost.

Build custom generative AI applications powered by Amazon Bedrock

With last month’s blog, I started a series of posts that highlight the key factors that are driving customers to choose Amazon Bedrock. I explored how Bedrock enables customers to build a secure, compliant foundation for generative AI applications. Now I’d like to turn to a slightly more technical, but equally important differentiator for Bedrock—the multiple techniques that you can use to customize models and meet your specific business needs.
As we’ve all heard, large language models (LLMs) are transforming the way we leverage artificial intelligence (AI) and enabling businesses to rethink core processes. Trained on massive datasets, these models can rapidly comprehend data and generate relevant responses across diverse domains, from summarizing content to answering questions. The wide applicability of LLMs explains why customers across healthcare, financial services, and media and entertainment are moving quickly to adopt them. However, our customers tell us that while pre-trained LLMs excel at analyzing vast amounts of data, they often lack the specialized knowledge necessary to tackle specific business challenges.
Customization unlocks the transformative potential of large language models. Amazon Bedrock equips you with a powerful and comprehensive toolset to transform your generative AI from a one-size-fits-all solution into one that is finely tailored to your unique needs. Customization includes varied techniques such as Prompt Engineering, Retrieval Augmented Generation (RAG), and fine-tuning and continued pre-training. Prompt Engineering involves carefully crafting prompts to get a desired response from LLMs. RAG combines knowledge retrieved from external sources with language generation to provide more contextual and accurate responses. Model Customization techniques—including fine-tuning and continued pre-training involve further training a pre-trained language model on specific tasks or domains for improved performance. These techniques can be used in combination with each other to train base models in Amazon Bedrock with your data to deliver contextual and accurate outputs. Read the below examples to understand how customers are using customization in Amazon Bedrock to deliver on their use cases.
Thomson Reuters, a global content and technology company, has seen positive results with Claude 3 Haiku, but anticipates even better results with customization. The company—which serves professionals in legal, tax, accounting, compliance, government, and media—expects that it will see even faster and more relevant AI results by fine-tuning Claude with their industry expertise.

“We’re excited to fine-tune Anthropic’s Claude 3 Haiku model in Amazon Bedrock to further enhance our Claude-powered solutions. Thomson Reuters aims to provide accurate, fast, and consistent user experiences. By optimizing Claude around our industry expertise and specific requirements, we anticipate measurable improvements that deliver high-quality results at even faster speeds. We’ve already seen positive results with Claude 3 Haiku, and fine-tuning will enable us to tailor our AI assistance more precisely.”
– Joel Hron, Chief Technology Officer at Thomson Reuters.

At Amazon, we see Buy with Prime using Amazon Bedrock’s cutting-edge RAG-based customization capabilities to drive greater efficiency. Their order on merchants’ sites are covered by Buy with Prime Assist, 24/7 live chat customer service. They recently launched a chatbot solution in beta capable of handling product support queries. The solution is powered by Amazon Bedrock and customized with data to go beyond traditional email-based systems. My colleague Amit Nandy, Product Manager at Buy with Prime, says,

“By indexing merchant websites, including subdomains and PDF manuals, we constructed tailored knowledge bases that provided relevant and comprehensive support for each merchant’s unique offerings. Combined with Claude’s state-of-the-art foundation models and Guardrails for Amazon Bedrock, our chatbot solution delivers a highly capable, secure, and trustworthy customer experience. Shoppers can now receive accurate, timely, and personalized assistance for their queries, fostering increased satisfaction and strengthening the reputation of Buy with Prime and its participating merchants.”

Stories like these are the reason why we continue to double down on our customization capabilities for generative AI applications powered by Amazon Bedrock.
In this blog, we’ll explore the three major techniques for customizing LLMs in Amazon Bedrock. And, we’ll cover related announcements from the recent AWS New York Summit.
Prompt Engineering: Guiding your application toward desired answers
Prompts are the primary inputs that drive LLMs to generate answers. Prompt engineering is the practice of carefully crafting these prompts to guide LLMs effectively. Learn more here. Well-designed prompts can significantly boost a model’s performance by providing clear instructions, context, and examples tailored to the task at hand. Amazon Bedrock supports multiple prompt engineering techniques. For example, few-shot prompting provides examples with desired outputs to help models better understand tasks, such as sentiment analysis samples labeled “positive” or “negative.” Zero-shot prompting provides task descriptions without examples. And chain-of-thought prompting enhances multi-step reasoning by asking models to break down complex problems, which is useful for arithmetic, logic, and deductive tasks.
Our Prompt Engineering Guidelines outline various prompting strategies and best practices for optimizing LLM performance across applications. Leveraging these techniques can help practitioners achieve their desired outcomes more effectively. However, developing optimal prompts that elicit the best responses from foundational models is a challenging and iterative process, often requiring weeks of refinement by developers.

Zero-shot prompting
Few-shot prompting

Chain-of-thought prompting with Prompt Flows Visual Builder

Retrieval-Augmented Generation: Augmenting results with retrieved data
LLMs generally lack specialized knowledge, jargon, context, or up-to-date information needed for specific tasks. For instance, legal professionals seeking reliable, current, and accurate information within their domain may find interactions with generalist LLMs inadequate. Retrieval-Augmented Generation (RAG) is the process of allowing a language model to consult an authoritative knowledge base outside of its training data sources—before generating a response.
The RAG process involves three main steps:

Retrieval: Given an input prompt, a retrieval system identifies and fetches relevant passages or documents from a knowledge base or corpus.
Augmentation: The retrieved information is combined with the original prompt to create an augmented input.
Generation: The LLM generates a response based on the augmented input, leveraging the retrieved information to produce more accurate and informed outputs.

Amazon Bedrock’s Knowledge Bases is a fully managed RAG feature that allows you to connect LLMs to internal company data sources—delivering relevant, accurate, and customized responses. To offer greater flexibility and accuracy in building RAG-based applications, we announced multiple new capabilities at the AWS New York Summit. For example, now you can securely access data from new sources like the web (in preview), allowing you to index public web pages, or access enterprise data from Confluence, SharePoint, and Salesforce (all in preview). Advanced chunking options are another exciting new feature, enabling you to create custom chunking algorithms tailored to your specific needs, as well as leverage built-in semantic and hierarchical chunking options. You now have the capability to extract information with precision from complex data formats (e.g., complex tables within PDFs), thanks to advanced parsing techniques. Plus, the query reformulation feature allows you to deconstruct complex queries into simpler sub-queries, enhancing retrieval accuracy. All these new features help you reduce the time and cost associated with data access and construct highly accurate and relevant knowledge resources—all tailored to your specific enterprise use cases.
Model Customization: Enhancing performance for specific tasks or domains
Model customization in Amazon Bedrock is a process to customize pre-trained language models for specific tasks or domains. It involves taking a large, pre-trained model and further training it on a smaller, specialized dataset related to your use case. This approach leverages the knowledge acquired during the initial pre-training phase while adapting the model to your requirements, without losing the original capabilities. The fine-tuning process in Amazon Bedrock is designed to be efficient, scalable, and cost-effective, enabling you to tailor language models to your unique needs, without the need for extensive computational resources or data. In Amazon Bedrock, model fine-tuning can be combined with prompt engineering or the Retrieval-Augmented Generation (RAG) approach to further enhance the performance and capabilities of language models. Model customization can be implemented both for labeled and unlabeled data.
Fine-Tuning with labeled data involves providing labeled training data to improve the model’s performance on specific tasks. The model learns to associate appropriate outputs with certain inputs, adjusting its parameters for better task accuracy. For instance, if you have a dataset of customer reviews labeled as positive or negative, you can fine-tune a pre-trained model within Bedrock on this data to create a sentiment analysis model tailored to your domain. At the AWS New York Summit, we announced Fine-tuning for Anthropic’s Claude 3 Haiku. By providing task-specific training datasets, users can fine-tune and customize Claude 3 Haiku, boosting its accuracy, quality, and consistency for their business applications.
Continued Pre-training with unlabeled data, also known as domain adaptation, allows you to further train the LLMs on your company’s proprietary, unlabeled data. It exposes the model to your domain-specific knowledge and language patterns, enhancing its understanding and performance for specific tasks.
Customization holds the key to unlocking the true power of generative AI
Large language models are revolutionizing AI applications across industries, but tailoring these general models with specialized knowledge is key to unlocking their full business impact. Amazon Bedrock empowers organizations to customize LLMs through Prompt Engineering techniques, such as Prompt Management and Prompt Flows, that help craft effective prompts. Retrieval-Augmented Generation—powered by Amazon Bedrock’s Knowledge Bases—lets you integrate LLMs with proprietary data sources to generate accurate, domain-specific responses. And Model Customization techniques, including fine-tuning with labeled data and continued pre-training with unlabeled data, help optimize LLM behavior for your unique needs. After taking a close look at these three main customization methods, it’s clear that while they may take different approaches, they all share a common goal—to help you address your specific business problems..
Resources       
For more information on customization with Amazon Bedrock, check the below resources:

Learn more about Amazon Bedrock
Learn more about Amazon Bedrock Knowledge Bases
Read announcement blog on additional data connectors in Knowledge Bases for Amazon Bedrock
Read blog on advanced chunking and parsing options in Knowledge Bases for Amazon Bedrock
Learn more about Prompt Engineering
Learn more about Prompt Engineering techniques and best practices
Read announcement blog on Prompt Management and Prompt Flows
Learn more about fine-tuning and continued pre-training
Read the announcement blog on fine-tuning Anthropic’s Claude 3 Haiku

About the author
Vasi Philomin is VP of Generative AI at AWS. He leads generative AI efforts, including Amazon Bedrock and Amazon Titan.