Microsoft AI Research Introduces OLA-VLM: A Vision-Centric Approach to …

Multimodal large language models (MLLMs) are advancing rapidly, enabling machines to interpret and reason about textual and visual data simultaneously. These models have transformative applications in image analysis, visual question answering, and multimodal reasoning. By bridging the gap between vision & language, they play a crucial role in improving artificial intelligence’s ability to understand and interact with the world holistically.

Despite their promise, these systems need to overcome significant challenges. A core limitation is the reliance on natural language supervision for training, often resulting in suboptimal visual representation quality. While increasing dataset size and computational complexity have led to modest improvements, they need more targeted optimization for visual understanding within these models to ensure they achieve the desired performance in vision-based tasks. Current methods frequently need to balance computational efficiency and improved performance.

Existing techniques for training MLLMs typically involve using visual encoders to extract features from images and feeding them into the language model alongside natural language data. Some methods employ multiple visual encoders or cross-attention mechanisms to enhance understanding. However, these approaches come at the cost of significantly higher data and computation requirements, limiting their scalability and practicality. This inefficiency underscores the need for a more effective way to optimize MLLMs for visual comprehension.

Researchers at SHI Labs at Georgia Tech and Microsoft Research introduced a novel approach called OLA-VLM to address these challenges. The method aims to improve MLLMs by distilling auxiliary visual information into their hidden layers during pretraining. Instead of increasing visual encoder complexity, OLA-VLM leverages embedding optimization to enhance the alignment of visual and textual data. Introducing this optimization into intermediate layers of the language model ensures better visual reasoning without additional computational overhead during inference.

The technology behind OLA-VLM involves embedding loss functions to optimize representations from specialized visual encoders. These encoders are trained for image segmentation, depth estimation, and image generation tasks. The distilled features are mapped to specific layers of the language model using predictive embedding optimization techniques. Further, special task-specific tokens are appended to the input sequence, allowing the model to incorporate auxiliary visual information seamlessly. This design ensures that the visual features are effectively integrated into the MLLM’s representations without disrupting the primary training objective of next-token prediction. The result is a model that learns more robust and vision-centric representations.

The performance of OLA-VLM was rigorously tested on various benchmarks, showing substantial improvements over existing single- and multi-encoder models. On CV-Bench, a vision-centric benchmark suite, OLA-VLM outperformed the LLaVA-1.5 baseline by up to 8.7% in in-depth estimation tasks, achieving an accuracy of 77.8%. For segmentation tasks, it achieved a mean Intersection over Union (mIoU) score of 45.4%, significantly improving over the baseline’s 39.3%. The model also demonstrated consistent gains across 2D and 3D vision tasks, achieving an average improvement of up to 2.5% on benchmarks like distance and relation reasoning. OLA-VLM achieved these results using only a single visual encoder during inference, making it far more efficient than multi-encoder systems.

To further validate its effectiveness, researchers analyzed the representations learned by OLA-VLM. Probing experiments revealed that the model achieved superior visual feature alignment in its intermediate layers. This alignment significantly enhanced the model’s downstream performance across various tasks. For instance, the researchers noted that integrating special task-specific tokens during training contributed to better optimizing features for depth, segmentation, and image generation tasks. The results underscored the efficiency of the predictive embedding optimization approach, proving its capability to balance high-quality visual understanding with computational efficiency.

OLA-VLM establishes a new standard for integrating visual information into MLLMs by focusing on embedding optimization during pretraining. This research addresses the gap in current training methods by introducing a vision-centric perspective to improve the quality of visual representations. The proposed approach enhances performance on vision-language tasks and achieves this with fewer computational resources compared to existing methods. OLA-VLM exemplifies how targeted optimization during pretraining can substantially improve multimodal model performance.

In conclusion, the research conducted by SHI Labs and Microsoft Research highlights a groundbreaking advancement in multimodal AI. By optimizing visual representations within MLLMs, OLA-VLM bridges a critical gap in performance and efficiency. This method demonstrates how embedding optimization can effectively address challenges in vision-language alignment, paving the way for more robust and scalable multimodal systems in the future.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Microsoft AI Research Introduces OLA-VLM: A Vision-Centric Approach to Optimizing Multimodal Large Language Models appeared first on MarkTechPost.

Meta FAIR Releases Meta Motivo: A New Behavioral Foundation Model for …

Foundation models, pre-trained on extensive unlabeled data, have emerged as a cutting-edge approach for developing versatile AI systems capable of solving complex tasks through targeted prompts. Researchers are now exploring the potential of extending this paradigm beyond language and visual domains, focusing on behavioral foundation models (BFMs) for agents interacting with dynamic environments. Specifically, the research aims to develop BFMs for humanoid agents, targeting whole-body control through proprioceptive observations. This approach addresses a long-standing challenge in robotics and AI, characterized by the high-dimensionality and intrinsic instability of humanoid control systems. The ultimate goal is to create generalized models that can express diverse behaviors in response to various prompts, including imitation, goal achievement, and reward optimization.

Meta researchers introduce FB-CPR (Forward-Backward representations with Conditional Policy Regularization), an innovative online unsupervised reinforcement learning algorithm designed to ground policy learning through observation-only unlabeled behaviors. The algorithm’s key technical innovation involves utilizing forward-backward representations to embed unlabeled trajectories into a shared latent space, utilizing a latent-conditional discriminator to encourage policies to comprehensively “cover” dataset states. Demonstrating the method’s effectiveness, the team developed META MOTIVO, a behavioral foundation model for whole-body humanoid control that can be prompted to solve diverse tasks such as motion tracking, goal reaching, and reward optimization in a zero-shot learning scenario. The model utilizes the SMPL skeleton and AMASS motion capture dataset to achieve remarkable behavioral expressiveness.

Researchers introduce a robust approach to forward-backward (FB) representation learning with conditional policy regularization. At the pre-training stage, the agent has access to an unlabeled behavior dataset containing observation-only trajectories. The method focuses on developing a continuous set of latent-conditioned policies where latent variables are drawn from a distribution defined over a latent space. By representing behaviors through the joint space of states and latent variables, the researchers aim to capture diverse motion patterns. The key innovation lies in inferring latent variables for each trajectory using the ERFB method, which allows encoding trajectories into a shared representational space. The ultimate goal is to regularize the unsupervised training of the behavioral foundation model by minimizing the discrepancy between the induced policy distribution and the dataset distribution.

The research presents a comprehensive performance evaluation of the FB-CPR algorithm across multiple task categories. FB-CPR demonstrates remarkable zero-shot capabilities, achieving 73.4% of top-line algorithm performance without explicit task-specific training. In reward-maximization tasks, the method outperforms unsupervised baselines, notably achieving 177% of DIFFUSER’s performance while maintaining significantly lower computational complexity. For goal-reaching tasks, FB-CPR performs comparably to specialized baselines, outperforming zero-shot alternatives by 48% and 118% in proximity and success metrics respectively. A human evaluation study further revealed that while task-specific algorithms might achieve higher numerical performance, FB-CPR was consistently perceived as more “human-like”, with participants rating its behaviors as more natural in 83% of reward-based tasks and 69% of goal-reaching scenarios.

This research introduced FB-CPR, a unique algorithm that combines zero-shot properties of forward-backward models with innovative regularization techniques for policy learning using unlabeled behavior datasets. By training the first behavioral foundation model for complex humanoid agent control, the method demonstrated state-of-the-art performance across diverse tasks. Despite its significant achievements, the approach has notable limitations. FB-CPR struggles with tasks far removed from motion-capture datasets and occasionally produces imperfect movements, particularly in scenarios involving falling or standing. The current model is restricted to proprioceptive observations and cannot navigate environments or interact with objects. Future research directions include integrating additional state variables, exploring complex perception methods, utilizing video-based human activity datasets, and developing more direct language-policy alignment techniques to expand the model’s capabilities and generalizability.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Meta FAIR Releases Meta Motivo: A New Behavioral Foundation Model for Controlling Virtual Physics-based Humanoid Agents for a Wide Range of Complex Whole-Body Tasks appeared first on MarkTechPost.

Nexa AI Releases OmniAudio-2.6B: A Fast Audio Language Model for Edge …

Audio language models (ALMs) play a crucial role in various applications, from real-time transcription and translation to voice-controlled systems and assistive technologies. However, many existing solutions face limitations such as high latency, significant computational demands, and a reliance on cloud-based processing. These issues pose challenges for edge deployment, where low power consumption, minimal latency, and localized processing are critical. In environments with limited resources or strict privacy requirements, these challenges make large, centralized models impractical. Addressing these constraints is essential for unlocking the full potential of ALMs in edge scenarios.

Nexa AI has announced OmniAudio-2.6B, an audio-language model designed specifically for edge deployment. Unlike traditional architectures that separate Automatic Speech Recognition (ASR) and language models, OmniAudio-2.6B integrates Gemma-2-2b, Whisper Turbo, and a custom projector into a unified framework. This design eliminates the inefficiencies and delays associated with chaining separate components, making it well-suited for devices with limited computational resources.

OmniAudio-2.6B aims to provide a practical, efficient solution for edge applications. By focusing on the specific needs of edge environments, Nexa AI offers a model that balances performance with resource constraints, demonstrating its commitment to advancing AI accessibility.

Technical Details and Benefits

OmniAudio-2.6B’s architecture is optimized for speed and efficiency. The integration of Gemma-2-2b, a refined LLM, and Whisper Turbo, a robust ASR system, ensures a seamless and efficient audio processing pipeline. The custom projector bridges these components, reducing latency and enhancing operational efficiency. Key performance highlights include:

Processing Speed: On a 2024 Mac Mini M4 Pro, OmniAudio-2.6B achieves 35.23 tokens per second with FP16 GGUF format and 66 tokens per second with Q4_K_M GGUF format, using the Nexa SDK. In comparison, Qwen2-Audio-7B, a prominent alternative, processes only 6.38 tokens per second on similar hardware. This difference represents a significant improvement in speed.

Resource Efficiency: The model’s compact design minimizes its reliance on cloud resources, making it ideal for applications in wearables, automotive systems, and IoT devices where power and bandwidth are limited.

Accuracy and Flexibility: Despite its focus on speed and efficiency, OmniAudio-2.6B delivers high accuracy, making it versatile for tasks such as transcription, translation, and summarization.

These advancements make OmniAudio-2.6B a practical choice for developers and businesses seeking responsive, privacy-friendly solutions for edge-based audio processing.

Performance Insights

Benchmark tests underline the impressive performance of OmniAudio-2.6B. On a 2024 Mac Mini M4 Pro, the model processes up to 66 tokens per second, significantly surpassing the 6.38 tokens per second of Qwen2-Audio-7B. This increase in speed expands the possibilities for real-time audio applications.

For example, OmniAudio-2.6B can enhance virtual assistants by enabling faster, on-device responses without the delays associated with cloud reliance. In industries such as healthcare, where real-time transcription and translation are critical, the model’s speed and accuracy can improve outcomes and efficiency. Its edge-friendly design further enhances its appeal for scenarios requiring localized processing.

Conclusion

OmniAudio-2.6B represents an important step forward in audio-language modeling, addressing key challenges such as latency, resource consumption, and cloud dependency. By integrating advanced components into a cohesive framework, Nexa AI has developed a model that balances speed, efficiency, and accuracy for edge environments.

With performance metrics showing up to a 10.3x improvement over existing solutions, OmniAudio-2.6B offers a robust, scalable option for a variety of edge applications. This model reflects a growing emphasis on practical, localized AI solutions, paving the way for advancements in audio-language processing that meet the demands of modern applications.

Check out the Details and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Nexa AI Releases OmniAudio-2.6B: A Fast Audio Language Model for Edge Deployment appeared first on MarkTechPost.

Llama 3.3 70B now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the Llama 3.3 70B from Meta is available in Amazon SageMaker JumpStart. Llama 3.3 70B marks an exciting advancement in large language model (LLM) development, offering comparable performance to larger Llama versions with fewer computational resources.
In this post, we explore how to deploy this model efficiently on Amazon SageMaker AI, using advanced SageMaker AI features for optimal performance and cost management.
Overview of the Llama 3.3 70B model
Llama 3.3 70B represents a significant breakthrough in model efficiency and performance optimization. This new model delivers output quality comparable to Llama 3.1 405B while requiring only a fraction of the computational resources. According to Meta, this efficiency gain translates to nearly five times more cost-effective inference operations, making it an attractive option for production deployments.
The model’s sophisticated architecture builds upon Meta’s optimized version of the transformer design, featuring an enhanced attention mechanism that can help substantially reduce inference costs. During its development, Meta’s engineering team trained the model on an extensive dataset comprising approximately 15 trillion tokens, incorporating both web-sourced content and over 25 million synthetic examples specifically created for LLM development. This comprehensive training approach results in the model’s robust understanding and generation capabilities across diverse tasks.
What sets Llama 3.3 70B apart is its refined training methodology. The model underwent an extensive supervised fine-tuning process, complemented by Reinforcement Learning from Human Feedback (RLHF). This dual-approach training strategy helps align the model’s outputs more closely with human preferences while maintaining high performance standards. In benchmark evaluations against its larger counterpart, Llama 3.3 70B demonstrated remarkable consistency, trailing Llama 3.1 405B by less than 2% in 6 out of 10 standard AI benchmarks and actually outperforming it in three categories. This performance profile makes it an ideal candidate for organizations seeking to balance model capabilities with operational efficiency.
The following figure summarizes the benchmark results (source).

Getting started with SageMaker JumpStart
SageMaker JumpStart is a machine learning (ML) hub that can help accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select pre-trained foundation models (FMs), including Llama 3 models. These models are fully customizable for your use case with your data, and you can deploy them into production using either the UI or SDK.
Deploying Llama 3.3 70B through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.
Deploy Llama 3.3 70B through the SageMaker JumpStart UI
You can access the SageMaker JumpStart UI through either Amazon SageMaker Unified Studio or Amazon SageMaker Studio. To deploy Llama 3.3 70B using the SageMaker JumpStart UI, complete the following steps:

In SageMaker Unified Studio, on the Build menu, choose JumpStart models.

Alternatively, on the SageMaker Studio console, choose JumpStart in the navigation pane.

Search for Meta Llama 3.3 70B.
Choose the Meta Llama 3.3 70B model.
Choose Deploy.
Accept the end-user license agreement (EULA).
For Instance type¸ choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).
Choose Deploy.

Wait until the endpoint status shows as InService. You can now run inference using the model.
Deploy Llama 3.3 70B using the SageMaker Python SDK
For teams looking to automate deployment or integrate with existing MLOps pipelines, you can use the following code to deploy the model using the SageMaker Python SDK:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = “meta-textgeneration-llama-3-3-70b-instruct”

gpu_instance_type = “ml.p4d.24xlarge”

response = “Hello, I’m a language model, and I’m here to help you with your English.”

sample_input = {
“inputs”: “Hello, I’m a language model,”,
“parameters”: {“max_new_tokens”: 128, “top_p”: 0.9, “temperature”: 0.6},
}

sample_output = [{“generated_text”: response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(
model=js_model_id,
schema_builder=schema_builder,
sagemaker_session=sagemaker_session,
role_arn=execution_role_arn,
log_level=logging.ERROR
)

model= model_builder.build()

predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)
predictor.predict(sample_input)

Set up auto scaling and scale down to zero
You can optionally set up auto scaling to scale down to zero after deployment. For more information, refer to Unlock cost savings with the new scale down to zero feature in SageMaker Inference.
Optimize deployment with SageMaker AI
SageMaker AI simplifies the deployment of sophisticated models like Llama 3.3 70B, offering a range of features designed to optimize both performance and cost efficiency. With the advanced capabilities of SageMaker AI, organizations can deploy and manage LLMs in production environments, taking full advantage of Llama 3.3 70B’s efficiency while benefiting from the streamlined deployment process and optimization tools of SageMaker AI. Default deployment through SageMaker JumpStart uses accelerated deployment, which uses speculative decoding to improve throughput. For more information on how speculative decoding works with SageMaker AI, see Amazon SageMaker launches the updated inference optimization toolkit for generative AI.
Firstly, the Fast Model Loader revolutionizes the model initialization process by implementing an innovative weight streaming mechanism. This feature fundamentally changes how model weights are loaded onto accelerators, dramatically reducing the time required to get the model ready for inference. Instead of the traditional approach of loading the entire model into memory before beginning operations, Fast Model Loader streams weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, enabling faster startup and scaling times.
One SageMaker inference capability is Container Caching, which transforms how model containers are managed during scaling operations. This feature eliminates one of the major bottlenecks in deployment scaling by pre-caching container images, removing the need for time-consuming downloads when adding new instances. For large models like Llama 3.3 70B, where container images can be substantial in size, this optimization significantly reduces scaling latency and improves overall system responsiveness.
Another key capability is Scale to Zero. It introduces intelligent resource management that automatically adjusts compute capacity based on actual usage patterns. This feature represents a paradigm shift in cost optimization for model deployments, allowing endpoints to scale down completely during periods of inactivity while maintaining the ability to scale up quickly when demand returns. This capability is particularly valuable for organizations running multiple models or dealing with variable workload patterns.
Together, these features create a powerful deployment environment that maximizes the benefits of Llama 3.3 70B’s efficient architecture while providing robust tools for managing operational costs and performance.
Conclusion
The combination of Llama 3.3 70B with the advanced inference features of SageMaker AI provides an optimal solution for production deployments. By using Fast Model Loader, Container Caching, and Scale to Zero capabilities, organizations can achieve both high performance and cost-efficiency in their LLM deployments.
We encourage you to try this implementation and share your experiences.

About the authors
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Adriana Simmons is a Senior Product Marketing Manager at AWS.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Yotam Moss is a Software development Manager for Inference at AWS AI.

AWS re:Invent 2024 Highlights: Top takeaways from Swami Sivasubramania …

We spoke with Dr. Swami Sivasubramanian, Vice President of Data and AI, shortly after AWS re:Invent 2024 to hear his impressions—and to get insights on how the latest AWS innovations help meet the real-world needs of customers as they build and scale transformative generative AI applications.
Q: What made this re:Invent different?
Swami Sivasubramanian: The theme I spoke about in my re:Invent keynote was simple but powerful—convergence. I believe that we’re at an inflection point unlike any other in the evolution of AI. We’re seeing a remarkable convergence of data, analytics, and generative AI. It’s a combination that enables next-level generative AI applications that are far more capable. And it lets our customers move faster in a really significant way, getting more value, more quickly. Companies like Rocket Mortgage are building on an AI-driven platform powered by Amazon Bedrock to create AI agents and automate tasks—working to give their employees access to generative AI with no-code tools. Canva uses AWS to power 1.2 million requests a day and sees 450 new designs created every second. There’s also a human side to convergence, as people across organizations are working together in new ways, requiring a deeper level of collaboration between groups, like science and engineering teams. And this isn’t just a one-time collaboration. It’s an ongoing process.
People’s expectations for applications and customer experiences are changing again with generative AI. Increasingly, I think generative AI inference is going to be a core building block for every application. To realize this future, organizations need more than just a chatbot or a single powerful large language model (LLM). At re:Invent, we made some exciting announcements about the future of generative AI, of course. But we also launched a remarkable portfolio of new products, capabilities, and features that will help our customers manage generative AI at scale—making it easier to control costs, build trust, increase productivity, and deliver ROI.
Q: Are there key innovations that build on the experience and lessons learned at Amazon in adopting generative AI? How are you bringing those capabilities to your customers
Swami Sivasubramanian: Yes, our announcement of Amazon Nova, a new generation of foundation models (FMs), has state-of-the-art intelligence across a wide range of tasks and industry-leading price performance. Amazon Nova models expand the growing selection of the broadest and most capable FMs in Amazon Bedrock for enterprise customers. The specific capabilities of Amazon Nova Micro, Lite, and Pro demonstrate exceptional intelligence, capabilities, and speed—and perform quite competitively against the best models in their respective categories. Amazon Nova Canvas, our state-of-the-art image generation model, creates professional grade images from text and image inputs, democratizing access to production-grade visual content for advertising, training, social media, and more. Finally, Amazon Nova Reel offers state-of-the-art video generation that allows customers to create high-quality video from text or images. With about 1,000 generative AI applications in motion inside Amazon, groups like Amazon Ads are using Amazon Nova to remove barriers for sellers and advertisers, enabling new levels of creativity and innovation. New capabilities like image and video generation are helping Amazon Ads customers promote more products in their catalogs, and experiment with new strategies like keyword-level creative to increase engagement and drive sales.

But there’s more ahead, and here’s where an important shift is happening. We’re working on an even more capable any-to-any model where you can provide text, images, audio, and video as input and the model can generate outputs in any of these modalities. And we think this multi-modal approach is how models are going to evolve, moving ahead where one model can accept any kind of input and generate any kind of output. Over time, I think this is what state-of-the-art models will look like.
Q: Speaking of announcements like Amazon Nova, you’ve been a key innovator in AI for many years. What continues to inspire you?
Swami Sivasubramanian: It’s fascinating to think about what LLMs are capable of. What inspires me most though is how can we help our customers unblock the challenges they are facing and realize that potential. Consider hallucinations. As highly capable as today’s models are, they still have a tendency to get things wrong occasionally. It’s a challenge that many of our customers struggle with when integrating generative AI into their businesses and moving to production. We explored the problem and asked ourselves if we could do more to help. We looked inward, and leveraged Automated Reasoning, an innovation that Amazon has been using as a behind-the-scenes technology in many of our services like identity and access management.
I like to think of this situation as yin and yang. Automated Reasoning is all about certainty and being able to mathematically prove that something is correct. Generative AI is all about creativity and open-ended responses. Though they might seem like opposites, they’re actually complementary—with Automated Reasoning completing and strengthening generative AI. We’ve found that Automated Reasoning works really well when you have a huge surface area of a problem, a corpus of knowledge about that problem area, and when it’s critical that you get the correct answer—which makes Automated Reasoning a good fit for addressing hallucinations.
At re:Invent, we announced Amazon Bedrock Guardrails Automated Reasoning checks—the first and only generative AI safeguard that helps prevent factual errors due to hallucinations. All by using logically accurate and verifiable reasoning that explains why generative AI responses are correct. I think that it’s an innovation that will have significant impact across organizations and industries, helping build trust and accelerate generative AI adoption.
Q: Controlling costs is important to all organizations, large and small, particularly as they take generative AI applications into production. How do the announcements at re:Invent answer this need?
Swami Sivasubramanian: Like our customers, here at Amazon we’re increasing our investment in generative AI development, with multiple projects in process—all requiring timely access to accelerated compute resources. But allocating optimal compute capacity to each project can create a supply/demand challenge. To address this challenge, we created an internal service that helped Amazon drive utilization of compute resources to more than 90% across all our projects. This service enabled us to smooth out demand across projects and achieve higher capacity utilization, speeding development.
As with Automated Reasoning, we realized that our customers would also benefit from these capabilities. So, at re:Invent, I announced the new task governance capability in Amazon SageMaker HyperPod, which helps our customers optimize compute resource utilization and reduce time to market by up to 40%. With this capability, users can dynamically run tasks across the end-to-end FM workflow— accelerating time to market for AI innovations while avoiding cost overruns due to underutilized compute resources.
Our customers also tell me that the trade-off between cost and accuracy for models is real. We’re answering this need by making it super-easy to evaluate models on Amazon Bedrock, so they don’t have to spend months researching and making comparisons. We’re also lowering costs with game-changing capabilities such Amazon Bedrock Model Distillation, which pairs models for lower costs; Amazon Bedrock Intelligent Prompt Routing, which manages prompts more efficiently, at scale; and prompt caching, which reduces repeated processing without compromising on accuracy.
Q: Higher productivity is one of the core promises of generative AI. How is AWS helping employees at all levels be more productive?
Swami Sivasubramanian: I like to point out that using generative AI becomes irresistible when it makes employees 10 times more productive. In short, not an incremental increase, but a major leap in productivity. And we’re helping employees get there. For example, Amazon Q Developer is transforming code development by taking care of the time-consuming chores that developers don’t want to deal with, like software upgrades. And it also helps them move much faster by automating code reviews and dealing with mainframe modernization. Consider Novacomp, a leading IT company in Latin America, which leveraged Amazon Q Developer to upgrade a project with over 10,000 lines of Java code in just 50 minutes, a task that would have typically taken an estimated 3 weeks. The company also simplified everyday tasks for developers, reducing its technical debt by 60% on average.
On the business side, Amazon Q Business is bridging the gap between unstructured and structured data, recognizing that most businesses need to draw from a mix of data. With Amazon Q in QuickSight, non-technical users can leverage natural language to build, discover, and share meaningful insights in seconds. Now they can access databases and data warehouses, as well as unstructured business data, like emails, reports, charts, graphs, and images.
And looking ahead, we announced advanced agentic capabilities for Amazon Q Business, coming in 2025, which will use agents to automate complex tasks that stretch across multiple teams and applications. Agents give generative AI applications next-level capabilities, and we’re bringing them to our customers via Amazon Q Business, as well as Amazon Bedrock multi-agent collaboration, which improves successful task completion by 40% over popular solutions. This major improvement translates to more accurate and human-like outcomes in use cases like automating customer support, analyzing financial data for risk management, or optimizing supply-chain logistics.
It’s all part of how we’re enabling greater productivity today, with even more on the horizon.
Q: To get employees and customers adopting generative AI and benefiting from that increased productivity, it has to be trusted. What steps is AWS taking to help build that trust?
Swami Sivasubramanian: I think that lack of trust is a big obstacle to moving from proof of concept to production. Business leaders are about to hit go and they hesitate because they don’t want to lose the trust of their customers. As generative AI continues to drive innovation across industries and our daily life, the need for responsible AI has become increasingly acute. And we’re helping meet that need with innovations like Amazon Bedrock Automated Reasoning, which I mentioned earlier, that works to prevent hallucinations—and increases trust. We also announced new LLM-as-a-judge capabilities with Amazon Bedrock Model Evaluation so you can now perform tests and evaluate other models with humanlike quality at a fraction of the cost and time of running human evaluations. These evaluations assess multiple quality dimensions, including correctness, helpfulness, and responsible AI criteria such as answer refusal and harmfulness.
I should also mention that AWS recently became the first major cloud provider to announce ISO/IEC 42001 accredited certification for AI services, covering Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. This international management system standard outlines requirements and controls for organizations to promote the responsible development and use of AI systems. Technical standards like ISO/IEC 42001 are significant because they provide a much-needed common framework for responsible AI development and deployment.
Q: Data remains central to building more personalized experiences applicable to your business. How do the re:Invent launches help AWS customers get their data ready for generative AI?
Swami Sivasubramanian: Generative AI isn’t going to be useful for organizations unless it can seamlessly access and deeply understand the organization’s data. With these insights, our customers can create customized experiences, such as highly personalized customer service agents that can help service representatives resolve issues faster. For AWS customers, getting data ready for generative AI isn’t just a technical challenge—it’s a strategic imperative. Proprietary, high-quality data is the key differentiator in transforming generic AI into powerful, business-specific applications. To prepare for this AI-driven future, we’re helping our customers build a robust, cloud-based data foundation, with built-in security and privacy. That’s the backbone of AI readiness.
With the next generation of Amazon SageMaker announced at re:Invent, we’re introducing an integrated experience to access, govern, and act on all your data by bringing together widely adopted AWS data, analytics, and AI capabilities. Collaborate and build faster from a unified studio using familiar AWS tools for model development, generative AI, data processing, and SQL analytics—with Amazon Q Developer assisting you along the way. Access all your data whether it’s stored in data lakes, data warehouses, third-party or federated data sources. And move with confidence and trust, thanks to built-in governance to address enterprise security needs.

At re:Invent, we also launched key Amazon Bedrock capabilities that help our customers maximize the value of their data. Amazon Bedrock Knowledge Bases now offers the only managed, out-of-the-box Retrieval Augmented Generation (RAG) solution, which enables our customers to natively query their structured data where it resides, accelerating development. Support for GraphRAG generates more relevant responses by modeling and storing relationships between data. And Amazon Bedrock Data Automation transforms unstructured, multimodal data into structured data for generative AI—automatically extracting, transforming, and generating usable data from multimodal content, at scale. These capabilities and more help our customers leverage their data to create powerful, insightful generative AI applications.
Q: What did you take away from your customer conversations at re:Invent?
Swami Sivasubramanian: I continue to be amazed and inspired by our customers and the important work they’re doing. We continue to offer our customers the choice and specialization they need to power their unique use cases. With Amazon Bedrock Marketplace, customers now have access to more than 100 popular, emerging, and specialized models.
At re:Invent, I heard a lot about the new efficiency and transformative experiences customers are creating. I also heard about innovations that are changing people’s lives. Like Exact Sciences, a molecular diagnostic company, which developed an AI-powered solution using Amazon Bedrock to accelerate genetic testing and analysis by 50%. Behind that metric there’s a real human value—enabling earlier cancer detection and personalized treatment planning. And that’s just one story among thousands, as our customers reach higher and build faster, achieving impressive results that change industries and improve lives.
I get excited when I think about how we can help educate the next wave of innovators building these experiences. With the launch of the new Education Equity Initiative, Amazon is committing up to $100 million in cloud technology and technical resources to help existing, dedicated learning organizations reach more learners by creating new and innovative digital learning solutions. That’s truly inspiring to me.
In fact, the pace of change, the remarkable innovations we introduced at re:Invent, and the enthusiasm of our customers all reminded me of the early days of AWS, when anything seemed possible. And now, it still is.

About the author
Swami Sivasubramanian is VP, AWS AI & Data. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Multi-tenant RAG with Amazon Bedrock Knowledge Bases

Organizations are continuously seeking ways to use their proprietary knowledge and domain expertise to gain a competitive edge. With the advent of foundation models (FMs) and their remarkable natural language processing capabilities, a new opportunity has emerged to unlock the value of their data assets.
As organizations strive to deliver personalized experiences to customers using generative AI, it becomes paramount to specialize the behavior of FMs using their own—and their customers’—data. Retrieval Augmented Generation (RAG) has emerged as a simple yet effective approach to achieve a desired level of specialization.
Amazon Bedrock Knowledge Bases is a fully managed capability that simplifies the management of the entire RAG workflow, empowering organizations to give FMs and agents contextual information from company’s private data sources to deliver more relevant and accurate responses tailored to their specific needs.
For organizations developing multi-tenant products, such as independent software vendors (ISVs) creating software as a service (SaaS) offerings, the ability to personalize experiences for each of their customers (tenants in their SaaS application) is particularly significant. This personalization can be achieved by implementing a RAG approach that selectively uses tenant-specific data.
In this post, we discuss and provide examples of how to achieve personalization using Amazon Bedrock Knowledge Bases. We focus particularly on addressing the multi-tenancy challenges that ISVs face, including data isolation, security, tenant management, and cost management. We focus on scenarios where the RAG architecture is integrated into the ISV application and not directly exposed to tenants. Although the specific implementations presented in this post use Amazon OpenSearch Service as a vector database to store tenants’ data, the challenges and architecture solutions proposed can be extended and tailored to other vector store implementations.
Multi-Tenancy design considerations
When architecting a multi-tenanted RAG system, organizations need to take several considerations into account:

Tenant isolation – One crucial consideration in designing multi-tenanted systems is the level of isolation between the data and resources related to each tenant. These resources include data sources, ingestion pipelines, vector databases, and RAG client application. The level of isolation is typically governed by security, performance, and the scalability requirements of your solution, together with your regulatory requirements. For example, you may need to encrypt the data related to each of your tenants using a different encryption key. You may also need to make sure that high activity generated by one of the tenants doesn’t affect other tenants.
Tenant variability – A similar yet distinct consideration is the level of variability of the features provided to each tenant. In the context of RAG systems, tenants might have varying requirements for data ingestion frequency, document chunking strategy, or vector search configuration.
Tenant management simplicity – Multi-tenant solutions need a mechanism for onboarding and offboarding tenants. This dimension determines the degree of complexity for this process, which might involve provisioning or tearing down tenant-specific infrastructure, such as data sources, ingestion pipelines, vector databases, and RAG client applications. This process could also involve adding or deleting tenant-specific data in its data sources.
Cost-efficiency – The operating costs of a multi-tenant solution depend on the way it provides the isolation mechanism for tenants, so designing a cost-efficient architecture for the solution is crucial.

These four considerations need to be carefully balanced and weighted to suit the needs of the specific solution. In this post, we present a model to simplify the decision-making process. Using the core isolation concepts of silo, pool, and bridge defined in the SaaS Tenant Isolation Strategies whitepaper, we propose three patterns for implementing a multi-tenant RAG solution using Amazon Bedrock Knowledge Bases, Amazon Simple Storage Service (Amazon S3), and OpenSearch Service.
A typical RAG solution using Amazon Bedrock Knowledge Bases is composed of several components, as shown in the following figure:

A data source, such as an S3 bucket
A knowledge base including a data source
A vector database such as an Amazon OpenSearch Serverless collection and index or other supported vector databases
A RAG client application

The main challenge in adapting this architecture for multi-tenancy is determining how to provide isolation between tenants for each of the components. We propose three prescriptive patterns that cater to different use cases and offer carrying levels of isolation, variability, management simplicity, and cost-efficiency. The following figure illustrates the trade-offs between these three architectural patterns in terms of achieving tenant isolation, variability, cost-efficiency, and ease of tenant management.

Multi-tenancy patterns
In this section, we describe the implementation of these three different multi-tenancy patterns in a RAG architecture based on Amazon Bedrock Knowledge Bases, discussing their use cases as well as their pros and cons.
Silo
The silo pattern, illustrated in the following figure, offers the highest level of tenant isolation, because the entire stack is deployed and managed independently for each single tenant.

In the context of the RAG architecture implemented by Amazon Bedrock Knowledge Bases, this pattern prescribes the following:

A separate data source per tenant – In this post, we consider the scenario in which tenant documents to be vectorized are stored in Amazon S3, therefore a separate S3 bucket is provisioned per tenant. This allows for per-tenant AWS Key Management Service (AWS KMS) encryption keys, as well as per-tenant S3 lifecycle policies to manage object expiration, and object versioning policies to maintain multiple versions of objects. Having separate buckets per tenant provides isolation and allows for customized configurations based on tenant requirements.
A separate knowledge base per tenant – This allows for a separate chunking strategy per tenant, and it’s particularly useful if you envision the document basis of your tenants to be different in nature. For example, one of your tenants might have a document base composed of flat text documents, which can be treated with fixed-size chunking, whereas another tenant might have a document base with explicit sections, for which semantic chunking would be better suited to section. Having a different knowledge base per tenant also lets you decide on different embedding models, giving you the possibility to choose different vector dimensions, balancing accuracy, cost, and latency. You can choose a different KMS key per tenant for the transient data stores, which Amazon Bedrock uses for end-to-end per-tenant encryption. You can also choose per-tenant data deletion policies to control whether your vectors are deleted from the vector database when a knowledge base is deleted. Separate knowledge bases also mean that you can have different ingestion schedules per tenants, allowing you to agree to different data freshness standards with your customers.
A separate OpenSearch Serverless collection per tenant – Having a separate OpenSearch Serverless collection per tenant allows you to have a separate KMS encryption key per tenant, maintaining per-tenant end-to-end encryption. For each tenant-specific collection, you can create a separate vector index, therefore choosing for each tenant the distance metric between Euclidean and dot product, so that you can choose how much importance to give to the document length. You can also choose the specific settings for the HNSW algorithm per tenant to control memory consumption, cost, and indexing time. Each vector index, in conjunction with the setup of metadata mappings in your knowledge base, can have a different metadata set per tenant, which can be used to perform filtered searches. Metadata filtering can be used in the silo pattern to restrict the search to a subset of documents with a specific characteristic. For example, one of your tenants might be uploading dated documents and wants to filter documents pertaining to a specific year, whereas another tenant might be uploading documents coming from different company divisions and wants to filter over the documentation of a specific company division.

Because the silo pattern offers tenant architectural independence, onboarding and offboarding a tenant means creating and destroying the RAG stack for that tenant, composed of the S3 bucket, knowledge base, and OpenSearch Serverless collection. You would typically do this using infrastructure as code (IaC). Depending on your application architecture, you may also need to update the log sinks and monitoring systems for each tenant.
Although the silo pattern offers the highest level of tenant isolation, it is also the most expensive to implement, mainly due to creating a separate OpenSearch Serverless collection per tenant for the following reasons:

Minimum capacity charges – Each OpenSearch Serverless collection encrypted with a separate KMS key has a minimum of 2 OpenSearch Compute Units (OCUs) charged hourly. These OCUs are charged independently from usage, meaning that you will incur charges for dormant tenants if you choose to have a separate KMS encryption key per tenant.
Scalability overhead – Each collection separately scales OCUs depending on usage, in steps of 6 GB of memory, and associated vCPUs and fast access storage. This means that resources might not be fully and optimally utilized across tenants.

When choosing the silo pattern, note that a maximum of 100 knowledge bases are supported in each AWS account. This makes the silo pattern favorable for your largest tenants with specific isolation requirements. Having a separate knowledge base per tenant also reduces the impact of quotas on concurrent ingestion jobs (maximum one concurrent job per KB, five per account), job size (100 GB per job), and data sources (maximum of 5 million documents per data source). It also improves the performance fairness as perceived by your tenants. Deleting a knowledge base during offboarding a tenant might be time-consuming, depending on the size of the data sources and the synchronization process. To mitigate this, you can set the data deletion policy in your tenants’ knowledge bases to RETAIN. This way, the knowledge base deletion process will not delete your tenants’ data from the OpenSearch Service index. You can delete the index by deleting the OpenSearch Serverless collection.
Pool
In contrast with the silo pattern, in the pool pattern, illustrated in the following figure, the whole end-to-end RAG architecture is shared by your tenants, making it particularly suitable to accommodate many small tenants.

The pool pattern prescribes the following:

Single data source – The tenants’ data is stored within the same S3 bucket. This implies that the pool model supports a shared KMS key for encryption at rest, not offering the possibility of per-tenant encryption keys. To identify tenant ownership downstream for each document uploaded to Amazon S3, a corresponding JSON metadata file has to be generated and uploaded. The metadata file generation process can be asynchronous, or even batched for multiple files, because Amazon Bedrock Knowledge Bases requires an explicit triggering of the ingestion job. The metadata file must use the same name as its associated source document file, with .metadata.json appended to the end of the file name, and must be stored in the same folder or location as the source file in the S3 bucket. The following code is an example of the format:

{
“metadataAttributes” : {
“tenantId” : “tenant_1”,

}
}

In the preceding JSON structure, the key tenantId has been deliberately chosen, and can be changed to a key you want to use to express tenancy. The tenancy field will be used at runtime to filter documents belonging to a specific tenant, therefore the filtering key at runtime must match the metadata key in the JSON used to index the documents. Additionally, you can include other metadata keys to perform further filtering that isn’t based on tenancy. If you don’t upload the object.metadata.json file, the client application won’t be able to find the document using metadata filtering.

Single knowledge base – A single knowledge base is created to handle the data ingestion for your tenants. This means that your tenants will share the same chunking strategy and embedding model, and share the same encryption at-rest KMS key. Moreover, because ingestion jobs are triggered per data source per KB, you will be restricted to offer to your tenants the same data freshness standards.
Single OpenSearch Serverless collection and index – Your tenant data is pooled in a single OpenSearch Service vector index, therefore your tenants share the same KMS encryption key for vector data, and the same HNSW parameters for indexing and query. Because tenant data isn’t physically segregated, it’s crucial that the query client be able to filter results for a single tenant. This can be efficiently achieved using either the Amazon Bedrock Knowledge Bases Retrieve or RetrieveAndGenerate, expressing the tenant filtering condition as part of the retrievalConfiguration (for more details, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy). If you want to restrict the vector search to return results for tenant_1, the following is an example client implementation performing RetrieveAndGenerate based on the AWS SDK for Python (Boto3):

import boto3

bedrock_agent_runtime = boto3.client(
service_name = “bedrock-agent-runtime”
)

tenant_filter = {
“equals”: {
“key”: “tenantId”,
“value”: “tenant_1”
}
}

retrievalConfiguration = {
“vectorSearchConfiguration”: {
“filter”: tenant_filter
}
}

bedrock_agent_runtime.retrieve_and_generate(
input = {
‘text’: ‘The original user query’
},
retrieveAndGenerateConfiguration = {
‘type’: ‘KNOWLEDGE_BASE’,
‘knowledgeBaseConfiguration’: {
‘knowledgeBaseId’: <YOUR_KNOWLEDGEBASE_ID>,
‘modelArn’: <FM_ARN>,
‘retrievalConfiguration’: retrievalConfiguration
}
}
)

text contains the original user query that needs to be answered. Taking into account the document base, <YOUR_KNOWLEDGEBASE_ID> needs to be substituted with the identifier of the knowledge base used to pool your tenants, and <FM_ARN> needs to be substituted with the Amazon Bedrock model Amazon Resource Name (ARN) you want to use to reply to the user query. The client presented in the preceding code has been streamlined to present the tenant filtering functionality. In a production case, we recommend implementing session and error handling, logging and retry logic, and separating the tenant filtering logic from the client invocation to make it inaccessible to developers.
Because the end-to-end architecture is pooled in this pattern, onboarding and offboarding a tenant doesn’t require you to create new physical or logical constructs, and it’s as simple as starting or stopping and uploading specific tenant documents to Amazon S3. This implies that there is no AWS managed API that can be used to offboard and end-to-end forget a specific tenant. To delete the historical documents belonging to a specific tenant, you can just delete the relevant objects in Amazon S3. Typically, customers will have an external application that maintains the list of available tenants and their status, facilitating the onboarding and offboarding process.
Sharing the monitoring system and logging capabilities in this pattern reduces the complexity of operations with a large number of tenants. However, it requires you to collect the tenant-specific metrics from the client side to perform specific tenant attribution.
The pool pattern optimizes the end-to-end cost of your RAG architecture, because sharing OCUs across tenants maximizes the use of each OCU and minimizes the tenants’ idle time. Sharing the same pool of OCUs across tenants means that this pattern doesn’t offer performance isolation at the vector store level, so the largest and most active tenants might impact the experience of other tenants.
When choosing the pool pattern for your RAG architecture, you should be aware that a single ingestion job can ingest or delete a maximum of 100 GB. Additionally, the data source can have a maximum of 5 million documents. If the solution has many tenants that are geographically distributed, consider triggering the ingestion job multiple times a day so you don’t hit the ingestion job size limit. Also, depending on the number and size of your documents to be synchronized, the time for ingestion will be determined by the embedding model invocation rate. For example, consider the following scenario:

Number of tenants to be synchronized = 10
Average number of documents per tenant = 100
Average size per document = 2 MB, containing roughly 200,000 tokens divided in 220 chunks of 1,000 tokens to allow for overlap
Using Amazon Titan Embeddings v2 on demand, allowing for 2,000 RPM and 300,000 TPM

This would result in the following:

Total embeddings requests = 10*100*220 = 220,000
Total tokens to process = 10*100*1,000=1,000,000
Total time taken to embed is dominated by the RPM, therefore 220,000/2,000 = 1 hour, 50 minutes

This means you could trigger an ingestion job 12 times per day to have a good time distribution of data to be ingested. This calculation is a best-case scenario and doesn’t account for the latency introduced by the FM when creating the vector from the chunk. If you expect having to synchronize a large number of tenants at the same time, consider using provisioned throughput to decrease the time it takes to create vector embeddings. This approach will also help distribute the load on the embedding models, limiting throttling of the Amazon Bedrock runtime API calls.
Bridge
The bridge pattern, illustrated in the following figure, strikes a balance between the silo and pool patterns, offering a middle ground that balances tenant data isolation and security.

The bridge pattern delivers the following characteristics:

Separate data source per tenant in a common S3 bucket – Tenant data is stored in the same S3 bucket, but prefixed by a tenant identifier. Although having a different prefix per tenant doesn’t offer the possibility of using per-tenant encryption keys, it does create a logical separation that can be used to segregate data downstream in the knowledge bases.
Separate knowledge base per tenant – This pattern prescribes creating a separate knowledge base per tenant similar to the silo pattern. Therefore, the considerations in the silo pattern apply. Applications built using the bridge pattern usually share query clients across tenants, so they need to identify the specific tenant’s knowledge base to query. They can identify the knowledge base by storing the tenant-to-knowledge base mapping in an external database, which manages tenant-specific configurations. The following example shows how to store this tenant-specific information in an Amazon DynamoDB table:

import boto3
# Create a DynamoDB resource
dynamodb = boto3.resource(‘dynamodb’)

table_name = ‘tenantKbConfig’
attribute_definitions = [
{‘AttributeName’: ‘tenantId’, ‘AttributeType’: ‘S’}
]

key_schema = [
{‘AttributeName’: ‘tenantId’, ‘KeyType’: ‘HASH’}
]

#Create the table holding KB tenant configurations
tenant_kb_config_table = dynamodb.create_table(
TableName=table_name,
AttributeDefinitions=attribute_definitions,
KeySchema=key_schema,
BillingMode=’PAY_PER_REQUEST’ # Use on-demand billing mode for illustration
)

#Create a tenant
tenant_kb_config_table.put_item(
Item={
‘tenantId’: ‘tenant_1’,
‘knowledgebaseId’: <YOUR_KNOWLEDGEBASE_ID>,
‘modelArn’: <FM_ARN> }
)
In a production setting, your application will store tenant-specific parameters belonging to other functionality in your data stores. Depending on your application architecture, you might choose to store knowledgebaseId and modelARN alongside the other tenant-specific parameters, or create a separate data store (for example, the tenantKbConfig table) specifically for your RAG architecture. This mapping can then be used by the client application by invoking the RetrieveAndGenerate API. The following is an example implementation:

import json
import boto3

# Create a DynamoDB resource
dynamodb = boto3.resource(‘dynamodb’)

# Create a Bedrock Runtime client
bedrock_runtime = boto3.client(‘bedrock-agent-runtime’)

# Define the table name
table_name = ‘tenantKbConfig’

# Define function returning tenant config
def get_tenant_config(tenant_id):
table = dynamodb.Table(table_name)
response = table.get_item(
Key = {
‘tenantId’: tenant_id
}
)
if ‘Item’ in response:
return { ‘knowledgebaseId’:response[‘Item’].get(‘knowledgebaseId’), ‘modelArn’: response[‘Item’].get(‘modelArn’)}
else:
return None

# Retrieve the tenant configurations from DynamoDB

tenant_config = get_tenant_config(‘tenant_1’)

#Invoke the Retrieve and Generate API
bedrock_runtime.retrieve_and_generate(
input = {
‘text’: ‘What type of info do your documents contain?’
},
retrieveAndGenerateConfiguration = {
‘type’: ‘KNOWLEDGE_BASE’,
‘knowledgeBaseConfiguration’: {
‘knowledgeBaseId’: tenant_config[‘knowledgebaseId’],
‘modelArn’: tenant_config[‘modelArn’]
}
}
)

Separate OpenSearch Service index per tenant – You store data within the same OpenSearch Serverless collection, but you create a vector index per tenant. This implies your tenants share the same KMS encryption key and the same pool of OCUs, optimizing the OpenSearch Service resources usage for indexing and querying. The separation in vector indexes gives you the flexibility of choosing different HNSM parameters per tenant, letting you tailor the performance of your k-NN indexing and querying for your different tenants.

The bridge pattern supports up to 100 tenants, and onboarding and offboarding a tenant requires the creation and deletion of a knowledge base and OpenSearch Service vector index. To delete the data pertaining to a particular tenant, you can delete the created resources and use the tenant-specific prefix as a logical parameter in your Amazon S3 API calls. Unlike the silo pattern, the bridge pattern doesn’t allow for per-tenant end-to-end encryption; it offers the same level of tenant customization offered by the silo pattern while optimizing costs.
Summary of differences
The following figure and table provide a consolidated view for comparing the characteristics of the different multi-tenant RAG architecture patterns. This comprehensive overview highlights the key attributes and trade-offs associated with the pool, bridge, and silo patterns, enabling informed decision-making based on specific requirements.
The following figure illustrates the mapping of design characteristics to components of the RAG architecture.

The following table summarizes the characteristics of the multi-tenant RAG architecture patterns.

Characteristic
Attribute of 
Pool
Bridge
Silo

Per-tenant chunking strategy
Amazon Bedrock Knowledge Base Data Source
No
Yes
Yes

Customer managed key for encryption of transient data and at rest
Amazon Bedrock Knowledge Base Data Source
No
No
Yes

Per-tenant distance measure
Amazon OpenSearch Service Index
No
Yes
Yes

Per-tenant ANN index configuration
Amazon OpenSearch Service Index
No
Yes
Yes

Per-tenant data deletion policies
Amazon Bedrock Knowledge Base Data Source
No
Yes
Yes

Per-tenant vector size
Amazon Bedrock Knowledge Base Data Source
No
Yes
Yes

Tenant performance isolation
Vector database
No
No
Yes

Tenant onboarding and offboarding complexity
Overall solution
Simplest, requires management of new tenants in existing infrastructure
Medium, requires minimal management of end-to-end infrastructure
Hardest, requires management of end-to-end infrastructure

Query client implementation
Original Data Source
Medium, requires dynamic filtering
Hardest, requires external tenant mapping table
Simplest, same as single-tenant implementation

Amazon S3 tenant management complexity
Amazon S3 buckets and objects
Hardest, need to maintain tenant specific metadata files for each object
Medium, each tenant needs a different S3 path
Simplest, each tenant requires a different S3 bucket

Cost
Vector database
Lowest
Medium
Highest

Per-tenant FM used to create vector embeddings
Amazon Bedrock Knowledge Base
No
Yes
Yes

Conclusion
This post explored three distinct patterns for implementing a multi-tenant RAG architecture using Amazon Bedrock Knowledge Bases and OpenSearch Service. The silo, pool, and bridge patterns offer varying levels of tenant isolation, variability, management simplicity, and cost-efficiency, catering to different use cases and requirements. By understanding the trade-offs and considerations associated with each pattern, organizations can make informed decisions and choose the approach that best aligns with their needs.
Get started with Amazon Bedrock Knowledge Bases today.

About the Authors
Emanuele Levi is a Solutions Architect in the Enterprise Software and SaaS team, based in London. Emanuele helps UK customers on their journey to refactor monolithic applications into modern microservices SaaS architectures. Emanuele is mainly interested in event-driven patterns and designs, especially when applied to analytics and AI, where he has expertise in the fraud-detection industry.
Mehran Nikoo is a Generative AI Go-To-Market Specialist at AWS. He leads the generative AI go-to-market strategy for UK and Ireland.
Dani Mitchell is a Generative AI Specialist Solutions Architect at AWS. He is focused on computer vision use case and helps AWS customers in EMEA accelerate their machine learning and generative AI journeys with Amazon SageMaker and Amazon Bedrock.

Meta AI Proposes Large Concept Models (LCMs): A Semantic Leap Beyond T …

Large Language Models (LLMs) have achieved remarkable advancements in natural language processing (NLP), enabling applications in text generation, summarization, and question-answering. However, their reliance on token-level processing—predicting one word at a time—presents challenges. This approach contrasts with human communication, which often operates at higher levels of abstraction, such as sentences or ideas.

Token-level modeling also struggles with tasks requiring long-context understanding and may produce outputs with inconsistencies. Moreover, extending these models to multilingual and multimodal applications is computationally expensive and data-intensive. To address these issues, researchers at Meta AI have proposed a new approach: Large Concept Models (LCMs).

Large Concept Models

Meta AI’s Large Concept Models (LCMs) represent a shift from traditional LLM architectures. LCMs bring two significant innovations:

High-dimensional Embedding Space Modeling: Instead of operating on discrete tokens, LCMs perform computations in a high-dimensional embedding space. This space represents abstract units of meaning, referred to as concepts, which correspond to sentences or utterances. The embedding space, called SONAR, is designed to be language- and modality-agnostic, supporting over 200 languages and multiple modalities, including text and speech.

Language- and Modality-agnostic Modeling: Unlike models tied to specific languages or modalities, LCMs process and generate content at a purely semantic level. This design allows seamless transitions across languages and modalities, enabling strong zero-shot generalization.

At the core of LCMs are concept encoders and decoders that map input sentences into SONAR’s embedding space and decode embeddings back into natural language or other modalities. These components are frozen, ensuring modularity and ease of extension to new languages or modalities without retraining the entire model.

Technical Details and Benefits of LCMs

LCMs introduce several innovations to advance language modeling:

Hierarchical Architecture: LCMs employ a hierarchical structure, mirroring human reasoning processes. This design improves coherence in long-form content and enables localized edits without disrupting broader context.

Diffusion-based Generation: Diffusion models were identified as the most effective design for LCMs. These models predict the next SONAR embedding based on preceding embeddings. Two architectures were explored:

One-Tower: A single Transformer decoder handles both context encoding and denoising.

Two-Tower: Separates context encoding and denoising, with dedicated components for each task.

Scalability and Efficiency: Concept-level modeling reduces sequence length compared to token-level processing, addressing the quadratic complexity of standard Transformers and enabling more efficient handling of long contexts.

Zero-shot Generalization: LCMs exhibit strong zero-shot generalization, performing well on unseen languages and modalities by leveraging SONAR’s extensive multilingual and multimodal support.

Search and Stopping Criteria: A search algorithm with a stopping criterion based on distance to an “end of document” concept ensures coherent and complete generation without requiring fine-tuning.

Insights from Experimental Results

Meta AI’s experiments highlight the potential of LCMs. A diffusion-based Two-Tower LCM scaled to 7 billion parameters demonstrated competitive performance in tasks like summarization. Key results include:

Multilingual Summarization: LCMs outperformed baseline models in zero-shot summarization across multiple languages, showcasing their adaptability.

Summary Expansion Task: This novel evaluation task demonstrated the capability of LCMs to generate expanded summaries with coherence and consistency.

Efficiency and Accuracy: LCMs processed shorter sequences more efficiently than token-based models while maintaining accuracy. Metrics such as mutual information and contrastive accuracy showed significant improvement, as detailed in the study’s results.

Conclusion

Meta AI’s Large Concept Models present a promising alternative to traditional token-based language models. By leveraging high-dimensional concept embeddings and modality-agnostic processing, LCMs address key limitations of existing approaches. Their hierarchical architecture enhances coherence and efficiency, while their strong zero-shot generalization expands their applicability to diverse languages and modalities. As research into this architecture continues, LCMs have the potential to redefine the capabilities of language models, offering a more scalable and adaptable approach to AI-driven communication.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post Meta AI Proposes Large Concept Models (LCMs): A Semantic Leap Beyond Token-based Language Modeling appeared first on MarkTechPost.

From Theory to Practice: Compute-Optimal Inference Strategies for Lang …

Large language models (LLMs) have demonstrated remarkable performance across multiple domains, driven by scaling laws highlighting the relationship between model size, training computation, and performance. Despite significant advancements in model scaling, a critical gap exists in comprehending how computational resources during inference impact model performance post-training. The complexity arises from balancing performance improvements against the increasing computational costs, associated with advanced inference techniques. Moreover, understanding the trade-offs between performance gains and computational expenses is crucial for developing more efficient and effective LLM inference strategies.

Existing research on LLMs has explored various strategies to enhance mathematical reasoning and problem-solving capabilities. They focus on generating step-by-step solutions, which are expanded to include solution verification and ranking methodologies. Inference strategies have ranged from deterministic methods like greedy decoding and beam search to more dynamic sampling algorithms that introduce diversity in generated sequences. More advanced techniques have emerged, including majority voting, weighted majority voting, and search-based algorithms like Monte Carlo Tree Search (MCTS). Process Reward Models (PRMs) have also gained prominence, providing a mechanism to assign rewards to intermediate reasoning steps and guide the multi-step problem-solving process.

Researchers from the Institute for Interdisciplinary Information Sciences at Tsinghua University and the School of Computer Science at Carnegie Mellon University have presented a comprehensive study on inference scaling laws and compute-optimal inference strategies. The research aims to explore the critical trade-offs between model sizes and token generation across various inference methodologies. By investigating cost-performance relationships, the researchers examine inference approaches like greedy search, majority voting, best-of-n, weighted voting, and two distinct tree search algorithms. The study reveals that smaller models can outperform larger models when equipped with advanced inference algorithms, challenging conventional assumptions about model scaling and computational efficiency.

The research methodology is structured around two primary experimental questions investigating compute-optimal inference strategies for mathematical problem-solving. Two mathematical datasets MATH, and GSM8K are selected. The experimental design uses multiple policy models, including Pythia models, math-specialized Llemma models, and Mistral-7B, to explore performance variations across different model sizes and architectures. A consistent Llemma-34B reward model fine-tuned on the Math-Shepherd synthetic dataset, is utilized to evaluate solution quality. Each experimental configuration is executed multiple times to ensure robust and reliable results, allowing comprehensive statistical analysis of performance scaling and computational efficiency across different inference strategies and model sizes.

The results show that Llemma-7B achieves competitive accuracy with Llemma-34B while requiring approximately 50% less computational resources. This finding suggests that smaller models when paired with appropriate inference strategies, can deliver more favorable cost-performance trade-offs than the larger models. Moreover, the REBASE inference strategy consistently proves Pareto-optimal across various settings and outperforms sampling-based methods and traditional tree search algorithms like MCTS. Notably, REBASE achieves higher accuracy with substantially lower computational budgets, a novel finding that challenges previous assumptions about computational complexity in inference strategies.

In conclusion, researchers provide critical insights into compute-optimal inference strategies for LLMs, offering three fundamental conclusions. First, the study demonstrates that smaller models using complex inference techniques can outperform larger models within constrained computational budgets. Second, the research reveals the fundamental limitations of sampling-based majority voting strategies. Third, the novel REBASE tree search method emerges as a groundbreaking inference strategy, proving Pareto-optimal across tested compute budgets and surpassing established methods. Lastly, the limitations of this research include its focus on mathematical problem-solving and proposing future research directions exploring inference scaling laws across diverse task domains.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post From Theory to Practice: Compute-Optimal Inference Strategies for Language Model appeared first on MarkTechPost.

This AI Paper Introduces SRDF: A Self-Refining Data Flywheel for High- …

Vision-and-Language Navigation (VLN) combines visual perception with natural language understanding to guide agents through 3D environments. The goal is to enable agents to follow human-like instructions and navigate complex spaces effectively. Such advancements hold potential in robotics, augmented reality, and smart assistant technologies, where linguistic instructions guide interaction with physical spaces.

The core problem in VLN research is the lack of high-quality annotated datasets that pair navigation trajectories with precise natural language instructions. Annotating these datasets manually requires significant resources, expertise, and effort, making the process costly and time-intensive. Moreover, these annotations often fail to provide the linguistic richness and fidelity required for generalizing the models across diverse environments, limiting their effectiveness in real-world applications.

Existing solutions rely on synthetic data generation and environment augmentation. Synthetic data is generated using trajectory-to-instruction models, while simulators diversify the environments. However, these methods often must improve quality, producing poorly aligned data between language and navigation trajectories. This misalignment results in suboptimal agent performance. The problem is further compounded by metrics that inadequately evaluate instructions’ semantic and directional alignment with their corresponding trajectories, thereby challenging quality control.

Researchers from Shanghai AI Laboratory, UNC Chapel Hill, Adobe Research, and Nanjing University proposed the Self-Refining Data Flywheel (SRDF), a system designed to iteratively improve both the dataset and the models through mutual collaboration between an instruction generator and a navigator. This fully automated method eliminates the need for human-in-the-loop annotation. Starting with a small, high-quality human-annotated dataset, the SRDF system generates synthetic instructions and uses them to train a base navigator. The navigator then evaluates the fidelity of these instructions, filtering out low-quality data to train a better generator in subsequent iterations. This iterative refinement ensures continuous improvement in both the data quality and the models’ performance.

The SRDF system comprises two key components: an instruction generator and a navigator. The generator creates synthetic navigation instructions from trajectories using advanced multimodal language models. The navigator, in turn, evaluates these instructions by measuring how accurately it can follow the generated paths. High-quality data is identified based on strict fidelity metrics, such as the Success weighted by Path Length (SPL) and normalized Dynamic Time Warping (nDTW). Poor-quality data is either regenerated or excluded, ensuring that only reliable and highly aligned data is used for training. Over three iterations, the system refines the dataset, which ultimately contains 20 million high-fidelity instruction-trajectory pairs spanning 860 diverse environments.

The SRDF system demonstrated exceptional performance improvements across various metrics and benchmarks. On the Room-to-Room (R2R) dataset, the SPL metric for the navigator rose from 70% to an unprecedented 78%, surpassing the human benchmark of 76%. This marks the first instance where a VLN agent has outperformed human-level navigation accuracy. The instruction generator also achieved impressive results, with SPICE scores increasing from 23.5 to 26.2, surpassing all prior Vision-and-Language Navigation instruction generation methods. Further, the SRDF-generated data facilitated superior generalization across downstream tasks, including long-term navigation (R4R) and dialogue-based navigation (CVDN), achieving state-of-the-art performance across all tested datasets.

Specifically, the system excelled in long-horizon navigation, achieving a 16.6% improvement in Success Rate on the R4R dataset. The CVDN dataset significantly improved the Goal Progress metric, outperforming all prior models. Furthermore, the scalability of SRDF was evident as the instruction generator consistently improved with larger datasets and diverse environments, ensuring robust performance across varied tasks and benchmarks. The researchers also reported enhanced instruction diversity and richness, with over 10,000 unique words incorporated into the SRDF-generated dataset, addressing the vocabulary limitations of previous datasets.

The SRDF approach addresses the long-standing challenge of data scarcity in VLN by automating dataset refinement. The iterative collaboration between the navigator and the instruction generator ensures continuous enhancement of both components, leading to highly aligned, high-quality datasets. This breakthrough method has set a new standard in VLN research, showcasing the critical role of data quality and alignment in advancing embodied AI. With its ability to surpass human performance and generalize across diverse tasks, SRDF is poised to drive significant progress in developing intelligent navigation systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post This AI Paper Introduces SRDF: A Self-Refining Data Flywheel for High-Quality Vision-and-Language Navigation Datasets appeared first on MarkTechPost.

DL4Proteins Notebook Series Bridging Machine Learning and Protein Engi …

The protein design and prediction are crucial in advancing synthetic biology and therapeutics. Despite significant progress with deep learning models like AlphaFold and ProteinMPNN, there is a gap in accessible educational resources that integrate foundational machine learning concepts with advanced protein engineering methods. This gap hinders the broader understanding and application of these cutting-edge technologies. The challenge is developing practical, hands-on tools that enable researchers, educators, and students to effectively apply deep learning techniques to protein design tasks, bridging theoretical knowledge and real-world applications in computational protein engineering.

DL4Proteins notebook series is a Jupyter notebook series designed by Graylab researchers to make deep learning for protein design and prediction accessible to a broad audience. Inspired by the groundbreaking work of David Baker, Demis Hassabis, and John Jumper—recipients of the 2024 Nobel Prize in Chemistry—this resource provides practical introductions to tools like AlphaFold, RFDiffusion, and ProteinMPNN. Aimed at researchers, educators, and students, DL4Proteins integrates foundational machine learning concepts with advanced protein engineering methods, fostering innovation in synthetic biology and therapeutics. With topics ranging from neural networks to graph models, these open-source notebooks enable hands-on learning and bridge the gap between research and education.

The notebook “Neural Networks with NumPy” introduces the foundational concepts of neural networks and demonstrates their implementation using NumPy. It provides a hands-on approach to understanding how basic neural network components, such as forward and backward propagation, are constructed from scratch. The notebook demystifies the mathematical framework underlying neural networks by focusing on core operations like matrix multiplication and activation functions. This resource is ideal for beginners seeking to build an intuitive understanding of machine learning fundamentals without relying on advanced libraries. Through practical coding exercises, users gain essential insights into the mechanics of deep learning in a simplified yet effective way.

The notebook “Neural Networks with PyTorch” introduces building neural networks using a popular deep learning framework. It simplifies implementing neural networks by leveraging PyTorch’s high-level abstractions, such as tensors, autograd, and modules. The notebook guides users through creating, training, and evaluating models, highlighting how PyTorch automates key tasks like gradient computation and optimization. By transitioning from NumPy to PyTorch, users gain exposure to modern tools for scaling machine learning models. This resource enables a deeper understanding of neural networks through practical examples while showcasing PyTorch’s versatility in streamlining deep learning workflows.

The CNNs notebook introduces the foundational concepts of CNNs, focusing on their application in handling image-like data. It explains how CNNs utilize convolutional layers to extract spatial features from input data. The notebook demonstrates key components such as convolution, pooling, and fully connected layers while covering how to construct and train CNN models using PyTorch. Through step-by-step implementation and visualization, users learn how CNNs process input data hierarchically, enabling efficient feature extraction and representation for diverse deep-learning applications.

The “Language Models for Shakespeare and Proteins” notebook explores the use of LMs in understanding sequences, such as text and proteins. Drawing parallels between predicting words in Shakespearean texts and amino acids in protein sequences highlights the versatility of LMs. Using PyTorch, the notebook provides a hands-on guide to building and training simple language models for sequence prediction tasks. Additionally, it explains concepts like tokenization, embeddings, and the generation of sequential data, demonstrating how these techniques can be applied to both natural language and protein design, bridging the gap between computational linguistics and biological insights.

The “Language Model Embeddings: Transfer Learning for Downstream Tasks” notebook delves into applying language model embeddings in solving real-world problems. It demonstrates how embeddings, generated from pre-trained language models, capture meaningful patterns in sequences, whether in text or protein data. These embeddings are repurposed for downstream tasks like classification or regression, showcasing the power of transfer learning. The notebook provides a hands-on approach to extracting embeddings and training models for specific applications, such as protein property prediction. This approach accelerates learning and improves performance in specialized tasks by leveraging pre-trained models, bridging foundational knowledge and practical implementations.The “Introduction to AlphaFold” notebook provides an accessible overview of AlphaFold, a breakthrough tool for predicting protein structures with high accuracy. It explains the core principles behind AlphaFold, including its reliance on deep learning and the use of multiple sequence alignments (MSAs) to predict protein folding. The notebook offers practical insights into how AlphaFold generates 3D protein structures from amino acid sequences, showcasing its transformative impact on structural biology. Users are guided through real-world applications, enabling them to understand and apply this powerful tool in research, from exploring protein functions to advancing drug discovery and synthetic biology innovations.

The “Graph Neural Networks for Proteins” notebook introduces the use of GNNs in protein research, emphasizing their ability to model the complex relationships between amino acids in protein structures. It explains how GNNs treat proteins as graphs, where nodes represent amino acids, and edges capture interactions or spatial proximity. By leveraging GNNs, researchers can predict properties like protein functions or binding affinities. The notebook provides a practical guide to implementing GNNs for protein-related tasks, offering insights into their architecture and training process. This approach opens new possibilities in protein engineering, drug discovery, and understanding protein dynamics.

The “Denoising Diffusion Probabilistic Models” notebook explores the application of diffusion models in protein structure prediction and design. These models generate data by gradual denoising a noisy input, enabling the prediction of intricate molecular structures. The notebook explains the foundational concepts of diffusion processes and reverse sampling, guiding users through their application to protein modeling tasks. By simulating stepwise denoising, diffusion models can capture complex distributions, making them suitable for generating accurate protein conformations. This method provides a cutting-edge approach to tackling challenges in protein engineering, offering powerful tools for creating and refining protein structures in various scientific applications.

The “Putting It All Together: Designing Proteins” notebook combines advanced tools like RFdiffusion, ProteinMPNN, and AlphaFold to guide users through the complete protein design process. This workflow begins with RFdiffusion to generate backbone structures, followed by ProteinMPNN to design optimal sequences that stabilize the generated structures. Finally, AlphaFold is used to predict and refine the 3D structures of the designed proteins. By integrating these tools, the notebook provides a streamlined approach to protein engineering, enabling users to tackle real-world challenges in synthetic biology and therapeutics through the iterative design, validation, and refinement of protein structures.

The “RFDiffusion: All-Atom” notebook introduces RFdiffusion for generating high-fidelity protein structures, focusing on the full atomistic level of detail. It leverages a denoising diffusion model to iteratively refine and generate accurate atomic representations of protein structures from initial coarse backbones. This process allows for precisely predicting atomic positions and interactions within a protein, which is critical for understanding protein folding and function. The notebook guides users through setting up and running the RFdiffusion model, emphasizing its application in protein design and its potential to advance the field of structural biology and drug discovery.

Image source

In conclusion, integrating deep learning tools with protein design and prediction holds immense potential in advancing synthetic biology and therapeutics. The notebooks offer practical, hands-on resources for understanding and applying cutting-edge technologies like AlphaFold, RFDiffusion, ProteinMPNN, and graph-based models. These tools empower researchers, educators, and students to explore protein structure prediction, design, and optimization by bridging foundational machine-learning concepts with real-world applications.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post DL4Proteins Notebook Series Bridging Machine Learning and Protein Engineering: A Practical Guide to Deep Learning Tools for Protein Design appeared first on MarkTechPost.

CloudFerro and ESA Φ-lab Launch the First Global Embeddings Dataset f …

CloudFerro and European Space Agency (ESA) Φ-lab have introduced the first global embeddings dataset for Earth observations, a significant development in geospatial data analysis. This dataset, part of the Major TOM project, aims to provide standardized, open, and accessible AI-ready datasets for Earth observation. This collaboration addresses the challenge of managing and analyzing the massive archives of Copernicus satellite data while promoting scalable AI applications.

The Role of Embedding Datasets in Earth Observation

The ever-increasing volume of Earth observation data presents challenges in processing and analyzing large-scale geospatial imagery efficiently. Embedding datasets tackle this issue by transforming high-dimensional image data into compact vector representations. These embeddings encapsulate key semantic features, facilitating faster searches, comparisons, and analyses.

The Major TOM project focuses on the geospatial domain, ensuring that its embedding datasets are compatible and reproducible for various Earth observation tasks. By leveraging advanced deep learning models, these embeddings streamline the processing and analysis of satellite imagery on a global scale.

Features of the Global Embeddings Dataset

The embedding datasets, derived from Major TOM Core datasets, include over 60 TB of AI-ready Copernicus data. Key features include:

Comprehensive Coverage: With over 169 million data points and more than 3.5 million unique images, the dataset provides thorough representation of Earth’s surface.

Diverse Models: Generated using four distinct models—SSL4EO-S2, SSL4EO-S1, SigLIP, and DINOv2—the embeddings offer varied feature representations tailored to different use cases.

Efficient Data Format: Stored in GeoParquet format, the embeddings integrate seamlessly with geospatial data workflows, enabling efficient querying and compatibility with processing pipelines.

Embedding Methodology

The creation of the embeddings involves several steps:

Image Fragmentation: Satellite images are divided into smaller patches suitable for model input sizes, preserving geospatial details.

Preprocessing: Fragments are normalized and scaled according to the requirements of the embedding models.

Embedding Generation: Preprocessed fragments are processed through pretrained deep learning models to create embeddings.

Data Integration: The embeddings and metadata are compiled into GeoParquet archives, ensuring streamlined access and usability.

This structured approach ensures high-quality embeddings while reducing computational demands for downstream tasks.

Applications and Use Cases

The embedding datasets have diverse applications, including:

Land Use Monitoring: Researchers can track land use changes efficiently by linking embedding spaces to labeled datasets.

Environmental Analysis: The dataset supports analyses of phenomena like deforestation and urban expansion with reduced computational costs.

Data Search and Retrieval: The embeddings enable fast similarity searches, simplifying access to relevant geospatial data.

Time-Series Analysis: Consistent embedding footprints facilitate long-term monitoring of changes across different regions.

Computational Efficiency

The embedding datasets are designed for scalability and efficiency. The computations were performed on CloudFerro’s CREODIAS cloud platform, utilizing high-performance hardware such as NVIDIA L40S GPUs. This setup enabled the processing of trillions of pixels from Copernicus data while maintaining reproducibility.

Standardization and Open Access

A hallmark of the Major TOM embedding datasets is their standardized format, which ensures compatibility across models and datasets. Open access to these datasets fosters transparency and collaboration, encouraging innovation within the global geospatial community.

Advancing AI in Earth Observation

The global embeddings dataset represents a significant step forward in integrating AI with Earth observation. Enabling efficient processing and analysis equips researchers, policymakers, and organizations to better understand and manage the Earth’s dynamic systems. This initiative lays the groundwork for new applications and insights in geospatial analysis.

Conclusion

The partnership between CloudFerro and ESA Φ-lab exemplifies progress in the geospatial data industry. By addressing the challenges of Earth observation and unlocking new possibilities for AI applications, the global embeddings dataset enhances our capacity to analyze and manage satellite data. As the Major TOM project evolves, it is poised to drive further advancements in science and technology.

Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post CloudFerro and ESA Φ-lab Launch the First Global Embeddings Dataset for Earth Observations appeared first on MarkTechPost.

xAI Releases Grok-2: An Advanced Language Model Now Freely Available o …

xAI, Elon Musk’s artificial intelligence venture, has introduced Grok-2, its most advanced language model to date. This AI tool is freely accessible to all users on the X platform, underscoring a step towards broader accessibility of AI technologies. Designed to deliver nuanced understanding and human-like text generation, Grok-2 offers capabilities that can enhance both personal productivity and business operations.

What Is Grok-2 and Why Does It Matter?

Grok-2 is a sophisticated AI-powered language model built to achieve high accuracy, contextual understanding, and flexibility. Unlike earlier models, it leverages advanced neural architectures to better interpret user inputs and generate responses aligned with real-world needs. Positioned alongside models like OpenAI’s GPT-4 and Google’s Bard, Grok-2 stands out for its emphasis on accessibility and ease of use.

Key Features of Grok-2

Contextual Understanding: Grok-2 effectively handles complex instructions and delivers coherent responses.

Personalization: The model adapts to user preferences, providing tailored outputs.

Content Creation: From formal emails to creative writing, Grok-2 demonstrates versatility.

Multimodal Support: The model can process text, images, and other media inputs.

Open Access: Grok-2 is available to all users at no cost, making advanced AI tools more widely accessible.

How Does Grok-2 Compare to Other Language Models?

Grok-2’s strengths lie in its integration with the X platform and its free availability. By removing subscription barriers, xAI enables a broader audience to benefit from AI technology. The seamless connection with X enhances usability, allowing users to create, refine, and share content effortlessly within their networks.

Businesses can leverage Grok-2 for automated customer support and content generation, while individuals might use it to improve productivity in tasks like resume writing or crafting social media posts. The model’s multimodal capabilities further expand its potential applications, especially in creative fields.

How to Access and Use Grok-2

Accessing Grok-2 is straightforward on the X platform. Here’s a step-by-step guide:

Log into X: Ensure your account is active.

Find Grok-2: Navigate to the dedicated section on the platform.

Start Exploring: Experiment with various prompts to explore the model’s features.

Provide Feedback: Share your experiences to help improve the tool.

The Future of AI with Grok-2

The introduction of Grok-2 is a significant development in AI’s evolution. By making advanced AI technology accessible, xAI enables innovation in areas like education, healthcare, and entertainment. The model’s capabilities can empower users to explore creative and practical applications without financial constraints.

Integration with the X platform positions Grok-2 as a key tool for fostering collaboration and creativity. Businesses, startups, and individuals alike can benefit from its capabilities, which promise to streamline workflows and inspire new ideas.

Conclusion

xAI’s release of Grok-2 represents a meaningful step toward broader AI accessibility. By offering a powerful tool that combines usability with advanced functionality, xAI is helping shape the future of AI integration across industries.

Whether you are a professional aiming to improve efficiency or an individual exploring creative projects, Grok-2 provides a robust platform to meet diverse needs. Experience the possibilities with Grok-2 on X and see how it can transform your interaction with AI.

Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….
The post xAI Releases Grok-2: An Advanced Language Model Now Freely Available on X appeared first on MarkTechPost.

Top 10 ChatGPT Use Cases for Businesses

ChatGPT is a versatile tool with immense potential for businesses across diverse industries. Its capability to comprehend and generate human-like text enables its use in numerous applications, making it valuable for companies aiming to optimize operations, boost customer engagement, and foster innovation. Let’s look at the top 10 ChatGPT use cases for businesses, showcasing how it can be effectively leveraged to meet various needs.

Customer Support and Virtual Assistants

One of the most prominent applications of ChatGPT is in customer support. Businesses can deploy ChatGPT-based chatbots as virtual assistants to handle routine inquiries, resolve customer issues, and provide information on products and services. These virtual assistants can operate 24/7, ensuring that customers receive instant support, improving response times and overall satisfaction. Using ChatGPT in this domain helps reduce the burden on human support agents, enabling them to focus on more complex queries.

Content Creation and Copywriting

ChatGPT’s advanced natural language capabilities make it an ideal tool for content creation. Whether generating blog posts, social media content, or marketing copy, businesses can use ChatGPT to create high-quality written material. By providing initial prompts or outlines, companies can receive draft content that aligns with their tone and messaging, significantly reducing the time and effort required to produce engaging content. This use case benefits marketing teams looking to scale content production without sacrificing quality.

Market Research and Trend Analysis

Market research can be time-consuming, but ChatGPT simplifies it by quickly summarizing industry trends, analyzing competitor information, and generating insights from unstructured data. By feeding the model with relevant data, businesses can obtain concise reports on market dynamics, helping them make informed decisions. ChatGPT can also sift through customer reviews, social media comments, and other sources to identify emerging trends and sentiment shifts, offering valuable insights for strategy development.

Training and Onboarding

Organizations often face challenges in training and onboarding new employees. ChatGPT can be an intelligent assistant, providing information on company policies, procedures, and role-specific details. New hires can interact with ChatGPT to get answers to common questions, reducing the reliance on human trainers and allowing for a more consistent and scalable onboarding experience. Moreover, ChatGPT can create customized training content, quizzes, and interactive learning modules to enhance the learning experience.

Personalized Customer Engagement

Personalized engagement is key to retaining customers and fostering loyalty. ChatGPT can analyze customer data to create customized messages, product recommendations, and targeted marketing campaigns. By integrating ChatGPT with CRM systems, businesses can generate individualized responses that cater to specific customer preferences and behaviors, making interactions more relevant and engaging. This level of personalization helps companies to stand out in competitive markets.

Document Automation and Summarization

Many business functions, such as legal and compliance, involve managing large volumes of documents. ChatGPT can assist by automating document generation, proofreading, and summarization. Businesses can use it to create standardized contracts, agreements, and reports based on predefined templates, ensuring consistency and accuracy. Additionally, ChatGPT can extract key points and summarize lengthy documents, saving time and aiding quick decision-making.

Social Media Management

Managing social media effectively requires consistent content updates and prompt responses to user interactions. ChatGPT can help businesses create engaging posts, respond to customer comments, and manage direct messages. Its ability to understand context and tone enables it to craft appropriate replies and maintain brand voice. This automation capability helps businesses keep an active social media presence without investing extensive resources.

Product Development and Innovation

ChatGPT can also brainstorm new product ideas and generate innovative solutions. By inputting customer feedback, market trends, and product performance data, businesses can use ChatGPT to generate creative ideas and suggestions for product improvement. It can assist product development teams in exploring new features, functionalities, and design elements, making it a valuable tool for driving innovation.

Sales Support and Lead Qualification

For sales teams, ChatGPT can act as a virtual assistant that helps qualify leads, answer product-related questions, and assist in scheduling meetings. By integrating with sales tools and CRM platforms, ChatGPT can automate the initial stages of lead qualification, reducing response times and allowing sales representatives to focus on closing deals. It can also generate tailored sales scripts based on customer profiles, enhancing the effectiveness of sales interactions.

Internal Collaboration and Knowledge Sharing

ChatGPT can facilitate internal collaboration by serving as a knowledge management tool. It can answer employee questions about internal processes, policies, and best practices. Also, ChatGPT can create, manage, and share knowledge articles, ensuring employees can access up-to-date information. This capability is especially useful for large organizations with distributed teams, as it promotes information consistency and improves knowledge accessibility.

Conclusion

ChatGPT offers businesses a versatile tool to enhance operations, streamline workflows, and improve engagement across various domains. By integrating ChatGPT into their processes, organizations can unlock new efficiencies and elevate their interactions with customers, partners, and employees. ChatGPT’s potential applications will only expand as AI technology evolves, making it a cornerstone of future business innovations.
The post Top 10 ChatGPT Use Cases for Businesses appeared first on MarkTechPost.

How Amazon trains sequential ensemble models at scale with Amazon Sage …

Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML) workflows. This allows scientists and model developers to focus on model development and rapid experimentation rather than infrastructure management
Pipelines offers the ability to orchestrate complex ML workflows with a simple Python SDK with the ability to visualize those workflows through SageMaker Studio. This helps with data preparation and feature engineering tasks and model training and deployment automation. Pipelines also integrates with Amazon SageMaker Automatic Model Tuning which can automatically find the hyperparameter values that result in the best performing model, as determined by your chosen metric.
Ensemble models are becoming popular within the ML communities. They generate more accurate predictions through combining the predictions of multiple models. Pipelines can quickly be used to create and end-to-end ML pipeline for ensemble models. This enables developers to build highly accurate models while maintaining efficiency, and reproducibility.
In this post, we provide an example of an ensemble model that was trained and deployed using Pipelines.
Use case overview
Sales representatives generate new leads and create opportunities within Salesforce to track them. The following application is a ML approach using unsupervised learning to automatically identify use cases in each opportunity based on various text information, such as name, description, details, and product service group.
Preliminary analysis showed that use cases vary by industry and different use cases have a very different distribution of annualized revenue and can help with segmentation. Hence, a use case is an important predictive feature that can optimize analytics and improve sales recommendation models.
We can treat the use case identification as a topic identification problem and we explore different topic identification models such as Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and BERTopic. In both LSA and LDA, each document is treated as a collection of words only and the order of the words or grammatical role does not matter, which may cause some information loss in determining the topic. Moreover, they require a pre-determined number of topics, which was hard to determine in our data set. Since, BERTopic overcame the above problem, it was used in order to identify the use case.
The approach uses three sequential BERTopic models to generate the final clustering in a hierarchical method.
Each BERTopic model consists of four parts:

Embedding – Different embedding methods can be used in BERTopic. In this scenario, input data comes from various areas and is usually inputted manually. As a result, we use sentence embedding to ensure scalability and fast processing.
Dimension reduction – We use Uniform Manifold Approximation and Projection (UMAP), which is an unsupervised and nonlinear dimension reduction method, to reduce high dimension text vectors.
Clustering – We use the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) method to form different use case clusters.
Keyword identification – We use class-based TF-IDF to extract the most representative words from each cluster.

Sequential ensemble model
There is no predetermined number of topics, so we set an input for the number of clusters to be 15–25 topics. Upon observation, some of the topics are wide and general. Therefore, another layer of the BERTopic model is applied individually to them. After combining all of the newly identified topics in the second-layer model and together with the original topics from first-layer results, postprocessing is performed manually to finalize topic identification. Lastly, a third layer is used for some of the clusters to create sub-topics.
To enable the second- and third-layer models to work effectively, you need a mapping file to map results from previous models to specific words or phrases. This helps make sure that the clustering is accurate and relevant.
We’re using Bayesian optimization for hyperparameter tuning and cross-validation to reduce overfitting. The data set contains features like opportunity name, opportunity details, needs, associated product name, product details, product groups. The models are evaluated using a customized loss function, and the best embedding model is selected.
Challenges and considerations
Here are some of the challenges and considerations of this solution:

The pipeline’s data preprocessing capability is crucial for enhancing model performance. With the ability to preprocess incoming data prior to training, we can make sure that our models are fed with high-quality data. Some of the preprocessing and data cleaning steps include converting all text column to lower case, removing template elements, contractions, URLs, emails, etc. removing non-relevant NER labels, and lemmatizing combined text. The result is more accurate and reliable predictions.
We need a compute environment that is highly scalable so that we can effortlessly handle and train millions of rows of data. This allows us to perform large-scale data processing and modeling tasks with ease and reduces development time and costs.
Because every step of the ML workflow requires varying resource requirements, a flexible and adaptable pipeline is essential for efficient resource allocation. We can reduce the overall processing time, resulting in faster model development and deployment, by optimizing resource usage for each step.
Running custom scripts for data processing and model training requires the availability of required frameworks and dependencies.
Coordinating the training of multiple models can be challenging, especially when each subsequent model depends on the output of the previous one. The process of orchestrating the workflow between these models can be complex and time-consuming.
Following each training layer, it’s necessary to revise a mapping that reflects the topics produced by the model and use it as an input for the subsequent model layer.

Solution overview
In this solution, the entry point is Amazon SageMaker Studio, which is a web-based integrated development environment (IDE) provided by AWS that enables data scientists and ML developers to build, train, and deploy ML models at scale in a collaborative and efficient manner.
The following diagrams illustrates the high-level architecture of the solution.

As part of the architecture, we’re using the following SageMaker pipeline steps:

SageMaker Processing – This step allows you to preprocess and transform data before training. One benefit of this step is the ability to use built-in algorithms for common data transformations and automatic scaling of resources. You can also use custom code for complex data preprocessing, and it allows you to use custom container images.
SageMaker Training – This step allows you to train ML models using SageMaker-built-in algorithms or custom code. You can use distributed training to accelerate model training.
SageMaker Callback – This step allows you to run custom code during the ML workflow, such as sending notifications or triggering additional processing steps. You can run external processes and resume the pipeline workflow on completion in this step.
SageMaker Model – This step allows you to create or register model to Amazon SageMaker

Implementation Walkthrough
First, we set up the Sagemaker pipeline:
import boto3
import sagemaker

# create a Session with custom region (e.g. us-east-1), will be None if not specified
region = “<your-region-name>”

# allocate default S3 bucket for SageMaker session, will be None if not specified
default_bucket = “<your-s3-bucket>”
boto_session = boto3.Session(region_name=region
sagemaker_client = boto_session.client(“sagemaker”)

Initialize a SageMaker Session
sagemaker_session = sagemaker.session.Session(boto_session=boto_session, sagemaker_client=sagemaker_client, default_bucket= default_bucket,)

Set Sagemaker execution role for the session
role = sagemaker.session.get_execution_role(sagemaker_session)

Manage interactions under Pipeline Context
pipeline_session = sagemaker.workflow.pipeline_context.PipelineSession(boto_session=boto_session, sagemaker_client=sagemaker_client, default_bucket=default_bucket,)

Define base image for scripts to run on
account_id = role.split(“:”)[4]
# create a base image that take care of dependencies
ecr_repository_name = “<your-base-image-to-run-script>”.
tag = “latest”
container_image_uri = “{0}.dkr.ecr.{1}.amazonaws.com/{2}:{3}”.format(account_id, region, ecr_repository_name, tag)

The following is a detailed explanation of the workflow steps:

Preprocess the data – This involves cleaning and preparing the data for feature engineering and splitting the data into train, test, and validation sets.

import os
BASE_DIR = os.path.dirname(os.path.realpath(__file__))

from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.steps import ProcessingStep

from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,
ScriptProcessor,
)

processing_instance_type = ParameterString(
name=”ProcessingInstanceType”,
# choose an instance type suitable for the job
default_value=”ml.m5.4xlarge”
)

script_processor = ScriptProcessor(
image_uri=container_image_uri,
command=[“python”],
instance_type=processing_instance_type,
instance_count=1,
role=role,
)

# define the data preprocess job
step_preprocess = ProcessingStep(
name=”DataPreprocessing”,
processor=script_processor,
inputs=[
ProcessingInput(source=BASE_DIR, destination=”/opt/ml/processing/input/code/”)
],
outputs=[
ProcessingOutput(output_name=”data_train”, source=”/opt/ml/processing/data_train”), # output data and dictionaries etc for later steps
]
code=os.path.join(BASE_DIR, “preprocess.py”),
)

Train layer 1 BERTopic model – A SageMaker training step is used to train the first layer of the BERTopic model using an Amazon Elastic Container Registry (Amazon ECR) image and a custom training script.

base_job_prefix=”OppUseCase”

from sagemaker.workflow.steps import TrainingStep
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

training_instance_type = ParameterString(
name=”TrainingInstanceType”,
default_value=”ml.m5.4xlarge”
)

# create an estimator for training job
estimator_first_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type,
instance_count=1,
output_path= f”s3://{default_bucket}/{base_job_prefix}/train_first_layer”, # S3 bucket where the training output be stored
role=role,
entry_point = “train_first_layer.py”
)

# create training job for the estimator based on inputs from data-preprocess step
step_train_first_layer = TrainingStep(
name=”TrainFirstLayerModel”,
estimator = estimator_first_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[ “data_train” ].S3Output.S3Uri,
),
},
)

Use a callback step – This involves sending a message to an Amazon Simple Queue Service (Amazon SQS) queue, which triggers an AWS Lambda function. The Lambda function updates the mapping file in Amazon Simple Storage Service (Amazon S3) and sends a success token back to the pipeline to resume its run.

from sagemaker.workflow.callback_step import CallbackStep, CallbackOutput, CallbackOutputTypeEnum

first_sqs_queue_to_use = ParameterString(
name=”FirstSQSQueue”,
default_value= <first_queue_url>, # add queue url
)

first_callback_output = CallbackOutput(output_name=”s3_mapping_first_update”, output_type=CallbackOutputTypeEnum.String)

step_first_mapping_update = CallbackStep(
name=”FirstMappingUpdate”,
sqs_queue_url= first_sqs_queue_to_use,

# Input arguments that will be provided in the SQS message
inputs={
“input_location”: f”s3://{default_bucket}/{base_job_prefix}/mapping”,
“output_location”: f”s3://{default_bucket}/{base_job_prefix}/ mapping_first_update ”
},
outputs=[
first_callback_output,
],
)

step_first_mapping_update.add_depends_on([step_train_first_layer]) # call back is run after the step_train_first_layer

Train layer 2 BERTopic model – Another SageMaker TrainingStep is used to train the second layer of the BERTopic model using an ECR image and a custom training script.

estimator_second_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type, # same type as of first train layer
instance_count=1,
output_path=f”s3://{bucket}/{base_job_prefix}/train_second_layer”, # S3 bucket where the training output be stored
role=role,
entry_point = “train_second_layer.py”
)

# create training job for the estimator based on inputs from preprocessing, output of previous call back step and first train layer step
step_train_second_layer = TrainingStep(
name=”TrainSecondLayerModel”,
estimator = estimator_second_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[ “data_train”].S3Output.S3Uri,
),
TrainingInput(
# Output of the previous call back step
s3_data= step_first_mapping_update.properties.Outputs[“s3_mapping_first_update”],
),
TrainingInput(
s3_data=f”s3://{bucket}/{base_job_prefix}/train_first_layer”
),
}
)

Use a callback step – Similar to Step 3, this involves sending a message to an SQS queue which triggers a Lambda function. The Lambda function updates the mapping file in Amazon S3 and sends a success token back to the pipeline to resume its run.

second_sqs_queue_to_use = ParameterString(
name=”SecondSQSQueue”,
default_value= <second_queue_url>, # add queue url
)

second_callback_output = CallbackOutput(output_name=”s3_mapping_second_update”, output_type=CallbackOutputTypeEnum.String)

step_second_mapping_update = CallbackStep(
name=”SecondMappingUpdate”,
sqs_queue_url= second_sqs_queue_to_use,

# Input arguments that will be provided in the SQS message
inputs={
“input_location”: f”s3://{default_bucket}/{base_job_prefix}/mapping_first_update “,
“output_location”: f”s3://{default_bucket}/{base_job_prefix}/mapping_second_update ”
},
outputs=[
second_callback_output,
],
)

step_second_mapping_update.add_depends_on([step_train_second_layer]) # call back is run after the step_train_second_layer

Train layer 3 BERTopic model – This involves fetching the mapping file from Amazon S3 and training the third layer of the BERTopic model using an ECR image and a custom training script.

estimator_third_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type, # same type as of prvious two train layers
instance_count=1,
output_path=f”s3://{default_bucket}/{base_job_prefix}/train_third_layer”, # S3 bucket where the training output be stored
role=role,
entry_point = “train_third_layer.py”
)

# create training job for the estimator based on inputs from preprocess step, second callback step and outputs of previous two train layers
step_train_third_layer = TrainingStep(
name=”TrainThirdLayerModel”,
estimator = estimator_third_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[“data_train”].S3Output.S3Uri,
),
TrainingInput(
# s3_data = Output of the previous call back step
s3_data= step_second_mapping_update.properties.Outputs[‘ s3_mapping_second_update’],
),
TrainingInput(
s3_data=f”s3://{default_bucket}/{base_job_prefix}/train_first_layer”
),
TrainingInput(
s3_data=f”s3://{default_bucket}/{base_job_prefix}/train_second_layer”
),
}
)

Register the model – A SageMaker model step is used to register the model in the SageMaker model registry. When the model is registered, you can use the model through a SageMaker inference pipeline.

from sagemaker.model import Model
from sagemaker.workflow.model_step import ModelStep

model = Model(
image_uri=container_image_uri,
model_data=step_train_third_layer.properties.ModelArtifacts.S3ModelArtifacts,
sagemaker_session=sagemaker_session,
role=role,
)

register_args = model.register(
content_types=[“text/csv”],
response_types=[“text/csv”],
inference_instances=[“ml.c5.9xlarge”, “ml.m5.xlarge”],
model_package_group_name=model_package_group_name,
approval_status=model_approval_status,
)
step_register = ModelStep(name=”OppUseCaseRegisterModel”, step_args=register_args)

To effectively train a BERTopic model and BIRCH and UMAP methods, you need a custom training image which can provide additional dependencies and framework required to run the algorithm. For a working sample of a custom docker image, refer to Create a custom Docker container Image for SageMaker
Conclusion
In this post, we explained how you can use wide range of steps offered by SageMaker Pipelines with custom images to train an ensemble model. For more information on how to get started with Pipelines using an existing ML Operations (MLOps) template, refer to Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines.

About the Authors
Bikramjeet Singh is a Applied Scientist at AWS Sales Insights, Analytics and Data Science (SIADS) Team, responsible for building GenAI platform and AI/ML Infrastructure solutions for ML scientists within SIADS. Prior to working as an AS, Bikram worked as a Software Development Engineer within SIADS and Alexa AI.
Rahul Sharma is a Senior Specialist Solutions Architect at AWS, helping AWS customers build ML and Generative AI solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance industries, helping customers build data and analytics platforms.
Sachin Mishra is a seasoned professional with 16 years of industry experience in technology consulting and software leadership roles. Sachin lead the Sales Strategy Science and Engineering function at AWS. In this role, he was responsible for scaling cognitive analytics for sales strategy, leveraging advanced AI/ML technologies to drive insights and optimize business outcomes.
Nada Abdalla is a research scientist at AWS. Her work and expertise span multiple science areas in statistics and ML including text analytics, recommendation systems, Bayesian modeling and forecasting. She previously worked in academia and obtained her M.Sc and PhD from UCLA in Biostatistics. Through her work in academia and industry she published multiple papers at esteemed statistics journals and applied ML conferences. In her spare time she enjoys running and spending time with her family.

Implementing login node load balancing in SageMaker HyperPod for enhan …

Amazon SageMaker HyperPod is designed to support large-scale machine learning (ML) operations, providing a robust environment for training foundation models (FMs) over extended periods. Multiple users — such as ML researchers, software engineers, data scientists, and cluster administrators — can work concurrently on the same cluster, each managing their own jobs and files without interfering with others.
When using HyperPod, you can use familiar orchestration options such as Slurm or Amazon Elastic Kubernetes Service (Amazon EKS). This blog post specifically applies to HyperPod clusters using Slurm as the orchestrator. In these clusters, the concept of login nodes is available, which cluster administrators can add to facilitate user access. These login nodes serve as the entry point through which users interact with the cluster’s computational resources. By using login nodes, users can separate their interactive activities, such as browsing files, submitting jobs, and compiling code, from the cluster’s head node. This separation helps prevent any single user’s activities from affecting the overall performance of the cluster.
However, although HyperPod provides the capability to use login nodes, it doesn’t provide an integrated mechanism for load balancing user activity across these nodes. As a result, users manually select a login node, which can lead to imbalances where some nodes are overutilized while others remain underutilized. This not only affects the efficiency of resource usage but can also lead to uneven performance experiences for different users.
In this post, we explore a solution for implementing load balancing across login nodes in Slurm-based HyperPod clusters. By distributing user activity evenly across all available nodes, this approach provides more consistent performance, better resource utilization, and a smoother experience for all users. We guide you through the setup process, providing practical steps to achieve effective load balancing in your HyperPod clusters.
Solution overview
In HyperPod, login nodes serve as access points for users to interact with the cluster’s computational resources so they can manage their tasks without impacting the head node. Although the default method for accessing these login nodes is through AWS Systems Manager, there are cases where direct Secure Shell (SSH) access is more suitable. SSH provides a more traditional and flexible way of managing interactions, especially for users who require specific networking configurations or need features such as TCP load balancing, which Systems Manager doesn’t support.
Given that HyperPod is typically deployed in a virtual private cloud (VPC) using private subnets, direct SSH access to the login nodes requires secure network connectivity into the private subnet. There are several options to achieve this:

AWS Site-to-Site VPN – Establishes a secure connection between your on-premises network and your VPC, suitable for enterprise environments
AWS Direct Connect – Offers a dedicated network connection for high-throughput and low-latency needs
AWS VPN Client – A software-based solution that remote users can use to securely connect to the VPC, providing flexible and easy access to the login nodes

This post demonstrates how to use the AWS VPN Client to establish a secure connection to the VPC. We set up a Network Load Balancer (NLB) within the private subnet to evenly distribute SSH traffic across the available login nodes and use the VPN connection to connect to the NLB in the VPC. The NLB ensures that user sessions are balanced across the nodes, preventing any single node from becoming a bottleneck and thereby improving overall performance and resource utilization.
For environments where VPN connectivity might not be feasible, an alternative option is to deploy the NLB in a public subnet to allow direct SSH access from the internet. In this configuration, the NLB can be secured by restricting access through a security group that allows SSH traffic only from specified, trusted IP addresses. As a result, authorized users can connect directly to the login nodes while maintaining some level of control over access to the cluster. However, this public-facing method is outside the scope of this post and isn’t recommended for production environments, as exposing SSH access to the internet can introduce additional security risks.
The following diagram provides an overview of the solution architecture.

Prerequisites
Before following the steps in this post, make sure you have the foundational components of a HyperPod cluster setup in place. This includes the core infrastructure for the HyperPod cluster and the network configuration required for secure access. Specifically, you need:

HyperPod cluster – This post assumes you already have a HyperPod cluster deployed. If not, refer to Getting started with SageMaker HyperPod and the HyperPod workshop for guidance on creating and configuring your cluster.
VPC, subnets, and security group – Your HyperPod cluster should reside within a VPC with associated subnets. To deploy a new VPC and subnets, follow the instructions in the Own Account section of the HyperPod workshop. This process includes deploying an AWS CloudFormation stack to create essential resources such as the VPC, subnets, security group, and an Amazon FSx for Lustre volume for shared storage.

Setting up login nodes for cluster access
Login nodes are dedicated access points that users can use to interact with the HyperPod cluster’s computational resources without impacting the head node. By connecting through login nodes, users can browse files, submit jobs, and compile code independently, promoting a more organized and efficient use of the cluster’s resources.
If you haven’t set up login nodes yet, refer to the Login Node section of the HyperPod Workshop, which provides detailed instructions on adding these nodes to your cluster configuration.
Each login node in a HyperPod cluster has an associated network interface within your VPC. A network interface, also known as an elastic network interface, represents a virtual network card that connects each login node to your VPC, allowing it to communicate over the network. These interfaces have assigned IPv4 addresses, which are essential for routing traffic from the NLB to the login nodes.
To proceed with the load balancer setup, you need to obtain the IPv4 addresses of each login node. You can obtain these addresses from the AWS Management Console or by invoking a command on your HyperPod cluster’s head node.
Using the AWS Management Console
To set up login nodes for cluster access using the AWS Management Console, follow these steps:

On the Amazon EC2 console, select Network interfaces in the navigation pane
In the Search bar, select VPC ID = (Equals) and choose the VPC id of the VPC containing the HyperPod cluster
In the Search bar, select Description : (Contains) and enter the name of the instance group that includes your login nodes (typically, this is login-group)

For each login node, you will find an entry in the list, as shown in the following screenshot. Note down the IPv4 addresses for all login nodes of your cluster.

Using the HyperPod head node
Alternatively, you can also retrieve the IPv4 addresses by entering the following command on your HyperPod cluster’s head node:

sudo cat /opt/ml/config/resource_config.json     | jq ‘.InstanceGroups[] | select(.Name==”login-group”).Instances[].CustomerIpAddress’

Create a Network Load Balancer
The next step is to create a NLB to manage traffic across your cluster’s login nodes.
For the NLB deployment, you need the IPv4 addresses of the login nodes collected earlier and the appropriate security group configurations. If you deployed your cluster using the HyperPod workshop instructions, a security group that permits communication between all cluster nodes should already be in place.
This security group can be applied to the load balancer, as demonstrated in the following instructions. Alternatively, you can opt to create a dedicated security group that grants access specifically to the login nodes.
Create target group
First, we create the target group that will be used by the NLB.

On the Amazon EC2 console, select Target groups in the navigation pane
Choose Create target group
Create a target group with the following parameters:

For Choose a target type, choose IP addresses
For Target group name, enter smhp-login-node-tg
For Protocol : Port, choose TCP and enter 22
For IP address type, choose IPv4
For VPC, choose SageMaker HyperPod VPC (which was created with the CloudFormation template)
For Health check protocol, choose TCP

Choose Next, as shown in the following screenshot

In the Register targets section, register the login node IP addresses as the targets
For Ports, enter 22 and choose Include as pending below, as shown in the following screenshot

The login node IPs will appear as targets with Pending health status. Choose Create target group, as shown in the following screenshot

Create load balancer
To create the load balancer, follow these steps:

On the Amazon EC2 console, select Load Balancers in the navigation pane
Choose Create load balancer
Choose Network Load Balancer and choose Create, as shown in the following screenshot

Provide a name (for example, smhp-login-node-lb) and choose Internal as Scheme

For network mapping, select the VPC that contains your HyperPod cluster and an associated private subnet, as shown in the following screenshot

Select a security group that allows access on port 22 to the login nodes. If you deployed your cluster using the HyperPod workshop instructions, you can use the security group from this deployment.

Select the Target group that you created before and choose TCP as Protocol and 22 for Port, as shown in the following screenshot

Choose Create load balancer

After the load balancer has been created, you can find its DNS name on the load balancer’s detail page, as shown in the following screenshot. 

Making sure host keys are consistent across login nodes
When using multiple login nodes in a load-balanced environment, it’s crucial to maintain consistent SSH host keys across all nodes. SSH host keys are unique identifiers that each server uses to prove its identity to connecting clients. If each login node has a different host key, users will encounter “WARNING: SSH HOST KEY CHANGED” messages whenever they connect to a different node, causing confusion and potentially leading users to question the security of the connection.
To avoid these warnings, configure the same SSH host keys on all login nodes in the load balancing rotation. This setup makes sure that users won’t receive host key mismatch alerts when routed to a different node by the load balancer.
You can enter the following script on the cluster’s head node to copy the SSH host keys from the first login node to the other login nodes in your HyperPod cluster:

#!/bin/bash

SUDOER_USER=”ubuntu”

login_nodes=($(sudo cat /opt/ml/config/resource_config.json | jq ‘.InstanceGroups[] | select(.Name==”login-group”).Instances[].CustomerIpAddress’ | tr ‘n’ ‘ ‘ | tr -d ‘”‘))
source_node=”${login_nodes[0]}”
key_paths=(“/etc/ssh/ssh_host_rsa_key”
“/etc/ssh/ssh_host_rsa_key.pub”
“/etc/ssh/ssh_host_ecdsa_key”
“/etc/ssh/ssh_host_ecdsa_key.pub”
“/etc/ssh/ssh_host_ed25519_key”
“/etc/ssh/ssh_host_ed25519_key.pub”)

tmp_dir=”/tmp/ssh_host_keys_$(uuidgen)”

copy_cmd=””
for key_path in “${key_paths[@]}”; do
copy_cmd=”sudo cp $key_path $tmp_dir/;$copy_cmd”
done

ssh $source_node “mkdir -p $tmp_dir;${copy_cmd} sudo chown -R $SUDOER_USER $tmp_dir;”

for node in “${login_nodes[@]:1}”; do
echo “Copying SSH host keys from $source_node to $node…”
scp -r $source_node:$tmp_dir $node:$tmp_dir
ssh $node “sudo chown -R root:root $tmp_dir; sudo mv $tmp_dir/ssh_host_* /etc/ssh/;”
done

for node in “${login_nodes[@]}”; do
echo “Cleaning up tmp dir $tmp_dir on $node…”
ssh $node “sudo rm -r $tmp_dir”
done

Create AWS Client VPN endpoint
Because the NLB has been created with Internal scheme, it’s only accessible from within the HyperPod VPC. To access the VPC and send requests to the NLB, we use AWS Client VPN in this post.
AWS Client VPN is a managed client-based VPN service that enables secure access to your AWS resources and resources in your on-premises network.
We’ll set up an AWS Client VPN endpoint that provides clients with access to the HyperPod VPC and uses mutual authentication. With mutual authentication, Client VPN uses certificates to perform authentication between clients and the Client VPN endpoint.
To deploy a client VPN endpoint with mutual authentication, you can follow the steps outlined in Get started with AWS Client VPN. When configuring the client VPN to access the HyperPod VPC and the login nodes, keep these adaptations to the following steps in mind:

Step 2 (create a Client VPN endpoint) – By default, all client traffic is routed through the Client VPN tunnel. To allow internet access without routing traffic through the VPN, you can enable the option Enable split-tunnel when creating the endpoint. When this option is enabled, only traffic destined for networks matching a route in the Client VPN endpoint route table is routed through the VPN tunnel. For more details, refer to Split-tunnel on Client VPN endpoints.
Step 3 (target network associations) – Select the VPC and private subnet used by your HyperPod cluster, which contains the cluster login nodes.
Step 4 (authorization rules) – Choose the Classless Inter-Domain Routing (CIDR) range associated with the HyperPod VPC. If you followed the HyperPod workshop instructions, the CIDR range is 10.0.0.0/16.
Step 6 (security groups) – Select the security group that you previously used when creating the NLB.

Connecting to the login nodes
After the AWS Client VPN is configured, clients can establish a VPN connection to the HyperPod VPC. With the VPN connection in place, clients can use SSH to connect to the NLB, which will route them to one of the login nodes.
ssh -i /path/to/your/private-key.pem user@<NLB-IP-or-DNS>
To allow SSH access to the login nodes, you must create user accounts on the cluster and add their public keys to the authorized_keys file on each login node (or on all nodes, if necessary). For detailed instructions on managing multi-user access, refer to the Multi-User section of the HyperPod workshop.
In addition to using the AWS Client VPN, you can also access the NLB from other AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, if they meet the following requirements:

VPC connectivity – The EC2 instances must be either in the same VPC as the NLB or able to access the HyperPod VPC through a peering connection or similar network setup.
Security group configuration – The EC2 instance’s security group must allow outbound connections on port 22 to the NLB security group. Likewise, the NLB security group should be configured to accept inbound SSH traffic on port 22 from the EC2 instance’s security group.

Clean up
To remove deployed resources, you can clean them up in the following order:

Delete the Client VPN endpoint
Delete the Network Load Balancer
Delete the target group associated with the load balancer

If you also want to delete the HyperPod cluster, follow these additional steps:

Delete the HyperPod cluster
Delete the CloudFormation stack, which includes the VPC, subnets, security group, and FSx for Lustre volume

Conclusion
In this post, we explored how to implement login node load balancing for SageMaker HyperPod clusters. By using a Network Load Balancer to distribute user traffic across login nodes, you can optimize resource utilization and enhance the overall multi-user experience, providing seamless access to cluster resources for each user.
This approach represents only one way to customize your HyperPod cluster. Because of the flexibility of SageMaker HyperPod you can adapt configurations to your unique needs while benefiting from a managed, resilient environment. Whether you need to scale foundation model workloads, share compute resources across different tasks, or support long-running training jobs, SageMaker HyperPod offers a versatile solution that can evolve with your requirements.
For more details on making the most of SageMaker HyperPod, dive into the HyperPod workshop and explore further blog posts covering HyperPod.

About the Authors
Janosch Woschitz is a Senior Solutions Architect at AWS, specializing in AI/ML. With over 15 years of experience, he supports customers globally in leveraging AI and ML for innovative solutions and building ML platforms on AWS. His expertise spans machine learning, data engineering, and scalable distributed systems, augmented by a strong background in software engineering and industry expertise in domains such as autonomous driving.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.