Meet Hostinger Horizons: A No-Code AI Tool that Lets You Create, Edit, …

​In the evolving landscape of web development, the emergence of no-code platforms has significantly broadened access to application creation. Among these, Hostinger Horizons stands out as an AI-powered tool designed to facilitate the building, editing, and publishing of custom web applications without necessitating any coding expertise. By integrating essential services such as hosting, domain registration, and email functionalities, Hostinger Horizons offers a comprehensive solution for individuals and businesses seeking to establish a digital presence.​

Technical Overview

Hostinger Horizons utilizes advanced artificial intelligence and natural language processing to interpret user inputs and generate functional web applications. The platform features a user-friendly chat interface where users can describe their envisioned application in everyday language. For example, a prompt like “Create a personal finance tracker that allows users to log expenses and view spending reports” enables the AI to construct an application aligned with these specifications. ​

Notable Technical Features:

Real-Time Editing and Live Preview: Users can make modifications to their applications and observe changes instantaneously, promoting an iterative development process. ​

Multilingual Support: The platform accommodates over 80 languages, allowing users worldwide to develop applications in their native tongues. ​

Image and Voice Input: Beyond text prompts, users can upload images or utilize voice commands to guide the AI in building the application, enhancing accessibility and flexibility. ​

Sandbox Environment: Hostinger Horizons provides a sandbox environment where users can test their applications without affecting the live version, ensuring a smooth deployment process. ​

Integrated Deployment: Once the application meets the user’s satisfaction, it can be deployed directly through the platform. Hostinger Horizons manages all backend processes, including hosting and domain setup, streamlining the launch process. ​

Business Considerations

Hostinger Horizons is tailored to a diverse audience, encompassing entrepreneurs, small businesses, and individual creators. By removing the necessity for coding skills, the platform lowers the barrier to web application development, enabling rapid transformation of ideas into functional applications.​

Advantages for Businesses:

Cost-Effective Development: Traditional web application development often involves significant expenses related to hiring developers. Hostinger Horizons offers a more economical alternative, making it particularly advantageous for startups and small enterprises. ​

Rapid Prototyping: The platform facilitates swift development and deployment of applications, allowing businesses to test concepts and iterate based on user feedback without substantial time investments.​

Integrated Services: With built-in hosting, domain registration, and email services, businesses can manage all aspects of their web presence from a single platform, simplifying operations and reducing the need for multiple service providers. ​

Scalability: Hostinger Horizons’ cloud-based infrastructure ensures that applications can scale seamlessly as the business grows, accommodating increasing traffic and user engagement.​

Pricing Structure:

Hostinger Horizons offers several pricing plans to accommodate different needs:​

Starter Plan: Priced at $19.99 per month, it includes 100 messages, hosting (one month free), unlimited bandwidth, up to 50 web apps, and free email services. ​

Hobbyist Plan: At $49.99 per month, this plan offers 250 messages along with the features included in the Starter Plan.​

Hustler Plan: For $99.99 per month, users receive 500 messages and the standard features.​

Pro Plan: The most comprehensive plan at $199.99 per month provides 1,000 messages and all included features.

Hostinger also offer a free test with 5 messages when clicking on the “Start for free” button

Tutorial: Creating a Web Application with Hostinger Horizons

Developing a web application with Hostinger Horizons involves a straightforward process. Here’s a step-by-step guide:

Step 1: Sign Up and Access Hostinger Horizons

Visit the Hostinger Horizons page and select a plan that aligns with your requirements.​

After purchasing, log in to your Hostinger account and navigate to the hPanel dashboard.​

Go to “Websites” → “Website List” and click on “Add Website.” Choose “Hostinger Horizons” from the options to access the platform. ​

Step 2: Define Your Application Idea

In the chat interface, describe the application you wish to create. For example: “Create a web application for SUDUKO Game. The web application should be mobile friendly. There should be 3 levels of games. Level 1: Easy mode. Level 2: Medium difficulty. Level 3: Difficult Mode.”​

The AI will process your input and generate a basic version of the application based on your description.​

Step 3: Customize the Application

Layout and Design: Use the real-time editor to adjust the layout, color scheme, and overall design to match your preferences.​

Functionality: Add or modify features by providing additional prompts. For instance, you can request the inclusion of a budgeting feature or integration with external APIs for real-time data.​

Content: Upload images, input text content, and configure any necessary settings to personalize the application.​

Step 4: Test the Application

Utilize the sandbox environment to test the application’s functionality. Ensure all features operate as intended and make any necessary adjustments based on your testing.​

Step 5: Deploy the Application

Once satisfied, click the “Publish” button to deploy your application.​

Demo

Thanks to the Hostinger team for the thought leadership/ Resources for this article. Hostinger team has supported us in this content/article.
The post Meet Hostinger Horizons: A No-Code AI Tool that Lets You Create, Edit, and Publish Custom Web Apps Without Writing a Single Line of Code appeared first on MarkTechPost.

Understanding AI Agent Memory: Building Blocks for Intelligent Systems

AI agent memory comprises multiple layers, each serving a distinct role in shaping the agent’s behavior and decision-making. By dividing memory into different types, it is better to understand and design AI systems that are both contextually aware and responsive. Let’s explore the four key types of memory commonly used in AI agents: Episodic, Semantic, Procedural, and Short-Term (or Working) Memory, along with the interplay between long-term and short-term storage.

Image Source

1. Episodic Memory: Recalling Past Interactions

Episodic memory in AI refers to the storage of past interactions and the specific actions taken by the agent. Like human memory, episodic memory records the events or “episodes” an agent experiences during its operation. This type of memory is crucial because it enables the agent to reference previous conversations, decisions, and outcomes to inform future actions. For example, when a user interacts with a customer support bot, the bot might store the conversation history in an episodic memory log, allowing it to maintain context over multiple exchanges. This contextual awareness is especially important in multi-turn dialogues where understanding previous interactions can dramatically improve the quality of responses.

In practical applications, episodic memory is often implemented using persistent storage systems like vector databases. These systems can store semantic representations of interactions, enabling rapid retrieval based on similarity searches. This means that when an AI agent needs to refer back to an earlier conversation, it can quickly identify and pull relevant segments of past interactions, thereby enhancing the continuity and personalization of the experience.

2. Semantic Memory: External Knowledge and Self-awareness

Semantic memory in AI encompasses the agent’s repository of factual, external information and internal knowledge. Unlike episodic memory, which is tied to specific interactions, semantic memory holds generalized knowledge that the agent can use to understand and interpret the world. This may include language rules, domain-specific information, or self-awareness of the agent’s capabilities and limitations.

One common semantic memory use is in Retrieval-Augmented Generation (RAG) applications, where the agent leverages a vast data store to answer questions accurately. For instance, if an AI agent is tasked with providing technical support for a software product, its semantic memory might contain user manuals, troubleshooting guides, and FAQs. Semantic memory also includes grounding context that helps the agent filter and prioritize relevant data from a broader corpus of information available on the internet.

Integrating semantic memory ensures that an AI agent responds based on immediate context and draws on a broad spectrum of external knowledge. This creates a more robust, informed system that can handle diverse queries with accuracy and nuance.

3. Procedural Memory: The Blueprint of Operations

Procedural memory is the backbone of an AI system’s operational aspects. It includes systemic information such as the structure of the system prompt, the tools available to the agent, and the guardrails that ensure safe and appropriate interactions. In essence, procedural memory defines “how” the agent functions rather than “what” it knows.

This type of memory is typically managed through well-organized registries, such as Git repositories for code, prompt registries for conversational contexts, and tool registries that enumerate the available functions and APIs. An AI agent can execute tasks more reliably and predictably by having a clear blueprint of its operational procedures. The explicit definition of protocols and guidelines also ensures that the agent behaves in a controlled manner, thereby minimizing risks such as unintended outputs or safety violations.

Procedural memory supports consistency in performance and facilitates easier updates and maintenance. As new tools become available or system requirements evolve, the procedural memory can be updated in a centralized manner, ensuring that the agent adapts seamlessly to changes without compromising its core functionality.

4. Short-Term (Working) Memory: Integrating Information for Action

In many AI systems, the information drawn from long-term memory is consolidated into short-term or working memory. This is the temporary context that the agent actively uses to process current tasks. Short-term memory is a compilation of the episodic, semantic, and procedural memories that have been retrieved and localized for immediate use.

When an agent is presented with a new task or query, it assembles relevant information from its long-term stores. This might include a snippet of a previous conversation (episodic memory), pertinent factual data (semantic memory), and operational guidelines (procedural memory). The combined information forms the prompt fed into the underlying language model, allowing the AI to generate coherent, context-aware responses.

This process of compiling short-term memory is critical for tasks that require nuanced decision-making and planning. It allows the AI agent to “remember” the conversation history and tailor responses accordingly. The agility provided by short-term memory is a significant factor in creating interactions that feel natural and human-like. Also, the separation between long-term and short-term memory ensures that while the system has a vast knowledge repository, only the most pertinent information is actively engaged during interaction, optimizing performance and accuracy.

The Synergy of Long-Term and Short-Term Memory

To fully appreciate the architecture of AI agent memory, it is important to understand the dynamic interplay between long-term memory and short-term (working) memory. Long-term memory, consisting of episodic, semantic, and procedural types, is the deep storage that informs the AI about its history, external facts, and internal operational frameworks. On the other hand, short-term memory is a fluid, working subset that the agent uses to navigate current tasks. The agent can adapt to new contexts without losing the richness of stored experiences and knowledge by periodically retrieving and synthesizing data from long-term memory. This dynamic balance ensures that AI systems are well-informed, responsive, and contextually aware.

In conclusion, the multifaceted approach to memory in AI agents underscores the complexity and sophistication required to build systems that can interact intelligently with the world. Episodic memory allows for the personalization of interactions, semantic memory enriches responses with factual depth, and procedural memory guarantees operational reliability. Meanwhile, integrating these long-term memories into short-term working memory enables the AI to act swiftly and contextually in real-time scenarios. As AI advances, refining these memory systems will be pivotal in creating smart agents capable of nuanced, context-aware decision-making. The layered memory approach is a cornerstone of intelligent agent design, ensuring these systems remain robust, adaptive, and ready to tackle the challenges of an ever-evolving digital landscape.

Sources:

https://www.deeplearning.ai/short-courses/long-term-agentic-memory-with-langgraph/ 

https://arxiv.org/html/2502.12110v1 

https://arxiv.org/pdf/2309.02427

The post Understanding AI Agent Memory: Building Blocks for Intelligent Systems appeared first on MarkTechPost.

PilotANN: A Hybrid CPU-GPU System For Graph-based ANNS

Approximate Nearest Neighbor Search (ANNS) is a fundamental vector search technique that efficiently identifies similar items in high-dimensional vector spaces. Traditionally, ANNS has served as the backbone for retrieval engines and recommendation systems, however, it struggles to keep pace with modern Transformer architectures that employ higher-dimensional embeddings and larger datasets. Unlike deep learning systems that can be horizontally scaled due to their stateless nature, ANNS remains centralized, creating a severe single-machine throughput bottleneck. Empirical testing with 100-million scale datasets reveals that even state-of-the-art CPU implementations of the Hierarchical Navigable Small World (HNSW) algorithm can’t maintain adequate performance as vector dimensions increase.

Previous research on large-scale ANNS has explored two optimization paths: index structure improvements and hardware acceleration. The Inverted MultiIndex (IMI) enhanced space partitioning through multi-codebook quantization, while PQFastScan improved performance with SIMD and cache-aware optimizations. DiskANN and SPANN introduced disk-based indexing for billion-scale datasets, addressing memory hierarchy challenges through different approaches. SONG and CAGRA achieved impressive speedups through GPU parallelization but remain constrained by GPU memory capacity. BANG handled billion-scale datasets via hybrid CPU-GPU processing but lacked critical CPU baseline comparisons. These methods frequently sacrifice compatibility, accuracy or require specialized hardware.

Researchers from the Chinese University of Hong Kong, Centre for Perceptual and Interactive Intelligence, and Theory Lab of Huawei Technologies have proposed PilotANN, a hybrid CPU-GPU system designed to overcome the limitations of existing ANNS implementations. PilotANN addresses the challenge: CPU-only implementations struggle with computational demands, while GPU-only solutions are constrained by limited memory capacity. It solves this issue by utilizing both the abundant RAM of CPUs and the parallel processing capabilities of GPUs. Moreover, it employs a three-stage graph traversal process, GPU-accelerated subgraph traversal using dimensionally-reduced vectors, CPU refinement, and precise search with complete vectors.

PilotANN fundamentally reimagines the vector search process through a “staged data ready processing” paradigm. It minimizes data movement across processing stages rather than adhering to traditional “move data for computation” models. It also consists of three stages: GPU piloting with subgraph and dimensionally-reduced vectors, residual refinement using subgraph with full vectors, and final traversal employing full graph and complete vectors. The design shows cost-effectiveness with only a single commodity GPU while scaling effectively across vector dimensions and graph complexity. Data transfer overhead is minimized to just the initial query vector movement to GPU and a small candidate set returning to CPU after GPU piloting.

Experimental results show PilotANN’s performance advantages across diverse large-scale datasets. PilotANN achieves a 3.9 times throughput speedup on the 96-dimensional DEEP dataset compared to the HNSW-CPU baseline, with even more impressive gains of 5.1-5.4 times on higher-dimensional datasets. PilotANN delivers significant speedups even on the notoriously challenging T2I dataset despite no specific optimizations for this benchmark. Moreover, it shows remarkable cost-effectiveness despite utilizing more expensive hardware. While the GPU-based platform costs 2.81 USD/hour compared to the CPU-only solution at 1.69 USD/hour, PilotANN achieves 2.3 times cost-effectiveness for DEEP and 3.0-3.2 times for T2I, WIKI, and LAION datasets when measuring throughput per dollar.

In conclusion, researchers introduced PilotANN, an advancement in graph-based ANNS that effectively utilizes CPU and GPU resources for emerging workloads. It shows great performance over existing CPU-only approaches through the intelligent decomposition of top-k search into a multi-stage CPU-GPU pipeline and implementation of efficient entry selection. It democratizes high-performance nearest neighbor search by achieving competitive results with a single commodity GPU, making advanced search capabilities accessible to researchers and organizations with limited computing resources. Unlike alternative solutions requiring expensive high-end GPUs, PilotANN enables efficient ANNS deployment on common hardware configurations while maintaining search accuracy.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post PilotANN: A Hybrid CPU-GPU System For Graph-based ANNS appeared first on MarkTechPost.

Tencent AI Researchers Introduce Hunyuan-T1: A Mamba-Powered Ultra-Lar …

Large language models struggle to process and reason over lengthy, complex texts without losing essential context. Traditional models often suffer from context loss, inefficient handling of long-range dependencies, and difficulties aligning with human preferences, affecting the accuracy and efficiency of their responses. Tencent’s Hunyuan-T1 directly tackles these challenges by integrating a novel Mamba-powered architecture with advanced reinforcement learning and curriculum strategies, ensuring robust context capture and enhanced reasoning capabilities.

Hunyuan-T1 is the first model powered by the innovative Mamba architecture, a design that fuses Hybrid Transformer and Mixture-of-Experts (MoE) technologies. Built on the TurboS fast-thinking base, Hunyuan-T1 is specifically engineered to optimize the processing of long textual sequences while minimizing computational overhead. This allows the model to effectively capture extended context and manage long-distance dependencies, crucial for tasks that demand deep, coherent reasoning.

A key highlight of Hunyuan-T1 is its heavy reliance on RL during the post-training phase. Tencent dedicated 96.7% of its computing power to this approach, enabling the model to refine its reasoning abilities iteratively. Techniques such as data replay, periodic policy resetting, and self-rewarding feedback loops help improve output quality, ensuring the model’s responses are detailed, efficient, and closely aligned with human expectations.

To further boost reasoning proficiency, Tencent employed a curriculum learning strategy. This approach gradually increases the difficulty of training data while simultaneously expanding the model’s context length. As a result, Hunyuan-T1 is trained to use tokens more efficiently, seamlessly adapting from solving basic mathematical problems to tackling complex scientific and logical challenges. Efficiency is another cornerstone of Hunyuan-T1’s design. The TurboS base’s ability to capture long-text information prevents context loss, a common issue in many language models, and doubles the decoding speed compared to similar systems. This breakthrough means that users benefit from faster, higher-quality responses without compromising performance.

The model has achieved impressive scores on multiple benchmarks: 87.2 on MMLU-PRO, which tests various subjects including humanities, social sciences, and STEM fields; 69.3 on GPQA-diamond, a challenging evaluation featuring doctoral-level scientific problems; 64.9 on LiveCodeBench for coding tasks; and a remarkable 96.2 on the MATH-500 benchmark for mathematical reasoning. These results underscore Hunyuan-T1’s versatility and ability to handle high-stakes, professional-grade tasks across various fields. Beyond quantitative metrics, Hunyuan-T1 is designed to deliver outputs with human-like understanding and creativity. During its RL phase, the model underwent a comprehensive alignment process that combined self-rewarding feedback with external reward models. This dual approach ensures its responses are accurate and exhibit rich details and natural flow.

In conclusion, Tencent’s Hunyuan-T1 combines an ultra-large-scale, Mamba-powered architecture with state-of-the-art reinforcement learning and curriculum strategies. Hunyuan-T1 delivers high performance, enhanced reasoning, and exceptional efficiency.

Check out the Details, Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Tencent AI Researchers Introduce Hunyuan-T1: A Mamba-Powered Ultra-Large Language Model Redefining Deep Reasoning, Contextual Efficiency, and Human-Centric Reinforcement Learning appeared first on MarkTechPost.

Advancing Medical Reasoning with Reinforcement Learning from Verifiabl …

Reinforcement Learning from Verifiable Rewards (RLVR) has recently emerged as a promising method for enhancing reasoning abilities in language models without direct supervision. This approach has shown notable success in mathematics and coding, where reasoning naturally aligns with structured problem-solving. While studies have demonstrated that RLVR alone can lead to self-evolved reasoning, research has largely been limited to these technical fields. Efforts to extend RLVR have explored synthetic datasets, such as those involving sequential tasks and object counting, indicating potential but also highlighting the challenges of adapting this method to different domains.

Expanding RLVR to broader areas remains an open challenge, particularly in tasks like multiple-choice question answering (MCQA), which provides structured, verifiable labels across diverse subjects, including medicine. However, unlike math and coding, which involve complex reasoning with an open-ended answer space, MCQA tasks typically have predefined answer choices, making it uncertain whether RLVR’s benefits translate effectively. This limitation is especially relevant in medical reasoning tasks, where models must navigate intricate clinical knowledge to produce accurate responses, an area that has proven difficult for existing AI systems.

Researchers from Microsoft Research investigate whether medical reasoning can emerge through RLVR. They introduce MED-RLVR, leveraging medical MCQA data to assess RLVR’s effectiveness in the medical domain. Their findings show that RLVR extends beyond math and coding, achieving performance comparable to supervised fine-tuning (SFT) in in-distribution tasks while significantly improving out-of-distribution generalization by eight percentage points. Analyzing training dynamics, they observe that reasoning capabilities emerge in a 3B-parameter base model without explicit supervision, highlighting RLVR’s potential for advancing reasoning in knowledge-intensive fields like medicine.

RL optimizes decision-making by training an agent to maximize rewards through interactions with an environment. It has been effectively applied to language models to align outputs with human preferences and, more recently, to elicit reasoning without explicit supervision. This study employs Proximal Policy Optimization (PPO) to train a policy model, incorporating a clipped objective function to stabilize training. Using a rule-based reward function, MED-RLVR assigns rewards based on output correctness and format validity. Without additional supervision, the model demonstrates emergent medical reasoning, similar to mathematical reasoning in prior RLVR studies, highlighting RLVR’s potential beyond structured domains.

The MedQA-USMLE dataset, which includes multi-choice medical exam questions, is used to train MED-RLVR. Unlike the standard four-option version, this dataset presents a greater challenge by offering more answer choices. Training is based on the Qwen2.5-3B model using OpenRLHF for reinforcement learning. Compared to SFT, MED-RLVR demonstrates superior generalization, particularly on the MMLU-Pro-Health dataset. Analysis reveals six stages of reasoning evolution: format failures, verbose outputs, reward hacking, and reintegrated reasoning. Unlike math or coding tasks, no self-validation behaviors (“aha-moments”) were observed, suggesting potential improvements through penalizing short reasoning chains or fine-tuning with longer CoTs.

In conclusion, the study focuses on MCQA in medicine, providing a controlled setting for evaluation. However, MCQA does not fully capture the complexity of real-world tasks like open-text answering, report generation, or medical dialogues. Additionally, the unimodal approach limits the model’s ability to integrate multimodal data, which is crucial for diagnostic applications. Future work should address these limitations. MED-RLVR, based on reinforcement learning with verifiable rewards, matches SFT on in-distribution tasks and improves out-of-distribution generalization. While medical reasoning emerges without explicit supervision, challenges like reward hacking persist, highlighting the need for further exploration of complex reasoning and multimodal integration.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Advancing Medical Reasoning with Reinforcement Learning from Verifiable Rewards (RLVR): Insights from MED-RLVR appeared first on MarkTechPost.

NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Techn …

Large language models (LLMs) have become vital across domains, enabling high-performance applications such as natural language generation, scientific research, and conversational agents. Underneath these advancements lies the transformer architecture, where alternating layers of attention mechanisms and feed-forward networks (FFNs) sequentially process tokenized input. However, with an increase in size and complexity, the computational burden required for inference grows substantially, creating an efficiency bottleneck. Efficient inference is now a critical concern, with many research groups focusing on strategies that can reduce latency, increase throughput, and cut computational costs while maintaining or improving model performance.

At the center of this efficiency problem lies the inherently sequential structure of transformers. Each layer’s output feeds into the next, demanding strict order and synchronization, which is especially problematic at scale. As model sizes expand, the cost of sequential computation and communication across GPUs grows, leading to reduced efficiency and increased deployment cost. This challenge is amplified in scenarios requiring fast, multi-token generation, such as real-time AI assistants. Reducing this sequential load while maintaining model capabilities presents a key technical hurdle. Unlocking new parallelization strategies that preserve accuracy yet significantly reduce computation depth is essential to broadening the accessibility and scalability of LLMs.

Several techniques have emerged to improve efficiency. Quantization reduces the precision of numerical representations to minimize memory and computation needs, though it often risks accuracy losses, especially at low bit-widths. Pruning eliminates redundant parameters and simplifies models but potentially harms accuracy without care. Mixture-of-Experts (MoE) models activate only a subset of parameters per input, making them highly efficient for specific workloads. Still, they can underperform at intermediate batch sizes due to low hardware utilization. While valuable, these strategies have trade-offs that limit their universal applicability. Consequently, the field seeks methods that offer broad efficiency improvements with fewer compromises, especially for dense architectures that are simpler to train, deploy, and maintain.

Researchers at NVIDIA introduced a new architectural optimization technique named FFN Fusion, which addresses the sequential bottleneck in transformers by identifying FFN sequences that can be executed in parallel. This approach emerged from the observation that when attention layers are removed using a Puzzle tool, models often retain long sequences of consecutive FFNs. These sequences show minimal interdependency and, therefore, can be processed simultaneously. By analyzing the structure of LLMs such as Llama-3.1-405B-Instruct, researchers created a new model called Ultra-253B-Base by pruning and restructuring the base model through FFN Fusion. This method results in a significantly more efficient model that maintains competitive performance.

FFN Fusion fuses multiple consecutive FFN layers into a single, wider FFN. This process is grounded in mathematical equivalence: by concatenating the weights of several FFNs, one can produce a single module that behaves like the sum of the original layers but can be computed in parallel. For instance, if three FFNs are stacked sequentially, each dependent on the output of the previous one, their fusion removes these dependencies by ensuring all three operate on the same input and their outputs are aggregated. The theoretical foundation for this method shows that the fused FFN maintains the same representational capacity. Researchers performed dependency analysis using cosine distance between FFN outputs to identify regions with low interdependence. These regions were deemed optimal for fusion, as minimal change in token direction between layers indicated the feasibility of parallel processing.

Applying FFN Fusion to the Llama-405B model resulted in Ultra-253B-Base, which delivered notable gains in speed and resource efficiency. Specifically, the new model achieved a 1.71x improvement in inference latency and reduced per-token computational cost by 35x at a batch size of 32. This efficiency did not come at the expense of capability. Ultra-253B-Base scored 85.17% on MMLU, 72.25% on MMLU-Pro, 84.92% on Arena Hard, 86.58% on HumanEval, and 9.19 on MT-Bench. These results often matched or exceeded the original 405B-parameter model, even though Ultra-253B-Base contained only 253 billion parameters. Memory usage also improved with a 2× reduction in kv-cache requirements. The training process involved distilling 54 billion tokens at an 8k context window, followed by staged fine-tuning at 16k, 32k, and 128k contexts. These steps ensured the fused model maintained high accuracy while benefiting from reduced size.

This research demonstrates how thoughtful architectural redesign can unlock significant efficiency gains. Researchers showed that FFN layers in transformer architectures are often more independent than previously assumed. Their method of quantifying inter-layer dependency and transforming model structures allowed for broader application across models of various sizes. The technique was also validated on a 70B-parameter model, proving generalizability. Further experiments indicated that while FFN layers can often be fused with minimal impact, full block parallelization, including attention, introduces more performance degradation due to stronger interdependencies.

Several Key Takeaways from the Research on FFN Fusion:

The FFN Fusion technique reduces sequential computation in transformers by parallelizing low-dependency FFN layers.  

Fusion is achieved by replacing sequences of FFNs with a single wider FFN using concatenated weights.  

Ultra-253B-Base, derived from Llama-3.1-405B, achieves 1.71x faster inference and 35x lower per-token cost.  

Benchmark results include: 85.17% (MMLU), 72.25% (MMLU-Pro), 86.58% (HumanEval), 84.92% (Arena Hard), and 9.19 (MT-Bench).  

Memory usage is cut by half due to kv-cache optimization.  

FFN Fusion is more effective at larger model scales and works well with techniques like pruning and quantization.  

Full transformer block parallelization shows potential but requires further research due to stronger interdependencies.  

A systematic method using cosine distance helps identify which FFN sequences are safe to fuse.  

The technique is validated across different model sizes, including 49B, 70B, and 253B.  

This approach lays the foundation for more parallel-friendly and hardware-efficient LLM designs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized appeared first on MarkTechPost.

Amazon Bedrock Guardrails image content filters provide industry-leadi …

Amazon Bedrock Guardrails announces the general availability of image content filters, enabling you to moderate both image and text content in your generative AI applications. Previously limited to text-only filtering, this enhancement now provides comprehensive content moderation across both modalities. This new capability removes the heavy lifting required to build your own image safeguards or spend cycles on manual content moderation that can be error-prone and tedious.
Tero Hottinen, VP, Head of Strategic Partnerships at KONE, envisions the following use case:

“In its ongoing evaluation, KONE recognizes the potential of Amazon Bedrock Guardrails as a key component in protecting generative AI applications, particularly for relevance and contextual grounding checks, as well as the multimodal safeguards. The company envisions integrating product design diagrams and manuals into its applications, with Amazon Bedrock Guardrails playing a crucial role in enabling more accurate diagnosis and analysis of multimodal content.”

Amazon Bedrock Guardrails provides configurable safeguards to help customers block harmful or unwanted inputs and outputs for their generative AI applications. Customers can create custom Guardrails tailored to their specific use cases by implementing different policies to detect and filter harmful or unwanted content from both input prompts and model responses. Furthermore, customers can use Guardrails to detect model hallucinations and help make responses grounded and accurate. Through its standalone ApplyGuardrail API, Guardrails enables customers to apply consistent policies across any foundation model, including those hosted on Amazon Bedrock, self-hosted models, and third-party models. Bedrock Guardrails supports seamless integration with Bedrock Agents and Bedrock Knowledge Bases, enabling developers to enforce safeguards across various workflows, such as Retrieval Augmented Generation (RAG) systems and agentic applications.
Amazon Bedrock Guardrails offers six distinct policies, including: content filters to detect and filter harmful material across several categories, including hate, insults, sexual content, violence, misconduct, and to prevent prompt attacks; topic filters to restrict specific subjects; sensitive information filters to block personally identifiable information (PII); word filters to block specific terms; contextual grounding checks to detect hallucinations and analyze response relevance; and Automated Reasoning checks (currently in gated preview) to identify, correct, and explain factual claims. With the new image content moderation capability, these safeguards now extend to both text and images, helping customer block up to 88% of harmful multimodal content. You can independently configure moderation for either image or text content (or both) with adjustable thresholds from low to high, helping you to build generative AI applications that align with your organization’s responsible AI policies.
This new capability is generally available in US East (N. Virginia), US West (Oregon), Europe (Frankfurt), and Asia Pacific (Tokyo) AWS Regions.
In this post, we discuss how to get started with image content filters in Amazon Bedrock Guardrails.
Solution overview
To get started, create a guardrail on the AWS Management Console and configure the content filters for either text or image data or both. You can also use AWS SDKs to integrate this capability into your applications.
Create a guardrail
To create a guardrail, complete the following steps:

On the Amazon Bedrock console, under Safeguards in the navigation pane, choose Guardrails.
Choose Create guardrail.
In the Configure content filters section, under Harmful categories and Prompt attacks, you can use the existing content filters to detect and block image data in addition to text data.
After you’ve selected and configured the content filters you want to use, you can save the guardrail and start using it to help you block harmful or unwanted inputs and outputs for your generative AI applications.

Test a guardrail with text generation
To test the new guardrail on the Amazon Bedrock console, select the guardrail and choose Test. You have two options: test the guardrail by choosing and invoking a model or test the guardrail without invoking a model by using the Amazon Bedrock Guardrails independent ApplyGuardail API.
With the ApplyGuardrail API, you can validate content at any point in your application flow before processing or serving results to the user. You can also use the API to evaluate inputs and outputs for self-managed (custom) or third-party FMs, regardless of the underlying infrastructure. For example, you could use the API to evaluate a Meta Llama 3.2 model hosted on Amazon SageMaker or a Mistral NeMo model running on your laptop.
Test a guardrail by choosing and invoking a model
Select a model that supports image inputs or outputs, for example, Anthropic’s Claude 3.5 Sonnet. Verify that the prompt and response filters are enabled for image content. Then, provide a prompt, upload an image file, and choose Run.

In this example, Amazon Bedrock Guardrails intervened. Choose View trace for more details.
The guardrail trace provides a record of how safety measures were applied during an interaction. It shows whether Amazon Bedrock Guardrails intervened or not and what assessments were made on both input (prompt) and output (model response). In this example, the content filters blocked the input prompt because they detected violence in the image with medium confidence.

Test a guardrail without invoking a model
On the Amazon Bedrock console, choose Use ApplyGuardail API, the independent API to test the guardrail without invoking a model. Choose whether you want to validate an input prompt or an example of a model generated output. Then, repeat the steps from the previous section. Verify that the prompt and response filters are enabled for image content, provide the content to validate, and choose Run.

For this example, we reused the same image and input prompt, and Amazon Bedrock Guardrails intervened again. Choose View trace again for more details.

Test a guardrail with image generation
Now, let’s test the Amazon Bedrock Guardrails multimodal toxicity detection with image generation use cases. The following is an example of using Amazon Bedrock Guardrails image content filters with an image generation use case. We generate an image using the Stability model on Amazon Bedrock using the InvokeModel API and the guardrail:

guardrailIdentifier = <<guardrail_id>>
guardrailVersion =”1″

model_id = ‘stability.sd3-5-large-v1:0’
output_images_folder = ‘images/output’

body = json.dumps(
{
“prompt”: “A Gun”, # for image generation (“A gun” should get blocked by violence)
“output_format”: “jpeg”
}
)

bedrock_runtime = boto3.client(“bedrock-runtime”, region_name=region)
try:
print(“Making a call to InvokeModel API for model: {}”.format(model_id))
response = bedrock_runtime.invoke_model(
body=body,
modelId=model_id,
trace=’ENABLED’,
guardrailIdentifier=guardrailIdentifier,
guardrailVersion=guardrailVersion
)
response_body = json.loads(response.get(‘body’).read())
print(“Received response from InvokeModel API (Request Id: {})”.format(response[‘ResponseMetadata’][‘RequestId’]))
if ‘images’ in response_body and len(response_body[‘images’]) > 0:
os.makedirs(output_images_folder, exist_ok=True)
images = response_body[“images”]
for image in images:
image_id = ”.join(random.choices(string.ascii_lowercase + string.digits, k=6))
image_file = os.path.join(output_images_folder, “generated-image-{}.jpg”.format(image_id))
print(“Saving generated image {} at {}”.format(image_id, image_file))
with open(image_file, ‘wb’) as image_file_descriptor:
image_file_descriptor.write(base64.b64decode(image.encode(‘utf-8’)))
else:
print(“No images generated from model”)
guardrail_trace = response_body[‘amazon-bedrock-trace’][‘guardrail’]
guardrail_trace[‘modelOutput’] = [‘<REDACTED>’]
print(guardrail_trace[‘outputs’])
print(“nGuardrail Trace: {}”.format(json.dumps(guardrail_trace, indent=2)))
except botocore.exceptions.ClientError as err:
print(“Failed while calling InvokeModel API with RequestId = {}”.format(err.response[‘ResponseMetadata’][‘RequestId’]))
raise err

You can access the complete example from the GitHub repo.
Conclusion
In this post, we explored how Amazon Bedrock Guardrails’ new image content filters provide comprehensive multimodal content moderation capabilities. By extending beyond text-only filtering, this solution now helps customers block up to 88% of harmful or unwanted multimodal content across configurable categories including hate, insults, sexual content, violence, misconduct, and prompt attack detection. Guardrails can help organizations across healthcare, manufacturing, financial services, media, and education enhance brand safety without the burden of building custom safeguards or conducting error-prone manual evaluations.
To learn more, see Stop harmful content in models using Amazon Bedrock Guardrails.

About the Authors
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Amazon Bedrock security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.
Shyam Srinivasan is on the Amazon Bedrock Guardrails product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.
Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at AWS. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.
Dr. Andrew Kane is an AWS Principal WW Tech Lead (AI Language Services) based out of London. He focuses on the AWS Language and Vision AI services, helping our customers architect multiple AI services into a single use case-driven solution. Before joining AWS at the beginning of 2015, Andrew spent two decades working in the fields of signal processing, financial payments systems, weapons tracking, and editorial and publishing systems. He is a keen karate enthusiast (just one belt away from Black Belt) and is also an avid home-brewer, using automated brewing hardware and other IoT sensors.

Integrating custom dependencies in Amazon SageMaker Canvas workflows

When implementing machine learning (ML) workflows in Amazon SageMaker Canvas, organizations might need to consider external dependencies required for their specific use cases. Although SageMaker Canvas provides powerful no-code and low-code capabilities for rapid experimentation, some projects might require specialized dependencies and libraries that aren’t included by default in SageMaker Canvas. This post provides an example of how to incorporate code that relies on external dependencies into your SageMaker Canvas workflows.
Amazon SageMaker Canvas is a low-code no-code (LCNC) ML platform that guides users through every stage of the ML journey, from initial data preparation to final model deployment. Without writing a single line of code, users can explore datasets, transform data, build models, and generate predictions.
SageMaker Canvas offers comprehensive data wrangling capabilities that help you prepare your data, including:

Over 300 built-in transformation steps
Feature engineering capabilities
Data normalization and cleansing functions
A custom code editor supporting Python, PySpark, and SparkSQL

In this post, we demonstrate how to incorporate dependencies stored in Amazon Simple Storage Service (Amazon S3) within an Amazon SageMaker Data Wrangler flow. Using this approach, you can run custom scripts that depend on modules not inherently supported by SageMaker Canvas.
Solution overview
To showcase the integration of custom scripts and dependencies from Amazon S3 into SageMaker Canvas, we explore the following example workflow.
The solution follows three main steps:

Upload custom scripts and dependencies to Amazon S3
Use SageMaker Data Wrangler in SageMaker Canvas to transform your data using the uploaded code
Train and export the model

The following diagram is the architecture for the solution.

In this example, we work with two complementary datasets available in SageMaker Canvas that contain shipping information for computer screen deliveries. By joining these datasets, we create a comprehensive dataset that captures various shipping metrics and delivery outcomes. Our goal is to build a predictive model that can determine whether future shipments will arrive on time based on historical shipping patterns and characteristics.
Prerequisites
As a prerequisite, you need access to Amazon S3 and Amazon SageMaker AI. If you don’t already have a SageMaker AI domain configured in your account, you also need permissions to create a SageMaker AI domain.
Create the data flow
To create the data flow, follow these steps:

On the Amazon SageMaker AI console, in the navigation pane, under Applications and IDEs, select Canvas, as shown in the following screenshot. You might need to create a SageMaker domain if you haven’t done so already.
After your domain is created, choose Open Canvas.

In Canvas, select the Datasets tab and select canvas-sample-shipping-logs.csv, as shown in the following screenshot. After the preview appears, choose + Create a data flow.

The initial data flow will open with one source and one data type.

At the top right of the screen, and select Add data → tabular. Choose Canvas Datasets as the source and select canvas-sample-product-descriptions.csv.
Choose Next as shown in the following screenshot. Then choose Import.

After both datasets have been added, select the plus sign. From the dropdown menu, choose select Combine data. From the next dropdown menu, choose Join.

To perform an inner join on the ProductID column, in the right-hand menu, under Join type, choose Inner join. Under Join keys, choose ProductId, as shown in the following screenshot.

After the datasets have been joined, select the plus sign. In the dropdown menu, select + Add transform. A preview of the dataset will open.

The dataset contains XShippingDistance (long) and YShippingDistance (long) columns. For our purposes, we want to use a custom function that will find the total distance using the X and Y coordinates and then drop the individual coordinate columns. For this example, we find the total distance using a function that relies on the mpmath library.

To call the custom function, select + Add transform. In the dropdown menu, select Custom transform. Change the editor to Python (Pandas) and try to run the following function from the Python editor:

from mpmath import sqrt # Import sqrt from mpmath

def calculate_total_distance(df, x_col=”XShippingDistance”, y_col=”YShippingDistance”, new_col=”TotalDistance”):

# Use mpmath’s sqrt to calculate the total distance for each row
df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)

# Drop the original x and y columns
df = df.drop(columns=[x_col, y_col])

return df

df = calculate_total_distance(df)

Running the function produces the following error: ModuleNotFoundError: No module named ‘mpmath’, as shown in the following screenshot.

This error occurs because mpmath isn’t a module that is inherently supported by SageMaker Canvas. To use a function that relies on this module, we need to approach the use of a custom function differently.
Zip the script and dependencies
To use a function that relies on a module that isn’t natively supported in Canvas, the custom script must be zipped with the module(s) it relies on. For this example, we used our local integrated development environment (IDE) to create a script.py that relies on the mpmath library.
The script.py file contains two functions: one function that is compatible with the Python (Pandas) runtime (function calculate_total_distance), and one that is compatible with the Python (Pyspark) runtime (function udf_total_distance).

def calculate_total_distance(df, x_col=”XShippingDistance”, y_col=”YShippingDistance”, new_col=”TotalDistance”):
from npmath import sqrt # Import sqrt from npmath

# Use npmath’s sqrt to calculate the total distance for each row
df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)

# Drop the original x and y columns
df = df.drop(columns=[x_col, y_col])

return df

def udf_total_distance(df, x_col=”XShippingDistance”, y_col=”YShippingDistance”, new_col=”TotalDistance”):
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

spark = SparkSession.builder
.master(“local”)
.appName(“DistanceCalculation”)
.getOrCreate()

def calculate_distance(x, y):
import sys

# Add the path to npmath
mpmath_path = “/tmp/maths”
if mpmath_path not in sys.path:
sys.path.insert(0, mpmath_path)

from mpmath import sqrt
return float(sqrt(x**2 + y**2))

# Register and apply UDF
distance_udf = udf(calculate_distance, FloatType())
df = df.withColumn(new_col, distance_udf(df[x_col], df[y_col]))
df = df.drop(x_col, y_col)

return df

To make sure the script can run, install mpmath into the same directory as script.py by running pip install mpmath.
Run zip -r my_project.zip to create a .zip file containing the function and the mpmath installation. The current directory now contains a .zip file, our Python script, and the installation our script depends on, as shown in the following screenshot.

Upload to Amazon S3
After creating the .zip file, upload it to an Amazon S3 bucket.

After the zip file has been uploaded to Amazon S3, it’s accessible in SageMaker Canvas.
Run the custom script
Return to the data flow in SageMaker Canvas and replace the prior custom function code with the following code and choose Update.

import zipfile
import boto3
import sys
from pathlib import Path
import shutil
import importlib.util

def load_script_and_dependencies(bucket_name, zip_key, extract_to):
    “””
    Downloads a zip file from S3, unzips it, and ensures dependencies are available.

    Args:
        bucket_name (str): Name of the S3 bucket.
        zip_key (str): Key for the .zip file in the bucket.
        extract_to (str): Directory to extract files to.

    Returns:
        str: Path to the extracted folder containing the script and dependencies.
    “””
    
    s3_client = boto3.client(“s3″)
    
    # Local path for the zip file
    zip_local_path = ‘/tmp/dependencies.zip’
    
    # Download the .zip file from S3
    s3_client.download_file(bucket_name, zip_key, zip_local_path)
    print(f”Downloaded zip file from S3: {zip_key}”)

    # Unzip the file
    try:
        with zipfile.ZipFile(zip_local_path, ‘r’) as zip_ref:
            zip_ref.extractall(extract_to)
            print(f”Extracted files to {extract_to}”)
    except Exception as e:
        raise RuntimeError(f”Failed to extract zip file: {e}”)

    # Add the extracted folder to Python path
    if extract_to not in sys.path:
      sys.path.insert(0, extract_to)
          
    return extract_to
    

def call_function_from_script(script_path, function_name, df):
    “””
    Dynamically loads a function from a Python script using importlib.
    “””
    try:
        # Get the script name from the path
        module_name = script_path.split(‘/’)[-1].replace(‘.py’, ”)
        
        # Load the module specification
        spec = importlib.util.spec_from_file_location(module_name, script_path)
        if spec is None:
            raise ImportError(f”Could not load specification for module {module_name}”)
            
        # Create the module
        module = importlib.util.module_from_spec(spec)
        sys.modules[module_name] = module
        
        # Execute the module
        spec.loader.exec_module(module)
        
        # Get the function from the module
        if not hasattr(module, function_name):
            raise AttributeError(f”Function ‘{function_name}’ not found in the script.”)
            
        loaded_function = getattr(module, function_name)

        # Clean up: remove module from sys.modules after execution
        del sys.modules[module_name]
        
        # Call the function
        return loaded_function(df)
        
    except Exception as e:
        raise RuntimeError(f”Error loading or executing function: {e}”)

bucket_name = ‘canvasdatabuckett’  # S3 bucket name
zip_key = ‘functions/my_project.zip’  # S3 path to the zip file with our custom dependancy
script_name = ‘script.py’  # Name of the script in the zip file
function_name = ‘udf’ # Name of function to call from our script
extract_to = ‘/tmp/maths’ # Local path to our custom script and dependancies

# Step 1: Load the script and dependencies
extracted_path = load_script_and_dependencies(bucket_name, zip_key, extract_to)

# Step 2: Call the function from the script
script_path = f”{extracted_path}/{script_name}”
df = call_function_from_script(script_path, function_name, df)

This example code unzips the .zip file and adds the required dependencies to the local path so they’re available to the function at run time. Because mpmath was added to the local path, you can now call a function that relies on this external library.
The preceding code runs using the Python (Pandas) runtime and calculate_total_distance function. To use the Python (Pyspark) runtime, update the function_name variable to call the udf_total_distance function instead.
Complete the data flow
As a last step, remove irrelevant columns before training the model. Follow these steps:

On the SageMaker Canvas console, select + Add transform. From the dropdown menu, select Manage columns
Under Transform, choose Drop column. Under Columns to drop, add ProductId_0, ProductId_1, and OrderID, as shown in the following screenshot.

The final dataset should contain 13 columns. The complete data flow is pictured in the following image.

Train the model
To train the model, follow these steps:

At the top right of the page, select Create model and name your dataset and model.
Select Predictive analysis as the problem type and OnTimeDelivery as the target column, as shown in the screenshot below.

When building the model you can choose to run a Quick build or a Standard build. A Quick build prioritizes speed over accuracy and produces a trained model in less than 20 minutes. A standard build prioritizes accuracy over latency but the model takes longer to train.
Results
After the model build is complete, you can view the model’s accuracy, along with metrics like F1, precision and recall. In the case of a standard build, the model achieved 94.5% accuracy.

After the model training is complete, there are four ways you can use your model:

Deploy the model directly from SageMaker Canvas to an endpoint
Add the model to the SageMaker Model Registry
Export your model to a Jupyter Notebook
Send your model to Amazon QuickSight for use in dashboard visualizations

Clean up
To manage costs and prevent additional workspace charges, choose Log out to sign out of SageMaker Canvas when you’re done using the application, as shown in the following screenshot. You can also configure SageMaker Canvas to automatically shut down when idle.
If you created an S3 bucket specifically for this example, you might also want to empty and delete your bucket.

Summary
In this post, we demonstrated how you can upload custom dependencies to Amazon S3 and integrate them into SageMaker Canvas workflows. By walking through a practical example of implementing a custom distance calculation function with the mpmath library, we showed how to:

Package custom code and dependencies into a .zip file
Store and access these dependencies from Amazon S3
Implement custom data transformations in SageMaker Data Wrangler
Train a predictive model using the transformed data

This approach means that data scientists and analysts can extend SageMaker Canvas capabilities beyond the more than 300 included functions.
To try custom transforms yourself, refer to the Amazon SageMaker Canvas documentation and sign in to SageMaker Canvas today. For additional insights into how you can optimize your SageMaker Canvas implementation, we recommend exploring these related posts:

Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio
Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

About the Author
Nadhya Polanco is an Associate Solutions Architect at AWS based in Brussels, Belgium. In this role, she supports organizations looking to incorporate AI and Machine Learning into their workloads. In her free time, Nadhya enjoys indulging in her passion for coffee and exploring new destinations.

Generate training data and cost-effectively train categorical models w …

In this post, we explore how you can use Amazon Bedrock to generate high-quality categorical ground truth data, which is crucial for training machine learning (ML) models in a cost-sensitive environment. Generative AI solutions can play an invaluable role during the model development phase by simplifying training and test data creation for multiclass classification supervised learning use cases. We dive deep into this process on how to use XML tags to structure the prompt and guide Amazon Bedrock in generating a balanced label dataset with high accuracy. We also showcase a real-world example for predicting the root cause category for support cases. This use case, solvable through ML, can enable support teams to better understand customer needs and optimize response strategies.
Business challenge
The exploration and methodology described in this post addresses two key challenges: costs associated with generating a ground truth dataset for multiclass classification use cases can be prohibitive, and conventional approaches and synthetic dataset creation strategies for generating ground truth data are inadequate in generating balanced classes and meeting desired performance parameters for the real-world use cases.
Ground truth data generation is expensive and time consuming
Ground truth annotation needs to be accurate and consistent, often requiring massive time and expertise to ensure the dataset is balanced, diverse, and large enough for model training and testing. For a multiclass classification problem such as support case root cause categorization, this challenge compounds many fold.
Let’s say the task at hand is to predict the root cause categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry) for customer support cases. Based on our experiments using best-in-class supervised learning algorithms available in AutoGluon, we arrived at a 3,000 sample size for the training dataset for each category to attain an accuracy of 90%. This requirement translates into time and effort investment of trained personnel, who could be support engineers or other technical staff, to review tens of thousands of support cases to arrive at an even distribution of 3,000 per category. With each support case and the related correspondences averaging 5 minutes per review and assessment from a human labeler, this translates into 1,500 hours (5 minutes x 18,000 support cases) of work or 188 days considering an 8-hour workday. Besides the time in review and labeling, there is an upfront investment in training the labelers so the exercise split between 10 or more labelers is consistent. To break this down further, a ground truth labeling campaign split between 10 labelers would require close to 4 weeks to label 18,000 cases if the labelers spend 40 hours a week on the exercise.
Not only is such an extended and effort-intensive campaign expensive, but it can cause inconsistent labeling for categories every time the labeler puts aside the task and resumes it later. The exercise also doesn’t guarantee a balanced labeled ground truth dataset because some root cause categories such as Customer Education could be far more common than Feature Request or Software Defect, thereby extending the campaign.
Conventional techniques to get balanced classes or synthetic data generation have shortfalls
A balanced labeled dataset is critical for a multiclass classification use case to mitigate bias and make sure the model learns to accurately classify all classes, rather than favoring the majority class. If the dataset is imbalanced, with one or more classes having significantly fewer instances than others, the model might struggle to learn the patterns and features associated with the minority classes, leading to poor performance and biased predictions. This issue is particularly problematic in applications where accurate classification of minority classes is critical, such as medical diagnoses, fraud detection, or root cause categorization. For the use case of labeling the support root cause categories, it’s often harder to source examples for categories such as Software Defect, Feature Request, and Documentation Improvement for labeling than it is for Customer Education. This results in an imbalanced class distribution for training and test datasets.
To address this challenge, various techniques can be employed, including oversampling the minority classes, undersampling the majority classes, using ensemble methods that combine multiple classifiers trained on different subsets of the data, or synthetic data generation to augment minority classes. However, the ideal approach for achieving optimal performance is to start with a balanced and highly accurate labeled dataset for ground truth training.
Although oversampling for minority classes means extended and expensive data labeling with humans who review the support cases, synthetic data generation to augment the minority classes poses its own challenges. For the multiclass classification problem to label support case data, synthetic data generation can quickly result in overfitting. This is because it can be difficult to synthesize real-world examples of technical case correspondences that contain complex content related to software configuration, implementation guidance, documentation references, technical troubleshooting, and the like.
Because ground truth labeling is expensive and synthetic data generation isn’t an option for use cases such as root cause prediction, the effort to train a model is often put aside. This results in a missed opportunity to review the root cause trends that can guide investment in the right areas such as education for customers, documentation improvement, or other efforts to reduce the case volume and improve customer experience.
Solution overview
The preceding section discussed why conventional ground truth data generation techniques aren’t viable for certain supervised learning use cases and fall short in training a highly accurate model to predict the support case root cause in our example. Let’s look at how generative AI can help solve this problem.
Generative AI supports key use cases such as content creation, summarization, code generation, creative applications, data augmentation, natural language processing, scientific research, and many others. Amazon Bedrock is well-suited for this data augmentation exercise to generate high-quality ground truth data. Using highly tuned and custom tailored prompts with examples and techniques discussed in the following sections, support teams can pass the anonymized support case correspondence to Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock or other available large language models (LLMs) to predict the root cause label for a support case from one of the many categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry). After achieving the desired accuracy, you can use this ground truth data in an ML pipeline with automated machine learning (AutoML) tools such as AutoGluon to train a model and inference the support cases.
Checking LLM accuracy for ground truth data
To evaluate an LLM for the task of category labeling, the process begins by determining if labeled data is available. If labeled data exists, the next step is to check if the model’s use case produces discrete outcomes. Where discrete outcomes with labeled data exist, standard ML methods such as precision, recall, or other classic ML metrics can be used. These metrics provide high precision but are limited to specific use cases due to limited ground truth data.
If the use case doesn’t yield discrete outputs, task-specific metrics are more appropriate. These include metrics such as ROUGE or cosine similarity for text similarity, and specific benchmarks for assessing toxicity (Detoxify), prompt stereotyping (cross-entropy loss), or factual knowledge (HELM, LAMA).
If labeled data is unavailable, the next question is whether the testing process should be automated. The automation decision depends on the cost-accuracy trade-off because higher accuracy comes at a higher cost. For cases where automation is not required, human-in-the-Loop (HIL) approaches can be used. This involves manual evaluation based on predefined assessment rules (for example, ground truth), yielding high evaluation precision, but often is time-consuming and costly.
When automation is preferred, using another LLM to assess outputs can be effective. Here, a reliable LLM can be instructed to rate generated outputs, providing automated scores and explanations. However, the precision of this method depends on the reliability of the chosen LLM. Each path represents a tailored approach based on the availability of labeled data and the need for automation, allowing for flexibility in assessing a wide range of FM applications.
The following figure illustrates an FM evaluation workflow.

For the use case, if a historic collection of 10,000 or more support cases labeled using Amazon SageMaker Ground Truth with HIL is available, it can be used for evaluating the accuracy of the LLM prediction. The key goal for generating new ground truth data using Amazon Bedrock should be to augment it for increasing diversity and increasing the training data size for AutoGluon training to arrive at a performant model that can be used for the final inference or root cause prediction. In the following sections, we explain how to take an incremental and measured approach to improve Anthropic’s Claude 3.5 Sonnet prediction accuracy through prompt engineering.
Prompt engineering for FM accuracy and consistency
Prompt engineering is the art and science of designing a prompt to get an LLM to produce the desired output. We suggest consulting LLM prompt engineering documentation such as Anthropic prompt engineering for experiments. Based on experiments conducted without a finely tuned and optimized prompt, we observed low accuracy rates of less than 60%. In the following sections, we provide a detailed explanation on how to construct your first prompt, and then gradually improve it to consistently achieve over 90% accuracy.
Designing the prompt
Before starting any scaled use of generative AI, you should have the following in place:

A clear definition of the problem you are trying to solve along with the end goal.
A way to test the model’s output for accuracy. The thumbs up/down technique to determine accuracy along with comparing with the 10,000 labeled dataset through SageMaker Ground Truth is well-suited for this exercise.
A defined success criterion on how accurate the model needs to be.

It’s helpful to think of an LLM as a new employee who is very well read, but knows nothing about your culture, your norms, what you are trying to do, or why you are trying to do it. The LLM’s performance will depend on how precisely you can explain what you want. How would a skilled manager handle a very smart, but new and inexperienced employee? The manager would provide contextual background, explain the problem, explain the rules they should apply when analyzing the problem, and give some examples of what good looks like along with why it is good. Later, if they saw the employee making mistakes, they might try to simplify the problem and provide constructive feedback by giving examples of what not to do, and why. One difference is that an employee would understand the job they are being hired for, so we need to explicitly tell the LLM to assume the persona of a support employee.
Prerequisites
To follow along with this post, set up Amazon SageMaker Studio to run Python in a notebook and interact with Amazon Bedrock. You also need the appropriate permissions to access Amazon Bedrock models.
Set up SageMaker Studio
Complete the following steps to set up SageMaker Studio:

On the SageMaker console, choose Studio under Applications and IDEs in the navigation pane.
Create a new SageMaker Studio instance if you haven’t already.
If prompted, set up a user profile for SageMaker Studio by providing a user name and specifying AWS Identity and Access Management (IAM) permissions.
Open a SageMaker Studio notebook:

Choose JupyterLab.
Create a private JupyterLab space.
Configure the space (set the instance type to ml.m5.large for optimal performance).
Launch the space.
On the File menu, choose New and Notebook to create a new notebook.

Configure SageMaker to meet your security and compliance objectives. Refer to Configure security in Amazon SageMaker AI for details.

Set up permissions for Amazon Bedrock access
Make sure you have the following permissions:

IAM role with Amazon Bedrock permissions – Make sure that your SageMaker Studio execution role has the necessary permissions to access Amazon Bedrock. Attach the AmazonBedrockFullAccesspolicy or a custom policy with specific Amazon Bedrock permissions to your IAM role.
AWS SDKs and authentication – Verify that your AWS credentials (usually from the SageMaker role) have Amazon Bedrock access. Refer to Getting started with the API to set up your environment to make Amazon Bedrock requests through the AWS API.
Model access – Grant permission to use Anthropic’s Claude 3.5 Sonnet. For instructions, see Add or remove access to Amazon Bedrock foundation models.

Test the code using the native inference API for Anthropic’s Claude
The following code uses the native inference API to send a text message to Anthropic’s Claude. The Python code invokes the Amazon Bedrock Runtime service:

import boto3
import json
from datetime import datetime
import time

# Create an Amazon Bedrock Runtime client in the AWS Region of your choice.
client = boto3.client(“bedrock-runtime”, region_name=”us-east-1″)

# Set the model ID, e.g., Claude 3 Haiku.
model_id = “anthropic.claude-3-5-sonnet-20240620-v1:0”

# Load the prompt from a file (showed and explained later in the blog)
with open(‘prompt.txt’, ‘r’) as file:
data = file.read()

def callBedrock(body):
# Format the request payload using the model’s native structure.

prompt = data + body;

# The prompt is then truncated to the max input window size of Sonnet 3.5
prompt = prompt[:180000]

# Define parametres passed to the model.
native_request = {
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 512,
“temperature”: 0.2,
“messages”: [
{
“role”: “user”,
“content”: [{“type”: “text”, “text”: prompt}],
}
],
}

# Convert the native request to JSON.
request = json.dumps(native_request)

try:
# Invoke the model with the request.
response = client.invoke_model(modelId=model_id, body=request)

except (Exception) as e:
print(f”ERROR: Can’t invoke ‘{model_id}’. Reason: {e}”)

# Load the response returned from Amazon Bedrock into a json object
model_response = json.loads(response[“body”].read())

# Extract and print the response text.
response_text = model_response[“content”][0][“text”]
return response_text;

Construct the initial prompt
We demonstrate the approach for the specific use case for root cause prediction with a goal of achieving 90% accuracy. Start by creating a prompt similar to the prompt you would give to humans in natural language. This can be a simple description of each root cause label and why you would choose it, how to interpret the case correspondences, how to analyze and choose the corresponding root cause label, and provide examples for every category. Ask the model to also provide the reasoning to understand how it reached to certain decisions. It can be especially interesting to understand the reasoning for the decisions you don’t agree with. See the following example code:

Please familiarize yourself with these categories.  When you evaluate a case, evaluate the definitions in order and label the case with the first definition that fits.  If a case morphs from one type to another, choose the type the case started out as. 

Read the correspondence, especially the original request, and the last correspondence from the support agent to the customer. If there are lot of correspondences, or the case does not seem straightforward to infer, read the correspondences date stamped in order to understand what happened. If the case references documentation, read or skim the documentation to determine whether the documentation clearly supports what the support agent mentioned and whether it answers the customers issue.

Software Defect:  “Software Defect” are cases where the application does not work as expected. The support agent confirms this through analysis and troubleshooting and mentions internal team is working on a fix or patch to address the bug or defect.

An example of Software Defect case is [Customer: “Our data pipeline jobs are failing with a ‘memory allocation error’ during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We’ve verified our infrastructure meets all requirements.” Agent: “After analyzing the logs, we’ve confirmed a memory leak in the aggregation module – a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue.”]
…. 

Analyze the results
We recommend using a small sample (for example, 150) of random cases and run them through Anthropic’s Claude 3.5 Sonnet using the initial prompt, and manually check the initial results. You can load the input data and model output into Excel, and add the following columns for analysis:

Claude Label – A calculated column with Anthropic’s Claude’s category
Label – True category after reviewing each case and selecting a specific root cause category to compare with the model’s prediction and derive an accuracy measurement
Close Call – 1 or 0 so that you can take numerical averages
Notes – For cases where there was something noteworthy about the case or inaccurate categorizations
Claude Correct – A calculated column (0 or 1) based on whether our category matched the model’s output category

Although the first run is expected to have low accuracy unfit for using the prompt for generating the ground truth data, the reasoning will help you understand why Anthropic’s Claude mislabeled the cases. In the example, many of the misses fell into these categories and the accuracy was only 61%:

Cases where Anthropic’s Claude categorized Customer Education cases as Software Defect because it interpreted the support agent instructions to reconfigure something as a workaround for a Software Defect.
Cases where users asked questions about billing that Anthropic’s Claude categorized as Customer Education. Although billing questions could also be Customer Education cases, we wanted these to be categorized as the more specific Billing Inquiry Likewise, although Security Awareness cases are also Customer Education, we wanted to categorize these as the more specific Security Awareness category.

Iterate on the prompt and make changes
Providing the LLM explicit instructions on correcting these errors should result in a major boost in accuracy. We tested the following adjustments with Anthropic’s Claude:

We defined and assigned a persona with background information for the LLM: “You are a Support Agent and an expert on the enterprise application software. You will be classifying customer cases into categories…”
We ordered the categories from more deterministic and well-defined to less specific and instructed Anthropic’s Claude to evaluate the categories in the order they appear in the prompt.
We recommend using the Anthropic documentation suggestion to use XML tags and the enclosed root cause categories in light XML but not a formal XML document, with elements delimited with tags. It’s ideal to create categories as nodes with a separate sub-node for each category. The category node should consist of a name of the category, a description, and what the output would look like. The categories should be delimited by begin and end tags.

You are a Support Agent and an expert on the enterprise application software. You will be classifying the customer support cases into categories, based on the given interaction between an agent and a customer. You can only choose ONE Category from the list below. You follow instructions well, step by step, and evaluate the categories in the order they appear in the prompt when making a decision.

The categories are defined as:

<categories>
<category>
<name>
“Software Defect”
</name>
<description>
“Software Defect” are cases where the application software does not work as expected. The agent confirms the application is not working as expected and may refer to internal team working on a fix or patch to address the bug or defect. The category includes common errors or failures related to performance, software version, functional defect, unexpected exception or usability bug when the customer is following the documented steps.
</description>
</category>

</categories>

We created a good examples node with at least one good example for every category. Each good example consisted of the example, the classification, and the reasoning:

Here are some good examples with reasoning:

<good examples>
<example>
<example data>
Customer: “Our data pipeline jobs are failing with a ‘memory allocation error’ during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We’ve verified our infrastructure meets all requirements.”
Agent: “After analyzing the logs, we’ve confirmed a memory leak in the aggregation module – a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue.”
</example data
<classification>
“Software Defect”
</classification>
<explanation>
Customer is reporting a data processing exception with a specific version and the agent confirms this is a regression and defect. The agent confirms that engineering is working to provide an emergency patch for the issue.
</explanation>
</example>

</good examples>

We created a bad examples node with examples of where the LLM miscategorized previous cases. The bad examples node should have the same set of fields as the good examples, such as example data, classification, explanation, but the explanation explained the error. The following is a snippet:

Here are some examples for wrong classification with reasoning:

<bad examples>

<example>
<example data>
Customer: “We need the ability to create custom dashboards that can aggregate data across multiple tenants in real-time. Currently, we can only view metrics per individual tenant, which requires manual consolidation for our enterprise reporting needs.”
Agent: “I understand your need for cross-tenant analytics. While the current functionality is limited to single-tenant views as designed, I’ve submitted your request to our product team as a high-priority feature enhancement. They’ll evaluate it for inclusion in our 2025 roadmap. I’ll update you when there’s news about this capability.”
</example data>
<example output>
<classification>
“Software Defect”
</classification>
<explanation>
Classification should be Feature Request and not Software Defect. The application does not have the function or capability being requested but it is working as documented or advertised. In the example, the agent mentions they have submitted with request to their product team to consider in the future roadmap.
</explanation>
</example>

<bad examples>

We also added instructions for how to format the output:

Given the above categories defined in XML, logically think through which category fits best and then complete the classification. Provide a response in XML with the following elements: classification, explanation (limited to 2 sentences). Return your results as this sample output XML below and do not append your thought process to the response.
 
<response>
<classification> Software Defect </classification>
<explanation> The support case is for ETL Pipeline Performance Degradation where the customer reports their nightly data transformation job takes 6 hours to complete instead of 2 hours before but no changes to configuration occurred. The agent mentions Engineering confirmed memory leak in version 5.1.2 and are deploying a Hotfix indicating this is a Software Defect.
</explanation>
</response>

Test with the new prompt
The preceding approach should result in an improved prediction accuracy. In our experiment, we saw 84% accuracy with the new prompt and the output was consistent and more straightforward to parse. Anthropic’s Claude followed the suggested output format in almost all cases. We wrote code to fix errors such as unexpected tags in the output and drop responses that could not be parsed.
The following is the code to parse the output:

# This python script parses LLM output into a comma separated list with the SupportID, Category, Reason
# Command line is python parse_llm_putput.py PathToLLMOutput.txt PathToParsedOutput.csv
# Note:  It will overwrite the output file without confirming
# it will write completion status and any error messages to stdout
 
import re
import sys
 
# these tokens are based on the format of the claude output.
# This will create three inputs CaseID, RootCause and Reasoning.  We will to extract them using re.match.
pattern = re.compile(
    “^([0-9]*).*<classification>(.*)</classification><explanation>(.*)</explanation>”
)
 
endToken = “</response>”
checkToken = “<classification>”
 
acceptableClassifications = [
    “Billing Inquiry”,
    “Documentation Improvement”,
    “Feature Request”,
    “Security Awareness”,
    “Software Defect”,
    “Customer Education”,
]
 
def parseResponse(response):
    # parsing is trivial withe regular expression groups
    m = pattern.match(response)
    return m
 
# get the input and output files
if len(sys.argv) != 3:
    print(“Command line error parse_llm_output.py inputfile outputfile”)
    exit(1)
 
# open the file
input = open(sys.argv[1], encoding=”utf8″)
output = open(sys.argv[2], “w”)
 
# read the entire file in.  This works well with 30,000 responses, but would need to be adjusted for say 3,000,000 responses
responses = input.read()
 
# get rid of the double quotes and newlines to avoid incorrect excel parsing and these are unnecessary
responses = responses.replace(‘”‘, “”)
responses = responses.replace(“n”, “”)
 
# initialize our placeholder, and counters
parsedChars = 0
skipped = 0
invalid = 0
responseCount = 0
 
# write the header
output.write(“CaseID,RootCause,Reasonn”)
 
# find the first response
index = responses.find(endToken, parsedChars)
 
while index > 0:
    # extract the response
    response = responses[parsedChars : index + len(endToken)]
    # parse it
    parsedResponse = parseResponse(response)
 
    # is the response valid
    if parsedResponse is None or len(response.split(checkToken)) != 2:
        # this happens when there is a missing /response delimiter or some other formatting problem, it clutters up and the next response
        skipped = skipped + 2
    else:
        # if we have a valid response write it to the file, enclose the reason in double quotes because it uses commas
        if parsedResponse.group(2).lower() not in acceptableClassifications:
            # make sure the classification is one we expect
            print(“Invalid Classification: {0}”.format(parsedResponse.group(2)))
            invalid = invalid + 1
        else:
            # write a valid line to the output file
            output.write(
                ‘{0},{1},”{2}”n’.format(
                    parsedResponse.group(1),
                    parsedResponse.group(2),
                    parsedResponse.group(3),
                )
            )
 
    # move the pointer past where we parsed and update the counter
    parsedChars = index + len(endToken)
    responseCount = responseCount + 1
 
    # find the next response
    index = responses.find(endToken, parsedChars)
 
print(“skipped {0} of {1} responses”.format(skipped, responseCount))
print(“{0} of these were invalid”.format(invalid))

Most mislabeled cases were close calls or had very similar traits. For example, when a customer described a problem, the support agent suggested possible solutions and asked for logs in order to troubleshoot. However, the customer self-resolved the case and so the resolution details weren’t conclusive. For this scenario, the root cause prediction was inaccurate. In our experiment, Anthropic’s Claude labeled these cases as Software Defects, but the most likely scenario is that the customer figured it out for themselves and never followed up.
Continued fine-tuning of the prompt to adjust examples and include such scenarios incrementally can help to get over 90% prediction accuracy, as we confirmed with our experimentation. The following code is an example of how to adjust the prompt and add a few more bad examples:

<example>
<example data>
Subject: Unable to configure custom routing rules in application gateway
Customer: Our team can’t set up routing rules in the application gateway. We’ve tried following the documentation but the traffic isn’t being directed as expected. This is blocking our production deployment.
Agent: I understand you’re having difficulties with routing rules configuration. To better assist you, could you please provide:
Current routing rule configuration
Application gateway logs
Expected traffic flow diagram
[No response from customer for 5 business days – Case closed by customer]
</example data>
    <example output>
      <classification>
       Software Defect
      </classification>
 <explanation>
Classification should be Customer Education and not Software Defect. The agent acknowledges the problem and asks the customer for additional information to troubleshoot, however, the customer does not reply and closes the case. Cases where the agent tells the customer how to solve the problem and provides documentation or asks for further details to troubleshoot but the customer self-resolves the case should be labeled Customer Education.
</explanation>
</example>

With the preceding adjustments and refinement to the prompt, we consistently obtained over 90% accuracy and noted that a few miscategorized cases were close calls where humans chose multiple categories including the one Anthropic’s Claude chose. See the appendix at the end of this post for the final prompt.
Run batch inference at scale with AutoGluon Multimodal
As illustrated in the previous sections, by crafting a well-defined and tailored prompt, Amazon Bedrock can help automate generation of ground truth data with balanced categories. This ground truth data is necessary to train the supervised learning model for a multiclass classification use case. We suggest taking advantage of the preprocessing capabilities of SageMaker to further refine the fields, encoding them into a format that’s optimal for model ingestion. The manifest files can be set up as the catalyst, triggering an AWS Lambda function that sets entire SageMaker pipeline into action. This end-to-end process seamlessly handles data inference and stores the results in Amazon Simple Storage Service (Amazon S3). We recommend AutoGluon Multimodal for training and prediction and deploying a model for a batch inference pipeline to predict the root cause for new or updated support cases at scale on a daily cadence.
Clean up
To prevent unnecessary expenses, it’s essential to properly decommission all provisioned resources. This cleanup process involves stopping notebook instances and deleting JupyterLab spaces, SageMaker domains, S3 bucket, IAM role, and associated user profiles. Refer to Clean up Amazon SageMaker notebook instance resources for details.
Conclusion
This post explored how Amazon Bedrock and advanced prompt engineering can generate high-quality labeled data for training ML models. Specifically, we focused on a use case of predicting the root cause category for customer support cases, a multiclass classification problem. Traditional approaches to generating labeled data for such problems are often prohibitively expensive, time-consuming, and prone to class imbalances. Amazon Bedrock, guided by XML prompt engineering, demonstrated the ability to generate balanced labeled datasets, at a lower cost, with over 90% accuracy for the experiment, and can help overcome labeling challenges for training categorical models for real-world use cases.
The following are our key takeaways:

Generative AI can simplify labeled data generation for complex multiclass classification problems
Prompt engineering is crucial for guiding LLMs to achieve desired outputs accurately
An iterative approach, incorporating good/bad examples and specific instructions, can significantly improve model performance
The generated labeled data can be integrated into ML pipelines for scalable inference and prediction using AutoML multimodal supervised learning algorithms for batch inference

Review your ground truth training costs with respect to time and effort for HIL labeling and service costs and do a comparative analysis with Amazon Bedrock to plan your next categorical model training at scale.
Appendix
The following code is the final prompt:

You are a Support Agent and an expert in the enterprise application software. You will be classifying the customer support cases into one of the 6 categories, based on the given interaction between the Support Agent and a customer. You can only choose ONE Category from the list below. You follow instructions well, step by step, and evaluate the categories in the order they appear in the prompt when making a decision.
 
The categories are defined as:
 
<categories>
 
<category>
<name>
“Billing Inquiry”
</name>
<description>
“Billing Inquiry” cases are the ones related to Account or Billing inquiries and questions related to charges, savings, or discounts. It also includes requests to provide guidance on account closing, request for Credit, cancellation requests, billing questions, and questions about discounts.
</description>
</category>
 
<category>
<name>
“Security Awareness”
</name>
<description>
“Security Awareness” cases are the cases associated with a security related incident. Security Awareness cases include exposed credentials, mitigating a security vulnerability, DDoS attacks, security concerns related to malicious traffic. Note that general security questions where the agent is helping to educate the user on the best practice such as SSO or MFA configuration, Security guidelines, or setting permissions for users and roles should be labeled as Customer Education and not Security Awareness.
</description>
</category>
 
<category>
<name>
“Feature Request”
</name>
<description>
“Feature Request” are the cases where the customer is experiencing a limitation in the application software and asking for a feature they want to have. Customer highlights a limitation and is requesting for the capability. For a Feature Request case, the support agent typically acknowledges that the question or expectation is a feature request for the software. Agent may use words such as the functionality or feature does not exist or it is currently not supported.
</description>
</category>
 
<category>
<name>
“Software Defect”
</name>
<description>
“Software Defect” are cases where the application does not work as expected. The support agent confirms this through analysis and troubleshooting and mentions internal team is working on a fix or patch to address the bug or defect.
</description>
</category>
 
<category>
<name>
“Documentation Improvement”
</name>
<description>
“Documentation Improvement” are cases where there is a lack of documentation, incorrect documentation, or insufficient documentation and when the case is not attributed to a Software Defect or a Feature Request. In Documentation Improvement cases the agent acknowledges the application documentation is incomplete or not up to date, or that they will ask documentation team to improve the documentation. For Documentation Improvement cases, the agent may suggest a workaround that is not part of application documentation and does not reference the standard application documentation or link. References to workarounds or sources such as Github or Stack Overflow, when used as an example of a solution, are examples of a Documentation Improvement case because the details and examples are missing from the official documentation.
</description>
</category>
 
<category>
<name>
“Customer Education”
</name>
<description>
“Customer Education” cases are cases where the customer could have resolved the case information using the existing application documentation. In these cases, the agent is educating the customer they are not using the feature correctly or have an incorrect configuration, while guiding them to the documentation. Customer Education cases include scenario where an agent provides troubleshooting steps for a problem or answers a question and provides links to the official application documentation. User Education cases include cases when the customer asks for best practices and agent provides knowledge article links to the support center documentation. Customer Education also includes cases created by the agent or application developers to suggest and educate the customer on a change to reduce cost, improve security, or improve application performance. Customer Education cases include cases where the customer asks a question or requests help with an error or configuration and the agent guides them appropriately with steps or documentation links. Customer Education cases also include the cases where the customer is using an unsupported configuration or version that may be End Of Life (EOL). Customer Education cases also include inconclusive cases where the customer reported an issue with the application but the case is closed without resolution details.
</description>
</category>
 
</categories>
 
Here are some good examples with reasoning:
 
<good examples>
 
<example>
<example data>
Customer: “I noticed unexpected charges of $12,500 on our latest invoice, which is significantly higher than our usual $7,000 monthly spend. We haven’t added new users, so I’m concerned about this increase.”
Support: “I understand your concern about the increased charges. Upon review, I see that 50 Premium Sales Cloud licenses were automatically activated on January 15th when your sandbox environments were refreshed. I can help adjust your sandbox configuration and discuss Enterprise License Agreement options to optimize costs.”
Customer: “Thank you for clarifying. Please tell me more about the Enterprise License options.”
</example data
<example output>
<classification>
“Billing Inquiry”
</classification>
<explanation>
Customer is asking a question to clarify the unexpected increase in their billing statement charge and the agent explains why this occurred. The customer wants to learn more about ways to optimize costs.
</explanation>
 
<example>
<example data>
Customer: “URGENT: We’ve detected unauthorized API calls from an unknown IP address accessing sensitive customer data in our production environment. Our monitoring shows 1000+ suspicious requests in the last hour.”
Support: “I understand the severity of this security incident. I’ve immediately revoked the compromised API credentials and initiated our security protocol. The suspicious traffic has been blocked. I’m escalating this to our Security team for forensic analysis. I’ll stay engaged until this is resolved.”
</example data
<example output>
<classification>
“Security Awareness”
</classification>
<explanation>
Customer reported unauthorized API calls and suspicious requests. The agent confirms revoking compromised API credentials and initiating the protocol.
</explanation>
 
<example>
<example data>
Customer: “Is there a way to create custom notification templates for different user groups? We need department-specific alert formats, but I can only find a single global template option.”
Support: “I understand you’re looking to customize notification templates per user group. Currently, this functionality isn’t supported in our platform – we only offer the global template system. I’ll submit this as a feature request to our product team. In the meantime, I can suggest using notification tags as a workaround.”
Customer: “Thanks, please add my vote for this feature.”
</example data
<example output>
<classification>
“Feature Request”
</classification>
<explanation>
Customer is asking for a new feature to have custom notification templates for different user groups since they have a use case that is currently not supported by the application. The agent confirms the functionality does not exist and mentions submitting a feature request to the product team.
</explanation>
 
<example>
<example data>
Customer: “Our data pipeline jobs are failing with a ‘memory allocation error’ during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We’ve verified our infrastructure meets all requirements.”
Support: “After analyzing the logs, we’ve confirmed a memory leak in the aggregation module – a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue.”
</example data
<example output>
<classification>
“Software Defect”
</classification>
<explanation>
Customer is reporting a data processing exception with a specific version and the agent confirms this is a regression and defect. The agent confirms that engineering is working to provide an emergency patch for the issue.
</explanation>
 
<example>
<example data>
Customer: “The data export function is failing consistently when we include custom fields. The export starts but crashes at 45% with error code DB-7721. This worked fine last week before the latest release.”
Support: “I’ve reproduced the issue in our test environment and confirmed this is a bug introduced in version 4.2.1. Our engineering team has identified the root cause – a query optimization error affecting custom field exports. They’re working on a hotfix (patch 4.2.1.3).”
Customer: “Please notify when fixed.”
</example data>
<example output>
<classification>
“Software Defect”
</classification>
<explanation>
This is a Software Defect as the data export function is not working as expected to export the custom fields. The agent acknowledged the issue and confirmed engineering is working on a hotfix.
</explanation>
 
<example>
<example data>
Customer: “I’m trying to implement the batch processing API but the documentation doesn’t explain how to handle partial failures or provide retry examples. The current docs only show basic success scenarios.”
Support: The documentation is lacking detailed error handling examples for batch processing. I’ll submit this to our documentation team to add comprehensive retry logic examples and partial failure scenarios. For now, I can share a working code snippet that demonstrates proper error handling and retry mechanisms.”
Customer: “Thanks, the code example would help.”
</example data
<example output>
<classification>
Documentation Improvement
</classification>
<explanation>
The agent acknowledges the gap in the documentation and mentions they will pass on this to the documentation team for further improvements. Agent mentions providing a working code snippet with retry examples.
</explanation>
 
<example>
<example data>
Customer: “We can’t get our SSO integration working. The login keeps failing and we’re not sure what’s wrong with our configuration.”
Support: “I can help guide you through the SSO setup. Looking at your configuration, I notice the SAML assertion isn’t properly formatted. Please follow our step-by-step SSO configuration guide here [link to docs]. Pay special attention to section 3.2 about SAML attributes. The guide includes validation steps to ensure proper integration.”
Customer: “Found the issue in section 3.2. Working now, thanks!”
</example data
<example output>
<classification>
Customer Education
</classification>
<explanation>
Customer is asking for help and guidance to get their SSO integration working. The agent went over the details and presented the steps along necessary along with the documentation links.
</explanation>
 
</good examples>
 
Here are some examples for wrong classification with reasoning:
 
<bad examples>
 
<example>
<example data>
Customer: “We want to enhance our application security. Currently, each team member has individual login credentials. What’s the recommended approach?”
Support: “recommend implementing SAML-based SSO with your existing identity provider. This will:
Centralize authentication
Enable MFA enforcement
Streamline user provisioning
Enhance security auditing
</example data>
<example output>
<classification>
“Security Awareness”
</classification>
<explanation>
Classification should be Customer Education and not Security Awareness. General security questions where the agent is helping to educate the user such as Security guidelines and best practices, should be labeled as Customer Education.
</explanation>
</example>
 
<example>
<example data>
Customer: “Our SAP invoices aren’t syncing instantly with Salesforce opportunities. We’ve configured MuleSoft Composer as per documentation, but updates only happen intermittently.”
Support: “I understand you’re looking for real-time synchronization. Currently, MuleSoft Composer’s fastest sync interval is 15 minutes by design. While I can help optimize your current setup, I’ll submit a feature request for real-time sync capability. Here’s how to optimize the current polling interval: doc link”
</example data>
<example output>
<classification>
Customer Education
</classification>
<explanation>
Classification should be Feature Request and not Customer Education. The agent tells the customer that fastest sync interval is 15 minutes by design. The agent also points out they will submit a Feature Request. Cases where the customer ask for features should be classified as Feature Request.
</explanation>
</example>
 
<example>
<example data>
Customer: “Our sales ETL pipeline keeps timing out with error ‘V_001’ at the transform step. This was working perfectly before.”
Support: “I’ve analyzed your configuration. The timeout occurs because the transformation spans 5 years of data containing 23 cross-object formula fields and is running without filters. Please implement these optimization steps from our documentation: Document link on ETL performance”
</example data>
<example output>
<classification>
Software Defect
</classification>
<explanation>
Classification should be Customer Education and not Software Defect. The agent tells the user that timeout is caused by misconfiguration and needs to be restricted using filters. The agent provides documentation explaining how to troubleshoot the issue. Cases where the agent tells the user how to solve the problem and provides documentation should be labeled Customer Education.
</explanation>
</example>
 
<example>
<example data>
Customer: “We are trying to deploy a custom workflow template but receiving this error: Resource handler returned message: ‘Error: Multiple or missing values for mandatory single-value field, Field: ACTION_TYPE, Parameter: Workflow Action (Status Code: 400, Request ID: TKT-2481-49bc)’ when deploying through Flow Designer.”
Support: “I’ve reviewed your Flow Designer deployment (instance: dev85xxx.xxx.com/flow/TKT-2481-49bc) which failed to create a Workflow Action resource. This error occurs when the action configuration is ambiguous. After checking the Flow Designer documentation [1], each Action Step in your template must define exactly one ‘Action Type’ attribute. The Flow Designer documentation [2] specifies that each workflow action requires a single, explicit action type definition. You cannot have multiple or undefined action types in a single step. This is similar to an issue reported in the Product Community [3]. Please review your workflow template and ensure each action step has exactly one defined Action Type. The documentation provides detailed configuration examples at [4]. Let me know if you need any clarification on implementing these changes.
</example data>
<example output>
<classification>
Documentation Improvement
</classification>
<explanation>
Classification should be Customer Education and not Documentation Improvement. The agent tells the user they have to change the action configuration and define an Action type attribute. Cases where the agent tells the user how to solve problem and provides documentation should be classified Customer Education.
</explanation>
</example>
 
</bad examples>
 
Given the above categories defined in XML, logically think through which category fits best and then complete the classification. Provide a response in XML with the following elements: classification, explanation (limited to 2 sentences). Return your results as this sample output XML below and do not append your thought process to the response.
 
<response>
<classification> Software Defect </classification>
<explanation> The support case is for ETL Pipeline Performance Degradation where the customer reports their nightly data transformation job takes 6 hours to complete instead of 2 hours before but no changes to configuration occurred. The agent mentions Engineering confirmed memory leak in version 5.1.2 and are deploying a Hotfix indicating this is a Software Defect.
</explanation>
</response>
 
Here is the conversation you need to categorize:

About the Authors
Sumeet Kumar is a Sr. Enterprise Support Manager at AWS leading the technical and strategic advisory team of TAM builders for automotive and manufacturing customers. He has diverse support operations experience and is passionate about creating innovative solutions using AI/ML.
Andy Brand is a Principal Technical Account Manager at AWS, where he helps education customers develop secure, performant, and cost-effective cloud solutions. With over 40 years of experience building, operating, and supporting enterprise software, he has a proven track record of addressing complex challenges.
Tom Coombs is a Principal Technical Account Manager at AWS, based in Switzerland. In Tom’s role, he helps enterprise AWS customers operate effectively in the cloud. From a development background, he specializes in machine learning and sustainability.
Ramu Ponugumati is a Sr. Technical Account Manager and a specialist in analytics and AI/ML at AWS. He works with enterprise customers to modernize and cost optimize workloads, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, playing badminton, and hiking.

Google DeepMind Researchers Propose CaMeL: A Robust Defense that Creat …

Large Language Models (LLMs) are becoming integral to modern technology, driving agentic systems that interact dynamically with external environments. Despite their impressive capabilities, LLMs are highly vulnerable to prompt injection attacks. These attacks occur when adversaries inject malicious instructions through untrusted data sources, aiming to compromise the system by extracting sensitive data or executing harmful operations. Traditional security methods, such as model training and prompt engineering, have shown limited effectiveness, underscoring the urgent need for robust defenses.

Google DeepMind Researchers propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. Unlike traditional approaches that require retraining or model modifications, CaMeL introduces a new paradigm inspired by proven software security practices. It explicitly extracts control and data flows from user queries, ensuring untrusted inputs never alter program logic directly. This design isolates potentially harmful data, preventing it from influencing the decision-making processes inherent to LLM agents.

Technically, CaMeL functions by employing a dual-model architecture: a Privileged LLM and a Quarantined LLM. The Privileged LLM orchestrates the overall task, isolating sensitive operations from potentially harmful data. The Quarantined LLM processes data separately and is explicitly stripped of tool-calling capabilities to limit potential damage. CaMeL further strengthens security by assigning metadata or “capabilities” to each data value, defining strict policies about how each piece of information can be utilized. A custom Python interpreter enforces these fine-grained security policies, monitoring data provenance and ensuring compliance through explicit control-flow constraints.

Results from empirical evaluation using the AgentDojo benchmark highlight CaMeL’s effectiveness. In controlled tests, CaMeL successfully thwarted prompt injection attacks by enforcing security policies at granular levels. The system demonstrated the ability to maintain functionality, solving 67% of tasks securely within the AgentDojo framework. Compared to other defenses like “Prompt Sandwiching” and “Spotlighting,” CaMeL outperformed significantly in terms of security, providing near-total protection against attacks while incurring moderate overheads. The overhead primarily manifests in token usage, with approximately a 2.82× increase in input tokens and a 2.73× increase in output tokens, acceptable considering the security guarantees provided.

Moreover, CaMeL addresses subtle vulnerabilities, such as data-to-control flow manipulations, by strictly managing dependencies through its metadata-based policies. For instance, a scenario where an adversary attempts to leverage benign-looking instructions from email data to control the system execution flow would be mitigated effectively by CaMeL’s rigorous data tagging and policy enforcement mechanisms. This comprehensive protection is essential, given that conventional methods might fail to recognize such indirect manipulation threats.

In conclusion, CaMeL represents a significant advancement in securing LLM-driven agentic systems. Its ability to robustly enforce security policies without altering the underlying LLM offers a powerful and flexible approach to defending against prompt injection attacks. By adopting principles from traditional software security, CaMeL not only mitigates explicit prompt injection risks but also safeguards against sophisticated attacks leveraging indirect data manipulation. As LLM integration expands into sensitive applications, adopting CaMeL could be vital in maintaining user trust and ensuring secure interactions within complex digital ecosystems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Google DeepMind Researchers Propose CaMeL: A Robust Defense that Creates a Protective System Layer around the LLM, Securing It even when Underlying Models may be Susceptible to Attacks appeared first on MarkTechPost.

This AI Paper Introduces PLAN-AND-ACT: A Modular Framework for Long-Ho …

Large language models are powering a new wave of digital agents to handle sophisticated web-based tasks. These agents are expected to interpret user instructions, navigate interfaces, and execute complex commands in ever-changing environments. The difficulty lies not in understanding language but in translating that understanding into precise, sequenced actions while adapting to dynamic contexts. Success for long-horizon tasks like booking travel or retrieving specific web data depends on managing a sequence of steps that evolves with each action. Despite major progress in language capabilities, creating agents that can effectively plan and adapt at each step remains an unsolved problem.

Composing broad goals into actionable steps is a major issue in building such agents. When a user requests “follow the top contributor of this GitHub project,” the agent must interpret the command and determine how to navigate to the contributor’s section, identify the relevant person, and initiate the following action. This task becomes even more complex in dynamic environments where content may shift between executions. Without a clear planning and updating strategy, agents can make inconsistent decisions or fail entirely. The scarcity of training data that shows how to plan and execute long tasks correctly adds another layer of difficulty.

Previously, researchers attempted to address these issues with models that either relied on single-agent strategies or applied reinforcement learning to guide actions. Single-agent systems like ReAct attempted to merge reasoning and execution but often faltered as the model was overwhelmed by thinking and acting at once. Reinforcement learning approaches showed promise but proved unstable and highly sensitive to environment-specific tuning. Collecting training data for these methods required extensive interaction with environments, making it time-consuming and impractical to scale. These methods also struggled to maintain performance consistency when tasks changed mid-process.

Researchers from UC Berkeley, the University of Tokyo, and ICSI introduced a new PLAN-AND-ACT system. Companies like Apple, Nvidia, Microsoft, and Intel supported the work. This framework splits task planning and execution into two modules: a PLANNER and an EXECUTOR. The PLANNER is tasked with creating a structured plan based on the user’s request, essentially outlining what steps need to be taken. The EXECUTOR then translates each step into environment-specific actions. By separating these responsibilities, the system allows the PLANNER to focus on strategy while the EXECUTOR handles execution, improving the reliability of both components. This modular design marks a significant shift from previous approaches.

The methodology behind PLAN-AND-ACT is detailed and focuses heavily on scalable training. Since human-annotated planning data is limited, researchers introduced a synthetic data generation pipeline. They began by collecting action trajectories from simulated agents—sequences of clicks, inputs, and responses. Large language models then analyzed these trajectories to reconstruct high-level plans grounded in actual outcomes. For example, a plan might specify identifying the top contributor, while the actions linked to it include clicking the “Contributors” tab and parsing the resulting HTML. The team expanded their dataset with 10,000 additional synthetic plans and then generated 5,000 more targeted plans based on failure analysis. This synthetic training method saved time and produced high-quality data that reflected real execution needs.

In testing, PLAN-AND-ACT achieved a task success rate of 53.94% on the WebArena-Lite benchmark, surpassing the previous best result of 49.1% from WebRL. Without any planner, a base executor only achieved 9.85%. Adding a non-finetuned planner boosted performance to 29.63% while finetuning on 10,000 synthetic plans brought results up to 44.24%. Incorporating dynamic replanning added a final 10.31% performance gain. Across all experiments, the data showed that most performance improvements came from enhancing the PLANNER rather than the EXECUTOR. Even with a base EXECUTOR, having a strong PLANNER led to substantial success rate increases, validating the researchers’ hypothesis that separating planning and execution yields better task outcomes.

In conclusion, this paper highlights how identifying the gap between goal understanding and environment interaction can lead to more effective AI systems. By focusing on structured planning and scalable data generation, the researchers proposed a method that solves a specific problem and demonstrates a framework that can extend to broader applications. PLAN-AND-ACT shows that effective planning, not just execution, is critical to AI agent success in complex environments.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post This AI Paper Introduces PLAN-AND-ACT: A Modular Framework for Long-Horizon Planning in Web-Based Language Agents appeared first on MarkTechPost.

DeepSeek AI Unveils DeepSeek-V3-0324: Blazing Fast Performance on Mac …

Artificial intelligence (AI) has made significant strides in recent years, yet challenges persist in achieving efficient, cost-effective, and high-performance models. Developing large language models (LLMs) often requires substantial computational resources and financial investment, which can be prohibitive for many organizations. Additionally, ensuring that these models possess strong reasoning capabilities and can be deployed effectively on consumer-grade hardware remains a hurdle.​

DeepSeek AI has addressed these challenges head-on with the release of DeepSeek-V3-0324, a significant upgrade to its V3 large language model. This new model not only enhances performance but also operates at an impressive speed of 20 tokens per second on a Mac Studio, a consumer-grade device. This advancement intensifies the competition with industry leaders like OpenAI, showcasing DeepSeek’s commitment to making high-quality AI models more accessible and efficient. ​

DeepSeek-V3-0324 introduces several technical improvements over its predecessor. Notably, it demonstrates significant enhancements in reasoning capabilities, with benchmark scores showing substantial increases:

MMLU-Pro: 75.9 → 81.2 (+5.3)

GPQA: 59.1 → 68.4 (+9.3)​

AIME: 39.6 → 59.4 (+19.8)​

LiveCodeBench: 39.2 → 49.2 (+10.0)

These improvements indicate a more robust understanding and processing of complex tasks. Additionally, the model has enhanced front-end web development skills, producing more executable code and aesthetically pleasing web pages and game interfaces. Its Chinese writing proficiency has also seen advancements, aligning with the R1 writing style and improving the quality of medium-to-long-form content. Furthermore, function calling accuracy has been increased, addressing issues present in previous versions.

The release of DeepSeek-V3-0324 under the MIT License underscores DeepSeek AI’s dedication to open-source collaboration, allowing developers worldwide to utilize and build upon this technology without restrictive licensing constraints. The model’s ability to run efficiently on devices like the Mac Studio, achieving 20 tokens per second, exemplifies its practical applicability and efficiency. This performance level not only makes advanced AI more accessible but also reduces the dependency on expensive, specialized hardware, thereby lowering the barrier to entry for many users and organizations. ​

In conclusion, DeepSeek AI’s release of DeepSeek-V3-0324 marks a significant milestone in the AI landscape. By addressing key challenges related to performance, cost, and accessibility, DeepSeek has positioned itself as a formidable competitor to established entities like OpenAI. The model’s technical advancements and open-source availability promise to democratize AI technology further, fostering innovation and broader adoption across various sectors.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post DeepSeek AI Unveils DeepSeek-V3-0324: Blazing Fast Performance on Mac Studio, Heating Up the Competition with OpenAI appeared first on MarkTechPost.

Amazon SageMaker JumpStart adds fine-tuning support for models in a pr …

Amazon SageMaker JumpStart is a machine learning (ML) hub that provides pre-trained models, solution templates, and algorithms to help developers quickly get started with machine learning. Within SageMaker JumpStart, the private model hub feature allows organizations to create their own internal repository of ML models, enabling teams to share and manage models securely within their organization.
Today, we are announcing an enhanced private hub feature with several new capabilities that give organizations greater control over their ML assets. These enhancements include the ability to fine-tune SageMaker JumpStart models directly within the private hub, support for adding and managing custom-trained models, deep linking capabilities for associated notebooks, and improved model version management. These new features streamline the ML workflow by combining the convenience of pre-built solutions with the flexibility of custom development, while maintaining enterprise-grade security and governance.
For enterprise customers, the ability to curate and fine-tune both pre-built and custom models is crucial for successful AI implementation. Model curation provides quality control, compliance, and security while preventing duplicate efforts across teams. When enterprises fine-tune curated models, they can specialize general-purpose solutions for their specific industry needs and gain competitive advantages through improved performance on their proprietary data. Similarly, the ability to fine-tune custom models enables organizations to continuously improve their AI solutions, adapt to changing business conditions, and preserve institutional knowledge, while maintaining cost-efficiency.
A common enterprise scenario involves centralized data science teams developing foundation models (FMs), evaluating the performance against open source FMs, and iterating on performance. After they develop their custom FM, it can serve as a baseline for the entire organization, and individual departments—such as legal, finance, or customer service—can fine-tune these models using their department-specific data that might be subject to different privacy requirements or access controls. This hub-and-spoke approach to model development maximizes resource efficiency while allowing for specialized optimization at the department level. This comprehensive approach to model management, now supported by the enhanced private hub features in SageMaker JumpStart, enables enterprises to balance standardization with customization while maintaining proper governance and control over their ML assets.
Solution overview
SageMaker JumpStart has introduced several new enhancements to its private model hub feature, allowing administrators greater control and flexibility in managing their organization’s ML models. These enhancements include:

Fine-tuning of models referenced in the private hub – Administrators can now add models from the SageMaker JumpStart catalog to their private hub and fine-tune them using Amazon SageMaker training jobs, without having to create the models from scratch.
Support for custom models – In addition to the pre-trained SageMaker JumpStart models, administrators can now add their own custom-trained models to the private hub and fine-tune them as needed.
Deep linking of notebooks – Administrators can now deep link to specific notebooks associated with the models in the private hub, making it straightforward for users to access and work with the models.
Updating models in the private hub – The private hub now supports updating models over time as new versions or iterations become available, allowing organizations to stay current with the latest model improvements.

These new capabilities give AWS customers more control over their ML infrastructure and enable faster model deployment and experimentation, while still maintaining the appropriate access controls and permissions within their organization.
In the following sections, we provide guidance on how to use these new private model hub features using the Amazon SageMaker SDK and Amazon SageMaker Studio console.
To learn more about how to manage models using private hubs, see Manage Amazon SageMaker JumpStart foundation model access with private hubs.
Prerequisites
To use the SageMaker Python SDK and run the code associated with this post, you need the following prerequisites:

An AWS account that contains your AWS resources
An AWS Identity and Access Management (IAM) role with access to SageMaker Studio notebooks
SageMaker JumpStart enabled in a SageMaker Studio domain

Create a private hub, curate models, and configure access control
This section provides a step-by-step guide for administrators to create a private hub, curate models, and configure access control for your organization’s users.

Because the feature has been integrated in the latest SageMaker Python SDK, to use the model granular access control feature with a private hub, let’s first update the SageMaker Python SDK:

!pip3 install sagemaker —force-reinstall —quiet

Next, import the SageMaker and Boto3 libraries:

import boto3 from sagemaker
import Session from sagemaker.session
import Hub

Configure your private hub:

HUB_NAME=”CompanyHub”
HUB_DISPLAY_NAME=”Allowlisted Models”
HUB_DESCRIPTION=”These are allowlisted models taken from the SageMaker Public Hub”
REGION=”<your_region_name>” # for example, “us-west-2”

In the preceding code, HUB_NAME specifies the name of your hub. HUB_DISPLAY_NAME is the display name for your hub that will be shown to users in UI experiences. HUB_DESCRIPTION is the description for your hub that will be shown to users.
Use an AWS Region where SageMaker JumpStart is available, as of March 2025: us-west-2, us-east-1, us-east-2, eu-west-1, eu-central-1, eu-central-2, eu-north-1, eu-south-2, me-south-1, me-central-1, ap-south-1, ap-south-2, eu-west-3, af-south-1, sa-east-1, ap-east-1, ap-northeast-2, ap-northeast-3, ap-southeast-3, ap-southeast-4, ap-southeast-5, ap-southeast-7, eu-west-2, eu-south-1, ap-northeast-1, us-west-1, ap-southeast-1, ap-southeast-2, ca-central-1, ca-west-1, cn-north-1, cn-northwest-1, il-central-1, mx-central-1, us-gov-east-1, us-gov-west-1.

Set up a Boto3 client for SageMaker:

sm_client = boto3.client(‘sagemaker’)
session = Session(sagemaker_client=sm_client)
session.get_caller_identity_arn()

Check if the following policies have been already added to your admin IAM role; if not, you can add them as inline policies (use the Region configured in Step 3):

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Action”: [
“s3:ListBucket”,
“s3:GetObject”,
“s3:GetObjectTagging”
],
“Resource”: [
“arn:aws:s3:::jumpstart-cache-prod-<REGION>”,
“arn:aws:s3:::jumpstart-cache-prod-<REGION>/*”
],
“Effect”: “Allow”
}
]
}

In addition to setting up IAM permissions to the admin role, you need to scope down permissions for your users so they can’t access public contents.

Use the following policy to deny access to the public hub for your users. These can be added as inline policies in the user’s IAM role (use the Region configured in Step 3):

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Action”: “s3:*”,
“Effect”: “Deny”,
“Resource”: [
“arn:aws:s3:::jumpstart-cache-prod-<REGION>”,
“arn:aws:s3:::jumpstart-cache-prod-<REGION>/*”
],
“Condition”: {
“StringNotLike”: {“s3:prefix”: [“*.ipynb”, “*/eula.txt”]}
}
},
{
“Action”: “sagemaker:*”,
“Effect”: “Deny”,
“Resource”: [
“arn:aws:sagemaker:<REGION>:aws:hub/SageMakerPublicHub”,
“arn:aws:sagemaker:<REGION>:aws:hub-content/SageMakerPublicHub/*/*”
]
}
]
}

After you have set up the private hub configuration and permissions, you’re ready to create the private hub.

Use the following code to create the private hub within your AWS account in the Region you specified earlier:

hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)

try:
hub.create(
description=HUB_DESCRIPTION,
display_name=HUB_DISPLAY_NAME
)
print(f”Successfully created Hub with name {HUB_NAME} in {REGION}”)
except Exception as e:
if “ResourceInUse” in str(e):
print(f”A hub with the name {HUB_NAME} already exists in your account.”)
else:
raise e

Use describe() to verify the configuration of your hub. After your private hub is set up, you can add a reference to models from the SageMaker JumpStart public hub to your private hub. No model artifacts need to be managed by the customer. The SageMaker team will manage version or security updates. For a list of available models, refer to Built-in Algorithms with pre-trained Model Table.
To search programmatically, run the following command:

from sagemaker.jumpstart.filters import Or

filter_value = Or(
“framework == meta”,
“framework == deepseek”
)
models = []
next_token = None

while True:
response = hub.list_sagemaker_public_hub_models(
filter=filter_value,
next_token=next_token
)
models.extend(response[“hub_content_summaries”])
next_token = response.get(“next_token”)

if not next_token:
break
print(models)

The filter argument is optional. For a list of filters you can apply, refer to the following GitHub repo.

Use the retrieved models from the preceding command to create model references for your private hub:

for model in models:
print(f”Adding {model.get(‘hub_content_name’)} to Hub”)
hub.create_model_reference(model_arn=model.get(“hub_content_arn”),
model_name=model.get(“hub_content_name”))

The SageMaker JumpStart private hub offers other useful features for managing and interacting with the curated models. Administrators can check the metadata of a specific model using the hub.describe_model(model_name=<model_name>) command. To list the available models in the private hub, you can use a simple loop:

response = hub.list_models()
models = response[“hub_content_summaries”]
while response[“next_token”]:
response = hub.list_models(next_token=response[“next_token”])
models.extend(response[“hub_content_summaries”])

for model in models:
print(model.get(‘HubContentArn’))

If you need to remove a specific model reference from the private hub, use the following command:

hub.delete_model_reference(“<model_name>”)

If you want to delete the private hub from your account and Region, you will need to delete all the HubContents first, then delete the private hub. Use the following code:

for model in models:
    hub.delete_model_reference(model_name=model.get(‘HubContentName’))
    
hub.delete()

Fine-tune models referenced in the private hub
This section walks through how to interact with allowlisted models in SageMaker JumpStart. We demonstrate how to list available models, identify a model from the public hub, and fine-tune the model using the SageMaker Python SDK as well as the SageMaker Studio UI.
User experience using the SageMaker Python SDK
To interact with your models using the SageMaker Python SDK, complete the following steps:

Just like the admin process, the first step is to force reinstall the SageMaker Python SDK:

!pip3 install sagemaker —force-reinstall —quiet

When interacting with the SageMaker SDK functions, add references to the hub_arn:

model_id=”meta-vlm-llama-3-2-11b-vision”
model_version=”2.1.8″
hub_arn=”<YourHubARN>”

from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
model_id=model_id, model_version=model_version, hub_arn=hub_arn
)
print(my_hyperparameters)
hyperparameters.validate(
model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters, hub_arn=hub_arn
)

You can then start a training job by specifying the model ID, version, and hub name:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
model_id=model_id,
hub_name=hub_arn,
model_version=model_version,
environment={“accept_eula”: “false”}, # Please change {“accept_eula”: “true”}
disable_output_compression=True,
instance_type=”ml.p4d.24xlarge”,
hyperparameters=my_hyperparameters,
)
estimator.fit({“training”: train_data_location})

For a custom model, see the example notebooks in GitHub.
User experience in SageMaker Studio
Complete the following steps to interact with allowlisted models using SageMaker Studio:

On the SageMaker Studio console, choose JumpStart in the navigation pane or in the Prebuilt and automated solutions section.
Choose one of model hubs you have access to.

If the user has access to multiple hubs, you will see a list of hubs, as shown in the following screenshot.

If the user has access to only one hub, you will be redirected to the model list.

To fine-tune a model, choose Train (this option will be enabled if it’s supported).
Modify your training job configurations like training data, instance type, and hyperparameters, and choose Submit.

Deep link notebooks in the private hub
You can now also access the notebook associated with the model in your curated hub.

Choose your model, then choose Preview notebooks.
Choose Open in JupyterLab to start the deep link workflow.
Select a running JupyterLab space and choose Open notebook.

You will need to upgrade your space to use a SageMaker distribution of at least 2.4.1. For more information on how to upgrade your SageMaker distribution, see Update the SageMaker Distribution Image.

This will automatically open the selected notebook in your JupyterLab instance, with your private HubName inputted into the necessary classes.

Update models in the private hub
Modify your existing private HubContent by calling the new sagemaker:UpdateHubContent API. You can now update an existing HubContent version in-place without needing to delete and re-add it. We don’t support updating the HubContentDocument at this time because there can be backward-incompatible changes that are introduced that fundamentally alter the performance and usage of the model itself. Refer to the public API documentation for more details.

client.update_hub_content(
    hub_content_name=”my-model”,
    hub_content_version=”1.0.0″,
    hub_content_type=”Model”,
    hub_name=”my-hub”,
    support_status=”DEPRECATED”
)

Additionally, you can modify your ModelReferences by calling the new sagemaker:UpdateHubContentReference API. Refer to the public API documentation for more usage details.

client.update_hub_content_reference(
    hub_content_name=”your-model”,
    hub_content_type=”ModelReference”,
    hub_name=”my-hub”,
    min_version=”1.2.0″
)

Conclusion
This post demonstrated the new enhancements to the SageMaker JumpStart private model hub feature, which gives enterprise customers greater control and flexibility in managing their ML assets. The key capabilities introduced include the ability to fine-tune pre-built SageMaker JumpStart models directly within the private hub, support for importing and fine-tuning custom-trained models, deep linking to associated notebooks for streamlined access and collaboration, and improved model version management through APIs. These features enable enterprises to curate a centralized repository of trusted, specialized ML models, while still providing the flexibility for individual teams and departments to fine-tune and adapt these models to their specific needs. The seamless integration with SageMaker Studio further streamlines the model development and deployment workflow, empowering enterprises to accelerate their ML initiatives while maintaining the appropriate security and control over their ML assets.
Now that you’ve seen how the enhanced private model hub features in Amazon SageMaker JumpStart can give your organization greater control and flexibility over managing your machine learning assets, start leveraging these capabilities to curate a centralized repository of trusted models and accelerate your AI initiatives.

About the Authors
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Niris Okram is a senior academic research specialist solutions architect at AWS. He has extensive experience working with public, private and research customers on various fields related to cloud. He is passionate about designing and building systems to accelerate the customer’s mission on AWS cloud.
Benjamin Crabtree is a software engineer with the Amazon SageMaker and Bedrock teams. He is passionate about democratizing the new and frequent breakthroughs in AI. Ben received his undergraduate degree from the University of Michigan and now lives in Brooklyn, NY.
Banu Nagasundaram leads product, engineering, and strategic partnerships for SageMaker JumpStart, SageMaker’s machine learning and GenAI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Generative AI-powered game design: Accelerating early development with …

In the competitive world of game development, staying ahead of technological advancements is crucial. Generative AI has emerged as a game changer, offering unprecedented opportunities for game designers to push boundaries and create immersive virtual worlds. At the forefront of this revolution is Stability AI’s cutting-edge text-to-image AI model, Stable Diffusion 3.5 Large (SD3.5 Large), which is transforming the way we approach game environment creation.
SD3.5 Large, available in Amazon Bedrock, is Stability AI’s most advanced text-to-image model to date. With 8.1 billion parameters, this model excels at generating high-quality, 1-megapixel images from text descriptions with exceptional prompt adherence, making it ideal for creating detailed game environments at speed. Its improved architecture, based on the Multimodal Diffusion Transformer (MMDiT), combines multiple pre-trained text encoders for enhanced text understanding and uses QK-normalization to improve training stability.
The model demonstrates improved performance in image quality, typography, and complex prompt understanding. It excels at creating diverse, high-quality images across multiple styles, making it valuable for industries such as media, gaming, advertising, and education.
In this post, we explore how you can use SD3.5 Large to address practical gaming needs such as early concept art and character design.
Key improvements in SD3.5 Large compared to SD3 Large
SD3.5 Large offers the following improvements:

Enhanced photorealism – Delivers detailed 3D imagery with unprecedented realism
Superior scene complexity – Handles multiple subjects in intricate scenes with remarkable accuracy
Improved anatomical rendering – Generates more precise and natural human representations
Diverse representation – Creates images with inclusive representation of skin tones and features without extensive prompting

Real-world use cases for game environment creation
Image generation is poised to revolutionize a few key areas within the gaming industry. Firstly, it will significantly enhance the ideation and design process, allowing teams to rapidly create new scenes and objects, thereby accelerating the design cycle. Secondly, it will enable in-game content generation, empowering users to create new objects, modify avatar skins, or generate new textures. Although current adoption is more prevalent in the design phase, the continued advancement of generative AI is expected to lead to increased user-generated AI content (such as player avatars), which will substantially boost user creativity and overall gaming experience. This shift towards AI-assisted content creation in gaming promises to open up new realms of possibilities for both developers and players alike.
The following are sample prompts for creating early game worlds and their output:

A vibrant fantasy landscape featuring rolling hills, a sparkling river, and a majestic castle in the distance under a bright blue sky.

A dense tropical rainforest teeming with exotic plants and wildlife, sunlight filtering through the thick canopy, with a hidden waterfall cascading into a crystal-clear pool.

A futuristic city skyline at dusk, featuring sleek skyscrapers with neon lights and flying vehicles soaring between them, reflecting on the glassy surface of a river.

The following are sample prompts for creating early game assets and props from different angles:

An intricately designed realistic game weapon prop of a fiery blue and green blade set against a blurred background of a gargantuan temple. The blade merges geometrical design of the blade with an alien cultural aesthetic.
Close-up, side-angle view of an intricately designed realistic, game weapon prop of a fiery blue and green blade set against a blurred background of a gargantuan temple. The blade merges geometrical design of the blade with an alien cultural aesthetic.
Top-down view of an intricately designed realistic, game weapon prop of a fiery blue and green blade set against a blurred background of a gargantuan temple. The blade merges geometrical design of the blade with an alien cultural aesthetic.

Solution overview
To demonstrate the power of SD3.5 Large in game environment creation, let’s walk through a hypothetical workflow. We have provided a Jupyter notebook to deploy a sample gaming use case in the following GitHub repo. Use the us-west-2 AWS Region to run this demo.
Prerequisites
This notebook is designed to run on AWS, using Amazon Bedrock for both Anthropic’s Claude 3 Sonnet and Stability AI model access. Make sure you have the following set up before moving forward:

An AWS account.
An Amazon SageMaker domain.
Access to Stability AI’s SD3.5 Large text-to-image model through the Amazon Bedrock console. For instructions, see Manage access to Amazon Bedrock foundation models.

Define the game world
Start by outlining the core concepts of your game world, including its theme, atmosphere, and key locations. For example, “Mystic Realms is set in a vibrant fantasy world where players embark on quests to uncover ancient secrets and battle mystical creatures. The game features diverse environments, including enchanted forests, mystical mountains, and forgotten ruins. The atmosphere is whimsical and magical, with bright colors and fantastical elements that evoke a sense of wonder.”
Craft detailed prompts for worlds and objects
Use natural language to describe specific environments and objects you want to create. The following screenshot shows some generated prompts.

You can also generate initial concept images with Amazon Bedrock following these steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Model catalog.
For Providers, select Stability AI, then choose Stable Diffusion 3.5 Large.
Choose Open in playground.
Enter your prompt and choose Run. A high-fidelity image will be generated in seconds.

Iterate and refine
After you have a base concept you’re happy with, you can generate variations to explore different possibilities for the same environment. Analyze the generated images and refine your prompts to achieve the desired results. You might want to adjust elements like lighting, color palette, or specific environmental features. Finally, use the generated images as reference material for 3D artists to create fully realized game environments.
Clean up
To avoid charges, you must stop the active SageMaker notebook instances if you used the notebook demo. For instructions, refer to Clean up Amazon SageMaker notebook instance resources.
Conclusion
Stability AI’s latest series of models represents a significant advancement in generative AI, providing game developers, designers, and content creators with a powerful tool to enhance creative workflows and explore new dimensions of visual storytelling. By using Stability AI’s capabilities, organizations can address practical gaming needs, from concept art and character design to level creation and marketing campaigns. However, it’s essential to approach this technology with a responsible and ethical mindset, considering potential biases, respecting intellectual property rights, and mitigating the risks of misuse. By embracing these models while being aware of their limitations and ethical considerations, gaming professionals can push the boundaries of what’s possible in game design and visual content creation.
To get started, check out Stability AI models available in Amazon Bedrock.

About the Authors
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS Enterprise customers grow by understanding their goals and challenges, and guiding them on how they can architect their applications in a cloud-native manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and environmental sustainability.
Parth Patel is a Senior Solutions Architect at AWS in the San Francisco Bay Area. Parth guides customers to accelerate their journey to the cloud and help them adopt and grow on the AWS Cloud successfully. He focuses on machine learning, environmental sustainability, and application modernization.

Google AI Released Gemini 2.5 Pro Experimental: An Advanced AI Model t …

​In the evolving field of artificial intelligence, a significant challenge has been developing models that can effectively reason through complex problems, generate accurate code, and process multiple forms of data. Traditional AI systems often excel in specific tasks but struggle to generalize across diverse domains, limiting their practical applications. This fragmentation underscores the need for more integrated and versatile AI solutions.​

Addressing this, Google has introduced Gemini 2.5 Pro Experimental, an advanced AI model designed to enhance reasoning, coding, and multimodal capabilities. Building upon its predecessors, Gemini 2.5 Pro is engineered to tackle complex challenges in fields such as coding, science, and mathematics. Its multimodal design enables it to interpret and generate text, audio, images, video, and code, broadening its applicability across various sectors. ​

From a technical standpoint, Gemini 2.5 Pro incorporates advanced reasoning capabilities, allowing the model to process tasks methodically and make informed decisions. It features a substantial context window, currently supporting up to 1 million tokens, with plans to expand to 2 million tokens. This extensive context window enables the model to comprehend large datasets and address intricate problems that require synthesizing information from multiple sources. In coding applications, Gemini 2.5 Pro demonstrates proficiency by creating visually compelling web applications and efficiently performing code transformation and editing tasks.

https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#advanced-coding

Empirical evaluations highlight Gemini 2.5 Pro’s strong performance. It leads in benchmarks related to mathematics and science, such as GPQA and AIME 2025, reflecting its robust reasoning capabilities. Notably, it achieved a score of 18.8% on Humanity’s Last Exam, a dataset designed to assess advanced knowledge and reasoning. In coding benchmarks, Gemini 2.5 Pro scored 63.8% on SWE-Bench Verified, indicating its competence in agentic code evaluations. Furthermore, it topped the LMArena leaderboard by a significant margin, underscoring its advanced capabilities in multimodal reasoning, coding, and STEM fields.

In conclusion, Gemini 2.5 Pro Experimental represents a notable advancement in AI, reflecting Google’s commitment to developing more intelligent and versatile models. By integrating reasoning capabilities directly into its architecture, Gemini 2.5 Pro addresses previous limitations, offering enhanced performance and improved accuracy. Its ability to handle complex problems across coding, science, and mathematics, coupled with its multimodal proficiency, positions it as a valuable tool in the AI landscape. As AI continues to evolve, models like Gemini 2.5 Pro pave the way for more sophisticated and context-aware applications, fostering innovation across various sectors.

Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Google AI Released Gemini 2.5 Pro Experimental: An Advanced AI Model that Excels in Reasoning, Coding, and Multimodal Capabilities appeared first on MarkTechPost.