Uncategorized Archives - Page 39 of 269

Unlock model insights with log probability support for Amazon Bedrock …

Posted on September 13, 2025 by i-genie

You can use Amazon Bedrock Custom Model Import to seamlessly integrate your customized models—such as Llama, Mistral, and Qwen—that you have fine-tuned elsewhere into Amazon Bedrock. The experience is completely serverless, minimizing infrastructure management while providing your imported models with the same unified API access as native Amazon Bedrock models. Your custom models benefit from automatic scaling, enterprise-grade security, and native integration with Amazon Bedrock features such as Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases.
Understanding how confident a model is in its predictions is essential for building reliable AI applications, particularly when working with specialized custom models that might encounter domain-specific queries.
With log probability support now added to Custom Model Import, you can access information about your models’ confidence in their predictions at the token level. This enhancement provides greater visibility into model behavior and enables new capabilities for model evaluation, confidence scoring, and advanced filtering techniques.
In this post, we explore how log probabilities work with imported models in Amazon Bedrock. You will learn what log probabilities are, how to enable them in your API calls, and how to interpret the returned data. We also highlight practical applications—from detecting potential hallucinations to optimizing RAG systems and evaluating fine-tuned models—that demonstrate how these insights can improve your AI applications, helping you build more trustworthy solutions with your custom models.
Understanding log probabilities
In language models, a log probability represents the logarithm of the probability that the model assigns to a token in a sequence. These values indicate how confident the model is about each token it generates or processes. Log probabilities are expressed as negative numbers, with values closer to zero indicating higher confidence. For example, a log probability of -0.1 corresponds to approximately 90% confidence, while a value of -3.0 corresponds to about 5% confidence. By examining these values, you can identify when a model is highly certain versus when it’s making less confident predictions. Log probabilities provide a quantitative measure of how likely the model considered each generated token, offering valuable insight into the confidence of its output. By analyzing them you can,

Gauge confidence across a response: Assess how confident the model was in different sections of its output, helping you identify where it was certain versus uncertain.
Score and compare outputs: Compare overall sequence likelihood (by adding or averaging log probabilities) to rank or filter multiple model outputs.
Detect potential hallucinations: Identify sudden drops in token-level confidence, which can flag segments that might require verification or review.
Reduce RAG costs with early pruning: Run short, low-cost draft generations based on retrieved contexts, compute log probabilities for those drafts, and discard low-scoring candidates early, avoiding unnecessary full-length generations or expensive reranking while keeping only the most promising contexts in the pipeline.
Build confidence-aware applications: Adapt system behavior based on certainty levels—for example, trigger clarifying prompts, provide fallback responses, or flagging for human review.

Overall, log probabilities are a powerful tool for interpreting and debugging model responses with measurable certainty—particularly valuable for applications where understanding why a model responded in a certain way can be as important as the response itself.
Prerequisites
To use log probability support with custom model import in Amazon Bedrock, you need:

An active AWS account with access to Amazon Bedrock
A custom model created in Amazon Bedrock using the Custom Model Import feature after July 31, 2025, when the log probabilities support was released
Appropriate AWS Identity and Access Management (IAM) permissions to invoke models through the Amazon Bedrock Runtime

Introducing log probabilities support in Amazon Bedrock
With this release, Amazon Bedrock now allows models imported using the Custom Model Import feature to return token-level log probabilities as part of the inference response.
When invoking a model through Amazon Bedrock InvokeModel API, you can access token log probabilities by setting “return_logprobs”: true in the JSON request body. With this flag enabled, the model’s response will include additional fields providing log probabilities for both the prompt tokens and the generated tokens, so that customers can analyze the model’s confidence in its predictions. These log probabilities let you quantitatively assess how confident your custom models are when processing inputs and generating responses. The granular metrics allow for better evaluation of response quality, troubleshooting of unexpected outputs, and optimization of prompts or model configurations.
Let’s walk through an example of invoking a custom model on Amazon Bedrock with log probabilities enabled and examine the output format. Suppose you have already imported a custom model (for instance, a fine-tuned Llama 3.2 1B model) into Amazon Bedrock and have its model Amazon Resource Name (ARN). You can invoke this model using the Amazon Bedrock Runtime SDK (Boto3 for Python in this example) as shown in the following example:

import boto3, json

bedrock_runtime = boto3.client(‘bedrock-runtime’)
model_arn = “arn:aws:bedrock:<<aws-region>>:<<account-id>>:imported-model/your-model-id”

# Define the request payload with log probabilties enabled
request_payload = {
   “prompt”: “The quick brown fox jumps”,
   “max_gen_len”: 50,
   “temperature”: 0.5,
   “stop”: [“.”, “n”],
   “return_logprobs”: True # Request log probabilities
}

response = bedrock_runtime.invoke_model(
   modelId=model_arn,
   body=json.dumps(request_payload),
   contentType=”application/json”,
   accept=”application/json”
)

# Parse the JSON response
result = json.loads(response[“body”].read())
print(json.dumps(result, indent=2))

In the preceding code, we send a prompt—”The quick brown fox jumps”—to our custom imported model. We configure standard inference parameters: a maximum generation length of 50 tokens, a moderate temperature of 0.5 for moderate randomness, and a stop condition (either a period or a newline). The “return_logprobs”:True parameter tells Amazon Bedrock to return log probabilities in the response.
The InvokeModel API returns a JSON response containing three main components: the standard generated text output, metadata about the generation process, and now log probabilities for both prompt and generated tokens. These values reveal the model’s internal confidence for each token prediction, so you can understand not just what text was produced, but how certain the model was at each step of the process. The following is an example response from the “quick brown fox jumps” prompt, showing log probabilities (appearing as negative numbers):

{
‘prompt_logprobs’: [
None,
{‘791’: -3.6223082542419434, ‘14924’: -1.184808373451233},
{‘4062’: -9.256651878356934, ‘220’: -3.6941518783569336},
{‘14198’: -4.840845108032227, ‘323’: -1.7158453464508057},
{‘39935’: -0.049946799874305725},
{‘35308’: -0.2087990790605545}
],
‘generation’: ‘ over the lazy dog’,
‘prompt_token_count’: 6,
‘generation_token_count’: 5,
‘stop_reason’: ‘stop’,
‘logprobs’: [
{‘927’: -0.04093993827700615},
{‘279’: -0.0728893131017685},
{‘16053’: -0.02005653828382492},
{‘5679’: -0.03769925609230995},
{‘627’: -1.194122076034546}
]
}

The raw API response provides token IDs paired with their log probabilities. To make this data interpretable, we need to first decode the token IDs using the appropriate tokenizer (in this case, the Llama 3.2 1B tokenizer), which maps each ID back to its actual text token. Then we convert log probabilities to probabilities by applying the exponential function, translating these values into more intuitive probabilities between 0 and 1. We have implemented these transformations using custom code (not shown here) to produce a human-readable format where each token appears alongside its probability, making the model’s confidence in its predictions immediately clear.

{‘prompt_logprobs’: [None,
  {‘791’: “‘The’ (p=0.0267)”, ‘14924’: “‘Question’ (p=0.3058)”},
  {‘4062’: “‘ quick’ (p=0.0001)”, ‘220’: “‘ ‘ (p=0.0249)”},
  {‘14198’: “‘ brown’ (p=0.0079)”, ‘323’: “‘ and’ (p=0.1798)”},
  {‘39935’: “‘ fox’ (p=0.9513)”},
  {‘35308’: “‘ jumps’ (p=0.8116)”}],
‘generation’: ‘ over the lazy dog’,
‘prompt_token_count’: 6,
‘generation_token_count’: 5,
‘stop_reason’: ‘stop’,
‘logprobs’: [{‘927’: “‘ over’ (p=0.9599)”},
  {‘279’: “‘ the’ (p=0.9297)”},
  {‘16053’: “‘ lazy’ (p=0.9801)”},
  {‘5679’: “‘ dog’ (p=0.9630)”},
  {‘627’: “‘.n’ (p=0.3030)”}]}

Let’s break down what this tells us about the model’s internal processing:

generation: This is the actual text generated by the model (in our example, it’s a continuation of the prompt that we sent to the model). This is the same field you would get normally from any model invocation.
prompt_token_count and generation_token_count: These indicate the number of tokens in the input prompt and in the output, respectively. In our example, the prompt was tokenized into six tokens, and the model generated five tokens in its completion.
stop_reason: The reason the generation stopped (“stop” means the model naturally stopped at a stop sequence or end-of-text, “length” means it hit the max token limit, and so on). In our case it shows “stop”, indicating the model stopped on its own or because of the stop condition we provided.
prompt_logprobs: This array provides log probabilities for each token in the prompt. As the model processes your input, it continuously predicts what should come next based on what it has seen so far. These values measure which tokens in your prompt were expected or surprising to the model.

The first entry is None because the very first token has no preceding context. The model cannot predict anything without prior information. Each subsequent entry contains token IDs mapped to their log probabilities. We have converted these IDs to readable text and transformed the log probabilities into percentages for easier understanding.
You can observe the model’s increasing confidence as it processes familiar sequences. For example, after seeing The quick brown, the model predicted fox with 95.1% confidence. After seeing the full context up to fox, it predicted jumps with 81.1% confidence.
Many positions show multiple tokens with their probabilities, revealing alternatives the model considered. For instance, at the second position, the model evaluated both The (2.7%) and Question (30.6%), which means the model considered both tokens viable at that position. This added visibility helps you understand where the model weighted alternatives and can reveal when it was more uncertain or had difficulty choosing from multiple options.
Notably low probabilities appear for some tokens—quick received just 0.01%—indicating the model found these words unexpected in their context.
The overall pattern tells a clear story: individual words initially received low probabilities, but as the complete quick brown fox jumps phrase emerged, the model’s confidence increased dramatically, showing it recognized this as a familiar expression.
When multiple tokens in your prompt consistently receive low probabilities, your phrasing might be unusual for the model. This uncertainty can affect the quality of completions. Using these insights, you can reformulate prompts to better align with patterns the model encountered in its training data.

logprobs: This array contains log probabilities for each token in the model’s generated output. The format is similar: a dictionary mapping token IDs to their corresponding log probabilities.

After decoding these values, we can see that the tokens over, the, lazy, and dog all have high probabilities. This demonstrates the model recognized it was completing the well-known phrase the quick brown fox jumps over the lazy dog—a common pangram that the model appears to have strong familiarity with.
In contrast, the final period (newline) token has a much lower probability (30.3%), revealing the model’s uncertainty about how to conclude the sentence. This makes sense because the model had multiple valid options: ending the sentence with a period, continuing with additional content, or choosing another punctuation mark altogether.

Practical use cases of log probabilities
Token-level log probabilities from the Custom Model Import feature provide valuable insights into your model’s decision-making process. These metrics transform how you interact with your custom models by revealing their confidence levels for each generated token. Here are impactful ways to use these insights:
Ranking multiple completions
You can use log probabilities to quantitatively rank multiple generated outputs for the same prompt. When your application needs to choose between different possible completions—whether for summarization, translation, or creative writing—you can calculate each completion’s overall likelihood by averaging or adding the log probabilities across all its tokens.
Example:
Prompt: Translate the phrase “Battre le fer pendant qu’il est chaud”

Completion A: “Strike while the iron is hot” (Average log probability: -0.39)
Completion B: “Beat the iron while it is hot.” (Average log probability: -0.46)

In this example, Completion A receives a higher log probability score (closer to zero), indicating the model found this idiomatic translation more natural than the more literal Completion B. This numerical approach enables your application to automatically select the most probable output or present multiple candidates ranked by the model’s confidence level.
This ranking capability extends beyond translation to many scenarios where multiple valid outputs exist—including content generation, code completion, and creative writing—providing an objective quality metric based on the model’s confidence rather than relying solely on subjective human judgment.
Detecting hallucinations and low-confidence answers
Models might produce hallucinations—plausible-sounding but factually incorrect statements—when handling ambiguous prompts, complex queries, or topics outside their expertise. Log probabilities provide a practical way to detect these instances by revealing the model’s internal uncertainty, helping you identify potentially inaccurate information even when the output appears confident.
By analyzing token-level log probabilities, you can identify which parts of a response the model was potentially uncertain about, even when the text appears confident on the surface. This capability is especially valuable in retrieval-augmented generation (RAG) systems, where responses should be grounded in retrieved context. When a model has relevant information available, it typically generates answers with higher confidence. Conversely, low confidence across multiple tokens suggests the model might be generating content without sufficient supporting information.
Example:

Prompt:

“Explain how the Portfolio Synergy Quotient (PSQ) is applied in multi-asset investment
strategies?”

Model output:

“The PSQ is a measure of the diversification benefits of combining different asset
classes in a portfolio.”

In this example, we intentionally asked about a fictional metric—Portfolio Synergy Quotient (PSQ)—to demonstrate how log probabilities reveal uncertainty in model responses. Despite producing a professional-sounding definition for this non-existent financial concept, the token-level confidence scores tell a revealing story. The confidence scores shown below are derived by applying the exponential function to the log probabilities returned by the model.

PSQ shows medium confidence (63.8%), indicating that the model recognized the acronym format but wasn’t highly certain about this specific term.
Common finance terminology like classes (98.2%) and portfolio (92.8%) exhibit high confidence, likely because these are standard concepts widely used in financial contexts.
Critical connecting concepts show notably low confidence: measure (14.0%) and diversification (31.8%), reveal the model’s uncertainty when attempting to explain what PSQ means or does.
Functional words like is (45.9%) and of (56.6%) hover in the medium confidence levels, suggesting uncertainty about the overall structure of the explanation.

By identifying these low-confidence segments, you can implement targeted safeguards in your applications—such as flagging content for verification, retrieving additional context, generating clarifying questions, or applying confidence thresholds for sensitive information. This approach helps create more reliable AI systems that can distinguish between high-confidence knowledge and uncertain responses.
Monitoring prompt quality
When engineering prompts for your application, log probabilities reveal how well the model understands your instructions. If the first few generated tokens show unusually low probabilities, it often signals that the model struggled to interpret what you are asking.
By tracking the average log probability of the initial tokens—typically the first 5–10 generated tokens—you can quantitatively measure prompt clarity. Well-structured prompts with clear context typically produce higher probabilities because the model immediately knows what to do. Vague or underspecified prompts often yield lower initial token likelihoods as the model hesitates or searches for direction.
Example:
Prompt comparison for customer service responses:

Basic prompt:

“Write a response to this customer complaint: I ordered a laptop 2 weeks ago and it
still hasn’t arrived.”

Average log probability of first five tokens: -1.215 (lower confidence)

Optimized prompt:

“You are a senior customer service manager with expertise in conflict resolution and
customer retention. You work for a reputable electronics retailer that values
customer satisfaction above all else. Your task is to respond to the following
customer complaint with professionalism and empathy.
Customer Complaint: I ordered a laptop 2 weeks ago and it still hasn’t arrived.”

Average log probability of first five tokens: -0.333 (higher confidence)

The optimized prompt generates higher log probabilities, demonstrating that precise instructions and clear context reduce the model’s uncertainty. Rather than making absolute judgments about prompt quality, this approach lets you measure relative improvement between versions. You can directly observe how specific elements—role definitions, contextual details, and explicit expectations—increase model confidence. By systematically measuring these confidence scores across different prompt iterations, you build a quantitative framework for prompt engineering that reveals exactly when and how your instructions become unclear to the model, enabling continuous data-driven refinement.
Reducing RAG costs with early pruning
In traditional RAG implementations, systems retrieve 5–20 documents and generate complete responses using these retrieved contexts. This approach drives up inference costs because every retrieved context consumes tokens regardless of actual usefulness.
Log probabilities enable a more cost-effective alternative through early pruning. Instead of immediately processing the retrieved documents in full:

Generate draft responses based on each retrieved context
Calculate the average log probability across these short drafts
Rank contexts by their average log probability scores
Discard low-scoring contexts that fall below a confidence threshold
Generate the complete response using only the highest-confidence contexts

This approach works because contexts that contain relevant information produce higher log probabilities in the draft generation phase. When the model encounters helpful context, it generates text with greater confidence, reflected in log probabilities closer to zero. Conversely, irrelevant or tangential contexts produce more uncertain outputs with lower log probabilities.
By filtering contexts before full generation, you can reduce token consumption while maintaining or even improving answer quality. This shifts the process from a brute-force approach to a targeted pipeline that directs full generation only toward contexts where the model demonstrates genuine confidence in the source material.
Fine-tuning evaluation
When you have fine-tuned a model for your specific domain, log probabilities offer a quantitative way to assess the effectiveness of your training. By analyzing confidence patterns in responses, you can determine if your model has developed proper calibration—showing high confidence for correct domain-specific answers and appropriate uncertainty elsewhere.
A well-calibrated fine-tuned model should assign higher probabilities to accurate information within its specialized area while maintaining lower confidence when operating outside its training domain. Problems with calibration appear in two main forms. Overconfidence occurs when the model assigns high probabilities to incorrect responses, suggesting it hasn’t properly learned the boundaries of its knowledge. Under confidence manifests as consistently low probabilities despite generating accurate answers, indicating that training might not have sufficiently reinforced correct patterns.
By systematically testing your model across various scenarios and analyzing the log probabilities, you can identify areas needing additional training or detect potential biases in your current approach. This creates a data-driven feedback loop for iterative improvements, making sure your model performs reliably within its intended scope while maintaining appropriate boundaries around its expertise.
Getting started
Here’s how to start using log probabilities with models imported through the Amazon Bedrock Custom Model Import feature:

Enable log probabilities in your API calls: Add “return_logprobs”: true to your request payload when invoking your custom imported model. This parameter works with both the InvokeModel and InvokeModelWithResponseStream APIs. Begin with familiar prompts to observe which tokens your model predicts with high confidence compared to which it finds surprising.
Analyze confidence patterns in your custom models: Examine how your fine-tuned or domain-adapted models respond to different inputs. The log probabilities reveal whether your model is appropriately calibrated for your specific domain—showing high confidence where it should be certain.
Develop confidence-aware applications: Implement practical use cases such as hallucination detection, response ranking, and content verification to make your applications more robust. For example, you can flag low-confidence sections of responses for human review or select the highest-confidence response from multiple generations.

Conclusion
Log probability support for Amazon Bedrock Custom Model Import offers enhanced visibility into model decision-making. This feature transforms previously opaque model behavior into quantifiable confidence metrics that developers can analyze and use.
Throughout this post, we have demonstrated how to enable log probabilities in your API calls, interpret the returned data, and use these insights for practical applications. From detecting potential hallucinations and ranking multiple completions to optimizing RAG systems and evaluating fine-tuning quality, log probabilities offer tangible benefits across diverse use cases.
For customers working with customized foundation models like Llama, Mistral, or Qwen, these insights address a fundamental challenge: understanding not just what a model generates, but how confident it is in its output. This distinction becomes critical when deploying AI in domains requiring high reliability—such as finance, healthcare, or enterprise applications—where incorrect outputs can have significant consequences.
By revealing confidence patterns across different types of queries, log probabilities help you assess how well your model customizations have affected calibration, highlighting where your model excels and where it might need refinement. Whether you are evaluating fine-tuning effectiveness, debugging unexpected responses, or building systems that adapt to varying confidence levels, this capability represents an important advancement in bringing greater transparency and control to generative AI development on Amazon Bedrock.
We look forward to seeing how you use log probabilities to build more intelligent and trustworthy applications with your custom imported models. This capability demonstrates the commitment from Amazon Bedrock to provide developers with tools that enable confident innovation while delivering the scalability, security, and simplicity of a fully managed service.

About the authors
Manoj Selvakumar is a Generative AI Specialist Solutions Architect at AWS, where he helps organizations design, prototype, and scale AI-powered solutions in the cloud. With expertise in deep learning, scalable cloud-native systems, and multi-agent orchestration, he focuses on turning emerging innovations into production-ready architectures that drive measurable business value. He is passionate about making complex AI concepts practical and enabling customers to innovate responsibly at scale—from early experimentation to enterprise deployment. Before joining AWS, Manoj worked in consulting, delivering data science and AI solutions for enterprise clients, building end-to-end machine learning systems supported by strong MLOps practices for training, deployment, and monitoring in production.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Revendra Kumar is a Senior Software Development Engineer at Amazon Web Services. In his current role, he focuses on model hosting and inference MLOps on Amazon Bedrock. Prior to this, he worked as an engineer on hosting Quantum computers on the cloud and developing infrastructure solutions for on-premises cloud environments. Outside of his professional pursuits, Revendra enjoys staying active by playing tennis and hiking.

Migrate from Anthropic’s Claude 3.5 Sonnet to Claude 4 Sonnet on Ama …

Posted on September 13, 2025 by i-genie

This post is co-written with Gareth Jones from Anthropic.
Anthropic’s Claude 4 Sonnet model has launched on Amazon Bedrock, marking a significant advancement in foundation model capabilities. Consequently, the deprecation timeline for Anthropic’s Claude 3.5 Sonnet (v1 and v2) was announced. This evolution creates a dual imperative for production AI applications: the opportunity to harness enhanced performance and the operational necessity to migrate before deprecation. Organizations must treat model migrations as a core component of their AI inference strategy, because poor execution can result in service disruptions, performance regressions, and cost overruns.
This post provides a systematic approach to migrating from Anthropic’s Claude 3.5 Sonnet to Claude 4 Sonnet on Amazon Bedrock. We examine the key model differences, highlight essential migration considerations, and deliver proven best practices to transform this necessary transition into a strategic advantage that drives measurable value for your organization.
Overview of model differences
Understanding the specific changes between model versions is the first step in planning a successful migration. The migration from Anthropic’s Claude Sonnet 3.5 Sonnet to Claude 4 Sonnet introduces capability and behavioral shifts that you can take advantage of:

Increased context window – Anthropic’s Claude 4 Sonnet expands the context window from 200,000 tokens to 1 million tokens (beta). This allows applications to process and reason over codebases, financial reports, or lengthy legal documents in a single prompt, simplifying complex workflows.
Native reasoning mechanisms – Whereas Anthropic’s Claude 3.5 Sonnet models rely on the chain-of-thought (CoT) prompting technique, Claude 4 models introduce built-in, API-enabled reasoning features like extended thinking and interleaved thinking. These give the model dedicated computational time to reason before answering, dramatically improving performance on complex problems.
Advanced tool use – Anthropic’s Claude 4 Sonnet model significantly upgrades tool use capabilities. They can execute multiple tools in parallel and use extended thinking between tool calls, enabling more sophisticated and efficient agentic workflows compared to the sequential tool use of older models.

To find out more about the model differences, refer to the Complete Model Comparison Guide.
A successful migration hinges on a few critical technical and strategic considerations, which we discuss in the following sections. Addressing these points proactively can minimize risk and accelerate your path to production.
Prerequisites
Before you can start using Anthropic’s Claude 4 Sonnet model, you must enable access to these models in Amazon Bedrock. For instructions, see Request access to an Amazon Bedrock foundation model. Make sure you review and accept the model’s End User License Agreement (EULA) proceeding with the model access request. It is crucial to first confirm that the Claude 4 Sonnet is available in your intended AWS Region, because model support can differ depending on location. For an updated list of supported Regions, refer to Model support by AWS Region in Amazon Bedrock. Additionally, you can use cross-Region inference (CRIS) by specifying an inference profile when running model inference, which can help improve throughput and maximize resource availability. For a list of supported cross-Region inference profiles, refer to Supported Regions and models for inference profiles.
API changes and code updates
When migrating on Amazon Bedrock, you can use either the model-specific InvokeModel API or the unified Converse API.
If you use the InvokeModel API, the migration is straightforward. You only need to update the modelId in your code, because the request body structure for Anthropic’s Messages API remains consistent between versions:

Old model ID – ‘anthropic.claude-3-5-sonnet-20240620-v1:0’ or ‘anthropic.claude-3-5-sonnet-20241022-v2:0’
New model ID – ‘anthropic.claude-4-sonnet-20240514-v1:0’

If you use a CRIS profile for the model, be sure to specify the correct inference profile ID in one of the source Regions, for example, ‘us.anthropic.claude-sonnet-4-20250514-v1:0′.
This migration is an excellent opportunity to switch to the Converse API. It provides a standardized request/response format for language models on Amazon Bedrock, making future migrations to different models or providers much simpler. See the following code:

import boto3

bedrock_runtime = boto3.client(service_name=’bedrock-runtime’)
# Simply change the model ID to switch models
response = bedrock_runtime.converse(
modelId=’us.anthropic.claude-sonnet-4-20250514-v1:0′,
messages=[{‘role’: ‘user’, ‘content’: [{‘text’: “Your prompt here”}]}],
inferenceConfig={‘maxTokens’: 1024}
)

print(response[‘output’][‘message’][‘content’][0][‘text’])

The following are some key changes:

Updated text editor tool – Developers using Anthropic’s built-in text editor tool for agentic workflows must be aware that the tool definition has been updated. The tool type is now text_editor_20250124 and tool name is now str_replace_based_edit_tool. The computer-use-2024-10-22 in the anthropic_beta header can be removed because the text editor tool is generally available in Anthropic’s Claude 4. Any code using this tool must be updated to reflect these changes. For more details, see Anthropic defined tools.
Removed undo_edit command – The undo_edit command is no longer supported in Anthropic’s Claude 4 Sonnet. Any code using this command should be removed.
New refusal stop reason – The model now introduces a new refusal stop reason. This reason is returned when the model declines to generate content due to safety policies. Application logic that handles model responses must be updated to recognize and manage this new stop reason to help prevent unexpected behavior or errors. For example, when you receive a refusal stop reason in a multi-turn interaction, you must reset the conversation context by either removing or updating the turn that was refused.

Prompt engineering and behavioral shifts
Don’t assume your existing prompts will work perfectly with the new model. Adherence to model-specific best practices is key.
Refer to Claude 4 prompt engineering best practices released by Anthropic to achieve optimal results. For example, Anthropic’s Claude models perform best when prompts are structured with XML tags (for example, <document>, </document>) to clearly separate different parts of the input.
Anthropic’s Claude 4 Sonnet is designed to follow instructions more precisely. This is a major advantage, but might mean it is less verbose or elaborative than Claude 3.5 Sonnet unless explicitly prompted to be. Your system prompts defining the AI’s persona might need adjustments.
New reasoning features
Anthropic’s Claude 4 Sonnet model’s built-in extended thinking is a powerful capability. To unlock the deep reasoning capabilities of the new models, developers can enable extended thinking by including the thinking keyword argument in their API call. But make sure to use it strategically—extended thinking incurs additional costs because reasoning tokens are billed as output tokens at standard model rates. You will be charged for the full thinking process (not just the summary you see), and costs can scale with the thinking budget you set.
We recommend enabling extended thinking for tasks requiring deep, multi-step analysis where accuracy is paramount (for example, complex scientific problems or advanced coding challenges).
We recommend disabling extended thinking for simple queries or latency-sensitive applications. The feature adds to both cost (reasoning tokens are billed) and response time. For many tasks, a well-crafted CoT prompt is still the most efficient choice.
To use extended thinking, simply update the Converse API call by adding the additionalModelRequestFields parameter with the thinking configuration. You can specify the maximum thinking tokens that the model can use by setting a value for budget_tokens. Note that maxTokens should be larger than budget_tokens for extended thinking. See the following code:

import boto3

bedrock_runtime = boto3.client(service_name=’bedrock-runtime’)

# Add configurations to enable extended thinking
response = bedrock_runtime.converse(
   modelId=’us.anthropic.claude-sonnet-4-20250514-v1:0′,
   messages=[{‘role’: ‘user’, ‘content’: [{‘text’: “Your prompt here”}]}],
   inferenceConfig={‘maxTokens’: 2048},
   additionalModelRequestFields={“thinking”:{“type”: “enabled”,
     “budget_tokens”: 1024}}
)

reasoning_text = response[‘output’][‘message’][‘content’][0][‘reasoningContent’][‘reasoningText’][‘text’]
assistant_response = response[‘output’][‘message’][‘content’][1][‘text’]

The following table shows an example of extended thinking and how it can improve response accuracy. We prompt Anthropic’s Claude 3.5 Sonnet v2 and Claude 4 Sonnet (with thinking enabled) with an example prompt:

Alan, Bob, Colin, Dave and Emily are standing in a circle. Alan is on Bob’s immediate left. Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s immediate left. Who is on Alan’s immediate right?

The table shows responses from the two models. With extended thinking, the model is able to carefully reason through the problem and arrive at the correct answer.

Model
Extended Thinking
Output
Correct Answer

Claude 3.5 Sonnet v2
N/A
Let me solve this step by step. First, let’s understand what we know: Alan is on Bob’s immediate left Bob is on Colin’s immediate left Colin is on Dave’s immediate left Dave is on Emily’s immediate left Starting with any person, let’s arrange them in order: If Dave is on Emily’s immediate left, then Emily is on Dave’s immediate right If Colin is on Dave’s immediate left, then Dave is on Colin’s immediate right If Bob is on Colin’s immediate left, then Colin is on Bob’s immediate right If Alan is on Bob’s immediate left, then Bob is on Alan’s immediate right So the order going clockwise is: Alan → Bob → Colin → Dave → Emily Since they’re in a circle, Emily must be next to Alan to complete the circle. Therefore, Emily is on Alan’s immediate right. The answer is Emily.
No

Claude 4 Sonnet
Let me work through this step by step. We have 5 people standing in a circle: Alan, Bob, Colin, Dave, and Emily. Let me trace through the given information: Alan is on Bob’s immediate left Bob is on Colin’s immediate left Colin is on Dave’s immediate left Dave is on Emily’s immediate left Since they’re in a circle, I need to figure out the order. Let me start with one person and work around. If Alan is on Bob’s immediate left, that means going clockwise: Alan, then Bob. If Bob is on Colin’s immediate left, that means going clockwise: Bob, then Colin. If Colin is on Dave’s immediate left, that means going clockwise: Colin, then Dave. If Dave is on Emily’s immediate left, that means going clockwise: Dave, then Emily. So going clockwise around the circle: Alan → Bob → Colin → Dave → Emily But wait, since it’s a circle, Emily must connect back to someone. Since we have Alan → Bob → Colin → Dave → Emily, and this is a complete circle of 5 people, Emily must be followed by Alan. So the complete circle clockwise is: Alan → Bob → Colin → Dave → Emily → Alan This means Emily is on Alan’s immediate left, and therefore Alan’s immediate right would be… let me think about this carefully. If we go clockwise: Alan → Bob → Colin → Dave → Emily → (back to Alan) So Alan’s immediate right (going clockwise from Alan) is Bob. And Alan’s immediate left (going counter-clockwise from Alan) is Emily. Wait, let me double-check this by verifying the given conditions: Alan is on Bob’s immediate left ✓ (Alan comes right before Bob going clockwise) Bob is on Colin’s immediate left ✓ (Bob comes right before Colin going clockwise) Colin is on Dave’s immediate left ✓ (Colin comes right before Dave going clockwise) Dave is on Emily’s immediate left ✓ (Dave comes right before Emily going clockwise) I confirm Bob sits directly to Alan’s right in the circular arrangement.
Looking at this step by step: Given information: Alan is on Bob’s immediate left Bob is on Colin’s immediate left Colin is on Dave’s immediate left Dave is on Emily’s immediate left Let me trace the clockwise order around the circle: Alan → Bob (since Alan is on Bob’s immediate left) Bob → Colin (since Bob is on Colin’s immediate left) Colin → Dave (since Colin is on Dave’s immediate left) Dave → Emily (since Dave is on Emily’s immediate left) Since they’re standing in a circle, Emily must connect back to complete the circle: Emily → Alan So the complete clockwise order is: Alan → Bob → Colin → Dave → Emily → Alan Therefore, Bob is on Alan’s immediate right.
Yes

With extended thinking enabled, the API will, by default, return a summary of the model’s full thought process rather than the entire chain of reasoning. This is a safeguard designed to prevent misuse while still providing the full intelligence benefits. Developers should be aware that this can result in a chunky or delayed pattern in streaming responses and should account for this in their application’s user experience.
Anthropic’s Claude 4 models feature the interleaved thinking capability when using extended thinking with tools, allowing the model to perform intermediate reasoning between tool calls and develop more nuanced analytical conclusions based on the tool results received. To enable interleaved thinking for tool calling, add the additional parameter “anthropic_beta”: [“interleaved-thinking-2025-05-14”] to additionalModelRequestFields in the Converse API request.
For more details on using extended thinking, refer to Extended thinking.
Robust evaluation is non-negotiable
You must verify that the new model performs as well as or better than the old one on your specific tasks. Create a curated set of prompts and expected outputs that are highly representative of your production traffic. This custom benchmark is far more valuable than reported benchmark scores.
Integrate this dataset into an automated evaluation pipeline. This pipeline should become part of your continuous integration and continuous delivery (CI/CD) process, serving as a permanent regression suite for future model or prompt changes. For tooling, teams can use built-in solutions like Amazon Bedrock evaluations or open source evaluation frameworks like RAGAS and DeepEval.
Managing integrated safety and governance
A model’s safety profile changes with each version. This must be tested as part of an integrated system. Never test the new model in isolation. Your migration test plan is only valid if it evaluates the new model with the exact same guardrail configuration you use in production, for example, Amazon Bedrock Guardrails. A new model’s conversational style might trigger existing guardrails in unexpected ways, leading to a spike in blocked responses.
Implementing safe deployment strategies
When deploying Anthropic’s Claude 4 Sonnet to production, implement a phased rollout strategy to minimize risks. Consider shadow testing to compare model performance using mirrored traffic without impacting users, then conduct A/B testing to measure impact on business KPIs. For the actual rollout, use either a canary release approach by gradually exposing a small percentage of users to the new model, or a blue/green deployment strategy that maintains parallel environments for instant rollback capability. This structured approach facilitates safe validation of the new model’s performance while maintaining business continuity.
Conclusion
By treating the transition from Anthropic’s Claude 3.5 Sonnet (v1 and v2) to Claude 4 Sonnet as a structured engineering project, you can de-risk the migration process and unlock significant benefits. Understanding the key model differences, adapting your prompts and API calls, and implementing a robust, automated evaluation strategy are the pillars of a successful upgrade.
This transition is an opportunity to maintain business continuity and enhance your application with next-generation capabilities. We encourage you to begin your analysis and testing today.
For more details, refer to Migrating to Claude 4 and Anthropic’s Claude in Amazon Bedrock, and reach out to us for migration-specific inquiries.

About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Deepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in AI, he partners with clients to accelerate their generative AI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences.
Mahsa Paknezhad, PhD, is a Deep Learning Architect at the AWS Generative AI Innovation Center. With a focus on scalability and production readiness, Mahsa helps organizations across diverse industries harness advanced generative AI models to achieve meaningful business outcomes. Mahsa has a proven track record of successfully delivering projects specifically within the mining industry and the healthcare sector.
Nicholas Moore is a Solutions Architect at AWS, helping startups and businesses of all sizes turn ideas into reality. He specializes in cloud solutions with a focus on AI, analytics, and modern application development.
Derrick Choo is a Senior Solutions Architect at AWS who accelerates enterprise digital transformation through cloud adoption, AI/ML, and generative AI solutions. He specializes in full-stack development and ML, designing end-to-end solutions spanning frontend interfaces, IoT applications, data integrations, and ML models, with a particular focus on computer vision and multi-modal systems.
Sovik Kumar Nath is an AI/ML and Generative AI Senior Solutions Architect with AWS. He has extensive experience designing end-to-end ML and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has master’s degrees from the University of South Florida and University of Fribourg, Switzerland, and a bachelor’s degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Gareth Jones is a Product Manager at Anthropic focused on the Claude API. His work includes collaboration with AWS on Amazon Bedrock to make Claude widely accessible to developers.

TwinMind Introduces Ear-3 Model: A New Voice AI Model that Sets New In …

Posted on September 12, 2025 by i-genie

TwinMind, a California-based Voice AI startup, unveiled Ear-3 speech-recognition model, claiming state-of-the-art performance on several key metrics and expanded multilingual support. The release positions Ear-3 as a competitive offering against existing ASR (Automatic Speech Recognition) solutions from providers like Deepgram, AssemblyAI, Eleven Labs, Otter, Speechmatics, and OpenAI.

Key Metrics

MetricTwinMind Ear-3 ResultComparisons / NotesWord Error Rate (WER)5.26 %Significantly lower than many competitors: Deepgram ~8.26 %, AssemblyAI ~8.31 %.Speaker Diarization Error Rate (DER)3.8 %Slight improvement over previous best from Speechmatics (~3.9 %).Language Support140+ languagesOver 40 more languages than many leading models; aims for “true global coverage.”Cost per Hour of TranscriptionUS$ 0.23/hrPositioned as lowest among major services.

Technical Approach & Positioning

TwinMind indicates Ear-3 is a “fine-tuned blend of several open-source models,” trained on a curated dataset containing human-annotated audio sources such as podcasts, videos, and films.

Diarization and speaker labeling are improved via a pipeline that includes audio cleaning and enhancement before diarization, plus “precise alignment checks” to refine speaker boundary detections.

The model handles code-switching and mixed scripts, which are typically difficult for ASR systems due to varied phonetics, accent variance, and linguistic overlap.

Trade-offs & Operational Details

Ear-3 requires cloud deployment. Because of its model size and compute load, it cannot be fully offline. TwinMind’s Ear-2 (its earlier model) remains the fallback when connectivity is lost.

Privacy: TwinMind claims audio is not stored long-term; only transcripts are stored locally, with optional encrypted backups. Audio recordings are deleted “on the fly.”

Platform integration: API access for the model is planned in the coming weeks for developers/enterprises. For end users, Ear-3 functionality will be rolled out to TwinMind’s iPhone, Android, and Chrome apps over the next month for Pro users.

Comparative Analysis & Implications

Ear-3’s WER and DER metrics put it ahead of many established models. Lower WER translates to fewer transcription errors (mis-recognitions, dropped words, etc.), which is critical for domains like legal, medical, lecture transcription, or archival of sensitive content. Similarly, lower DER (i.e. better speaker separation + labeling) matters for meetings, interviews, podcasts — anything with multiple participants.

The price point of US$0.23/hr makes high-accuracy transcription more economically feasible for long-form audio (e.g. hours of meetings, lectures, recordings). Combined with support for over 140 languages, there is a clear push to make this usable in global settings, not just English-centric or well-resourced language contexts.

However, cloud dependency could be a limitation for users needing offline or edge-device capabilities, or where data privacy / latency concerns are stringent. Implementation complexity for supporting 140+ languages (accent drift, dialects, code-switching) may reveal weaker zones under adverse acoustic conditions. Real-world performance may vary compared to controlled benchmarking.

Conclusion

TwinMind’s Ear-3 model represents a strong technical claim: high accuracy, speaker diarization precision, extensive language coverage, and aggressive cost reduction. If benchmarks hold in real usage, this could shift expectations for what “premium” transcription services should deliver.

Check out the Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post TwinMind Introduces Ear-3 Model: A New Voice AI Model that Sets New Industry Records in Accuracy, Speaker Labeling, Languages and Price appeared first on MarkTechPost.

OpenAI Adds Full MCP Tool Support in ChatGPT Developer Mode: Enabling …

Posted on September 12, 2025 by i-genie

OpenAI has just introduced a major upgrade to ChatGPT’s developer mode by adding full support for Model Context Protocol (MCP) tools. Until now, MCP integrations inside ChatGPT were limited to search and fetch operations—essentially read-only. With this update, MCP connectors can perform write actions, which means developers can now directly update systems, trigger workflows, and chain complex automations from within a ChatGPT conversation. The capability is currently available to Plus and Pro users.

This change moves ChatGPT beyond being just an intelligent query layer. Instead of only retrieving data from connected sources, it can now act on that data. For example, developers can update Jira tickets directly through chat, kick off a Zapier workflow, or combine connectors to perform multi-step tasks such as analyzing error logs, opening an incident ticket, and notifying a team channel. ChatGPT is no longer just a conversational assistant—it is positioned as an orchestration layer for real work across distributed tools.

The technical foundation of this expansion lies in the MCP framework, which defines how large language models interact with external services through structured protocols. Connectors expose capabilities that ChatGPT can call, typically described using JSON schemas. The addition of write support introduces new requirements around authentication, security, and reliability. Since connectors now modify external state, API tokens, OAuth scopes, and access controls need to be tightly scoped. Error handling becomes critical: when a write operation fails, ChatGPT must be able to surface the issue clearly, log it, and recover gracefully. Developers also need to consider transaction safety when chaining multiple write actions across services.

From a developer experience standpoint, enabling these capabilities is straightforward. Once developer mode is activated in ChatGPT, developers can register connectors that include both read and write methods. These connectors can then be invoked naturally during a conversation. The workflow is designed for iteration—developers can prototype, test, and refine integrations directly in chat rather than building custom middleware from scratch. OpenAI’s documentation provides schemas, endpoint definitions, and examples to standardize connector behavior across services.

The impact for enterprise and automation use cases is significant. Operations teams can streamline incident response by having ChatGPT log issues, update tickets, and push alerts automatically. Business teams can embed ChatGPT into CRM pipelines, where a single conversational update might sync customer data, generate reports, and notify account managers. For engineering teams, ChatGPT can now trigger builds, update GitHub pull requests, or synchronize task trackers—all without leaving the chat interface. In each case, ChatGPT is not just summarizing information but actively driving workflows.

This update marks an important step in the future of ChatGPT. By enabling full MCP tool support, OpenAI is pushing the assistant from being a knowledge layer to a true automation platform. It provides developers with the flexibility to build connectors that bridge natural language instructions and real-world actions, effectively turning conversation into a universal interface for enterprise systems. For organizations using ChatGPT Plus or Pro, developer mode now opens the door to integrating conversational AI directly into daily operations, where chat doesn’t just answer questions—it gets work done.

We’ve (finally) added full support for MCP tools in ChatGPT. In developer mode, developers can create connectors and use them in chat for write actions (not just search/fetch). Update Jira tickets, trigger Zapier workflows, or combine connectors for complex automations. pic.twitter.com/1W0rTGGEnu— OpenAI Developers (@OpenAIDevs) September 10, 2025

The post OpenAI Adds Full MCP Tool Support in ChatGPT Developer Mode: Enabling Write Actions, Workflow Automation, and Enterprise Integrations appeared first on MarkTechPost.

What are Optical Character Recognition (OCR) Models? Top Open-Source O …

Posted on September 12, 2025 by i-genie

Optical Character Recognition (OCR) is the process of turning images that contain text—such as scanned pages, receipts, or photographs—into machine-readable text. What began as brittle rule-based systems has evolved into a rich ecosystem of neural architectures and vision-language models capable of reading complex, multi-lingual, and handwritten documents.

How OCR Works?

Every OCR system tackles three core challenges:

Detection – Finding where text appears in the image. This step has to handle skewed layouts, curved text, and cluttered scenes.

Recognition – Converting the detected regions into characters or words. Performance depends heavily on how the model handles low resolution, font diversity, and noise.

Post-Processing – Using dictionaries or language models to correct recognition errors and preserve structure, whether that’s table cells, column layouts, or form fields.

The difficulty grows when dealing with handwriting, scripts beyond Latin alphabets, or highly structured documents such as invoices and scientific papers.

From Hand-Crafted Pipelines to Modern Architectures

Early OCR: Relied on binarization, segmentation, and template matching. Effective only for clean, printed text.

Deep Learning: CNN and RNN-based models removed the need for manual feature engineering, enabling end-to-end recognition.

Transformers: Architectures such as Microsoft’s TrOCR expanded OCR into handwriting recognition and multilingual settings with improved generalization.

Vision-Language Models (VLMs): Large multimodal models like Qwen2.5-VL and Llama 3.2 Vision integrate OCR with contextual reasoning, handling not just text but also diagrams, tables, and mixed content.

Comparing Leading Open-Source OCR Models

ModelArchitectureStrengthsBest FitTesseractLSTM-basedMature, supports 100+ languages, widely usedBulk digitization of printed textEasyOCRPyTorch CNN + RNNEasy to use, GPU-enabled, 80+ languagesQuick prototypes, lightweight tasksPaddleOCRCNN + Transformer pipelinesStrong Chinese/English support, table & formula extractionStructured multilingual documentsdocTRModular (DBNet, CRNN, ViTSTR)Flexible, supports both PyTorch & TensorFlowResearch and custom pipelinesTrOCRTransformer-basedExcellent handwriting recognition, strong generalizationHandwritten or mixed-script inputsQwen2.5-VLVision-language modelContext-aware, handles diagrams and layoutsComplex documents with mixed mediaLlama 3.2 VisionVision-language modelOCR integrated with reasoning tasksQA over scanned docs, multimodal tasks

Emerging Trends

Research in OCR is moving in three notable directions:

Unified Models: Systems like VISTA-OCR collapse detection, recognition, and spatial localization into a single generative framework, reducing error propagation.

Low-Resource Languages: Benchmarks such as PsOCR highlight performance gaps in languages like Pashto, suggesting multilingual fine-tuning.

Efficiency Optimizations: Models such as TextHawk2 reduce visual token counts in transformers, cutting inference costs without losing accuracy.

Conclusion

The open-source OCR ecosystem offers options that balance accuracy, speed, and resource efficiency. Tesseract remains dependable for printed text, PaddleOCR excels with structured and multilingual documents, while TrOCR pushes the boundaries of handwriting recognition. For use cases requiring document understanding beyond raw text, vision-language models like Qwen2.5-VL and Llama 3.2 Vision are promising, though costly to deploy.

The right choice depends less on leaderboard accuracy and more on the realities of deployment: the types of documents, scripts, and structural complexity you need to handle, and the compute budget available. Benchmarking candidate models on your own data remains the most reliable way to decide.
The post What are Optical Character Recognition (OCR) Models? Top Open-Source OCR Models appeared first on MarkTechPost.

Enhance video understanding with Amazon Bedrock Data Automation and op …

Posted on September 12, 2025 by i-genie

In real-world video and image analysis, businesses often face the challenge of detecting objects that weren’t part of a model’s original training set. This becomes especially difficult in dynamic environments where new, unknown, or user-defined objects frequently appear. For example, media publishers might want to track emerging brands or products in user-generated content; advertisers need to analyze product appearances in influencer videos despite visual variations; retail providers aim to support flexible, descriptive search; self-driving cars must identify unexpected road debris; and manufacturing systems need to catch novel or subtle defects without prior labeling.In all these cases, traditional closed-set object detection (CSOD) models—which only recognize a fixed list of predefined categories—fail to deliver. They either misclassify the unknown objects or ignore them entirely, limiting their usefulness for real-world applications.Open-set object detection (OSOD) is an approach that enables models to detect both known and previously unseen objects, including those not encountered during training. It supports flexible input prompts, ranging from specific object names to open-ended descriptions, and can adapt to user-defined targets in real time without requiring retraining. By combining visual recognition with semantic understanding—often through vision-language models—OSOD helps users query the system broadly, even if it’s unfamiliar, ambiguous, or entirely new.
In this post, we explore how Amazon Bedrock Data Automation uses OSOD to enhance video understanding.
Amazon Bedrock Data Automation and video blueprints with OSOD
Amazon Bedrock Data Automation is a cloud-based service that extracts insights from unstructured content like documents, images, video and audio. Specifically, for video content, Amazon Bedrock Data Automation supports functionalities such as chapter segmentation, frame-level text detection, chapter-level classification Interactive Advertising Bureau (IAB) taxonomies, and frame-level OSOD. For more information about Amazon Bedrock Data Automation, see Automate video insights for contextual advertising using Amazon Bedrock Data Automation.
Amazon Bedrock Data Automation video blueprints support OSOD on the frame level. You can input a video along with a text prompt specifying the desired objects to detect. For each frame, the model outputs a dictionary containing bounding boxes in XYWH format (the x and y coordinates of the top-left corner, followed by the width and height of the box), along with corresponding labels and confidence scores. You can further customize the output based on their needs—for instance, filtering by high-confidence detections when precision is prioritized.
The input text is highly flexible, so you can define dynamic fields in the Amazon Bedrock Data Automation video blueprints powered by OSOD.
Example use cases
In this section, we explore some examples of different use cases for Amazon Bedrock Data Automation video blueprints using OSOD. The following table summarizes the functionality of this feature.

Functionality
Sub-functionality
Examples

Multi-granular visual comprehension
Object detection from fine-grained object reference
“Detect the apple in the video.”

Object detection from cross-granularity object reference
“Detect all the fruit items in the image.”

Object detection from open questions
“Find and detect the most visually important elements in the image.”

Visual hallucination detection
Identify and flag object mentionings in the input text that do not correspond to actual content in the given image.
“Detect if apples appear in the image.”

Ads analysis
Advertisers can use this feature to compare the effectiveness of various ad placement strategies across different locations and conduct A/B testing to identify the most optimal advertising approach. For example, the following image is the output in response to the prompt “Detect the locations of echo devices.”

Smart resizing
By detecting key elements in the video, you can choose appropriate resizing strategies for devices with different resolutions and aspect ratios, making sure important visual information is preserved. For example, the following image is the output in response to the prompt “Detect the key elements in the video.”

Surveillance with intelligent monitoring
In home security systems, producers or users can take advantage of the model’s high-level understanding and localization capabilities to maintain safety, without the need to manually enumerate all possible scenarios. For example, the following image is the output in response to the prompt “Check dangerous elements in the video.”

Custom labels
You can define your own labels and search through videos to retrieve specific, desired results. For example, the following image is the output in response to the prompt “Detect the white car with red wheels in the video.”

Image and video editing
With flexible text-based object detection, you can accurately remove or replace objects in photo editing software, minimizing the need for imprecise, hand-drawn masks that often require multiple attempts to achieve the desired result. For example, the following image is the output in response to the prompt “Detect the people riding motorcycles in the video.”
Sample video blueprint input and output
The following example demonstrates how to define an Amazon Bedrock Data Automation video blueprint to detect visually prominent objects at the chapter level, with sample output including objects and their bounding boxes.
The following code is our example blueprint schema:

blueprint = {
  “$schema”: “http://json-schema.org/draft-07/schema#”,
  “description”: “This blueprint enhances the searchability and discoverability of video content by providing comprehensive object detection and scene analysis.”,
  “class”: “media_search_video_analysis”,
  “type”: “object”,
  “properties”: {
   # Targeted Object Detection: Identifies visually prominent objects in the video
   # Set granularity to chapter level for more precise object detection
   “targeted-object-detection”: {
   “type”: “array”,
   “instruction”: “Please detect all the visually prominent objects in the video”,
   “items”: {
   “$ref”: “bedrock-data-automation#/definitions/Entity”
   },
   “granularity”: [“chapter”] # Chapter-level granularity provides per-scene object detection
   },
  }
}

The following code is out example video custom output:

“chapters”: [
…..,
   {
   “inference_result”: {
   “emotional-tone”: “Tension and suspense”
   },
   “frames”: [
   {
   “frame_index”: 10289,
   “inference_result”: {
   “targeted-object-detection”: [
   {
   “label”: “man”,
   “bounding_box”: {
   “left”: 0.6198254823684692,
   “top”: 0.10746771097183228,
   “width”: 0.16384708881378174,
   “height”: 0.7655990719795227
   },
   “confidence”: 0.9174646443068981
   },
   {
   “label”: “ocean”,
   “bounding_box”: {
   “left”: 0.0027531087398529053,
   “top”: 0.026655912399291992,
   “width”: 0.9967235922813416,
   “height”: 0.7752640247344971
   },
   “confidence”: 0.7712276351034641
   },
   {
   “label”: “cliff”,
   “bounding_box”: {
   “left”: 0.4687306359410286,
   “top”: 0.5707792937755585,
   “width”: 0.168929323554039,
   “height”: 0.20445972681045532
   },
   “confidence”: 0.719932173293829
   }
   ],
   },
   “timecode_smpte”: “00:05:43;08”,
   “timestamp_millis”: 343276
   }
   ],
   “chapter_index”: 11,
   “start_timecode_smpte”: “00:05:36;16”,
   “end_timecode_smpte”: “00:09:27;14”,
   “start_timestamp_millis”: 336503,
   “end_timestamp_millis”: 567400,
   “start_frame_index”: 10086,
   “end_frame_index”: 17006,
   “duration_smpte”: “00:03:50;26”,
   “duration_millis”: 230897,
   “duration_frames”: 6921
   },
……….
]

For the full example, refer to the following GitHub repo.
Conclusion
The OSOD capability within Amazon Bedrock Data Automation significantly enhances the ability to extract actionable insights from video content. By combining flexible text-driven queries with frame-level object localization, OSOD helps users across industries implement intelligent video analysis workflows—ranging from targeted ad evaluation and security monitoring to custom object tracking. Integrated seamlessly into the broader suite of video analysis tools available in Amazon Bedrock Data Automation, OSOD not only streamlines content understanding but also help reduce the need for manual intervention and rigid pre-defined schemas, making it a powerful asset for scalable, real-world applications.
To learn more about Amazon Bedrock Data Automation video and audio analysis, see New Amazon Bedrock Data Automation capabilities streamline video and audio analysis.

About the authors
Dongsheng An is an Applied Scientist at AWS AI, specializing in face recognition, open-set object detection, and vision-language models. He received his Ph.D. in Computer Science from Stony Brook University, focusing on optimal transport and generative modeling.
Lana Zhang is a Senior Solutions Architect in the AWS World Wide Specialist Organization AI Services team, specializing in AI and generative AI with a focus on use cases including content moderation and media analysis. She’s dedicated to promoting AWS AI and generative AI solutions, demonstrating how generative AI can transform classic use cases by adding business value. She assists customers in transforming their business solutions across diverse industries, including social media, gaming, ecommerce, media, advertising, and marketing.
Raj Jayaraman is a Senior Generative AI Solutions Architect at AWS, bringing over a decade of experience in helping customers extract valuable insights from data. Specializing in AWS AI and generative AI solutions, Raj’s expertise lies in transforming business solutions through the strategic application of AWS’s AI capabilities, ensuring customers can harness the full potential of generative AI in their unique contexts. With a strong background in guiding customers across industries in adopting AWS Analytics and Business Intelligence services, Raj now focuses on assisting organizations in their generative AI journey—from initial demonstrations to proof of concepts and ultimately to production implementations.

How Skello uses Amazon Bedrock to query data in a multi-tenant environ …

Posted on September 12, 2025 by i-genie

This is a guest post co-written with Skello.
Skello is a leading human resources (HR) software as a service (SaaS) solution focusing on employee scheduling and workforce management. Catering to diverse sectors such as hospitality, retail, healthcare, construction, and industry, Skello offers features including schedule creation, time tracking, and payroll preparation. With approximately 20,000 customers and 400,000 daily users across Europe as of 2024, Skello continually innovates to meet its clients’ evolving needs.
One such innovation is the implementation of an AI-powered assistant to enhance user experience and data accessibility. In this post, we explain how Skello used Amazon Bedrock to create this AI assistant for end-users while maintaining customer data safety in a multi-tenant environment. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
We dive deep into the challenges of implementing large language models (LLMs) for data querying, particularly in the context of a French company operating under the General Data Protection Regulation (GDPR). Our solution demonstrates how to balance powerful AI capabilities with strict data protection requirements.
Challenges with multi-tenant data access
As Skello’s platform grew to serve thousands of businesses, we identified a critical need: our users needed better ways to access and understand their workforce data. Many of our customers, particularly those in HR and operations roles, found traditional database querying tools too technical and time-consuming. This led us to identify two key areas for improvement:

Quick access to non-structured data – Our users needed to find specific information across various data types—employee records, scheduling data, attendance logs, and performance metrics. Traditional search methods often fell short when users had complex questions like “Show me all part-time employees who worked more than 30 hours last month” or “What’s the average sick leave duration in the retail department?”
Visualization of data through graphs for analytics – Although our platform collected comprehensive workforce data, users struggled to transform this raw information into actionable insights. They needed an intuitive way to create visual representations of trends and patterns without writing complex SQL queries or learning specialized business intelligence tools.

To address these challenges, we needed a solution that could:

Understand natural language questions about complex workforce data
Correctly interpret context and intent from user queries
Generate appropriate database queries while respecting data access rules
Return results in user-friendly formats, including visualizations
Handle variations in how users might phrase similar questions
Process queries about time-based data and trends

LLMs emerged as the ideal solution for this task. Their ability to understand natural language and context, combined with their capability to generate structured outputs, made them perfectly suited for translating user questions into precise database queries. However, implementing LLMs in a business-critical application required careful consideration of security, accuracy, and performance requirements.
Solution overview
Using LLMs to generate structured queries from natural language input is an emerging area of interest. This process enables the transformation of user requests into organized data structures, which can then be used to query databases automatically.
The following diagram of Skello’s high-level architecture illustrates this user request transformation process.

The implementation using AWS Lambda and Amazon Bedrock provides several advantages:

Scalability through serverless architecture
Cost-effective processing with pay-as-you-go pricing
Low-latency performance
Access to advanced language models like Anthropic’s Claude 3.5 Sonnet
Rapid deployment capabilities
Flexible integration options

Basic query generation process
The following diagram illustrates how we transform natural language queries into structured database requests. For this example, the user asks “Give me the gender parity.”

The process works as follows:

The authentication service validates the user’s identity and permissions.
The LLM converts the natural language to a structured query format.
The query validation service enforces compliance with security policies.
The database access layer executes the query within the user’s permitted scope.

Handling complex queries
For more sophisticated requests like “Give me the worked hours per week per position for the last 3 months,” our system completes the following steps:

Extract query components:

Target metric: worked hours
Aggregation levels: week, position
Time frame: 3 months

Generate temporal calculations:

Use relative time expressions instead of hard-coded dates
Implement standardized date handling patterns

Data schema optimization
To make our system as efficient and user-friendly as possible, we carefully organized our data structure—think of it as creating a well-organized filing system for a large office.
We created standardized schema definitions, establishing consistent ways to store similar types of information. For example, date-related fields (hire dates, shift times, vacation periods) follow the same format. This helps prevent confusion when users ask questions like “Show me all events from last week.” It’s similar to having all calendars in your office using the same date format instead of some using MM/DD/YY and others using DD/MM/YY.
Our system employs consistent naming conventions with clear, predictable names for all data fields. Instead of technical abbreviations like emp_typ_cd, we use clear terms like employee_type. This makes it straightforward for the AI to understand what users mean when they ask questions like “Show me all full-time employees.”
For optimized search patterns, we strategically organized our data to make common searches fast and efficient. This is particularly important because it directly impacts user experience and system performance. We analyzed usage patterns to identify the most frequently requested information and designed our database indexes accordingly. Additionally, we created specialized data views that pre-aggregate common report requests. This comprehensive approach means questions like “Who’s working today?” get answered almost instantly.
We also established clear data relationships by mapping out how different pieces of information relate to each other. For example, we clearly connect employees to their departments, shifts, and managers. This helps answer complex questions like “Show me all department managers who have team members on vacation next week.”
These optimizations deliver real benefits to our users:

Faster response times when asking questions
More accurate answers to queries
Less confusion when referring to specific types of data
Ability to ask more complex questions about relationships between different types of information
Consistent results when asking similar questions in different ways

For example, whether a user asks “Show me everyone’s vacation time” or “Display all holiday schedules,” the system understands they’re looking for the same type of information. This reliability makes the system more trustworthy and easier to use for everyone, regardless of their technical background.
Graph generation and display
One of the most powerful features of our system is its ability to turn data into meaningful visual charts and graphs automatically. This consists of the following actions:

Smart label creation – The system understands what your data means and creates clear, readable labels. For example, if you ask “Show me employee attendance over the last 6 months,” the horizontal axis automatically labels the months (January through June), the vertical axis shows attendance numbers with simple-to-read intervals, and the title clearly states what you’re looking at: “Employee Attendance Trends.”
Automatic legend creation – The system creates helpful legends that explain what each part of the chart means. For instance, if you ask “Compare sales across different departments,” different departments get different colors, a clear legend shows which color represents which department, and additional information like “Dashed lines show previous year” is automatically added when needed.
Choosing the right type of chart – The system is smart about picking the best way to show your information. For example, it uses bar charts for comparing different categories (“Show me sales by department”), line graphs for trends over time (“How has attendance changed this year?”), pie charts for showing parts of a whole (“What’s the breakdown of full-time vs. part-time staff?”), and heat maps for complex patterns (“Show me busiest hours per day of the week”).
Smart sizing and scaling – The system automatically adjusts the size and scale of charts to make them simple to read. For example, if numbers range from 1–100, it might show intervals of 10; if you’re looking at millions, it might show them in a more readable way (1M, 2M, etc.); charts automatically resize to show patterns clearly; and important details are never too small to see.

All of this happens automatically—you ask your question, and the system handles the technical details of creating a clear, professional visualization. For example, the following figure is an example for the question “How many hours my employees worked over the past 7 weeks?”

Security-first architecture
Our implementation adheres to OWASP best practices (specifically LLM06) by maintaining complete separation between security controls and the LLM.
Through dedicated security services, user authentication and authorization checks are performed before LLM interactions, with user context and permissions managed through Amazon Bedrock SessionParameters, keeping security information entirely outside of LLM processing.
Our validation layer uses Amazon Bedrock Guardrails to protect against prompt injection, inappropriate content, and forbidden topics such as racism, sexism, or illegal content.
The system’s architecture implements strict role-based access controls through a detailed permissions matrix, so users can only access data within their authorized scope. For authentication, we use industry-standard JWT and SAML protocols, and our authorization service maintains granular control over data access permissions.
This multi-layered approach prevents potential security bypasses through prompt manipulation or other LLM-specific attacks. The system automatically enforces data boundaries at both database and API levels, effectively preventing cross-contamination between different customer accounts. For instance, department managers can only access their team’s data, with these restrictions enforced through database compartmentalization.
Additionally, our comprehensive audit system maintains immutable logs of all actions, including timestamps, user identifiers, and accessed resources, stored separately to protect their integrity. This security framework operates seamlessly in the background, maintaining robust protection of sensitive information without disrupting the user experience or legitimate workflows.
Benefits
Creating data visualizations has never been more accessible. Even without specialized expertise, you can now produce professional-quality charts that communicate your insights effectively. The streamlined process makes sure your visualizations remain consistently clear and intuitive, so you can concentrate on exploring your data questions instead of spending time on presentation details.
The solution works through simple conversational requests that require no technical knowledge or specialized software. You simply describe what you want to visualize using everyday language and the system interprets your request and creates the appropriate visualization. There’s no need to learn complex software interfaces, remember specific commands, or understand data formatting requirements. The underlying technology handles the data processing, chart selection, and professional formatting automatically, transforming your spoken or written requests into polished visual presentations within moments.
Your specific information needs to drive how the data is displayed, making the insights more relevant and actionable. When it’s time to share your findings, these visualizations seamlessly integrate into your reports and presentations with polished formatting that enhances your overall message. This democratization of data visualization empowers everyone to tell compelling data stories.
Conclusion
In this post, we explored Skello’s implementation of an AI-powered assistant using Amazon Bedrock and Lambda. We saw how end-users can query their own data in a multi-tenant environment while maintaining logical boundaries and complying with GDPR regulations. The combination of serverless architecture and advanced language models proved effective in enhancing data accessibility and user experience.
We invite you to explore the AWS Machine Learning Blog for more insights on AI solutions and their potential business applications. If you’re interested in learning more about Skello’s journey in modernizing HR software, check out our blog post series on the topic.
If you have any questions or suggestions about implementing similar solutions in your own multi-tenant environment, please feel free to share them in the comments section.

About the authors
Nicolas de Place is a Data & AI Solutions Architect specializing in machine learning strategy for high-growth startups. He empowers emerging companies to harness the full potential of artificial intelligence and advanced analytics, designing scalable ML architectures and data-driven solutions
Cédric Peruzzi is a Software Architect at Skello, where he focuses on designing and implementing Generative AI features. Before his current role, he worked as a software engineer and architect, bringing his experience to help build better software solutions.

Create a private workforce on Amazon SageMaker Ground Truth with the A …

Posted on September 12, 2025 by i-genie

Private workforces for Amazon SageMaker Ground Truth and Amazon Augmented AI (Amazon A2I) help organizations build proprietary, high-quality datasets while keeping high standards of security and privacy.
The AWS Management Console provides a fast and intuitive way to create a private workforce, but many organizations need to automate their infrastructure deployment through infrastructure as code (IaC) because it provides benefits such as automated and consistent deployments, increased operational efficiency, and reduced chances of human errors or misconfigurations.
However, creating a private workforce with IaC is not a straightforward task because of some complex technical dependencies between services during the initial creation.
In this post, we present a complete solution for programmatically creating private workforces on Amazon SageMaker AI using the AWS Cloud Development Kit (AWS CDK), including the setup of a dedicated, fully configured Amazon Cognito user pool. The accompanying GitHub repository provides a customizable AWS CDK example that shows how to create and manage a private workforce, paired with a dedicated Amazon Cognito user pool, and how to integrate the necessary Amazon Cognito configurations.
Solution overview
This solution demonstrates how to create a private workforce and a coupled Amazon Cognito user pool and its dependent resources. The goal is to provide a comprehensive setup for the base infrastructure to enable machine learning (ML) labeling tasks.
The key technical challenge in this solution is the mutual dependency between the Amazon Cognito resources and the private workforce.
Specifically, the creation of the user pool app client requires certain parameters, such as the callback URL, which is only available after the private workforce is created. However, the private workforce creation itself needs the app client to be already present. This mutual dependency makes it challenging to set up the infrastructure in a straightforward manner.
Additionally, the user pool domain name must remain consistent across deployments, because it can’t be easily changed after the initial creation and inconsistency in the name can lead to deployment errors.
To address these challenges, the solution uses several AWS CDK constructs, including AWS CloudFormation custom resources. This custom approach allows the orchestration of the user pool and SageMaker private workforce creation, to correctly configure the resources and manage their interdependencies.
The solution architecture is composed of one stack with several resources and services, some of which are needed only for the initial setup of the private workforce, and some that are used by the private workforce workers when logging in to complete a labeling task. The following diagram illustrates this architecture.

The solution’s deployment requires AWS services and resources that work together to set up the private workforce. The numbers in the diagram reflect the stack components that support the stack creation, which occur in the following order:

Amazon Cognito user pool – The user pool provides user management and authentication for the SageMaker private workforce. It handles user registration, login, and password management. A default email invitation is initially set to onboard new users to the private workforce. The user pool is both associated with an AWS WAF firewall and configured to deliver user activity logs to Amazon CloudWatch for enhanced security.
Amazon Cognito user pool app client – The user pool app client configures the client application that will interact with the user pool. During the initial deployment, a temporary placeholder callback URL is used, because the actual callback URL can only be determined later in the process.
AWS Systems Manager Parameter Store – Parameter Store, a capability of AWS Systems Manager, stores and persists the prefix of the user pool domain across deployments in a string parameter. The provided prefix must be such that the resulting domain is globally unique.
Amazon Cognito user pool domain – The user pool domain defines the domain name for the managed login experience provided by the user pool. This domain name must remain consistent across deployments, because it can’t be easily changed after the initial creation.
IAM roles – AWS Identity and Access Management (IAM) roles for CloudFormation custom resources include permissions to make AWS SDK calls to create the private workforce and other API calls during the next steps.
Private workforce – Implemented using a custom resource backing the CreateWorkforce API call, the private workforce is the foundation to manage labeling activities. It creates the labeling portal and manages portal-level access controls, including authentication through the integrated user pool. Upon creation, the labeling portal URL is made available to be used as a callback URL by the Amazon Cognito app client. The connected Amazon Cognito app client is automatically updated with the new callback URL.
SDK call to fetch the labeling portal domain – This SDK call reads the subdomain of labeling portal. This is implemented as a CloudFormation custom resource.
SDK call to update user pool – This SDK call updates the user pool with a user invitation email that points to the labeling portal URL. This is implemented as a CloudFormation custom resource.
Filter for placeholder callback URL – Custom logic separates the placeholder URL from the app client’s callback URLs. This is implemented as a CloudFormation custom resource, backed by a custom AWS Lambda function.
SDK call to update the app client to remove the placeholder callback URL – This SDK call updates the app client with the correct callback URLs. This is implemented as a CloudFormation custom resource.
User creation and invitation emails – Amazon Cognito users are created and sent invitation emails with instructions to join the private workforce.

After this initial setup, a worker can join the private workforce and access the labeling. The authentication flow includes the email invitation, initial registration, authentication, and login to the labeling portal. The following diagram illustrates this workflow.

The detailed workflow steps are as follows:

A worker receives an email invitation that provides the user name, temporary password, and URL of the labeling portal.
When trying to reach the labeling portal, the worker is redirected to the Amazon Cognito user pool domain for authentication. Amazon Cognito domain endpoints are additionally protected by AWS WAF. The worker then sets a new password and registers with multi-factor authentication.
Authentication actions by the worker are logged and sent to CloudWatch.
The worker can log in and is redirected to the labeling portal.
In the labeling portal, the worker can access existing labeling jobs in SageMaker Ground Truth.

The solution uses a mix of AWS CDK constructs and CloudFormation custom resources to integrate the Amazon Cognito user pool and the SageMaker private workforce so workers can register and access the labeling portal. In the following sections, we show how to deploy the solution.
Prerequisites
You must have the following prerequisites:

An AWS account, already bootstrapped for the AWS CDK
AWS credentials with sufficient permissions to deploy the solution
The AWS CDK installed (version 2.178.1 or later)
Python (version 3.13 or later)
The AWS Command Line Interface (AWS CLI) installed
A mobile device with an authenticator app installed

Deploy the solution
To deploy the solution, complete the following steps. Make sure you have AWS credentials available in your environment with sufficient permissions to deploy the solution resources.

Clone the GitHub repository.
Follow the detailed instructions in the README file to deploy the stack using the AWS CDK and AWS CLI.
Open the AWS CloudFormation console and choose the Workforce stack for more information on the ongoing deployment and the created resources.

Test the solution
If you invited yourself from the AWS CDK CLI to join the private workforce, follow the instructions in the email that you received to register and access the labeling portal. Otherwise, complete the following steps to invite yourself and others to join the private workforce. For more information, see Creating a new user in the AWS Management Console.

On the Amazon Cognito console, choose User pools in the navigation pane.
Choose the existing user pool, MyWorkforceUserPool.
Choose Users, then choose Create a user.
Choose Email as the alias attribute to sign in.
Choose Send an email invitation as the invitation message.
For User name, enter a name for the new user. Make sure not to use the email address.
For Email address, enter the email address of the worker to be invited.
For simplicity, choose Generate a password for the user.
Choose Create.

After you receive the invitation email, follow the instructions to set a new password and register with an authenticator application. Then you can log in and see a page listing your labeling jobs.

Best practices and considerations
When setting up a private workforce, consider the best practices for Amazon Cognito and the AWS CDK, as well as additional customizations:

Customized domain – Provide your own prefix for the Amazon Cognito subdomain when deploying the solution. This way, you can use a more recognizable domain name for the labeling application, rather than a randomly generated one. For even greater customization, integrate the user pool with a custom domain that you own. This gives you full control over the URL used for the login and aligns it with the rest your organization’s applications.
Enhance security controls – Depending on your organization’s security and compliance requirements, you can further adapt the Amazon Cognito resources, for instance, by integrating with external identity providers and following other security best practices.
Implement VPC configuration – You can implement additional security controls, such as adding a virtual private cloud (VPC) configuration to the private workforce. This helps you enhance the overall security posture of your solution, providing an additional layer of network-level security and isolation.
Restrict the source IPs – When creating the SageMaker private workforce, you can specify a list of IP addresses ranges (CIDR) from which workers can log in.
AWS WAF customization – Bring your own existing AWS WAF or configure one to your organization’s needs by setting up custom rules, IP filtering, rate-based rules, and web access control lists (ACLs) to protect your application.
Integrate with CI/CD – Incorporate the IaC in a continuous integration and continuous delivery (CI/CD) pipeline to standardize deployment, track changes, and further improve resource tracking and observability also across multiple environments (for instance, development, staging, production).
Extend the solution – Depending on your specific use case, you might want to extend the solution to include the creation and management of work teams and labeling jobs or flows. This can help integrate the private workforce setup more seamlessly with your existing ML workflows and data labeling processes.
Integrate with additional AWS services – To suit your specific requirements, you can further integrate the private workforce and user pool with other relevant AWS services, such as CloudWatch for logging, monitoring, and alarms, and Amazon Simple Notification Service (Amazon SNS) for notifications to enhance the capabilities of your data labeling solution.

Clean up
To clean up your resources, open the AWS CloudFormation console and delete the Workforce stack. Alternatively, if you deployed using the AWS CDK CLI, you can run cdk destroy from the same terminal where you ran cdk deploy and use the same AWS CDK CLI arguments as during deployment.
Conclusion
This solution demonstrates how to programmatically create a private workforce on SageMaker Ground Truth, paired with a dedicated and fully configured Amazon Cognito user pool. By using the AWS CDK and AWS CloudFormation, this solution brings the benefits of IaC to the setup of your ML data labeling private workforce.
To further customize this solution to meet your organization’s standards, discover how to accelerate your journey on the cloud with the help of AWS Professional Services.
We encourage you to learn more from the developer guides on data labeling on SageMaker and Amazon Cognito user pools. Refer to the following blog posts for more examples of labeling data using SageMaker Ground Truth:

Power Your LLM Training and Evaluation with the New SageMaker AI Generative AI Tools
Enhance speech synthesis and video generation models with RLHF using audio and video segmentation in Amazon SageMaker
Accelerate custom labeling workflows in Amazon SageMaker Ground Truth without using AWS Lambda
Create a data labeling project with Amazon SageMaker Ground Truth Plus
High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus

About the author
Dr. Giorgio Pessot is a Machine Learning Engineer at Amazon Web Services Professional Services. With a background in computational physics, he specializes in architecting enterprise-grade AI systems at the confluence of mathematical theory, DevOps, and cloud technologies, where technology and organizational processes converge to achieve business objectives. When he’s not whipping up cloud solutions, you’ll find Giorgio engineering culinary creations in his kitchen.

Building Advanced MCP (Model Context Protocol) Agents with Multi-Agent …

Posted on September 11, 2025 by i-genie

In this tutorial, we are walking through the process of building an advanced MCP (Model Context Protocol) Agent that runs smoothly inside Jupyter or Google Colab. We are designing the system with real-world practicality in mind, focusing on multi-agent coordination, context awareness, memory management, and dynamic tool usage. As we progress, we see how each agent specializes in its own role, whether it’s coordinating, researching, analyzing, or executing, and how together they form a swarm that can handle complex tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport json
import logging
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from enum import Enum
from datetime import datetime
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

try:
import google.generativeai as genai
GEMINI_AVAILABLE = True
except ImportError:
print(” google-generativeai not installed. Run: pip install google-generativeai”)
GEMINI_AVAILABLE = False

We start by importing essential Python libraries for data handling, logging, and agent structuring, while also setting up logging for better debugging. We then check for the availability of the Gemini API, so we can seamlessly integrate it if it is installed; otherwise, we run in demo mode. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AgentRole(Enum):
COORDINATOR = “coordinator”
RESEARCHER = “researcher”
ANALYZER = “analyzer”
EXECUTOR = “executor”

@dataclass
class Message:
role: str
content: str
timestamp: datetime
metadata: Dict[str, Any] = None

@dataclass
class AgentContext:
agent_id: str
role: AgentRole
capabilities: List[str]
memory: List[Message]
tools: List[str]

We define the core building blocks of our agent system. We create AgentRole to assign clear responsibilities, use Message to store conversations with context, and build AgentContext to capture each agent’s identity, role, memory, and tools so we can manage interactions effectively. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCPAgent:
“””Advanced MCP Agent with evolved capabilities – Jupyter Compatible”””

def __init__(self, agent_id: str, role: AgentRole, api_key: str = None):
self.agent_id = agent_id
self.role = role
self.api_key = api_key
self.memory = []
self.context = AgentContext(
agent_id=agent_id,
role=role,
capabilities=self._init_capabilities(),
memory=[],
tools=self._init_tools()
)

self.model = None
if GEMINI_AVAILABLE and api_key:
try:
genai.configure(api_key=api_key)
self.model = genai.GenerativeModel(‘gemini-pro’)
print(f” Agent {agent_id} initialized with Gemini API”)
except Exception as e:
print(f” Gemini configuration failed: {e}”)
print(” Running in demo mode with simulated responses”)
else:
print(f” Agent {agent_id} running in demo mode”)

def _init_capabilities(self) -> List[str]:
“””Initialize role-specific capabilities”””
capabilities_map = {
AgentRole.COORDINATOR: [“task_decomposition”, “agent_orchestration”, “priority_management”],
AgentRole.RESEARCHER: [“data_gathering”, “web_search”, “information_synthesis”],
AgentRole.ANALYZER: [“pattern_recognition”, “data_analysis”, “insight_generation”],
AgentRole.EXECUTOR: [“action_execution”, “result_validation”, “output_formatting”]
}
return capabilities_map.get(self.role, [])

def _init_tools(self) -> List[str]:
“””Initialize available tools based on role”””
tools_map = {
AgentRole.COORDINATOR: [“task_splitter”, “agent_selector”, “progress_tracker”],
AgentRole.RESEARCHER: [“search_engine”, “data_extractor”, “source_validator”],
AgentRole.ANALYZER: [“statistical_analyzer”, “pattern_detector”, “visualization_tool”],
AgentRole.EXECUTOR: [“code_executor”, “file_handler”, “api_caller”]
}
return tools_map.get(self.role, [])

def process_message(self, message: str, context: Optional[Dict] = None) -> Dict[str, Any]:
“””Process incoming message with context awareness – Synchronous version”””

msg = Message(
role=”user”,
content=message,
timestamp=datetime.now(),
metadata=context
)
self.memory.append(msg)

prompt = self._generate_contextual_prompt(message, context)

try:
if self.model:
response = self._generate_response_gemini(prompt)
else:
response = self._generate_demo_response(message)

response_msg = Message(
role=”assistant”,
content=response,
timestamp=datetime.now(),
metadata={“agent_id”: self.agent_id, “role”: self.role.value}
)
self.memory.append(response_msg)

return {
“agent_id”: self.agent_id,
“role”: self.role.value,
“response”: response,
“capabilities_used”: self._analyze_capabilities_used(message),
“next_actions”: self._suggest_next_actions(response),
“timestamp”: datetime.now().isoformat()
}

except Exception as e:
logger.error(f”Error processing message: {e}”)
return {“error”: str(e)}

def _generate_response_gemini(self, prompt: str) -> str:
“””Generate response using Gemini API – Synchronous”””
try:
response = self.model.generate_content(prompt)
return response.text
except Exception as e:
logger.error(f”Gemini API error: {e}”)
return self._generate_demo_response(prompt)

def _generate_demo_response(self, message: str) -> str:
“””Generate simulated response for demo purposes”””
role_responses = {
AgentRole.COORDINATOR: f”As coordinator, I’ll break down the task: ‘{message[:50]}…’ into manageable components and assign them to specialized agents.”,
AgentRole.RESEARCHER: f”I’ll research information about: ‘{message[:50]}…’ using my data gathering and synthesis capabilities.”,
AgentRole.ANALYZER: f”Analyzing the patterns and insights from: ‘{message[:50]}…’ to provide data-driven recommendations.”,
AgentRole.EXECUTOR: f”I’ll execute the necessary actions for: ‘{message[:50]}…’ and validate the results.”
}

base_response = role_responses.get(self.role, f”Processing: {message[:50]}…”)

time.sleep(0.5)

additional_context = {
AgentRole.COORDINATOR: ” I’ve identified 3 key subtasks and will coordinate their execution across the agent team.”,
AgentRole.RESEARCHER: ” My research indicates several relevant sources and current trends in this area.”,
AgentRole.ANALYZER: ” The data shows interesting correlations and actionable insights for decision making.”,
AgentRole.EXECUTOR: ” I’ve completed the requested actions and verified the outputs meet quality standards.”
}

return base_response + additional_context.get(self.role, “”)

def _generate_contextual_prompt(self, message: str, context: Optional[Dict]) -> str:
“””Generate context-aware prompt based on agent role”””

base_prompt = f”””
You are an advanced AI agent with the role: {self.role.value}
Your capabilities: {‘, ‘.join(self.context.capabilities)}
Available tools: {‘, ‘.join(self.context.tools)}

Recent conversation context:
{self._get_recent_context()}

Current request: {message}
“””

role_instructions = {
AgentRole.COORDINATOR: “””
Focus on breaking down complex tasks, coordinating with other agents,
and maintaining overall project coherence. Consider dependencies and priorities.
Provide clear task decomposition and agent assignments.
“””,
AgentRole.RESEARCHER: “””
Prioritize accurate information gathering, source verification,
and comprehensive data collection. Synthesize findings clearly.
Focus on current trends and reliable sources.
“””,
AgentRole.ANALYZER: “””
Focus on pattern recognition, data interpretation, and insight generation.
Provide evidence-based conclusions and actionable recommendations.
Highlight key correlations and implications.
“””,
AgentRole.EXECUTOR: “””
Concentrate on practical implementation, result validation,
and clear output delivery. Ensure actions are completed effectively.
Focus on quality and completeness of execution.
“””
}

return base_prompt + role_instructions.get(self.role, “”)

def _get_recent_context(self, limit: int = 3) -> str:
“””Get recent conversation context”””
if not self.memory:
return “No previous context”

recent = self.memory[-limit:]
context_str = “”
for msg in recent:
context_str += f”{msg.role}: {msg.content[:100]}…n”
return context_str

def _analyze_capabilities_used(self, message: str) -> List[str]:
“””Analyze which capabilities were likely used”””
used_capabilities = []
message_lower = message.lower()

capability_keywords = {
“task_decomposition”: [“break down”, “divide”, “split”, “decompose”],
“data_gathering”: [“research”, “find”, “collect”, “gather”],
“pattern_recognition”: [“analyze”, “pattern”, “trend”, “correlation”],
“action_execution”: [“execute”, “run”, “implement”, “perform”],
“agent_orchestration”: [“coordinate”, “manage”, “organize”, “assign”],
“information_synthesis”: [“synthesize”, “combine”, “merge”, “integrate”]
}

for capability, keywords in capability_keywords.items():
if capability in self.context.capabilities:
if any(keyword in message_lower for keyword in keywords):
used_capabilities.append(capability)

return used_capabilities

def _suggest_next_actions(self, response: str) -> List[str]:
“””Suggest logical next actions based on response”””
suggestions = []
response_lower = response.lower()

if “need more information” in response_lower or “research” in response_lower:
suggestions.append(“delegate_to_researcher”)
if “analyze” in response_lower or “pattern” in response_lower:
suggestions.append(“delegate_to_analyzer”)
if “implement” in response_lower or “execute” in response_lower:
suggestions.append(“delegate_to_executor”)
if “coordinate” in response_lower or “manage” in response_lower:
suggestions.append(“initiate_multi_agent_collaboration”)
if “subtask” in response_lower or “break down” in response_lower:
suggestions.append(“task_decomposition_required”)

return suggestions if suggestions else [“continue_conversation”]

We implement the MCPAgent as a notebook-friendly, role-aware agent that initializes capabilities and tools based on its assigned role, keeps a memory of messages, and generates context-aware responses. We seamlessly use Gemini when available (falling back to a demo response otherwise) and wrap everything with structured outputs like capabilities used and suggested next actions. We also provide utilities to craft role-specific prompts, surface recent context, detect implied capabilities, and propose the next step in a multi-agent workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCPAgentSwarm:
“””Multi-agent coordination system – Jupyter Compatible”””

def __init__(self, api_key: str = None):
self.api_key = api_key
self.agents = {}
self.task_history = []
self.results = {}

def create_agent(self, agent_id: str, role: AgentRole) -> MCPAgent:
“””Create and register a new agent”””
agent = MCPAgent(agent_id, role, self.api_key)
self.agents[agent_id] = agent
print(f” Created agent: {agent_id} with role: {role.value}”)
return agent

def coordinate_task(self, task: str) -> Dict[str, Any]:
“””Coordinate complex task across multiple agents – Synchronous”””

print(f”n Coordinating task: {task}”)
print(“=” * 60)

if “coordinator” not in self.agents:
self.create_agent(“coordinator”, AgentRole.COORDINATOR)

coordinator = self.agents[“coordinator”]

print(“n Step 1: Task Decomposition”)
decomposition = coordinator.process_message(
f”Decompose this complex task into subtasks and identify which specialized agents are needed: {task}”
)
print(f”Coordinator: {decomposition[‘response’]}”)

self._ensure_required_agents()

print(“n Step 2: Agent Collaboration”)
results = {}
for agent_id, agent in self.agents.items():
if agent_id != “coordinator”:
print(f”n {agent_id.upper()} working…”)
result = agent.process_message(
f”Handle your specialized part of this task: {task}n”
f”Coordinator’s guidance: {decomposition[‘response’][:200]}…”
)
results[agent_id] = result
print(f” {agent_id}: {result[‘response’][:150]}…”)

print(“n Step 3: Final Synthesis”)
final_result = coordinator.process_message(
f”Synthesize these agent results into a comprehensive final output for the task ‘{task}’:n”
f”Results summary: {[f'{k}: {v[‘response’][:100]}…’ for k, v in results.items()]}”
)
print(f”Final Result: {final_result[‘response’]}”)

task_record = {
“task”: task,
“timestamp”: datetime.now().isoformat(),
“decomposition”: decomposition,
“agent_results”: results,
“final_synthesis”: final_result,
“agents_involved”: list(self.agents.keys())
}
self.task_history.append(task_record)

return task_record

def _ensure_required_agents(self):
“””Ensure all required agent types exist”””
required_roles = [AgentRole.RESEARCHER, AgentRole.ANALYZER, AgentRole.EXECUTOR]

for role in required_roles:
agent_id = role.value
if agent_id not in self.agents:
self.create_agent(agent_id, role)

def get_swarm_status(self) -> Dict[str, Any]:
“””Get current status of the agent swarm”””
return {
“total_agents”: len(self.agents),
“agent_roles”: {aid: agent.role.value for aid, agent in self.agents.items()},
“tasks_completed”: len(self.task_history),
“last_task”: self.task_history[-1][“task”] if self.task_history else “None”
}

We manage a swarm of role-specific agents, create them on demand, and coordinate complex tasks through decomposition, collaboration, and final synthesis. We track results and history, ensure required agents exist, and provide a quick status view of the whole system at any time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_notebook_compatible():
“””Demonstrate advanced MCP agent capabilities – Notebook Compatible”””

print(” Starting Advanced MCP Agent Tutorial”)
print(” Jupyter/Colab Compatible Version”)
print(“=” * 60)

API_KEY = None # Set to your actual key

if not API_KEY:
print(” Running in DEMO MODE (simulated responses)”)
print(” Set API_KEY variable for real Gemini AI responses”)
print(“-” * 60)

swarm = MCPAgentSwarm(API_KEY)

print(“n Demo 1: Single Agent Interaction”)
researcher = swarm.create_agent(“research_agent”, AgentRole.RESEARCHER)

result = researcher.process_message(
“Research the latest trends in AI agent architectures and multi-agent systems”
)
print(f”n Researcher Response:”)
print(f” {result[‘response’]}”)
print(f” Capabilities Used: {result[‘capabilities_used’]}”)
print(f” Suggested Next Actions: {result[‘next_actions’]}”)

print(“nn Demo 2: Multi-Agent Coordination”)

complex_task = “””
Analyze the impact of AI agents on software development productivity.
Include research on current tools, performance metrics, future predictions,
and provide actionable recommendations for development teams.
“””

coordination_result = swarm.coordinate_task(complex_task)

print(“nn Demo 3: Swarm Status”)
status = swarm.get_swarm_status()
print(f” Total Agents: {status[‘total_agents’]}”)
print(f” Agent Roles: {status[‘agent_roles’]}”)
print(f” Tasks Completed: {status[‘tasks_completed’]}”)

print(“n Tutorial Completed Successfully!”)
return swarm

def run_demo():
“””Simple function to run the demo”””
return demo_notebook_compatible()

if __name__ == “__main__”:
print(” Running MCP Agent Demo…”)
swarm = run_demo()
else:
print(” MCP Agent Tutorial loaded!”)
print(” Run: swarm = run_demo() to start the demonstration”)

We wrap everything into a notebook-friendly demo that showcases the MCP agent system in action. We start by creating a researcher agent for single-agent interaction, then demonstrate multi-agent collaboration on a complex task, and finally check swarm status. We also ensure the code runs smoothly in both script mode and Jupyter/Colab mode, with a clear fallback to demo responses when no Gemini API key is set.

In conclusion, we have successfully demonstrated how our MCP agents can coordinate, decompose tasks, and synthesize results into actionable insights, all within a notebook-friendly, synchronous setup. We have seen how memory enables continuity of context, how role-based specialization ensures efficiency, and how the swarm can adapt to various challenges. With Gemini integration available for real AI responses and a fallback demo mode for simulation, we are leaving with a working foundation for advanced multi-agent systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building Advanced MCP (Model Context Protocol) Agents with Multi-Agent Coordination, Context Awareness, and Gemini Integration appeared first on MarkTechPost.

NVIDIA AI Releases Universal Deep Research (UDR): A Prototype Framewor …

Posted on September 11, 2025 by i-genie

Why do existing deep research tools fall short?

Deep Research Tools (DRTs) like Gemini Deep Research, Perplexity, OpenAI’s Deep Research, and Grok DeepSearch rely on rigid workflows bound to a fixed LLM. While effective, they impose strict limitations: users cannot define custom strategies, swap models, or enforce domain-specific protocols.

NVIDIA’s analysis identifies three core problems:

Users cannot enforce preferred sources, validation rules, or cost control.

Specialized research strategies for domains such as finance, law, or healthcare are unsupported.

DRTs are tied to single models, preventing flexible pairing of the best LLM with the best strategy.

These issues restrict adoption in high-value enterprise and scientific applications.

https://arxiv.org/pdf/2509.00244

What is Universal Deep Research (UDR)?

Universal Deep Research (UDR) is an open-source system (in preview) that decouples strategy from model. It allows users to design, edit, and run their own deep research workflows without retraining or fine-tuning any LLM.

Unlike existing tools, UDR works at the system orchestration level:

It converts user-defined research strategies into executable code.

It runs workflows in a sandboxed environment for safety.

It treats the LLM as a utility for localized reasoning (summarization, ranking, extraction) instead of giving it full control.

This architecture makes UDR lightweight, flexible, and model-agnostic.

https://arxiv.org/pdf/2509.00244

How does UDR process and execute research strategies?

UDR takes two inputs: the research strategy (step-by-step workflow) and the research prompt (topic and output requirements).

Strategy Processing

Natural language strategies are compiled into Python code with enforced structure.

Variables store intermediate results, avoiding context-window overflow.

All functions are deterministic and transparent.

Strategy Execution

Control logic runs on CPU; only reasoning tasks call the LLM.

Notifications are emitted via yield statements, keeping users updated in real time.

Reports are assembled from stored variable states, ensuring traceability.

This separation of orchestration vs. reasoning improves efficiency and reduces GPU cost.

What example strategies are available?

NVIDIA ships UDR with three template strategies:

Minimal – Generate a few search queries, gather results, and compile a concise report.

Expansive – Explore multiple topics in parallel for broader coverage.

Intensive – Iteratively refine queries using evolving subcontexts, ideal for deep dives.

These serve as starting points, but the framework allows users to encode entirely custom workflows.

https://arxiv.org/pdf/2509.00244

What outputs does UDR generate?

UDR produces two key outputs:

Structured Notifications – Progress updates (with type, timestamp, and description) for transparency.

Final Report – A Markdown-formatted research document, complete with sections, tables, and references.

This design gives users both auditability and reproducibility, unlike opaque agentic systems.

Where can UDR be applied?

UDR’s general-purpose design makes it adaptable across domains:

Scientific discovery: structured literature reviews.

Enterprise due diligence: validation against filings and datasets.

Business intelligence: market analysis pipelines.

Startups: custom assistants built without retraining LLMs.

By separating model choice from research logic, UDR supports innovation in both dimensions.

Summary

Universal Deep Research signals a shift from model-centric to system-centric AI agents. By giving users direct control over workflows, NVIDIA enables customizable, efficient, and auditable research systems.

For startups and enterprises, UDR provides a foundation for building domain-specific assistants without the cost of model retraining—opening new opportunities for innovation across industries.

Check out the PAPER, PROJECT and CODE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA AI Releases Universal Deep Research (UDR): A Prototype Framework for Scalable and Auditable Deep Research Agents appeared first on MarkTechPost.

Baidu Releases ERNIE-4.5-21B-A3B-Thinking: A Compact MoE Model for Dee …

Posted on September 11, 2025 by i-genie

Baidu AI Research team has just released ERNIE-4.5-21B-A3B-Thinking, a new reasoning-focused large language model designed around efficiency, long-context reasoning, and tool integration. Being part of the ERNIE-4.5 family, this model is a Mixture-of-Experts (MoE) architecture with 21B total parameters but only 3B active parameters per token, making it computationally efficient while maintaining competitive reasoning capability. Released under the Apache-2.0 license, it is accessible for both research and commercial deployment via Hugging Face.

What is the architectural design of ERNIE-4.5-21B-A3B-Thinking?

ERNIE-4.5-21B-A3B-Thinking is built on a Mixture-of-Experts backbone. Instead of activating all 21B parameters, the router selects a subset of experts, resulting in 3B active parameters per token. This structure reduces computation without compromising the specialization of different experts. The research team applies router orthogonalization loss and token-balanced loss to encourage diverse expert activation and stable training.

This design provides a middle ground between small dense models and ultra-large systems. The research team’s assumptions include a theory that ~3B active parameters per token may represent a practical sweet spot for reasoning performance versus deployment efficiency.

How does the model handle long-context reasoning?

A defining capability of ERNIE-4.5-21B-A3B-Thinking is its 128K context length. This allows the model to process very long documents, perform extended multi-step reasoning, and integrate structured data sources such as academic papers or multi-file codebases.

The research team achieves this through progressive scaling of Rotary Position Embeddings (RoPE)—gradually increasing the frequency base from 10K up to 500K during training. Additional optimizations, including FlashMask attention and memory-efficient scheduling, make these long-context operations computationally feasible.

What training strategy supports its reasoning?

The model follows the multi-stage recipe defined across the ERNIE-4.5 family:

Stage I – Text-only pretraining builds the core language backbone, starting with 8K context and expanding to 128K.

Stage II – Vision training is skipped for this text-only variant.

Stage III – Joint multimodal training is not used here, as A3B-Thinking is purely textual.

Post-training focuses on reasoning tasks. The research team employs Supervised Fine-Tuning (SFT) across mathematics, logic, coding, and science, followed by Progressive Reinforcement Learning (PRL). Reinforcement stages begin with logic, then extend to mathematics and programming, and finally to broader reasoning tasks. This is enhanced by Unified Preference Optimization (UPO), which integrates preference learning with PPO to stabilize alignment and reduce reward hacking.

What role does tool usage play in this model?

ERNIE-4.5-21B-A3B-Thinking supports structured tool and function calling, making it useful for scenarios where external computation or retrieval is required. Developers can integrate it with vLLM, Transformers 4.54+, and FastDeploy. This tool-use capability is particularly suited for program synthesis, symbolic reasoning, and multi-agent workflows.

Built-in function calling allows the model to reason over long contexts while dynamically invoking external APIs, a key requirement for applied reasoning in enterprise systems.

How does ERNIE-4.5-21B-A3B-Thinking perform on reasoning benchmarks?

It show strong performance improvements across logical reasoning, mathematics, scientific QA, and programming tasks. In evaluations, the model demonstrates:

Enhanced accuracy in multi-step reasoning datasets, where long chains of thought are required.

Competitiveness with larger dense models on STEM reasoning tasks.

Stable text generation and academic synthesis performance, benefiting from extended context training.

These results suggest that the MoE structure amplifies reasoning specialization, making it efficient without requiring trillion-scale dense parameters.

https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking

How does it compare to other reasoning-focused LLMs?

This release gets into the landscape that includes OpenAI’s o3, Anthropic’s Claude 4, DeepSeek-R1, and Qwen-3. Many of these competitors rely on dense architectures or larger active parameter counts. Baidu research team’s choice of a compact MoE with 3B active parameters offers a different balance:

Scalability: Sparse activation reduces compute overhead while scaling expert capacity.

Long-context readiness: 128K context is directly trained, not retrofitted.

Commercial openness: Apache-2.0 license lowers adoption friction for enterprises.

Summary

ERNIE-4.5-21B-A3B-Thinking explains how deep reasoning can be achieved without massive dense parameter counts. By combining efficient MoE routing, 128K context training, and tool integration, Baidu’s research team offers a model that balances research-grade reasoning with deployment feasibility.

Check out the Model on Hugging Face and PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Baidu Releases ERNIE-4.5-21B-A3B-Thinking: A Compact MoE Model for Deep Reasoning appeared first on MarkTechPost.

TII Falcon-H1 models now available on Amazon Bedrock Marketplace and A …

Posted on September 11, 2025 by i-genie

This post was co-authored with Jingwei Zuo from TII.
We are excited to announce the availability of the Technology Innovation Institute (TII)’s Falcon-H1 models on Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, developers and data scientists can now use six instruction-tuned Falcon-H1 models (0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B) on AWS, and have access to a comprehensive suite of hybrid architecture models that combine traditional attention mechanisms with State Space Models (SSMs) to deliver exceptional performance with unprecedented efficiency.
In this post, we present an overview of Falcon-H1 capabilities and show how to get started with TII’s Falcon-H1 models on both Amazon Bedrock Marketplace and SageMaker JumpStart.
Overview of TII and AWS collaboration
TII is a leading research institute based in Abu Dhabi. As part of UAE’s Advanced Technology Research Council (ATRC), TII focuses on advanced technology research and development across AI, quantum computing, autonomous robotics, cryptography, and more. TII employs international teams of scientists, researchers, and engineers in an open and agile environment, aiming to drive technological innovation and position Abu Dhabi and the UAE as a global research and development hub in alignment with the UAE National Strategy for Artificial Intelligence 2031.
TII and Amazon Web Services (AWS) are collaborating to expand access to made-in-the-UAE AI models across the globe. By combining TII’s technical expertise in building large language models (LLMs) with AWS Cloud-based AI and machine learning (ML) services, professionals worldwide can now build and scale generative AI applications using the Falcon-H1 series of models.
About Falcon-H1 models
The Falcon-H1 architecture implements a parallel hybrid design, using elements from Mamba and Transformer architectures to combine the faster inference and lower memory footprint of SSMs like Mamba with the effectiveness of Transformers’ attention mechanism in understanding context and enhanced generalization capabilities. The Falcon-H1 architecture scales across multiple configurations ranging from 0.5–34 billion parameters and provides native support for 18 languages. According to TII, the Falcon-H1 family demonstrates notable efficiency with published metrics indicating that smaller model variants achieve performance parity with larger models. Some of the benefits of Falcon-H1 series include:

Performance – The hybrid attention-SSM model has optimized parameters with adjustable ratios between attention and SSM heads, leading to faster inference, lower memory usage, and strong generalization capabilities. According to TII benchmarks published in Falcon-H1’s technical blog post and technical report, Falcon-H1 models demonstrate superior performance across multiple scales against other leading Transformer models of similar or larger scales. For example, Falcon-H1-0.5B delivers performance similar to typical 7B models from 2024, and Falcon-H1-1.5B-Deep rivals many of the current leading 7B-10B models.
Wide range of model sizes – The Falcon-H1 series includes six sizes: 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B, with both base and instruction-tuned variants. The Instruct models are now available in Amazon Bedrock Marketplace and SageMaker JumpStart.
Multilingual by design – The models support 18 languages natively (Arabic, Czech, German, English, Spanish, French, Hindi, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Romanian, Russian, Swedish, Urdu, and Chinese) and can scale to over 100 languages according to TII, thanks to a multilingual tokenizer trained on diverse language datasets.
Up to 256,000 context length – The Falcon-H1 series enables applications in long-document processing, multi-turn dialogue, and long-range reasoning, showing a distinct advantage over competitors in practical long-context applications like Retrieval Augmented Generation (RAG).
Robust data and training strategy – Training of Falcon-H1 models employs an innovative approach that introduces complex data early on, contrary to traditional curriculum learning. It also implements strategic data reuse based on careful memorization window assessment. Additionally, the training process scales smoothly across model sizes through a customized Maximal Update Parametrization (µP) recipe, specifically adapted for this novel architecture.
Balanced performance in science and knowledge-intensive domains – Through a carefully designed data mixture and regular evaluations during training, the model achieves strong general capabilities and broad world knowledge while minimizing unintended specialization or domain-specific biases.

In line with their mission to foster AI accessibility and collaboration, TII have released Falcon-H1 models under the Falcon LLM license. It offers the following benefits:

Open source nature and accessibility
Multi-language capabilities
Cost-effectiveness compared to proprietary models
Energy-efficiency

About Amazon Bedrock Marketplace and SageMaker JumpStart
Amazon Bedrock Marketplace offers access to over 100 popular, emerging, specialized, and domain-specific models, so you can find the best proprietary and publicly available models for your use case based on factors such as accuracy, flexibility, and cost. On Amazon Bedrock Marketplace you can discover models in a single place and access them through unified and secure Amazon Bedrock APIs. You can also select your desired number of instances and the instance type to meet the demands of your workload and optimize your costs.
SageMaker JumpStart helps you quickly get started with machine learning. It provides access to state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch. With SageMaker JumpStart you can deploy models in a secure environment by provisioning them on SageMaker inference instances and isolating them within your virtual private cloud (VPC). You can also use Amazon SageMaker AI to further customize and fine-tune the models and streamline the entire model deployment process.
Solution overview
This post demonstrates how to deploy a Falcon-H1 model using both Amazon Bedrock Marketplace and SageMaker JumpStart. Although we use Falcon-H1-0.5B as an example, you can apply these steps to other models in the Falcon-H1 series. For help determining which deployment option—Amazon Bedrock Marketplace or SageMaker JumpStart—best suits your specific requirements, see Amazon Bedrock or Amazon SageMaker AI?
Deploy Falcon-H1-0.5B-Instruct with Amazon Bedrock Marketplace
In this section, we show how to deploy the Falcon-H1-0.5B-Instruct model in Amazon Bedrock Marketplace.
Prerequisites
To try the Falcon-H1-0.5B-Instruct model in Amazon Bedrock Marketplace, you must have access to an AWS account that will contain your AWS resources.Prior to deploying Falcon-H1-0.5B-Instruct, verify that your AWS account has sufficient quota allocation for ml.g6.xlarge instances. The default quota for endpoints using several instance types and sizes is 0, so attempting to deploy the model without a higher quota will trigger a deployment failure.
To request a quota increase, open the AWS Service Quotas console and search for Amazon SageMaker. Locate ml.g6.xlarge for endpoint usage and choose Request quota increase, then specify your required limit value. After the request is approved, you can proceed with the deployment.
Deploy the model using the Amazon Bedrock Marketplace UI
To deploy the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Discover in the navigation pane, choose Model catalog.
Filter for Falcon-H1 as the model name and choose Falcon-H1-0.5B-Instruct.

The model overview page includes information about the model’s license terms, features, setup instructions, and links to further resources.

Review the model license terms, and if you agree with the terms, choose Deploy.

For Endpoint name, enter an endpoint name or leave it as the default pre-populated name.
To minimize costs while experimenting, set the Number of instances to 1.
For Instance type, choose from the list of compatible instance types. Falcon-H1-0.5B-Instruct is an efficient model, so ml.m6.xlarge is sufficient for this exercise.

Although the default configurations are typically sufficient for basic needs, you can customize advanced settings like VPC, service access permissions, encryption keys, and resource tags. These advanced settings might require adjustment for production environments to maintain compliance with your organization’s security protocols.

Choose Deploy.
A prompt asks you to stay on the page while the AWS Identity and Access Management (IAM) role is being created. If your AWS account lacks sufficient quota for the selected instance type, you’ll receive an error message. In this case, refer to the preceding prerequisite section to increase your quota, then try the deployment again.

While deployment is in progress, you can choose Marketplace model deployments in the navigation pane to monitor the deployment progress in the Managed deployment section. When the deployment is complete, the endpoint status will change from Creating to In Service.
Interact with the model in the Amazon Bedrock Marketplace playground
You can now test Falcon-H1 capabilities directly in the Amazon Bedrock playground by selecting the managed deployment and choosing Open in playground.

You can now use the Amazon Bedrock Marketplace playground to interact with Falcon-H1-0.5B-Instruct.
Invoke the model using code
In this section, we demonstrate to invoke the model using the Amazon Bedrock Converse API.
Replace the placeholder code with the endpoint’s Amazon Resource Name (ARN), which begins with arn:aws:sagemaker. You can find this ARN on the endpoint details page in the Managed deployments section.

import boto3
bedrock_runtime = boto3.client(“bedrock-runtime”)
endpoint_arn = “{ENDPOINT ARN}” # Replace with endpoint ARN
response = bedrock_runtime.converse( modelId=endpoint_arn, messages=[{“role”: “user”, “content”: [{“text”: “What is generative AI?”}]}], inferenceConfig={“temperature”: 0.1, “topP”: 0.1})

print(response[“output”][“message”][“content”][0][“text”])

To learn more about the detailed steps and example code for invoking the model using Amazon Bedrock APIs, refer to Submit prompts and generate response using the API.
Deploy Falcon-H1-0.5B-Instruct with SageMaker JumpStart
You can access FMs in SageMaker JumpStart through Amazon SageMaker Studio, the SageMaker SDK, and the AWS Management Console. In this walkthrough, we demonstrate how to deploy Falcon-H1-0.5B-Instruct using the SageMaker Python SDK. Refer to Deploy a model in Studio to learn how to deploy the model through SageMaker Studio.
Prerequisites
To deploy Falcon-H1-0.5B-Instruct with SageMaker JumpStart, you must have the following prerequisites:

An AWS account that will contain your AWS resources.
An IAM role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, see Identity and Access Management for Amazon SageMaker AI.
Access to SageMaker Studio with a JupyterLab space, or an interactive development environment (IDE) such as Visual Studio Code or PyCharm.

Deploy the model programmatically using the SageMaker Python SDK
Before deploying Falcon-H1-0.5B-Instruct using the SageMaker Python SDK, make sure you have installed the SDK and configured your AWS credentials and permissions.
The following code example demonstrates how to deploy the model:

import sagemakerfrom sagemaker.jumpstart.model
import JumpStartModelfrom sagemaker
import Session
import boto3
import json

# Initialize SageMaker session
session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Specify model parameters
model_id = “huggingface-llm-falcon-h1-0-5b-instruct”
instance_type = “ml.g6.xlarge” # Choose appropriate instance based on your needs

# Create and deploy the model
model = JumpStartModel( model_id=model_id, role=role, instance_type=instance_type, model_version=”*” # Latest version)

# Deploy the model
predictor = model.deploy( initial_instance_count=1, accept_eula=True # Required for deploying foundation models)

print(“Endpoint name:”)
print(predictor.endpoint_name)

Perform inference using the SageMaker Python API

When the previous code segment completes successfully, the Falcon-H1-0.5B-Instruct model deployment is complete and available on a SageMaker endpoint. Note the endpoint name shown in the output—you will replace the placeholder in the following code segment with this value.The following code demonstrates how to prepare the input data, make the inference API call, and process the model’s response:

import json
import boto3

session = boto3.Session() # Make sure your AWS credentials are configured
sagemaker_runtime = session.client(“sagemaker-runtime”)

endpoint_name = “{ENDPOINT_NAME}” # Replace with endpoint name from deployment output

payload = { “messages”: [ { “role”: “user”, “content”: “What is generative AI?” } ], “parameters”: { “max_tokens”: 256, “temperature”: 0.1, “top_p”: 0.1 } }

# Perform inference
response = sagemaker_runtime.invoke_endpoint( EndpointName=endpoint_name, ContentType=”application/json”, Body=json.dumps(payload))

# Parse the response
result = json.loads(response[“Body”].read().decode(“utf-8”))generated_text = result[“choices”][0][“message”][“content”].strip()
print(“Generated Response:”)
print(generated_text)

Clean up
To avoid ongoing charges for AWS resources used while experimenting with Falcon-H1 models, make sure to delete all deployed endpoints and their associated resources when you’re finished. To do so, complete the following steps:

Delete Amazon Bedrock Marketplace resources:

On the Amazon Bedrock console, choose Marketplace model deployment in the navigation pane.
Under Managed deployments, choose the Falcon-H1 model endpoint you deployed earlier.
Choose Delete and confirm the deletion if you no longer need to use this endpoint in Amazon Bedrock Marketplace.

Delete SageMaker endpoints:

On the SageMaker AI console, in the navigation pane, choose Endpoints under Inference.
Select the endpoint associated with the Falcon-H1 models.
Choose Delete and confirm the deletion. This stops the endpoint and avoids further compute charges.

Delete SageMaker models:

On the SageMaker AI console, choose Models under Inference.
Select the model associated with your endpoint and choose Delete.

Always verify that all endpoints are deleted after experimentation to optimize costs. Refer to the Amazon SageMaker documentation for additional guidance on managing resources.
Conclusion
The availability of Falcon-H1 models in Amazon Bedrock Marketplace and SageMaker JumpStart helps developers, researchers, and businesses build cutting-edge generative AI applications with ease. Falcon-H1 models offer multilingual support (18 languages) across various model sizes (from 0.5B to 34B parameters) and support up to 256K context length, thanks to their efficient hybrid attention-SSM architecture.
By using the seamless discovery and deployment capabilities of Amazon Bedrock Marketplace and SageMaker JumpStart, you can accelerate your AI innovation while benefiting from the secure, scalable, and cost-effective AWS Cloud infrastructure.
We encourage you to explore the Falcon-H1 models in Amazon Bedrock Marketplace or SageMaker JumpStart. You can use these models in AWS Regions where Amazon Bedrock or SageMaker JumpStart and the required instance types are available.
For further learning, explore the AWS Machine Learning Blog, SageMaker JumpStart GitHub repository, and Amazon Bedrock User Guide. Start building your next generative AI application with Falcon-H1 models and unlock new possibilities with AWS!
Special thanks to everyone who contributed to the launch: Evan Kravitz, Varun Morishetty, and Yotam Moss.

About the authors
Mehran Nikoo leads the Go-to-Market strategy for Amazon Bedrock and agentic AI in EMEA at AWS, where he has been driving the development of AI systems and cloud-native solutions over the last four years. Prior to joining AWS, Mehran held leadership and technical positions at Trainline, McLaren, and Microsoft. He holds an MBA from Warwick Business School and an MRes in Computer Science from Birkbeck, University of London.
Mustapha Tawbi is a Senior Partner Solutions Architect at AWS, specializing in generative AI and ML, with 25 years of enterprise technology experience across AWS, IBM, Sopra Group, and Capgemini. He has a PhD in Computer Science from Sorbonne and a Master’s degree in Data Science from Heriot-Watt University Dubai. Mustapha leads generative AI technical collaborations with AWS partners throughout the MENAT region.
Jingwei Zuo is a Lead Researcher at the Technology Innovation Institute (TII) in the UAE, where he leads the Falcon Foundational Models team. He received his PhD in 2022 from University of Paris-Saclay, where he was awarded the Plateau de Saclay Doctoral Prize. He holds an MSc (2018) from the University of Paris-Saclay, an Engineer degree (2017) from Sorbonne Université, and a BSc from Huazhong University of Science & Technology.
John Liu is a Principal Product Manager for Amazon Bedrock at AWS. Previously, he served as the Head of Product for AWS Web3/Blockchain. Prior to joining AWS, John held various product leadership roles at public blockchain protocols and financial technology (fintech) companies for 14 years. He also has nine years of portfolio management experience at several hedge funds.
Hamza MIMI is a Solutions Architect for partners and strategic deals in the MENAT region at AWS, where he bridges cutting-edge technology with impactful business outcomes. With expertise in AI and a passion for sustainability, he helps organizations architect innovative solutions that drive both digital transformation and environmental responsibility, transforming complex challenges into opportunities for growth and positive change.

Oldcastle accelerates document processing with Amazon Bedrock

Posted on September 11, 2025 by i-genie

This post was written with Avdhesh Paliwal of Oldcastle APG.
Oldcastle APG, one of the largest global networks of manufacturers in the architectural products industry, was grappling with an inefficient and labor-intensive process for handling proof of delivery (POD) documents, known as ship tickets. The company was processing 100,000–300,000 ship tickets per month across more than 200 facilities. Their existing optical character recognition (OCR) system was unreliable, requiring constant maintenance and manual intervention. It could only accurately read 30–40% of the documents, leading to significant time and resource expenditure.
This post explores how Oldcastle partnered with AWS to transform their document processing workflow using Amazon Bedrock with Amazon Textract. We discuss how Oldcastle overcame the limitations of their previous OCR solution to automate the processing of hundreds of thousands of POD documents each month, dramatically improving accuracy while reducing manual effort. This solution demonstrates a practical, scalable approach that can be adapted to your specific needs, such as similar challenges addressing document processing or using generative AI for business process optimization.
Challenges with document processing
The primary challenge for Oldcastle was to find a solution that could accomplish the following:

Accurately process a high volume of ship tickets (PODs) with minimal human intervention
Scale to handle 200,000–300,000 documents per month
Handle inconsistent inputs like rotated pages and variable formatting
Improve the accuracy of data extraction from the current 30–40% to a much higher rate
Add new capabilities like signature validation on PODs
Provide real-time visibility into outstanding PODs and deliveries

Additionally, Oldcastle needed a solution for processing supplier invoices and matching them against purchase orders, which presented similar challenges due to varying document formats.The existing process required dispatchers at more than 200 facilities to spend 4–5 hours daily manually processing ship tickets. This consumed valuable human resources and led to delays in processing and potential errors in data entry. The IT team was burdened with constant maintenance and development efforts to keep the unreliable OCR system functioning.
Solution overview
AWS Solutions Architects worked closely with Oldcastle engineers to build a solution addressing these challenges. The end-to-end workflow uses Amazon Simple Email Service (Amazon SES) to receive ship tickets, which are sent directly from drivers in the field. The system processes emails at scale using an event-based architecture centered on Amazon S3 Event Notifications. The workflow sends ship ticket documents to an automatic scaling compute job orchestrator. Documents are processed with the following steps:

The system sends PDF files to Amazon Textract using the Start Document Analysis API with Layout and Signature features.
Amazon Textract results are processed by an AWS Lambda microservice. This microservice resolves rotation issues with page text and generates a collection of pages of markdown representation of the text.
The markdown is passed to Amazon Bedrock, which efficiently extracts key values from the markdown text.
The orchestrator saves the results to their Amazon Relational Database Service (Amazon RDS) for PostgreSQL database.

The following diagram illustrates the solution architecture.

In this architecture, Amazon Textract is an effective solution to handle large PDF files at scale. The output of Amazon Textract contains the necessary geometries used to calculate rotation and fix layout issues before generating markdown. Quality markdown layouts are critical for Amazon Bedrock in identifying the right key-value pairs from the content. We further optimized cost by extracting only the data needed to limit output tokens and by using Amazon Bedrock batch processing to get the lowest token cost. Amazon Bedrock was used for its cost-effectiveness and ability to process format shipping tickets where the fields that need to be extracted are the same.
Results
The implementation using this architecture on AWS brought numerous benefits to Oldcastle:

Business process improvement – The solution accomplished the following:

Alleviated the need for manual processing of ship tickets at each facility
Automated document processing with minimal human intervention
Improved accuracy and reliability of data extraction
Enhanced ability to validate signatures and reject incomplete documents
Provided real-time visibility into outstanding PODs and deliveries

Productivity gains – Oldcastle saw the following benefits:

Significantly fewer human hours were spent on manual data entry and document processing
Staff had more time for more value-added activities
The IT team benefited from reduced development and maintenance efforts

Scalability and performance – The team experienced the following performance gains:

They seamlessly scaled from processing a few thousand documents to 200,000–300,000 documents per month
The team observed no performance issues with increased volume

User satisfaction – The solution improved user sentiment in several ways:

High user confidence in the new system due to its accuracy and reliability
Positive feedback from business users on the ease of use and effectiveness

Cost-effective – With this approach, Oldcastle can process documents at less than $0.04 per page

Conclusion
With the success of the AWS implementation, Oldcastle is exploring potential expansion to other use cases such as AP invoice processing, W9 form validation, and automated document approval workflows. This strategic move towards AI-powered document processing is positioning Oldcastle for improved efficiency and scalability in its operations.
Review your current manual document processing procedures and identify where intelligent document processing can help you automate these workflows for your business.
For further exploration and learning, we recommend checking out the following resources:

Intelligent Document Processing on AWS
Automate document processing with Amazon Bedrock Prompt Flows
Intelligent Document Processing with Generative AI

About the authors
Erik Cordsen is a Solutions Architect at AWS serving customers in Georgia. He is passionate about applying cloud technologies and ML to solve real life problems. When he is not designing cloud solutions, Erik enjoys travel, cooking, and cycling.
Sourabh Jain is a Senior Solutions Architect with over 8 years of experience developing cloud solutions that drive better business outcomes for organizations worldwide. He specializes in architecting and implementing robust cloud software solutions, with extensive experience working alongside global Fortune 500 teams across diverse time zones and cultures.
Avdhesh Paliwal is an accomplished Application Architect at Oldcastle APG with 29 years of extensive ERP experience. His expertise spans Manufacturing, Supply Chain, and Human Resources modules, with a proven track record of designing and implementing enterprise solutions that drive operational efficiency and business value.

How London Stock Exchange Group is detecting market abuse with their A …

Posted on September 11, 2025 by i-genie

London Stock Exchange Group (LSEG) is a global provider of financial markets data and infrastructure. It operates the London Stock Exchange and manages international equity, fixed income, and derivative markets. The group also develops capital markets software, offers real-time and reference data products, and provides extensive post-trade services. This post was co-authored with Charles Kellaway and Rasika Withanawasam of LSEG.
Financial markets are remarkably complex, hosting increasingly dynamic investment strategies across new asset classes and interconnected venues. Accordingly, regulators place great emphasis on the ability of market surveillance teams to keep pace with evolving risk profiles. However, the landscape is vast; London Stock Exchange alone facilitates the trading and reporting of over £1 trillion of securities by 400 members annually. Effective monitoring must cover all MiFID asset classes, markets and jurisdictions to detect market abuse, while also giving weight to participant relationships, and market surveillance systems must scale with volumes and volatility. As a result, many systems are outdated and unsatisfactory for regulatory expectations, requiring manual and time-consuming work.
To address these challenges, London Stock Exchange Group (LSEG) has developed an innovative solution using Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models from leading AI companies, to automate and enhance their market surveillance capabilities. LSEG’s AI-powered Surveillance Guide helps analysts efficiently review trades flagged for potential market abuse by automatically analyzing news sensitivity and its impact on market behavior.
In this post, we explore how LSEG used Amazon Bedrock and Anthropic’s Claude foundation models to build an automated system that significantly improves the efficiency and accuracy of market surveillance operations.
The challenge
Currently, LSEG’s surveillance monitoring systems generate automated, customized alerts to flag suspicious trading activity to the Market Supervision team. Analysts then conduct initial triage assessments to determine whether the activity warrants further investigation, which might require undertaking differing levels of qualitative analysis. This could involve manual collation of all and any evidence that might be applicable when methodically corroborating regulation, news, sentiment and trading activity. For example, during an insider dealing investigation, analysts are alerted to statistically significant price movements. The analyst must then conduct an initial assessment of related news during the observation period to determine if the highlighted price move has been caused by specific news and its likely price sensitivity. This initial step in assessing the presence, or absence, of price sensitive news guides the subsequent actions an analyst will take with a possible case of market abuse.
Initial triaging can be a time-consuming and resource-intensive process and still necessitate a full investigation if the identified behavior remains potentially suspicious or abusive.
Moreover, the dynamic nature of financial markets and evolving tactics and sophistication of bad actors demand that market facilitators revisit automated rules-based surveillance systems. The increasing frequency of alerts and high number of false positives adversely impact an analyst’s ability to devote quality time to the most meaningful cases, and such heightened emphasis on resources could result in operational delays.
Solution overview
To address these challenges, LSEG collaborated with AWS to improve insider dealing detection, developing a generative AI prototype that automatically predicts the probability of news articles being price sensitive. The system employs Anthropic’s Claude Sonnet 3.5 model—the most price performant model at the time—through Amazon Bedrock to analyze news content from LSEG’s Regulatory News Service (RNS) and classify articles based on their potential market impact. The results support analysts to more quickly determine whether highlighted trading activity can be mitigated during the observation period.
The architecture consists of three main components:

A data ingestion and preprocessing pipeline for RNS articles
Amazon Bedrock integration for news analysis using Claude Sonnet 3.5
Inference application for visualising results and predictions

The following diagram illustrates the conceptual approach:

The workflow processes news articles through the following steps:

Ingest raw RNS news documents in HTML format
Preprocess and extract clean news text
Fill the classification prompt template with text from the news documents
Prompt Anthropic’s Claude Sonnet 3.5 through Amazon Bedrock
Receive and process model predictions and justifications
Present results through the visualization interface developed using Streamlit

Methodology
The team collated a comprehensive dataset of approximately 250,000 RNS articles spanning 6 consecutive months of trading activity in 2023. The raw data—HTML documents from RNS—were initially pre-processed within the AWS environment by removing extraneous HTML elements and formatted to extract clean textual content. Having isolated substantive news content, the team subsequently carried out exploratory data analysis to understand distribution patterns within the RNS corpus, focused on three dimensions:

News categories: Distribution of articles across different regulatory categories
Instruments: Financial instruments referenced in the news articles
Article length: Statistical distribution of document sizes

Exploration provided contextual understanding of the news landscape and informed the sampling strategy in creating a representative evaluation dataset. 110 articles were selected to cover major news categories, and this curated subset was presented to market surveillance analysts who, as domain experts, evaluated each article’s price sensitivity on a nine-point scale, as shown in the following image:

1–3: PRICE_NOT_SENSITIVE – Low probability of price sensitivity
4–6: HARD_TO_DETERMINE – Uncertain price sensitivity
7–9: PRICE_SENSITIVE – High probability of price sensitivity

The experiment was executed within Amazon SageMaker using Jupyter Notebooks as the development environment. The technical stack consisted of:

Instructor library: Provided integration capabilities with Anthropic’s Claude Sonnet 3.5 model in Amazon Bedrock
Amazon Bedrock: Served as the API infrastructure for model access
Custom data processing pipelines (Python): For data ingestion and preprocessing

This infrastructure enabled systematic experimentation with various algorithmic approaches, including traditional supervised learning methods, prompt engineering with foundation models, and fine-tuning scenarios.
The evaluation framework established specific technical success metrics:

Data pipeline implementation: Successful ingestion and preprocessing of RNS data
Metric definition: Clear articulation of precision, recall, and F1 metrics
Workflow completion: Execution of comprehensive exploratory data analysis (EDA) and experimental workflows

The analytical approach was a two-step classification process, as shown in the following figure:

Step 1: Classify news articles as potentially price sensitive or other
Step 2: Classify news articles as potentially price not sensitive or other

This multi-stage architecture was designed to maximize classification accuracy by allowing analysts to focus on specific aspects of price sensitivity at each stage. The results from each step were then merged to produce the final output, which was compared with the human-labeled dataset to generate quantitative results.
To consolidate the results from both classification steps, the data merging rules followed were:

Step 1 Classification
Step 2 Classification
Final Classification

Sensitive
Other
Sensitive

Other
Non-sensitive
Non-sensitive

Other
Other
Ambiguous – requires manual review i.e., Hard to Determine

Sensitive
Non-sensitive
Ambiguous – requires manual review i.e., Hard to Determine

Based on the insights gathered, prompts were optimized. The prompt templates elicited three key components from the model:

A concise summary of the news article
A price sensitivity classification
A chain-of-thought explanation justifying the classification decision

The following is an example prompt:

system non sensitive = “*”
You are an expert financial analyst with deep knowledge of market dynamics, investor
psychology, and the intricate relationships between news events and asset prices.
Your core function is to analyze news articles and assess their likelihood of being
non-price sensitive with unparalleled accuracy and insight.
Key aspects of your expertise include:
1. Market Dynamics: You have a comprehensive understanding of how financial markets
operate, including the factors that typically drive price movements and those that
are often overlooked by the market.
2. Investor Psychology: You possess keen insight into how different types of news affect
investor sentiment and decision-making, particularly in distinguishing between
information that causes reactions and information that doesn’t.
3. News Analysis: You excel at dissecting financial news articles, identifying key
elements, and determining their relevance (or lack thereof) to asset valuations and
market movements.
4. Pattern Recognition: You can draw upon a vast knowledge of historical market
reactions to various types of news, allowing you to identify patterns of
non-impactful information.
5. Sector-Specific Knowledge: You understand the nuances of different industry sectors
and how the importance of news can vary across them.
6. Regulatory Insight: You’re well-versed in financial regulations and can identify when
news does or doesn’t meet thresholds for material information.
7. Macroeconomic Perspective: You can place company-specific news in the broader context
of economic trends and assess whether it’s likely to be overshadowed by larger market
forces.
8. Quantitative Skills: You can evaluate financial metrics and understand when changes or
announcements related to them are significant enough to impact prices.
Your primary task is to analyze given news articles and determine, with a high degree of
confidence, whether they are likely to be non-price sensitive. This involves:
– Carefully examining the content and context of each news item
– Assessing its potential (or lack thereof) to influence investor decisions
– Considering both short-term and long-term implications
– Providing clear, well-reasoned justifications for your assessments
– Identifying key factors that support your conclusion
– Recommending further information that could enhance the analysis
– Offering insights that can help traders make more informed decisions
You should always maintain a conservative approach, erring on the side of caution. If
there’s any reasonable doubt about whether news could be price-sensitive, you should
classify it as ‘OTHER’ rather than ‘NOT_PRICE_SENSITIVE’.
Your analyses should be sophisticated yet accessible, catering to both experienced
traders and those new to the market. Always strive for objectivity, acknowledging any
uncertainties or limitations in your assessment.
Remember, your insights play a crucial role in helping traders filter out market noise
and focus on truly impactful information, ultimately contributing to more effective
and educated trading decisions.

As shown in the following figure, the solution was optimized to maximize:

Precision for the NOT SENSITIVE class
Recall for the PRICE SENSITIVE class

This optimization strategy was deliberate, facilitating high confidence in non-sensitive classifications to reduce unnecessary escalations to human analysts (in other words, to reduce false positives). Through this methodical approach, prompts were iteratively refined while maintaining rigorous evaluation standards through comparison against the expert-annotated baseline data.
Key benefits and results
Over a 6-week period, Surveillance Guide demonstrated remarkable accuracy when evaluated on a representative sample dataset. Key achievements include the following:

100% precision in identifying non-sensitive news, allocating 6 articles to this category that analysts confirmed were non price sensitive
100% recall in detecting price-sensitive content, allocating 36 hard to determine and 28 price sensitive articles labelled by analysts into one of these two categories (never misclassifying price sensitive content)
Automated analysis of complex financial news
Detailed justifications for classification decisions
Effective triaging of results by sensitivity level

In this implementation, LSEG has employed Amazon Bedrock so that they can use secure, scalable access to foundation models through a unified API, minimizing the need for direct model management and reducing operational complexity. Because of the serverless architecture of Amazon Bedrock, LSEG can take advantage of dynamic scaling of model inference capacity based on news volume, while maintaining consistent performance during market-critical periods. Its built-in monitoring and governance features support reliable model performance and maintain audit trails for regulatory compliance.
Impact on market surveillance
This AI-powered solution transforms market surveillance operations by:

Reducing manual review time for analysts
Improving consistency in price-sensitivity assessment
Providing detailed audit trails through automated justifications
Enabling faster response to potential market abuse cases
Scaling surveillance capabilities without proportional resource increases

The system’s ability to process news articles instantly and provide detailed justifications helps analysts focus their attention on the most critical cases while maintaining comprehensive market oversight.
Proposed next steps
LSEG plans to first enhance the solution, for internal use, by:

Integrating additional data sources, including company financials and market data
Implementing few-shot prompting and fine-tuning capabilities
Expanding the evaluation dataset for continued accuracy improvements
Deploying in live environments alongside manual processes for validation
Adapting to additional market abuse typologies

Conclusion
LSEG’s Surveillance Guide demonstrates how generative AI can transform market surveillance operations. Powered by Amazon Bedrock, the solution improves efficiency and enhances the quality and consistency of market abuse detection.
As financial markets continue to evolve, AI-powered solutions architected along similar lines will become increasingly important for maintaining integrity and compliance. AWS and LSEG are intent on being at the forefront of this change.
The selection of Amazon Bedrock as the foundation model service provides LSEG with the flexibility to iterate on their solution while maintaining enterprise-grade security and scalability. To learn more about building similar solutions with Amazon Bedrock, visit the Amazon Bedrock documentation or explore other financial services use cases in the AWS Financial Services Blog.

About the authors
Charles Kellaway is a Senior Manager in the Equities Trading team at LSE plc, based in London. With a background spanning both Equity and Insurance markets, Charles specialises in deep market research and business strategy, with a focus on deploying technology to unlock liquidity and drive operational efficiency. His work bridges the gap between finance and engineering, and he always brings a cross-functional perspective to solving complex challenges.
Rasika Withanawasam is a seasoned technology leader with over two decades of experience architecting and developing mission-critical, scalable, low-latency software solutions. Rasika’s core expertise lies in big data and machine learning applications, focusing intently on FinTech and RegTech sectors. He has held several pivotal roles at LSEG, including Chief Product Architect for the flagship Millennium Surveillance and Millennium Analytics platforms, and currently serves as Manager of the Quantitative Surveillance & Technology team, where he leads AI/ML solution development.
Richard Chester is a Principal Solutions Architect at AWS, advising large Financial Services organisations. He has 25+ years’ experience across the Financial Services Industry where he has held leadership roles in transformation programs, DevOps engineering, and Development Tooling. Since moving across to AWS from being a customer, Richard is now focused on driving the execution of strategic initiatives, mitigating risks and tackling complex technical challenges for AWS customers.

MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Adva …

Posted on September 10, 2025 by i-genie

A team of researchers from MBZUAI’s Institute of Foundation Models and G42 released K2 Think, is a 32B-parameter open reasoning system for advanced AI reasoning. It pairs long chain-of-thought supervised fine-tuning with reinforcement learning from verifiable rewards, agentic planning, test-time scaling, and inference optimizations (speculative decoding + wafer-scale hardware). The result is frontier-level math performance with markedly lower parameter count and competitive results on code and science—together with a transparent, fully open release spanning weights, data, and code.

System overview

K2 Think is built by post-training an open-weight Qwen2.5-32B base model and adding a lightweight test-time compute scaffold. The design emphasizes parameter efficiency: a 32B backbone is deliberately chosen to enable fast iteration and deployment while leaving headroom for post-training gains. The core recipe combines six “pillars”: (1) Long chain-of-thought (CoT) supervised fine-tuning; (2) Reinforcement Learning with Verifiable Rewards (RLVR); (3) agentic planning before solving; (4) test-time scaling via best-of-N selection with verifiers; (5) speculative decoding; and (6) inference on a wafer-scale engine.

The goals are straightforward: raise pass@1 on competition-grade math benchmarks, maintain strong code/science performance, and keep response length and wall-clock latency under control through plan-before-you-think prompting and hardware-aware inference.

Pillar 1: Long CoT SFT

Phase-1 SFT uses curated, long chain-of-thought traces and instruction/response pairs spanning math, code, science, instruction following, and general chat (AM-Thinking-v1-Distilled). The effect is to teach the base model to externalize intermediate reasoning and adopt a structured output format. Rapid pass@1 gains occur early (≈0.5 epoch), with AIME’24 stabilizing around ~79% and AIME’25 around ~72% on the SFT checkpoint before RL, indicating convergence.

Pillar 2: RL with Verifiable Rewards

K2 Think then trains with RLVR on Guru, a ~92k-prompt, six-domain dataset (Math, Code, Science, Logic, Simulation, Tabular) designed for verifiable end-to-end correctness. The implementation uses the verl library with a GRPO-style policy-gradient algorithm. Notable observation: starting RL from a strong SFT checkpoint yields modest absolute gains and can plateau/degenerate, whereas applying the same RL recipe directly on the base model shows large relative improvements (e.g., ~40% on AIME’24 over training), supporting a trade-off between SFT strength and RL headroom.

A second ablation shows multi-stage RL with a reduced initial context window (e.g., 16k → 32k) underperforms—failing to recover the SFT baseline—suggesting that reducing max sequence length below the SFT regime can disrupt learned reasoning patterns.

Pillars 3–4: Agentic “Plan-Before-You-Think” and Test-time Scaling

At inference, the system first elicits a compact plan before generating a full solution, then performs best-of-N (e.g., N=3) sampling with verifiers to select the most likely-correct answer. Two effects are reported: (i) consistent quality gains from the combined scaffold; and (ii) shorter final responses despite the added plan—average token counts drop across benchmarks, with reductions up to ~11.7% (e.g., Omni-HARD), and overall lengths comparable to much larger open models. This matters for both latency and cost.

Table-level analysis shows K2 Think’s response lengths are shorter than Qwen3-235B-A22B and in the same range as GPT-OSS-120B on math; after adding plan-before-you-think and verifiers, K2 Think’s average tokens fall versus its own post-training checkpoint (e.g., AIME’24 −6.7%, AIME’25 −3.9%, HMMT25 −7.2%, Omni-HARD −11.7%, LCBv5 −10.5%, GPQA-D −2.1%).

Pillars 5–6: Speculative decoding and wafer-scale inference

K2 Think targets Cerebras Wafer-Scale Engine inference with speculative decoding, advertising per-request throughput upwards of 2,000 tokens/sec, which makes the test-time scaffold practical for production and research loops. The hardware-aware inference path is a central part of the release and aligns with the system’s “small-but-fast” philosophy.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Evaluation protocol

Benchmarking covers competition-level math (AIME’24, AIME’25, HMMT’25, Omni-MATH-HARD), code (LiveCodeBench v5; SciCode sub/main), and science knowledge/reasoning (GPQA-Diamond; HLE). The research team reports a standardized setup: max generation length 64k tokens, temperature 1.0, top-p 0.95, stop marker </answer>, and each score as an average of 16 independent pass@1 evaluations to reduce run-to-run variance.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Results

Math (micro-average across AIME’24/’25, HMMT25, Omni-HARD). K2 Think reaches 67.99, leading the open-weight cohort and comparing favorably even to much larger systems; it posts 90.83 (AIME’24), 81.24 (AIME’25), 73.75 (HMMT25), and 60.73 on Omni-HARD—the latter being the most difficult split. The positioning is consistent with strong parameter efficiency relative to DeepSeek V3.1 (671B) and GPT-OSS-120B (120B).

Code. LiveCodeBench v5 score is 63.97, exceeding similarly sized peers and even larger open models (e.g., > Qwen3-235B-A22B at 56.64). On SciCode, K2 Think is 39.2/12.0 (sub/main), tracking the best open systems closely on sub-problem accuracy.

Science. GPQA-Diamond reaches 71.08; HLE is 9.95. The model is not just a math specialist: it stays competitive across knowledge-heavy tasks.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Key numbers at a glance

Backbone: Qwen2.5-32B (open weight), post-trained with long CoT SFT + RLVR (GRPO via verl).

RL data: Guru (~92k prompts) across Math/Code/Science/Logic/Simulation/Tabular.

Inference scaffold: Plan-before-you-think + BoN with verifiers; shorter outputs (e.g., −11.7% tokens on Omni-HARD) at higher accuracy.

Throughput target: ~2,000 tok/s on Cerebras WSE with speculative decoding.

Math micro-avg: 67.99 (AIME’24 90.83, AIME’25 81.24, HMMT’25 73.75, Omni-HARD 60.73).

Code/Science: LCBv5 63.97; SciCode 39.2/12.0; GPQA-D 71.08; HLE 9.95.

Safety-4 macro: 0.75 (Refusal 0.83, Conv. Robustness 0.89, Cybersecurity 0.56, Jailbreak 0.72).

Summary

K2 Think demonstrates that integrative post-training + test-time compute + hardware-aware inference can close much of the gap to larger, proprietary reasoning systems. At 32B, it is tractable to fine-tune and serve; with plan-before-you-think and BoN-with-verifiers, it controls token budgets; with speculative decoding on wafer-scale hardware, it reaches ~2k tok/s per request. K2 Think is presented as a fully open system—weights, training data, deployment code, and test-time optimization code.

Check out the Paper, Model on Hugging Face, GitHub and Direct Access. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Advanced AI Reasoning and Outperforms 20x Larger Reasoning Models appeared first on MarkTechPost.