Perplexity Launches an AI Email Assistant Agent for Gmail and Outlook, …

Perplexity introduced “Email Assistant,” an AI agent that plugs into Gmail and Outlook to draft replies in your voice, auto-label and prioritize messages, and coordinate meetings end-to-end (availability checks, time suggestions, and calendar invites). The feature is restricted to Perplexity’s Max plan and is live today.

What it does?

Email Assistant adds an agent to any thread (via cc) that handles the back-and-forth typical of scheduling. It reads availability, proposes times, and issues invites, while also surfacing daily priorities and generating reply drafts aligned to the user’s tone. Launch support covers Gmail and Outlook with one-click setup links.

https://www.perplexity.ai/assistant

How it plugs into calendars and mail?

Perplexity has been shipping native connectors for Google and Microsoft stacks; the current changelog notes that Gmail/Gcal/Outlook connections support email search and “create calendar invites directly within Perplexity,” which is what the Email Assistant automates from within a live thread. Practically, users enroll, then send or cc assistant@perplexity.com to delegate scheduling and triage tasks.

https://www.perplexity.ai/assistant

Security posture

Perplexity’s specifies SOC 2 and GDPR compliance and says user data is not used for training. For teams evaluating agents in regulated environments, that implies standard audit controls and data-handling boundaries, but as always, production rollouts should validate data-access scopes and DLP posture in the target tenant.

Competitive context

Email Assistant overlaps with Microsoft Copilot for Outlook and Google Gemini for Gmail (summaries/assists). Perplexity’s differentiator is agentic handling of the entire negotiation loop inside email threads plus cross-account connectors already present in its Comet stack. That makes it a realistic drop-in for users who prefer an external agent rather than suite-native assistants.

Early read for implementers

Integration path: Connect Gmail/Outlook, then cc the agent on threads that need scheduling; use it for triage queries and auto-drafts.

Workflow coverage: Auto-labels for “needs reply” vs. FYI; daily summaries; draft-in-your-style replies; invite creation.

Boundary conditions: Max-only; launch support limited to Gmail/Outlook; verify calendar write permissions and compliance needs per domain.

Summary

Perplexity’s Email Assistant is a concrete agentic workflow for inboxes: cc it, let it negotiate times, send invites, and keep your triage queue lean—currently gated to Max subscribers and Gmail/Outlook environments.

Try it here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership with marktechpost.com, please TALK to us

The post Perplexity Launches an AI Email Assistant Agent for Gmail and Outlook, Aimed at Scheduling, Drafting, and Inbox Triage appeared first on MarkTechPost.

Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, …

Microsoft has released a public preview that enables Azure Logic Apps (Standard) to run as Model Context Protocol (MCP) servers, exposing Logic Apps workflows as agent tools discoverable and callable by MCP-capable clients (e.g., VS Code + Copilot).

What’s actually shipping

Remote MCP server on Logic Apps (Standard): You configure a Standard logic app to host an MCP endpoint (/api/mcp) and surface HTTP Request/Response workflows as tools. Authentication is front-doored by Easy Auth; MCP endpoints default to OAuth 2.0. VS Code (≥1.102) includes GA MCP client support for testing.

API Center registration path (preview): You can also create/register MCP servers in Azure API Center, where selected managed connector actions become tools with cataloging and governance.

https://learn.microsoft.com/en-us/azure/logic-apps/set-up-model-context-protocol-server-standard

Key requirements and transport details

Workflow shape: Tools must be implemented as HTTP Request trigger (“When a HTTP request is received”) plus a Response action.

Auth & access control: By default, MCP uses OAuth 2.0; Easy Auth enforces client/identity/tenant restrictions. During setup, App Service authentication must allow unauthenticated requests (the MCP flow still performs OAuth).

Transports: Streamable HTTP works out of the box. SSE additionally requires VNET integration and host.json settingRuntime.Backend.EdgeWorkflowRuntimeTriggerListener.AllowCrossWorkerCommunication=true.

Enablement switch: MCP APIs are enabled by adding extensions.workflow.McpServerEndpoints.enable=true in host.json.

API Center path: preview limitations that matter

When creating MCP servers via API Center backed by Logic Apps, the current preview imposes the following limits:

Start with an empty Standard logic app resource.

One connector per MCP server.

Built-in service-provider and custom connectors aren’t supported in this path (managed connectors only).

One action per tool.

These constraints materially affect tool granularity and server layout in larger estates.

Why Standard (single-tenant) is the target?

Standard runs on the single-tenant Logic Apps runtime (on Azure Functions), supports multiple workflows per app, and integrates directly with virtual networks and private endpoints—all relevant for exposing private systems safely to agents and for predictable throughput/latency. By contrast, Consumption is multitenant, single-workflow per app, and pay-per-execution.

Tooling semantics and discoverability

Microsoft recommends adding trigger descriptions, parameter schemas/descriptions, and required markers to improve agent tool selection and invocation reliability. These annotations are read by MCP clients and influence calling behavior.

Connectors and enterprise reach

Organizations can front existing workflows and a large catalog of Logic Apps connectors (cloud and on-prem) through MCP, turning them into callable agent tools; Microsoft explicitly cites “more than 1,400 connectors.”

Operations, governance, and testing

Run history plus Application Insights/Log Analytics are available for diagnostics and auditability. VS Code provides quick client validation via MCP: Add Server, including OAuth sign-in and tool enumeration. Registering via API Center brings discovery/governance to MCP servers across teams.

Production notes (preview)

SSE requires both VNET and the cross-worker setting; without these, use streamable HTTP.

Easy Auth must be configured precisely (including the “allow unauthenticated” toggle) or client sign-in flows will fail despite OAuth expectations.

Throttling, idempotency, and schema versioning remain your responsibility when wrapping connectors as tools (not new, but now in the agent path). InfoQ highlights similar operational concerns from early adopters.

Summary

The preview cleanly MCP-enables Logic Apps (Standard): you expose HTTP-based workflows as OAuth-protected tools; you can catalog them in API Center; and you can reach private systems through single-tenant networking. For teams already invested in Logic Apps, this is a low-friction, standards-aligned route to operationalize enterprise agent tooling—just mind the API Center limits, SSE prerequisites, and Easy Auth nuances during rollout.

Check out more details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership with marktechpost.com, please TALK to us

The post Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, Turning Connectors into Agent Tools appeared first on MarkTechPost.

Alibaba Qwen Team Just Released FP8 Builds of Qwen3-Next-80B-A3B (Inst …

Alibaba’s Qwen team has just released FP8-quantized checkpoints for its new Qwen3-Next-80B-A3B models in two post-training variants—Instruct and Thinking—aimed at high-throughput inference with ultra-long context and MoE efficiency. The FP8 repos mirror the BF16 releases but package “fine-grained FP8” weights (block size 128) and deployment notes for sglang and vLLM nightly builds. Benchmarks in the cards remain those of the original BF16 models; FP8 is provided “for convenience and performance,” not as a separate evaluation run.

What’s in the A3B stack

Qwen3-Next-80B-A3B is a hybrid architecture combining Gated DeltaNet (a linear/conv-style attention surrogate) with Gated Attention, interleaved with an ultra-sparse Mixture-of-Experts (MoE). The 80B total parameter budget activates ~3B params per token via 512 experts (10 routed + 1 shared). The layout is specified as 48 layers arranged into 12 blocks: 3×(Gated DeltaNet → MoE) followed by 1×(Gated Attention → MoE). Native context is 262,144 tokens, validated up to ~1,010,000 tokens using RoPE scaling (YaRN). Hidden size is 2048; attention uses 16 Q heads and 2 KV heads at head dim 256; DeltaNet uses 32 V and 16 QK linear heads at head dim 128.

Qwen team reports the 80B-A3B base model outperforms Qwen3-32B on downstream tasks at ~10% of its training cost and delivers ~10× inference throughput beyond 32K context—driven by low activation in MoE and multi-token prediction (MTP). The Instruct variant is non-reasoning (no <think> tags), whereas the Thinking variant enforces reasoning traces by default and is optimized for complex problems.

FP8 releases: what actually changed

The FP8 model cards state the quantization is “fine-grained fp8” with block size 128. Deployment differs slightly from BF16: both sglang and vLLM require current main/nightly builds, with example commands provided for 256K context and optional MTP. The Thinking FP8 card also recommends a reasoning parser flag (e.g., –reasoning-parser deepseek-r1 in sglang, deepseek_r1 in vLLM). These releases retain Apache-2.0 licensing.

Benchmarks (reported on BF16 weights)

The Instruct FP8 card reproduces Qwen’s BF16 comparison table, putting Qwen3-Next-80B-A3B-Instruct on par with Qwen3-235B-A22B-Instruct-2507 on several knowledge/reasoning/coding benchmarks, and ahead on long-context workloads (up to 256K). The Thinking FP8 card lists AIME’25, HMMT’25, MMLU-Pro/Redux, and LiveCodeBench v6, where Qwen3-Next-80B-A3B-Thinking surpasses earlier Qwen3 Thinking releases (30B A3B-2507, 32B) and claims wins over Gemini-2.5-Flash-Thinking on multiple benchmarks.

Training and post-training signals

The series is trained on ~15T tokens before post-training. Qwen highlights stability additions (zero-centered, weight-decayed layer norm, etc.) and uses GSPO in RL post-training for the Thinking model to handle the hybrid attention + high-sparsity MoE combination. MTP is used to speed inference and improve pretraining signal.

Why FP8 matters?

On modern accelerators, FP8 activations/weights reduce memory bandwidth pressure and resident footprint versus BF16, allowing larger batch sizes or longer sequences at similar latency. Because A3B routes only ~3B parameters per token, the combination of FP8 + MoE sparsity compounds throughput gains in long-context regimes, particularly when paired with speculative decoding via MTP as exposed in the serving flags. That said, quantization interacts with routing and attention variants; real-world acceptance rates for speculative decoding and end-task accuracy can vary with engine and kernel implementations—hence Qwen’s guidance to use current sglang/vLLM and to tune speculative settings.

Summary

Qwen’s FP8 releases make the 80B/3B-active A3B stack practical to serve at 256K context on mainstream engines, preserving the hybrid-MoE design and MTP path for high throughput. The model cards keep benchmarks from BF16, so teams should validate FP8 accuracy and latency on their own stacks, especially with reasoning parsers and speculative settings. Net outcome: lower memory bandwidth and improved concurrency without architectural regressions, positioned for long-context production workloads.

Check out the Qwen3-Next-80B-A3B models in two post-training variants—Instruct and Thinking. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Alibaba Qwen Team Just Released FP8 Builds of Qwen3-Next-80B-A3B (Instruct & Thinking), Bringing 80B/3B-Active Hybrid-MoE to Commodity GPUs appeared first on MarkTechPost.

Rapid ML experimentation for enterprises with Amazon SageMaker AI and …

This post was written with Sarah Ostermeier from Comet.
As enterprise organizations scale their machine learning (ML) initiatives from proof of concept to production, the complexity of managing experiments, tracking model lineage, and managing reproducibility grows exponentially. This is primarily because data scientists and ML engineers constantly explore different combinations of hyperparameters, model architectures, and dataset versions, generating massive amounts of metadata that must be tracked for reproducibility and compliance. As the ML model development scales across multiple teams and regulatory requirements intensify, tracking experiments becomes even more complex. With increasing AI regulations, particularly in the EU, organizations now require detailed audit trails of model training data, performance expectations, and development processes, making experiment tracking a business necessity and not just a best practice.
Amazon SageMaker AI provides the managed infrastructure enterprises need to scale ML workloads, handling compute provisioning, distributed training, and deployment without infrastructure overhead. However, teams still need robust experiment tracking, model comparison, and collaboration capabilities that go beyond basic logging.
Comet is a comprehensive ML experiment management platform that automatically tracks, compares, and optimizes ML experiments across the entire model lifecycle. It provides data scientists and ML engineers with powerful tools for experiment tracking, model monitoring, hyperparameter optimization, and collaborative model development. It also offers Opik, Comet’s open source platform for LLM observability and development.
Comet is available in SageMaker AI as a Partner AI App, as a fully managed experiment management capability, with enterprise-grade security, seamless workflow integration, and a straightforward procurement process through AWS Marketplace.
The combination addresses the needs of an enterprise ML workflow end-to-end, where SageMaker AI handles infrastructure and compute, and Comet provides the experiment management, model registry, and production monitoring capabilities that teams require for regulatory compliance and operational efficiency. In this post, we demonstrate a complete fraud detection workflow using SageMaker AI with Comet, showcasing reproducibility and audit-ready logging needed by enterprises today.
Enterprise-ready Comet on SageMaker AI
Before proceeding to setup instructions, organizations must identify their operating model and based on that, decide how Comet is going to be set up. We recommend implementing Comet using a federated operating model. In this architecture, Comet is centrally managed and hosted in a shared services account, and each data science team maintains fully autonomous environments. Each operating model comes with their own sets of benefits and limitations. For more information, refer to SageMaker Studio Administration Best Practices.
Let’s dive into the setup of Comet in SageMaker AI. Large enterprise generally have the following personas:

Administrators – Responsible for setting up the common infrastructure services and environment for use case teams
Users – ML practitioners from use case teams who use the environments set up by platform team to solve their business problems

In the following sections, we go through each persona’s journey.
Comet works well with both SageMaker AI and Amazon SageMaker. SageMaker AI provides the Amazon SageMaker Studio integrated development environment (IDE), and SageMaker provides the Amazon SageMaker Unified Studio IDE. For this post, we use SageMaker Studio.
Administrator journey
In this scenario, the administrator receives a request from a team working on a fraud detection use case to provision an ML environment with a fully managed training and experimentation setup. The administrator’s journey includes the following steps:

Follow the prerequisites to set up Partner AI Apps. This sets up permissions for administrators, allowing Comet to assume a SageMaker AI execution role on behalf of the users and additional privileges for managing the Comet subscription through AWS Marketplace.
On the SageMaker AI console, under Applications and IDEs in the navigation pane, choose Partner AI Apps, then choose View details for Comet.

The details are shown, including the contract pricing model for Comet and infrastructure tier estimated costs.

Comet provides different subscription options ranging from a 1-month to 36-month contract. With this contract, users can access Comet in SageMaker. Based on the number of users, the admin can review and analyze the appropriate instance size for the Comet dashboard server. Comet supports 5–500 users running more than 100 experiment jobs..

Choose Go to Marketplace to subscribe to be redirected to the Comet listing on AWS Marketplace.
Choose View purchase options.

In the subscription form, provide the required details.

When the subscription is complete, the admin can start configuring Comet.

While deploying Comet, add the project lead of the fraud detection use case team as an admin to manage the admin operations for the Comet dashboard.

It takes a few minutes for the Comet server to be deployed. For more details on this step, refer to Partner AI App provisioning.

Set up a SageMaker AI domain following the steps in Use custom setup for Amazon SageMaker AI. As a best practice, provide a pre-signed domain URL for the use case team member to directly access the Comet UI without logging in to the SageMaker console.
Add the team members to this domain and enable access to Comet while configuring the domain.

Now the SageMaker AI domain is ready for users to log in to and start working on the fraud detection use case.
User journey
Now let’s explore the journey of an ML practitioner from the fraud detection use case. The user completes the following steps:

Log in to the SageMaker AI domain through the pre-signed URL.

You will be redirected to the SageMaker Studio IDE. Your user name and AWS Identity and Access Management (IAM) execution role are preconfigured by the admin.

Create a JupyterLab Space following the JupyterLab user guide.
You can start working on the fraud detection use case by spinning up a Jupyter notebook.

The admin has also set up required access to the data through an Amazon Simple Storage Service (Amazon S3) bucket.

To access Comet APIs, install the comet_ml library and configure the required environment variables as described in Set up the Amazon SageMaker Partner AI Apps SDKs.
To access the Comet UI, choose Partner AI Apps in the SageMaker Studio navigation pane and choose Open for Comet.

Now, let’s walk through the use case implementation.
Solution overview
This use case highlights common enterprise challenges: working with imbalanced datasets (in this example, only 0.17% of transactions are fraudulent), requiring multiple experiment iterations, and maintaining full reproducibility for regulatory compliance. To follow along, refer to the Comet documentation and Quickstart guide for additional setup and API details.
For this use case, we use the Credit Card Fraud Detection dataset. The dataset contains credit card transactions with binary labels representing fraudulent (1) or legitimate (0) transactions. In the following sections, we walk through some of the important sections of the implementation. The entire code of the implementation is available in the GitHub repository.
Prerequisites
As a prerequisite, configure the necessary imports and environment variables for the Comet and SageMaker integration:

# Comet ML for experiment tracking
import comet_ml
from comet_ml import Experiment, API, Artifact
from comet_ml.integration.sagemaker import log_sagemaker_training_job_v1
AWS_PARTNER_APP_AUTH=true
AWS_PARTNER_APP_ARN=<Your_AWS_PARTNER_APP_ARN>
COMET_API_KEY=<Your_Comet_API_Key>
# From Details Page, click Open Comet. In the top #right corner, click on user -> API # Key
# Comet ML configuration
COMET_WORKSPACE = ‘<your-comet-workspace-name>’
COMET_PROJECT_NAME = ‘<your-comet-project-name>’

Prepare the dataset
One of Comet’s key enterprise features is automatic dataset versioning and lineage tracking. This capability provides full auditability of what data was used to train each model, which is critical for regulatory compliance and reproducibility. Start by loading the dataset:

# Create a Comet Artifact to track our raw dataset
dataset_artifact = Artifact(
name=”fraud-dataset”,
artifact_type=”dataset”,
aliases=[“raw”]
)
# Add the raw dataset file to the artifact
dataset_artifact.add_remote(s3_data_path, metadata={
“dataset_stage”: “raw”,
“dataset_split”: “not_split”,
“preprocessing”: “none”
})

Start a Comet experiment
With the dataset artifact created, you can now start tracking the ML workflow. Creating a Comet experiment automatically begins capturing code, installed libraries, system metadata, and other contextual information in the background. You can log the dataset artifact created earlier in the experiment. See the following code:

# Create a new Comet experiment
experiment_1 = comet_ml.Experiment(
project_name=COMET_PROJECT_NAME,
workspace=COMET_WORKSPACE,
)
# Log the dataset artifact to this experiment for lineage tracking
experiment_1.log_artifact(dataset_artifact)

Preprocess the data
The next steps are standard preprocessing steps, including removing duplicates, dropping unneeded columns, splitting into train/validation/test sets, and standardizing features using scikit-learn’s StandardScaler. We wrap the processing code in preprocess.py and run it as a SageMaker Processing job. See the following code:

# Run SageMaker processing job
processor = SKLearnProcessor(
framework_version=’1.0-1′,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type=’ml.t3.medium’
)
processor.run(
code=’preprocess.py’,
inputs=[ProcessingInput(source=s3_data_path, destination=’/opt/ml/processing/input’)],
outputs=[ProcessingOutput(source=’/opt/ml/processing/output’, destination=f’s3://{bucket_name}/{processed_data_prefix}’)]
)

After you submit the processing job, SageMaker AI launches the compute instances, processes and analyzes the input data, and releases the resources upon completion. The output of the processing job is stored in the S3 bucket specified.
Next, create a new version of the dataset artifact to track the processed data. Comet automatically versions artifacts with the same name, maintaining complete lineage from raw to processed data.

# Create an updated version of the ‘fraud-dataset’ Artifact for the preprocessed data
preprocessed_dataset_artifact = Artifact(
name=”fraud-dataset”,
artifact_type=”dataset”,
aliases=[“preprocessed”],
metadata={
“description”: “Credit card fraud detection dataset”,
“fraud_percentage”: f”{fraud_percentage:.3f}%”,
“dataset_stage”: “preprocessed”,
“preprocessing”: “StandardScaler + train/val/test split”,
}
)
# Add our train, validation, and test dataset files as remote assets
preprocessed_dataset_artifact.add_remote(
uri=f’s3://{bucket_name}/{processed_data_prefix}’,
logical_path=’split_data’
)
# Log the updated dataset to the experiment to track the updates
experiment_1.log_artifact(preprocessed_dataset_artifact)

The Comet and SageMaker AI experiment workflow
Data scientists prefer rapid experimentation; therefore, we organized the workflow into reusable utility functions that can be called multiple times with different hyperparameters while maintaining consistent logging and evaluation across all runs. In this section, we showcase the utility functions along with a brief snippet of the code inside the function:

train() – Spins up a SageMaker model training job using the SageMaker built-in XGBoost algorithm:

# Create SageMaker estimator
estimator = Estimator(
image_uri=xgboost_image,
role=execution_role,
instance_count=1,
instance_type=’ml.m5.large’,
output_path=model_output_path,
sagemaker_session=sagemaker_session_obj,
hyperparameters=hyperparameters_dict,
max_run=1800 # Maximum training time in seconds
)
# Start training
estimator.fit({
‘train’: train_channel,
‘validation’: val_channel
})

log_training_job() – Captures the training metadata and metrics and links the model asset to the experiment for complete traceability:

# Log SageMaker training job to Comet
log_sagemaker_training_job_v1(
estimator=training_estimator,
experiment=api_experiment
)

log_model_to_comet() – Links model artifacts to Comet, captures the training metadata, and links the model asset to the experiment for complete traceability:

experiment.log_remote_model(
model_name=model_name,
uri=model_artifact_path,
metadata=metadata
)

deploy_and_evaluate_model() – Performs model deployment and evaluation, and metric logging:

# Deploy to endpoint
predictor = estimator.deploy(
initial_instance_count=1,
instance_type=”ml.m5.xlarge”)
# Log metrics and visualizations to Comet
experiment.log_metrics(metrics) experiment.log_confusion_matrix(matrix=cm,labels=[‘Normal’, ‘Fraud’])
# Log ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob_as_np_array) experiment.log_curve(“roc_curve”, x=fpr, y=tpr)

The complete prediction and evaluation code is available in the GitHub repository.
Run the experiments
Now you can run multiple experiments by calling the utility functions with different configurations and compare experiments to find the most optimal settings for the fraud detection use case.
For the first experiment, we establish a baseline using standard XGBoost hyperparameters:

# Define hyperparameters for first experiment
hyperparameters_v1 = {
‘objective’: ‘binary:logistic’, # Binary classification
‘num_round’: 100, # Number of boosting rounds
‘eval_metric’: ‘auc’, # Evaluation metric
‘learning_rate’: 0.15, # Learning rate
‘booster’: ‘gbtree’ # Booster algorithm
}
# Train the model
estimator_1 = train(
model_output_path=f”s3://{bucket_name}/{model_output_prefix}/1″,
execution_role=role,
sagemaker_session_obj=sagemaker_session,
hyperparameters_dict=hyperparameters_v1,
train_channel_loc=train_channel_location,
val_channel_loc=validation_channel_location
)
# log the training job and model artifact
log_training_job(experiment_key = experiment_1.get_key(), training_estimator=estimator_1)
log_model_to_comet(experiment = experiment_1,
model_name=”fraud-detection-xgb-v1″,
model_artifact_path=estimator_1.model_data,
metadata=metadata)
# Deploy and evaluate
deploy_and_evaluate_model(experiment=experiment_1,
estimator=estimator_1,
X_test_scaled=X_test_scaled,
y_test=y_test
)

While running a Comet experiment from a Jupyter notebook, we need to end the experiment to make sure everything is captured and persisted in the Comet server. See the following code: experiment_1.end()
When the baseline experiment is complete, you can run additional experiments with different hyperparameters. Check out the notebook to see the details of both experiments.
When the second experiment is complete, navigate to the Comet UI to compare these two experiment runs.
View Comet experiments in the UI
To access the UI, you can locate the URL in the SageMaker Studio IDE or by executing the code provided in the notebook: experiment_2.url
The following screenshot shows the Comet experiments UI. The experiment details are for illustration purposes only and do not represent a real-world fraud detection experiment.

This concludes the fraud detection experiment.
Clean up
For the experimentation part, SageMaker processing and training infrastructure is ephemeral in nature and shuts down automatically when the job is complete. However, you must still manually clean up a few resources to avoid unnecessary costs:

Shut down the SageMaker JupyterLab Space after use. For instructions, refer to Idle shutdown.
The Comet subscription renews based on the contract chosen. Cancel the contract when there is no further requirement to renew the Comet subscription.

Advantages of SageMaker and Comet integration
Having demonstrated the technical workflow, let’s examine the broader advantages this integration provides.
Streamlined model development
The Comet and SageMaker combination reduces the manual overhead of running ML experiments. While SageMaker handles infrastructure provisioning and scaling, Comet’s automatic logging captures hyperparameters, metrics, code, installed libraries, and system performance from your training jobs without additional configuration. This helps teams focus on model development rather than experiment bookkeeping.Comet’s visualization capabilities extend beyond basic metric plots. Built-in charts enable rapid experiment comparison, and custom Python panels support domain-specific analysis tools for debugging model behavior, optimizing hyperparameters, or creating specialized visualizations that standard tools can’t provide.
Enterprise collaboration and governance
For enterprise teams, the combination creates a mature platform for scaling ML projects across regulated environments. SageMaker provides consistent, secure ML environments, and Comet enables seamless collaboration with complete artifact and model lineage tracking. This helps avoid costly mistakes that occur when teams can’t recreate previous results.
Complete ML lifecycle integration
Unlike point solutions that only address training or monitoring, Comet paired with SageMaker supports your complete ML lifecycle. Models can be registered in Comet’s model registry with full version tracking and governance. SageMaker handles model deployment, and Comet maintains the lineage and approval workflows for model promotion. Comet’s production monitoring capabilities track model performance and data drift after deployment, creating a closed loop where production insights inform your next round of SageMaker experiments.
Conclusion
In this post, we showed how to use SageMaker and Comet together to spin up fully managed ML environments with reproducibility and experiment tracking capabilities.
To enhance your SageMaker workflows with comprehensive experiment management, deploy Comet directly in your SageMaker environment through the AWS Marketplace, and share your feedback in the comments.
For more information about the services and features discussed in this post, refer to the following resources:

Set up Partner AI Apps
Comet Quickstart
GitHub notebook
Comet Documentation
Opik open source platform for LLM observability

About the authors
Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS, helping large financial institutions adopt and scale generative AI and ML workloads. He is the author of book “Generative AI for financial services.” He carries more than 15 years of experience building enterprise-grade applications on generative AI/ML and related technologies. In his spare time, he plays an unnamed sport with his son that lies somewhere between football and rugby.
Naufal Mir is a Senior GenAI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning workloads to SageMaker. He previously worked at financial services institutes developing and operating systems at scale. Outside of work, he enjoys ultra endurance running and cycling.
Sarah Ostermeier is a Technical Product Marketing Manager at Comet. She specializes in bringing Comet’s GenAI and ML developer products to the engineers who need them through technical content, educational resources, and product messaging. She has previously worked as an ML engineer, data scientist, and customer success manager, helping customers implement and scale AI solutions. Outside of work she enjoys traveling off the beaten path, writing about AI, and reading science fiction.

Understanding the Universal Tool Calling Protocol (UTCP)

The Universal Tool Calling Protocol (UTCP) is a lightweight, secure, and scalable way for AI agents and applications to find and call tools directly, without the need for additional wrapper servers.

Key Features

Lightweight and secure – Allows tools to be accessed directly, avoiding unnecessary middle layers.

Scalable – Can support a large number of tools and providers without losing performance.

Modular design – Version 1.0.0 introduces a plugin-based core, making the protocol easier to extend, test, and package.

Built on Pydantic models – Provides simple, well-defined data structures that make implementation straightforward.

The Problem with Current Approaches

Traditional solutions for integrating tools often require:

Building and maintaining wrapper servers for every tool

Routing all traffic through a central protocol or service

Reimplementing authentication and security for each tool

Accepting additional latency and complexity

These steps add friction for developers and slow down execution.

The UTCP Solution

UTCP offers a better alternative by:

Defining a clear, language-agnostic standard for describing tools and their interfaces

Allowing agents to connect directly to tools using their native communication protocols

Providing an architecture that lets developers add:

New communication protocols (HTTP, SSE, CLI, etc.)

Alternative storage systems

Custom search strategies

All of this can be done without modifying the core library.

By eliminating the need for wrapper servers or other heavy middle layers, UTCP streamlines the way AI agents and applications connect with tools. It reduces latency and overall complexity, since requests no longer have to pass through extra infrastructure. Authentication and security become simpler as well, because UTCP allows agents to use the tool’s existing mechanisms rather than duplicating them in an intermediary service. This leaner approach also makes it easier to build, test, and maintain integrations, while naturally supporting growth as the number of tools and providers increases.

How It Works

UTCP makes tool integration simple and predictable. First, an AI agent discovers your tools by fetching a UTCP manual, which contains definitions and metadata for every capability you expose. Next, the agent learns how to call these tools by reading the manual and understanding the associated call templates. Once the definitions are clear, the agent can invoke your APIs directly using their native communication protocols. Finally, your API processes the request and returns a normal response. This process ensures seamless interoperability without extra middleware or custom translation layers.

Source: https://www.utcp.io/

Architecture Overview

Version 1.0 of UTCP introduces a modular, plugin-based architecture designed for scalability and flexibility. At its core are manuals, which define tools and their metadata, as well as call templates that specify how to interact with each tool over different protocols. 

The UTCP Client acts as the engine for discovering tools and executing calls. Around this core is a plugin system that supports protocol adapters, custom communication methods, tool repositories, and search strategies. This separation of concerns makes it easy to extend the system or customize it for a particular environment without altering its foundation.

How is UTCP different from MCP?

UTCP and MCP both help AI agents connect with external tools, but they focus on different needs. UTCP enables direct calls to APIs, CLIs, WebSockets, and other interfaces through simple JSON manuals, keeping infrastructure light and latency low. MCP provides a more structured layer, wrapping tools behind dedicated servers and standardizing communication with JSON-RPC.

Key points:

Architecture: UTCP connects agents straight to tools; MCP uses a server layer for routing.

Performance & Overhead: UTCP minimizes hops; MCP centralizes calls but adds a layer of processing.

Infrastructure: UTCP requires only manuals and a discovery endpoint, while MCP relies on servers for wrapping and routing.

Protocol Support: UTCP works across HTTP, WebSocket, CLI, SSE, and more; MCP focuses on JSON-RPC transport.

Security & Auth: UTCP uses the tool’s existing mechanisms, while MCP manages access inside its servers.

Flexibility: UTCP supports hybrid deployments through its MCP plugin, while MCP offers centralized management and monitoring.

Both approaches are useful: UTCP is ideal for lightweight, flexible integrations, while MCP suits teams wanting a standardized gateway with built-in control.

Conclusion

UTCP is a versatile solution for both tool providers and AI developers. It lets API owners, SaaS providers, and enterprise teams expose services like REST or GraphQL endpoints to AI agents in a simple, secure way. At the same time, developers building agents or applications can use UTCP to connect effortlessly with internal or external tools. By removing complexity and overhead, it streamlines integration and makes it easier for software to access powerful capabilities.

The post Understanding the Universal Tool Calling Protocol (UTCP) appeared first on MarkTechPost.

Meta AI Proposes ‘Metacognitive Reuse’: Turning LLM Chains-of-Thou …

Meta researchers introduced a method that compresses repeated reasoning patterns into short, named procedures—“behaviors”—and then conditions models to use them at inference or distills them via fine-tuning. The result: up to 46% fewer reasoning tokens on MATH while matching or improving accuracy, and up to 10% accuracy gains in a self-improvement setting on AIME, without changing model weights. The work frames this as procedural memory for LLMs—how to reason, not just what to recall—implemented with a curated, searchable “behavior handbook.”

https://arxiv.org/pdf/2509.13237

What problem does this solve?

Long chain-of-thought (CoT) traces repeatedly re-derive common sub-procedures (e.g., inclusion–exclusion, base conversions, geometric angle sums). That redundancy burns tokens, adds latency, and can crowd out exploration. Meta’s idea is to abstract recurring steps into concise, named behaviors (name + one-line instruction) recovered from prior traces via an LLM-driven reflection pipeline, then reuse them during future reasoning. On math benchmarks (MATH-500; AIME-24/25), this reduces output length substantially while preserving or improving solution quality.

How does the pipeline work?

Three roles, one handbook:

Metacognitive Strategist (R1-Llama-70B):

solves a problem to produce a trace, 2) reflects on the trace to identify generalizable steps, 3) emits behaviors as (behavior_name → instruction) entries. These populate a behavior handbook (procedural memory).

Teacher (LLM B): generates behavior-conditioned responses used to build training corpora.

Student (LLM C): consumes behaviors in-context (inference) or is fine-tuned on behavior-conditioned data.Retrieval is topic-based on MATH and embedding-based (BGE-M3 + FAISS) on AIME.

Prompts: The team provides explicit prompts for solution, reflection, behavior extraction, and behavior-conditioned inference (BCI). In BCI, the model is instructed to reference behaviors explicitly in its reasoning, encouraging consistently short, structured derivations.

What are the evaluation modes?

Behavior-Conditioned Inference (BCI): Retrieve K relevant behaviors and prepend them to the prompt.

Behavior-Guided Self-Improvement: Extract behaviors from a model’s own earlier attempts and feed them back as hints for revision.

Behavior-Conditioned SFT (BC-SFT): Fine-tune students on teacher outputs that already follow behavior-guided reasoning, so the behavior usage becomes parametric (no retrieval at test time).

Key results (MATH, AIME-24/25)

Token efficiency: On MATH-500, BCI reduces reasoning tokens by up to 46% versus the same model without behaviors, while matching or improving accuracy. This holds for both R1-Llama-70B and Qwen3-32B students across token budgets (2,048–16,384).

Self-improvement gains: On AIME-24, behavior-guided self-improvement beats a critique-and-revise baseline at nearly every budget, with up to 10% higher accuracy as budgets increase, indicating better test-time scaling of accuracy (not just shorter traces).

BC-SFT quality lift: Across Llama-3.1-8B-Instruct, Qwen2.5-14B-Base, Qwen2.5-32B-Instruct, and Qwen3-14B, BC-SFT consistently outperforms (accuracy) standard SFT and the original base across budgets, while remaining more token-efficient. Importantly, the advantage is not explained by an easier training corpus: teacher correctness rates in the two training sets (original vs. behavior-conditioned) are close, yet BC-SFT students generalize better on AIME-24/25.

Why does this work?

The handbook stores procedural knowledge (how-to strategies), distinct from classic RAG’s declarative knowledge (facts). By converting verbose derivations into short, reusable steps, the model skips re-derivation and reallocates compute to novel subproblems. Behavior prompts serve as structured hints that bias the decoder toward efficient, correct trajectories; BC-SFT then internalizes these trajectories so that behaviors are implicitly invoked without prompt overhead.

What’s inside a “behavior”?

Behaviors range from domain-general reasoning moves to precise mathematical tools, e.g.,

behavior_inclusion_exclusion_principle: avoid double counting by subtracting intersections;

behavior_translate_verbal_to_equation: formalize word problems systematically;

behavior_distance_from_point_to_line: apply |Ax+By+C|/√(A²+B²) for tangency checks.During BCI, the student explicitly cites behaviors when they’re used, making traces auditable and compact.

Retrieval and cost considerations

On MATH, behaviors are retrieved by topic; on AIME, top-K behaviors are selected via BGE-M3 embeddings and FAISS. While BCI introduces extra input tokens (the behaviors), input tokens are pre-computable and non-autoregressive, and are often billed cheaper than output tokens on commercial APIs. Since BCI shrinks output tokens, the overall cost can drop while latency improves. BC-SFT eliminates retrieval at test time entirely.

Image source: marktechpost.com

Summary

Meta’s behavior-handbook approach operationalizes procedural memory for LLMs: it abstracts recurring reasoning steps into reusable “behaviors,” applies them via behavior-conditioned inference or distills them with BC-SFT, and empirically delivers up to 46% fewer reasoning tokens with accuracy that holds or improves (≈10% gains in self-correction regimes). The method is straightforward to integrate—an index, a retriever, optional fine-tuning—and surfaces auditable traces, though scaling beyond math and managing a growing behavior corpus remain open engineering problems.

Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta AI Proposes ‘Metacognitive Reuse’: Turning LLM Chains-of-Thought into a Procedural Handbook that Cuts Tokens by 46% appeared first on MarkTechPost.

IBM and ETH Zürich Researchers Unveil Analog Foundation Models to Tac …

IBM researchers, together with ETH Zürich, have unveiled a new class of Analog Foundation Models (AFMs) designed to bridge the gap between large language models (LLMs) and Analog In-Memory Computing (AIMC) hardware. AIMC has long promised a radical leap in efficiency—running models with a billion parameters in a footprint small enough for embedded or edge devices—thanks to dense non-volatile memory (NVM) that combines storage and computation. But the technology’s Achilles’ heel has been noise: performing matrix-vector multiplications directly inside NVM devices yields non-deterministic errors that cripple off-the-shelf models.

Why does analog computing matter for LLMs?

Unlike GPUs or TPUs that shuttle data between memory and compute units, AIMC performs matrix-vector multiplications directly inside memory arrays. This design removes the von Neumann bottleneck and delivers massive improvements in throughput and power efficiency. Prior studies showed that combining AIMC with 3D NVM and Mixture-of-Experts (MoE) architectures could, in principle, support trillion-parameter models on compact accelerators. That could make foundation-scale AI feasible on devices well beyond data-centers.

https://arxiv.org/pdf/2505.09663

What makes Analog In-Memory Computing (AIMC) so difficult to use in practice?

The biggest barrier is noise. AIMC computations suffer from device variability, DAC/ADC quantization, and runtime fluctuations that degrade model accuracy. Unlike quantization on GPUs—where errors are deterministic and manageable—analog noise is stochastic and unpredictable. Earlier research found ways to adapt small networks like CNNs and RNNs (<100M parameters) to tolerate such noise, but LLMs with billions of parameters consistently broke down under AIMC constraints.

How do Analog Foundation Models address the noise problem?

The IBM team introduces Analog Foundation Models, which integrate hardware-aware training to prepare LLMs for analog execution. Their pipeline uses:

Noise injection during training to simulate AIMC randomness.

Iterative weight clipping to stabilize distributions within device limits.

Learned static input/output quantization ranges aligned with real hardware constraints.

Distillation from pre-trained LLMs using 20B tokens of synthetic data.

These methods, implemented with AIHWKIT-Lightning, allow models like Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct to sustain performance comparable to weight-quantized 4-bit / activation 8-bit baselines under analog noise. In evaluations across reasoning and factual benchmarks, AFMs outperformed both quantization-aware training (QAT) and post-training quantization (SpinQuant).

Do these models work only for analog hardware?

No. An unexpected outcome is that AFMs also perform strongly on low-precision digital hardware. Because AFMs are trained to tolerate noise and clipping, they handle simple post-training round-to-nearest (RTN) quantization better than existing methods. This makes them useful not just for AIMC accelerators, but also for commodity digital inference hardware.

Can performance scale with more compute at inference time?

Yes. The researchers tested test-time compute scaling on the MATH-500 benchmark, generating multiple answers per query and selecting the best via a reward model. AFMs showed better scaling behavior than QAT models, with accuracy gaps shrinking as more inference compute was allocated. This is consistent with AIMC’s strengths—low-power, high-throughput inference rather than training.

https://arxiv.org/pdf/2505.09663

How does it impact Analog In-Memory Computing (AIMC) future?

The research team provides the first systematic demonstration that large LLMs can be adapted to AIMC hardware without catastrophic accuracy loss. While training AFMs is resource-heavy and reasoning tasks like GSM8K still show accuracy gaps, the results are a milestone. The combination of energy efficiency, robustness to noise, and cross-compatibility with digital hardware makes AFMs a promising direction for scaling foundation models beyond GPU limits.

Summary

The introduction of Analog Foundation Models marks a critical milestone for scaling LLMs beyond the limits of digital accelerators. By making models robust to the unpredictable noise of analog in-memory computing, the research team shows that AIMC can move from a theoretical promise to a practical platform. While training costs remain high and reasoning benchmarks still show gaps, this work establishes a path toward energy-efficient large scale models running on compact hardware, pushing foundation models closer to edge deployment

Check out the PAPER and GITHUB PAGE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post IBM and ETH Zürich Researchers Unveil Analog Foundation Models to Tackle Noise in In-Memory AI Hardware appeared first on MarkTechPost.

An Internet of AI Agents? Coral Protocol Introduces Coral v1: An MCP-N …

Coral Protocol has released Coral v1 of its agent stack, aiming to standardize how developers discover, compose, and operate AI agents across heterogeneous frameworks. The release centers on an MCP-based runtime (Coral Server) that enables threaded, mention-addressed agent-to-agent messaging, a developer workflow (CLI + Studio) for orchestration and observability, and a public registry for agent discovery. Coral plans to pay-per-usage payouts on Solana as “coming soon,” not generally available.

What Coral v1 Actually Ships

For the first time, anyone can: → Publish AI agents on a marketplace where the world can discover them → Get paid for AI agents they create → Rent agents on demand to build AI startups 10x faster

Coral Server (runtime): Implements Model Context Protocol (MCP) primitives so agents can register, create threads, send messages, and mention other agents, enabling structured A2A coordination instead of brittle context splicing.

Coral CLI + Studio: Add remote/local agents, wire them into shared threads, and inspect thread/message telemetry for debugging and performance tuning.

Registry surface: A discovery layer to find and integrate agents. Monetization and hosted checkout are explicitly marked as “coming soon.”

Why Interoperability Matters

Agent frameworks (e.g., LangChain, CrewAI, custom stacks) don’t speak a common operational protocol, which blocks composition. Coral’s MCP threading model provides a common transport and addressing scheme, so specialized agents can coordinate without ad-hoc glue code or prompt concatenation. The Coral Protocol team emphasized on persistent threads and mention-based targeting to keep collaboration organized and low-overhead.

Reference Implementation: Anemoi on GAIA

Coral’s open implementation Anemoi demonstrates the semi-centralized pattern: a light planner + specialized workers communicating directly over Coral MCP threads. On GAIA, Anemoi reports 52.73% pass@3 using GPT-4.1-mini (planner) and GPT-4o (workers), surpassing a reproduced OWL setup at 43.63% under identical LLM/tooling. The arXiv paper and GitHub readme both document these numbers and the coordination loop (plan → execute → critique → refine).

The design reduces reliance on a single powerful planner, trims redundant token passing, and improves scalability/cost for long-horizon tasks—credible, benchmark-anchored evidence that structured A2A beats naive prompt chaining when planner capacity is limited.

Incentives and Marketplace Status

Coral positions a usage-based marketplace where agent authors can list agents with pricing metadata and get paid per call. As of this writing, the developer page clearly labels “Pay Per Usage / Get Paid Automatically” and “Hosted checkout” as coming soon—teams should avoid assuming GA for payouts until Coral updates availability.

Summary

Coral v1 contributes a standards-first interop runtime for multi-agent systems, plus practical tooling for discovery and observability. The Anemoi GAIA results provide empirical backing for the A2A, thread-based design under constrained planners. The marketplace narrative is compelling, but treat monetization as upcoming per Coral’s own site; build against the runtime/registry now and keep payments feature-flagged until GA.

Introducing Coral v1.For the first time, anyone can:→ Publish AI agents on a marketplace where the world can discover them→ Get paid for AI agents they create→ Rent agents on demand to build AI startups 10x fasterHere’s why this matters pic.twitter.com/viqc4d7ajC— Coral Protocol (@Coral_Protocol) September 20, 2025

The post An Internet of AI Agents? Coral Protocol Introduces Coral v1: An MCP-Native Runtime and Registry for Cross-Framework AI Agents appeared first on MarkTechPost.

LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and Wha …

What exactly is being measured when a judge LLM assigns a 1–5 (or pairwise) score?

Most “correctness/faithfulness/completeness” rubrics are project-specific. Without task-grounded definitions, a scalar score can drift from business outcomes (e.g., “useful marketing post” vs. “high completeness”). Surveys of LLM-as-a-judge (LAJ) note that rubric ambiguity and prompt template choices materially shift scores and human correlations.

How stable are judge decisions to prompt position and formatting?

Large controlled studies find position bias: identical candidates receive different preferences depending on order; list-wise and pairwise setups both show measurable drift (e.g., repetition stability, position consistency, preference fairness).

Work cataloging verbosity bias shows longer responses are often favored independent of quality; several reports also describe self-preference (judges prefer text closer to their own style/policy).

Do judge scores consistently match human judgments of factuality?

Empirical results are mixed. For summary factuality, one study reported low or inconsistent correlations with humans for strong models (GPT-4, PaLM-2), with only partial signal from GPT-3.5 on certain error types.

Conversely, domain-bounded setups (e.g., explanation quality for recommenders) have reported usable agreement with careful prompt design and ensembling across heterogeneous judges.

Taken together, correlation seems task- and setup-dependent, not a general guarantee.

How robust are judge LLMs to strategic manipulation?

LLM-as-a-Judge (LAJ) pipelines are attackable. Studies show universal and transferable prompt attacks can inflate assessment scores; defenses (template hardening, sanitization, re-tokenization filters) mitigate but do not eliminate susceptibility.

Newer evaluations differentiate content-author vs. system-prompt attacks and document degradation across several families (Gemma, Llama, GPT-4, Claude) under controlled perturbations.

Is pairwise preference safer than absolute scoring?

Preference learning often favors pairwise ranking, yet recent research finds protocol choice itself introduces artifacts: pairwise judges can be more vulnerable to distractors that generator models learn to exploit; absolute (pointwise) scores avoid order bias but suffer scale drift. Reliability therefore hinges on protocol, randomization, and controls rather than a single universally superior scheme.

Could “judging” encourage overconfident model behavior?

Recent reporting on evaluation incentives argues that test-centric scoring can reward guessing and penalize abstention, shaping models toward confident hallucinations; proposals suggest scoring schemes that explicitly value calibrated uncertainty. While this is a training-time concern, it feeds back into how evaluations are designed and interpreted.

Where do generic “judge” scores fall short for production systems?

When an application has deterministic sub-steps (retrieval, routing, ranking), component metrics offer crisp targets and regression tests. Common retrieval metrics include Precision@k, Recall@k, MRR, and nDCG; these are well-defined, auditable, and comparable across runs.

Industry guides emphasize separating retrieval and generation and aligning subsystem metrics with end goals, independent of any judge LLM.

If judge LLMs are fragile, what does “evaluation” look like in the wild?

Public engineering playbooks increasingly describe trace-first, outcome-linked evaluation: capture end-to-end traces (inputs, retrieved chunks, tool calls, prompts, responses) using OpenTelemetry GenAI semantic conventions and attach explicit outcome labels (resolved/unresolved, complaint/no-complaint). This supports longitudinal analysis, controlled experiments, and error clustering—regardless of whether any judge model is used for triage.

Tooling ecosystems (e.g., LangSmith and others) document trace/eval wiring and OTel interoperability; these are descriptions of current practice rather than endorsements of a particular vendor.

Are there domains where LLM-as-a-Judge (LAJ) seems comparatively reliable?

Some constrained tasks with tight rubrics and short outputs report better reproducibility, especially when ensembles of judges and human-anchored calibration sets are used. But cross-domain generalization remains limited, and bias/attack vectors persist.

Does LLM-as-a-Judge (LAJ) performance drift with content style, domain, or “polish”?

Beyond length and order, studies and news coverage indicate LLMs sometimes over-simplify or over-generalize scientific claims compared to domain experts—useful context when using LAJ to score technical material or safety-critical text.

Key Technical Observations

Biases are measurable (position, verbosity, self-preference) and can materially change rankings without content changes. Controls (randomization, de-biasing templates) reduce but do not eliminate effects.

Adversarial pressure matters: prompt-level attacks can systematically inflate scores; current defenses are partial.

Human agreement varies by task: factuality and long-form quality show mixed correlations; narrow domains with careful design and ensembling fare better.

Component metrics remain well-posed for deterministic steps (retrieval/routing), enabling precise regression tracking independent of judge LLMs.

Trace-based online evaluation described in industry literature (OTel GenAI) supports outcome-linked monitoring and experimentation.

Summary

In conclusion, this article does not argue against the existence of LLM-as-a-Judge but highlights the nuances, limitations, and ongoing debates around its reliability and robustness. The intention is not to dismiss its use but to frame open questions that need further exploration. Companies and research groups actively developing or deploying LLM-as-a-Judge (LAJ) pipelines are invited to share their perspectives, empirical findings, and mitigation strategies—adding valuable depth and balance to the broader conversation on evaluation in the GenAI era.

The post LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean? appeared first on MarkTechPost.

A Coding Guide to End-to-End Robotics Learning with LeRobot: Training, …

In this tutorial, we walk step by step through using Hugging Face’s LeRobot library to train and evaluate a behavior-cloning policy on the PushT dataset. We begin by setting up the environment in Google Colab, installing the required dependencies, and loading the dataset through LeRobot’s unified API. We then design a compact visuomotor policy that combines a convolutional backbone with a small MLP head, allowing us to map image and state observations directly to robot actions. By training on a subset of the dataset for speed, we are able to quickly demonstrate how LeRobot enables reproducible, dataset-driven robot learning pipelines. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install –upgrade lerobot torch torchvision timm imageio[ffmpeg]

import os, math, random, io, sys, json, pathlib, time
import torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import DataLoader, Subset
from torchvision.utils import make_grid, save_image
import numpy as np
import imageio.v2 as imageio

try:
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
except Exception:
from lerobot.datasets.lerobot_dataset import LeRobotDataset

DEVICE = “cuda” if torch.cuda.is_available() else “cpu”
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

We begin by installing the required libraries and setting up our environment for training. We import all the essential modules, configure the dataset loader, and fix the random seed to ensure reproducibility. We also detect whether we are running on a GPU or CPU, allowing our experiments to run efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserREPO_ID = “lerobot/pusht”
ds = LeRobotDataset(REPO_ID)
print(“Dataset length:”, len(ds))

s0 = ds[0]
keys = list(s0.keys())
print(“Sample keys:”, keys)

def key_with(prefixes):
for k in keys:
for p in prefixes:
if k.startswith(p): return k
return None

K_IMG = key_with([“observation.image”, “observation.images”, “observation.rgb”])
K_STATE = key_with([“observation.state”])
K_ACT = “action”
assert K_ACT in s0, f”No ‘action’ key found in sample. Found: {keys}”
print(“Using keys -> IMG:”, K_IMG, “STATE:”, K_STATE, “ACT:”, K_ACT)

We load the PushT dataset with LeRobot and inspect its structure. We check the available keys, identify which ones correspond to images, states, and actions, and map them for consistent access throughout our training pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass PushTWrapper(torch.utils.data.Dataset):
def __init__(self, base):
self.base = base
def __len__(self): return len(self.base)
def __getitem__(self, i):
x = self.base[i]
img = x[K_IMG]
if img.ndim == 4: img = img[-1]
img = img.float() / 255.0 if img.dtype==torch.uint8 else img.float()
state = x.get(K_STATE, torch.zeros(2))
state = state.float().reshape(-1)
act = x[K_ACT].float().reshape(-1)
if img.shape[-2:] != (96,96):
img = F.interpolate(img.unsqueeze(0), size=(96,96), mode=”bilinear”, align_corners=False)[0]
return {“image”: img, “state”: state, “action”: act}

wrapped = PushTWrapper(ds)
N = len(wrapped)
idx = list(range(N))
random.shuffle(idx)
n_train = int(0.9*N)
train_idx, val_idx = idx[:n_train], idx[n_train:]

train_ds = Subset(wrapped, train_idx[:12000])
val_ds = Subset(wrapped, val_idx[:2000])

BATCH = 128
train_loader = DataLoader(train_ds, batch_size=BATCH, shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_ds, batch_size=BATCH, shuffle=False, num_workers=2, pin_memory=True)

We wrap each sample so we consistently get a normalized 96×96 image, a flattened state, and an action, picking the last frame if a temporal stack is present. We then shuffle, split into train/val, and cap sizes for fast Colab runs. Finally, we create efficient DataLoaders with batching, shuffling, and pinned memory to keep training smooth. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SmallBackbone(nn.Module):
def __init__(self, out=256):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(3, 32, 5, 2, 2), nn.ReLU(inplace=True),
nn.Conv2d(32, 64, 3, 2, 1), nn.ReLU(inplace=True),
nn.Conv2d(64,128, 3, 2, 1), nn.ReLU(inplace=True),
nn.Conv2d(128,128,3, 1, 1), nn.ReLU(inplace=True),
)
self.head = nn.Sequential(nn.AdaptiveAvgPool2d(1), nn.Flatten(), nn.Linear(128, out), nn.ReLU(inplace=True))
def forward(self, x): return self.head(self.conv(x))

class BCPolicy(nn.Module):
def __init__(self, img_dim=256, state_dim=2, hidden=256, act_dim=2):
super().__init__()
self.backbone = SmallBackbone(img_dim)
self.mlp = nn.Sequential(
nn.Linear(img_dim + state_dim, hidden), nn.ReLU(inplace=True),
nn.Linear(hidden, hidden//2), nn.ReLU(inplace=True),
nn.Linear(hidden//2, act_dim)
)
def forward(self, img, state):
z = self.backbone(img)
if state.ndim==1: state = state.unsqueeze(0)
z = torch.cat([z, state], dim=-1)
return self.mlp(z)

policy = BCPolicy().to(DEVICE)
opt = torch.optim.AdamW(policy.parameters(), lr=3e-4, weight_decay=1e-4)
scaler = torch.cuda.amp.GradScaler(enabled=(DEVICE==”cuda”))

@torch.no_grad()
def evaluate():
policy.eval()
mse, n = 0.0, 0
for batch in val_loader:
img = batch[“image”].to(DEVICE, non_blocking=True)
st = batch[“state”].to(DEVICE, non_blocking=True)
act = batch[“action”].to(DEVICE, non_blocking=True)
pred = policy(img, st)
mse += F.mse_loss(pred, act, reduction=”sum”).item()
n += act.numel()
return mse / n

def cosine_lr(step, total, base=3e-4, min_lr=3e-5):
if step>=total: return min_lr
cos = 0.5*(1+math.cos(math.pi*step/total))
return min_lr + (base-min_lr)*cos

EPOCHS = 4
steps_total = EPOCHS*len(train_loader)
step = 0
best = float(“inf”)
ckpt = “/content/lerobot_pusht_bc.pt”

for epoch in range(EPOCHS):
policy.train()
for batch in train_loader:
lr = cosine_lr(step, steps_total); step += 1
for g in opt.param_groups: g[“lr”] = lr

img = batch[“image”].to(DEVICE, non_blocking=True)
st = batch[“state”].to(DEVICE, non_blocking=True)
act = batch[“action”].to(DEVICE, non_blocking=True)

opt.zero_grad(set_to_none=True)
with torch.cuda.amp.autocast(enabled=(DEVICE==”cuda”)):
pred = policy(img, st)
loss = F.smooth_l1_loss(pred, act)
scaler.scale(loss).backward()
nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
scaler.step(opt); scaler.update()

val_mse = evaluate()
print(f”Epoch {epoch+1}/{EPOCHS} | Val MSE: {val_mse:.6f}”)
if val_mse < best:
best = val_mse
torch.save({“state_dict”: policy.state_dict(), “val_mse”: best}, ckpt)

print(“Best Val MSE:”, best, “| Saved:”, ckpt)

We define a compact visuomotor policy: a CNN backbone extracts image features that we fuse with the robot state to predict 2-D actions. We train with AdamW, a cosine learning-rate schedule, mixed precision, and gradient clipping, while evaluating with MSE on the validation set. We checkpoint the best model by validation loss so we can reload the strongest policy later. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserpolicy.load_state_dict(torch.load(ckpt)[“state_dict”]); policy.eval()
os.makedirs(“/content/vis”, exist_ok=True)

def draw_arrow(imgCHW, action_xy, scale=40):
import PIL.Image, PIL.ImageDraw
C,H,W = imgCHW.shape
arr = (imgCHW.clamp(0,1).permute(1,2,0).cpu().numpy()*255).astype(np.uint8)
im = PIL.Image.fromarray(arr)
dr = PIL.ImageDraw.Draw(im)
cx, cy = W//2, H//2
dx, dy = float(action_xy[0])*scale, float(-action_xy[1])*scale
dr.line((cx, cy, cx+dx, cy+dy), width=3, fill=(0,255,0))
return np.array(im)

frames = []
with torch.no_grad():
for i in range(60):
b = wrapped[i]
img = b[“image”].unsqueeze(0).to(DEVICE)
st = b[“state”].unsqueeze(0).to(DEVICE)
pred = policy(img, st)[0].cpu()
frames.append(draw_arrow(b[“image”], pred))
video_path = “/content/vis/pusht_pred.mp4”
imageio.mimsave(video_path, frames, fps=10)
print(“Wrote”, video_path)

grid = make_grid(torch.stack([wrapped[i][“image”] for i in range(16)]), nrow=8)
save_image(grid, “/content/vis/grid.png”)
print(“Saved grid:”, “/content/vis/grid.png”)

We reload the best checkpoint and switch the policy to eval so we can visualize its behavior. We overlay predicted action arrows on frames, stitch them into a short MP4, and also save a quick image grid for a snapshot view of the dataset. This lets us confirm, at a glance, what actions our model outputs on real PushT observations.

In conclusion, we see how easily LeRobot integrates data handling, policy definition, and evaluation into a single framework. By training our lightweight policy and visualizing predicted actions on PushT frames, we confirm that the library gives us a practical entry point into robot learning without needing real-world hardware. We are now equipped to extend the pipeline to more advanced models, such as diffusion or ACT policies, to experiment with different datasets, and even to share our trained policies on the Hugging Face Hub.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A Coding Guide to End-to-End Robotics Learning with LeRobot: Training, Evaluating, and Visualizing Behavior Cloning Policies on PushT appeared first on MarkTechPost.

Google’s Sensible Agent Reframes Augmented Reality (AR) Assistance a …

Sensible Agent is an AI research framework and prototype from Google that chooses both the action an augmented reality (AR) agent should take and the interaction modality to deliver/confirm it, conditioned on real-time multimodal context (e.g., whether hands are busy, ambient noise, social setting). Rather than treating “what to suggest” and “how to ask” as separate problems, it computes them jointly to minimize friction and social awkwardness in the wild.

https://research.google/pubs/sensible-agent-a-framework-for-unobtrusive-interaction-with-proactive-ar-agent/

What interaction failure modes is it targeting?

Voice-first prompting is brittle: it’s slow under time pressure, unusable with busy hands/eyes, and awkward in public. Sensible Agent’s core bet is that a high-quality suggestion delivered through the wrong channel is effectively noise. The framework explicitly models the joint decision of (a) what the agent proposes (recommend/guide/remind/automate) and (b) how it’s presented and confirmed (visual, audio, or both; inputs via head nod/shake/tilt, gaze dwell, finger poses, short-vocabulary speech, or non-lexical conversational sounds). By binding content selection to modality feasibility and social acceptability, the system aims to lower perceived effort while preserving utility.

How is the system architected at runtime?

A prototype on an Android-class XR headset implements a pipeline with three main stages. First, context parsing fuses egocentric imagery (vision-language inference for scene/activity/familiarity) with an ambient audio classifier (YAMNet) to detect conditions like noise or conversation. Second, a proactive query generator prompts a large multimodal model with few-shot exemplars to select the action, query structure (binary / multi-choice / icon-cue), and presentation modality. Third, the interaction layer enables only those input methods compatible with the sensed I/O availability, e.g., head nod for “yes” when whispering isn’t acceptable, or gaze dwell when hands are occupied.

Where do the few-shot policies come from—designer instinct or data?

The team seeded the policy space with two studies: an expert workshop (n=12) to enumerate when proactive help is useful and which micro-inputs are socially acceptable; and a context mapping study (n=40; 960 entries) across everyday scenarios (e.g., gym, grocery, museum, commuting, cooking) where participants specified desired agent actions and chose a preferred query type and modality given the context. These mappings ground the few-shot exemplars used at runtime, shifting the choice of “what+how” from ad-hoc heuristics to data-derived patterns (e.g., multi-choice in unfamiliar environments, binary under time pressure, icon + visual in socially sensitive settings).

What concrete interaction techniques does the prototype support?

For binary confirmations, the system recognizes head nod/shake; for multi-choice, a head-tilt scheme maps left/right/back to options 1/2/3. Finger-pose gestures support numeric selection and thumbs up/down; gaze dwell triggers visual buttons where raycast pointing would be fussy; short-vocabulary speech (e.g., “yes,” “no,” “one,” “two,” “three”) provides a minimal dictation path; and non-lexical conversational sounds (“mm-hm”) cover noisy or whisper-only contexts. Crucially, the pipeline only offers modalities that are feasible under current constraints (e.g., suppress audio prompts in quiet spaces; avoid gaze dwell if the user isn’t looking at the HUD).

https://research.google/pubs/sensible-agent-a-framework-for-unobtrusive-interaction-with-proactive-ar-agent/

Does the joint decision actually reduce interaction cost?

A preliminary within-subjects user study (n=10) comparing the framework to a voice-prompt baseline across AR and 360° VR reported lower perceived interaction effort and lower intrusiveness while maintaining usability and preference. This is a small sample typical of early HCI validation; it’s directional evidence rather than product-grade proof, but it aligns with the thesis that coupling intent and modality reduces overhead.

How does the audio side work, and why YAMNet?

YAMNet is a lightweight, MobileNet-v1–based audio event classifier trained on Google’s AudioSet, predicting 521 classes. In this context it’s a practical choice to detect rough ambient conditions—speech presence, music, crowd noise—fast enough to gate audio prompts or to bias toward visual/gesture interaction when speech would be awkward or unreliable. The model’s ubiquity in TensorFlow Hub and Edge guides makes it straightforward to deploy on device.

How can you integrate it into an existing AR or mobile assistant stack?

A minimal adoption plan looks like this: (1) instrument a lightweight context parser (VLM on egocentric frames + ambient audio tags) to produce a compact state; (2) build a few-shot table of context→(action, query type, modality) mappings from internal pilots or user studies; (3) prompt an LMM to emit both the “what” and the “how” at once; (4) expose only feasible input methods per state and keep confirmations binary by default; (5) log choices and outcomes for offline policy learning. The Sensible Agent artifacts show this is feasible in WebXR/Chrome on Android-class hardware, so migrating to a native HMD runtime or even a phone-based HUD is mostly an engineering exercise.

Summary

Sensible Agent operationalizes proactive AR as a coupled policy problem—selecting the action and the interaction modality in a single, context-conditioned decision—and validates the approach with a working WebXR prototype and small-N user study showing lower perceived interaction effort relative to a voice baseline. The framework’s contribution is not a product but a reproducible recipe: a dataset of context→(what/how) mappings, few-shot prompts to bind them at runtime, and low-effort input primitives that respect social and I/O constraints.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Google’s Sensible Agent Reframes Augmented Reality (AR) Assistance as a Coupled “what+how” Decision—So What does that Change? appeared first on MarkTechPost.

Qwen3-ASR-Toolkit: An Advanced Open Source Python Command-Line Toolkit …

Qwen has released Qwen3-ASR-Toolkit, an MIT-licensed Python CLI that programmatically bypasses the Qwen3-ASR-Flash API’s 3-minute/10 MB per-request limit by performing VAD-aware chunking, parallel API calls, and automatic resampling/format normalization via FFmpeg. The result is stable, hour-scale transcription pipelines with configurable concurrency, context injection, and clean text post-processing. Python ≥3.8 prerequisite, Install with:

Copy CodeCopiedUse a different Browserpip install qwen3-asr-toolkit

What the toolkit adds on top of the API

Long-audio handling. The toolkit slices input using voice activity detection (VAD) at natural pauses, keeping each chunk under the API’s hard duration/size caps, then merges outputs in order.

Parallel throughput. A thread pool dispatches multiple chunks concurrently to DashScope endpoints, improving wall-clock latency for hour-long inputs. You control concurrency via -j/–num-threads.

Format & rate normalization. Any common audio/video container (MP4/MOV/MKV/MP3/WAV/M4A, etc.) is converted to the API’s required mono 16 kHz before submission. Requires FFmpeg installed on PATH.

Text cleanup & context. The tool includes post-processing to reduce repetitions/hallucinations and supports context injection to bias recognition toward domain terms; the underlying API also exposes language detection and inverse text normalization (ITN) toggles.

The official Qwen3-ASR-Flash API is single-turn and enforces ≤3 min duration and ≤10 MB payloads per call. That is reasonable for interactive requests but awkward for long media. The toolkit operationalizes best practices—VAD-aware segmentation + concurrent calls—so teams can batch large archives or live capture dumps without writing orchestration from scratch.

Quick start

Install prerequisites

Copy CodeCopiedUse a different Browser# System: FFmpeg must be available
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install -y ffmpeg

Install the CLI

Copy CodeCopiedUse a different Browserpip install qwen3-asr-toolkit

Configure credentials

Copy CodeCopiedUse a different Browser# International endpoint key
export DASHSCOPE_API_KEY=”sk-…”

Run

Copy CodeCopiedUse a different Browser# Basic: local video, default 4 threads
qwen3-asr -i “/path/to/lecture.mp4”

# Faster: raise parallelism and pass key explicitly (optional if env var set)
qwen3-asr -i “/path/to/podcast.wav” -j 8 -key “sk-…”

# Improve domain accuracy with context
qwen3-asr -i “/path/to/earnings_call.m4a”
-c “tickers, CFO name, product names, Q3 revenue guidance”

Arguments you’ll actually use:-i/–input-file (file path or http/https URL), -j/–num-threads, -c/–context, -key/–dashscope-api-key, -t/–tmp-dir, -s/–silence. Output is printed and saved as <input_basename>.txt.

Minimal pipeline architecture

Load local file or URL → 2) VAD to find silence boundaries → 3) Chunk under API caps → 4) Resample to 16 kHz mono → 5) Parallel submit to DashScope → 6) Aggregate segments in order → 7) Post-process text (dedupe, repetitions) → 8) Emit .txt transcript.

Summary

Qwen3-ASR-Toolkit turns Qwen3-ASR-Flash into a practical long-audio pipeline by combining VAD-based segmentation, FFmpeg normalization (mono/16 kHz), and parallel API dispatch under the 3-minute/10 MB caps. Teams get deterministic chunking, configurable throughput, and optional context/LID/ITN controls without custom orchestration. For production, pin the package version, verify region endpoints/keys, and tune thread count to your network and QPS—then pip install qwen3-asr-toolkit and ship.

Check out the GitHub Page for Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Qwen3-ASR-Toolkit: An Advanced Open Source Python Command-Line Toolkit for Using the Qwen-ASR API Beyond the 3 Minutes/10 MB Limit appeared first on MarkTechPost.

Top Computer Vision CV Blogs & News Websites (2025)

Computer vision moved fast in 2025: new multimodal backbones, larger open datasets, and tighter model–systems integration. Practitioners need sources that publish rigorously, link code and benchmarks, and track deployment patterns—not marketing posts. This list prioritizes primary research hubs, lab blogs, and production-oriented engineering outlets with consistent update cadence. Use it to monitor SOTA shifts, grab reproducible code paths, and translate papers into deployable pipelines.

Google Research (AI Blog)

Primary source for advances from Google/DeepMind teams, including vision architectures (e.g., V-MoE) and periodic research year-in-review posts across CV and multimodal. Posts typically include method summaries, figures, and links to papers/code.

Marktechpost

Consistent reporting on new computer-vision models, datasets, and benchmarks with links to papers, code, and demos. Dedicated CV category plus frequent deep-dives (e.g., DINOv3 releases and analysis). Useful for staying on top of weekly research drops without wading through raw feeds.

AI at Meta

High-signal posts with preprints and open-source drops. Recent examples include DINOv3—scaled self-supervised backbones with SOTA across dense prediction tasks—which provide technical detail and artifacts.

NVIDIA Technical Blog

Production-oriented content on VLM-powered analytics, optimized inference, and GPU pipelines. Category feed for Computer Vision includes blueprints, SDK usage, and performance guidance relevant to enterprise deployments.

arXiv cs.CV — raw research firehose

The canonical preprint feed for CV. Use the recent or new views for daily updates; taxonomy confirms scope (image processing, pattern recognition, scene understanding). Best paired with RSS + custom filters.

CVF Open Access (CVPR/ICCV/ECCV)

Final versions of main-conference papers and workshops, searchable and citable. CVPR 2025 proceedings and workshop menus are already live, making this the authoritative archive post-acceptance.

BAIR Blog (UC Berkeley)

Occasional but deep posts on frontier topics (e.g., extremely large image modeling, robotics-vision crossovers). Good for conceptual clarity directly from authors.

Stanford Blog

Technical explainers and lab roundups (e.g., SAIL at CVPR 2025) with links to papers/talks. Useful to scan emerging directions across perception, generative models, and embodied vision.

Roboflow Blog

High-frequency, implementation-focused posts (labeling, training, deployment, apps, and trend reports). Strong for practitioners who need working pipelines and edge deployments.

Hugging Face Blog

Hands-on guides (VLMs, FiftyOne integrations) and ecosystem notes across Transformers, Diffusers, and timm; good for rapid prototyping and fine-tuning CV/VLM stacks.

PyTorch Blog

Change logs, APIs, and recipes affecting CV training/inference (Transforms V2, multi-weight support, FX feature extraction). Read when upgrading training stacks.

The post Top Computer Vision CV Blogs & News Websites (2025) appeared first on MarkTechPost.

Move your AI agents from proof of concept to production with Amazon Be …

Building an AI agent that can handle a real-life use case in production is a complex undertaking. Although creating a proof of concept demonstrates the potential, moving to production requires addressing scalability, security, observability, and operational concerns that don’t surface in development environments.
This post explores how Amazon Bedrock AgentCore helps you transition your agentic applications from experimental proof of concept to production-ready systems. We follow the journey of a customer support agent that evolves from a simple local prototype to a comprehensive, enterprise-grade solution capable of handling multiple concurrent users while maintaining security and performance standards.
Amazon Bedrock AgentCore is a comprehensive suite of services designed to help you build, deploy, and scale agentic AI applications. If you’re new to AgentCore, we recommend exploring our existing deep-dive posts on individual services: AgentCore Runtime for secure agent deployment and scaling, AgentCore Gateway for enterprise tool development, AgentCore Identity for securing agentic AI at scale, AgentCore Memory for building context-aware agents, AgentCore Code Interpreter for code execution, AgentCore Browser Tool for web interaction, and AgentCore Observability for transparency on your agent behavior. This post demonstrates how these services work together in a real-world scenario.
The customer support agent journey
Customer support represents one of the most common and compelling use cases for agentic AI. Modern businesses handle thousands of customer inquiries daily, ranging from simple policy questions to complex technical troubleshooting. Traditional approaches often fall short: rule-based chatbots frustrate customers with rigid responses, and human-only support teams struggle with scalability and consistency. An intelligent customer support agent needs to seamlessly handle diverse scenarios: managing customer orders and accounts, looking up return policies, searching product catalogs, troubleshooting technical issues through web research, and remembering customer preferences across multiple interactions. Most importantly, it must do all this while maintaining the security and reliability standards expected in enterprise environments. Consider the typical evolution path many organizations follow when building such agents:

The proof of concept stage – Teams start with a simple local prototype that demonstrates core capabilities, such as a basic agent that can answer policy questions and search for products. This works well for demos but lacks the robustness needed for real customer interactions.
The reality check – As soon as you try to scale beyond a few test users, challenges emerge. The agent forgets previous conversations, tools become unreliable under load, there’s no way to monitor performance, and security becomes a paramount concern.
The production challenge – Moving to production requires addressing session management, secure tool sharing, observability, authentication, and building interfaces that customers actually want to use. Many promising proofs of concept stall at this stage due to the complexity of these requirements.

In this post, we address each challenge systematically. We start with a prototype agent equipped with three essential tools: return policy lookup, product information search, and web search for troubleshooting. From there, we add the capabilities needed for production deployment: persistent memory for conversation continuity and a hyper-personalized experience, centralized tool management for reliability and security, full observability for monitoring and debugging, and finally a customer-facing web interface. This progression mirrors the real-world path from proof of concept to production, demonstrating how Amazon Bedrock AgentCore services work together to solve the operational challenges that emerge as your agentic applications mature. For simplification and demonstration purposes, we consider a single-agent architecture. In real-life use cases, customer support agents are often created as multi-agent architectures and those scenarios are also supported by Amazon Bedrock AgentCore services.
Solution overview
Every production system starts with a proof of concept, and our customer support agent is no exception. In this first phase, we build a functional prototype that demonstrates the core capabilities needed for customer support. In this case, we use Strands Agents, an open source agent framework, to build the proof of concept and Anthropic’s Claude 3.7 Sonnet on Amazon Bedrock as the large language model (LLM) powering our agent. For your application, you can use another agent framework and model of your choice.
Agents rely on tools to take actions and interact with live systems. Several tools are used in customer support agents, but to keep our example simple, we focus on three core capabilities to handle the most common customer inquiries:

Return policy lookup – Customers frequently ask about return windows, conditions, and processes. Our tool provides structured policy information based on product categories, covering everything from return timeframes to refund processing and shipping policies.
Product information retrieval – Technical specifications, warranty details, and compatibility information are essential for both pre-purchase questions and troubleshooting. This tool serves as a bridge to your product catalog, delivering formatted technical details that customers can understand.
Web search for troubleshooting – Complex technical issues often require the latest solutions or community-generated fixes not found in internal documentation. Web search capability allows the agent to access the web for current troubleshooting guides and technical solutions in real time.

The tools implementation and the end-to-end code for this use case are available in our GitHub repository. In this post, we focus on the main code that connects with Amazon Bedrock AgentCore, but you can follow the end-to-end journey in the repository.
Create the agent
With the tools available, let’s create the agent. The architecture for our proof of concept will look like the following diagram.

You can find the end-to-end code for this post on the GitHub repository. For simplicity, we show only the essential parts for our end-to-end code here:

from strands import Agent
from strands.models import BedrockModel

@tool
def get_return_policy(product_category: str) -> str:
    “””Get return policy information for a specific product category.”””
    # Returns structured policy info: windows, conditions, processes, refunds
    # check github for full code
    return {“return_window”: “10 days”, “conditions”: “”}
    
@tool  
def get_product_info(product_type: str) -> str:
    “””Get detailed technical specifications and information for electronics products.”””
    # Returns warranty, specs, features, compatibility details
    # check github for full code
    return {“product”: “ThinkPad X1 Carbon”, “info”: “ThinkPad X1 Carbon info”}
    
@tool
def web_search(keywords: str, region: str = “us-en”, max_results: int = 5) -> str:
    “””Search the web for updated troubleshooting information.”””
    # Provides access to current technical solutions and guides
    # check github for full code
    return “results from websearch”
    
# Initialize the Bedrock model
model = BedrockModel(
    model_id=”us.anthropic.claude-3-7-sonnet-20250219-v1:0″,
    temperature=0.3
)

# Create the customer support agent
agent = Agent(
    model=model,
    tools=[
        get_product_info,
        get_return_policy,
        web_search
    ],
    system_prompt=”””You are a helpful customer support assistant for an electronics company.
    Use the appropriate tools to provide accurate information and always offer additional help.”””
)

Test the proof of concept
When we test our prototype with realistic customer queries, the agent demonstrates the correct tool selection and interaction with real-world systems:

# Return policy inquiry
response = agent(“What’s the return policy for my ThinkPad X1 Carbon?”)
# Agent correctly uses get_return_policy with “laptops” category

# Technical troubleshooting  
response = agent(“My iPhone 14 heats up, how do I fix it?”)
# Agent uses web_search to find current troubleshooting solutions

The agent works well for these individual queries, correctly mapping laptop inquiries to return policy lookups and complex technical issues to web search, providing comprehensive and actionable responses.
The proof of concept reality check
Our proof of concept successfully demonstrates that an agent can handle diverse customer support scenarios using the right combination of tools and reasoning. The agent runs perfectly on your local machine and handles queries correctly. However, this is where the proof of concept gap becomes obvious. The tools are defined as local functions in your agent code, the agent responds quickly, and everything seems production-ready. But several critical limitations become apparent the moment you think beyond single-user testing:

Memory loss between sessions – If you restart your notebook or application, the agent completely forgets previous conversations. A customer who was discussing a laptop return yesterday would need to start from scratch today, re-explaining their entire situation. This isn’t just inconvenient—it’s a poor customer experience that breaks the conversational flow that makes AI agents valuable.
Single customer limitation – Your current agent can only handle one conversation at a time. If two customers try to use your support system simultaneously, their conversations would interfere with each other, or worse, one customer might see another’s conversation history. There’s no mechanism to maintain separate conversation context for different users.
Tools embedded in code – Your tools are defined directly in the agent code. This means:

You can’t reuse these tools across different agents (sales agent, technical support agent, and so on).
Updating a tool requires changing the agent code and redeploying everything.
Different teams can’t maintain different tools independently.

No production infrastructure – The agent runs locally with no consideration for scalability, security, monitoring, and reliability.

These fundamental architectural barriers can prevent real customer deployment. Agent building teams can take months to address these issues, which delays the time to value from their work and adds significant costs to the application. This is where Amazon Bedrock AgentCore services become essential. Rather than spending months building these production capabilities from scratch, Amazon Bedrock AgentCore provides managed services that address each gap systematically.
Let’s begin our journey to production by solving the memory problem first, transforming our agent from one that forgets every conversation into one that remembers customers across conversations and can hyper-personalize conversations using Amazon Bedrock AgentCore Memory.
Add persistent memory for hyper-personalized agents
The first major limitation we identified in our proof of concept was memory loss—our agent forgot everything between sessions, forcing customers to repeat their context every time. This “goldfish agent” behavior breaks the conversational experience that makes AI agents valuable in the first place.
Amazon Bedrock AgentCore Memory solves this by providing managed, persistent memory that operates on two complementary levels:

Short-term memory – Immediate conversation context and session-based information for continuity within interactions
Long-term memory – Persistent information extracted across multiple conversations, including customer preferences, facts, and behavioral patterns

After adding Amazon Bedrock AgentCore Memory to our customer support agent, our new architecture will look like the following diagram.

Install dependencies
Before we start, let’s install our dependencies: boto3, the AgentCore SDK, and the AgentCore Starter Toolkit SDK. Those will help us quickly add Amazon Bedrock AgentCore capabilities to our agent proof of concept. See the following code:

pip install boto3 bedrock-agentcore bedrock-agentcore-starter-toolkit

Create the memory resources
Amazon Bedrock AgentCore Memory uses configurable strategies to determine what information to extract and store. For our customer support use case, we use two complementary strategies:

USER_PREFERENCE – Automatically extracts and stores customer preferences like “prefers ThinkPad laptops,” “uses Linux,” or “plays competitive FPS games.” This enables personalized recommendations across conversations.
SEMANTIC – Captures factual information using vector embeddings, such as “customer has MacBook Pro order #MB-78432” or “reported overheating issues during video editing.” This provides relevant context for troubleshooting.

See the following code:

from bedrock_agentcore.memory import MemoryClient
from bedrock_agentcore.memory.constants import StrategyType

memory_client = MemoryClient(region_name=region)

strategies = [
    {
        StrategyType.USER_PREFERENCE.value: {
            “name”: “CustomerPreferences”,
            “description”: “Captures customer preferences and behavior”,
            “namespaces”: [“support/customer/{actorId}/preferences”],
        }
    },
    {
        StrategyType.SEMANTIC.value: {
            “name”: “CustomerSupportSemantic”,
            “description”: “Stores facts from conversations”,
            “namespaces”: [“support/customer/{actorId}/semantic”],
        }
    },
]

# Create memory resource with both strategies
response = memory_client.create_memory_and_wait(
    name=”CustomerSupportMemory”,
    description=”Customer support agent memory”,
    strategies=strategies,
    event_expiry_days=90,
)

Integrate with Strands Agents hooks
The key to making memory work seamlessly is automation—customers shouldn’t need to think about it, and agents shouldn’t require manual memory management. Strands Agents provides a powerful hook system that lets you intercept agent lifecycle events and handle memory operations automatically. The hook system enables both built-in components and user code to react to or modify agent behavior through strongly-typed event callbacks. For our use case, we create CustomerSupportMemoryHooks to retrieve the customer context and save the support interactions:

MessageAddedEvent hook – Triggered when customers send messages, this hook automatically retrieves relevant memory context and injects it into the query. The agent receives both the customer’s question and relevant historical context without manual intervention.
AfterInvocationEvent hook – Triggered after agent responses, this hook automatically saves the interaction to memory. The conversation becomes part of the customer’s persistent history immediately.

See the following code:

class CustomerSupportMemoryHooks(HookProvider):
    def retrieve_customer_context(self, event: MessageAddedEvent):
        “””Inject customer context before processing queries”””
        user_query = event.agent.messages[-1][“content”][0][“text”]
        
        # Retrieve relevant memories from both strategies
        all_context = []
        for context_type, namespace in self.namespaces.items():
            memories = self.client.retrieve_memories(
                memory_id=self.memory_id,
                namespace=namespace.format(actorId=self.actor_id),
                query=user_query,
                top_k=3,
            )
            # Format and add to context
            for memory in memories:
                if memory.get(“content”, {}).get(“text”):
                    all_context.append(f”[{context_type.upper()}] {memory[‘content’][‘text’]}”)
        
        # Inject context into the user query
        if all_context:
            context_text = “n”.join(all_context)
            original_text = event.agent.messages[-1][“content”][0][“text”]
            event.agent.messages[-1][“content”][0][“text”] = f”Customer Context:n{context_text}nn{original_text}”

    def save_support_interaction(self, event: AfterInvocationEvent):
        “””Save interactions after agent responses”””
        # Get last customer query and agent response check github for implementation
        customer_query = “This is a sample query”
        agent_response = “LLM gave a sample response”
        
        # Extract customer query and agent response
        # Save to memory for future retrieval
        self.client.create_event(
            memory_id=self.memory_id,
            actor_id=self.actor_id,
            session_id=self.session_id,
            messages=[(customer_query, “USER”), (agent_response, “ASSISTANT”)]
        )

In this code, we can see that our hooks are the ones interacting with Amazon Bedrock AgentCore Memory to save and retrieve memory events.
Integrate memory with the agent
Adding memory to our existing agent requires minimal code changes; you can simply instantiate the memory hooks and pass them to the agent constructor. The agent code then only needs to connect with the memory hooks to use the full power of Amazon Bedrock AgentCore Memory. We will create a new hook for each session, which will help us handle different customer interactions. See the following code:

# Create memory hooks for this customer session
memory_hooks = CustomerSupportMemoryHooks(
    memory_id=memory_id,
    client=memory_client,
    actor_id=customer_id,
    session_id=session_id
)

# Create agent with memory capabilities
agent = Agent(
    model=model,

    tools=[get_product_info, get_return_policy, web_search],
    system_prompt=SYSTEM_PROMPT
)

Test the memory in action
Let’s see how memory transforms the customer experience. When we invoke the agent, it uses the memory from previous interactions to show customer interests in gaming headphones, ThinkPad laptops, and MacBook thermal issues:

# Test personalized recommendations
response = agent(“Which headphones would you recommend?”)
# Agent remembers: “prefers low latency for competitive FPS games”
# Response includes gaming-focused recommendations

# Test preference recall
response = agent(“What is my preferred laptop brand?”)  
# Agent remembers: “prefers ThinkPad models” and “needs Linux compatibility”
# Response acknowledges ThinkPad preference and suggests compatible models

The transformation is immediately apparent. Instead of generic responses, the agent now provides personalized recommendations based on the customer’s stated preferences and past interactions. The customer doesn’t need to re-explain their gaming needs or Linux requirements—the agent already knows.
Benefits of Amazon Bedrock AgentCore Memory
With Amazon Bedrock AgentCore Memory integrated, our agent now delivers the following benefits:

Conversation continuity – Customers can pick up where they left off, even across different sessions or support channels
Personalized service – Recommendations and responses are tailored to individual preferences and past issues
Contextual troubleshooting – Access to previous problems and solutions enables more effective support
Seamless experience – Memory operations happen automatically without customer or agent intervention

However, we still have limitations to address. Our tools remain embedded in the agent code, preventing reuse across different support agents or teams. Security and access controls are minimal, and we still can’t handle multiple customers simultaneously in a production environment.
In the next section, we address these challenges by centralizing our tools using Amazon Bedrock AgentCore Gateway and implementing proper identity management with Amazon Bedrock AgentCore Identity, creating a scalable and secure foundation for our customer support system.
Centralize tools with Amazon Bedrock AgentCore Gateway and Amazon Bedrock AgentCore Identity
With memory solved, our next challenge is tool architecture. Currently, our tools are embedded directly in the agent code—a pattern that works for prototypes but creates significant problems at scale. When you need multiple agents (customer support, sales, technical support), each one duplicates the same tools, leading to extensive code, inconsistent behavior, and maintenance nightmares.
Amazon Bedrock AgentCore Gateway simplifies this process by centralizing tools into reusable, secure endpoints that agents can access. Combined with Amazon Bedrock AgentCore Identity for authentication, it creates an enterprise-grade tool sharing infrastructure.
We will now update our agent to use Amazon Bedrock AgentCore Gateway and Amazon Bedrock AgentCore Identity. The architecture will look like the following diagram.

In this case, we convert our web search tool to be used in the gateway and keep the return policy and get product information tools local to this agent. That is important because web search is a common capability that can be reused across different use cases in an organization, and return policy and production information are capabilities commonly associated with customer support services. With Amazon Bedrock AgentCore services, you can decide which capabilities to use and how to combine them. In this case, we also use two new tools that could have been developed by other teams: check warranty and get customer profile. Because those teams have already exposed those tools using AWS Lambda functions, we can use them as targets to our Amazon Bedrock AgentCore Gateway. Amazon Bedrock AgentCore Gateway can also support REST APIs as target. That means that if we have an OpenAPI specification or a Smithy model, we can also quickly expose our tools using Amazon Bedrock AgentCore Gateway.
Convert existing services to MCP
Amazon Bedrock AgentCore Gateway uses the Model Context Protocol (MCP) to standardize how agents access tools. Converting existing Lambda functions into MCP endpoints requires minimal changes—mainly adding tool schemas and handling the MCP context. To use this functionality, we convert our local tools to Lambda functions and create the tools schema definitions to make these functions discoverable by agents:

# Original Lambda function (simplified)
def web_search(keywords: str, region: str = “us-en”, max_results: int = 5) -> str:
    # web_search functionality
        
def lambda_handler(event, context):
    if get_tool_name(event) == “web_search”:
        query = get_named_parameter(event=event, name=”query”)
        
        search_result = web_search(keywords)
        return {“statusCode”: 200, “body”: search_result}

The following code is the tool schema definition:

{
        “name”: “web_search”,
        “description”: “Search the web for updated information using DuckDuckGo”,
        “inputSchema”: {
            “type”: “object”,
            “properties”: {
                “keywords”: {
                    “type”: “string”,
                    “description”: “The search query keywords”
                },
                “region”: {
                    “type”: “string”,
                    “description”: “The search region (e.g., us-en, uk-en, ru-ru)”
                },
                “max_results”: {
                    “type”: “integer”,
                    “description”: “The maximum number of results to return”
                }
            },
            “required”: [
                “keywords”
            ]
        }
    }

For demonstration purposes, we build a new Lambda function from scratch. In reality, organizations already have different functionalities available as REST services or Lambda functions, and this approach lets you expose existing enterprise services as agent tools without rebuilding them.
Configure security with Amazon Bedrock AgentCore Gateway and integrate with Amazon Bedrock AgentCore Identity
Amazon Bedrock AgentCore Gateway requires authentication for both inbound and outbound connections. Amazon Bedrock AgentCore Identity handles this through standard OAuth flows. After you set up an OAuth authorization configuration, you can create a new gateway and pass this configuration to it. See the following code:

# Create gateway with JWT-based authentication
auth_config = {
    “customJWTAuthorizer”: {
        “allowedClients”: [cognito_client_id],
        “discoveryUrl”: cognito_discovery_url
    }
}

gateway_response = gateway_client.create_gateway(
    name=”customersupport-gw”,
    roleArn=gateway_iam_role,
    protocolType=”MCP”,
    authorizerType=”CUSTOM_JWT”,
    authorizerConfiguration=auth_config,
    description=”Customer Support AgentCore Gateway”
)

For inbound authentication, agents must present valid JSON Web Token (JWT) tokens (from identity providers like Amazon Cognito, Okta, and EntraID) as a compact, self-contained standard for securely transmitting information between parties to access Amazon Bedrock AgentCore Gateway tools.
For outbound authentication, Amazon Bedrock AgentCore Gateway can authenticate to downstream services using AWS Identity and Access Management (IAM) roles, API keys, or OAuth tokens.
For demonstration purposes, we have created an Amazon Cognito user pool with a dummy user name and password. For your use case, you should set a proper identity provider and manage the users accordingly. This configure makes sure only authorized agents can access specific tools and a full audit trail is provided.
Add Lambda targets
After you set up Amazon Bedrock AgentCore Gateway, adding Lambda functions as tool targets is straightforward:

lambda_target_config = {
    “mcp”: {
        “lambda”: {
            “lambdaArn”: lambda_function_arn,
            “toolSchema”: {“inlinePayload”: api_spec},
        }
    }
}

gateway_client.create_gateway_target(
    gatewayIdentifier=gateway_id,
    name=”LambdaTools”,
    targetConfiguration=lambda_target_config,
    credentialProviderConfigurations=[{
        “credentialProviderType”: “GATEWAY_IAM_ROLE”
    }]
)

The gateway now exposes your Lambda functions as MCP tools that authorized agents can discover and use.
Integrate MCP tools with Strands Agents
Converting our agent to use centralized tools requires updating the tool configuration. We keep some tools local, such as product info and return policies specific to customer support that will likely not be reused in other use cases, and use centralized tools for shared capabilities. Because Strands Agents has a native integration for MCP tools, we can simply use the MCPClient from Strands with a streamablehttp_client. See the following code:

# Get OAuth token for gateway access
gateway_access_token = get_token(
    client_id=cognito_client_id,
    client_secret=cognito_client_secret,
    scope=auth_scope,
    url=token_url
)

# Create authenticated MCP client
mcp_client = MCPClient(
    lambda: streamablehttp_client(
        gateway_url,
        headers={“Authorization”: f”Bearer {gateway_access_token[‘access_token’]}”}
    )
)

# Combine local and MCP tools
tools = [
    get_product_info,     # Local tool (customer support specific)
    get_return_policy,    # Local tool (customer support specific)
] + mcp_client.list_tools_sync()  # Centralized tools from gateway

agent = Agent(
    model=model,
    tools=tools,
    hooks=[memory_hooks],
    system_prompt=SYSTEM_PROMPT
)

Test the enhanced agent
With the centralized tools integrated, our agent now has access to enterprise capabilities like warranty checking:

# Test web search using centralized tool  
response = agent(“How can I fix Lenovo ThinkPad with a blue screen?”)
# Agent uses web_search from AgentCore Gateway

The agent seamlessly combines local tools with centralized ones, providing comprehensive support capabilities while maintaining security and access control.
However, we still have a significant limitation: our entire agent runs locally on our development machine. For production deployment, we need scalable infrastructure, comprehensive observability, and the ability to handle multiple concurrent users.
In the next section, we address this by deploying our agent to Amazon Bedrock AgentCore Runtime, transforming our local prototype into a production-ready system with Amazon Bedrock AgentCore Observability and automatic scaling capabilities.
Deploy to production with Amazon Bedrock AgentCore Runtime
With the tools centralized and secured, our final major hurdle is production deployment. Our agent currently runs locally on your laptop, which is ideal for experimentation but unsuitable for real customers. Production requires scalable infrastructure, comprehensive monitoring, automatic error recovery, and the ability to handle multiple concurrent users reliably.
Amazon Bedrock AgentCore Runtime transforms your local agent into a production-ready service with minimal code changes. Combined with Amazon Bedrock AgentCore Observability, it provides enterprise-grade reliability, automatic scaling, and comprehensive monitoring capabilities that operations teams need to maintain agentic applications in production.
Our architecture will look like the following diagram.

Minimal code changes for production
Converting your local agent requires adding just four lines of code:

# Your existing agent code remains unchanged
model = BedrockModel(model_id=”us.anthropic.claude-3-7-sonnet-20250219-v1:0″)
memory_hooks = CustomerSupportMemoryHooks(memory_id, memory_client, actor_id, session_id)
agent = Agent(
    model=model,
    tools=[get_return_policy, get_product_info],
    system_prompt=SYSTEM_PROMPT,
    hooks=[memory_hooks]
)

def invoke(payload):
    user_input = payload.get(“prompt”, “”)
    response = agent(user_input)
    return response.message[“content”][0][“text”]

if __name__ == “__main__”:

BedrockAgentCoreApp automatically creates an HTTP server with the required /invocations and /ping endpoints, handles proper content types and response formats, manages error handling according to AWS standards, and provides the infrastructure bridge between your agent code and Amazon Bedrock AgentCore Runtime.
Secure production deployment
Production deployment requires proper authentication and access control. Amazon Bedrock AgentCore Runtime integrates with Amazon Bedrock AgentCore Identity to provide enterprise-grade security. Using the Bedrock AgentCore Starter Toolkit, we can deploy our application using three simple steps: configure, launch, and invoke.
During the configuration, a Docker file is created to guide the deployment of our agent. It contains information about the agent and its dependencies, the Amazon Bedrock AgentCore Identity configuration, and the Amazon Bedrock AgentCore Observability configuration to be used. During the launch step, AWS CodeBuild is used to run this Dockerfile and an Amazon Elastic Container Registry (Amazon ECR) repository is created to store the agent dependencies. The Amazon Bedrock AgentCore Runtime agent is then created, using the image of the ECR repository, and an endpoint is generated and used to invoke the agent in applications. If your agent is configured with OAuth authentication through Amazon Bedrock AgentCore Identity, like ours will be, you also need to pass the authentication token during the agent invocation step. The following diagram illustrates this process.

The code to configure and launch our agent on Amazon Bedrock AgentCore Runtime will look as follows:

from bedrock_agentcore_starter_toolkit import Runtime

# Configure secure deployment with Cognito authentication
agentcore_runtime = Runtime()

response = agentcore_runtime.configure(
    entrypoint=”lab_helpers/lab4_runtime.py”,
    execution_role=execution_role_arn,
    auto_create_ecr=True,
    requirements_file=”requirements.txt”,
    region=region,
    agent_name=”customer_support_agent”,
    authorizer_configuration={
        “customJWTAuthorizer”: {
            “allowedClients”: [cognito_client_id],
            “discoveryUrl”: cognito_discovery_url,
        }
    }
)

# Deploy to production
launch_result = agentcore_runtime.launch()

This configuration creates a secure endpoint that only accepts requests with valid JWT tokens from your identity provider (such as Amazon Cognito, Okta, or Entra). For our agent, we use a dummy setup with Amazon Cognito, but your application can use an identity provider of your choosing. The deployment process automatically builds your agent into a container, creates the necessary AWS infrastructure, and establishes monitoring and logging pipelines.
Session management and isolation
One of the most critical production features for agents is proper session management. Amazon Bedrock AgentCore Runtime automatically handles session isolation, making sure different customers’ conversations don’t interfere with each other:

# Customer 1 conversation
response1 = agentcore_runtime.invoke(
    {“prompt”: “My iPhone Bluetooth isn’t working. What should I do?”},
    bearer_token=auth_token,
    session_id=”session-customer-1″
)

# Customer 1 follow-up (maintains context)
response2 = agentcore_runtime.invoke(
    {“prompt”: “I’ve turned Bluetooth on and off but it still doesn’t work”},
    bearer_token=auth_token,
    session_id=”session-customer-1″  # Same session, context preserved
)

# Customer 2 conversation (completely separate)
response3 = agentcore_runtime.invoke(
    {“prompt”: “Still not working. What is going on?”},
    bearer_token=auth_token,
    session_id=”session-customer-2″  # Different session, no context
)

Customer 1’s follow-up maintains full context about their iPhone Bluetooth issue, whereas Customer 2’s message (in a different session) has no context and the agent appropriately asks for more information. This automatic session isolation is crucial for production customer support scenarios.
Comprehensive observability with Amazon Bedrock AgentCore Observability
Production agents need comprehensive monitoring to diagnose issues, optimize performance, and maintain reliability. Amazon Bedrock AgentCore Observability automatically instruments your agent code and sends telemetry data to Amazon CloudWatch, where you can analyze patterns and troubleshoot issues in real time. The observability data includes session-level tracking, so you can trace individual customer session interactions and understand exactly what happened during a support interaction. You can use Amazon Bedrock AgentCore Observability with an agent of your choice, hosted in Amazon Bedrock AgentCore Runtime or not. Because Amazon Bedrock AgentCore Runtime automatically integrates with Amazon Bedrock AgentCore Observability, we don’t need extra work to observe our agent.
With Amazon Bedrock AgentCore Runtime deployment, your agent is ready to be used in production. However, we still have one limitation: our agent is accessible only through SDK or API calls, requiring customers to write code or use technical tools to interact with it. For true customer-facing deployment, we need a user-friendly web interface that customers can access through their browsers.
In the following section, we demonstrate the complete journey by building a sample web application using Streamlit, providing an intuitive chat interface that can interact with our production-ready Amazon Bedrock AgentCore Runtime endpoint. The exposed endpoint maintains the security, scalability, and observability capabilities we’ve built throughout our journey from proof of concept to production. In a real-world scenario, you would integrate this endpoint with your existing customer-facing applications and UI frameworks.
Create a customer-facing UI
With our agent deployed to production, the final step is creating a customer-facing UI that customers can use to interface with the agent. Although SDK access works for developers, customers need an intuitive web interface for seamless support interactions.
To demonstrate a complete solution, we build a sample Streamlit-based web-application that connects to our production-ready Amazon Bedrock AgentCore Runtime endpoint. The frontend includes secure Amazon Cognito authentication, real-time streaming responses, persistent session management, and a clean chat interface. Although we use Streamlit for rapid-prototyping, enterprises would typically integrate the endpoint with their existing interface or preferred UI frameworks.
The end-to-end application (shown in the following diagram) maintains full conversation context across the sessions while providing the security, scalability, and observability capabilities that we built throughout this post. The result is a complete customer support agentic system that handles everything from initial authentication to complex multi-turn troubleshooting conversations, demonstrating how Amazon Bedrock AgentCore services transform prototypes into production-ready customer applications.

Conclusion
Our journey from prototype to production demonstrates how Amazon Bedrock AgentCore services address the traditional barriers to deploying enterprise-ready agentic applications. What started as a simple local customer support chatbot transformed into a comprehensive, production-grade system capable of serving multiple concurrent users with persistent memory, secure tool sharing, comprehensive observability, and an intuitive web interface—without months of custom infrastructure development.
The transformation required minimal code changes at each step, showcasing how Amazon Bedrock AgentCore services work together to solve the operational challenges that typically stall promising proofs of concept. Memory capabilities avoid the “goldfish agent” problem, centralized tool management through Amazon Bedrock AgentCore Gateway creates a reusable infrastructure that securely serves multiple use cases, Amazon Bedrock AgentCore Runtime provides enterprise-grade deployment with automatic scaling, and Amazon Bedrock AgentCore Observability delivers the monitoring capabilities operations teams need to maintain production systems.
The following video provides an overview of AgentCore capabilities.

Ready to build your own production-ready agent? Start with our complete end-to-end tutorial, where you can follow along with the exact code and configurations we’ve explored in this post. For additional use cases and implementation patterns, explore the broader GitHub repository, and dive deeper into service capabilities and best practices in the Amazon Bedrock AgentCore documentation.

About the authors
Maira Ladeira Tanke is a Tech Lead for Agentic AI at AWS, where she enables customers on their journey to develop autonomous AI systems. With over 10 years of experience in AI/ML, Maira partners with enterprise customers to accelerate the adoption of agentic applications using Amazon Bedrock AgentCore and Strands Agents, helping organizations harness the power of foundation models to drive innovation and business transformation. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.

Building AI agents is 5% AI and 100% software engineering

Production-grade agents live or die on data plumbing, controls, and observability—not on model choice. The doc-to-chat pipeline below maps the concrete layers and why they matter.

What is a “doc-to-chat” pipeline?

A doc-to-chat pipeline ingests enterprise documents, standardizes them, enforces governance, indexes embeddings alongside relational features, and serves retrieval + generation behind authenticated APIs with human-in-the-loop (HITL) checkpoints. It’s the reference architecture for agentic Q&A, copilots, and workflow automation where answers must respect permissions and be audit-ready. Production implementations are variations of RAG (retrieval-augmented generation) hardened with LLM guardrails, governance, and OpenTelemetry-backed tracing.

How do you integrate cleanly with the existing stack?

Use standard service boundaries (REST/JSON, gRPC) over a storage layer your org already trusts. For tables, Iceberg gives ACID, schema evolution, partition evolution, and snapshots—critical for reproducible retrieval and backfills. For vectors, use a system that coexists with SQL filters: pgvector collocates embeddings with business keys and ACL tags in PostgreSQL; dedicated engines like Milvus handle high-QPS ANN with disaggregated storage/compute. In practice, many teams run both: SQL+pgvector for transactional joins and Milvus for heavy retrieval.

Key properties

Iceberg tables: ACID, hidden partitioning, snapshot isolation; vendor support across warehouses.

pgvector: SQL + vector similarity in one query plan for precise joins and policy enforcement.

Milvus: layered, horizontally scalable architecture for large-scale similarity search.

How do agents, humans, and workflows coordinate on one “knowledge fabric”?

Production agents require explicit coordination points where humans approve, correct, or escalate. AWS A2I provides managed HITL loops (private workforces, flow definitions) and is a concrete blueprint for gating low-confidence outputs. Frameworks like LangGraph model these human checkpoints inside agent graphs so approvals are first-class steps in the DAG, not ad hoc callbacks. Use them to gate actions like publishing summaries, filing tickets, or committing code.

Pattern: LLM → confidence/guardrail checks → HITL gate → side-effects. Persist every artifact (prompt, retrieval set, decision) for auditability and future re-runs.

How is reliability enforced before anything reaches the model?

Treat reliability as layered defenses:

Language + content guardrails: Pre-validate inputs/outputs for safety and policy. Options span managed (Bedrock Guardrails) and OSS (NeMo Guardrails, Guardrails AI; Llama Guard). Independent comparisons and a position paper catalog the trade-offs.

PII detection/redaction: Run analyzers on both source docs and model I/O. Microsoft Presidio offers recognizers and masking, with explicit caveats to combine with additional controls.

Access control and lineage: Enforce row-/column-level ACLs and audit across catalogs (Unity Catalog) so retrieval respects permissions; unify lineage and access policies across workspaces.

Retrieval quality gates: Evaluate RAG with reference-free metrics (faithfulness, context precision/recall) using Ragas/related tooling; block or down-rank poor contexts.

How do you scale indexing and retrieval under real traffic?

Two axes matter: ingest throughput and query concurrency.

Ingest: Normalize at the lakehouse edge; write to Iceberg for versioned snapshots, then embed asynchronously. This enables deterministic rebuilds and point-in-time re-indexing.

Vector serving: Milvus’s shared-storage, disaggregated compute architecture supports horizontal scaling with independent failure domains; use HNSW/IVF/Flat hybrids and replica sets to balance recall/latency.

SQL + vector: Keep business joins server-side (pgvector), e.g., WHERE tenant_id = ? AND acl_tag @> … ORDER BY embedding <-> :q LIMIT k. This avoids N+1 trips and respects policies.

Chunking/embedding strategy: Tune chunk size/overlap and semantic boundaries; bad chunking is the silent killer of recall.

For structured+unstructured fusion, prefer hybrid retrieval (BM25 + ANN + reranker) and store structured features next to vectors to support filters and re-ranking features at query time.

How do you monitor beyond logs?

You need traces, metrics, and evaluations stitched together:

Distributed tracing: Emit OpenTelemetry spans across ingestion, retrieval, model calls, and tools; LangSmith natively ingests OTEL traces and interoperates with external APMs (Jaeger, Datadog, Elastic). This gives end-to-end timing, prompts, contexts, and costs per request.

LLM observability platforms: Compare options (LangSmith, Arize Phoenix, LangFuse, Datadog) by tracing, evals, cost tracking, and enterprise readiness. Independent roundups and matrixes are available.

Continuous evaluation: Schedule RAG evals (Ragas/DeepEval/MLflow) on canary sets and live traffic replays; track faithfulness and grounding drift over time.

Add schema profiling/mapping on ingestion to keep observability attached to data shape changes (e.g., new templates, table evolution) and to explain retrieval regressions when upstream sources shift.

Example: doc-to-chat reference flow (signals and gates)

Ingest: connectors → text extraction → normalization → Iceberg write (ACID, snapshots).

Govern: PII scan (Presidio) → redact/mask → catalog registration with ACL policies.

Index: embedding jobs → pgvector (policy-aware joins) and Milvus (high-QPS ANN).

Serve: REST/gRPC → hybrid retrieval → guardrails → LLM → tool use.

HITL: low-confidence paths route to A2I/LangGraph approval steps.

Observe: OTEL traces to LangSmith/APM + scheduled RAG evaluations.

Why “5% AI, 100% software engineering” is accurate in practice?

Most outages and trust failures in agent systems are not model regressions; they’re data quality, permissioning, retrieval decay, or missing telemetry. The controls above—ACID tables, ACL catalogs, PII guardrails, hybrid retrieval, OTEL traces, and human gates—determine whether the same base model is safe, fast, and credibly correct for your users. Invest in these first; swap models later if needed.

References:

https://iceberg.apache.org/docs/1.9.0/evolution/

https://iceberg.apache.org/docs/1.5.2/

https://docs.snowflake.com/en/user-guide/tables-iceberg

https://docs.dremio.com/current/developer/data-formats/apache-iceberg/

https://github.com/pgvector/pgvector

https://www.postgresql.org/about/news/pgvector-070-released-2852/

https://github.com/pgvector/pgvector-go

https://github.com/pgvector/pgvector-rust

https://github.com/pgvector/pgvector-java

https://milvus.io/docs/four_layers.md

https://milvus.io/docs/v2.3.x/architecture_overview.md

https://milvus.io/docs/v2.2.x/architecture.md

https://www.linkedin.com/posts/armand-ruiz_

https://docs.vespa.ai/en/tutorials/hybrid-search.html

https://www.elastic.co/what-is/hybrid-search

https://www.elastic.co/search-labs/blog/hybrid-search-elasticsearch

https://docs.cohere.com/reference/rerank

https://docs.cohere.com/docs/rerank

https://cohere.com/rerank

https://opentelemetry.io/docs/concepts/signals/traces/

https://opentelemetry.io/docs/specs/otel/logs/

https://docs.smith.langchain.com/evaluation

https://docs.smith.langchain.com/evaluation/concepts

https://docs.smith.langchain.com/reference/python/evaluation

https://docs.smith.langchain.com/observability

https://www.langchain.com/langsmith

https://arize.com/docs/phoenix

https://github.com/Arize-ai/phoenix

https://langfuse.com/docs/observability/get-started

https://langfuse.com/docs/observability/overview

https://docs.datadoghq.com/opentelemetry/

https://langchain-ai.github.io/langgraph/concepts/human_in_the_loop/

https://langchain-ai.github.io/langgraph/tutorials/get-started/4-human-in-the-loop/

https://docs.langchain.com/oss/python/langgraph/add-human-in-the-loop

https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-use-augmented-ai-a2i-human-review-loops.html

https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-start-human-loop.html

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-a2i-runtime.html

https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-monitor-humanloop-results.html

https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html

https://aws.amazon.com/bedrock/guardrails/

https://docs.aws.amazon.com/bedrock/latest/APIReference/API_CreateGuardrail.html

https://docs.aws.amazon.com/bedrock/latest/userguide/agents-guardrail.html

https://docs.nvidia.com/nemo-guardrails/index.html

https://developer.nvidia.com/nemo-guardrails

https://github.com/NVIDIA/NeMo-Guardrails

https://docs.nvidia.com/nemo/guardrails/latest/user-guides/guardrails-library.html

https://guardrailsai.com/docs/

https://github.com/guardrails-ai/guardrails

https://guardrailsai.com/docs/getting_started/quickstart

https://guardrailsai.com/docs/getting_started/guardrails_server

https://pypi.org/project/guardrails-ai/

https://github.com/guardrails-ai/guardrails_pii

https://huggingface.co/meta-llama/Llama-Guard-4-12B

https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/

https://microsoft.github.io/presidio/

https://github.com/microsoft/presidio

https://github.com/microsoft/presidio-research

https://docs.databricks.com/aws/en/data-governance/unity-catalog/access-control

https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/

https://docs.databricks.com/aws/en/data-governance/unity-catalog/abac/

https://docs.ragas.io/

https://docs.ragas.io/en/stable/references/evaluate/

https://docs.ragas.io/en/latest/tutorials/rag/

https://python.langchain.com/docs/concepts/text_splitters/

https://python.langchain.com/api_reference/text_splitters/index.html

https://pypi.org/project/langchain-text-splitters/

https://milvus.io/docs/evaluation_with_deepeval.md

https://mlflow.org/docs/latest/genai/eval-monitor/

https://mlflow.org/docs/2.10.1/llms/rag/notebooks/mlflow-e2e-evaluation.html

The post Building AI agents is 5% AI and 100% software engineering appeared first on MarkTechPost.