How CLICKFORCE accelerates data-driven advertising with Amazon Bedrock …

CLICKFORCE is one of leaders in digital advertising services in Taiwan, specializing in data-driven advertising and conversion (D4A – Data for Advertising & Action). With a mission to deliver industry-leading, trend-aligned, and innovative marketing solutions, CLICKFORCE helps brands, agencies, and media partners make smarter advertising decisions.
However, as the advertising industry rapidly evolves, traditional analysis methods and generic AI outputs are no longer sufficient to provide actionable insights. To remain competitive, CLICKFORCE turned to AWS to build Lumos, a next-generation AI-driven marketing analysis solution powered by Amazon Bedrock, Amazon SageMaker AI, Amazon OpenSearch, and AWS Glue.
In this post, we demonstrate how CLICKFORCE used AWS services to build Lumos and transform advertising industry analysis from weeks-long manual work into an automated, one-hour process.
Digital advertising challenges
Before adopting Amazon Bedrock, CLICKFORCE faced several roadblocks in building actionable intelligence for digital advertising. Large language models (LLMs) tend to produce generic recommendations rather than actionable industry-specific intelligence. Without an understanding of the advertising environment, these models didn’t have the industry context needed to align their suggestions with actual industry realities.
Another significant challenge was the absence of integrated internal datasets, which weakened the reliability of outputs and increased the risk of hallucinated or inaccurate insights. At the same time, marketing teams relied on disconnected tools and technique such as vibe coding, without standardized architectures or workflows, making the processes difficult to maintain and scale.
Preparing a comprehensive industry analysis report was also a time-consuming process, typically requiring between two and six weeks. The timeline stemmed from multiple labor-intensive stages: one to three days to define objectives and set the research plan, one to four weeks to gather and validate data from different sources, one to two weeks to conduct statistical analysis and build charts, one to two to extract strategic insights, and finally three to seven days to draft and finalize the report. Each stage often required back-and-forth coordination across teams, which further extended the timeline. As a result, marketing strategies were frequently delayed and based more on intuition than timely, data-backed insights.
Solutions overview
To address these challenges, CLICKFORCE built Lumos, an integrated AI-powered industry analysis service, using AWS services.
The solution is designed around Amazon Bedrock Agents for contextualized reasoning and Amazon SageMaker AI for fine-tuning Text-to-SQL accuracy. CLICKFORCE chose Amazon Bedrock because it provides managed access to foundation models without the need to build or maintain infrastructure, while also offering agents that can orchestrate multi-step tasks and integrate with enterprise data sources through Knowledge Bases. This allowed the team to ground insights in real, verifiable data, minimize hallucinations, and quickly experiment with different models, while also reducing operational overhead and accelerating time-to-market.

The first step was to build a unified AI agent using Amazon Bedrock. End-users interact with a chatbot interface that runs on Amazon ECS, developed with Streamlit and fronted by an Application Load Balancer. When a user submits a query, it is routed to an AWS Lambda function that invokes an Amazon Bedrock Agent. The agent retrieves relevant information from a Amazon Bedrock Knowledge Bases, which is built from source documents—such as campaign reports, product descriptions, and industry analysis files—hosted in Amazon S3. These documents are automatically converted into vector embeddings and indexed in Amazon OpenSearch Service. By grounding model responses in this curated document set, CLICKFORCE made sure that outputs were contextualized, reduced hallucinations, and aligned with real-world advertising data.
Next, CLICKFORCE made the workflows more action-oriented by using Text-to-SQL requests. When queries required data retrieval, the Bedrock Agent generated JSON schemas via the Agent Actions API Schema. These were passed to Lambda Executor functions that translated requests into Text-to-SQL queries. With AWS Glue crawlers continuously updating SQL databases from CSV files in Amazon S3, analysts were able to run precise queries on campaign performance, audience behaviors, and competitive benchmarks.
Finally, the company improved accuracy by incorporating Amazon SageMaker and MLflow into the development workflow. Initially, CLICKFORCE relied on foundation models for Text-to-SQL translation but found them to be inflexible and often inaccurate. By using SageMaker, the team processed data, evaluated different approaches, and tuned the overall Text-to-SQL pipeline. Once validated, the optimized pipeline was deployed through AWS Lambda functions and integrated back into the agent, making sure that improvements flowed directly into the Lumos application. With MLflow providing experiment tracking and evaluation, the cycle of data processing, pipeline tuning, and deployment became streamlined, allowing Lumos to achieve higher precision in query generation and deliver automated, data-driven marketing reports.
Results
The impact of adopting Amazon Bedrock Agents and SageMaker AI has been transformative for CLICKFORCE. Industry analysis that previously required two to six weeks can now be completed in under one hour, dramatically accelerating decision-making. The company also reduced its reliance on third-party industry research reports, which resulted in a 47 percent reduction in operational costs.
In addition to time and cost savings, the Lumos system has extended scalability across roles within the marketing environment. Brand owners, agencies, analysts, marketers, and media partners can now independently generate insights without waiting for centralized analyst teams. This autonomy has led to greater agility across campaigns. Moreover, by grounding outputs in both internal datasets and industry-specific context, Lumos significantly reduced the risk of hallucinations and made sure that insights aligned more closely with industry realities.

Users can generate industry analysis reports through natural language conversations and iteratively refine the content by continuing the dialogue.

These visual reports, generated through the Lumos system powered by Amazon Bedrock Agents and SageMaker AI, showcase the platform’s ability to produce comprehensive market intelligence within minutes. The charts illustrate brand sales distribution, retail and e-commerce performance, and demonstrating how AI-driven analytics automate data aggregation, visualization, and insight generation with high precision and efficiency.
Conclusion
CLICKFORCE’s Lumos system represents a breakthrough in how digital marketing decisions are made. By combining Amazon Bedrock Agents, Amazon SageMaker AI, Amazon OpenSearch Service, and AWS Glue, CLICKFORCE transformed its industry analysis workflow from a slow, manual process into a fast, automated, and reliable system. In this post, we demonstrated how CLICKFORCE used these AWS services to build Lumos and transform advertising industry analysis from weeks-long manual work into an automated, one-hour process.

About the Authors
Ray Wang is a Senior Solutions Architect at AWS. With 12+ years of experience in the backend and consultant, Ray is dedicated to building modern solutions in the cloud, especially in especially in NoSQL, big data, machine learning, and Generative AI. As a hungry go-getter, he passed all 12 AWS certificates to increase the breadth and depth of his technical knowledge. He loves to read and watch sci-fi movies in his spare time.
Shanna Chang is a Solutions Architect at AWS. She focuses on observability in modern architectures and cloud-native monitoring solutions. Before joining AWS, she was a software engineer.

Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agent …

Inworld AI has introduced Inworld TTS-1.5, an upgrade to its TTS-1 family that targets realtime voice agents with strict constraints on latency, quality, and cost. TTS-1.5 is described as the number top ranked text to speech system on Artificial Analysis and is designed to be more expressive and more stable than prior generations while remaining suitable for large scale consumer deployments.

Realtime latency for interactive agents

TTS-1.5 focuses on P90 time to first audio latency, which is a critical metric for user perceived responsiveness. For TTS-1.5 Max, P90 time to first audio is below 250 ms. For TTS-1.5 Mini, P90 time to first audio is below 130 ms. These values are about 4 times faster than the prior TTS generation according to Inworld.

The TTS-1.5 stack supports streaming over WebSocket so synthesis and playback can start as soon as the first audio chunk is generated. In practice this keeps end to end interaction latency in the same range as typical realtime language model responses when models run on modern GPUs, which is important when TTS is part of a full agent pipeline.

Inworld recommends TTS-1.5 Max for most applications because it balances latency near 200 ms with higher stability and quality. TTS-1.5 Mini is positioned for latency sensitive workloads such as real time gaming or ultra responsive voice agents where every millisecond is important.

Expression, stability and benchmark position

TTS-1.5 builds on TTS-1 and it delivers about 30 percent more expressive range and about 40 percent better stability than the earlier models.

Here expression refers to features such as prosody, emphasis, and emotional variation. Stability is measured by metrics such as word error rate and output consistency across long sequences and varied prompts. The reduction in word error rate reduces issues like truncated sentences, unintended word substitutions, or artifacts, which is important when TTS output is driven directly from generated language model text.

Pricing and cost profile at consumer scale

TTS-1.5 is priced with two main configurations. Inworld TTS-1.5 Mini costs 5 dollars per 1 million characters, which is about 0.005 dollars per minute of speech. TTS-1.5 Max costs 10 dollars per 1 million characters, which is about 0.01 dollars per minute.

This cost profile makes it feasible to run TTS continuously in high usage products such as voice native companions, education platforms, or customer support lines without TTS becoming the dominant variable cost.

Multilingual support, voice cloning and deployment options

Inworld TTS-1.5 supports 15 languages. The list includes English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. This allows a single TTS pipeline to cover a wide set of markets without separate models per region.

The system provides instant voice cloning and professional voice cloning. Instant voice cloning can create a custom voice from about 15 seconds of audio and is exposed directly in the Inworld portal and through API. Professional voice cloning uses at least 30 minutes of clean audio, with 20 minutes or more recommended for best results, and targets branded voices and less common accents.

For deployment, TTS-1.5 is available as a cloud API and also as an on prem solution, where the full model runs inside the customer infrastructure for data sovereignty and compliance. The same quality profile is maintained across both deployment modes, and the models integrate with partner platforms such as LiveKit, Pipecat, and Vapi for end to end voice agent stacks.

Key Takeaways

Inworld TTS 1.5 delivers realtime performance, with P90 time to first audio under 250 ms for the Max model and under 130 ms for the Mini model, about 4 times faster than the prior generation.

The model increases expressiveness by about 30 percent and improves stability with about 40 percent lower word error rate.

Pricing is optimized for consumer scale, TTS 1.5 Mini costs about 5 dollars per 1 million characters and TTS 1.5 Max costs about 10 dollars per 1 million characters, which is significantly cheaper per minute than many competing systems.

TTS 1.5 supports 15 languages and offers instant and professional voice cloning, enabling custom and branded voices from short reference audio or longer recorded datasets.

The system is available as a cloud API and as an on prem deployment, and integrates with existing voice agent stacks, which makes it suitable for production realtime agents that require explicit guarantees on latency, quality, and data control.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents appeared first on MarkTechPost.

FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialog …

Chroma 1.0 is a real time speech to speech dialogue model that takes audio as input and returns audio as output while preserving the speaker identity across multi turn conversations. It is presented as the first open source end to end spoken dialogue system that combines low latency interaction with high fidelity personalized voice cloning from only a few seconds of reference audio.

The model operates directly on discrete speech representations rather than on text transcripts. It targets the same use cases as commercial real time agents, but with a compact 4B parameter dialogue core and a design that treats speaker similarity as a primary objective, not as an auxiliary feature. Chroma achieves a reported 10.96% relative improvement in speaker similarity over a human baseline and reaches a Real Time Factor (RTF) of 0.43, so it can generate speech more than 2 times faster than playback.

https://arxiv.org/pdf/2601.11141

From cascaded ASR LLM TTS end to end S2S

Most production assistants still use a three stage pipeline, automatic speech recognition to convert audio to text, a large language model for reasoning, and text to speech synthesis. This structure is flexible but it introduces latency and loses paralinguistic information such as timbre, emotion, speaking rate and prosody once the system collapses audio to text. In real time dialogue this loss of acoustic detail directly hurts speaker fidelity and naturalness.

Chroma follows the newer class of speech to speech systems that map between sequences of codec tokens. A speech tokenizer and neural codec produce quantized acoustic codes. A language model then reasons and responds over a sequence that interleaves text tokens and audio codes, without an explicit intermediate transcript. This keeps the model conditioned on prosody and speaker identity during the whole processing chain.

Architecture, Reasoner + speech generation stack

Chroma 1.0 has two main subsystems. The Chroma Reasoner handles multimodal understanding and text generation. The speech stack, Chroma Backbone, Chroma Decoder and Chroma Codec Decoder, converts that semantic output into personalized response audio.

The Chroma Reasoner is built on the Thinker module from the Qwen-omni series and uses the Qwen2 Audio encoding pipeline. It processes text and audio inputs with shared front ends, fuses them with cross modal attention, and aligns them over time using Time aligned Multimodal Rotary Position Embedding (TM-RoPE). The output is a sequence of hidden states that carry both linguistic content and acoustic cues, for example rhythm and emphasis.

https://arxiv.org/pdf/2601.11141

The Chroma Backbone is a 1B parameter LLaMA style model based on Llama3. It is conditioned on the target voice using CSM-1B, which encodes a short reference audio clip and its transcript into embedding prompts that are prepended to the sequence. During inference, token embeddings and hidden states from the Reasoner are fed as unified context, so the Backbone always sees the semantic state of the dialogue while it generates acoustic codes.

To support streaming, the system uses a fixed 1 to 2 interleaving schedule. For every text token from the Reasoner, the Backbone produces 2 audio code tokens. This allows the model to start emitting speech as soon as text generation begins and avoids waiting for full sentences. This interleaving is the main mechanism behind the low Time to First Token.

The Chroma Decoder is a lightweight LLaMA variant with about 100M parameters. The Backbone predicts only the first Residual Vector Quantization codebook per frame, which is a coarse representation. The Decoder then takes the Backbone hidden state and the first code and autoregressively predicts the remaining RVQ levels inside the same frame. This factorization keeps long context temporal structure in the Backbone and restricts the Decoder to frame local refinement, which reduces compute and improves detailed prosody and articulation.

The Chroma Codec Decoder concatenates the coarse and refined codes and maps them to waveform samples. It follows the decoder design of the Mimi vocoder and uses a causal convolutional neural network so that each output sample depends only on past context, which is required for streaming. The system uses 8 codebooks, which cuts the number of autoregressive refinement steps for the Decoder while preserving enough detail for voice cloning.

Training setup and synthetic speech to speech (S2S) data

High quality speech dialogue data with strong reasoning signals is scarce. Chroma therefore uses a synthetic speech to speech (S2S) pipeline. A Reasoner like LLM first produces textual answers for user questions. A Test to Speech (TTS) system then synthesizes target speech that matches the timbre of the reference audio for those answers. These synthetic pairs train the Backbone and Decoder to perform acoustic modeling and voice cloning. The Reasoner stays frozen and acts as a provider of text embeddings and multimodal hidden states.

Voice cloning quality and comparison with existing systems

Objective evaluation uses the SEED-TTS-EVAL protocol on English CommonVoice speakers. Chroma operates at 24 kHz sampling rate and achieves a Speaker Similarity score of 0.81. The human baseline is 0.73. CosyVoice-3 reaches 0.72 and most other TTS baselines lie below the human reference. The research team report this as a 10.96% relative improvement over the human baseline, which indicates that the model captures fine paralinguistic details more consistently than human recordings in this metric.

https://arxiv.org/pdf/2601.11141

Subjective evaluation compares Chroma with the ElevenLabs eleven_multilingual_v2 model. In naturalness CMOS, listeners prefer ElevenLabs 57.2% of the time versus 24.4% for Chroma, with 18.3% deuce. In speaker similarity CMOS, the scores are very close, 42.4% for ElevenLabs and 40.6% for Chroma, with 17.0% deuce. A follow up test asking which audio sounds more natural between ElevenLabs and the original recordings yields 92.0% preference for ElevenLabs versus 8.0% for ground truth, which shows that perceived naturalness and speaker fidelity are not aligned.

Latency and real-time behavior

Latency is measured with one concurrent stream. For a 38.80 second response, the total generation time is 16.58 seconds, which gives a Real Time Factor (RTF) of 0.43. The Reasoner contributes 119.12 ms TTFT, the Backbone 8.48 ms and the Decoder 19.27 ms per frame on average. The Codec Decoder works on groups of 4 frames so TTFT does not apply to that component. The overall Time to First Token is 146.87 ms, which is well under one second and suitable for interactive dialogue.

https://arxiv.org/pdf/2601.11141

Spoken dialogue and reasoning benchmarks

Chroma is evaluated on the basic track of URO Bench. It uses only 4B parameters yet achieves an overall task accomplishment score of 57.44%. GLM-4 Voice, a 9B parameter model, leads with 69.09%. Chroma ranks second overall and outperforms several 7B and 0.5B omni baselines on many dimensions. It reaches 71.14% on Storal, 51.69% on TruthfulQA and 22.74% on GSM8K. For oral conversation metrics it attains the highest scores on MLC at 60.26% and on CommonVoice at 62.07%.

https://arxiv.org/pdf/2601.11141

Critically, Chroma is the only model in this comparison that supports personalized voice cloning. All other systems focus on spoken dialogue and reasoning only. This means Chroma provides competitive cognitive capability while also performing high fidelity voice personalization in real time.

Key Takeaways

End to end real time speech to speech: Chroma 1.0 is a 4B parameter spoken dialogue model that maps speech to speech directly using codec tokens, it avoids explicit ASR and TTS stages and preserves prosody and speaker identity through the whole pipeline.

Reasoner plus speech stack architecture: The system combines a Qwen-based Chroma Reasoner with a 1B LLaMA style Backbone, a 100M Chroma Decoder and a Mimi based Codec Decoder, it uses RVQ codebooks and an interleaved 1 to 2 text to audio token schedule to support streaming and low Time to First Token.

Strong personalized voice cloning: On SEED-TTS-EVAL with CommonVoice speakers, Chroma reaches a Speaker Similarity score of 0.81 at 24 kHz, this is reported as a 10.96 percent relative improvement over the human baseline of 0.73 and outperforms CosyVoice 3 and other TTS baselines.

Sub second latency and faster than real time generation: Single stream inference on an H200 GPU yields an overall Time to First Token of about 147 ms, for a 38.80 second response the model generates audio in 16.58 seconds, resulting in a Real Time Factor of 0.43 which is more than 2 times faster than playback.

Competitive dialogue and reasoning with cloning as a unique feature: On URO Bench basic track, Chroma attains 57.44 percent overall task accomplishment and competitive scores on Storal, TruthfulQA, GSM8K, MLC and CommonVoice.

Check out the Paper, Model Weights, Project and Playground. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning appeared first on MarkTechPost.

Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flo …

Salesforce AI research team present FOFPred, a language driven future optical flow prediction framework that connects large vision language models with diffusion transformers for dense motion forecasting in control and video generation settings. FOFPred takes one or more images and a natural language instruction such as ‘moving the bottle from right to left’ and predicts 4 future optical flow frames that describe how every pixel is expected to move over time.

https://arxiv.org/pdf/2601.10781

Future optical flow as a motion representation

Optical flow is the apparent per pixel displacement between two frames. FOFPred focuses on future optical flow, which means predicting dense displacement fields for future frames given only current observations and text, without access to future images at inference.

Future optical flow is a compact motion only representation. It removes static appearance and keeps only pixel level motion, so it is well suited as an intermediate state for robot control policies and as a conditioning signal for video diffusion models. Compared to predicting future RGB frames, it reduces the complexity of the output distribution and avoids modeling textures and high frequency details that are not required for motion planning.

To plug into existing latent diffusion infrastructure, the research team encode optical flow as RGB images. They map flow magnitude and direction from polar form into HSV channels, then convert to RGB. The scaling of each channel is tuned so that consecutive flow frames are visually smooth and resemble animated graphics. A standard Flux.1 variational autoencoder then encodes and decodes these flow images.

Unified VLM Diffusion backbone

FOFPred uses a unified architecture that combines a frozen vision language model, a frozen VAE and a trainable diffusion transformer. The pipeline is:

Qwen2.5-VL is used as the vision language encoder to jointly encode the caption and visual inputs.

Flux.1 VAE encodes the input images and the training optical flow targets into latent tensors.

An OmniGen style diffusion transformer, DiT, takes projected visual and textual features as conditional inputs and generates latent future flow sequences.

Only the DiT and small MLP projectors are trained. The Qwen2.5-VL and Flux.1 weights stay frozen, which lets the model reuse image editing pretraining and multimodal reasoning ability from prior work. Temporal modeling is added by extending the RoPE positional encoding and attention blocks from two dimensional spatial positions to full spatio-temporal positions across input and output frame sequences. This gives full spatio-temporal attention without adding extra parameters, so the DiT can reuse OmniGen image pretraining directly.

https://arxiv.org/pdf/2601.10781

Training on noisy web videos with relative optical flow

The core model is trained on web scale human activity videos with paired captions. The research team uses the Something Something V2 dataset and the EgoDex egocentric manipulation dataset to obtain around 500,000 video caption pairs.

Training uses an end to end flow matching objective in latent space. Future optical flow sequences are first computed offline, then encoded by the VAE and used as targets in a flow matching diffusion loss for the DiT. During training the method also applies classifier free guidance on both text and visual conditions and masks some frames and viewpoints to improve robustness.

A critical contribution is the relative optical flow calculation used to build clean training targets from noisy egocentric videos. For each frame pair the method:

Computes dense optical flow with an off the shelf estimator.

Estimates camera motion via homography using deep features.

Uses projective geometry to subtract camera motion and obtain object centric relative flow vectors.

Filters frame pairs by selecting those where the top k percent flow magnitudes exceed a threshold, which focuses training on segments with meaningful motion.

These steps are run offline at lower resolution for efficiency, then recomputed at original resolution for the final targets. The ablation study shows that static frame targets or raw flow without camera motion removal harm downstream performance, while disentangled relative flow targets give the best results.

https://arxiv.org/pdf/2601.10781

Language driven robot manipulation

The first downstream use case is robot control. FOFPred is finetuned on robot video caption data to predict future optical flow from both fixed and wrist mounted cameras. On top of FOFPred, the research team attach a diffusion policy network that takes predicted flow, text and robot state, and outputs continuous actions. This setup follows prior diffusion policy work but uses future optical flow instead of predicted RGB frames as the core representation.

On the CALVIN ABCD benchmark, which evaluates long horizon zero shot chains of 5 language specified manipulation tasks, FOFPred reaches an average chain length of 4.48. VPP reaches 4.33 and DreamVLA reaches 4.44 under the same protocol. FOFPred also attains a Task 5 success rate of 78.7 percent, which is the best among reported methods. In a low data setting with 10 percent of CALVIN demonstrations, FOFPred still reaches 3.43 average length, higher than the 3.25 of VPP.

On RoboTwin 2.0, a dual arm manipulation benchmark with 5 tasks that require both arms, FOFPred attains an average success rate of 68.6 percent. The VPP baseline reaches 61.8 percent under identical training settings. FOFPred improves success on every task in the subset.

https://arxiv.org/pdf/2601.10781

Motion aware text to video generation

The second downstream task is motion control in text to video generation. The research team build a two stage pipeline by connecting FOFPred with the Go with the Flow video diffusion model. FOFPred takes an initial frame and a language description of motion, predicts a sequence of future flow frames, and interpolates them into a dense motion field. Go with the Flow then uses this motion field and the initial frame to synthesize the final video, enforcing the described motion pattern.

On the motion heavy Something Something V2 benchmark, the FOFPred along with Go with the Flow pipeline improves over the CogVideoX baseline under identical conditions. The method reaches SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and motion fidelity 0.662, which are consistently better than CogVideoX. Importantly, FOFPred only uses language and a single frame at inference, while several controllable video baselines require hand or object masks or trajectories as extra inputs.

https://arxiv.org/pdf/2601.10781

Key Take aways

FOFPred reframes motion prediction as language driven future optical flow, predicting 4 dense optical flow frames from one or more current images and a text instruction, which provides a compact motion only representation for downstream tasks.

The model uses a unified VLM Diffusion backbone, with Qwen2.5-VL as a frozen vision language encoder, Flux.1-VAE as a frozen latent encoder for images and flow, and an OmniGen style DiT as the only trained component with spatio temporal RoPE based attention.

Training relies on large scale web and egocentric video from Something Something-V2 and EgoDex, and builds relative optical flow targets by estimating ego-motion via homography, subtracting camera flow and filtering for high motion segments, which significantly improves downstream performance.

In robot manipulation, FOFPred acts as a motion backbone for a diffusion policy head and achieves state of the art or better results on CALVIN ABCD and RoboTwin 2.0, including 4.48 average task chain length on CALVIN and 68.6 percent average success on RoboTwin, outperforming VPP and DreamVLA variants.

For text to video generation, connecting FOFPred to Go with the Flow yields better SSv2 metrics than CogVideoX, with higher SSIM and PSNR, lower FVD and KVD, and improved motion fidelity, while requiring only language and a single frame at inference, making FOFPred a reusable motion controller for both robotics and video synthesis pipelines.

Check out the Paper, Model and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flow Prediction Framework that Enables Improved Robot Control and Video Generation appeared first on MarkTechPost.

How Thomson Reuters built an Agentic Platform Engineering Hub with Ama …

This post was co-written with Naveen Pollamreddi and Seth Krause from Thomson Reuters.
Thomson Reuters (TR) is a leading AI and technology company dedicated to delivering trusted content and workflow automation solutions. With over 150 years of expertise, TR provides essential solutions across legal, tax, accounting, risk, trade, and media sectors in a fast-evolving world. AI plays a critical role at TR. It’s embedded in how it helps create, enhance, connect, and deliver trusted information to customers. It powers the products used by professionals around the world. AI at TR empowers professionals with professional-grade AI that clarifies complex challenges.
This blog post explains how TR’s Platform Engineering team, a geographically distributed unit overseeing TR’s service availability, boosted its operational productivity by transitioning from manual to an automated agentic system using Amazon Bedrock AgentCore.
Business challenge
Platform engineering teams face significant challenges in providing seamless, self-service experiences to its internal customers at scale for operational activities such as database management, information security and risk management (ISRM) operations, landing zone maintenance, infrastructure provisioning, secrets management, continuous integration and deployment (CI/CD) pipeline orchestration, and compliance automation. At TR, the Platform Engineering team supports multiple lines of business by providing essential cloud infrastructure and enablement services, including cloud account provisioning and database management. However, manual processes and the need for repeated coordination between teams for operational tasks created delays that slowed down innovation.
“Our engineers were spending considerable time answering the same questions and executing identical processes across different teams,” says Naveen Polalmreddi, Distinguished Engineer at TR. “We needed a way to automate these interactions while maintaining our security and compliance standards.”
Current state
The Platform Engineering team offers services to multiple product teams within TR including Product Engineering and Service Management. These teams consume their internal home-grown solutions as a service to build and run applications at scale on AWS services. Over a period, these services are offered not only as tools but also through TR’s internal processes, following Information Technology Infrastructure Library (ITIL) standards and using third party software as a service (SaaS) systems.
Some of these services rely on humans to execute a predefined list of steps and are repeated many times, creating a significant dependency on engineers to execute the same tasks repeatedly for multiple applications. Current processes are semi-automated and are:-

Repetitive and labor intensive – Because of the nature of the workflows and multi-team engagement model, these operational processes tend to be labor intensive and repetitive. The Platform Engineering team spent a lot of time doing work that is undifferentiated heavy lifting.
Longer time to value – Because of process interdependencies, these operational workflows aren’t fully autonomous and take a long time to realize the value compared to fully automated processes.
Resource and cost intensive – Manual execution requires dedicated engineering resources whose time could be better spent on innovation rather than repetitive tasks. Each operational request consumes engineer hours across multiple teams for coordination, execution, and validation.

The Platform Engineering team is solving this problem by building autonomous agentic solutions that use specialized agents across multiple service domains and groups. The cloud account provisioning agent automates the creation and configuration of new cloud accounts according to internal standards, handling tasks such as setting up organizational units, applying security policies, and configuring baseline networking. The database patching agent manages the end-to-end database patching lifecycle, version upgrades. Network service agents handle network configuration requests such as VPC setup, subnet allocation, and connectivity establishment between environments. Architecture review agents assist in evaluating proposed architectures against best practices, security requirements, and compliance standards, providing automated feedback and recommendations. AgentCore serves as the foundational orchestration layer for these agents, providing the core agentic capabilities that enable intelligent decision-making, natural language understanding, tool calling and agent-to-agent (A2A) communication.
Solution overview
TR’s Platform Engineering team built this solution with scalability, extensibility, and security as core principles and designed it so that non-technical users can quickly create and deploy AI-powered automation. Designed for a broad enterprise audience, the architecture is designed so that business users can interact with specialized agents through basic natural language requests without needing to understand the underlying technical complexity. TR chose Amazon Bedrock AgentCore because it provides the complete foundational infrastructure needed to build, deploy, and operate enterprise-grade AI agents at scale without having to build that infrastructure from scratch. The Platform Engineering team gained the flexibility to innovate with their preferred frameworks while designing their autonomous agents operate with enterprise-level security, reliability, and scalability—critical requirements for managing production operational workflows at scale.
The following diagram illustrates the architecture of solution:

TR built an AI-powered platform engineering hub using AgentCore. The solution consists of:

A custom web portal for more secure agent interactions
A central orchestrator agent that routes requests and manages interactions
Multiple service-specific agents handling specialized tasks such as AWS account provisioning and database patching
A human-in-the-loop validation service for sensitive operations

TR decided to use AgentCore because it helped their developers to accelerate from prototype to production with fully managed services that minimize infrastructure complexity and build AI agents using different frameworks, models, and tools while maintaining complete control over how agents operate and integrate with their existing systems.
Solution workflow
The team used the following workflow to develop and deploy the agentic AI system.

Discovery and architecture planning: Evaluated existing AWS resources and code base to design a comprehensive solution incorporating AgentCore, focusing on service objectives and integration requirements.
Core development and migration: Developed a dual-track approach by migrating existing solutions to AgentCore while building TRACK (deployment engine), enabling rapid agent creation. Implemented a registry system as a modular bridge between the agent and the orchestrator.
System enhancement and deployment: Refined orchestrator functionality, developed an intuitive UX , and executed a team onboarding process for the new agentic system deployment.

Building the orchestrator agent
TR’s Platform Engineering team designed their orchestrator service, named Aether, as a modular system using the LangGraph Framework. The orchestrator retrieves context from their agent registry to determine the appropriate agent for each situation. When an agent’s actions are required, the orchestrator makes a tool call that programmatically populates data from the registry, helping prevent potential prompt injection attacks and facilitating more secure communication between endpoints.
To maintain conversation context while keeping the system stateless, the orchestrator integrates with the AgentCore Memory service capabilities at both conversation and user levels. Short-term memory maintains context within individual conversations, while long-term memory tracks user preferences and interaction patterns over time. This dual-memory approach allows the system to learn from past interactions and avoid repeating previous mistakes.
Service Agent Development Framework
The Platform Engineering team developed their own framework, TR-AgentCore-Kit (TRACK), to simplify agent deployment across the organization. TRACK, which is a homegrown solution utilizes a customized version of the Bedrock AgentCore Starter Toolkit. The team customized this toolkit to meet TR’s specific compliance alignment requirements, which include asset identification standards and resource tagging standards. The framework handles connection to AgentCore Runtime, tool management, AgentCore Gateway connectivity, and baseline agent setup, so developers can focus on implementing business logic rather than dealing with infrastructure concerns. AgentCore Gateway provided a straightforward and more secure way for developers to build, deploy, discover, and connect to tools at scale. TRACK also handles the registration of service agents into the Aether environment by deploying agent cards into the custom-built A2A registry. TRACK maintains a seamless flow for developers by offering deployment capabilities to AWS and registration to the custom-built services in one package. By deploying the agent cards into the registry, the process to fully onboard an agent built by a service team can continue to make the agent available from the overarching orchestrator.
Agent discovery and registration system
To enable seamless agent discovery and communication, TR implemented a custom A2A solution using Amazon DynamoDB and Amazon API Gateway. This system supports cross-account agent calls, which was essential for their modular architecture. The registration process occurs through the TRACK project, so that teams can register their agents directly with the orchestrator service. The A2A registry maintains a comprehensive history of agent versions for auditing purposes and requires human validation before allowing new agents into the production environment. This governance model facilitates conformance with TR’s ISRM standards while providing flexibility for future expansion.
Aether web portal integration
The team developed a web portal using React, hosted on Amazon Simple Storage Service (Amazon S3), to provide a more secure and intuitive interface for agent interactions. The portal authenticates users against TR’s enterprise single sign-on (SSO) and provides access to agent flows based on user permissions. This approach helps ensure that sensitive operations, such as AWS account provisioning or database patching, are only accessible to authorized personnel.
Human-in-the-loop validation service
The system includes Aether Greenlight, a validation service that makes sure critical operations receive appropriate human oversight. This service extends beyond basic requester approval, so that team members outside the initial conversation can participate in the validation process. The system maintains a complete audit trail of approvals and actions, supporting TR’s compliance requirements.
Outcome
By building a self-service agentic system on AgentCore, TR implemented autonomous agents that use AI orchestration to handle complex operational workflows end-to-end.
Productivity and efficiency

15-fold productivity gain through intelligent automation of routine tasks
70% automation rate achieved at first launch, dramatically reducing manual workload
Continuous reliability with repeatable runbooks executed by agents around the clock

Speed and agility

Faster time to value: Accelerated product delivery by automating environment setup, policy enforcement, and day-to-day operations
Self-service workflows: Empowered teams with clear standards and paved-road tooling

Security and compliance

Stronger security posture: Applied guardrails and database patching by default
Human-in-the-loop approvals: Maintained oversight while automating verification of changes

Cost and resource optimization

Better cost efficiency: Automated infrastructure usage optimization
Strategic talent allocation: Freed engineering teams to focus on highest-priority, high-value work
Reduced operational toil: Removed repetitive tasks and variance through standardization

Developer experience

Improved satisfaction: Streamlined workflows with intuitive self-service capabilities
Consistent standards: Established repeatable patterns for other teams to adopt and scale

Conclusion
This agentic system described in this post establishes a replicable pattern that teams across the organization can use to adopt similar automation capabilities, creating a multiplier effect for operational excellence. The Aether project aims to help enhance the experience of engineers by removing the need for manual execution of tasks that could be automated to support further innovation and creative thinking. As Aether continues to improve, the team hopes that the pattern will be adopted more broadly to begin assisting teams beyond Platform Engineering to break-through productivity standards organization wide, solidifying TR as a front-runner in the age of artificial intelligence.
Using Amazon Bedrock AgentCore, TR transformed their platform engineering operations from manual processes to an AI-powered self-service hub. This approach not only improved efficiency but also strengthened security and compliance controls.
Ready to transform your platform engineering operations:

Explore AgentCore
Explore AgentCore documentation
For additional use cases, explore notebook-based tutorials

About the Authors
Naveen Pollamreddi is a Distinguished Engineer in Thomson Reuters as part of the Platform Engineering team and drives the Agentic AI strategy for Cloud Infrastructure services.
Seth Krause is a Cloud Engineer on Thomson Reuters’ Platform Engineering Compute team. Since joining the company, he has contributed to architecting and implementing generative AI solutions that enhance productivity across the organization. Seth specializes in building cloud-based microservices with a current focus on integrating AI capabilities into enterprise workflows.
Pratip Bagchi is an Enterprise Solutions Architect at Amazon Web Services. He is passionate about helping customers to drive AI adoption and innovation to unlock business value and enterprise transformation.
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Build agents to learn from experiences using Amazon Bedrock AgentCore …

Today, most agents operate only on what’s visible in the current interaction: they can access facts and knowledge, but they can’t remember how they solved similar problems before or why certain approaches worked or failed. This creates a significant gap in their ability to learn and improve over time. Amazon Bedrock AgentCore episodic memory addresses this limitation by capturing and surfacing experience-level knowledge for AI agents. Although semantic memory helps an agent remember what it knows, episodic memory documents how it arrived there: the goal, reasoning steps, actions, outcomes, and reflections. By converting each interaction into a structured episode, you can enable agents to recall knowledge and interpret and apply prior reasoning. This helps agents adapt across sessions, avoid repeating mistakes, and evolve their planning over time.
Amazon Bedrock AgentCore Memory is a fully managed service that helps developers create context-aware AI agents through both short-term memory and long-term intelligent memory capabilities. To learn more, see Amazon Bedrock AgentCore Memory: Building context-aware agents and Building smarter AI agents: AgentCore long-term memory deep dive.
In this post, we walk you through the complete architecture to structure and store episodes, discuss the reflection module, and share compelling benchmarks that demonstrate significant improvements in agent task success rates.
Key challenges in designing agent episodic memory
Episodic memory enables agents to retain and reason over their own experiences. However, designing such a system requires solving several key challenges to make sure experiences remain coherent, evaluable, and reusable:

Maintaining temporal and causal coherence – Episodes need to preserve the order and cause-effect flow of reasoning steps, actions, and outcomes so the agent can understand how its decisions evolved.
Detecting and segmenting multiple goals – Sessions often involve overlapping or shifting goals. The episodic memory must identify and separate them to avoid mixing unrelated reasoning traces.
Learning from experience – Each episode should be evaluated for success or failure. Reflection should then compare similar past episodes to identify generalizable patterns and principles, enabling the agent to adapt those insights to new goals rather than replaying prior trajectories.

In the next section, we describe how to build an AgentCore episodic memory strategy, covering its extraction, storage, retrieval, and reflection pipeline and how these components work together to help transform experience into adaptive intelligence.
How AgentCore episodic memory works
When your agentic application sends conversational events to AgentCore Memory, raw interactions get transformed into rich episodic memory records through an intelligent extraction and reflection process. The following diagram illustrates how this episodic memory strategy works and how simple agent conversations become meaningful, reflective memories that shape future interactions.

The following diagram illustrates the detailed data flow of the same architecture with more elaborate details.

The preceding diagrams illustrate the different steps in the episodic memory strategy. The first two steps (marked pink and purple) are grouped together as a two-stage approach of the episode extraction module that serves distinct but complementary purposes. The third step (marked as blue) is the reflection module, which helps the agent learn from the past experience. In the following sections, we discuss the steps in detail.
Episode extraction module
The episode extraction module is the foundational step in the episodic strategy that transforms raw user-agent interaction data into structured, meaningful episodes. We follow a two-stage approach where the stages are designed to capture both granular step-wise mechanics of each interaction (called turn extraction) and broader episode-wise knowledge to create coherent narratives (called episode extraction). To make an analogy, think of it in terms of taking notes during a meeting (turn level) and writing the meeting summary at the end of the meeting (episode). Both stages are valuable but serve different purposes when learning from experience.
In the first stage of episode extraction, the system performs turn-level processing to understand what went right or wrong. Here, single exchange units between the user and the agent called conversational turns are identified, segmented, and transformed into structured summaries in the following dimensions:

Turn situation – A brief description of the circumstances and context that the assistant is responding to in this turn. This includes the immediate context, the user’s overarching objectives that might span multiple turns, and the relevant history from previous interactions that informed the current exchange.
Turn intent – The assistant’s specific purpose and primary goal for this turn, essentially answering the question “What was the assistant trying to accomplish in this moment?”
Turn action – A detailed record of the concrete steps taken during the interaction, documenting which specific tools were used, what input arguments or parameters were provided to each tool, and how the assistant translated intent into executable actions.
Turn thought – The reasoning behind the assistant’s decisions, explaining the “why” behind tool selection and approach.
Turn assessment – An honest evaluation of whether the assistant successfully achieved its stated goal for this specific turn, providing immediate feedback on the effectiveness of the chosen approach and actions taken.
Goal assessment – A broader perspective on whether the user’s overall objective across the entire conversation appears to be satisfied or progressing toward completion, looking beyond individual turns to evaluate holistic success.

After processing and structuring individual turns, the system proceeds to the episode extraction stage, when a user completes their goal (detected by the large language model) or an interaction ends. This helps capture the complete user journey, because a user’s goal often spans multiple turns and individual turn data alone can’t convey whether the overall objective was achieved or what the holistic strategy looked like. In this stage, sequentially related turns are synthesized into coherent episodic memories that capture complete user journeys, from initial request to final resolution:

Episode situation – The broader circumstances that initiated the user’s need for assistance
Episode intent – A clear articulation of what the user ultimately wanted to accomplish
Success evaluation – A definitive assessment of whether the conversation achieved its intended purpose for each episode
Evaluation justification – Concrete reasoning for success or failure assessments, grounded in specific conversational moments that demonstrate progress toward or away from user goals
Episode insights – Insights capturing proven effective approaches and identifying pitfalls to avoid for the current episode

Reflection module
The reflection module highlights the ability of Amazon Bedrock AgentCore episodic memory to learn from past experiences and generate insights that help improve future performance. This is where individual episode learnings evolve into generalizable knowledge that can guide agents across diverse scenarios.
The reflection module operates through cross-episodic reflection, retrieving past similar successful episodes based on user intent and reflecting across multiple episodes to achieve more generalizable insights. When new episodes are processed, the system performs the following actions:

Using the user intent as a semantic key, the system identifies historically successful and relevant episodes from the vector store that share similar goals, contexts, or problem domains.
The system analyzes patterns across the main episode and relevant episodes, looking for transferable insights about what approaches work consistently across different contexts.
Existing reflection knowledge is reviewed and either enhanced with new insights or expanded with entirely new patterns discovered through cross-episodic analysis.

At the end of the process, each reflection memory record contains the following information:

Use case – When and where the insight applies, including relevant user goals and trigger conditions
Hints (insights) – Actionable guidance covering tool selection strategies, effective approaches, and pitfalls to avoid
Confidence scoring – A score (0.1–1.0) indicating how well the insight generalizes across different scenarios

Episodes provide agents with concrete examples of how similar problems were solved before. These case studies show the specific tools used, reasoning applied, and outcomes achieved, including both successes and failures. This creates a learning framework where agents can follow proven strategies and avoid documented mistakes.
Reflection memories extract patterns from multiple episodes to deliver strategic insights. Instead of individual cases, they reveal which tools work best, what decision-making approaches succeed, and which factors drive outcomes. These distilled principles give agents higher-level guidance for navigating complex scenarios.
Custom override configurations
Although built-in memory strategies cover the common use cases, many domains require tailored approaches for memory processing. The system supports built-in strategy overrides through custom prompts that extend the built-in logic, helping teams adapt memory handling to their specific requirement. You can implement the following custom override configurations:

Custom prompts – These prompts focus on criteria and logic rather than output formats and help developers define the following:

Extraction criteria – What information gets extracted or filtered out.
Consolidation rules – How related memories should be consolidated.
Conflict resolution – How to handle contradictory information.
Insight generation – How cross-episode reflections are synthesized.

Custom model: AgentCore Memory supports custom model selection for memory extraction, consolidation, and reflection operations. This flexibility helps developers balance accuracy and latency based on their specific requirements. You can define them using APIs when you create the _memory_resource_ as a strategy override or through the Amazon Bedrock AgentCore console (as shown in the following screenshot).
Namespaces: Namespaces provide a hierarchical organization for episodes and reflections, enabling access to your agent’s experiences at different levels of granularity and providing a seamless natural logical grouping. For instance, to design a namespace for a travel application, episodes could be stored under travel_booking/users/userABC/episodes and reflections could reside at travel_booking/users/userABC. Note that the namespace for reflections must be a sub-path of the namespace for episodes.

Performance evaluation
We evaluated Amazon Bedrock AgentCore episodic memory on real-world goal completion benchmarks from the retail and airline domain (sampled from τ2-bench). These benchmarks contain tasks that mirror actual customer service scenarios where agents need to help users achieve specific goals.
We compared three different setups in our experiments:

For the baseline, we ran the agent (built with Anthropic’s Claude 3.7) without interacting with the memory component.
For memory-augmented agents, we explored two methods of using memories:

In-context learning examples – The first method uses extracted episodes as in-context learning examples. Specifically, we constructed a tool named retrieve_exemplars (tool definition in appendix) that agents can use by issuing a query (for example, “how to get refund?”) to get step-by-step instructions from the episodes repository. When agents face similar problems, the retrieved episodes will be added into the context to guide the agent to take the next action.
Reflection-as-guidance – The second method we explored is reflection-as-guidance. Specifically, we construct a tool named retrieve_reflections (tool definition in appendix) that agents can use to access broader insights from past experiences. Similar to retrieve_exemplars, the agent can generate a query to retrieve reflections as context, gaining insights to make informed decisions about strategy and approach rather than specific step-by-step actions.

We used the following evaluation methodology:

The baseline agent first processes a set of historical customer interactions, which become the source for memory extraction.
The agent then receives new user queries from τ2-bench.
Each query is attempted four times in parallel.
To evaluate, pass rate metrics are measured across these four attempts. Pass^k measures the percentage of tasks where the agent succeeded in at least k out of four attempts:

Pass^1: Succeeded at least once (measures capability)
Pass^2: Succeeded at least twice (measures reliability)
Pass^3: Succeeded at least three times (measures consistency)

The results in the following table show clear improvements across both domains and multiple attempts.
 

System
Memory Type used by Agent
Retail

Pass^1
Pass^2
Pass^3
Pass^1
Pass^2
Pass^3

Baseline
No Memory
65.80%
49.70%
42.10%
47%
33.30%
24%

Memory-Augmented Agent
Episodes as ICL Example
69.30%
53.80%
43.40%
55.00%
46.70%
43.00%

Cross Episodes Reflection Memory
77.20%
64.30%
55.70%
58%
46%
41%

Memory-augmented agents consistently outperform the baseline across domains and consistency levels. Crucially, these results demonstrate that different memory retrieval strategies are better suited to different task characteristics. Cross-episode reflection improved Pass^1 by +11.4% and Pass^3 by +13.6% over the baseline, suggesting that generalized strategic insights are particularly valuable when handling open-ended customer service scenarios with diverse interaction patterns. In contrast, the airline domain – characterized by complex, rule-based policies and multi-step procedures—benefits more from episodes as examples, which achieved the highest Pass^3 (43.0% vs 41.0% for reflection). This indicates that concrete step-by-step examples help agents navigate structured workflows reliably. The relative improvement is most pronounced at higher consistency thresholds (Pass^3), where memory helps agents avoid the mistakes that cause intermittent failures.
Best practices for using episodic memory
The key to effective episodic memory is knowing when to use it and which type fits your situation. In this section, we discuss what we’ve learned works best.
When to use episodic memory
Episodic memory delivers the most value when you match the right memory type to your current need. It is ideal for complex, multi-step tasks where context matters and past experience matters significantly, such as debugging code, planning trips, and analyzing data. It’s also particularly valuable for repetitive workflows where learning from previous attempts can dramatically improve outcomes, and for domain-specific problems where accumulated expertise makes a real difference.
However, episodic memory isn’t always the right choice. You can skip it for simple, one-time questions like weather checks or basic facts that don’t need reasoning or context. Simple customer service conversations, basic Q&A, or casual chats don’t need the advanced features that episodic memory adds. The true benefit of episodic memory is observed over time. For short tasks, a session summary provides sufficient information. However, for complex tasks and repetitive workflows, episodic memory helps agents build on past experiences and continuously improve their performance.
Choosing episodes vs. reflection
Episodes work best when you’re facing similar specific problems and need clear guidance. If you’re debugging a React component that won’t render, episodes can show you exactly how similar problems were fixed before, including the specific tools used, thinking process, and results. They give you real examples when general advice isn’t enough, showing the complete path from finding the problem to solving it.
Reflection memories work best when you need strategic guidance across broader contexts rather than specific step-by-step solutions. Use reflections when you’re facing a new type of problem and need to understand general principles, like “What’s the most effective approach for data visualization tasks?” or “Which debugging strategies tend to work best for API integration issues?” Reflections are particularly valuable when you’re making high-level decisions about tool selection and which method to follow, or understanding why certain patterns consistently succeed or fail.
Before starting tasks, check reflections for strategy guidance, look at similar episodes for solution patterns, and find high-confidence mistakes documented in previous attempts. During tasks, look at episodes when you hit roadblocks, use reflection insights for tool choices, and think about how your current situation differs from past examples.
Conclusion
Episodic memory fills a critical gap in current agent capabilities. By storing complete reasoning paths and learning from outcomes, agents can avoid repeating mistakes and build on successful strategies.
Episodic memory completes the memory framework of Amazon Bedrock AgentCore alongside summarization, semantic, and preference memory. Each serves a specific purpose: summarization manages context length, semantic memory stores facts, preference memory handles personalization, and episodic memory captures experience. The combination helps give agents both structured knowledge and practical experience to handle complex tasks more effectively.
To learn more about episodic memory, refer to Episodic memory strategy, How to best retrieve episodes to improve agentic performance, and the AgentCore Memory GitHub samples.

Appendix
In this section, we discuss two methods of using memories for memory-augmented agents.
Episode example
The following is an example using extracted episodes as in-context learning examples:

** Context **
A customer (Jane Doe) contacted customer service expressing frustration
about a recent flight delay that disrupted their travel plans and wanted
to discuss compensation or resolution options for the inconvenience they
experienced.

** Goal **
The user’s primary goal was to obtain compensation or some form of resolution
for a flight delay they experienced, seeking acknowledgment of the disruption
and appropriate remediation from the airline.

### Step 1:

**Thought:**
The assistant chose to gather information systematically rather than making
assumptions, as flight delay investigations require specific reservation and
flight details. This approach facilitates accurate assistance and demonstrates
professionalism by acknowledging the customer’s frustration while taking concrete
steps to help resolve the issue.

**Action:**
The assistant responded conversationally without using any tools, asking the
user to provide their user ID to access reservation details.

— End of Step 1 —

** Episode Reflection **:
The conversation demonstrates an excellent systematic approach to flight
modifications: starting with reservation verification, then identifying
confirmation, followed by comprehensive flight searches, and finally processing
changes with proper authorization. The assistant effectively used appropriate
tools in a logical sequence – get_reservation_details for verification, get_user_details
for identity/payment info, search_direct_flight for options, and update tools for
processing changes. Key strengths included transparent pricing calculations,
proactive mention of insurance benefits, clear presentation of options, and proper
handling of policy constraints (explaining why mixed cabin classes aren’t allowed).
The assistant successfully leveraged user benefits (Gold status for free bags) and
maintained security protocols throughout. This methodical approach made sure user
needs were addressed while following proper procedures for reservation modifications.

Reflection example
The following is an example of Reflection memory, which can be used for agent guidance:

**Title:** Proactive Alternative Search Despite Policy Restrictions

**Use Cases:**
This applies when customers request flight modifications or changes that
are blocked by airline policies (such as basic economy no-change rules,
fare class restrictions, or booking timing limitations). Rather than simply
declining the request, this pattern involves immediately searching for
alternative solutions to help customers achieve their underlying goals.
It’s particularly valuable for emergency situations, budget-conscious travelers,
or when customers have specific timing needs that their current reservations
don’t accommodate.

**Hints:**
When policy restrictions prevent the requested modification, immediately pivot
to solution-finding rather than just explaining limitations. Use search_direct_flight
to find alternative options that could meet the customer’s needs, even if it requires
separate bookings or different approaches. Present both the policy constraint
explanation AND viable alternatives in the same response to maintain momentum toward
resolution. Consider the customer’s underlying goal (getting home earlier,
changing dates, etc.) and search for flights that accomplish this objective.
When presenting alternatives, organize options clearly by date and price, highlight
budget-friendly choices, and explain the trade-offs between keeping existing reservations
versus canceling and rebooking. This approach transforms policy limitations into problem-solving
opportunities and maintains customer satisfaction even when the original request cannot be fulfilled.

Tool definitions
The following code is the tool definition for retrieve_exemplars:
def retrieve_exemplars(task: str) -> str:
“””
Retrieve example processes to help solve the given task.
Args:
task: The task to solve that requires example processes.

Returns:
str: The example processes to help solve the given task.
“””

The following is the tool definition for retrieve_reflections:
def retrieve_reflections(task: str, k: int = 5) -> str:
“””
Retrieve synthesized reflection knowledge from past agent experiences by matching
against knowledge titles and use cases. Each knowledge entry contains: (1) a descriptive title,
(2) specific use cases describing the types of goals where this knowledge applies and when to apply it,
and (3) actionable hints including best practices from successful episodes and common pitfalls to avoid
from failed episodes. Use this to get strategic guidance for similar tasks.

Args:
task: The current task or goal you are trying to accomplish. This will be matched
against knowledge titles and use cases to find relevant reflection knowledge. Describe your task
clearly to get the most relevant matches.
k: Number of reflection knowledge entries to retrieve. Default is 5.

Returns:
str: The synthesized reflection knowledge from past agent experiences.
“””

About the Authors
Jiarong Jiang is a Principal Applied Scientist at AWS, driving innovations in Retrieval Augmented Generation (RAG) and agent memory systems to improve the accuracy and intelligence of enterprise AI. She’s passionate about helping customers build context-aware, reasoning-driven applications that use their own data effectively.
Akarsha Sehwag is a Generative AI Data Scientist for the Amazon Bedrock AgentCore Memory team. With over 6 years of expertise in AI/ML, she has built production-ready enterprise solutions across diverse customer segments in generative AI, deep learning, and computer vision domains. Outside of work, she likes to hike, bike, and play badminton.
Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Peng Shi is a Senior Applied Scientist at AWS, where he leads advancements in agent memory systems to enhance the accuracy, adaptability, and reasoning capabilities of AI. His work focuses on creating more intelligent and context-aware applications that bridge cutting-edge research with real-world impact.
Anil Gurrala is a Senior Solutions Architect at AWS based in Atlanta. With over 3 years at Amazon and nearly two decades of experience in digital innovation and transformation, he helps customers with modernization initiatives, architecture design, and optimization on AWS. Anil specializes in implementing agentic AI solutions while partnering with enterprises to architect scalable applications and optimize their deployment within the AWS cloud environment. Outside of work, Anil enjoys playing volleyball and badminton, and exploring new destinations around the world.
Ruo Cheng is a Senior UX Designer at AWS, designing enterprise AI and developer experiences across Amazon Bedrock and Amazon Bedrock AgentCore. With a decade of experience, she leads design for AgentCore Memory, shaping memory-related workflows and capabilities for agent-based applications. Ruo is passionate about translating complex AI and infrastructure concepts into intuitive, user-centered experiences.

How bunq handles 97% of support with Amazon Bedrock

This post was co-authored with Benjamin Kleppe, Machine Learning Engineering Lead at bunq.
The integration of agentic AI is transforming the banking industry, marking a significant shift from traditional customer service systems. Agentic AI demonstrates autonomous decision-making capabilities in complex financial environments, enabling banks to provide round-the-clock multilingual support, process transactions, and deliver personalized financial insights at scale.
bunq is Europe’s second-largest neobank, built to make life easy for people and businesses who live an international lifestyle. Founded in 2012 by serial entrepreneur Ali Niknam, bunq has always put users at the heart of everything they do. The company helps its 20 million users across Europe spend, save, budget, and invest confidently, all within a single, user-friendly application built on user feedback
In this post, we show how bunq upgraded Finn, its in-house generative AI assistant, using Amazon Bedrock to transform user support and banking operations to be seamless, in multiple languages and time zones.
Business challenge
Banks face a major challenge to deliver consistent, high-quality customer support across multiple channels, languages, and time zones. Traditional support systems struggle with the complexity of financial products, regulatory requirements, and the growing expectation for instant, accurate responses. Customers expect instant access to essential banking functions like transaction disputes, account management, and financial advice, and banks need to maintain strict security protocols and compliance standards. As a user-centric bank, bunq’s users expect round-the-clock support for their banking needs, such as requesting a refund or seeking guidance on features. Traditional support models couldn’t keep up with this demand, creating frustrating bottlenecks and straining internal resources. Beyond direct support, bunq’s team also needed efficient ways to analyze incoming feature requests and bug reports to continuously improve their system. It was clear that bunq needed a smarter solution that could provide instant, accurate assistance around the clock and help the team turn valuable user feedback into action.
Solution overview
Launched in 2023, bunq’s generative AI assistant, Finn, is fully built in-house as part of bunq’s proprietary AI stack. Finn uses leading AI foundation models (FMs) and tooling, including Anthropic’s Claude models through Amazon Bedrock. Unlike generic chatbots, Finn processes natural language and provides real-time, intelligent answers. Finn can translate the bunq application into 38 languages and translate speech-to-speech calls to the support team in real time. It can also summarize complex banking information, provide financial insights and budgeting advice, and even recognize images, automating tedious tasks such as invoice processing. bunq’s approach uses AWS services to create a scalable AI agent infrastructure that can handle the demands of modern banking while maintaining security and compliance. The solution uses the following AWS services:

Amazon Bedrock – A fully managed service that makes high-performing FMs from leading AI companies and Amazon available through a unified API. bunq uses Amazon Bedrock to access Anthropic’s Claude models with enhanced security features, scalability, and compliance—critical requirements for banking applications.
Amazon Elastic Container Service (Amazon ECS) – A fully managed container orchestration service that makes it straightforward to deploy, manage, and scale containerized applications. Amazon ECS alleviates the need to install and operate container orchestration software or manage clusters of virtual machines, helping bunq focus on building Finn’s multi-agent architecture.
Amazon DynamoDB – A fully managed, serverless, NoSQL database service designed to run high-performance applications at scale. DynamoDB delivers single-digit millisecond performance and stores agent memory, conversation history, and session data, enabling Finn to maintain context across customer interactions.
Amazon OpenSearch Serverless – An on-demand, automatic scaling configuration for Amazon OpenSearch Service. OpenSearch Serverless automatically scales compute resources based on application needs and provides vector search capabilities for Finn’s Retrieval Augmented Generation (RAG) implementation, enabling semantic search across bunq’s knowledge base.

Building a multi-agent implementation with Amazon Bedrock
Users can interact with Finn through bunq’s application and web interface, using natural language for their requests, such as account information, transaction history, financial advice, and support issues. The system processes requests in real time, accessing only pertinent data to the request, while maintaining strict security and privacy controls. User support scenarios demand more than what a single AI agent can deliver. A multi-agent architecture allows specialized agents to handle distinct tasks—one agent might excel at understanding the user, another focuses on extracting relevant documentation, and a third handles transaction analysis or account operations. For Finn, this means a user asking about a failed payment can trigger a coordinated response: one agent interprets the question, another checks transaction logs, and a third suggests solutions based on similar cases. They all work together seamlessly to deliver a comprehensive answer in seconds, instead of bouncing the user between departments. The initial multi-agent support system for banking services followed a seemingly straightforward pattern: a central router agent directed user queries to specialized sub-agents. Each agent handled specific domains—technical support, general inquiries, transaction status, account management, and so on. However, as the system grew, so did the size and complexity of the demands. As bunq added more specialized agents to handle the new ecosystem, three issues became apparent:

Routing complexity – With multiple specialized agents, the router needed increasingly sophisticated logic to determine the correct destination.
Overlapping capabilities – Multiple agents required access to the same data sources and capabilities, forcing the router to predict not just the primary intent but also which secondary agents might be needed downstream—an impossible task at scale.
Scalability bottleneck – Every new agent or capability meant updating the router’s logic. Adding a new specialized agent required comprehensive testing of all routing scenarios. The router became a single point of failure and a potential development bottleneck.

Rethinking the architecture
bunq redesigned its system around an orchestrator agent that works fundamentally differently from the old router. Instead of trying to route to all possible agents, the orchestrator performs the following actions:

Routes queries to only three to five primary agents
Empowers these primary agents to invoke other agents as tools when needed
Delegates decision-making to the agents themselves

With this agent-as-tool pattern, primary agents detect when they need specialized help. Tool agents are invoked dynamically by primary agents. Agents can call other agents through a well-defined interface—they become tools in each other’s toolkits.
The following diagram illustrates this workflow.

bunq’s Finn service uses a comprehensive AWS infrastructure designed for security, scalability, and intelligent orchestration. The following architecture diagram shows how multiple AWS services work together to deliver a multi-agent AI system.

Orchestration and agent architecture
At the core of the system is the orchestrator agent, running on Amazon Elastic Container Service (Amazon ECS). This orchestrator implements the agent-as-tool pattern, routing user queries to a limited set of primary agents rather than attempting to predict every possible scenario. The orchestrator maintains three to five primary agents (Primary Agent 1 through 5), each deployed as containerized services on Amazon ECS. This design provides horizontal scalability—as demand increases, additional agent instances can be spun up automatically. Each primary agent is empowered to invoke specialized agents as needed. These specialized agents (Specialized Agent 1, 2, 3, and so on) act as tools that primary agents can call upon for specific capabilities, such as analyzing transaction data, retrieving documentation, or processing complex queries. This hierarchical structure avoids the routing complexity bottleneck while maintaining flexibility.
Infrastructure details
The architecture is built on a robust foundation of AWS services that enable Finn’s performance. Users access the service through bunq’s application, with traffic secured by AWS WAF and Amazon CloudFront, while authentication flows through bunq’s proprietary identity system. Amazon Bedrock provides access to Anthropic’s Claude models for natural language understanding, complemented by Amazon SageMaker hosted fine-tuned models for specialized banking scenarios. Agent memory and conversation history are stored in DynamoDB, and OpenSearch Service serves as a vector store for RAG capabilities, enabling semantic search across bunq’s knowledge base. Amazon Simple Storage Service (Amazon S3) handles document storage, and Amazon MemoryDB manages user sessions for real-time interactions. Comprehensive observability through AWS CloudTrail, Amazon GuardDuty, and Amazon CloudWatch helps the team monitor performance, detect threats, and maintain compliance—all within a secure virtual private cloud (VPC).
Real-world impact
The transformation from bunq’s initial router-based architecture to the orchestrator pattern with Amazon Bedrock delivered measurable improvements across user support operations. The multi-agent deployment achieved significant operational efficiency gains:

Finn now handles 97% of bunq’s user support activity, with over 82% fully automated. Average response times dropped to just 47 seconds, helping bunq deliver the real-time solutions users expect.
The rapid deployment timeline highlights bunq’s focus on innovation. The team moved from concept to production in 3 months, starting in January 2025. bunq brought together a team of 80 people—from AI engineers to support staff—who worked together to test, learn, and deploy updates three times a day.
Before implementing the orchestrator architecture, escalations were mainly manual processes. The new multi-agent system increased automation, transforming end-to-end support metrics. Beyond that, Finn expanded bunq’s reach by translating the application into 38 languages, making banking more accessible to millions of users across Europe.
The solution enabled bunq to become Europe’s first AI-powered bank, offering capabilities no traditional support system could deliver: real-time speech-to-speech translation (a first in global banking), image recognition for receipt processing and document verification, and intelligent financial insights—all while maintaining the round-the-clock availability users demand.

“We went from concept to production in 3 months. Before the orchestrator architecture, escalations were mainly manual. Now Finn handles 97% of support with 70% fully automated and 47-second average response times.”
– Benjamin Kleppe, Machine Learning Engineering Lead at bunq.

Conclusion
bunq’s journey from manual support escalations to an intelligent multi-agent system shows how modern AI architecture can transform banking operations. By moving from a rigid router-based approach to a flexible orchestrator pattern with Amazon Bedrock, bunq avoided scalability bottlenecks while maintaining the agility needed to serve 20 million users across Europe. The orchestrator pattern with agent-as-tool capabilities proved essential to bunq’s success. Rather than predicting every possible user scenario upfront, the system empowers primary agents to dynamically invoke specialized agents as needed. This architectural shift reduced complexity, accelerated development cycles, and helped bunq deploy updates three times per day during the initial rollout. The results: 97% of support interactions handled by Finn, 70% fully automated, and average response times of just 47 seconds. Beyond efficiency gains, the solution expanded bunq’s reach to 38 languages and positioned the company as Europe’s first AI-powered bank. By freeing internal resources from manual processes, bunq can now focus on what it does best: building a bank that makes life easy for its users.
To learn more about building AI-powered applications with FMs, refer to Amazon Bedrock. Explore how Anthropic’s Claude on Amazon Bedrock can transform your customer experience with enhanced security features and scalability. Get started with the Amazon Bedrock documentation to build your own multi-agent solutions.

About the Authors
Benjamin Kleppe is Machine Learning Engineering Lead at bunq, where he leads the development and scaling of AI-powered solutions that make banking smarter and more personal for 20 million users across Europe. He focuses on building intelligent systems that enhance user experience, improve product discovery, and automate complex banking processes. Benjamin is passionate about pushing the boundaries of AI innovation in banking, having led bunq to become Europe’s first AI-powered bank with the launch of Finn, their proprietary generative AI platform.
Jagdeep Singh Soni is a Senior AI/ML Solutions Architect at AWS based in the Netherlands, specializing in generative AI and Amazon Bedrock. He helps customers and partners architect and implement intelligent agent solutions using Amazon Bedrock and other AWS AI/ML services. With 16 years of experience in innovation and cloud architecture, Jagdeep focuses on enabling organizations to build production-ready generative AI applications that use foundation models and agent frameworks for real-world business outcomes.
Guy Kfir is a generative AI Lead at AWS with over 15 years of experience in cloud technology sales, business development, and AI/ML evangelism. He works with enterprise customers, startups, and partners across EMEA to accelerate adoption of generative AI solutions and execute go-to-market strategies.

What are Context Graphs?

Knowledge Graphs and their limitations

With the rapid growth of AI applications, Knowledge Graphs (KGs) have emerged as a foundational structure for representing knowledge in a machine-readable form. They organize information as triples—a head entity, a relation, and a tail entity—forming a graph-like structure where entities are nodes and relationships are edges. This representation allows machines to understand and reason over connected knowledge, supporting intelligent applications such as question answering, semantic analysis, and recommendation systems

Despite their effectiveness, Knowledge Graphs (KGs) have notable limitations. They often lose important contextual information, making it difficult to capture the complexity and richness of real-world knowledge. Additionally, many KGs suffer from data sparsity, where entities and relationships are incomplete or poorly connected. This lack of full annotation limits the contextual signals available during inference, posing challenges for effective reasoning, even when integrated with large language models.

Context Graphs

Context Graphs (CGs) extend traditional Knowledge Graphs by adding extra information such as time, location, and source details. Instead of storing knowledge as isolated facts, they capture the situation in which a fact or decision occurred, leading to a clearer and more accurate understanding of real-world knowledge.

When used with agent-based systems, context graphs also store how decisions were made. Agents need more than rules—they need to know how rules were applied before, when exceptions were allowed, who approved decisions, and how conflicts were handled. Since agents operate directly where decisions happen, they can naturally record this full context.

Over time, these stored decision traces form a context graph that helps agents learn from past actions. This allows systems to understand not only what happened, but also why it happened, making agent behavior more consistent and reliable.

What are the effects of Contextual Information?

Contextual information adds important layers to knowledge representation by going beyond simple entities–relation facts. It helps distinguish between facts that look similar but occur under different conditions, such as differences in time, location, scale, or surrounding circumstances. For example, two companies may be competitors in one market or time period but not in another. By capturing such context, systems can represent knowledge in a more detailed way and avoid treating all similar-looking facts as identical.

In context graphs, contextual information also plays a key role in reasoning and decision-making. It includes signals such as historical decisions, policies applied, exceptions granted, approvals involved, and related events from other systems. When agents record how a decision was made—what data was used, which rule was checked, and why an exception was allowed—this information becomes reusable context for future decisions. Over time, these records help connect entities that are not directly linked and allow systems to reason based on past outcomes and precedents, rather than relying only on fixed rules or isolated triples.

Shift from static tools to decision-making agents

There has been a clear shift in AI systems—from static tools to decision-making agents, driven largely by major industry players. Real-world decisions are rarely based on rules alone; they involve exceptions, approvals, and lessons from past cases. Context graphs address this gap by capturing how decisions are made across systems—what policies were checked, which data was used, who approved the decision, and what outcome followed. By structuring this decision history as context, agents can reuse prior judgments instead of repeatedly relearning the same edge cases. Some examples of this shift include:

Google

Gmail’s Gemini features and Gemini 3–based agent frameworks both show AI shifting from simple help to active decision-making, whether that’s managing inbox priorities or running complex workflows.

Gmail relies on conversation history and user intent, while Gemini 3 agents use memory and state to handle longer tasks. In both cases, context matters more than single prompts.

Gemini 3 acts as an orchestration layer for multi-agent systems (ADK, Agno, Letta, Eigent), similar to how Gemini orchestrates summarization, writing, and prioritization inside Gmail.

Features like AI Inbox and Suggested Replies rely on persistent understanding of user behavior, just as agent frameworks like Letta and mem0 rely on stateful memory to prevent context loss and ensure consistent behavior.

Gmail turns email into actionable summaries and to-dos, while Gemini-powered agents automate browsers, workflows, and enterprise tasks—both reflecting a broader shift toward AI systems that act, not just respond.

OpenAI

ChatGPT Health brings health data from different sources—medical records, apps, wearables, and notes—into one place. This creates a clear, shared context that helps the system understand health patterns over time instead of answering isolated questions, similar to how context graphs link facts with their context.

By using personal health history and past interactions, ChatGPT Health helps users make better-informed decisions, such as preparing for doctor visits or understanding test results.

Health runs in a separate, secure space, keeping sensitive information private and contained. This ensures health context stays accurate and protected, which is essential for safely using context-based systems like context graphs.

JP Morgan

JP Morgan replacing proxy advisors with its AI tool, Proxy IQ, shows a shift toward building in-house decision systems that aggregate and analyze voting data across thousands of meetings, rather than relying on third-party recommendations.

By analyzing proxy data internally, the firm can incorporate historical voting behavior, company-specific details, and firm-level policies—aligning with the idea of context graphs that preserve how decisions are formed over time.

Internal AI-based analysis gives JP Morgan more transparency, speed, and consistency in proxy voting, reflecting a broader move toward context-aware, AI-driven decision-making in enterprise settings.

NVIDIA

NVIDIA’s NeMo Agent Toolkit helps turn AI agents into production-ready systems by adding observability, evaluation, and deployment controls. By capturing execution traces, reasoning steps, and performance signals, it records how an agent arrived at an outcome—not just the final result—aligning closely with the idea of context graphs.

Tools like OpenTelemetry tracing and structured evaluations convert agent behavior into usable context. This makes it easier to debug decisions, compare different runs, and steadily improve reliability.

Similar to how DLSS 4.5 integrates AI deeply into real-time graphics pipelines, NAT integrates AI agents into enterprise workflows. Both highlight a broader shift toward AI systems that retain state, history, and context, which is critical for dependable, large-scale deployment.

Microsoft

Copilot Checkout and Brand Agents turn shopping conversations into direct purchases. Questions, comparisons, and decisions happen in one place, creating clear context around why a customer chose a product.

These AI agents operate exactly where buying decisions happen—inside chats and brand websites—allowing them to guide users and complete checkout without extra steps.

Merchants keep control of transactions and customer data. Over time, these interactions build useful context about customer intent and buying patterns, helping future decisions become faster and more accurate.

The post What are Context Graphs? appeared first on MarkTechPost.

A Coding Guide to Anemoi-Style Semi-Centralized Agentic Systems Using …

In this tutorial, we demonstrate how a semi-centralized Anemoi-style multi-agent system works by letting two peer agents negotiate directly without a manager or supervisor. We show how a Drafter and a Critic iteratively refine an output through peer-to-peer feedback, reducing coordination overhead while preserving quality. We implement this pattern end-to-end in Colab using LangGraph, focusing on clarity, control flow, and practical execution rather than abstract orchestration theory. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install -U langgraph langchain-openai langchain-core

import os
import json
from getpass import getpass
from typing import TypedDict

from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY (hidden): “)

MODEL = os.environ.get(“OPENAI_MODEL”, “gpt-4o-mini”)
llm = ChatOpenAI(model=MODEL, temperature=0.2)

We set up the Colab environment by installing the required LangGraph and LangChain packages and securely collecting the OpenAI API key as a hidden input. We initialize the language model that will be shared by all agents, keeping the configuration minimal and reproducible. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AnemoiState(TypedDict):
task: str
max_rounds: int
round: int
draft: str
critique: str
agreed: bool
final: str
trace: bool

We define a typed state that acts as the shared communication surface between agents during negotiation. We explicitly track the task, draft, critique, agreement flag, and iteration count to keep the flow transparent and debuggable. This state obviates the need for a central manager or for implicit memory. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDRAFTER_SYSTEM = “””You are Agent A (Drafter) in a peer-to-peer loop.
You write a high-quality solution to the user’s task.
If you receive critique, you revise decisively and incorporate it.
Return only the improved draft text.”””

def drafter_node(state: AnemoiState) -> AnemoiState:
task = state[“task”]
critique = state.get(“critique”, “”).strip()
r = state.get(“round”, 0) + 1

if critique:
user_msg = f”””TASK:
{task}

CRITIQUE:
{critique}

Revise the draft.”””
else:
user_msg = f”””TASK:
{task}

Write the first draft.”””

draft = llm.invoke(
[
{“role”: “system”, “content”: DRAFTER_SYSTEM},
{“role”: “user”, “content”: user_msg},
]
).content.strip()

if state.get(“trace”, False):
print(f”n— Drafter Round {r} —n{draft}n”)

return {**state, “round”: r, “draft”: draft, “agreed”: False}

We implement the Drafter agent, which produces the initial response and revises it whenever peer feedback is available. We keep the Drafter focused purely on improving the user-facing draft, without awareness of control logic or termination conditions. It mirrors the Anemoi idea of agents optimizing locally while observing peer signals. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserCRITIC_SYSTEM = “””You are Agent B (Critic).
Return strict JSON:
{“agree”: true/false, “critique”: “…”}”””

def critic_node(state: AnemoiState) -> AnemoiState:
task = state[“task”]
draft = state.get(“draft”, “”)

raw = llm.invoke(
[
{“role”: “system”, “content”: CRITIC_SYSTEM},
{
“role”: “user”,
“content”: f”TASK:n{task}nnDRAFT:n{draft}”,
},
]
).content.strip()

cleaned = raw.strip(““`”).replace(“json”, “”).strip()

try:
data = json.loads(cleaned)
agree = bool(data.get(“agree”, False))
critique = str(data.get(“critique”, “”)).strip()
except Exception:
agree = False
critique = raw

if state.get(“trace”, False):
print(f”— Critic Decision —nAGREE: {agree}n{critique}n”)

final = draft if agree else state.get(“final”, “”)
return {**state, “agreed”: agree, “critique”: critique, “final”: final}

We implement the Critic agent, which evaluates the draft and decides whether it is ready to ship or needs revision. We enforce a strict agree-or-revise decision to avoid vague feedback and ensure fast convergence. This peer evaluation step allows quality control without introducing a supervisory agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef continue_or_end(state: AnemoiState) -> str:
if state.get(“agreed”, False):
return “end”
if state.get(“round”, 0) >= state.get(“max_rounds”, 3):
return “force_ship”
return “loop”

def force_ship_node(state: AnemoiState) -> AnemoiState:
return {**state, “final”: state.get(“final”) or state.get(“draft”, “”)}

graph = StateGraph(AnemoiState)
graph.add_node(“drafter”, drafter_node)
graph.add_node(“critic”, critic_node)
graph.add_node(“force_ship”, force_ship_node)

graph.set_entry_point(“drafter”)
graph.add_edge(“drafter”, “critic”)
graph.add_conditional_edges(
“critic”,
continue_or_end,
{“loop”: “drafter”, “force_ship”: “force_ship”, “end”: END},
)
graph.add_edge(“force_ship”, END)

anemoi_critic_loop = graph.compile()

demo_task = “””Explain the Anemoi semi-centralized agent pattern and why peer-to-peer critic loops reduce bottlenecks.”””

result = anemoi_critic_loop.invoke(
{
“task”: demo_task,
“max_rounds”: 3,
“round”: 0,
“draft”: “”,
“critique”: “”,
“agreed”: False,
“final”: “”,
“trace”: False,
}
)

print(“n====================”)
print(” FINAL OUTPUT”)
print(“====================n”)
print(result[“final”])

We assemble the LangGraph workflow that routes control between Drafter and Critic until agreement is reached or the maximum round limit is reached. We rely on simple conditional routing rather than centralized planning, thereby preserving the system’s semi-centralized nature. Finally, we execute the graph and return the best available output to the user.

In conclusion, we demonstrated that Anemoi-style peer negotiation is a practical alternative to manager-worker architectures, offering lower latency, reduced context bloat, and simpler agent coordination. By allowing agents to monitor and correct each other directly, we achieved convergence with fewer tokens and less orchestration complexity. In this tutorial, we provided a reusable blueprint for building scalable, semi-centralized agent systems. It lays the foundation for extending the pattern to multi-peer meshes, red-team loops, or protocol-based agent interoperability.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Anemoi-Style Semi-Centralized Agentic Systems Using Peer-to-Peer Critic Loops in LangGraph appeared first on MarkTechPost.

Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Loc …

GLM-4.7-Flash is a new member of the GLM 4.7 family and targets developers who want strong coding and reasoning performance in a model that is practical to run locally. Zhipu AI (Z.ai) describes GLM-4.7-Flash as a 30B-A3B MoE model and presents it as the strongest model in the 30B class, designed for lightweight deployment where performance and efficiency both matter.

Model class and position inside the GLM 4.7 family

GLM-4.7-Flash is a text generation model with 31B params, BF16 and F32 tensor types, and the architecture tag glm4_moe_lite. It supports English and Chinese, and it is configured for conversational use. GLM-4.7-Flash sits in the GLM-4.7 collection next to the larger GLM-4.7 and GLM-4.7-FP8 models.

Z.ai positions GLM-4.7-Flash as a free tier and lightweight deployment option relative to the full GLM-4.7 model, while still targeting coding, reasoning, and general text generation tasks. This makes it interesting for developers who cannot deploy a 358B class model but still want a modern MoE design and strong benchmark results.

Architecture and context length

In a Mixture of Experts architecture of this type, the model stores more parameters than it activates for each token. That allows specialization across experts while keeping the effective compute per token closer to a smaller dense model.

GLM 4.7 Flash supports a context length of 128k tokens and achieves strong performance on coding benchmarks among models of similar scale. This context size is suitable for large codebases, multi-file repositories, and long technical documents, where many existing models would need aggressive chunking.

GLM-4.7-Flash uses a standard causal language modeling interface and a chat template, which allows integration into existing LLM stacks with minimal changes.

Benchmark performance in the 30B class

The Z.ai team compares GLM-4.7-Flash with Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B. GLM-4.7-Flash leads or is competitive across a mix of math, reasoning, long horizon, and coding agent benchmarks.

https://huggingface.co/zai-org/GLM-4.7-Flash

This above table showcase why GLM-4.7-Flash is one of the strongest model in the 30B class, at least among the models included in this comparison. The important point is that GLM-4.7-Flash is not only a compact deployment of GLM but also a high performing model on established coding and agent benchmarks.

Evaluation parameters and thinking mode

For most tasks, the default settings are: temperature 1.0, top p 0.95, and max new tokens 131072. This defines a relatively open sampling regime with a large generation budget.

For Terminal Bench and SWE-bench Verified, the configuration uses temperature 0.7, top p 1.0, and max new tokens 16384. For τ²-Bench, the configuration uses temperature 0 and max new tokens 16,384. These stricter settings reduce randomness for tasks that need stable tool use and multi step interaction.

Z.ai team also recommends turning on Preserved Thinking mode for multi turn agentic tasks such as τ²-Bench and Terminal Bench 2. This mode preserves internal reasoning traces across turns. That is useful when you build agents that need long chains of function calls and corrections.

How GLM-4.7-Flash fits developer workflows

GLM-4.7-Flash combines several properties that are relevant for agentic, coding focused applications:

A 30B-A3B MoE architecture with 31B params and a 128k token context length.

Strong benchmark results on AIME 25, GPQA, SWE-bench Verified, τ²-Bench, and BrowseComp compared to other models in the same table.

Documented evaluation parameters and a Preserved Thinking mode for multi turn agent tasks.

First class support for vLLM, SGLang, and Transformers based inference, with ready to use commands.

A growing set of finetunes and quantizations, including MLX conversions, in the Hugging Face ecosystem.

Check out the Model weight. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents appeared first on MarkTechPost.

Introducing multimodal retrieval for Amazon Bedrock Knowledge Bases

We are excited to announce the general availability of multimodal retrieval for Amazon Bedrock Knowledge Bases. This new capability adds native support for video and audio content, on top of text and images. With it you can build Retrieval Augmented Generation (RAG) applications that can search and retrieve information across text, images, audio, and video—all within a fully managed service.
Modern enterprises store valuable information in multiple formats. Product documentation includes diagrams and screenshots, training materials contain instructional videos, and customer insights are captured in recorded meetings. Until now, building artificial intelligence (AI) applications that could effectively search across these content types required complex custom infrastructure and significant engineering effort.
Previously, Bedrock Knowledge Bases used text-based embedding models for retrieval. While it supported text documents and images, images had to be processed using foundation models (FM) or Bedrock Data Automation to generate text descriptions—a text-first approach that lost visual context and prevented visual search capabilities. Video and audio required custom preprocessing external pipelines. Now, with multimodal embeddings, the retriever natively supports text, images, audio, and video within a single embedding model.
With multimodal retrieval in Bedrock Knowledge Bases, you can now ingest, index, and retrieve information from text, images, video, and audio using a single, unified workflow. Content is encoded using multimodal embeddings that preserve visual and audio context, enabling your applications to find relevant information across media types. You can even search using an image to find visually similar content or locate specific scenes in videos.
In this post, we’ll guide you through building multimodal RAG applications. You’ll learn how multimodal knowledge bases work, how to choose the right processing strategy based on your content type, and how to configure and implement multimodal retrieval using both the console and code examples.
Understanding multimodal knowledge bases
Amazon Bedrock Knowledge Bases automates the complete RAG workflow: ingesting content from your data sources, parsing and chunking it into searchable segments, converting chunks to vector embeddings, and storing them in a vector database. During retrieval, user queries are embedded and matched against stored vectors to find semantically similar content, which augments the prompt sent to your foundation model.
With multimodal retrieval, this workflow now handles images, video, and audio alongside text through two processing approaches. Amazon Nova Multimodal Embeddings encodes content natively into a unified vector space, for cross-modal retrieval where you can query with text and retrieve videos, or search using images to find visual content.
Alternatively, Bedrock Data Automation converts multimedia into rich text descriptions and transcripts before embedding, providing high-accuracy retrieval over spoken content. Your choice depends on whether visual context or speech precision matters most for your use case.

We explore each of these approaches in this post.
Amazon Nova Multimodal Embeddings
Amazon Nova Multimodal Embeddings is the first unified embedding model that encodes text, documents, images, video, and audio into a single shared vector space. Content is processed natively without text conversion. The model supports up to 8,172 tokens for text and 30 seconds for video/audio segments, handles over 200 languages, and offers four embedding dimensions (with 3072-dimension as default, 1,024, 384, 256) to balance accuracy and efficiency. Bedrock Knowledge Bases segments video and audio automatically into configurable chunks (5-30 seconds), with each segment independently embedded.

For video content, Nova embeddings capture visual elements—scenes, objects, motion, and actions—as well as audio characteristics like music, sounds, and ambient noise. For videos where spoken dialogue is important to your use case, you can use Bedrock Data Automation to extract transcripts alongside visual descriptions. For standalone audio files, Nova processes acoustic features such as music, environmental sounds, and audio patterns. The cross-modal capability enables use cases such as describing a visual scene in text to retrieve matching videos, upload a reference image to find similar products, or locate specific actions in footage—all without pre-existing text descriptions.
Best for: Product catalogs, visual search, manufacturing videos, sports footage, security cameras, and scenarios where visual content drives the use case.
Amazon Bedrock Data Automation
Bedrock Data Automation takes a different approach by converting multimedia content into rich textual representations before embedding. For images, it generates detailed descriptions including objects, scenes, text within images, and spatial relationships. For video, it produces scene-by-scene summaries, identifies key visual elements, and extracts the on-screen text. For audio and video with speech, Bedrock Data Automation provides accurate transcriptions with timestamps and speaker identification, along with segment summaries that capture the key points discussed.

Once converted to text, this content is chunked and embedded using text embedding models like Amazon Titan Text Embeddings or Amazon Nova Multimodal Embeddings. This text-first approach enables highly accurate question-answering over spoken content—when users ask about specific statements made in a meeting or topics discussed in a podcast, the system searches through precise transcripts rather than audio embeddings. This makes it particularly valuable for compliance scenarios where you need exact quotes and verbatim records for audit trails, meeting analysis, customer support call mining, and use cases where you need to retrieve and verify specific spoken information.
Best for: Meetings, webinars, interviews, podcasts, training videos, support calls, and scenarios requiring precise retrieval of specific statements or discussions.
Use case scenario: Visual product search for e-commerce
Multimodal knowledge bases can be used for applications ranging from enhanced customer experiences and employee training to maintenance operations and legal analysis. Traditional e-commerce search relies on text queries, requiring customers to articulate what they’re looking for with the right keywords. This breaks down when they’ve seen a product elsewhere, have a photo of something they like, or want to find items similar to what appears in a video. Now, customers can search your product catalog using text descriptions, upload an image of an item they’ve photographed, or reference a scene from a video to find matching products. The system retrieves visually similar items by comparing the embedded representation of their query—whether text, image, or video—against the multimodal embeddings of your product inventory. For this scenario, Amazon Nova Multimodal Embeddings is the ideal choice. Product discovery is fundamentally visual—customers care about colors, styles, shapes, and visual details. By encoding your product images and videos into the Nova unified vector space, the system matches based on visual similarity without relying on text descriptions that might miss subtle visual characteristics. While a complete recommendation system would incorporate customer preferences, purchase history, and inventory availability, retrieval from a multimodal knowledge base provides the foundational capability: finding visually relevant products regardless of how customers choose to search.
Console walkthrough
In the following section, we walk through the high-level steps to set up and test a multimodal knowledge base for our e-commerce product search example. We create a knowledge base containing smartphone product images and videos, then demonstrate how customers can search using text descriptions, uploaded images, or video references. The GitHub repository provides a guided notebook that you can follow to deploy this example in your account.
Prerequisites
Before you get started, make sure that you have the following prerequisites:

An AWS Account with appropriate service access
An AWS Identity and Access Management (IAM) role with the appropriate permissions to access Amazon Bedrock and Amazon Simple Storage Service (Amazon S3)

Provide the knowledge base details and data source type
Start by opening the Amazon Bedrock console and creating a new knowledge base. Provide a descriptive name for your knowledge base and select your data source type—in this case, Amazon S3 where your product images and videos are stored.

Configure data source
Connect your S3 bucket containing product images and videos. For the parsing strategy, select Amazon Bedrock default parser. Since we’re using Nova Multimodal Embeddings, the images and videos are processed natively and embedded directly into the unified vector space, preserving their visual characteristics without conversion to text.

Configure data storage and processing
Select Amazon Nova Multimodal Embeddings as your embedding model. This unified embedding model encodes both your product images and customer queries into the same vector space, enabling cross-modal retrieval where text queries can retrieve images and image queries can find visually similar products. For this example, we use Amazon S3 Vectors as the vector store (you could optionally use other available vector stores), which provides cost-effective and durable storage optimized for large-scale vector data sets while maintaining sub-second query performance. You also need to configure the multimodal storage destination by specifying an S3 location. Knowledge Bases uses this location to store extracted images and other media from your data source. When users query the knowledge base, relevant media is retrieved from this storage.

Review and create
Review your configuration settings including the knowledge base details, data source configuration, embedding model selection—we’re using Amazon Nova Multimodal Embeddings v1 with 3072 vector dimensions (higher dimensions provide richer representations; you can use lower dimensions like 1,024, 384, or 256 to optimize for storage and cost) —and vector store setup (Amazon S3 Vectors). Once everything looks correct, create your knowledge base.
Create an ingestion job
Once created, initiate the sync process to ingest your product catalog. The knowledge base processes each image and video, generates embeddings and stores them in the managed vector database. Monitor the sync status to confirm the documents are successfully indexed.

Test the knowledge base using text as input in your prompt
With your knowledge base ready, test it using a text query in the console. Search with product descriptions like “A metallic phone cover” (or anything equivalent that could be relevant for your products media) to verify that text-based retrieval works correctly across your catalog.

Test the knowledge base using a reference image and retrieve different modalities
Now for the powerful part—visual search. Upload a reference image of a product you want to find. For example, imagine you saw a cell phone cover on another website and want to find similar items in your catalog. Simply upload the image without additional text prompt.

The multimodal knowledge base extracts visual features from your uploaded image and retrieves visually similar products from your catalog. As you can see in the results, the system returns phone covers with similar design patterns, colors, or visual characteristics. Notice the metadata associated with each chunk in the Source details panel. The x-amz-bedrock-kb-chunk-start-time-in-millis and x-amz-bedrock-kb-chunk-end-time-in-millis fields indicate the exact temporal location of this segment within the source video. When building applications programmatically, you can use these timestamps to extract and display the specific video segment that matched the query, enabling features like “jump to relevant moment” or clip generation directly from your source videos. This cross-modal capability transforms the shopping experience—customers no longer need to describe what they’re looking for with words; they can show you.
Test the knowledge base using a reference image and retrieve different modalities using Bedrock Data Automation
Now we look at what the results would look like if you configured Bedrock Data Automation parsing during the data source setup. In the following screenshot, notice the transcript section in the Source details panel.

For each retrieved video chunk, Bedrock Data Automation automatically generates a detailed text description—in this example, describing the smartphone’s metallic rose gold finish, studio lighting, and visual characteristics. This transcript appears directly in the test window alongside the video, providing rich textual context. You get both visual similarities matching from the multimodal embeddings and detailed product descriptions that can answer specific questions about features, colors, materials, and other attributes visible in the video.
Clean-up
To clean up your resources, complete the following steps, starting with deleting the knowledge base:

On the Amazon Bedrock console, choose Knowledge Bases
Select your Knowledge Base and note both the IAM service role name and S3 Vector index ARN
Choose Delete and confirm

To delete the S3 Vector as a vector store, use the following AWS Command Line Interface (AWS CLI) commands:

aws s3vectors delete-index –vector-bucket-name YOUR_VECTOR_BUCKET_NAME –index-name YOUR_INDEX_NAME –region YOUR_REGION
aws s3vectors delete-vector-bucket –vector-bucket-name YOUR_VECTOR_BUCKET_NAME –region YOUR_REGION

On the IAM console, find the role noted earlier
Select and delete the role

To delete the sample dataset:

On the Amazon S3 console, find your S3 bucket
Select and delete the files you uploaded for this tutorial

Conclusion
Multimodal retrieval for Amazon Bedrock Knowledge Bases removes the complexity of building RAG applications that span text, images, video, and audio. With native support for video and audio content, you can now build comprehensive knowledge bases that unlock insights from your enterprise data—not just text documents.
The choice between Amazon Nova Multimodal Embeddings and Bedrock Data Automation gives you flexibility to optimize for your specific content. The Nova unified vector space enables cross-modal retrieval for visual-driven use cases, while the Bedrock Data Automation text-first approach delivers precise transcription-based retrieval for speech-heavy content. Both approaches integrate seamlessly into the same fully managed workflow, alleviating the need for custom preprocessing pipelines.
Availability
Region availability is dependent on the features selected for multimodal support, please refer to the documentation for details.
Next steps
Get started with multimodal retrieval today:

Explore the documentation: Review the Amazon Bedrock Knowledge Bases documentation and Amazon Nova User Guide for additional technical details.
Experiment with code examples: Check out the Amazon Bedrock samples repository for hands-on notebooks demonstrating multimodal retrieval.
Learn more about Nova: Read the Amazon Nova Multimodal Embeddings announcement for deeper technical insights.

About the authors
Dani Mitchell is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS). He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock and Bedrock AgentCore.
Pallavi Nargund is a Principal Solutions Architect at AWS. She is a generative AI lead for US Greenfield and leads the AWS for Legal Tech team. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Pallavi holds a Bachelor’s of Engineering from the University of Pune, India. She lives in Edison, New Jersey, with her husband, two girls, and her two pups.
Jean-Pierre Dodel is a Principal Product Manager for Amazon Bedrock, Amazon Kendra, and Amazon Quick Index. He brings 15 years of Enterprise Search and AI/ML experience to the team, with prior work at Autonomy, HP, and search startups before joining Amazon 8 years ago. JP is currently focusing on innovations for multimodal RAG, agentic retrieval, and structured RAG.

Nous Research Releases NousCoder-14B: A Competitive Olympiad Programmi …

Nous Research has introduced NousCoder-14B, a competitive olympiad programming model that is post trained on Qwen3-14B using reinforcement learning (RL) with verifiable rewards. On the LiveCodeBench v6 benchmark, which covers problems from 08/01/2024 to 05/01/2025, the model reaches a Pass@1 accuracy of 67.87 percent. This is 7.08 percentage points higher than the Qwen3-14B baseline of 60.79 percent on the same benchmark. The research team trained the model on 24k verifiable coding problems using 48 B200 GPUs over 4 days, and released the weights under the Apache 2.0 license on Hugging Face.

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

Benchmark focus and what Pass@1 means

LiveCodeBench v6 is designed for competitive programming evaluation. The test split used here contains 454 problems. The training set uses the same recipe as the DeepCoder-14B project from Agentica and Together AI. It combines problems from TACO Verified, PrimeIntellect SYNTHETIC 1, and LiveCodeBench problems created before 07/31/2024.

The benchmark only includes competitive programming style tasks. For each problem, a solution must respect strict time and memory limits and must pass a large set of hidden input output tests. Pass@1 is the fraction of problems where the first generated program passes all tests, including time and memory constraints.

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

Dataset construction for execution based RL

All datasets used for training are composed of verifiable code generation problems. Each problem has a reference implementation and many test cases. The training set contains 24k problems drawn from:

TACO Verified

PrimeIntellect SYNTHETIC 1

LiveCodeBench problems that come before 07/31/2024

The test set is LiveCodeBench v6, which has 454 problems between 08/01/2024 and 05/01/2025.

Every problem is a complete competitive programming task with a description, input format, output format, and test cases. This setup is important for RL because it gives a binary reward signal that is cheap to compute once the code has run.

RL environment with Atropos and Modal

The RL environment is built using the Atropos framework. NousCoder-14B is prompted using the standard LiveCodeBench prompt format, and it generates Python code for each problem. Each rollout receives a scalar reward that depends on test case results:

Reward 1 when the generated code passes all test cases for that problem

Reward −1 when the code outputs a wrong answer, exceeds a 15 second time limit, or exceeds a 4 GB memory limit on any test case

To execute untrusted code safely and at scale, the team uses Modal as an autoscaled sandbox. The system launches one Modal container per rollout in the main design that the research team describes as the used setting. Each container runs all test cases for that rollout. This avoids mixing training compute with verification compute and keeps the RL loop stable.

The research team also pipelines inference and verification. When an inference worker finishes a generation, it sends the completion to a Modal verifier and immediately starts a new generation. With many inference workers and a fixed pool of Modal containers, this design keeps the training loop inference compute bound instead of verification bound.

The team discusses 3 verification parallelization strategies. They explore one container per problem, one per rollout, and one per test case. They finally avoid the per test case setting because of container launch overhead and use an approach where each container evaluates many test cases and focuses on a small set of the hardest test cases first. If any of these fail, the system can stop verification early.

GRPO objectives, DAPO, GSPO, and GSPO+

NousCoder-14B uses Group Relative Policy Optimization (GRPO) which does not require a separate value model. On top of GRPO the research team test 3 objectives: Dynamic sAmpling Policy Optimization (DAPO), Group Sequence Policy Optimization (GSPO), and a modified GSPO variant called GSPO+.

All 3 objectives share the same definition of advantage. The advantage for each rollout is the reward for that rollout normalized by the mean and standard deviation of rewards inside the group. DAPO applies importance weighting and clipping at the token level, and introduces three main changes relative to GRPO:

A clip higher rule that increases exploration for low probability tokens

A token level policy gradient loss that gives each token equal weight

Dynamic sampling, where groups that are all correct or all incorrect are dropped because they carry zero advantage

GSPO moves the importance weighting to the sequence level. It defines a sequence importance ratio that aggregates token ratios over the whole program. GSPO+ keeps sequence level correction, but it rescales gradients so that tokens are weighted equally regardless of sequence length.

On LiveCodeBench v6, the differences between these objectives are modest. At a context length of 81,920 tokens, DAPO reaches a Pass@1 of 67.87 percent while GSPO and GSPO+ reach 66.26 percent and 66.52 percent. At 40,960 tokens, all 3 objectives cluster around 63 percent Pass@1.

Iterative context extension and overlong filtering

Qwen3-14B supports long context and the training follows an iterative context extension schedule. The team first trains the model with a 32k context window and then continues training at the maximum Qwen3-14B context window of 40k. At each stage they select the checkpoint with the best LiveCodeBench score at 40k context and then use YaRN context extension at evaluation time to reach 80k tokens, that is 81,920 tokens.

A key trick is overlong filtering. When a generated program exceeds the maximum context window, they reset its advantage to zero. This removes that rollout from the gradient signal rather than penalizing it. The research team report that this approach avoids pushing the model toward shorter solutions for purely optimization reasons and helps maintain quality when they scale context length at test time.

Key Takeaways

NousCoder 14B is a Qwen3-14B based competitive programming model trained with execution based RL, it reaches 67.87 percent Pass@1 on LiveCodeBench v6, a 7.08 percentage point gain over the Qwen3-14B baseline of 60.79 percent on the same benchmark.

The model is trained on 24k verifiable coding problems from TACO Verified, PrimeIntellect SYNTHETIC-1, and pre 07 31 2024 LiveCodeBench tasks, and evaluated on a disjoint LiveCodeBench v6 test set of 454 problems from 08/01/2024 to 05/01/2025.

The RL setup uses Atropos, with Python solutions executed in sandboxed containers, a simple reward of 1 for solving all test cases and minus 1 for any failure or resource limit breach, and a pipelined design where inference and verification run asynchronously.

Group Relative Policy Optimization objectives DAPO, GSPO, and GSPO+ are used for long context code RL, all operate on group normalized rewards, and show similar performance, with DAPO reaching the best Pass@1 at the longest 81,920 token context.

The training uses iterative context extension, first at 32k then at 40k tokens, along with YaRN based extension at evaluation time to 81,920 tokens, includes overlong rollout filtering for stability, and ships as a fully reproducible open stack with Apache 2.0 weights and RL pipeline code.

Check out the Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Nous Research Releases NousCoder-14B: A Competitive Olympiad Programming Model Post-Trained on Qwen3-14B via Reinforcement Learning appeared first on MarkTechPost.

Vercel Releases Agent Skills: A Package Manager For AI Coding Agents W …

Vercel has released agent-skills, a collection of skills that turns best practice playbooks into reusable skills for AI coding agents. The project follows the Agent Skills specification and focuses first on React and Next.js performance, web design review, and claimable deployments on Vercel. Skills are installed with a command that feels similar to npm, and are then discovered by compatible agents during normal coding flows.

Agent Skills format

Agent Skills is an open format for packaging capabilities for AI agents. A skill is a folder that contains instructions and optional scripts. The format is designed so that different tools can understand the same layout.

A typical skill in vercel-labs/agent-skills has three main components:

SKILL.md for natural language instructions that describe what the skill does and how it should behave

a scripts directory for helper commands that the agent can call to inspect or modify the project

an optional references directory with additional documentation or examples

react-best-practices also compiles its individual rule files into a single AGENTS.md file. This file is optimized for agents. It aggregates the rules into one document that can be loaded as a knowledge source during a code review or refactor. This removes the need for ad-hoc prompt engineering per project.

Core skills in vercel-labs/agent-skills

The repository currently presents three main skills that target common front end workflows:

1. react-best-practices

This skill encodes React and Next.js performance guidance as a structured rule library. It contains more than 40 rules grouped into 8 categories. These cover areas such as elimination of network waterfalls, bundle size reduction, server side performance, client side data fetching, re-render behavior, rendering performance, and JavaScript micro optimizations.

Each rule includes an impact rating. Critical issues are listed first, then lower impact changes. Rules are expressed with concrete code examples that show an anti pattern and a corrected version. When a compatible agent reviews a React component, it can map findings directly onto these rules.

2. web-design-guidelines

This skill is focused on user interface and user experience quality. It includes more than 100 rules that span accessibility, focus handling, form behavior, animation, typography, images, performance, navigation, dark mode, touch interaction, and internationalization.

During a review, an agent can use these rules to detect missing ARIA attributes, incorrect label associations for form controls, misuse of animation when the user requests reduced motion, missing alt text or lazy loading on images, and other issues that are easy to miss during manual review.

3. vercel-deploy-claimable

This skill connects the agent review loop to deployment. It can package the current project into a tarball, auto detect the framework based on package.json, and create a deployment on Vercel. The script can recognize more than 40 frameworks and also supports static HTML sites.

The skill returns two URLs. One is a preview URL for the deployed site. The other is a claim URL. The claim URL allows a user or team to attach the deployment to their Vercel account without sharing credentials from the original environment.

Installation and integration flow

Skills can be installed from the command line. The launch announcement highlights a simple path:

Copy CodeCopiedUse a different Browsernpx skills i vercel-labs/agent-skills

This command fetches the agent-skills repository and prepares it as a skills package.

Vercel and the surrounding ecosystem also provide an add-skill CLI that is designed to wire skills into specific agents. A typical flow looks like this:

Copy CodeCopiedUse a different Browsernpx add-skill vercel-labs/agent-skills

add-skill scans for installed coding agents by checking their configuration directories. For example, Claude Code uses a .claude directory, and Cursor uses .cursor and a directory under the home folder. The CLI then installs the chosen skills into the correct skills folders for each tool.

You can call add-skill in non interactive mode to control exactly what is installed. For example, you can install only the React skill for Claude Code at a global level:

Copy CodeCopiedUse a different Browsernpx add-skill vercel-labs/agent-skills –skill react-best-practices -g -a claude-code -y

You can also list available skills before installing them:

Copy CodeCopiedUse a different Browsernpx add-skill vercel-labs/agent-skills –list

After installation, skills live in agent specific directories such as ~/.claude/skills or .cursor/skills. The agent discovers these skills, reads SKILL.md, and is then able to route relevant user requests to the correct skill.

After deployment, the user interacts through natural language. For example, ‘Review this component for React performance issues’ or ‘Check this page for accessibility problems’. The agent inspects the installed skills and uses react-best-practices or web-design-guidelines when appropriate.

Key Takeaways

vercel-labs/agent-skills implements the Agent Skills specification, packaging each capability as a folder with SKILL.md, optional scripts, and references, so different AI coding agents can consume the same skill layout.

The repository currently ships 3 skills, react-best-practices for React and Next.js performance, web-design-guidelines for UI and UX review, and vercel-deploy-claimable for creating claimable deployments on Vercel.

react-best-practices encodes more than 40 rules in 8 categories, ordered by impact, and provides concrete code examples, which lets agents run structured performance reviews instead of ad hoc prompt based checks.

web-design-guidelines provides more than 100 rules across accessibility, focus handling, forms, animation, typography, images, performance, navigation, dark mode, touch interaction, and internationalization, enabling systematic UI quality checks by agents.

Skills are installed through commands such as npx skills i vercel-labs/agent-skills and npx add-skill vercel-labs/agent-skills, then discovered from agent specific skills directories, which turns best practice libraries into reusable, version controlled building blocks for AI coding workflows.

Check out the GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Vercel Releases Agent Skills: A Package Manager For AI Coding Agents With 10 Years of React and Next.js Optimisation Rules appeared first on MarkTechPost.

A Coding Guide to Understanding How Retries Trigger Failure Cascades i …

In this tutorial, we build a hands-on comparison between a synchronous RPC-based system and an asynchronous event-driven architecture to understand how real distributed systems behave under load and failure. We simulate downstream services with variable latency, overload conditions, and transient errors, and then drive both architectures using bursty traffic patterns. By observing metrics such as tail latency, retries, failures, and dead-letter queues, we examine how tight RPC coupling amplifies failures and how asynchronous event-driven designs trade immediate consistency for resilience. Throughout the tutorial, we focus on practical mechanisms, retries, exponential backoff, circuit breakers, bulkheads, and queues that engineers use to control cascading failures in production systems. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport asyncio, random, time, math, statistics
from dataclasses import dataclass, field
from collections import deque

def now_ms():
return time.perf_counter() * 1000.0

def pctl(xs, p):
if not xs:
return None
xs2 = sorted(xs)
k = (len(xs2) – 1) * p
f = math.floor(k)
c = math.ceil(k)
if f == c:
return xs2[int(k)]
return xs2[f] + (xs2[c] – xs2[f]) * (k – f)

@dataclass
class Stats:
latencies_ms: list = field(default_factory=list)
ok: int = 0
fail: int = 0
dropped: int = 0
retries: int = 0
timeouts: int = 0
cb_open: int = 0
dlq: int = 0

def summary(self, name):
l = self.latencies_ms
return {
“name”: name,
“ok”: self.ok,
“fail”: self.fail,
“dropped”: self.dropped,
“retries”: self.retries,
“timeouts”: self.timeouts,
“cb_open”: self.cb_open,
“dlq”: self.dlq,
“lat_p50_ms”: round(pctl(l, 0.50), 2) if l else None,
“lat_p95_ms”: round(pctl(l, 0.95), 2) if l else None,
“lat_p99_ms”: round(pctl(l, 0.99), 2) if l else None,
“lat_mean_ms”: round(statistics.mean(l), 2) if l else None,
}

We define the core utilities and data structures used throughout the tutorial. We establish timing helpers, percentile calculations, and a unified metrics container to track latency, retries, failures, and tail behavior. It gives us a consistent way to measure and compare RPC and event-driven executions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class FailureModel:
base_latency_ms: float = 8.0
jitter_ms: float = 6.0
fail_prob: float = 0.05
overload_fail_prob: float = 0.40
overload_latency_ms: float = 50.0

def sample(self, load_factor: float):
base = self.base_latency_ms + random.random() * self.jitter_ms
if load_factor > 1.0:
base += (load_factor – 1.0) * self.overload_latency_ms
fail_p = min(0.95, self.fail_prob + (load_factor – 1.0) * self.overload_fail_prob)
else:
fail_p = self.fail_prob
return base, (random.random() < fail_p)

class CircuitBreaker:
def __init__(self, fail_threshold=8, window=20, open_ms=500):
self.fail_threshold = fail_threshold
self.window = window
self.open_ms = open_ms
self.events = deque(maxlen=window)
self.open_until_ms = 0.0

def allow(self):
return now_ms() >= self.open_until_ms

def record(self, ok: bool):
self.events.append(not ok)
if len(self.events) >= self.window and sum(self.events) >= self.fail_threshold:
self.open_until_ms = now_ms() + self.open_ms

class Bulkhead:
def __init__(self, limit):
self.sem = asyncio.Semaphore(limit)

async def __aenter__(self):
await self.sem.acquire()

async def __aexit__(self, exc_type, exc, tb):
self.sem.release()

def exp_backoff(attempt, base_ms=20, cap_ms=400):
return random.random() * min(cap_ms, base_ms * (2 ** (attempt – 1)))

We model failure behavior and resilience primitives that shape system stability. We simulate overload-sensitive latency and failures, and we introduce circuit breakers, bulkheads, and exponential backoff to control cascading effects. These components let us experiment with safe versus unsafe distributed-system configurations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DownstreamService:
def __init__(self, fm: FailureModel, capacity_rps=250):
self.fm = fm
self.capacity_rps = capacity_rps
self._inflight = 0

async def handle(self, payload: dict):
self._inflight += 1
try:
load_factor = max(0.5, self._inflight / (self.capacity_rps / 10))
lat, should_fail = self.fm.sample(load_factor)
await asyncio.sleep(lat / 1000.0)
if should_fail:
raise RuntimeError(“downstream_error”)
return {“status”: “ok”}
finally:
self._inflight -= 1

async def rpc_call(
svc,
req,
stats,
timeout_ms=120,
max_retries=0,
cb=None,
bulkhead=None,
):
t0 = now_ms()
if cb and not cb.allow():
stats.cb_open += 1
stats.fail += 1
return False

attempt = 0
while True:
attempt += 1
try:
if bulkhead:
async with bulkhead:
await asyncio.wait_for(svc.handle(req), timeout=timeout_ms / 1000.0)
else:
await asyncio.wait_for(svc.handle(req), timeout=timeout_ms / 1000.0)
stats.latencies_ms.append(now_ms() – t0)
stats.ok += 1
if cb: cb.record(True)
return True
except asyncio.TimeoutError:
stats.timeouts += 1
except Exception:
pass
stats.fail += 1
if cb: cb.record(False)
if attempt <= max_retries:
stats.retries += 1
await asyncio.sleep(exp_backoff(attempt) / 1000.0)
continue
return False

We implement the synchronous RPC path and its interaction with downstream services. We observe how timeouts, retries, and in-flight load directly affect latency and failure propagation. It also highlights how tight coupling in RPC can amplify transient issues under bursty traffic. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class Event:
id: int
tries: int = 0

class EventBus:
def __init__(self, max_queue=5000):
self.q = asyncio.Queue(maxsize=max_queue)

async def publish(self, e: Event):
try:
self.q.put_nowait(e)
return True
except asyncio.QueueFull:
return False

async def event_consumer(
bus,
svc,
stats,
stop,
max_retries=0,
dlq=None,
bulkhead=None,
timeout_ms=200,
):
while not stop.is_set() or not bus.q.empty():
try:
e = await asyncio.wait_for(bus.q.get(), timeout=0.2)
except asyncio.TimeoutError:
continue

t0 = now_ms()
e.tries += 1
try:
if bulkhead:
async with bulkhead:
await asyncio.wait_for(svc.handle({“id”: e.id}), timeout=timeout_ms / 1000.0)
else:
await asyncio.wait_for(svc.handle({“id”: e.id}), timeout=timeout_ms / 1000.0)
stats.ok += 1
stats.latencies_ms.append(now_ms() – t0)
except Exception:
stats.fail += 1
if e.tries <= max_retries:
stats.retries += 1
await asyncio.sleep(exp_backoff(e.tries) / 1000.0)
await bus.publish(e)
else:
stats.dlq += 1
if dlq is not None:
dlq.append(e)
finally:
bus.q.task_done()

We build the asynchronous event-driven pipeline using a queue and background consumers. We process events independently of request submission, apply retry logic, and route unrecoverable messages to a dead-letter queue. It demonstrates how decoupling improves resilience while introducing new operational considerations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def generate_requests(total=2000, burst=350, gap_ms=80):
reqs = []
rid = 0
while rid < total:
n = min(burst, total – rid)
for _ in range(n):
reqs.append(rid)
rid += 1
await asyncio.sleep(gap_ms / 1000.0)
return reqs

async def main():
random.seed(7)
fm = FailureModel()
svc = DownstreamService(fm)
ids = await generate_requests()

rpc_stats = Stats()
cb = CircuitBreaker()
bulk = Bulkhead(40)

await asyncio.gather(*[
rpc_call(svc, {“id”: i}, rpc_stats, max_retries=3, cb=cb, bulkhead=bulk)
for i in ids
])

bus = EventBus()
ev_stats = Stats()
stop = asyncio.Event()
dlq = []

consumers = [
asyncio.create_task(event_consumer(bus, svc, ev_stats, stop, max_retries=3, dlq=dlq))
for _ in range(16)
]

for i in ids:
await bus.publish(Event(i))

await bus.q.join()
stop.set()
for c in consumers:
c.cancel()

print(rpc_stats.summary(“RPC”))
print(ev_stats.summary(“EventDriven”))
print(“DLQ size:”, len(dlq))

await main()

We drive both architectures with bursty workloads and orchestrate the full experiment. We collect metrics, cleanly terminate consumers, and compare outcomes across RPC and event-driven executions. The final step ties together latency, throughput, and failure behavior into a coherent system-level comparison.

In conclusion, we clearly saw the trade-offs between RPC and event-driven architectures in distributed systems. We observed that RPC offers lower latency when dependencies are healthy but becomes fragile under saturation, where retries and timeouts quickly cascade into system-wide failures. In contrast, the event-driven approach decouples producers from consumers, absorbs bursts through buffering, and localizes failures, but requires careful handling of retries, backpressure, and dead-letter queues to avoid hidden overload and unbounded queues. Through this tutorial, we demonstrated that resilience in distributed systems does not come from choosing a single architecture, but from combining the right communication model with disciplined failure-handling patterns and capacity-aware design.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Understanding How Retries Trigger Failure Cascades in RPC and Event-Driven Architectures appeared first on MarkTechPost.

NVIDIA Releases PersonaPlex-7B-v1: A Real-Time Speech-to-Speech Model …

NVIDIA Researchers released PersonaPlex-7B-v1, a full duplex speech to speech conversational model that targets natural voice interactions with precise persona control.

From ASR→LLM→TTS to a single full duplex model

Conventional voice assistants usually run a cascade. Automatic Speech Recognition (ASR) converts speech to text, a language model generates a text answer, and Text to Speech (TTS) converts back to audio. Each stage adds latency, and the pipeline cannot handle overlapping speech, natural interruptions, or dense backchannels.

PersonaPlex replaces this stack with a single Transformer model that performs streaming speech understanding and speech generation in one network. The model operates on continuous audio encoded with a neural codec and predicts both text tokens and audio tokens autoregressively. Incoming user audio is incrementally encoded, while PersonaPlex simultaneously generates its own speech, which enables barge in, overlaps, rapid turn taking, and contextual backchannels.

PersonaPlex runs in a dual stream configuration. One stream tracks user audio, the other stream tracks agent speech and text. Both streams share the same model state, so the agent can keep listening while speaking and can adjust its response when the user interrupts. This design is directly inspired by Kyutai’s Moshi full duplex framework.

Hybrid prompting, voice control and role control

PersonaPlex uses two prompts to define the conversational identity.

The voice prompt is a sequence of audio tokens that encodes vocal characteristics, speaking style, and prosody.

The text prompt describes role, background, organization information, and scenario context.

Together, these prompts constrain both the linguistic content and the acoustic behavior of the agent. On top of this, a system prompt supports fields such as name, business name, agent name, and business information, with a budget up to 200 tokens.

Architecture, Helium backbone and audio path

The PersonaPlex model has 7B parameters and follows the Moshi network architecture. A Mimi speech encoder that combines ConvNet and Transformer layers converts waveform audio into discrete tokens. Temporal and depth Transformers process multiple channels that represent user audio, agent text, and agent audio. A Mimi speech decoder that also combines Transformer and ConvNet layers generates the output audio tokens. Audio uses a 24 kHz sample rate for both input and output.

PersonaPlex is built on Moshi weights and uses Helium as the underlying language model backbone. Helium provides semantic understanding and enables generalization outside the supervised conversational scenarios. This is visible in the ‘space emergency’ example, where a prompt about a reactor core failure on a Mars mission leads to coherent technical reasoning with appropriate emotional tone, even though this situation is not part of the training distribution.

Training data blend, real conversations and synthetic roles

Training has 1 stage and uses a blend of real and synthetic dialogues.

Real conversations come from 7,303 calls, about 1,217 hours, in the Fisher English corpus. These conversations are back annotated with prompts using GPT-OSS-120B. The prompts are written at different granularity levels, from simple persona hints like ‘You enjoy having a good conversation’ to longer descriptions that include life history, location, and preferences. This corpus provides natural backchannels, disfluencies, pauses, and emotional patterns that are difficult to obtain from TTS alone.

Synthetic data covers assistant and customer service roles. NVIDIA team reports 39,322 synthetic assistant conversations, about 410 hours, and 105,410 synthetic customer service conversations, about 1,840 hours. Qwen3-32B and GPT-OSS-120B generate the transcripts, and Chatterbox TTS converts them to speech. For assistant interactions, the text prompt is fixed as ‘You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way.’ For customer service scenarios, prompts encode organization, role type, agent name, and structured business rules such as pricing, hours, and constraints.

This design lets PersonaPlex disentangle natural conversational behavior, which comes mainly from Fisher, from task adherence and role conditioning, which come mainly from synthetic scenarios.

Evaluation on FullDuplexBench and ServiceDuplexBench

PersonaPlex is evaluated on FullDuplexBench, a benchmark for full duplex spoken dialogue models, and on a new extension called ServiceDuplexBench for customer service scenarios.

FullDuplexBench measures conversational dynamics with Takeover Rate and latency metrics for tasks such as smooth turn taking, user interruption handling, pause handling, and backchanneling. GPT-4o serves as an LLM judge for response quality in question answering categories. PersonaPlex reaches smooth turn taking TOR 0.908 with latency 0.170 seconds and user interruption TOR 0.950 with latency 0.240 seconds. Speaker similarity between voice prompts and outputs on the user interruption subset uses WavLM TDNN embeddings and reaches 0.650.

PersonaPlex outperforms many other open source and closed systems on conversational dynamics, response latency, interruption latency, and task adherence in both assistant and customer service roles.

https://research.nvidia.com/labs/adlr/personaplex/

Key Takeaways

PersonaPlex-7B-v1 is a 7B parameter full duplex speech to speech conversational model from NVIDIA, built on the Moshi architecture with a Helium language model backbone, code under MIT and weights under the NVIDIA Open Model License.

The model uses a dual stream Transformer with Mimi speech encoder and decoder at 24 kHz, it encodes continuous audio into discrete tokens and generates text and audio tokens at the same time, which enables barge in, overlaps, fast turn taking, and natural backchannels.

Persona control is handled by hybrid prompting, a voice prompt made of audio tokens sets timbre and style, a text prompt and a system prompt of up to 200 tokens defines role, business context, and constraints, with ready made voice embeddings such as NATF and NATM families.

Training uses a blend of 7,303 Fisher conversations, about 1,217 hours, annotated with GPT-OSS-120B, plus synthetic assistant and customer service dialogs, about 410 hours and 1,840 hours, generated with Qwen3-32B and GPT-OSS-120B and rendered with Chatterbox TTS, which separates conversational naturalness from task adherence.

On FullDuplexBench and ServiceDuplexBench, PersonaPlex reaches smooth turn taking takeover rate 0.908 and user interruption takeover rate 0.950 with sub second latency and improved task adherence.

Check out the Technical details, Model weights and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA Releases PersonaPlex-7B-v1: A Real-Time Speech-to-Speech Model Designed for Natural and Full-Duplex Conversations appeared first on MarkTechPost.