PSEEDR

Amazon AMET Payments: Reducing QA Cycles from Weeks to Hours with Multi-Agent AI

Coverage of aws-ml-blog

· PSEEDR Editorial

A look at how Amazon's internal payments team utilized the Strands Agents SDK and Amazon Bedrock to automate test case generation, shifting from manual workflows to agentic AI.

In a recent post, aws-ml-blog details how the Amazon AMET Payments team successfully addressed a critical bottleneck in their software delivery lifecycle: the manual generation of quality assurance (QA) test cases. By deploying a multi-agent AI solution, the team transitioned a workflow that traditionally required a full week of engineering effort into a process completed in a matter of hours.

The Context: The Cost of Quality at Scale
For enterprise-grade software, particularly in high-stakes domains like payments processing, QA is non-negotiable. The Amazon AMET Payments team serves approximately 10 million customers monthly across five countries, releasing an average of five new features every month. In this environment, the sheer volume of test cases required to ensure stability can become a drag on velocity. Traditionally, creating comprehensive test suites for a single project consumed roughly 40 hours of manual work-equating to the cost of one full-time QA engineer annually just for test generation. As the industry moves toward agentic workflows, the challenge has been moving beyond simple code completion to building systems that understand complex business logic without hallucinating invalid scenarios.

The Gist: SAARAM and Strands Agents
The blog post outlines the development of a solution named SAARAM (QA Lifecycle App). This internal tool leverages Amazon Bedrock, utilizing the Claude Sonnet model by Anthropic, and is orchestrated via the Strands Agents SDK. Unlike standard automation scripts, this approach employs a multi-agent architecture where different AI agents collaborate to analyze requirements and generate test logic.

A distinct aspect of this implementation is its "human-centric" design philosophy. Rather than focusing solely on optimizing algorithmic performance, the engineering team studied the cognitive patterns of human QA testers. They mapped out how a human engineer deconstructs a feature into testable units and designed the agents to mimic this step-by-step reasoning. This cognitive alignment helped ensure that the generated tests were not just syntactically correct, but semantically relevant to the business logic.

Engineering Reliability
One of the primary hurdles in applying Generative AI to QA is the tendency for models to hallucinate-inventing features or test parameters that do not exist. The AMET team mitigated this by enforcing structured outputs. By constraining the AI's responses to specific formats, they significantly improved the reliability of the generated code, allowing the system to be integrated directly into the CI/CD pipeline. The result was not only a reduction in time but an improvement in test coverage quality.

This case study serves as a practical blueprint for organizations looking to implement agentic AI in production. It demonstrates that success often lies not just in the choice of the model, but in the orchestration framework and the strict structuring of inputs and outputs.

For a detailed breakdown of the architecture and the specific benefits realized by the payments team, we recommend reading the full report.

Read the full post at aws-ml-blog

Key Takeaways

  • The Amazon AMET Payments team reduced test case generation time from 1 week to mere hours using a multi-agent AI solution.
  • The solution, SAARAM, utilizes Amazon Bedrock (Claude Sonnet) and the Strands Agents SDK to orchestrate complex QA tasks.
  • A 'human-centric' design approach was used to model AI behavior on human cognitive patterns, improving the relevance of generated tests.
  • Structured outputs were implemented to strictly control AI responses, significantly reducing hallucinations and increasing reliability.
  • The initiative saved the equivalent of one full-time engineer's annual workload per project while improving overall test coverage.

Read the original post at aws-ml-blog

Sources