Wednesday, January 28, 2026

Construct dependable Agentic AI answer with Amazon Bedrock: Be taught from Pushpay’s journey on GenAI analysis


This publish was co-written with Saurabh Gupta and Todd Colby from Pushpay.

Pushpay is a market-leading digital giving and engagement platform designed to assist church buildings and faith-based organizations drive group engagement, handle donations, and strengthen generosity fundraising processes effectively. Pushpay’s church administration system offers church directors and ministry leaders with insight-driven reporting, donor improvement dashboards, and automation of monetary workflows.

Utilizing the ability of generative AI, Pushpay developed an progressive agentic AI search characteristic constructed for the distinctive wants of ministries. The strategy makes use of pure language processing so ministry employees can ask questions in plain English and generate real-time, actionable insights from their group knowledge. The AI search characteristic addresses a essential problem confronted by ministry leaders: the necessity for fast entry to group insights with out requiring technical experience. For instance, ministry leaders can enter “present me people who find themselves members in a bunch, however haven’t given this 12 months” or “present me people who find themselves not engaged in my church,” and use the outcomes to take significant motion to higher help people of their group. Most group leaders are time-constrained and lack technical backgrounds; they will use this answer to acquire significant knowledge about their congregations in seconds utilizing pure language queries.

By empowering ministry employees with quicker entry to group insights, the AI search characteristic helps Pushpay’s mission to encourage generosity and connection between church buildings and their group members. Early adoption customers report that this answer has shortened their time to insights from minutes to seconds. To realize this end result, the Pushpay group constructed the characteristic utilizing agentic AI capabilities on Amazon Net Providers (AWS) whereas implementing strong high quality assurance measures and establishing a fast iterative suggestions loop for steady enhancements.

On this publish, we stroll you thru Pushpay’s journey in constructing this answer and discover how Pushpay used Amazon Bedrock to create a customized generative AI analysis framework for steady high quality assurance and establishing fast iteration suggestions loops on AWS.

Answer overview: AI powered search structure

The answer consists of a number of key elements that work collectively to ship an enhanced search expertise. The next determine reveals the answer structure diagram and the general workflow.

Determine 1: AI Search Answer Structure

  • Consumer interface layer: The answer begins with Pushpay customers submitting pure language queries via the present Pushpay utility interface. By utilizing pure language queries, church ministry employees can get hold of knowledge insights utilizing AI capabilities with out studying new instruments or interfaces.
  • AI search agent: On the coronary heart of the system lies the AI search agent, which consists of two key elements:
    • System immediate: Incorporates the big language mannequin (LLM) function definitions, directions, and utility descriptions that information the agent’s conduct.
    • Dynamic immediate constructor (DPC): routinely constructs further custom-made system prompts based mostly on the consumer particular data, akin to church context, pattern queries, and utility filter stock. Additionally they use semantic search to pick out solely related filters amongst tons of of obtainable utility filters. The DPC improves response accuracy and consumer expertise.
  • Amazon Bedrock superior characteristic: The answer makes use of the next Amazon Bedrock managed providers:
    • Immediate caching: Reduces latency and prices by caching continuously used system immediate.
    • LLM processing: Makes use of Claude Sonnet 4.5 to course of prompts and generate JSON output required by the appliance to show the specified question outcomes as insights to customers.
  • Analysis system: The analysis system implements a closed-loop enchancment answer the place consumer interactions are instrumented, captured and evaluated offline. The analysis outcomes feed right into a dashboard for product and engineering groups to research and drive iterative enhancements to the AI search agent. Throughout this course of, the info science group collects a golden dataset and repeatedly curates this dataset based mostly on the precise consumer queries coupled with validated responses.

The challenges of preliminary answer with out analysis

To create the AI search characteristic, Pushpay developed the primary iteration of the AI search agent. The answer implements a single agent configured with a rigorously tuned system immediate that features the system function, directions, and the way the consumer interface works with detailed clarification of every filter software and their sub-settings. The system immediate is cached utilizing Amazon Bedrock immediate caching to cut back token price and latency. The agent makes use of the system immediate to invoke an Amazon Bedrock LLM which generates the JSON doc that Pushpay’s utility makes use of to use filters and current question outcomes to customers.

Nevertheless, this primary iteration rapidly revealed some limitations. Whereas it demonstrated a 60-70% success charge with primary enterprise queries, the group reached an accuracy plateau. The analysis of the agent was a handbook and tedious course of Tuning the system immediate past this accuracy threshold proved difficult given the varied spectrum of consumer queries and the appliance’s protection of over 100 distinct configurable filters. These introduced essential blockers for the group’s path to manufacturing.

Figure 2: AI Search First Solution

Determine 2: AI Search First Answer

Bettering the answer by including a customized generative AI analysis framework

To handle the challenges of measuring and bettering agent accuracy, the group applied a generative AI analysis framework built-in into the present structure, proven within the following determine. This framework consists of 4 key elements that work collectively to supply complete efficiency insights and allow data-driven enhancements.

Figure 3: Introducing the GenAI Evaluation Framework

Determine 3: Introducing the GenAI Analysis Framework

  1. The golden dataset: A curated golden dataset containing over 300 consultant queries, every paired with its corresponding anticipated output, varieties the inspiration of automated analysis. The product and knowledge science groups rigorously developed and validated this dataset to attain complete protection of real-world use instances and edge instances. Moreover, there’s a steady curation technique of including consultant precise consumer queries with validated outcomes.
  2. The evaluator: The evaluator element processes consumer enter queries and compares the agent-generated output towards the golden dataset utilizing the LLM as a choose sample This strategy generates core accuracy metrics whereas capturing detailed logs and efficiency knowledge, akin to latency, for additional evaluation and debugging.
  3. Area class: Area classes are developed utilizing a mix of generative AI area summarization and human-defined common expressions to successfully categorize consumer queries. The evaluator determines the area class for every question, enabling nuanced, category-based analysis as a further dimension of analysis metrics.
  4. Generative AI analysis dashboard: The dashboard serves because the mission management for Pushpay’s product and engineering groups, displaying area category-level metrics to evaluate efficiency and latency and information choices. It shifts the group from single combination scores to nuanced, domain-based efficiency insights.

The accuracy dashboard: Pinpointing weaknesses by area

As a result of consumer queries are categorized into area classes, the dashboard incorporates statistical confidence visualization utilizing a 95% Wilson rating interval to show accuracy metrics and question volumes at every area stage. By utilizing classes, the group can pinpoint the AI agent’s weaknesses by area. Within the following instance , the “exercise” area reveals considerably decrease accuracy than different classes.

Figure 4: Pinpointing Agent Weaknesses by Domain

Determine 4: Pinpointing Agent Weaknesses by Area

Moreover, a efficiency dashboard, proven within the following determine, visualizes latency indicators on the area class stage, together with latency distributions from p50 to p90 percentiles. Within the following instance, the exercise area reveals notably larger latency than others.

Identifying Latency Bottlenecks by Domain

Determine 5: Figuring out Latency Bottlenecks by Area

Strategic rollout via domain-Degree insights

Area-based metrics revealed various efficiency ranges throughout semantic domains, offering essential insights into agent effectiveness. Pushpay used this granular visibility to make strategic characteristic rollout choices. By briefly suppressing underperforming classes—akin to exercise queries—whereas present process optimization, the system achieved 95% general accuracy. By utilizing this strategy, customers skilled solely the highest-performing options whereas the group refined others to manufacturing requirements.

Determine 6: Reaching 95% Accuracy with Area-Degree Characteristic Rollout

Strategic prioritization: Specializing in high-impact domains

To prioritize enhancements systematically, Pushpay employed a 2×2 matrix framework plotting matters towards two dimensions (proven within the following determine): Enterprise precedence (vertical axis) and present efficiency or feasibility (horizontal axis). This visualization positioned matters with each excessive enterprise worth and robust current efficiency within the top-right quadrant. The group then centered on these areas as a result of they required much less heavy lifting to attain additional accuracy enchancment from already-good ranges to an distinctive 95% accuracy for the enterprise centered matters.

The implementation adopted an iterative cycle: after every spherical of enhancements, they re-analyze the outcomes to establish the following set of high-potential matters. This systematic, cyclical strategy enabled steady optimization whereas sustaining give attention to business-critical areas.

Figure 7: Strategic Prioritization Framework for Domain Category Optimization

Determine 7: Strategic Prioritization Framework for Area Class Optimization

Dynamic immediate building

The insights gained from the analysis framework led to an architectural enhancement: the introduction of a dynamic immediate constructor. This element enabled fast iterative enhancements by permitting fine-grained management over which area classes the agent may tackle. The structured discipline stock – beforehand embedded within the system immediate – was reworked right into a dynamic factor, utilizing semantic search to assemble contextually related prompts for every consumer question. This strategy tailors the immediate filter stock based mostly on three key contextual dimensions: question content material, consumer persona, and tenant-specific necessities. The result’s a extra exact and environment friendly system that generates extremely related responses whereas sustaining the flexibleness wanted for steady optimization.

Enterprise influence

The generative AI analysis framework grew to become the cornerstone of Pushpay’s AI characteristic improvement, delivering measurable worth throughout three dimensions:

  • Consumer expertise: The AI search characteristic decreased time-to-insight from roughly 120 seconds (skilled customers manually navigating complicated UX) to below 4 seconds – a 15-fold acceleration that straight helps improve ministry leaders’ productiveness and decision-making velocity. This characteristic democratized knowledge insights, in order that customers of various technical ranges can entry significant intelligence with out requiring specialised experience.
  • Improvement velocity: The scientific analysis strategy reworked optimization cycles. Relatively than debating immediate modifications, the group now validates adjustments and measures domain-specific impacts inside minutes, changing extended deliberations with data-driven iteration.
  • Manufacturing readiness: Enhancements from 60–70% accuracy to greater than 95% accuracy utilizing high-performance domains supplied the quantitative confidence required for customer-facing deployment, whereas the framework’s structure permits steady refinement throughout different area classes.

Key takeaways to your AI agent journey

The next are key takeaways from Pushpay’s expertise that you should utilize in your personal AI agent journey.

1/ Construct with manufacturing in thoughts from day one

Constructing agentic AI methods is easy, however scaling them to manufacturing is difficult. Builders ought to undertake a scaling mindset through the proof-of-concept part, not after. Implementing strong tracing and analysis frameworks early, offers a transparent pathway from experimentation to manufacturing. By utilizing this methodology, groups can establish and tackle accuracy points systematically earlier than they develop into blockers.

2/ Make the most of the superior options of Amazon Bedrock

Amazon Bedrock immediate caching considerably reduces token prices and latency by caching continuously used system prompts. For brokers with massive, secure system prompts, this characteristic is important for production-grade efficiency.

3/ Assume past combination metrics

Mixture accuracy scores can generally masks essential efficiency variations. By evaluating agent efficiency on the area class stage, Pushpay uncovered weaknesses past what a single accuracy metric can seize. This granular strategy permits focused optimization and knowledgeable rollout choices, ensuring customers solely expertise high-performing options whereas others are refined.

4/ Information safety and accountable AI

When creating agentic AI methods, contemplate data safety and LLM safety issues from the outset, following the AWS Shared Accountability Mannequin, as a result of safety necessities essentially influence the architectural design. Pushpay’s clients are church buildings and faith-based organizations who’re stewards of delicate data—together with pastoral care conversations, monetary giving patterns, household struggles, prayer requests and extra. On this implementation instance, Pushpay set a transparent strategy to incorporating AI ethically inside its product ecosystem, sustaining strict safety requirements to make sure church knowledge and personally identifiable data (PII) stays inside its safe partnership ecosystem. Information is shared solely with safe and applicable knowledge protections utilized and isn’t used to coach exterior fashions. To be taught extra about Pushpay’s requirements for incorporating AI inside their merchandise, go to the Pushpay Information Heart for a extra in-depth assessment of firm requirements.

Conclusion: Your Path to Manufacturing-Prepared AI Brokers

Pushpay’s journey from a 60–70% accuracy prototype to a 95% correct production-ready AI agent demonstrates that constructing dependable agentic AI methods requires extra than simply refined prompts—it calls for a scientific, data-driven strategy to analysis and optimization. The important thing breakthrough wasn’t within the AI expertise itself, however in implementing a complete analysis framework constructed on sturdy observability basis that supplied granular visibility into agent efficiency throughout totally different domains. This systematic strategy enabled fast iteration, strategic rollout choices, and steady enchancment.

Able to construct your personal production-ready AI agent?

  • Discover Amazon Bedrock: Start constructing your agent with Amazon Bedrock
  • Implement LLM-as-a-judge: Create your personal analysis system utilizing the patterns described on this LLM-as-a-judge on Amazon Bedrock Mannequin Analysis
  • Construct your golden dataset: Begin curating consultant queries and anticipated outputs to your particular use case

Concerning the authors

Roger Wang is a Senior Answer Architect at AWS. He’s a seasoned architect with over 20 years of expertise within the software program business. He helps New Zealand and international software program and SaaS corporations use cutting-edge expertise at AWS to unravel complicated enterprise challenges. Roger is captivated with bridging the hole between enterprise drivers and technological capabilities and thrives on facilitating conversations that drive impactful outcomes.

Melanie LiMelanie Li, PhD, is a Senior Generative AI Specialist Options Architect at AWS based mostly in Sydney, Australia, the place her focus is on working with clients to construct options leveraging state-of-the-art AI and machine studying instruments. She has been actively concerned in a number of Generative AI initiatives throughout APJ, harnessing the ability of Massive Language Fashions (LLMs). Previous to becoming a member of AWS, Dr. Li held knowledge science roles within the monetary and retail industries.

Frank Huang, PhD, is a Senior Analytics Specialist Options Architect at AWS based mostly in Auckland, New Zealand. He focuses on serving to clients ship superior analytics and AI/ML options. All through his profession, Frank has labored throughout quite a lot of industries akin to monetary providers, Web3, hospitality, media and leisure, and telecommunications. Frank is raring to make use of his deep experience in cloud structure, AIOps, and end-to-end answer supply to assist clients obtain tangible enterprise outcomes with the ability of knowledge and AI.

Saurabh Gupta is an information science and AI skilled at Pushpay based mostly in Auckland, New Zealand, the place he focuses on implementing sensible AI options and statistical modeling. He has intensive expertise in machine studying, knowledge science, and Python for knowledge science purposes, with specialised expertise coaching in database brokers and AI implementation. Previous to his present function, he gained expertise in telecom, retail and monetary providers, creating experience in advertising analytics and buyer retention applications. He has a Grasp’s in Statistics from College of Auckland and a Grasp’s in Enterprise Administration from the Indian Institute of Administration, Calcutta.

Todd Colby is a Senior Software program Engineer at Pushpay based mostly in Seattle. His experience is targeted on evolving complicated legacy purposes with AI, and translating consumer wants into structured, high-accuracy options. He leverages AI to extend supply velocity and produce innovative metrics and enterprise choice instruments.

Related Articles

Latest Articles