Governing Intelligence: How AI Is Reshaping Public Sector Software (feat. Andrew Stockwell)

Euna Solutions' VP of AI reveals how rigorous observability, purpose-built guardrails, and a centralized AI gateway make responsible public sector AI deployable at scale.

May 06, 2026

Deploying AI in regulated, mission-critical environments is a challenge of a different order from shipping a consumer app. Where most AI practitioners enjoy the freedom to iterate quickly and fail cheaply, public sector software vendors must satisfy procurement regulations, legal liability constraints, and a profound obligation to public trust. Andrew Stockwell, VP of AI at Euna Solutions — a leading provider of cloud-based software for government bodies across the United States and Canada — has spent years operating at this intersection. In a wide-ranging conversation on the Snowpal Podcast, Stockwell walked through the technical decisions, architectural patterns, and organizational strategies his team uses to ship production-quality AI responsibly in one of the world’s most demanding verticals.

AI + Snowpal API: Reduce Time to Market

Sections

Augmentation over automation — why human-in-the-loop is a legal necessity, not timidity
LLM Ops in practice — how Arise, test sets, and iterative prompt tuning enforce reliability
Base LLMs, RAG, and differentiation — where the moat actually lives (and how it evolves)
The AI gateway — multi-tenancy, PII removal, prompt injection guards, and model flexibility
Token economics and ROI — the 3× budget overrun and how departmental accountability replaced centralised approval
SDLC transformation — halved time-to-merge, citizen developers, and the guardrail standardization challenge
The SaaS landscape — why trust and compliance posture compound into a durable moat
Context windows and the horizon — what becomes possible as context limits expand

Podcast

Trust the Guardrails: Building AI That Governments Can Actually Use — on Apple and Spotify.

1. Augmentation Over Automation: The Public Sector Constraint

The starting point for any AI deployment at EUNA Solutions is a deliberate philosophical choice: keep a human in the loop. This is not timidity — it is a response to the legal and regulatory realities of government procurement, grants, and budget management.

“There’s a strong bias for augmentation instead of automation... It’s about trust, explainability and transparency. We cannot roll out any agentic solution that we have not tested thoroughly.”

Stockwell’s concrete example — Euna Solutions’ AI solicitation agent — illustrates the principle sharply. The agent analyses Request for Proposal (RFP) documents and suggests categories a procurement officer may have overlooked. For a fire engine RFP, the system might prompt an official to specify hose diameter or wheel size. But it stops there deliberately.

“We cannot say to them the hose diameter needs to be X or Y, because then we kind of hold liable.”

The boundary between recommendation and prescription is not a product design preference; it is a legal firewall. Every guardrail, every observability hook, and every evaluation run exists to enforce that line in production, not just in staging.

2. LLM Ops in Practice: Building Confidence Before Go-Live

With liability stakes this high, Euna Solutions treats LLM observability as a first-class engineering discipline, not an afterthought. The team uses Arise — an LLM ops and observability platform — alongside alternatives such as LangSmith and LangFuse, to evaluate agents outside production before release.

“We have the system prompt, we have the agent, and then we run a standardised test set through Arise and we have an expected result that we want. And we have the actual output that was outputted from the LLM. And then we were able to see the difference between the two, adjust the system prompt and rerun those evaluations again to make sure we’re getting the level of accuracy that we want.”

The same observability loop runs in production. Drift in model output — inevitable as underlying models are updated by providers — is caught early and corrected by updating system or guardrail prompts before customers are affected.

The Guardrail Feedback Loop

Guardrails at Euna Solutions are product-specific rather than generic. For each feature, product managers and engineers define what the model must not output. The guardrail layer intercepts every agent response and re-prompts the model if the response violates a rule — iterating until the output is compliant before it is surfaced to the user.

“The guardrail reviews it. If it’s not [acceptable], it goes back to the agent and it keeps iterating until the response gets back. And then it’s populated in the front end for the customer to see.”

This architecture means that test coverage at the system-prompt level becomes a form of regression testing. QA engineers own the evaluation suites; software engineers own the prompts. The distinction matters because evaluation quality is ultimately a domain knowledge problem, not just a technical one.

3. Base LLMs, RAG, and the Question of Differentiation

A natural question arises: if the solicitation agent is powered by a base LLM with no proprietary fine-tuning, what stops a competitor from replicating it? Stockwell’s answer is pragmatic and instructive for anyone building in the LLM application layer.

“These large language models are trained on billions of parameters of data. So they have all those public RFPs that have been published in the past in them... The system prompt is probably the most important in this. That is a very, very long prompt, a lot of tokens where we give it examples of how we want it to output. We give it the wording that we want it to use. We give it different scenarios.”

The moat, in the near term, is prompt engineering depth, evaluation infrastructure, and the guardrail layer — not proprietary model weights. Over time, Stockwell anticipates a migration toward fine-tuned models or Retrieval-Augmented Generation (RAG) pipelines seeded with the company’s accumulated private data.

RAG as a Shared Capability

RAG is not just a customer-facing feature at Euna Solutions — it is an internal developer productivity tool. The team has built reusable RAG patterns that engineering teams can adopt by pointing their data at a managed vector store (currently AWS OpenSearch) and calling through the AI gateway.

“If you want a RAG agent, it’s a pretty easy thing to deploy, here’s the pattern. All you have to do is shift your data into this vector store, call it through the AI gateway, choose a large language model. Here’s Arise that you can use for testing. And that’s kind of enabled them.”

One illustrative internal application is developer tooling built on top of MCP (Model Context Protocol) servers, which give AI agents contextual knowledge about SharePoint, Salesforce, or role-based access control systems — reducing the friction of context-gathering in an engineering session.

4. The AI Gateway: Multi-Tenancy at the LLM Layer

Euna Solutions serves multiple government entities, each with their own data isolation requirements. The core architectural solution is a centralised AI gateway through which every LLM API call flows, regardless of which product line triggers it.

“Every single API call to a large language model goes to that AI gateway. And then we split the gateway by the different products. So we have like a procurement entry point, the grants entry point, a budget entry point broken up by the products. And we were able to see which customers are calling the model, what their token spend, what their limit is.”

Beyond multi-tenancy and billing visibility, the gateway serves as a unified enforcement point for cross-cutting concerns. Toxic language filtering, PII removal, SQL injection prevention, and prompt injection guardrails all live at this layer — applied consistently across every product without requiring each engineering team to re-implement them.

Model Flexibility and Cost Optimization

The gateway also enables provider-agnostic model selection. Engineering teams can evaluate expensive frontier models against cheaper, lightweight alternatives using the same evaluation harness and choose based on accuracy data rather than intuition.

“We can take an expensive large language model and we can take something like Gemini Flash and test it and see what the output is. And if they both give me the same accuracy, I’m going to take the cheaper one.”

In customer-facing contexts, this flexibility could eventually become a product feature — allowing government agencies to choose a model tier based on their accuracy requirements and budget, with pricing attached to that choice.

5. Token Economics and the ROI Question

Enabling Claude across the organization at Euna Solutions produced an immediate and instructive result: token spend ran to roughly three times the projected budget. The experience offers a candid case study in enterprise AI governance.

“We enabled Claude and we thought our budget would be X and it’s like three times X because of the usage... We kind of narrow it in and we say, hey, you spent $5,000 on X this month. What did you use it for? Log it in the AI innovation hub and what’s the return on investment?”

Stockwell’s response was to build an AI Innovation Hub — an internal tool where employees log their AI projects, enabling leadership to tie token spend to concrete outcomes. The shift in governance model is noteworthy: rather than centralised approval for every use of AI, departmental leaders are accountable for demonstrating ROI within their own teams.

“I should not be approving your token usage if you’re in finance or if you’re in HR... it should be the leaders in those areas understanding what their employees are using AI for and making sure that they’re using it in a way that we are getting a positive ROI from it.”

One employee, for example, completed a documentation project in four months that would have taken twelve — a result that justified elevated token consumption. The AI enablement team’s role is not cost policing but process re-engineering: helping teams understand whether an AI automation is truly the right solution, or whether the underlying process should be redesigned first.

“Instead of just, ‘should we automate this?’, it’s like, ‘can we take a step back and let’s look at the entire process to see if this is really an AI automation, is it something that Claude should be doing, is it a software engineering process?’”

6. SDLC Transformation: Speed, Quality, and the Guardrail Gap

Euna Solutions has a dedicated team focused solely on SDLC transformation through AI. The headline metric is time-to-merge-request, which has dropped by approximately half as developer adoption of AI tooling has increased.

“It’s still two weeks [sprints], but we’re able to get through a lot more in those two weeks.”

Early adopters within engineering teams have begun building their own agent-based review pipelines — ad hoc solutions to code quality and risk concerns that arise naturally as AI-generated code enters production codebases. The challenge for the AI platform team is standardizing these patterns so their benefits are available to all engineers, not just those who built them.

“As our developers start using it, they start building up their own agents within the different solutions to mitigate the risks that they’re seeing — they’ll have a review agent, they’ll have this agent. And now what we kind of have to do is figure out how do we standardise that so that all developers have access to these different things.”

Claude Code has been central to this internal transformation. Non-technical staff in HR, marketing, legal, and finance are now building their own internal applications — a dynamic that is surfacing new questions about production readiness checklists, SDLC governance for citizen-developed apps, and how to enforce coding standards outside traditional engineering pipelines.

7. The SaaS Landscape: Threats, Opportunities, and the Adoption Curve

The conversation broadened to the macro question of what AI means for SaaS companies as a category. Stockwell’s view is nuanced: the threat is real, but the response is within reach of any company willing to move quickly and invest in AI capability.

“You can go get a Replit account or Lovable and vibe code something very, very quickly. Governments move in this space pretty slowly, and there’s definitely a trust component to this. So we’ve kind of built over years all the guardrails, not just from an AI perspective, but from a data and infrastructure perspective.”

The compound moat — compliance posture, data trust, customer relationships, and now AI platform depth — is harder to replicate than any individual feature. The risk is not that AI replaces Euna Solutions outright; it is that a nimbler competitor replicates enough functionality fast enough to win new contracts.

“If you kind of have a vision — instead of pushing out five product features a year, you can maybe push out ten because you’re using AI and you’re using more and more tokens to produce things.”

Stockwell’s broader framing of the adoption curve is a useful corrective to the hype cycle. The majority of potential users have not yet meaningfully engaged with AI tooling. Teams and organizations that move quickly across that curve — in Stockwell’s words, “as quickly as possible” — are building a lead that will compound as the curve steepens.

8. Looking Ahead: Context Windows, Vibe Coding, and the Horizon

Two technical constraints define the current ceiling of AI-assisted software development in Stockwell’s view: context window size and the maturity of evaluation infrastructure. Both are moving.

“What’s stopping a company right now from taking their current software application and giving it to an AI agent and saying, ‘take this and redo X, Y, and Z with these features and deploy it’ is context. The context window is too small. It cannot take all the tokens into account of your entire codebase. But if I have to look about this — maybe a year, maybe two years from now, maybe even sooner — that’s not going to be an issue.”

The implication is that organizations building strong AI posture now are positioning themselves for a qualitatively different capability in the near term. The teams and companies that have invested in observability, guardrails, prompt engineering depth, and developer education will be able to absorb larger context windows and more autonomous agents without starting from scratch on governance.

His advice to teams navigating the current pace of change is to resist the temptation to over-engineer.

“People try and sometimes overcomplicate things when you can do a very small pilot project. It’s very easy to build an agentic pattern and it’s very easy to productionize it once you have the capabilities — your LLM ops, your guardrails. And there’s a lot of value that we can already get to our customer just by embedding a basic LLM powered by an agentic solution.”

Technologies

At Euna Solutions, every LLM API call — whether targeting Claude, Gemini Flash, or any other provider — is routed through a centralized AI gateway that enforces rate limiting, token metering, PII redaction, toxic language filtering, and prompt injection guardrails before a single token reaches a customer-facing surface. Atop that gateway sits a stack of reusable agentic patterns: RAG pipelines backed by AWS OpenSearch vector stores, Lambda functions for serverless orchestration, and MCP (Model Context Protocol) servers that give agents contextual awareness of enterprise systems including SharePoint, Salesforce, and role-based access control environments. Each pattern is observable end-to-end through Arise, an LLM ops platform analogous to LangSmith and LangFuse, which runs standardized evaluation sets against expected outputs both in pre-production and live environments — enabling the team to detect prompt drift, adjust system prompts or guardrail prompts, and rerun evals before any degradation surfaces to users. The guardrail layer itself operates as a feedback loop: agent responses are intercepted, evaluated against product-specific constraint rules, and re-submitted to the model iteratively until compliant output is produced, at which point it is passed to the front end.

On the developer productivity side, Euna Solutions has embedded Claude and Claude Code across engineering, HR, marketing, legal, and finance, producing a measurable 50% reduction in time-to-merge-request without shortening two-week sprints — the same cycles now yield significantly higher throughput. Engineers across product lines, spanning AWS, Azure, and GCP infrastructure inherited through acquisition, are building bespoke dev-side MCP servers and autonomous review agents to validate AI-generated code against production readiness checklists, effectively creating team-local SDLC guardrails that the AI platform team is now working to standardize organization-wide. Model selection is treated as an empirical rather than intuitive decision: the AI gateway enables side-by-side evaluation of frontier models against lightweight alternatives like Gemini Flash, with accuracy benchmarked against the same Arise test sets used in production monitoring, so cost optimization is grounded in observed performance deltas rather than vendor claims. Fine-tuning and expanded RAG coverage — augmenting base LLMs already trained on billions of public parameters including historical RFP corpora — remain the planned evolution path as private data assets mature and context window constraints, currently the binding limit on whole-codebase agentic refactoring, continue to expand.

Conclusion

The technical story that emerges from Andrew Stockwell’s experience at Euna Solutions is less about any single model or framework and more about infrastructure discipline. The companies and teams succeeding with AI in regulated, high-stakes environments are not doing so because they have access to better models — they are doing so because they have invested in the layers that make models trustworthy: rigorous evaluation pipelines, purpose-built guardrails, centralised observability, and a culture of measured experimentation over speculative automation.

As Stockwell put it in his closing remarks:

“Don’t look at it from a negative point of view. Look at it from a positive point of view and just have the right vision and strategy to execute on it. Anyone can do anything now. I can take anyone who’s never coded and they can vibe code an app or an idea. So there’s so many opportunities.”

The technical foundations Euna Solutions has built — an AI gateway, reusable agentic patterns, LLM observability, and a governed AI innovation hub — are a blueprint for any software organization trying to ship AI responsibly at scale. The tools are largely available. The discipline is the differentiator.

About the Guest

Andrew Stockwell is VP of AI at Euna Solutions, a cloud-based software provider for public sector organizations in the United States and Canada. His background spans actuarial economics, data science, and a Master’s degree in Computer Science. He has led AI platform, enablement, and LLM ops initiatives across multiple organizations, with a focus on responsible deployment of generative AI in regulated environments.

About the Snowpal Podcast

The Snowpal Podcast explores the intersection of technology, software architecture, and entrepreneurship. Episodes feature practitioners sharing hands-on experience building and deploying software at scale. Hosted by Krish Palaniappan, founder of Snowpal.

Snowpal AI + API: Build Apps Faster, Cheaper, Better

Discussion about this post

Ready for more?