← All work
Case study · Client product · Built end to end

ServiceNow Co-Pilot

A developer assistant built around one principle: an answer is only as good as its grounding and its proof. It reads a ServiceNow screenshot, answers from cited documentation, and for buildable requests it actually builds and smoke tests the fix in a sandbox instance, then hands back an importable Update Set. When it cannot ground an answer, it says so.

ServiceNow Co-Pilot · buildable answer
A buildable, lab validated ServiceNow Co-Pilot answer with a green Built and tested in PDI badge, the Script Include code, alternatives considered, citations, and a download Update Set button
Trust architecture Anti-hallucination RAG + citations Proof of work TypeScript · Supabase
178
Automated tests, safety first
3
Honest answer tiers
7
Sequenced build phases
5
Weighted eval dimensions
The problem

ServiceNow development is high stakes and slow to get right. A developer staring at a broken form, a slow Business Rule, or a vague build me a utility request pays three recurring taxes, and a generic AI assistant makes every one of them worse.

The first is the diagnosis tax: working out what the screenshot is even showing, which table, which error, which form context, before any real work can start. The second is the correctness tax. ServiceNow has strong opinions, Script Includes versus GlideAjax versus Scripted REST, GlideRecord performance, the ACL surface, and a generic assistant will happily hallucinate a plausible but wrong API. Here a wrong answer ships to production. The third is the proof tax. Even a correct answer is just words until it compiles and runs on the instance, and the developer is still left to build and test it by hand.

Most AI co-pilots stop at confident prose. The hard and valuable part, grounding the answer in real documentation and proving it runs, is exactly what they skip. That gap is the product.

The trust model

The clearest way to understand the product is its trust spectrum. The same interface renders three honestly different classes of answer, and it never pretends one is another. An answer is graded by how much I can actually stand behind it, not by how confident the prose sounds.

Built + lab validated

A buildable solution is actually built and smoke tested in a sandbox ServiceNow instance, a Personal Developer Instance, then exported as an Update Set the developer imports. Nothing is claimed validated unless it ran. This is the hero flow.

Diagnostic, advice only

Not everything is buildable. A why is this happening question is classified as diagnostic, grounded in retrieved docs and cited on every claim, and labelled advice only, not lab validated. Real advice, never dressed up as something that was tested.

Unverified

Asked about something with no supporting source, the co-pilot admits it. It returns an unverified badge, low confidence, and zero citations, offers a generic starting point, and refuses to invent a property name or a citation.

The most important screen in the product is the one where it admits it does not know. That behaviour is enforced by the architecture, not by prompt politeness: the model only ever drafts prose, and the orchestrator, not the model, attaches citations. A claim with no retrieved source cannot acquire a fake one.
A diagnostic, advice only ServiceNow Co-Pilot answer with a medium confidence badge and a citation on every claim
Diagnostic and advice only. Grounded, cited, and honestly labelled as not lab validated, because it was not.
An unverified ServiceNow Co-Pilot answer that states no citations and that the answer is unverified
Unverified, no supporting source found. It states there are no citations and refuses to invent property names or defaults.
Architecture

The whole system is shaped to make a wrong or unsafe answer structurally hard, not just discouraged. A developer pastes a screenshot, types a question, and optionally attaches a ServiceNow XML export. The request flows through a reasoning agent that parses the image, retrieves cited chunks, reasons into a strict contract, classifies the request, and only then, if it is buildable and a sandbox is configured, builds and proves it. The diagram below is not a description of the code, it is the set of decisions that make the trust model real.

ServiceNow Co-Pilot architecture: a React 19 PWA posts to a Hono API, a reasoning agent does vision parse, retrieval over local embeddings and pgvector, XXE safe XML intake, structured reasoning, citation attach and classify, and a PDI validator that builds and smoke tests in a sandbox ServiceNow instance and returns an Update Set as proof
The decisions this diagram encodes. The Answer Contract: the model fills a strict schema of reasoning prose only and never holds the pen on a citation, the orchestrator attaches citations and the validation status. Embeddings run locally with bge-small-en-v1.5, so there is no per query cost and no corpus egress to a third party. The validator is PDI-only by construction: a single chokepoint throws before any network request if the target is not the configured sandbox, so the agent can never be steered into writing to production. Buildable requests follow build then test then prove, and degrade honestly to advice only if anything fails, so a false validated is never emitted. And XML intake is XXE safe because it never parses the XML as XML, it strips the DOCTYPE with a bracket aware scanner, leaving entity expansion attacks no parser to attack.
The product, end to end

Every screen below is captured from the running application. Together they walk the full path: from the composer a developer types into, through the source it reads, to the three honest answer tiers, the offline state, and the phone.

ServiceNow Co-Pilot landing screen with the PDI only, advice grounded, cited tagline
The front door. The tagline on every screen, PDI-only, advice grounded, cited, is the product's entire safety posture in three words.
The composer with a typed question, a pasted screenshot preview, and an attached ServiceNow XML export
The composer accepts a typed question, a pasted, dragged or uploaded screenshot with a live preview, and an optional ServiceNow XML export. Ask stays disabled until there is a question.
A ServiceNow incident list screenshot of open P1 incidents that the model reads as input
The kind of screenshot the model reads. From an open P1 incident list and a question, it extracts the table, the error text, and the form context before any answer is drafted.
The buildable, lab validated answer with the Script Include code, alternatives considered, citations, a Built and tested in PDI badge, and a download Update Set button
The hero flow. A complete engineering recommendation with the actual Script Include code, alternatives weighed by their downsides, a Built + tested in PDI badge, and a download Update Set button. The answer is not a suggestion, it is a tested, importable change.
The offline state of the installable PWA with a banner explaining past conversations stay viewable but new answers need a connection
An installable progressive web app. Go offline and past conversations stay viewable while the composer disables new questions. The API is never cached, because it carries authed and private data.
The full structured answer reflowed cleanly to a phone viewport
On the phone, the full structured answer, badge, screenshot, code, and citations, reflows cleanly to a mobile viewport.

Screenshots are captured from the running application. The demo conversations are curated to show the full trust spectrum, and they render through the live persistence and UI path.

How I built it

The product was delivered in seven sequenced phases, each test driven and merged behind passing checks: model gateway, then the knowledge layer, then the reasoning agent and XML intake, then the PDI validator, then persistence, then the web app and PWA, and finally the evaluation harness.

Quality is measured, not vibed. A golden set scores every answer on five weighted dimensions, completeness, citation or unverified, kind match, concept coverage, and groundedness, and two of them are hard guardrails: anti hallucination and groundedness. A model A/B benchmark holds retrieval constant and swaps only the model, so the cheapest one that clears the guardrails wins, and only free, vision capable models are ever in the running.

A later high effort code review pass found and fixed 10 issues, each with a regression test, from a conversation timestamp bug to a double decoded screenshot path. The result is a codebase where the safety critical behaviours are the most tested ones.

178
Passing tests
41
Test files
7
Sequenced phases
10
Review fixes, each with a test

Safety is tested, not assumed. The PDI-only guard and the anti hallucination and groundedness behaviour each have dedicated tests, so a regression that let the model invent a citation, or let the validator touch a non PDI host, fails CI.

Tech & tools
TypeScript Hono React 19 Vite Supabase (Postgres, pgvector, Storage) OpenRouter bge-small-en-v1.5 local embeddings Zod PWA

One language across the API, web and scripts, with Zod giving runtime and compile time safety. The Answer Contract is the spine of the whole app.

My role
  • Designed the trust model, the three honest answer tiers and the rule that the model never holds the pen on a citation, which was a product and safety design problem, not a prompt.
  • Designed the knowledge layer: a local embedding pipeline and pgvector retrieval whose chunks become the citations the orchestrator attaches.
  • Designed the reasoning agent and the strict Answer Contract that turns model output into a structured, validatable answer.
  • Designed the PDI validator, the PDI-only safety guard, and the build then test then prove flow that degrades honestly to advice only.
  • Designed persistence and the web PWA: schema decoupled answer storage, a private screenshot bucket, and an offline aware shell.
  • Designed the evaluation harness: the golden set, the five weighted dimensions, and the guardrailed model A/B benchmark.
  • Shipped it across seven phases and led the review and hardening pass that found and fixed 10 issues with regression tests.

Want the full walkthrough?

I am happy to demo ServiceNow Co-Pilot live and talk through the trust model, the architecture, and how I designed and shipped it end to end.