AI SRE

I joined the AI SRE team at Harness following the acquisition of Transposit, an incident management platform. While technically powerful, it was built before modern AI capabilities were viable.

As AI advanced, it created an opportunity to reimagine smarter, more intuitive ways of responding to incidents.

I led UX to design that new experience.

Problem

Responders — typically on-call engineers — operate under extreme time pressure with a single goal: restore service as quickly as possible.

In reality, incident response is fragmented. Critical information is spread across Slack, Zoom, Jira, APMs, and documentation, forcing responders to constantly switch context to piece together what’s happening. This slows decision-making, increases cognitive load, and extends resolution time.

Success Metric

Success ultimately comes down to one metric:
reducing Mean Time to Resolution (MTTR).

MTTR measures how long it takes a team to restore service after an incident occurs — making it the most meaningful indicator of incident response effectiveness.

Competitive Landscape

Before Transposit was acquired, it competed with first-generation tools like Incident.io, FireHydrant, Rootly, and Blameless. As AI matured, the category shifted toward more proactive, assistive workflows.

We benchmarked both traditional and new AI-native tools to find where generative AI could meaningfully reduce MTTR and improve responder effectiveness.

Early first-gen market leaders:

New AI-native disruptors:

AI-Native Redesign

We observed that a significant portion of incident response time was spent on manual, operational work — searching across tools, gathering context, and stitching together fragmented information.

Rather than asking responders to manage this overhead themselves, we explored how AI could take on these tasks proactively and surface the right context at the right time.

This led to our guiding question:

How might we use AI to support responders to reduce MTTR?

AI Root Cause Analysis

To address this, I redesigned the incident detail page — the primary workspace for responders — and re-architected the flow of information so natural language AI became the main entry point into every investigation, rather than a secondary feature.

At the center of this redesign is the Likely Root Cause Theory section, where responders can review AI-generated hypotheses as AI SRE continuously analyzes signals across integrated systems in the background.

This includes an AI reasoning map that visually explains how AI SRE arrived at its conclusions, making the system’s logic transparent to responders. Alongside this, I designed clear human-in-the-loop feedback mechanisms, allowing responders to adjust confidence, confirm a true root cause, or dismiss false leads.

I also created targeted AI questioning, enabling users to reference specific contextual information directly within the natural language interface to steer the AI’s investigation.

AI reasoning map, feedback mechanisms & targeted AI questioning.

AI Scribe

To further reduce operational burden, we introduced AI Scribe, a listening agent that responders can invite to Slack, Microsoft Teams, or Zoom calls. AI Scribe automatically captures communications, actions, and decisions in real time, creating a single, authoritative incident record without manual note-taking.

As part of this feature, I designed how the transcription experience works and appears, as well as how live summaries surface within the incident detail page as AI Scribe listens and updates the record. This keeps documentation accurate and up to date while allowing responders to stay focused on resolving the incident rather than documenting it.

Transcription experience and surfacing AI updates.

AI Postmortem

Between the time after an incident is mitigated and before it is marked resolved, teams typically run a postmortem while juggling Zoom/Teams, Slack, and a shared doc. To reduce this overhead, I designed AI Postmortem, which listens to post-incident calls and automatically captures structured outcomes.

I designed a Meeting Mode tailored for the two-screen setup most responders use.

Meeting Mode for two-screen setup used by most responders.

It provides a single, flat one-page view that AI updates in real time by writing a post-incident summary, generating a timeline of key events, and creating assignable action items.

If anything is missing, responders can simply add it in natural language and the system updates the record — no forms, no manual formatting. In effect, Meeting Mode acts like a junior SRE who handles documentation so responders can focus on analysis and decision-making.

Pre-incident AI Investigations

Not every issue starts as an incident. Responders often investigate early — checking logs, reviewing changes, and gathering context — but this work was scattered across Slack messages, random docs, and personal notes.

I explored how AI could support this by designing Pre-Incident Investigations, which includes a lightweight scratchpad workspace where users could pin context, gather evidence, and ask AI questions in one place. If an investigation became a formal incident, everything carried over automatically.

Because the space sat somewhere between notes and incidents, naming mattered. I ran a quick survey with the field to land on terminology that felt intuitive.

This extended AI SRE beyond reactive response into proactive investigation.

Scratchpad design and naming research results.

Together, these features allow responders to remain in a single workspace while steering AI SRE’s investigation, shifting the operational burden of documentation and maintenance to the AI, ultimately saving time and reducing MTTR.

Results

82.5%

↓ in MTTR from 4h to 42min

$300K

Annual Contract Value Secured

With our first beta customer, Tebra, AI SRE delivered a meaningful reduction in incident resolution time. For an incident type that historically took around four hours to resolve using their previous tooling, the same workflow was completed in just 42 minutes with AI SRE — an ~80% reduction in MTTR.

Beyond operational gains, the AI-native capabilities also drove clear business impact. A live demo of AI SRE’s investigation and automation workflows directly helped close iHerb, our beta design partner, as a new customer, contributing approximately $300K in annual contract value.

Website & Marketing Enablement

Because the marketing team didn’t have deep product context, I partnered closely with them to translate complex AI SRE capabilities into clear, compelling narratives.

While marketing defined the overall strategy and brand direction, I led the product storytelling and UX, designing key screens, feature visuals, messaging structure, and interactive widgets to accurately represent how the product works.

Website I designed to support AI SRE in the Harness dark mode rebrand.

A lightweight product walkthrough video I made for AI SRE.

I also created diagrams, flows, and supporting visuals used across documentation, demos, and PM assets to ensure consistency between the product, website, and customer materials.

Final Thoughts

Designing AI SRE showed me that the biggest gains don’t come from adding features — they come from rethinking where the work should live.

By letting AI handle the operational burden, we gave time and clarity back to responders. That shift reduced MTTR, improved adoption, and helped the product win real customers.

It’s the kind of 0→1, systems-level problem I’m most excited to solve.

←