← Tori 1.0 · GTM teardown
PostHog AI observability — a GTM teardown
A go-to-market take for the Developer Marketer (AI observability) role. The companion working demo is in the repo; the traces it produced are in PostHog right now.
TL;DR
AI observability is a land-grab happening inside an analytics platform PostHog already owns the developer relationship for. The winning move isn't to out-feature LangSmith on trace UIs — it's to make the case that the AI is just another part of your product, and it belongs in the same tool as the rest of it. PostHog is the only serious player that can say that truthfully. Everything below is built around that wedge.
1. The market, in one paragraph
Every team building with LLMs hits the same wall: the app is non-deterministic, "bugs" look like slightly-wrong prose, and cost balloons invisibly per token. They reach for tooling to record and inspect every model call — traces, token/cost, latency, and increasingly evals (is the output actually good?). That category is "AI observability." It's early, the buyer is a developer, and no one has won it yet.
2. The competitive set (and PostHog's wedge)
Current as of June 2026; funding/acquisition facts independently verified. The category is hot and consolidating fast — context that helps PostHog's pitch.
| Player | Status / 2025–26 | Pricing | PostHog's honest counter |
|---|---|---|---|
| LangSmith (LangChain) | $125M Series B, Oct 2025, $1.25B (unicorn). Pulls hardest for all-in-LangChain shops. | Developer $0; Plus $39/seat/mo; Enterprise custom. | Point solution in a silo. PostHog ties the trace to the real user session, the funnel it moved, and an A/B test behind a flag — without leaving the tool. |
| Langfuse | OSS (MIT); acquired by ClickHouse, Jan 2026 (alongside ClickHouse's $400M Series D). OTel-native. | Hobby free; Core $29/mo; Pro $199/mo; self-host free. | PostHog is also OSS + developer-first, but arrives with analytics, replay, flags, experiments attached — and Langfuse now serves ClickHouse's agenda. |
| Helicone | Proxy/gateway (YC W23). Acquired by Mintlify, Mar 2026 — maintenance mode, feature dev ended. | Hobby free; Pro $79/mo; Team $799/mo. | Vendor-risk wedge: a tool in maintenance mode is a dead end. PostHog is independent and building this as a first-class, ongoing product. |
| Arize / Phoenix | $70M Series C, Feb 2025. Phoenix OSS is the eval core; Arize AX is the enterprise SaaS. | Phoenix free; AX Pro $50/mo; Enterprise ~$60k/yr. | Built for ML engineers, not product engineers shipping features. Tells you how the model behaves, nothing about adoption/retention. |
| Braintrust | $80M Series B, Feb 2026, ~$800M (ICONIQ, a16z, Greylock). Eval-to-production loop. | Starter $0; Pro $249/mo; Enterprise custom. | Standalone AI-only tool, billed separately on data/score volume, disconnected from product impact. |
| Datadog LLM / Agent Obs | The enterprise "bolt AI onto APM" play (the "billion-dollar competitor"); hard push into agents. | Usage-based: ~$8 / 10K LLM spans (≈$480/mo at 500K) + free tier. | Datadog correlates AI with infra; PostHog correlates it with the user and product. Datadog sells to ops; PostHog to the person who built the feature — at a fraction of the price. |
Two market facts to weaponize:
- Consolidation is underway (Langfuse→ClickHouse, Helicone→Mintlify). Point tools are getting absorbed into bigger agendas. "Pick the platform building this as a core product, not a feature it acquired" is a true, current message.
- Pricing edge is real. PostHog = 100k LLM events/mo free, then $0.00006/event (~10× cheaper) vs Braintrust's $249/mo Pro and Datadog's ~$8/10K-span usage pricing.
The one-sentence wedge: Every competitor can only see the AI. PostHog sees the AI in the context of the actual product and the actual user — because the same project already has product analytics, session replay, feature flags, and experiments. You go from "this generation was slow/expensive/wrong" to "…and here's the user it happened to, what they did next, and whether they churned." Nobody else can close that loop.
3. Positioning & messaging
- Category line: "AI observability that lives where your product already does."
- For the skeptic (on LangSmith/Langfuse): "Keep your traces. But your AI feature lives in a product you're already measuring. Stop stitching two tools together."
- For the greenfield team: "You'll need analytics, flags, experiments, and AI observability anyway. Start with the one tool that has all four — instrument once."
- Anti-pattern to avoid: competing on trace-viewer aesthetics. Compete on integration of context, not the trace UI.
4. The launch plan (zero-to-done)
- Hero artifact — a build, not a blog. An "instrument a real AI feature in 15 minutes" repo + post (this demo is the template).
- The wedge demo: a 90-second clip from an
$ai_generationtrace → the same person's session replay and product events. Lead with it. - Eval-story content: "We found our AI's failure mode with evals" — reproducible teardowns (mine on finish detection is one).
- Co-marketing with an adjacent AI dev tool (agent framework, vector DB, eval lib): a joint "instrument X with PostHog" guide.
- Beta → adoption: ship weekly, announce each improvement as its own small launch, instrument docs/onboarding to see where activation drops.
5. The GEO / AI-search angle (my specialty)
Developers increasingly discover tools by asking an LLM, not Googling. "What should I use to monitor my LLM app?" is now answered by ChatGPT/Claude/Perplexity — and those answers are shaped by what's legible to models. I'd own a deliberate GEO program: comparison/alternative pages written to be cited by models; canonical, code-first docs an LLM can quote verbatim; and tracking share-of-voice in LLM answers, not just search rank. Exactly what I built at Depot across ChatGPT/Claude/Perplexity/Kagi.
6. Proof: I used the product to find a real bug
I built finish detection for my Pokémon card pipeline (normal / holo / reverse — which sets the price) with Claude vision, wrapped in PostHog AI observability, evaluated against real labeled scans:
| Run | Change | Accuracy | What the data showed |
|---|---|---|---|
| v1 | naive prompt | 42% | Model shoved everything into reverse — its traces showed it equated scanner glare with foil. |
| v2 | "glare ≠ foil" prompt | 47% | normal recall doubled. The fix landed exactly where the bias was. |
| v3 | full resolution | 53% | holo still 0% — even at full res. |
The finding: holo is undetectable from flat scans regardless of prompt or resolution — foil only reveals its pattern under angled light. Evals didn't just tune a prompt; they told me the limiting factor is data capture. That's the entire value proposition of AI observability in one true story — and the kind of content I'd publish to sell it.
7. How I'd measure the launch
- Activation: % of new AI-observability projects that send a 2nd-day trace.
- The wedge in action: how many AI-obs adopters also view a linked session replay / product insight.
- GEO share-of-voice: citation rate in LLM answers to "best LLM observability tool."
- Cost-to-value: time-from-signup-to-first-trace. If it's not minutes, onboarding is the bug.
Competitor facts verified June 2026; sources in competitor-research.json. Funding/pricing move fast — re-check before reuse.