bench/leaderboards/lookalike
[agent view]Markdown rendering of the lookalike matrix, optimized for LLM ingestion. Switch back via the toggle above.
# Lookalike Benchmark

Active dataset: `lookalike-2026-q2`
14 seed companies across 7 verticals × 5 vendors. Each vendor is asked for its top `K = 10` lookalikes per seed. An LLM judge (`majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini`) scores each returned company for relevance; the cell value is **Precision@K** — relevant / K.

## Endpoints

- JSON API: https://benchmarks.openfunnel.dev/api/leaderboards/lookalikes
- Markdown agent docs: https://benchmarks.openfunnel.dev/llms.txt
- OpenAPI 3.1 spec: https://benchmarks.openfunnel.dev/openapi.json
- MCP server discovery: https://benchmarks.openfunnel.dev/.well-known/mcp.json
- Public data + code (reproduce any cell): https://github.com/openfunnel/gtm-bench

## Vendors live (5)

`openfunnel` (OpenFunnel), `ocean` (Ocean.io), `exa` (Exa), `parallel` (Parallel), `predictleads` (PredictLeads)

## Vendors not surveyed

- `ZoomInfo` - company lookalike API is not on self-serve — gated behind sales contract
- `Clay` - lookalike runs inside Clay tables, no standalone API
- `Apollo` - no public lookalike endpoint
- `Lusha` - /v3/companies/lookalike requires 5-100 seed companies per request; benchmark scores one seed per cell

## Leaderboard (vendor totals)

| Rank | Vendor | Seeds judged | avg Precision@K | total relevant |
|------|--------|--------------|-----------------|----------------|
| 1 | openfunnel | 14/14 | 89.1% | 123 |
| 2 | predictleads | 14/14 | 73.6% | 103 |
| 3 | ocean | 14/14 | 70.7% | 99 |
| 4 | parallel | 13/14 | 64.6% | 84 |
| 5 | exa | 14/14 | 31.8% | 44 |

- `avg Precision@K` - mean Precision@10 across all seeds the vendor returned ≥K results for. Headline metric.
- `total relevant` - sum of relevant lookalikes across all seeds (out of `seeds_judged × K`).

## Seed × vendor matrix

Cell value = Precision@10. `-` means the vendor has not been run on that seed yet, or returned fewer than K results.

### B2B SaaS

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Pylon | 100.0% | 70.0% | 40.0% | 60.0% | 100.0% |
| Default | 100.0% | 40.0% | 44.4% | 70.0% | 80.0% |

### Devtools

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Trigger.dev | 100.0% | 30.0% | 11.1% | 0.0% | 100.0% |
| Liveblocks | 60.0% | 40.0% | 10.0% | 30.0% | 80.0% |

### E-commerce

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Postscript | 90.0% | 80.0% | 10.0% | 80.0% | 90.0% |
| Recharge | 100.0% | 80.0% | 20.0% | 80.0% | 10.0% |

### Healthtech

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Hinge Health | 90.0% | 60.0% | 20.0% | 100.0% | 100.0% |
| Aledade | 20.0% | 60.0% | 10.0% | 100.0% | 70.0% |

### Home Services SaaS / Chains

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Roto-Rooter | 87.5% | 70.0% | 30.0% | 100.0% | 100.0% |
| ServiceTitan | 100.0% | 70.0% | 10.0% | 100.0% | 100.0% |

### Local Trades

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| JDV Electric | 100.0% | 90.0% | 70.0% | - | 60.0% |
| Point Loma Home Pros | 100.0% | 100.0% | 40.0% | 70.0% | 90.0% |

### Real Estate

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| BLVD Residential | 100.0% | 100.0% | 60.0% | 20.0% | 50.0% |
| Emerge Living | 100.0% | 100.0% | 70.0% | 30.0% | 0.0% |

## TAM Recall (coverage vs an independent reference set)

A second, **judge-free** benchmark answering the opposite question to Precision@K:
not "are the few you returned good?" but "build me my whole TAM." Each vendor
returns its deepest list (fetch depth 100); we match it against a
**frozen, vendor-independent reference set** (G2 category rosters resolved to
canonical domains) and report **Recall@K** = the fraction of that reference set
surfaced in the vendor's top K. A "hit" is deterministic reference-set membership —
**no LLM judge**. Recall is relative to the reference set, **not absolute TAM**.

### Recall leaderboard (vendor totals, mean across seeds)

| Rank | Vendor | R@10 | R@50 | R@100 |
|------|--------|------|------|-------|
| 1 | ocean | 2.2% | 6.5% | 8.3% |
| 2 | openfunnel | 1.9% | 4.8% | 8.2% |
| 3 | parallel | 1.2% | 4.2% | 5.7% |
| 4 | predictleads | 2.9% | 5.6% | 5.6% |
| 5 | exa | 0.5% | 1.1% | 2.0% |

- `R@10/50/100` - mean Recall@K across scored seeds; `R@100` is the headline.

### Seed × vendor Recall@100 matrix

Cell = Recall@100. `-` = vendor not run / errored on that seed. Reference-set size in parentheses.

### B2B SaaS

| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Pylon (107) | 10.3% | 11.2% | 0.9% | 7.5% | 9.3% |

### E-commerce

| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Postscript (431) | 5.1% | 4.9% | 1.6% | 2.8% | 4.6% |
| Recharge (135) | 11.8% | 13.3% | 5.2% | 7.4% | 4.4% |

### Home Services SaaS / Chains

| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| ServiceTitan (499) | 5.6% | 3.8% | 0.4% | 5.0% | 3.8% |

Full methodology + the frozen, content-hashed gold sets: https://github.com/openfunnel/gtm-bench (`scripts/lookalike/RECALL_METHODOLOGY.md`).

## Methodology

1. Fix a canonical list of seed companies across 7 verticals. Each seed has a name, domain, and short description (the inputs every vendor sees).
2. For every (seed, vendor) cell, call the vendor's lookalike API with the seed company and `K = 10`. Capture the ordered top-K result list and credit cost.
3. Feed the seed + each returned candidate into the LLM judge (`majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini`). Judge returns a binary relevance label per candidate, with a one-line rationale. Same prompt and rubric across all vendors.
4. Cell value = relevant_count / K. Aggregate per vendor as `avg_precision_at_k` (mean across seeds with ≥K results).
5. `-` semantics: either the vendor returned fewer than K candidates (e.g. tail seeds where catalog is thin), or the (seed, vendor) pair has not been run / judged yet.

## Reproducibility

Every cell on this leaderboard is reproducible end-to-end from the public
mirror at https://github.com/openfunnel/gtm-bench. Each `data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json`
contains the **literal HTTP request/response** sent to the vendor (auth
headers redacted) plus the **literal LLM judge prompt + raw response** for
every candidate. Replay any `vendor_calls[]` entry with your own
credentials to verify the vendor's output, or replay
`judge_calls[].messages` against your own LLM to measure judge bias or
drift across model versions.

## Known limitations

- **Judge bias.** A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift if you swap models.
- **K-tail vs precision tradeoff.** Vendors who can only return small result sets win Precision@K by default (they don't have noisy tail entries). We balance this by requiring ≥K results for the cell to score.
- **Vertical balance.** 14 seeds spread across 7 verticals — modern B2B SaaS, devtools, DTC ecom, healthcare networks, vertical SaaS / national chains for home-services, independent local trades (HVAC / plumbing / electrical), and multifamily real-estate operators. Lets the matrix exercise both tech-stack-style matching and SIC/NAICS firmographic matching.
- **Precision ≠ recall.** Precision@10 measures whether the few results a vendor returned are clean, not how much of the market it covers. The **TAM Recall** section above is the complementary coverage metric — recall against a frozen, vendor-independent reference set (judge-free). Read them together: a vendor can be precise but narrow, or broad but noisy.

## License

CC-BY-4.0. Attribute "OpenFunnel Bench" and link back when redistributing.
03 · lookalike · live

Lookalike Benchmark

14 seed companies × 5 vendors - each vendor returns its top 10 lookalikes per seed. An LLM judge (majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini) scores every returned company for relevance. Cell value is Precision@10.

open data + codeEvery cell on this leaderboard is reproducible from a literal HTTP envelope + LLM judge prompt committed in openfunnel/gtm-bench · verify any number end-to-end.github →

Lookalike Precision@10

Scan the company type, then compare vendors. Each cell is the % of a vendor's top 10similar-company suggestions the LLM judge marked relevant.

exampleWant more B2B customer support platforms? Find that row, then compare which vendor returns the most relevant similar companies.
leaderboards/lookalike/lookalike-2026-q214 examples · 5 vendors
how to readcell = Precision@K (K = 10) — % of vendor's top 10 lookalikes judged relevant🥇🥈🥉top 3 vendors per company typeN/Avendor not yet run (or returned fewer than K results)click any scored cell to view the companies that vendor returned
#Company typeOpenFunnelOcean.ioExaParallelPredictLeads
01Open-source background jobs / long-running workflow platform aimed at TypeScript devsDevtoolsseed exampleTrigger.devtrigger.dev
02Family-owned independent residential electrical contractor serving Delaware, Chester, Lower Montgomery, and Philadelphia counties; panel repair/replacement, wiring, lighting, EV charger installs for homeownersLocal Tradesseed exampleJDV Electricjdvelectric.comN/A
03SMS marketing platform for DTC e-commerce brands on ShopifyE-commerceseed examplePostscriptpostscript.io
04National plumbing and drain-cleaning services franchise — residential + commercialHome Services SaaS / Chainsseed exampleRoto-Rooterrotorooter.com
05Subscription billing and customer lifecycle platform for Shopify DTC merchantsE-commerceseed exampleRechargerechargepayments.com
06Regional multi-trade home services company offering HVAC, plumbing, electrical, and water filtration to homeowners in the greater San Diego areaLocal Tradesseed examplePoint Loma Home Prospointlomahomepros.com
07Digital musculoskeletal (MSK) care platform — app-guided physical therapy with wearable motion sensors and human coaches for back, joint, and chronic painHealthtechseed exampleHinge Healthhingehealth.com
08Multifamily property management company operating apartment communities; focused on on-site operations, resident experience, and asset performance for institutional ownersReal Estateseed exampleBLVD Residentialblvdresidential.com
09Value-based care enablement platform for independent primary care practices and ACOsHealthtechseed exampleAledadealedade.com
10B2B customer support platform built around shared Slack/Teams channels with enterprise customersB2B SaaSseed examplePylonusepylon.com
11RevOps tool — form, routing, scheduling, and lead enrichment for inbound pipelinesB2B SaaSseed exampleDefaultdefault.com
12Multifamily property management company operating apartment communities across the Sunbelt; focused on resident experience and asset operations for institutional ownersReal Estateseed exampleEmerge Livingemergeliving.com
13Vertical SaaS for trades — dispatch, CRM, billing for HVAC, plumbing, and electrical contractorsHome Services SaaS / Chainsseed exampleServiceTitanservicetitan.com
14Realtime collaboration primitives for product teams — presence, multiplayer cursors, threaded commentsDevtoolsseed exampleLiveblocksliveblocks.io

TAM Recall - coverage of the addressable market

Precision@10above asks "of the few you returned, how many are good?" This asks the opposite: "build me my whole TAM". Each vendor returns its deepest list (fetch depth 100); we match against a frozen, vendor-independent reference set resolved to canonical domains and report Recall@10/50/100. No LLM judge - a hit is deterministic reference-set membership. Recall is relative to that reference set, never claimed as absolute TAM. Top coverage: Ocean.io at 8.3% avg Recall@100.

exampleWant the full market for B2B customer support platforms? Pylon is the seed example for that market; compare which vendor recovers the most companies from the frozen reference set.
leaderboards/lookalike/tam-recall4 markets · 5 vendors · depth 100
#VendorR@10R@50R@100
🥇Ocean.io2.2%6.5%8.3%
🥈OpenFunnel1.9%4.8%8.2%
🥉Parallel1.2%4.2%5.7%
04PredictLeads2.9%5.6%5.6%
05Exa0.5%1.1%2.0%
how to readcell = Recall@100- % of the market's frozen reference set the vendor surfaced in its top 100 resultsN/Avendor not run / errored on this seed
#Market tested / seed exampleOcean.ioOpenFunnelParallelPredictLeadsExa
01SMS marketing platforms for Shopify and DTC brandsE-commerceseed examplePostscriptpostscript.io431 company reference set - G2 SMS marketing roster
02B2B customer support and conversational support platformsB2B SaaSseed examplePylonusepylon.com107 company reference set - G2 conversational support roster
03subscription management and recurring billing for e-commerceE-commerceseed exampleRechargerechargepayments.com135 company reference set - G2 subscription management roster
04field service management software for trades businessesHome Services SaaS / Chainsseed exampleServiceTitanservicetitan.com499 company reference set - G2 field service management roster
leaderboards/lookalike/tam-recall/fairnesshome vendor: openfunnel

[01.d.i] Anti-bias check: the reference set is pooled from public sources, never from a benchmarked vendor. If it were biased toward openfunnel, most gold companies would be found only by openfunnel. Instead most are surfaced by ≥2 independent vendors (methodology §7).

SeedGoldFound by ≥2Found by anyopenfunnel-only
Postscript43116 (27.1%)5913
Pylon10710 (37.0%)273
Recharge1359 (21.9%)418
ServiceTitan49921 (29.6%)7120

Can an AI agent actually use this vendor?

Same agent-readiness lens as the technographics benchmark. Vendors that let an autonomous agent obtain a working key on its own (OTP-via-email or device-code) work end-to-end without human handoff.

leaderboards/lookalike/agent-readiness3/5 agent-ready
VendorAgent sign-upAPI docsllms.txtMCPTry it
OpenFunnel✓ readyotp-emaildocs ↗llms.txt ↗mcp ↗sign up →
Ocean.iomanual signupdocs ↗llms.txt ↗mcp ↗
Exa✓ readyotp-emaildocs ↗llms.txt ↗mcp ↗sign up →
Parallel✓ readyotp-emaildocs ↗llms.txt ↗mcp ↗sign up →
PredictLeadsmanual signupdocs ↗mcp ↗
[02] methodology, metric definitions, and known limitations+

How the matrix is built

  1. Fix a canonical list of 14 seed companies across 7 verticals (b2b-saas, devtools, ecommerce, healthtech, home-services SaaS / chains, local trades, real estate). Each seed has a name, domain, and short description - the exact inputs every vendor sees.
  2. For every (seed, vendor) cell, call the vendor's lookalike API with K = 10. Capture the ordered top-K result list and credit cost.
  3. Feed the seed + each returned candidate to the LLM judge (majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini). Judge returns a binary relevance label per candidate plus a one-line rationale. Identical prompt and rubric across all vendors.
  4. Cell value = relevant_count / K. Aggregate per vendor as avg_precision_at_k (mean across seeds with ≥K results returned).
  5. A vendor that returns fewer than K candidates for a seed has the cell rendered as - rather than scored on a truncated denominator. Keeps cells comparable.

What each metric means

  • Precision@K · cell value. Of the top 10 lookalikes a vendor returned for the seed, the fraction the LLM judge labeled relevant. The buyer's metric - "of the K I paid for, how many are usable".
  • avg Precision@K · headline ranking metric. Mean Precision@K across all judged seeds. Higher is better.
  • total relevant · sum of relevant lookalikes across all seeds. Reach metric - useful when comparing two vendors with similar precision.
  • cost per relevant · vendor credit spend ÷ total relevant lookalikes. The economics metric.

Why an LLM judge instead of a hand-labeled set

A fully hand-labeled lookalike set would require labeling K × seeds × vendors candidates (10 × 14 × 5 = 700judgements) every time we re-run a snapshot. That doesn't scale, and it isn't how the buyer actually evaluates a vendor in the wild - the buyer reads the list and decides "close enough to my ICP, yes or no".

The judge approximates that decision with a consistent rubric: given the seed's name, domain, and description, is this returned candidate plausibly the same kind of company a B2B seller would target as a lookalike? The judge's rationale is persisted alongside the binary label so any cell can be audited by a human in seconds. When the model swaps, the cohort re-runs with the same prompt; deltas are visible.

How each vendor was queried

  • OpenFunnel· embeddings over the OpenFunnel company index with the seed's domain as the query; optional graph re-rank using shared jobs / tech / funding co-signals. Top-K by cosine score.
  • Ocean.io · /companies/lookalikes with seed domain. Default similarity model, K = 10.
  • Exa · findSimilar with seed domain. Neural web-content embedding model, K = 10. Filters down to results that look like company sites (heuristics on result URL/title).
  • Parallel· agentic research task: "find 10 companies similar to {seed}". The agent decides its own retrieval strategy. We record the final ranked list.
  • PredictLeads · /api/v3/companies/{domain}/similar_companies; ranks via shared tech, news, and jobs co-signals.

What this benchmark does not tell you

  • Judge bias.A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift across model versions.
  • K-tail vs precision tradeoff. Vendors with thin catalogs can win Precision@K by refusing to return tail results. We mitigate by requiring ≥K results for a cell to be scored - thin cells render as -, not a high Precision number with a small denominator.
  • No recall metric.Precision@K doesn't measure how many real lookalikes the vendor missed. That requires a held-out ground truth set we don't yet have.
  • Domain-only seeding. All vendors receive the same compact input (name + domain + 1-line description). Vendors that benefit from richer inputs (e.g. headcount filters, ARR band, geography) may underperform their in-product behavior. The flip side is that this matches how an agent would query them.
  • Cohort coverage. 10 seeds, 2 per vertical. Spans modern B2B SaaS / devtools / DTC ecom / healthcare networks and traditional trades (HVAC, plumbing) - the matrix exercises both tech-stack-driven matching and SIC/NAICS-style firmographic matching.

Verify any number end-to-end

The full benchmark — runner code, judge prompt, leaderboard snapshot, and per-cell raw audit trail (the literal HTTP request/response we sent each vendor + the literal LLM judge prompt/response per candidate) — is mirrored in a public repo: openfunnel/gtm-bench. Auth headers are scrubbed via an allow-list; everything else is verbatim.

To audit a single cell, open data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json in that repo and replay any of the vendor_calls[] with your own credentials, or re-score with your own LLM by replaying judge_calls[].messages against any OpenAI-v1 compatible model — useful for measuring judge bias or drift across model versions.

Inclusion queue and how to request a provider

Live: OpenFunnel, Ocean.io, Exa, Parallel, PredictLeads.

Requested but not directly comparable: ZoomInfo (company lookalikes are sales-gated, no self-serve API), Clay (lookalike runs inside Clay tables), Apollo (no public lookalike endpoint), Lusha (`/v3/companies/lookalike` requires 5-100 seeds per request, incompatible with the per-seed cell unit of this benchmark).

Under review next: Common Room, Koala, LeadGenius, 6sense, Demandbase.

To request a provider, email founders@openfunnel.dev with a link to the public API docs and pricing page.