# Lookalike Benchmark
Active dataset: `lookalike-2026-q2`
14 seed companies across 7 verticals × 5 vendors. Each vendor is asked for its top `K = 10` lookalikes per seed. An LLM judge (`majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini`) scores each returned company for relevance; the cell value is **Precision@K** — relevant / K.
## Endpoints
- JSON API: https://benchmarks.openfunnel.dev/api/leaderboards/lookalikes
- Markdown agent docs: https://benchmarks.openfunnel.dev/llms.txt
- OpenAPI 3.1 spec: https://benchmarks.openfunnel.dev/openapi.json
- MCP server discovery: https://benchmarks.openfunnel.dev/.well-known/mcp.json
- Public data + code (reproduce any cell): https://github.com/openfunnel/gtm-bench
## Vendors live (5)
`openfunnel` (OpenFunnel), `ocean` (Ocean.io), `exa` (Exa), `parallel` (Parallel), `predictleads` (PredictLeads)
## Vendors not surveyed
- `ZoomInfo` - company lookalike API is not on self-serve — gated behind sales contract
- `Clay` - lookalike runs inside Clay tables, no standalone API
- `Apollo` - no public lookalike endpoint
- `Lusha` - /v3/companies/lookalike requires 5-100 seed companies per request; benchmark scores one seed per cell
## Leaderboard (vendor totals)
| Rank | Vendor | Seeds judged | avg Precision@K | total relevant |
|------|--------|--------------|-----------------|----------------|
| 1 | openfunnel | 14/14 | 89.1% | 123 |
| 2 | predictleads | 14/14 | 73.6% | 103 |
| 3 | ocean | 14/14 | 70.7% | 99 |
| 4 | parallel | 13/14 | 64.6% | 84 |
| 5 | exa | 14/14 | 31.8% | 44 |
- `avg Precision@K` - mean Precision@10 across all seeds the vendor returned ≥K results for. Headline metric.
- `total relevant` - sum of relevant lookalikes across all seeds (out of `seeds_judged × K`).
## Seed × vendor matrix
Cell value = Precision@10. `-` means the vendor has not been run on that seed yet, or returned fewer than K results.
### B2B SaaS
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Pylon | 100.0% | 70.0% | 40.0% | 60.0% | 100.0% |
| Default | 100.0% | 40.0% | 44.4% | 70.0% | 80.0% |
### Devtools
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Trigger.dev | 100.0% | 30.0% | 11.1% | 0.0% | 100.0% |
| Liveblocks | 60.0% | 40.0% | 10.0% | 30.0% | 80.0% |
### E-commerce
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Postscript | 90.0% | 80.0% | 10.0% | 80.0% | 90.0% |
| Recharge | 100.0% | 80.0% | 20.0% | 80.0% | 10.0% |
### Healthtech
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Hinge Health | 90.0% | 60.0% | 20.0% | 100.0% | 100.0% |
| Aledade | 20.0% | 60.0% | 10.0% | 100.0% | 70.0% |
### Home Services SaaS / Chains
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Roto-Rooter | 87.5% | 70.0% | 30.0% | 100.0% | 100.0% |
| ServiceTitan | 100.0% | 70.0% | 10.0% | 100.0% | 100.0% |
### Local Trades
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| JDV Electric | 100.0% | 90.0% | 70.0% | - | 60.0% |
| Point Loma Home Pros | 100.0% | 100.0% | 40.0% | 70.0% | 90.0% |
### Real Estate
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| BLVD Residential | 100.0% | 100.0% | 60.0% | 20.0% | 50.0% |
| Emerge Living | 100.0% | 100.0% | 70.0% | 30.0% | 0.0% |
## TAM Recall (coverage vs an independent reference set)
A second, **judge-free** benchmark answering the opposite question to Precision@K:
not "are the few you returned good?" but "build me my whole TAM." Each vendor
returns its deepest list (fetch depth 100); we match it against a
**frozen, vendor-independent reference set** (G2 category rosters resolved to
canonical domains) and report **Recall@K** = the fraction of that reference set
surfaced in the vendor's top K. A "hit" is deterministic reference-set membership —
**no LLM judge**. Recall is relative to the reference set, **not absolute TAM**.
### Recall leaderboard (vendor totals, mean across seeds)
| Rank | Vendor | R@10 | R@50 | R@100 |
|------|--------|------|------|-------|
| 1 | ocean | 2.2% | 6.5% | 8.3% |
| 2 | openfunnel | 1.9% | 4.8% | 8.2% |
| 3 | parallel | 1.2% | 4.2% | 5.7% |
| 4 | predictleads | 2.9% | 5.6% | 5.6% |
| 5 | exa | 0.5% | 1.1% | 2.0% |
- `R@10/50/100` - mean Recall@K across scored seeds; `R@100` is the headline.
### Seed × vendor Recall@100 matrix
Cell = Recall@100. `-` = vendor not run / errored on that seed. Reference-set size in parentheses.
### B2B SaaS
| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Pylon (107) | 10.3% | 11.2% | 0.9% | 7.5% | 9.3% |
### E-commerce
| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Postscript (431) | 5.1% | 4.9% | 1.6% | 2.8% | 4.6% |
| Recharge (135) | 11.8% | 13.3% | 5.2% | 7.4% | 4.4% |
### Home Services SaaS / Chains
| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| ServiceTitan (499) | 5.6% | 3.8% | 0.4% | 5.0% | 3.8% |
Full methodology + the frozen, content-hashed gold sets: https://github.com/openfunnel/gtm-bench (`scripts/lookalike/RECALL_METHODOLOGY.md`).
## Methodology
1. Fix a canonical list of seed companies across 7 verticals. Each seed has a name, domain, and short description (the inputs every vendor sees).
2. For every (seed, vendor) cell, call the vendor's lookalike API with the seed company and `K = 10`. Capture the ordered top-K result list and credit cost.
3. Feed the seed + each returned candidate into the LLM judge (`majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini`). Judge returns a binary relevance label per candidate, with a one-line rationale. Same prompt and rubric across all vendors.
4. Cell value = relevant_count / K. Aggregate per vendor as `avg_precision_at_k` (mean across seeds with ≥K results).
5. `-` semantics: either the vendor returned fewer than K candidates (e.g. tail seeds where catalog is thin), or the (seed, vendor) pair has not been run / judged yet.
## Reproducibility
Every cell on this leaderboard is reproducible end-to-end from the public
mirror at https://github.com/openfunnel/gtm-bench. Each `data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json`
contains the **literal HTTP request/response** sent to the vendor (auth
headers redacted) plus the **literal LLM judge prompt + raw response** for
every candidate. Replay any `vendor_calls[]` entry with your own
credentials to verify the vendor's output, or replay
`judge_calls[].messages` against your own LLM to measure judge bias or
drift across model versions.
## Known limitations
- **Judge bias.** A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift if you swap models.
- **K-tail vs precision tradeoff.** Vendors who can only return small result sets win Precision@K by default (they don't have noisy tail entries). We balance this by requiring ≥K results for the cell to score.
- **Vertical balance.** 14 seeds spread across 7 verticals — modern B2B SaaS, devtools, DTC ecom, healthcare networks, vertical SaaS / national chains for home-services, independent local trades (HVAC / plumbing / electrical), and multifamily real-estate operators. Lets the matrix exercise both tech-stack-style matching and SIC/NAICS firmographic matching.
- **Precision ≠ recall.** Precision@10 measures whether the few results a vendor returned are clean, not how much of the market it covers. The **TAM Recall** section above is the complementary coverage metric — recall against a frozen, vendor-independent reference set (judge-free). Read them together: a vendor can be precise but narrow, or broad but noisy.
## License
CC-BY-4.0. Attribute "OpenFunnel Bench" and link back when redistributing.
Lookalike Precision@10
Scan the company type, then compare vendors. Each cell is the % of a vendor's top 10similar-company suggestions the LLM judge marked relevant.
TAM Recall - coverage of the addressable market
Precision@10above asks "of the few you returned, how many are good?" This asks the opposite: "build me my whole TAM". Each vendor returns its deepest list (fetch depth 100); we match against a frozen, vendor-independent reference set resolved to canonical domains and report Recall@10/50/100. No LLM judge - a hit is deterministic reference-set membership. Recall is relative to that reference set, never claimed as absolute TAM. Top coverage: Ocean.io at 8.3% avg Recall@100.
| # | Vendor | R@10 | R@50 | R@100 |
|---|---|---|---|---|
| 🥇 | Ocean.io | 2.2% | 6.5% | 8.3% |
| 🥈 | OpenFunnel | 1.9% | 4.8% | 8.2% |
| 🥉 | Parallel | 1.2% | 4.2% | 5.7% |
| 04 | PredictLeads | 2.9% | 5.6% | 5.6% |
| 05 | Exa | 0.5% | 1.1% | 2.0% |
| # | Market tested / seed example | Ocean.io | OpenFunnel | Parallel | PredictLeads | Exa |
|---|---|---|---|---|---|---|
| 01 | SMS marketing platforms for Shopify and DTC brands431 company reference set - G2 SMS marketing roster | |||||
| 02 | B2B customer support and conversational support platforms107 company reference set - G2 conversational support roster | |||||
| 03 | subscription management and recurring billing for e-commerce135 company reference set - G2 subscription management roster | |||||
| 04 | field service management software for trades businesses499 company reference set - G2 field service management roster |
[01.d.i] Anti-bias check: the reference set is pooled from public sources, never from a benchmarked vendor. If it were biased toward openfunnel, most gold companies would be found only by openfunnel. Instead most are surfaced by ≥2 independent vendors (methodology §7).
| Seed | Gold | Found by ≥2 | Found by any | openfunnel-only |
|---|---|---|---|---|
| Postscript | 431 | 16 (27.1%) | 59 | 13 |
| Pylon | 107 | 10 (37.0%) | 27 | 3 |
| Recharge | 135 | 9 (21.9%) | 41 | 8 |
| ServiceTitan | 499 | 21 (29.6%) | 71 | 20 |
Can an AI agent actually use this vendor?
Same agent-readiness lens as the technographics benchmark. Vendors that let an autonomous agent obtain a working key on its own (OTP-via-email or device-code) work end-to-end without human handoff.
| Vendor | Agent sign-up | API docs | llms.txt | MCP | Try it |
|---|---|---|---|---|---|
| OpenFunnel | ✓ readyotp-email | docs ↗ | llms.txt ↗ | mcp ↗ | sign up → |
| Ocean.io | manual signup | docs ↗ | llms.txt ↗ | mcp ↗ | — |
| Exa | ✓ readyotp-email | docs ↗ | llms.txt ↗ | mcp ↗ | sign up → |
| Parallel | ✓ readyotp-email | docs ↗ | llms.txt ↗ | mcp ↗ | sign up → |
| PredictLeads | manual signup | docs ↗ | — | mcp ↗ | — |
[02] methodology, metric definitions, and known limitations+
How the matrix is built
- Fix a canonical list of 14 seed companies across 7 verticals (b2b-saas, devtools, ecommerce, healthtech, home-services SaaS / chains, local trades, real estate). Each seed has a name, domain, and short description - the exact inputs every vendor sees.
- For every (seed, vendor) cell, call the vendor's lookalike API with
K = 10. Capture the ordered top-K result list and credit cost. - Feed the seed + each returned candidate to the LLM judge (
majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini). Judge returns a binary relevance label per candidate plus a one-line rationale. Identical prompt and rubric across all vendors. - Cell value =
relevant_count / K. Aggregate per vendor asavg_precision_at_k(mean across seeds with ≥K results returned). - A vendor that returns fewer than K candidates for a seed has the cell rendered as
-rather than scored on a truncated denominator. Keeps cells comparable.
What each metric means
Precision@K· cell value. Of the top 10 lookalikes a vendor returned for the seed, the fraction the LLM judge labeled relevant. The buyer's metric - "of the K I paid for, how many are usable".avg Precision@K· headline ranking metric. Mean Precision@K across all judged seeds. Higher is better.total relevant· sum of relevant lookalikes across all seeds. Reach metric - useful when comparing two vendors with similar precision.cost per relevant· vendor credit spend ÷ total relevant lookalikes. The economics metric.
Why an LLM judge instead of a hand-labeled set
A fully hand-labeled lookalike set would require labeling K × seeds × vendors candidates (10 × 14 × 5 = 700judgements) every time we re-run a snapshot. That doesn't scale, and it isn't how the buyer actually evaluates a vendor in the wild - the buyer reads the list and decides "close enough to my ICP, yes or no".
The judge approximates that decision with a consistent rubric: given the seed's name, domain, and description, is this returned candidate plausibly the same kind of company a B2B seller would target as a lookalike? The judge's rationale is persisted alongside the binary label so any cell can be audited by a human in seconds. When the model swaps, the cohort re-runs with the same prompt; deltas are visible.
How each vendor was queried
- OpenFunnel· embeddings over the OpenFunnel company index with the seed's domain as the query; optional graph re-rank using shared jobs / tech / funding co-signals. Top-K by cosine score.
- Ocean.io ·
/companies/lookalikeswith seed domain. Default similarity model, K = 10. - Exa ·
findSimilarwith seed domain. Neural web-content embedding model, K = 10. Filters down to results that look like company sites (heuristics on result URL/title). - Parallel· agentic research task: "find 10 companies similar to {seed}". The agent decides its own retrieval strategy. We record the final ranked list.
- PredictLeads ·
/api/v3/companies/{domain}/similar_companies; ranks via shared tech, news, and jobs co-signals.
What this benchmark does not tell you
- Judge bias.A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift across model versions.
- K-tail vs precision tradeoff. Vendors with thin catalogs can win Precision@K by refusing to return tail results. We mitigate by requiring ≥K results for a cell to be scored - thin cells render as
-, not a high Precision number with a small denominator. - No recall metric.Precision@K doesn't measure how many real lookalikes the vendor missed. That requires a held-out ground truth set we don't yet have.
- Domain-only seeding. All vendors receive the same compact input (name + domain + 1-line description). Vendors that benefit from richer inputs (e.g. headcount filters, ARR band, geography) may underperform their in-product behavior. The flip side is that this matches how an agent would query them.
- Cohort coverage. 10 seeds, 2 per vertical. Spans modern B2B SaaS / devtools / DTC ecom / healthcare networks and traditional trades (HVAC, plumbing) - the matrix exercises both tech-stack-driven matching and SIC/NAICS-style firmographic matching.
Verify any number end-to-end
The full benchmark — runner code, judge prompt, leaderboard snapshot, and per-cell raw audit trail (the literal HTTP request/response we sent each vendor + the literal LLM judge prompt/response per candidate) — is mirrored in a public repo: openfunnel/gtm-bench. Auth headers are scrubbed via an allow-list; everything else is verbatim.
To audit a single cell, open data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json in that repo and replay any of the vendor_calls[] with your own credentials, or re-score with your own LLM by replaying judge_calls[].messages against any OpenAI-v1 compatible model — useful for measuring judge bias or drift across model versions.
Inclusion queue and how to request a provider
Live: OpenFunnel, Ocean.io, Exa, Parallel, PredictLeads.
Requested but not directly comparable: ZoomInfo (company lookalikes are sales-gated, no self-serve API), Clay (lookalike runs inside Clay tables), Apollo (no public lookalike endpoint), Lusha (`/v3/companies/lookalike` requires 5-100 seeds per request, incompatible with the per-seed cell unit of this benchmark).
Under review next: Common Room, Koala, LeadGenius, 6sense, Demandbase.
To request a provider, email founders@openfunnel.dev with a link to the public API docs and pricing page.








