# Lookalike Benchmark
Active dataset: `lookalike-2026-q2`
14 seed companies across 7 verticals × 5 vendors. Each vendor is asked for its top `K = 10` lookalikes per seed. An LLM judge (`gpt-5.4-mini`) scores each returned company for relevance; the cell value is **Precision@K** — relevant / K.
## Endpoints
- JSON API: https://benchmarks.openfunnel.dev/api/leaderboards/lookalikes
- Markdown agent docs: https://benchmarks.openfunnel.dev/llms.txt
- OpenAPI 3.1 spec: https://benchmarks.openfunnel.dev/openapi.json
- MCP server discovery: https://benchmarks.openfunnel.dev/.well-known/mcp.json
- Public data + code (reproduce any cell): https://github.com/openfunnel/gtm-bench
## Vendors live (5)
`openfunnel` (OpenFunnel), `ocean` (Ocean.io), `exa` (Exa), `parallel` (Parallel), `predictleads` (PredictLeads)
## Vendors not surveyed
- `ZoomInfo` - company lookalike API is not on self-serve — gated behind sales contract
- `Clay` - lookalike runs inside Clay tables, no standalone API
- `Apollo` - no public lookalike endpoint
- `Lusha` - /v3/companies/lookalike requires 5-100 seed companies per request; benchmark scores one seed per cell
## Leaderboard (vendor totals)
| Rank | Vendor | Seeds judged | avg Precision@K | total relevant | avg latency |
|------|--------|--------------|-----------------|----------------|-------------|
| 1 | openfunnel | 14/14 | 89.0% | 120 | 30747ms |
| 2 | predictleads | 14/14 | 73.6% | 103 | 726ms |
| 3 | ocean | 14/14 | 71.4% | 100 | 1840ms |
| 4 | parallel | 13/14 | 70.0% | 91 | 1492ms |
| 5 | exa | 14/14 | 37.3% | 52 | 244ms |
- `avg Precision@K` - mean Precision@10 across all seeds the vendor returned ≥K results for. Headline metric.
- `total relevant` - sum of relevant lookalikes across all seeds (out of `seeds_judged × K`).
- `avg latency` - mean per-seed request latency across the cohort.
## Seed × vendor matrix
Cell value = Precision@10. `-` means the vendor has not been run on that seed yet, or returned fewer than K results.
### B2B SaaS
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Pylon | 90.0% | 70.0% | 50.0% | 50.0% | 100.0% |
| Default | 90.0% | 70.0% | 60.0% | 100.0% | 60.0% |
### Devtools
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Liveblocks | 100.0% | 40.0% | 20.0% | 70.0% | 100.0% |
| Trigger.dev | 90.0% | 30.0% | 12.5% | 0.0% | 100.0% |
### E-commerce
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Postscript | 90.0% | 80.0% | 20.0% | 40.0% | 90.0% |
| Recharge | 66.7% | 90.0% | 30.0% | 80.0% | 10.0% |
### Healthtech
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Hinge Health | 90.0% | 60.0% | 30.0% | 100.0% | 100.0% |
| Aledade | 40.0% | 50.0% | 30.0% | 90.0% | 70.0% |
### Home Services SaaS / Chains
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| ServiceTitan | 90.0% | 60.0% | 0.0% | 90.0% | 100.0% |
| Roto-Rooter | 100.0% | 70.0% | 20.0% | 100.0% | 100.0% |
### Local Trades
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Point Loma Home Pros | 100.0% | 100.0% | 50.0% | 80.0% | 80.0% |
| JDV Electric | 100.0% | 80.0% | 70.0% | - | 70.0% |
### Real Estate
| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Emerge Living | 100.0% | 100.0% | 80.0% | 40.0% | 0.0% |
| BLVD Residential | 100.0% | 100.0% | 50.0% | 70.0% | 50.0% |
## Methodology
1. Fix a canonical list of seed companies across 7 verticals. Each seed has a name, domain, and short description (the inputs every vendor sees).
2. For every (seed, vendor) cell, call the vendor's lookalike API with the seed company and `K = 10`. Capture the ordered top-K result list, latency, and credit cost.
3. Feed the seed + each returned candidate into the LLM judge (`gpt-5.4-mini`). Judge returns a binary relevance label per candidate, with a one-line rationale. Same prompt and rubric across all vendors.
4. Cell value = relevant_count / K. Aggregate per vendor as `avg_precision_at_k` (mean across seeds with ≥K results).
5. `-` semantics: either the vendor returned fewer than K candidates (e.g. tail seeds where catalog is thin), or the (seed, vendor) pair has not been run / judged yet.
## Reproducibility
Every cell on this leaderboard is reproducible end-to-end from the public
mirror at https://github.com/openfunnel/gtm-bench. Each `data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json`
contains the **literal HTTP request/response** sent to the vendor (auth
headers redacted) plus the **literal LLM judge prompt + raw response** for
every candidate. Replay any `vendor_calls[]` entry with your own
credentials to verify the vendor's output, or replay
`judge_calls[].messages` against your own LLM to measure judge bias or
drift across model versions.
## Known limitations
- **Judge bias.** A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift if you swap models.
- **K-tail vs precision tradeoff.** Vendors who can only return small result sets win Precision@K by default (they don't have noisy tail entries). We balance this by requiring ≥K results for the cell to score.
- **Vertical balance.** 14 seeds spread across 7 verticals — modern B2B SaaS, devtools, DTC ecom, healthcare networks, vertical SaaS / national chains for home-services, independent local trades (HVAC / plumbing / electrical), and multifamily real-estate operators. Lets the matrix exercise both tech-stack-style matching and SIC/NAICS firmographic matching.
- **No recall metric.** Precision@K does not measure how many *real* lookalikes exist that the vendor missed. That requires a held-out ground truth set we don't yet have.
## License
CC-BY-4.0. Attribute "OpenFunnel Bench" and link back when redistributing.
Lookalike Precision@10
Rows are seed companies, columns are vendors, and each cell is the % of the vendor's top 10 lookalikes the LLM judge marked relevant.
B2B SaaS2 seeds
narrow B2B SaaS — customer support, RevOps, ops tooling for ops/eng/CS teams
| # | Seed company | OpenFunnel | Ocean.io | Exa | Parallel | PredictLeads |
|---|---|---|---|---|---|---|
| 01 | Pylon | 🥈90% | 🥉70% | 50% | 50% | 🥇100% |
| 02 | Default | 🥈90% | 🥉70% | 60% | 🥇100% | 60% |
Devtools2 seeds
developer primitives — realtime, background jobs, infra-as-code
| # | Seed company | OpenFunnel | Ocean.io | Exa | Parallel | PredictLeads |
|---|---|---|---|---|---|---|
| 01 | Liveblocks | 🥇100% | 40% | 20% | 🥉70% | 🥈100% |
| 02 | Trigger.dev | 🥈90% | 🥉30% | 13% | 0.0% | 🥇100% |
E-commerce2 seeds
DTC infra — SMS, subscriptions, retention
| # | Seed company | OpenFunnel | Ocean.io | Exa | Parallel | PredictLeads |
|---|---|---|---|---|---|---|
| 01 | Postscript | 🥇90% | 🥉80% | 20% | 40% | 🥈90% |
| 02 | Recharge | 🥉67% | 🥇90% | 30% | 🥈80% | 10% |
Healthtech2 seeds
vertical healthcare — digital MSK / digital therapeutics, value-based primary care
| # | Seed company | OpenFunnel | Ocean.io | Exa | Parallel | PredictLeads |
|---|---|---|---|---|---|---|
| 01 | Hinge Health | 🥉90% | 60% | 30% | 🥇100% | 🥈100% |
| 02 | Aledade | 40% | 🥉50% | 30% | 🥇90% | 🥈70% |
Home Services SaaS / Chains2 seeds
the platforms behind the trades — vertical SaaS for contractors and national service chains
| # | Seed company | OpenFunnel | Ocean.io | Exa | Parallel | PredictLeads |
|---|---|---|---|---|---|---|
| 01 | ServiceTitan | 🥈90% | 60% | 0.0% | 🥉90% | 🥇100% |
| 02 | Roto-Rooter | 🥇100% | 70% | 20% | 🥈100% | 🥉100% |
Local Trades2 seeds
independent local service contractors — HVAC, plumbing, electrical small businesses serving homeowners
| # | Seed company | OpenFunnel | Ocean.io | Exa | Parallel | PredictLeads |
|---|---|---|---|---|---|---|
| 01 | Point Loma Home Pros | 🥇100% | 🥈100% | 50% | 🥉80% | 80% |
| 02 | JDV Electric | 🥇100% | 🥈80% | 🥉70% | N/A | 70% |
Real Estate2 seeds
multifamily property operators / property management — apartment ops, resident experience
| # | Seed company | OpenFunnel | Ocean.io | Exa | Parallel | PredictLeads |
|---|---|---|---|---|---|---|
| 01 | Emerge Living | 🥇100% | 🥈100% | 🥉80% | 40% | 0.0% |
| 02 | BLVD Residential | 🥇100% | 🥈100% | 50% | 🥉70% | 50% |
Can an AI agent actually use this vendor?
Same agent-readiness lens as the technographics benchmark. Vendors that let an autonomous agent obtain a working key on its own (OTP-via-email or device-code) work end-to-end without human handoff.
| Vendor | Agent sign-up | API docs | llms.txt | MCP | Try it |
|---|---|---|---|---|---|
| OpenFunnel | ✓ readyotp-email | docs ↗ | llms.txt ↗ | mcp ↗ | sign up → |
| Ocean.io | manual signup | docs ↗ | — | — | — |
| Exa | ✓ readyotp-email | docs ↗ | — | — | sign up → |
| Parallel | ✓ readyotp-email | docs ↗ | — | — | sign up → |
| PredictLeads | manual signup | docs ↗ | — | — | — |
[02] methodology, metric definitions, and known limitations+
How the matrix is built
- Fix a canonical list of 14 seed companies across 7 verticals (b2b-saas, devtools, ecommerce, healthtech, home-services SaaS / chains, local trades, real estate). Each seed has a name, domain, and short description - the exact inputs every vendor sees.
- For every (seed, vendor) cell, call the vendor's lookalike API with
K = 10. Capture the ordered top-K result list, request latency, and credit cost. - Feed the seed + each returned candidate to the LLM judge (
gpt-5.4-mini). Judge returns a binary relevance label per candidate plus a one-line rationale. Identical prompt and rubric across all vendors. - Cell value =
relevant_count / K. Aggregate per vendor asavg_precision_at_k(mean across seeds with ≥K results returned). - A vendor that returns fewer than K candidates for a seed has the cell rendered as
-rather than scored on a truncated denominator. Keeps cells comparable.
What each metric means
Precision@K· cell value. Of the top 10 lookalikes a vendor returned for the seed, the fraction the LLM judge labeled relevant. The buyer's metric - "of the K I paid for, how many are usable".avg Precision@K· headline ranking metric. Mean Precision@K across all judged seeds. Higher is better.total relevant· sum of relevant lookalikes across all seeds. Reach metric - useful when comparing two vendors with similar precision.avg latency· mean per-seed request time.cost per relevant· vendor credit spend ÷ total relevant lookalikes. The economics metric.
Why an LLM judge instead of a hand-labeled set
A fully hand-labeled lookalike set would require labeling K × seeds × vendors candidates (10 × 14 × 5 = 700judgements) every time we re-run a snapshot. That doesn't scale, and it isn't how the buyer actually evaluates a vendor in the wild - the buyer reads the list and decides "close enough to my ICP, yes or no".
The judge approximates that decision with a consistent rubric: given the seed's name, domain, and description, is this returned candidate plausibly the same kind of company a B2B seller would target as a lookalike? The judge's rationale is persisted alongside the binary label so any cell can be audited by a human in seconds. When the model swaps, the cohort re-runs with the same prompt; deltas are visible.
How each vendor was queried
- OpenFunnel· embeddings over the OpenFunnel company index with the seed's domain as the query; optional graph re-rank using shared jobs / tech / funding co-signals. Top-K by cosine score.
- Ocean.io ·
/companies/lookalikeswith seed domain. Default similarity model, K = 10. - Exa ·
findSimilarwith seed domain. Neural web-content embedding model, K = 10. Filters down to results that look like company sites (heuristics on result URL/title). - Parallel· agentic research task: "find 10 companies similar to {seed}". The agent decides its own retrieval strategy. We record the final ranked list.
- PredictLeads ·
/api/v3/companies/{domain}/similar_companies; ranks via shared tech, news, and jobs co-signals.
What this benchmark does not tell you
- Judge bias.A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift across model versions.
- K-tail vs precision tradeoff. Vendors with thin catalogs can win Precision@K by refusing to return tail results. We mitigate by requiring ≥K results for a cell to be scored - thin cells render as
-, not a high Precision number with a small denominator. - No recall metric.Precision@K doesn't measure how many real lookalikes the vendor missed. That requires a held-out ground truth set we don't yet have.
- Domain-only seeding. All vendors receive the same compact input (name + domain + 1-line description). Vendors that benefit from richer inputs (e.g. headcount filters, ARR band, geography) may underperform their in-product behavior. The flip side is that this matches how an agent would query them.
- Cohort coverage. 10 seeds, 2 per vertical. Spans modern B2B SaaS / devtools / DTC ecom / healthcare networks and traditional trades (HVAC, plumbing) - the matrix exercises both tech-stack-driven matching and SIC/NAICS-style firmographic matching.
Verify any number end-to-end
The full benchmark — runner code, judge prompt, leaderboard snapshot, and per-cell raw audit trail (the literal HTTP request/response we sent each vendor + the literal LLM judge prompt/response per candidate) — is mirrored in a public repo: openfunnel/gtm-bench. Auth headers are scrubbed via an allow-list; everything else is verbatim.
To audit a single cell, open data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json in that repo and replay any of the vendor_calls[] with your own credentials, or re-score with your own LLM by replaying judge_calls[].messages against any OpenAI-v1 compatible model — useful for measuring judge bias or drift across model versions.
Inclusion queue and how to request a provider
Live: OpenFunnel, Ocean.io, Exa, Parallel, PredictLeads.
Requested but not directly comparable: ZoomInfo (company lookalikes are sales-gated, no self-serve API), Clay (lookalike runs inside Clay tables), Apollo (no public lookalike endpoint), Lusha (`/v3/companies/lookalike` requires 5-100 seeds per request, incompatible with the per-seed cell unit of this benchmark).
Under review next: Common Room, Koala, LeadGenius, 6sense, Demandbase.
To request a provider, email founders@openfunnel.dev with a link to the public API docs and pricing page.








