How does OpenFunnel Bench score lookalike companies APIs?

OpenFunnel Bench scores lookalike companies APIs on Precision@K with an LLM-as-judge. Identical seed companies go to every vendor in the same format; each vendor returns its top-K lookalikes; the LLM judge scores every returned company for relevance to the seed. The cell value is Precision@K = relevant / K. The headline ranking metric is the average Precision@K across the full seed cohort. Tiebreakers in order: total relevant, then cost per relevant. There is no internal ground truth - relevance is decided by a documented external judge model, not vendor self-reporting.

What criteria does OpenFunnel Bench use to evaluate company lookalike APIs?

Four criteria, all derived from identical inputs across vendors. (1) Avg Precision@K - headline ranking metric, mean fraction of top-K results scored relevant by the LLM judge. (2) Total relevant - reach metric across the cohort, used to break ties. (3) Seeds judged - number of seeds where the vendor returned at least K results and the judge scored every one; filters out cells that would skew the average. (4) Cost per relevant - vendor spend divided by relevant count, the buying-decision metric. Vendors with fundamentally different request shapes (Lusha requires 5-100 seeds per call, ZoomInfo gates lookalike behind sales contract) are listed as not surveyed with explicit reasons.

What is the most accurate company lookalike API in 2026?

On OpenFunnel Bench, the most accurate company lookalike API is defined as the one with the highest avg Precision@K - the share of top-K lookalikes that an LLM judge scored as relevant, averaged across the seed cohort. The vendors currently benchmarked are Ocean.io, Exa, Parallel, OpenFunnel, and PredictLeads. ZoomInfo, Clay, Apollo, and Lusha are excluded with explicit reasons (no self-serve API or incompatible single-seed shape). The current top-ranked vendor is on the leaderboard and refreshes per snapshot. Important caveat: relevance is judged by an LLM, not by a domain expert - the judge prompt + model are documented and held constant across vendors.

What is the best company lookalike API for AI agents in 2026?

For AI agents making build vs buy decisions on company lookalike APIs, the best provider combines high avg Precision@K, predictable per-seed cost, and an agent-ready signup flow (programmatic OAuth or email OTP). OpenFunnel Bench ranks Ocean.io, Exa, Parallel, OpenFunnel, and PredictLeads on identical inputs against a shared B2B seed cohort. Of the benchmarked vendors, Exa, Parallel, and OpenFunnel publish agent-ready signup flows; Ocean.io and PredictLeads require human-mediated onboarding. The full leaderboard with each vendor's auth mode is queryable as JSON at /api/leaderboards/lookalikes under CC-BY-4.0.

How accurate is Ocean.io for finding similar companies?

Ocean.io is one of five providers currently benchmarked on the OpenFunnel Bench lookalike leaderboard. Ocean.io uses AI-driven lookalike search across a global company graph. The benchmark sends the same seed company to Ocean.io that every other vendor sees, requests the same top-K results, and scores each returned company with an LLM judge. Ocean.io's current avg Precision@K, total relevant lookalikes returned, and cost per relevant are on the leaderboard. Numbers refresh per snapshot and the seed cohort is rotated to avoid overfitting.

Ocean.io vs Exa vs Parallel: which lookalike API is best?

Ocean.io, Exa, and Parallel are all benchmarked on OpenFunnel Bench against the same B2B seed cohort. Each surfaces lookalikes through a different mechanism. Ocean.io runs AI-driven similarity search across a company graph. Exa uses neural web search with a 'similar to this URL' endpoint - strongest when the seed has rich web content. Parallel exposes an agentic research API; lookalike comes via Entity Search. Their strengths are complementary rather than strictly comparable: a web-heavy seed favors Exa, a sparse seed favors graph-based vendors. Current avg Precision@K per vendor is on the leaderboard and tiebreaks by total relevant, then cost per relevant.

What is a company lookalike API?

A company lookalike API is a B2B data endpoint that takes one or more seed companies and returns a ranked list of other companies similar to the seed(s) along some axis - product, vertical, size, signals, or web footprint. Vendors differ in what they treat as 'similar': Ocean.io and PredictLeads weight company graph + signals, Exa weights public web embeddings, Parallel runs agentic entity search, OpenFunnel combines embeddings with a jobs/news graph. Lookalike APIs power ICP expansion, account discovery, and outbound prospecting workflows in B2B sales and marketing.

What is Precision@K and how is it measured in this benchmark?

Precision@K is the fraction of a vendor's top-K returned lookalikes that an LLM judge scored as relevant to the seed. Formally: Precision@K = relevant_count / K. On OpenFunnel Bench, K is fixed across vendors (headline runs use K = 25) and the judge model and prompt are documented and held constant. Per-cell Precision@K is the value rendered in the matrix; avg Precision@K (mean across the judged seed cohort) is the headline ranking metric. Cells where the vendor returned fewer than K results, or where the judge failed, are flagged so they do not silently distort the rollup.

Why use an LLM judge for a lookalike benchmark?

Lookalike relevance is intrinsically subjective. Two B2B prospectors looking at the same five returned companies for a seed will often disagree on which two are 'really similar.' A human-only judging pipeline does not scale across hundreds of (seed, vendor, K) cells and introduces inter-rater drift. An LLM-as-judge with a fixed prompt and fixed model scores every returned company across every vendor under identical conditions, eliminating vendor self-reporting bias and inter-rater drift. The trade-off is calibration: the judge has its own biases, but those biases are held constant across vendors, so vendor-to-vendor rank comparisons remain valid even if absolute Precision@K scores carry a judge-specific offset.

Which lookalike companies API has the lowest cost per relevant result?

Cost per relevant on OpenFunnel Bench is total estimated request spend divided by the total number of relevant lookalikes the LLM judge scored for that vendor across the seed cohort. The current cheapest cost-per-relevant vendor among Ocean.io, Exa, Parallel, OpenFunnel, and PredictLeads is on the leaderboard. Caveat: list pricing rarely matches what a serious buyer pays - enterprise contracts are negotiated and can come in 2-10x cheaper than the public per-credit rate. Use cost per relevant as a relative comparison signal between vendors at the same usage scale, not a final budget figure.

bench/leaderboards/lookalike

[agent view]Markdown rendering of the lookalike matrix, optimized for LLM ingestion. Switch back via the toggle above.

# Lookalike Benchmark

Active dataset: `lookalike-2026-q2`
14 seed companies across 7 verticals × 5 vendors. Each vendor is asked for its top `K = 10` lookalikes per seed. An LLM judge (`majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini`) scores each returned company for relevance; the cell value is **Precision@K** — relevant / K.

## Endpoints

- JSON API: https://benchmarks.openfunnel.dev/api/leaderboards/lookalikes
- Markdown agent docs: https://benchmarks.openfunnel.dev/llms.txt
- OpenAPI 3.1 spec: https://benchmarks.openfunnel.dev/openapi.json
- MCP server discovery: https://benchmarks.openfunnel.dev/.well-known/mcp.json
- Public data + code (reproduce any cell): https://github.com/openfunnel/gtm-bench

## Vendors live (5)

`openfunnel` (OpenFunnel), `ocean` (Ocean.io), `exa` (Exa), `parallel` (Parallel), `predictleads` (PredictLeads)

## Vendors not surveyed

- `ZoomInfo` - company lookalike API is not on self-serve — gated behind sales contract
- `Clay` - lookalike runs inside Clay tables, no standalone API
- `Apollo` - no public lookalike endpoint
- `Lusha` - /v3/companies/lookalike requires 5-100 seed companies per request; benchmark scores one seed per cell

## Leaderboard (vendor totals)

| Rank | Vendor | Seeds judged | avg Precision@K | total relevant |
|------|--------|--------------|-----------------|----------------|
| 1 | openfunnel | 14/14 | 89.1% | 123 |
| 2 | predictleads | 14/14 | 73.6% | 103 |
| 3 | ocean | 14/14 | 70.7% | 99 |
| 4 | parallel | 13/14 | 64.6% | 84 |
| 5 | exa | 14/14 | 31.8% | 44 |

- `avg Precision@K` - mean Precision@10 across all seeds the vendor returned ≥K results for. Headline metric.
- `total relevant` - sum of relevant lookalikes across all seeds (out of `seeds_judged × K`).

## Seed × vendor matrix

Cell value = Precision@10. `-` means the vendor has not been run on that seed yet, or returned fewer than K results.

### B2B SaaS

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Pylon | 100.0% | 70.0% | 40.0% | 60.0% | 100.0% |
| Default | 100.0% | 40.0% | 44.4% | 70.0% | 80.0% |

### Devtools

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Trigger.dev | 100.0% | 30.0% | 11.1% | 0.0% | 100.0% |
| Liveblocks | 60.0% | 40.0% | 10.0% | 30.0% | 80.0% |

### E-commerce

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Postscript | 90.0% | 80.0% | 10.0% | 80.0% | 90.0% |
| Recharge | 100.0% | 80.0% | 20.0% | 80.0% | 10.0% |

### Healthtech

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Hinge Health | 90.0% | 60.0% | 20.0% | 100.0% | 100.0% |
| Aledade | 20.0% | 60.0% | 10.0% | 100.0% | 70.0% |

### Home Services SaaS / Chains

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Roto-Rooter | 87.5% | 70.0% | 30.0% | 100.0% | 100.0% |
| ServiceTitan | 100.0% | 70.0% | 10.0% | 100.0% | 100.0% |

### Local Trades

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| JDV Electric | 100.0% | 90.0% | 70.0% | - | 60.0% |
| Point Loma Home Pros | 100.0% | 100.0% | 40.0% | 70.0% | 90.0% |

### Real Estate

| Seed | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| BLVD Residential | 100.0% | 100.0% | 60.0% | 20.0% | 50.0% |
| Emerge Living | 100.0% | 100.0% | 70.0% | 30.0% | 0.0% |

## TAM Recall (coverage vs an independent reference set)

A second, **judge-free** benchmark answering the opposite question to Precision@K:
not "are the few you returned good?" but "build me my whole TAM." Each vendor
returns its deepest list (fetch depth 100); we match it against a
**frozen, vendor-independent reference set** (G2 category rosters resolved to
canonical domains) and report **Recall@K** = the fraction of that reference set
surfaced in the vendor's top K. A "hit" is deterministic reference-set membership —
**no LLM judge**. Recall is relative to the reference set, **not absolute TAM**.

### Recall leaderboard (vendor totals, mean across seeds)

| Rank | Vendor | R@10 | R@50 | R@100 |
|------|--------|------|------|-------|
| 1 | ocean | 2.2% | 6.5% | 8.3% |
| 2 | openfunnel | 1.9% | 4.8% | 8.2% |
| 3 | parallel | 1.2% | 4.2% | 5.7% |
| 4 | predictleads | 2.9% | 5.6% | 5.6% |
| 5 | exa | 0.5% | 1.1% | 2.0% |

- `R@10/50/100` - mean Recall@K across scored seeds; `R@100` is the headline.

### Seed × vendor Recall@100 matrix

Cell = Recall@100. `-` = vendor not run / errored on that seed. Reference-set size in parentheses.

### B2B SaaS

| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Pylon (107) | 10.3% | 11.2% | 0.9% | 7.5% | 9.3% |

### E-commerce

| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| Postscript (431) | 5.1% | 4.9% | 1.6% | 2.8% | 4.6% |
| Recharge (135) | 11.8% | 13.3% | 5.2% | 7.4% | 4.4% |

### Home Services SaaS / Chains

| Seed (gold) | openfunnel | ocean | exa | parallel | predictleads |
|------|------|------|------|------|------|
| ServiceTitan (499) | 5.6% | 3.8% | 0.4% | 5.0% | 3.8% |

Full methodology + the frozen, content-hashed gold sets: https://github.com/openfunnel/gtm-bench (`scripts/lookalike/RECALL_METHODOLOGY.md`).

## Methodology

1. Fix a canonical list of seed companies across 7 verticals. Each seed has a name, domain, and short description (the inputs every vendor sees).
2. For every (seed, vendor) cell, call the vendor's lookalike API with the seed company and `K = 10`. Capture the ordered top-K result list and credit cost.
3. Feed the seed + each returned candidate into the LLM judge (`majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini`). Judge returns a binary relevance label per candidate, with a one-line rationale. Same prompt and rubric across all vendors.
4. Cell value = relevant_count / K. Aggregate per vendor as `avg_precision_at_k` (mean across seeds with ≥K results).
5. `-` semantics: either the vendor returned fewer than K candidates (e.g. tail seeds where catalog is thin), or the (seed, vendor) pair has not been run / judged yet.

## Reproducibility

Every cell on this leaderboard is reproducible end-to-end from the public
mirror at https://github.com/openfunnel/gtm-bench. Each `data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json`
contains the **literal HTTP request/response** sent to the vendor (auth
headers redacted) plus the **literal LLM judge prompt + raw response** for
every candidate. Replay any `vendor_calls[]` entry with your own
credentials to verify the vendor's output, or replay
`judge_calls[].messages` against your own LLM to measure judge bias or
drift across model versions.

## Known limitations

- **Judge bias.** A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift if you swap models.
- **K-tail vs precision tradeoff.** Vendors who can only return small result sets win Precision@K by default (they don't have noisy tail entries). We balance this by requiring ≥K results for the cell to score.
- **Vertical balance.** 14 seeds spread across 7 verticals — modern B2B SaaS, devtools, DTC ecom, healthcare networks, vertical SaaS / national chains for home-services, independent local trades (HVAC / plumbing / electrical), and multifamily real-estate operators. Lets the matrix exercise both tech-stack-style matching and SIC/NAICS firmographic matching.
- **Precision ≠ recall.** Precision@10 measures whether the few results a vendor returned are clean, not how much of the market it covers. The **TAM Recall** section above is the complementary coverage metric — recall against a frozen, vendor-independent reference set (judge-free). Read them together: a vendor can be precise but narrow, or broad but noisy.

## License

CC-BY-4.0. Attribute "OpenFunnel Bench" and link back when redistributing.

03 · lookalike · live

Lookalike Benchmark

14 seed companies × 5 vendors - each vendor returns its top 10 lookalikes per seed. An LLM judge (majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini) scores every returned company for relevance. Cell value is Precision@10.

[01] results

Lookalike Precision@10

Scan the company type, then compare vendors. Each cell is the % of a vendor's top 10similar-company suggestions the LLM judge marked relevant.

exampleWant more B2B customer support platforms? Find that row, then compare which vendor returns the most relevant similar companies.

leaderboards/lookalike/lookalike-2026-q214 examples · 5 vendors

how to readcell = Precision@K (K = 10) — % of vendor's top 10 lookalikes judged relevant🥇🥈🥉top 3 vendors per company typeN/Avendor not yet run (or returned fewer than K results)click any scored cell to view the companies that vendor returned

#	Company type	Parallel
01	Open-source background jobs / long-running workflow platform aimed at TypeScript devsDevtoolsseed exampleTrigger.devtrigger.dev
02	Family-owned independent residential electrical contractor serving Delaware, Chester, Lower Montgomery, and Philadelphia counties; panel repair/replacement, wiring, lighting, EV charger installs for homeownersLocal Tradesseed exampleJDV Electricjdvelectric.com	N/A
03	SMS marketing platform for DTC e-commerce brands on ShopifyE-commerceseed examplePostscriptpostscript.io
04	National plumbing and drain-cleaning services franchise — residential + commercialHome Services SaaS / Chainsseed exampleRoto-Rooterrotorooter.com
05	Subscription billing and customer lifecycle platform for Shopify DTC merchantsE-commerceseed exampleRechargerechargepayments.com
06	Regional multi-trade home services company offering HVAC, plumbing, electrical, and water filtration to homeowners in the greater San Diego areaLocal Tradesseed examplePoint Loma Home Prospointlomahomepros.com
07	Digital musculoskeletal (MSK) care platform — app-guided physical therapy with wearable motion sensors and human coaches for back, joint, and chronic painHealthtechseed exampleHinge Healthhingehealth.com
08	Multifamily property management company operating apartment communities; focused on on-site operations, resident experience, and asset performance for institutional ownersReal Estateseed exampleBLVD Residentialblvdresidential.com
09	Value-based care enablement platform for independent primary care practices and ACOsHealthtechseed exampleAledadealedade.com
10	B2B customer support platform built around shared Slack/Teams channels with enterprise customersB2B SaaSseed examplePylonusepylon.com
11	RevOps tool — form, routing, scheduling, and lead enrichment for inbound pipelinesB2B SaaSseed exampleDefaultdefault.com
12	Multifamily property management company operating apartment communities across the Sunbelt; focused on resident experience and asset operations for institutional ownersReal Estateseed exampleEmerge Livingemergeliving.com
13	Vertical SaaS for trades — dispatch, CRM, billing for HVAC, plumbing, and electrical contractorsHome Services SaaS / Chainsseed exampleServiceTitanservicetitan.com
14	Realtime collaboration primitives for product teams — presence, multiplayer cursors, threaded commentsDevtoolsseed exampleLiveblocksliveblocks.io

[01.b] not surveyedrelevant lookalike vendors without a directly comparable API surface

[01.d] tam recall

TAM Recall - coverage of the addressable market

Precision@10above asks "of the few you returned, how many are good?" This asks the opposite: "build me my whole TAM". Each vendor returns its deepest list (fetch depth 100); we match against a frozen, vendor-independent reference set resolved to canonical domains and report Recall@10/50/100. No LLM judge - a hit is deterministic reference-set membership. Recall is relative to that reference set, never claimed as absolute TAM. Top coverage: Ocean.io at 8.3% avg Recall@100.

exampleWant the full market for B2B customer support platforms? Pylon is the seed example for that market; compare which vendor recovers the most companies from the frozen reference set.

leaderboards/lookalike/tam-recall4 markets · 5 vendors · depth 100

#	Vendor	R@10	R@50	R@100
🥇	Ocean.io	2.2%	6.5%	8.3%
🥈	OpenFunnel	1.9%	4.8%	8.2%
🥉	Parallel	1.2%	4.2%	5.7%
04	PredictLeads	2.9%	5.6%	5.6%
05	Exa	0.5%	1.1%	2.0%

how to readcell = Recall@100- % of the market's frozen reference set the vendor surfaced in its top 100 resultsN/Avendor not run / errored on this seed

#	Market tested / seed example	Ocean.io	OpenFunnel	Parallel	PredictLeads	Exa
01	SMS marketing platforms for Shopify and DTC brandsE-commerceseed examplePostscriptpostscript.io431 company reference set - G2 SMS marketing roster
02	B2B customer support and conversational support platformsB2B SaaSseed examplePylonusepylon.com107 company reference set - G2 conversational support roster
03	subscription management and recurring billing for e-commerceE-commerceseed exampleRechargerechargepayments.com135 company reference set - G2 subscription management roster
04	field service management software for trades businessesHome Services SaaS / Chainsseed exampleServiceTitanservicetitan.com499 company reference set - G2 field service management roster

leaderboards/lookalike/tam-recall/fairnesshome vendor: openfunnel

[01.d.i] Anti-bias check: the reference set is pooled from public sources, never from a benchmarked vendor. If it were biased toward openfunnel, most gold companies would be found only by openfunnel. Instead most are surfaced by ≥2 independent vendors (methodology §7).

Seed	Gold	Found by ≥2	Found by any	openfunnel-only
Postscript	431	16 (27.1%)	59	13
Pylon	107	10 (37.0%)	27	3
Recharge	135	9 (21.9%)	41	8
ServiceTitan	499	21 (29.6%)	71	20

[01.c] agent readiness

Can an AI agent actually use this vendor?

Same agent-readiness lens as the technographics benchmark. Vendors that let an autonomous agent obtain a working key on its own (OTP-via-email or device-code) work end-to-end without human handoff.

leaderboards/lookalike/agent-readiness3/5 agent-ready

Vendor	Agent sign-up	API docs	llms.txt	MCP	Try it
OpenFunnel	✓ readyotp-email	docs ↗	llms.txt ↗	mcp ↗	sign up →
Ocean.io	manual signup	docs ↗	llms.txt ↗	mcp ↗	—
Exa	✓ readyotp-email	docs ↗	llms.txt ↗	mcp ↗	sign up →
Parallel	✓ readyotp-email	docs ↗	llms.txt ↗	mcp ↗	sign up →
PredictLeads	manual signup	docs ↗	—	mcp ↗	—

[02] methodology, metric definitions, and known limitations+

[02.a] methodology

How the matrix is built

Fix a canonical list of 14 seed companies across 7 verticals (b2b-saas, devtools, ecommerce, healthtech, home-services SaaS / chains, local trades, real estate). Each seed has a name, domain, and short description - the exact inputs every vendor sees.
For every (seed, vendor) cell, call the vendor's lookalike API with K = 10. Capture the ordered top-K result list and credit cost.
Feed the seed + each returned candidate to the LLM judge (majority(n=3): gpt-5.1,gpt-5.2,gpt-5.4-mini). Judge returns a binary relevance label per candidate plus a one-line rationale. Identical prompt and rubric across all vendors.
Cell value = relevant_count / K. Aggregate per vendor as avg_precision_at_k (mean across seeds with ≥K results returned).
A vendor that returns fewer than K candidates for a seed has the cell rendered as - rather than scored on a truncated denominator. Keeps cells comparable.

[02.b] metric definitions

What each metric means

Precision@K · cell value. Of the top 10 lookalikes a vendor returned for the seed, the fraction the LLM judge labeled relevant. The buyer's metric - "of the K I paid for, how many are usable".
avg Precision@K · headline ranking metric. Mean Precision@K across all judged seeds. Higher is better.
total relevant · sum of relevant lookalikes across all seeds. Reach metric - useful when comparing two vendors with similar precision.
cost per relevant · vendor credit spend ÷ total relevant lookalikes. The economics metric.

[02.c] why LLM-as-judge

Why an LLM judge instead of a hand-labeled set

A fully hand-labeled lookalike set would require labeling K × seeds × vendors candidates (10 × 14 × 5 = 700judgements) every time we re-run a snapshot. That doesn't scale, and it isn't how the buyer actually evaluates a vendor in the wild - the buyer reads the list and decides "close enough to my ICP, yes or no".

The judge approximates that decision with a consistent rubric: given the seed's name, domain, and description, is this returned candidate plausibly the same kind of company a B2B seller would target as a lookalike? The judge's rationale is persisted alongside the binary label so any cell can be audited by a human in seconds. When the model swaps, the cohort re-runs with the same prompt; deltas are visible.

[02.d] per-vendor query rules

How each vendor was queried

OpenFunnel· embeddings over the OpenFunnel company index with the seed's domain as the query; optional graph re-rank using shared jobs / tech / funding co-signals. Top-K by cosine score.
Ocean.io · /companies/lookalikes with seed domain. Default similarity model, K = 10.
Exa · findSimilar with seed domain. Neural web-content embedding model, K = 10. Filters down to results that look like company sites (heuristics on result URL/title).
Parallel· agentic research task: "find 10 companies similar to {seed}". The agent decides its own retrieval strategy. We record the final ranked list.
PredictLeads · /api/v3/companies/{domain}/similar_companies; ranks via shared tech, news, and jobs co-signals.

[02.e] known limitations

What this benchmark does not tell you

Judge bias.A single LLM judge has its own priors about what "similar" means. We publish the judge model and the full rationale so the bias is auditable, but expect ±5% drift across model versions.
K-tail vs precision tradeoff. Vendors with thin catalogs can win Precision@K by refusing to return tail results. We mitigate by requiring ≥K results for a cell to be scored - thin cells render as -, not a high Precision number with a small denominator.
No recall metric.Precision@K doesn't measure how many real lookalikes the vendor missed. That requires a held-out ground truth set we don't yet have.
Domain-only seeding. All vendors receive the same compact input (name + domain + 1-line description). Vendors that benefit from richer inputs (e.g. headcount filters, ARR band, geography) may underperform their in-product behavior. The flip side is that this matches how an agent would query them.
Cohort coverage. 10 seeds, 2 per vertical. Spans modern B2B SaaS / devtools / DTC ecom / healthcare networks and traditional trades (HVAC, plumbing) - the matrix exercises both tech-stack-driven matching and SIC/NAICS-style firmographic matching.

[02.f] reproducibility

Verify any number end-to-end

The full benchmark — runner code, judge prompt, leaderboard snapshot, and per-cell raw audit trail (the literal HTTP request/response we sent each vendor + the literal LLM judge prompt/response per candidate) — is mirrored in a public repo: openfunnel/gtm-bench. Auth headers are scrubbed via an allow-list; everything else is verbatim.

To audit a single cell, open data/lookalike-runs/<dataset>/<seed>/<vendor>.raw.json in that repo and replay any of the vendor_calls[] with your own credentials, or re-score with your own LLM by replaying judge_calls[].messages against any OpenAI-v1 compatible model — useful for measuring judge bias or drift across model versions.

[02.g] providers under review

Inclusion queue and how to request a provider

Live: OpenFunnel, Ocean.io, Exa, Parallel, PredictLeads.

Requested but not directly comparable: ZoomInfo (company lookalikes are sales-gated, no self-serve API), Clay (lookalike runs inside Clay tables), Apollo (no public lookalike endpoint), Lusha (`/v3/companies/lookalike` requires 5-100 seeds per request, incompatible with the per-seed cell unit of this benchmark).

Under review next: Common Room, Koala, LeadGenius, 6sense, Demandbase.

To request a provider, email founders@openfunnel.dev with a link to the public API docs and pricing page.