What I would build
Talent Engineering at Polymarket
This page is centered on a primarily on sourcing using data and engineering to find the right people, judge fit, and reach them at scale. Including company lists, discovery, identity, enrichment, pipelines, activation, and outreach. The architecture map is the system sketch for that slice; below is the outcome, strategy, and spend model.
Broader talent engineering spans scheduling, ATS workflow, interview ops, analytics across the funnel, and more. A closing section summarizes adjacent work I'd still want to own or stay tight with however the rest of the page stays on the sourcing core.
Goal outcome
- Identify what companies are producing the right talent scraping jobs through Exa that continually identify companies that are producing the talent we want to hire.
- Map out everyone who's the ICP for every role with a combination of different data sources and enrichment methods.
- Find out the best way to reach out to them once we know who to reach out to we can use a combination of different channels to reach out to them. Whether it's email, DM's, text, inviting to events or anything else.
- Everything leading to better hires, faster.
Strategy
Understand who is great and why, inferred from how people show up in the world not only where they worked (though this is important and is a big part of the strategy). Signals include what they post and reply to, who they follow and who follows them, what they ship (repos, writing, design, markets), and any other signals that are relevant to the role. When combined this should reliably produce the insights you want for identifying the best people to reach out to.
Talent is segmented and not one-size-fits-all. It is not the same company lists, discovery rules, or bar for “great” across engineering, policy / government affairs, CX, design, or markets-facing roles. Each segment runs its own custom company lists (who counts as a strong source org, tiering, and who to pull from) and its own custom evals for judging talent. Rubrics, scorecards, and model prompts calibrated to what excellence means in that lane. Sources and enrichment depth follow from those lists and evals, not from a generic template.
Data sources
Pull structured and behavioral signal from APIs, scraped pages, public artifacts, and community surfaces scraping aggressively where it matters. The architecture map lists categories; the point is coverage + cost: no single source owns “truth,” and every ingestion path is budgeted.
Across the large platforms, the workhorse pattern is regex plus small LLMs: build a lean text blob per account or entity (username, bio, a handful of short posts or snippets), strip obvious junk with deterministic filters, then ask a cheap model for a tiny structured output — a label, score, or tag. That keeps token mass down and makes full passes over huge surfaces (including follower-scale lists) economically sane; see cost section for concrete token math.
Company intelligence
Build a living model of which organizations train the people we want: tiering, roster change, talent density, and narrative (what the company actually does vs headline). Those org sets are custom company lists per segment. This means that engineering’s A-list is not the same as policy’s or CX’s. The model feeds discovery seeds and tells you when a “B-tier” company is temporarily a better bet than a brand name for that specific lane.
Discovery
Expand from strong seeds: people and companies we already believe in through graph moves (social graph, co-attendance, co-contribution) and event-driven spikes (launches, posts, wins). Followers and following are a first-class volume surface: a seed account’s audience is a discovery frontier we can score in bulk (who matches the segment, who is noise) without hand-reviewing millions of rows.
I'd optimize for precision at the top of the funnel, not maximum raw leads. Regex gates and cheap model passes exist to shrink the set before we make expensive calls.
Signal at scale: regex + small LLMs on followers and platform blobs
For ~1.2M followers (Polymarket account)(or any large homogeneous list), a full scoring pass with a cheap small model lands in the tens of dollars, maybe low hundreds if prompts stay tight and outputs stay tiny.
Assumptions (concrete): one lean blob per account: username, bio, maybe 1–3 short posts with regex stripping obvious junk first. After filtering, plan on ~200–500 tokens input per profile into the model, and ask only for a 5–20 token label or score on the way out.
| Reference pricing (per 1M tokens) | Input | Output |
|---|---|---|
| GPT-4o-mini–class | ≈ $0.15–$0.30 | ≈ $0.60–$1.20 |
| Mistral Small 3.x / 4–class | ≈ $0.10–$0.20 | ≈ $0.30–$0.60 |
Pass over 1.2M users at 300 tokens in and 10 tokens out per user → 360M input tokens, 12M output tokens. At GPT-4o-mini–ish rates: input ≈ $54–$108, output ≈ $7–$14 → ~$60–$120 ballpark for one full pass. Tighter regex, shorter prompts, self-hosted small open weights, or batching can shave another 2–5×; bloated context pushes toward a few hundred.
Discord, Reddit, Telegram, Product Hunt get scraped handle, thread author, project, or listing as one “profile” with the same trimmed blob. 100k items × ~300 tokens in ≈ 30M input tokens → at $0.10–$0.30/M input, ~$3–$9 input plus roughly $1–$3 outputs for a cheap label pass — i.e. low double-digit dollars per full scan at that scale if regex keeps text tight.
Identity resolution
One person, many handles: deterministic joins where possible (email, wallet, stable IDs), probabilistic joins with scores everywhere else.
Enrichment & scoring
Tier spend: cheap features for everyone in the frontier, then structured LLM extraction for maybes, deep passes only for people who clear a bar. The wide layers lean on regex pre-filters + small models with minimal output schema; the narrow layers can afford richer prompts. Custom evals per segment: scorecards, gates, and often different feature sets all trained on hiring-manager feedback so engineering, CX, policy, and other lanes are not sharing one generic rubric.
Role pipelines
Parallel tracks share infrastructure but not definitions of “great.” Each pipeline gets templates for sourcing, eval rubric, and which activation levers matter (e.g., competitions and OSS for eng; dinners and briefings for policy-aware roles).
Activation & events
Deliberate IRL and community moments: hosted comps, small dinners, curated invites. With attendance and follow-up feeding back into discovery and scoring. We'd work backwards from the info we have on people to find out what events they'd be interested in.
Outreach & feedback loops
Outbound that respects context (why you, why now), plus tight loops from recruiter and hiring-manager decisions into weights and seeds. Every “not a fit” would improve the next pass.
Cost snapshot — API & data stack
Numbers below are published vendor list pricing. Polymarket may already do better on negotiated contracts (especially the official X enterprise API).
Volume shape: X has millions of profiles in play; regex + cheap model passes get that down to tens of thousands of real candidates without blowing the budget (see Signal at scale). LinkedIn is smaller but still huge — on the order of ~100k engineers in NYC alone. Exa sits beside Fiber for open-web company search and people identification (search, deep research, page contents, monitors).
| Surface / vendor | Pricing model | List $ | Notes |
|---|---|---|---|
| X — official API | Enterprise (Polymarket) | Existing co. plan | Polymarket is on the enterprise X API tier; treat as baseline access, not a new line item in this stack. |
| X — SocialData.tools | Per successful results returned | ~$0.20 / 1k results | Published default for many endpoints (e.g. 1k posts); endpoint-specific pricing in their docs; volume discounts for large projects. |
| LinkedIn — Fiber.ai Enterprise | Monthly + credits | $2,400 / mo | 150k credits/mo; pay only on successful calls; reverse email, CRM enrichment, AI search, full API, priority Slack, higher limits, custom endpoints. |
| LinkedIn — Fiber.ai Bulk API Search | Monthly search credits | $2,000 / mo | 1M API search credits/mo; successful calls only; companies, jobs, people, combined + AI natural-language search; priority Slack. |
| Exa — Search | Per 1k requests | $7 / 1k | Real-time web search; token-efficient page text + highlights; agent tool-call style queries; configurable latency ~180ms–1s. Core for open-web company and people discovery alongside Fiber. |
| Exa — Deep Search | Per 1k requests | $12 / 1k | Multi-step agent workflows; structured outputs; web-grounded citations — for heavier research-style company / person questions. |
| Exa — Contents | Per 1k pages (per content type) | $1 / 1k | Full page body + highlights for LLM context; configurable livecrawl — cheap way to ground IDs on live pages. |
| Exa — Monitors | Per 1k requests | $15 / 1k | Scheduled searches (e.g. daily/weekly cadence); fresh events on the web; updates via webhooks. |
| GitHub | Public API rate limits | $0 | Standard API is free within limits. Most info can also be pulled w/ browser agents |
| Reddit + browser agents | Scrape + Comet Max (Perplexity) | $0 data; ~$200 / mo agents | Reddit itself: scrape or automate with browser-style agents. Budget Comet Max ~$200/mo (Perplexity). |
| Discord | Scrape + bots | $0.20 per 1k posts | We'd use a combination of bots and browser agents to scrape the data. |
| Farcaster / other | APIs, indexers, self-host | $0–$500+ / mo | Depends whether you rent indexers or run infra. |
| LLM (enrichment + matching) | Per-token / per-request | ~$500–$5k+ / mo | Interactive workflows and fat context. Bulk passes over lean blobs stay cheap — often $60–$120 per ~1.2M profiles at mini-model rates (Signal at scale). |
| Embeddings + vector store | API + hosted DB | ~$50–$500 / mo | Scales with corpus size and query QPS. |
| ATS + comms | SaaS seat + API | Existing OPEX | Webhooks and status sync live alongside the stack above. |
| Contingent search fees | % of first-year cash comp | Target: $0 | Internal pipeline + selective retained/advisory only if needed. |
Stack order of magnitude: Fiber (~$2k–$2.4k/mo) + SocialData (~$0.20/1k results) + Exa (usage-based: Contents $1/1k pages, Search $7/1k, Deep Search $12/1k, Monitors $15/1k — mix depends on workflow) + Comet Max (~$200/mo) + LLM/embeddings. All-in often lands around ~$5k–$10k/mo before heavy Exa or SocialData volume spikes.
Beyond sourcing
Outside the sourcing core above, talent engineering work usually includes a wider set of systems and execution work. Below are adjacent tasks that could be tackled in the same spirit.
- Treat hiring as an engineering problem: pipelines, scorecards, feedback loops, and analytics (velocity, source quality, cost per qualified conversation, time to fill).
- Ship tooling recruiters and hiring managers can run: briefs before screens, automated scheduling hygiene, outbound sequences.
- Integrate ATS, comms, calendars, and data sources; default to automation and experiments.
System diagram: Architecture map