Sourcing map

What I would build

Talent Engineering at Polymarket

This page is centered on a primarily on sourcing using data and engineering to find the right people, judge fit, and reach them at scale. Including company lists, discovery, identity, enrichment, pipelines, activation, and outreach. The architecture map is the system sketch for that slice; below is the outcome, strategy, and spend model.

Broader talent engineering spans scheduling, ATS workflow, interview ops, analytics across the funnel, and more. A closing section summarizes adjacent work I'd still want to own or stay tight with however the rest of the page stays on the sourcing core.

Goal outcome

Strategy

Understand who is great and why, inferred from how people show up in the world not only where they worked (though this is important and is a big part of the strategy). Signals include what they post and reply to, who they follow and who follows them, what they ship (repos, writing, design, markets), and any other signals that are relevant to the role. When combined this should reliably produce the insights you want for identifying the best people to reach out to.

Talent is segmented and not one-size-fits-all. It is not the same company lists, discovery rules, or bar for “great” across engineering, policy / government affairs, CX, design, or markets-facing roles. Each segment runs its own custom company lists (who counts as a strong source org, tiering, and who to pull from) and its own custom evals for judging talent. Rubrics, scorecards, and model prompts calibrated to what excellence means in that lane. Sources and enrichment depth follow from those lists and evals, not from a generic template.

Data sources

Pull structured and behavioral signal from APIs, scraped pages, public artifacts, and community surfaces scraping aggressively where it matters. The architecture map lists categories; the point is coverage + cost: no single source owns “truth,” and every ingestion path is budgeted.

Across the large platforms, the workhorse pattern is regex plus small LLMs: build a lean text blob per account or entity (username, bio, a handful of short posts or snippets), strip obvious junk with deterministic filters, then ask a cheap model for a tiny structured output — a label, score, or tag. That keeps token mass down and makes full passes over huge surfaces (including follower-scale lists) economically sane; see cost section for concrete token math.

Company intelligence

Build a living model of which organizations train the people we want: tiering, roster change, talent density, and narrative (what the company actually does vs headline). Those org sets are custom company lists per segment. This means that engineering’s A-list is not the same as policy’s or CX’s. The model feeds discovery seeds and tells you when a “B-tier” company is temporarily a better bet than a brand name for that specific lane.

Discovery

Expand from strong seeds: people and companies we already believe in through graph moves (social graph, co-attendance, co-contribution) and event-driven spikes (launches, posts, wins). Followers and following are a first-class volume surface: a seed account’s audience is a discovery frontier we can score in bulk (who matches the segment, who is noise) without hand-reviewing millions of rows.

I'd optimize for precision at the top of the funnel, not maximum raw leads. Regex gates and cheap model passes exist to shrink the set before we make expensive calls.

Signal at scale: regex + small LLMs on followers and platform blobs

For ~1.2M followers (Polymarket account)(or any large homogeneous list), a full scoring pass with a cheap small model lands in the tens of dollars, maybe low hundreds if prompts stay tight and outputs stay tiny.

Assumptions (concrete): one lean blob per account: username, bio, maybe 1–3 short posts with regex stripping obvious junk first. After filtering, plan on ~200–500 tokens input per profile into the model, and ask only for a 5–20 token label or score on the way out.

Reference pricing (per 1M tokens)InputOutput
GPT-4o-mini–class≈ $0.15–$0.30≈ $0.60–$1.20
Mistral Small 3.x / 4–class≈ $0.10–$0.20≈ $0.30–$0.60

Pass over 1.2M users at 300 tokens in and 10 tokens out per user → 360M input tokens, 12M output tokens. At GPT-4o-mini–ish rates: input ≈ $54–$108, output ≈ $7–$14 ~$60–$120 ballpark for one full pass. Tighter regex, shorter prompts, self-hosted small open weights, or batching can shave another 2–5×; bloated context pushes toward a few hundred.

Discord, Reddit, Telegram, Product Hunt get scraped handle, thread author, project, or listing as one “profile” with the same trimmed blob. 100k items × ~300 tokens in ≈ 30M input tokens → at $0.10–$0.30/M input, ~$3–$9 input plus roughly $1–$3 outputs for a cheap label pass — i.e. low double-digit dollars per full scan at that scale if regex keeps text tight.

Identity resolution

One person, many handles: deterministic joins where possible (email, wallet, stable IDs), probabilistic joins with scores everywhere else.

Enrichment & scoring

Tier spend: cheap features for everyone in the frontier, then structured LLM extraction for maybes, deep passes only for people who clear a bar. The wide layers lean on regex pre-filters + small models with minimal output schema; the narrow layers can afford richer prompts. Custom evals per segment: scorecards, gates, and often different feature sets all trained on hiring-manager feedback so engineering, CX, policy, and other lanes are not sharing one generic rubric.

Role pipelines

Parallel tracks share infrastructure but not definitions of “great.” Each pipeline gets templates for sourcing, eval rubric, and which activation levers matter (e.g., competitions and OSS for eng; dinners and briefings for policy-aware roles).

Activation & events

Deliberate IRL and community moments: hosted comps, small dinners, curated invites. With attendance and follow-up feeding back into discovery and scoring. We'd work backwards from the info we have on people to find out what events they'd be interested in.

Outreach & feedback loops

Outbound that respects context (why you, why now), plus tight loops from recruiter and hiring-manager decisions into weights and seeds. Every “not a fit” would improve the next pass.

Cost snapshot — API & data stack

Numbers below are published vendor list pricing. Polymarket may already do better on negotiated contracts (especially the official X enterprise API).

Volume shape: X has millions of profiles in play; regex + cheap model passes get that down to tens of thousands of real candidates without blowing the budget (see Signal at scale). LinkedIn is smaller but still huge — on the order of ~100k engineers in NYC alone. Exa sits beside Fiber for open-web company search and people identification (search, deep research, page contents, monitors).

Surface / vendorPricing modelList $Notes
X — official APIEnterprise (Polymarket)Existing co. planPolymarket is on the enterprise X API tier; treat as baseline access, not a new line item in this stack.
X — SocialData.toolsPer successful results returned~$0.20 / 1k resultsPublished default for many endpoints (e.g. 1k posts); endpoint-specific pricing in their docs; volume discounts for large projects.
LinkedIn — Fiber.ai EnterpriseMonthly + credits$2,400 / mo150k credits/mo; pay only on successful calls; reverse email, CRM enrichment, AI search, full API, priority Slack, higher limits, custom endpoints.
LinkedIn — Fiber.ai Bulk API SearchMonthly search credits$2,000 / mo1M API search credits/mo; successful calls only; companies, jobs, people, combined + AI natural-language search; priority Slack.
Exa — SearchPer 1k requests$7 / 1kReal-time web search; token-efficient page text + highlights; agent tool-call style queries; configurable latency ~180ms–1s. Core for open-web company and people discovery alongside Fiber.
Exa — Deep SearchPer 1k requests$12 / 1kMulti-step agent workflows; structured outputs; web-grounded citations — for heavier research-style company / person questions.
Exa — ContentsPer 1k pages (per content type)$1 / 1kFull page body + highlights for LLM context; configurable livecrawl — cheap way to ground IDs on live pages.
Exa — MonitorsPer 1k requests$15 / 1kScheduled searches (e.g. daily/weekly cadence); fresh events on the web; updates via webhooks.
GitHubPublic API rate limits$0Standard API is free within limits. Most info can also be pulled w/ browser agents
Reddit + browser agentsScrape + Comet Max (Perplexity)$0 data; ~$200 / mo agentsReddit itself: scrape or automate with browser-style agents. Budget Comet Max ~$200/mo (Perplexity).
DiscordScrape + bots$0.20 per 1k postsWe'd use a combination of bots and browser agents to scrape the data.
Farcaster / otherAPIs, indexers, self-host$0–$500+ / moDepends whether you rent indexers or run infra.
LLM (enrichment + matching)Per-token / per-request~$500–$5k+ / moInteractive workflows and fat context. Bulk passes over lean blobs stay cheap — often $60–$120 per ~1.2M profiles at mini-model rates (Signal at scale).
Embeddings + vector storeAPI + hosted DB~$50–$500 / moScales with corpus size and query QPS.
ATS + commsSaaS seat + APIExisting OPEXWebhooks and status sync live alongside the stack above.
Contingent search fees% of first-year cash compTarget: $0Internal pipeline + selective retained/advisory only if needed.

Stack order of magnitude: Fiber (~$2k–$2.4k/mo) + SocialData (~$0.20/1k results) + Exa (usage-based: Contents $1/1k pages, Search $7/1k, Deep Search $12/1k, Monitors $15/1k — mix depends on workflow) + Comet Max (~$200/mo) + LLM/embeddings. All-in often lands around ~$5k–$10k/mo before heavy Exa or SocialData volume spikes.

Beyond sourcing

Outside the sourcing core above, talent engineering work usually includes a wider set of systems and execution work. Below are adjacent tasks that could be tackled in the same spirit.

  • Treat hiring as an engineering problem: pipelines, scorecards, feedback loops, and analytics (velocity, source quality, cost per qualified conversation, time to fill).
  • Ship tooling recruiters and hiring managers can run: briefs before screens, automated scheduling hygiene, outbound sequences.
  • Integrate ATS, comms, calendars, and data sources; default to automation and experiments.

System diagram: Architecture map