Search 18 million+
mathematical dependencies

A unified statement-level dependency graph spanning both informal and formal mathematics, including 11.7 million arXiv statements linked to Mathlib through a shared embedding space.

Simon KurganEvan WangEric LeonenSophie SzetoLuke AlexanderArtemii RemizovJarod AlperGiovanni InchiostroVasily Ilin

11.7M
Theorems indexed
18.3M
Dependency edges
388k
Lean declarations
47,952
Formal-informal matches

Motivation

Mathematical knowledge is organized around statements and their dependencies, but this structure is exposed unevenly. Informal papers cite mostly at the document level, while formal proof assistants like Lean record fine-grained dependencies over a much smaller body of mathematics. This asymmetry limits attribution, duplication detection, and automated formalization.

For mathematicians, a unified dependency graph makes it possible to check whether a result is already known and trace exactly which lemmas a proof relies on. A 2024 study found that 2.4% of withdrawn arXiv submissions were self-identified as non-novel, which might have been identified earlier with improved cross-paper dependency tracking.

For AI agents, the graph gives neural theorem provers and autoformalization tools a access to mathematics as a graph. Instead of retrieving flat semantic neighbors, an agent can walk the graph to find connected lemmas and related formalizations across informal and formal spaces.

How It Works

  1. 1

    Parse informal statements. We extract over 11.7 million theorem-like environments from mathematics arXiv papers using a regex-based parser, and recover 18.3 million candidate directed dependency edges. We use deterministic, heuristic, and notation-based parsing methods, and label each edge with the parser type.

  2. 2

    Extract the formal graph. LeanGraph extracts typed declaration-level dependencies from Mathlib4 and 25 open-source Lean projects, yielding 388,105 nodes and 11.3 million typed edges across six semantic categories.

  3. 3

    Generate slogans and embed. Every formal and informal statement we extract is summarized into a concise natural-language slogan by Qwen3-235B and embedded with Qwen3-Embedding-8B into a shared semantic space.

  4. 4

    Bridging the corpora. A cross-modal nearest-neighbor sweep proposes (informal, formal) candidate matches, and a GPT-5.4 judge affirms 47,952 matches above a 0.8 cosine similarity floor.

Overview

Informal Dependency Extractors

EXTRACTOREDGESPRECISION
Deterministic5.23M98.8%
Heuristic6.47M76.6%
Notation7.88M42.7%
Any (total)18.3M68.1%

Precision estimated by LLM judge (Kimi K2.5) on 500 sampled arXiv papers.

Formal Edge Types

TYPEDESCRIPTION
proofUsed inside a theorem proof term
sigAppears in a type signature
defUsed in a non-Prop definition body
fieldReferenced in a structure field type
extendsStructure or class inheritance
docrefBacktick reference in a docstring

LeanGraph extracts 388,105 nodes and 11.3M typed edges across 25 Lean projects (Mathlib v4.27-v4.29 plus 24 community formalizations).

REST API

TheoremGraph provides a REST API for semantic search and dependency graph traversal.

curl "https://api.theoremsearch.com/graph/paper?external_id=2301.00001"

Returns all statements and dependency edges for an arXiv paper or Lean repository. Full API reference →

MCP Tool

TheoremGraph is also available as an MCP tool for AI agents via a single theorem_search tool.

ENDPOINThttps://api.theoremsearch.com/mcp