Citracer

Citracer

📝 Description

Trace citation chains for any keyword across research papers.

Given a source PDF and a keyword, citracer parses the bibliography with GROBID, finds every occurrence of the keyword in the body, identifies the references cited near each occurrence, downloads those papers, and recursively walks the resulting citation graph. The output is an interactive HTML page.

Supported sources. citracer currently resolves cited papers through three external services: arXiv, Semantic Scholar, and OpenReview (for ICLR / TMLR papers not on arXiv). Workshop proceedings, books, and paywalled journal articles are not retrievable and appear as unavailable nodes in the graph.

citracer interactive graph

⚙️ Installation

Requirements: Python 3.10+ and Docker.

pip install -r requirements.txt
docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.1

GROBID must be reachable on http://localhost:8070. Verify with curl http://localhost:8070/api/isalive.

A Semantic Scholar API key is optional but recommended — without one the public endpoint is throttled to ~3.5s between calls. With a key, the throttle drops to 0.2s.

🚀 Usage

python citation_tracer.py --pdf paper.pdf --keyword "channel-independent" --depth 3

Flag	Default	Description
`--pdf`	required	Path to the source PDF
`--keyword`	required	Term to trace through citations
`--depth`	`3`	Maximum recursion depth
`--details`	off	Show passages directly in node tooltips
`--output`	`./output/graph.html`	Output HTML file
`--cache-dir`	`./cache`	Local cache for PDFs and metadata
`--grobid-url`	`http://localhost:8070`	GROBID service URL
`--s2-api-key`	none	Semantic Scholar API key
`--context-window`	sentence-based	If set, fall back to a ±N character window for ref association
`--no-open`	off	Do not open the result in a browser
`-v, --verbose`	off	Verbose logging

🎨 Output

Nodes are colored by status:

Color	Status	Meaning
blue	`root`	The source PDF
green	`analyzed`	PDF retrieved and the keyword was found in its text
gray	`analyzed (no match)`	PDF retrieved and parsed, but the keyword does not appear
red	`unavailable`	PDF could not be retrieved

Node size scales with the number of keyword occurrences. The interactive graph supports hover for live preview, click to pin a node, click on the legend to toggle visibility by status, and KaTeX rendering of LaTeX in passages.

🔍 How it works

PDF parsing. GROBID processes the PDF and returns TEI XML. citracer walks the <body> to reconstruct the plain text while recording the character offset of every inline <ref type="bibr"> citation. The bibliography is extracted from <listBibl>. Figure-diagram paragraphs (detected by their density of mathematical Unicode characters) are skipped to avoid polluting the keyword matcher.
Keyword matching. The keyword is compiled to a flexible regex that handles morphological variants (e.g. channel-independent matches channel-independence, channel independently, channelindependence). The body is segmented into sentences with pysbd, and each occurrence of the keyword is associated with the references cited in the same sentence or the immediately following one.
Reference resolution. Each cited paper is resolved through the following cascade:
1. If GROBID extracted a DOI or arXiv ID, use it directly.
2. Otherwise, search arXiv by title (phrase first, then keyword fallback, with rapidfuzz validation).
3. If arXiv has nothing, query Semantic Scholar with 429-aware backoff.
4. As a last resort, search OpenReview (covers ICLR/TMLR papers not on arXiv).
Resolved PDFs are cached in ./cache/pdfs/.
Recursion. The tracer is a BFS that processes papers in queue order, deduplicating by canonical ID (DOI > arXiv > OpenReview > title hash). When the same PDF is reached via a second path, the new edge is added without re-parsing.
Rendering. The graph is serialized to an interactive HTML page using pyvis, with a custom overlay for the legend filter, side info panel, keyword highlighting, and KaTeX math.

📁 Project structure

citation_tracer/
├── cli.py                  # argparse entry point
├── pdf_parser.py           # GROBID + TEI walking + figure-noise filter + pymupdf fallback
├── keyword_matcher.py      # morphological regex + sentence-based ref association
├── reference_resolver.py   # arXiv-first cascade resolver with cache
├── tracer.py               # BFS recursion with deduplication
├── visualizer.py           # pyvis rendering + custom overlay
├── models.py               # dataclasses
└── utils.py                # ID normalization, hashing, logging

🧩 Dependencies

Package	Used for
GROBID	PDF structural parsing (external service)
lxml	TEI XML processing
pymupdf	PDF text extraction (parser fallback)
arxiv	arXiv search and download
pysbd	Sentence boundary detection
pyvis	Interactive HTML graph rendering
rapidfuzz	Fuzzy title matching
requests	HTTP client
tqdm	Progress bar
KaTeX	LaTeX math rendering in the HTML output (CDN)

External APIs:

⚠️ Limitations

GROBID misclassifies a small fraction of references (especially sub-citations like Liu et al., 2024b). These are silently dropped.
pysbd handles most academic abbreviations but can occasionally split mid-sentence; falling back to --context-window 300 is sometimes useful.
arXiv enforces ~3 seconds between requests, so the first run on a deep trace can take several minutes. The local cache makes subsequent runs fast.
Only three sources are supported for resolving cited papers: arXiv, Semantic Scholar and OpenReview. Workshop papers, books, and journal articles without an open-access PDF on one of these platforms appear as unavailable red nodes.

✍️ Authors

Marc Pinet - Initial work - marcpinet

Table of Contents