Marc Pinet

Table of Contents




Citracer

📝 Description

Trace citation chains for any keyword across research papers.

Given a source PDF and a keyword, citracer parses the bibliography with GROBID, finds every occurrence of the keyword in the body, identifies the references cited near each occurrence, downloads those papers, and recursively walks the resulting citation graph. The output is an interactive HTML page.

Supported sources. citracer currently resolves cited papers through three external services: arXiv, Semantic Scholar, and OpenReview (for ICLR / TMLR papers not on arXiv). Workshop proceedings, books, and paywalled journal articles are not retrievable and appear as unavailable nodes in the graph.

citracer interactive graph

⚙️ Installation

Requirements: Python 3.10+ and Docker.

pip install -r requirements.txt
docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.1

GROBID must be reachable on http://localhost:8070. Verify with curl http://localhost:8070/api/isalive.

A Semantic Scholar API key is optional but recommended — without one the public endpoint is throttled to ~3.5s between calls. With a key, the throttle drops to 0.2s.

🚀 Usage

python citation_tracer.py --pdf paper.pdf --keyword "channel-independent" --depth 3

Flag

Default

Description

--pdf

required

Path to the source PDF

--keyword

required

Term to trace through citations

--depth

3

Maximum recursion depth

--details

off

Show passages directly in node tooltips

--output

./output/graph.html

Output HTML file

--cache-dir

./cache

Local cache for PDFs and metadata

--grobid-url

http://localhost:8070

GROBID service URL

--s2-api-key

none

Semantic Scholar API key

--context-window

sentence-based

If set, fall back to a ±N character window for ref association

--no-open

off

Do not open the result in a browser

-v, --verbose

off

Verbose logging

🎨 Output

Nodes are colored by status:

Color

Status

Meaning

blue

root

The source PDF

green

analyzed

PDF retrieved and the keyword was found in its text

gray

analyzed (no match)

PDF retrieved and parsed, but the keyword does not appear

red

unavailable

PDF could not be retrieved

Node size scales with the number of keyword occurrences. The interactive graph supports hover for live preview, click to pin a node, click on the legend to toggle visibility by status, and KaTeX rendering of LaTeX in passages.

🔍 How it works

  1. PDF parsing. GROBID processes the PDF and returns TEI XML. citracer walks the <body> to reconstruct the plain text while recording the character offset of every inline <ref type="bibr"> citation. The bibliography is extracted from <listBibl>. Figure-diagram paragraphs (detected by their density of mathematical Unicode characters) are skipped to avoid polluting the keyword matcher.

  2. Keyword matching. The keyword is compiled to a flexible regex that handles morphological variants (e.g. channel-independent matches channel-independence, channel independently, channelindependence). The body is segmented into sentences with pysbd, and each occurrence of the keyword is associated with the references cited in the same sentence or the immediately following one.

  3. Reference resolution. Each cited paper is resolved through the following cascade:

    1. If GROBID extracted a DOI or arXiv ID, use it directly.
    2. Otherwise, search arXiv by title (phrase first, then keyword fallback, with rapidfuzz validation).
    3. If arXiv has nothing, query Semantic Scholar with 429-aware backoff.
    4. As a last resort, search OpenReview (covers ICLR/TMLR papers not on arXiv).

    Resolved PDFs are cached in ./cache/pdfs/.

  4. Recursion. The tracer is a BFS that processes papers in queue order, deduplicating by canonical ID (DOI > arXiv > OpenReview > title hash). When the same PDF is reached via a second path, the new edge is added without re-parsing.

  5. Rendering. The graph is serialized to an interactive HTML page using pyvis, with a custom overlay for the legend filter, side info panel, keyword highlighting, and KaTeX math.

📁 Project structure

citation_tracer/
├── cli.py                  # argparse entry point
├── pdf_parser.py           # GROBID + TEI walking + figure-noise filter + pymupdf fallback
├── keyword_matcher.py      # morphological regex + sentence-based ref association
├── reference_resolver.py   # arXiv-first cascade resolver with cache
├── tracer.py               # BFS recursion with deduplication
├── visualizer.py           # pyvis rendering + custom overlay
├── models.py               # dataclasses
└── utils.py                # ID normalization, hashing, logging

🧩 Dependencies

Package

Used for

GROBID

PDF structural parsing (external service)

lxml

TEI XML processing

pymupdf

PDF text extraction (parser fallback)

arxiv

arXiv search and download

pysbd

Sentence boundary detection

pyvis

Interactive HTML graph rendering

rapidfuzz

Fuzzy title matching

requests

HTTP client

tqdm

Progress bar

KaTeX

LaTeX math rendering in the HTML output (CDN)

External APIs:

⚠️ Limitations

  • GROBID misclassifies a small fraction of references (especially sub-citations like Liu et al., 2024b). These are silently dropped.
  • pysbd handles most academic abbreviations but can occasionally split mid-sentence; falling back to --context-window 300 is sometimes useful.
  • arXiv enforces ~3 seconds between requests, so the first run on a deep trace can take several minutes. The local cache makes subsequent runs fast.
  • Only three sources are supported for resolving cited papers: arXiv, Semantic Scholar and OpenReview. Workshop papers, books, and journal articles without an open-access PDF on one of these platforms appear as unavailable red nodes.

✍️ Authors