Home
cd ../playbooks
ProductivityIntermediate

PDF to Readable HTML

Convert a PDF into one self-contained, readable HTML file that preserves images, tables, charts, and reading order — optionally translating it while keeping every figure intact.

10 minutes
By daymadeSource
#pdf#html#reading#translation#conversion

You just want to read a dense PDF comfortably on your phone, or translate it without losing the charts. This playbook turns it into one portable, image-faithful HTML page.

Who it's for: readers, researchers, students, knowledge workers

Example

"Make this PDF readable on my phone, in English" → One self-contained HTML file with images, tables, and charts preserved, translated and visually verified

CLAUDE.md Template

New here? 3-minute setup guide → | Already set up? Copy the template below.

# PDF to HTML

Turn a PDF into a single, self-contained, readable HTML file — images, tables,
charts and reading order preserved — and optionally translate it, keeping every
figure in place.

The pipeline is **extract → look → (translate) → build → verify**. The middle
"look" and final "verify" steps are where faithfulness actually comes from: a PDF
is a layout, not just a text stream, so you read the rendered pages before
building and the rendered HTML before delivering.

This skill runs **inline** (no `context: fork`): translation orchestrates a
Dynamic Workflow, and a subagent cannot spawn one.

## When to use / not use

- **Use** when the goal is to *read* a PDF as HTML/web page, to convert a PDF to
  a styled HTML document, or to translate a PDF into another language while
  keeping its figures and tables.
- **doc-to-markdown** instead if they want plain Markdown text (no styling, figures optional).
- **pdf-creator** instead for the reverse direction (Markdown → PDF).

## What it does NOT do

- **Scanned/image-only PDFs** (no text layer): OCR first (e.g. `ocrmypdf`), then use this.
- **Complex multi-column tables**: cell *text* is preserved and readable, but column
  alignment can flatten into a text flow — PyMuPDF reads a table as text blocks, not a
  grid, so the grid lines are gone. Tables that are *images* in the PDF survive as
  images. If the table's grid structure is essential, use **doc-to-markdown** (pandoc
  rebuilds real tables) or convert that page separately.
- **Pixel-perfect facsimile**: output is a clean *re-flow* that keeps images and
  reading order, not a 1:1 copy of the original page layout.
- **Rewriting**: it translates and re-lays-out; it does not summarize, add a TL;DR,
  or editorialize. Faithfulness is the point (see Fidelity below).

## Dependencies

`uv` (runs Python with inline deps), Google Chrome or Chromium (visual
verification). Python packages come via `uv run --with`: PyMuPDF, Pillow, numpy.
Nothing to pre-install beyond Chrome and uv.

## Workflow

Copy this checklist and tick as you go:

```
- [ ] 1. Extract structure + render pages   (extract_pdf.py)
- [ ] 2. Read pages/*.png — SEE the layout, find content vs decorative images
- [ ] 3. (only if translating) run the translation workflow
- [ ] 4. Build the single-file HTML          (build_html.py)
- [ ] 5. Verify visually                      (verify_render.py → Read every segment)
- [ ] 6. Deliver the .html
```

### 1. Extract

```bash
uv run --with pymupdf python scripts/extract_pdf.py input.pdf
```

Writes `input-build/` with `structure.json` (text blocks with font sizes + image
blocks flagged `decorative`), `images/`, and `pages/` (one PNG per page).

### 2. Look before you build

Read `input-build/pages/*.png`. This is not optional: you need to see the real
layout, confirm which images are content vs decoration, and spot tables/charts.
For a long PDF, read every page; for a short one it's quick. This is also where
you understand the document well enough to translate it well.

### 3. Translate (optional)

Only if the user asked for another language. Read
[references/translation_workflow.md](references/translation_workflow.md) and
follow it: a Dynamic Workflow translates pages in parallel, captions data charts,
and reconciles terminology. It produces two overlay files (`units.json`,
`caps.json`) that step 4 consumes. **Do not** hand-translate inline for anything
longer than a page — the workflow keeps terminology consistent and is far faster.

### 4. Build

```bash
# original-language HTML
uv run --with Pillow python scripts/build_html.py input-build/structure.json --out output.html

# translated HTML (overlays from step 3)
uv run --with Pillow python scripts/build_html.py input-build/structure.json --out output.html \
    --translation input-build/units.json --captions input-build/caps.json --lang zh-CN
```

`build_html.py` is **data-driven**: it infers heading levels from font size (most
common size = body; larger steps up to h3/h2/h1), drops decorative images, and
inlines content images as compressed base64 → one portable file. It is not
hand-tuned to any document. If a particular PDF has an unusual structure (e.g.
multi-column, sidebars, a figure the size heuristic misreads), read the script and
adjust — it's short and meant to be edited per document.

### 5. Verify visually (mandatory)

```bash
uv run --with Pillow --with numpy python scripts/verify_render.py output.html
```

Then **Read every `seg-*.png`** and check: fonts render (no tofu boxes), no
clipped tables/figures, headings/lists look right, all expected images present.
Text being correct does not mean the render is correct (failure_cases #7). Fix and
re-verify until it's clean.

A quick structural cross-check is fine too, but count occurrences correctly:
`grep -o '<figure>' output.html | wc -l` — **not** `grep -c` (failure_cases #1).

### 6. Deliver

Hand over the single `.html`. It's self-contained (images inlined), so it opens
with a double-click and nothing can go missing.

## Scripts

| Script | Run with | Purpose |
|--------|----------|---------|
| `scripts/extract_pdf.py` | `uv run --with pymupdf` | PDF → structure.json + images/ + page renders |
| `scripts/build_html.py` | `uv run --with Pillow` | structure.json (+ optional translation/captions) → single-file HTML |
| `scripts/verify_render.py` | `uv run --with Pillow --with numpy` | headless-Chrome render → readable PNG segments |

## Fidelity (read before translating)

The deliverable looks authoritative, so wrong content is worse than ugly content.
The non-negotiable rules — and the specific ways this has gone wrong before — are
in [references/failure_cases.md](references/failure_cases.md). The one that bites
hardest: **never give a real person an inferred translated name, and copy every
number/proper-noun verbatim** (failure_cases #6). Read that file before any
translation run; skim it before any run.

## Next Step

After producing the HTML, suggest the natural follow-up:

```
Conversion complete: output.html (single self-contained file).

Options:
A) Make a PDF of it — run /daymade-docs:pdf-creator if you want a print/share copy (Recommended if they need to send it)
B) Extract the text as Markdown instead — run /daymade-docs:doc-to-markdown (if they wanted editable text, not a reading page)
C) No thanks — the HTML is what I wanted
```
README.md

What This Does

Convert a PDF into one self-contained, readable HTML file that preserves images, tables, charts, and reading order — optionally translating it while keeping every figure intact.

What's Inside

The template covers:

  • When to use / not use
  • What it does NOT do
  • Dependencies
  • Workflow
  • Scripts
  • Fidelity (read before translating)

Quick Start

Step 1: Create a Project Folder

Make a dedicated folder for this workflow and open it in Claude Code.

Step 2: Download the Template

Click Download above to save the template, then drop it into your project as CLAUDE.md (or paste it into your existing one).

Step 3: Start Working

Tell Claude what you need in plain language — it will follow the template's workflow automatically. For example:

Make this PDF readable on my phone, in English

Claude reads the template and runs the steps for you.

$Related Playbooks