ProductivityIntermediate

PDF to Readable HTML

Name: PDF to Readable HTML
Author: daymade

Convert a PDF into one self-contained, readable HTML file that preserves images, tables, charts, and reading order — optionally translating it while keeping every figure intact.

10 minutes

By daymadeSource

#pdf#html#reading#translation#conversion

You just want to read a dense PDF comfortably on your phone, or translate it without losing the charts. This playbook turns it into one portable, image-faithful HTML page.

Who it's for: readers, researchers, students, knowledge workers

Example

"Make this PDF readable on my phone, in English" → One self-contained HTML file with images, tables, and charts preserved, translated and visually verified

CLAUDE.md Template

New here? 3-minute setup guide → | Already set up? Copy the template below.

# PDF to HTML

Turn a PDF into a single, self-contained, readable HTML file — images, tables,
charts and reading order preserved — and optionally translate it, keeping every
figure in place.

The pipeline is **extract → look → (translate) → build → verify**. The middle
"look" and final "verify" steps are where faithfulness actually comes from: a PDF
is a layout, not just a text stream, so you read the rendered pages before
building and the rendered HTML before delivering.

This skill runs **inline** (no `context: fork`): translation orchestrates a
Dynamic Workflow, and a subagent cannot spawn one.

## When to use / not use

- **Use** when the goal is to *read* a PDF as HTML/web page, to convert a PDF to
  a styled HTML document, or to translate a PDF into another language while
  keeping its figures and tables.
- **doc-to-markdown** instead if they want plain Markdown text (no styling, figures optional).
- **pdf-creator** instead for the reverse direction (Markdown → PDF).

## What it does NOT do

- **Scanned/image-only PDFs** (no text layer): OCR first (e.g. `ocrmypdf`), then use this.
- **Complex multi-column tables**: cell *text* is preserved and readable, but column
  alignment can flatten into a text flow — PyMuPDF reads a table as text blocks, not a
  grid, so the grid lines are gone. Tables that are *images* in the PDF survive as
  images. If the table's grid structure is essential, use **doc-to-markdown** (pandoc
  rebuilds real tables) or convert that page separately.
- **Pixel-perfect facsimile**: output is a clean *re-flow* that keeps images and
  reading order, not a 1:1 copy of the original page layout.
- **Rewriting**: it translates and re-lays-out; it does not summarize, add a TL;DR,
  or editorialize. Faithfulness is the point (see Fidelity below).

## Dependencies

`uv` (runs Python with inline deps), Google Chrome or Chromium (visual
verification). Python packages come via `uv run --with`: PyMuPDF, Pillow, numpy.
Nothing to pre-install beyond Chrome and uv.

## Workflow

Copy this checklist and tick as you go:

```
- [ ] 1. Extract structure + render pages   (extract_pdf.py)
- [ ] 2. Read pages/*.png — SEE the layout, find content vs decorative images
- [ ] 3. (only if translating) run the translation workflow
- [ ] 4. Build the single-file HTML          (build_html.py)
- [ ] 5. Verify visually                      (verify_render.py → Read every segment)
- [ ] 6. Deliver the .html
```

### 1. Extract

```bash
uv run --with pymupdf python scripts/extract_pdf.py input.pdf
```

Writes `input-build/` with `structure.json` (text blocks with font sizes + image
blocks flagged `decorative`), `images/`, and `pages/` (one PNG per page).

### 2. Look before you build

Read `input-build/pages/*.png`. This is not optional: you need to see the real
layout, confirm which images are content vs decoration, and spot tables/charts.
For a long PDF, read every page; for a short one it's quick. This is also where
you understand the document well enough to translate it well.

### 3. Translate (optional)

Only if the user asked for another language. Read
[references/translation_workflow.md](references/translation_workflow.md) and
follow it: a Dynamic Workflow translates pages in parallel, captions data charts,
and reconciles terminology. It produces two overlay files (`units.json`,
`caps.json`) that step 4 consumes. **Do not** hand-translate inline for anything
longer than a page — the workflow keeps terminology consistent and is far faster.

### 4. Build

```bash
# original-language HTML
uv run --with Pillow python scripts/build_html.py input-build/structure.json --out output.html

# translated HTML (overlays from step 3)
uv run --with Pillow python scripts/build_html.py input-build/structure.json --out output.html \
    --translation input-build/units.json --captions input-build/caps.json --lang zh-CN
```

`build_html.py` is **data-driven**: it infers heading levels from font size (most
common size = body; larger steps up to h3/h2/h1), drops decorative images, and
inlines content images as compressed base64 → one portable file. It is not
hand-tuned to any document. If a particular PDF has an unusual structure (e.g.
multi-column, sidebars, a figure the size heuristic misreads), read the script and
adjust — it's short and meant to be edited per document.

### 5. Verify visually (mandatory)

```bash
uv run --with Pillow --with numpy python scripts/verify_render.py output.html
```

Then **Read every `seg-*.png`** and check: fonts render (no tofu boxes), no
clipped tables/figures, headings/lists look right, all expected images present.
Text being correct does not mean the render is correct (failure_cases #7). Fix and
re-verify until it's clean.

A quick structural cross-check is fine too, but count occurrences correctly:
`grep -o '<figure>' output.html | wc -l` — **not** `grep -c` (failure_cases #1).

### 6. Deliver

Hand over the single `.html`. It's self-contained (images inlined), so it opens
with a double-click and nothing can go missing.

## Scripts

| Script | Run with | Purpose |
|--------|----------|---------|
| `scripts/extract_pdf.py` | `uv run --with pymupdf` | PDF → structure.json + images/ + page renders |
| `scripts/build_html.py` | `uv run --with Pillow` | structure.json (+ optional translation/captions) → single-file HTML |
| `scripts/verify_render.py` | `uv run --with Pillow --with numpy` | headless-Chrome render → readable PNG segments |

## Fidelity (read before translating)

The deliverable looks authoritative, so wrong content is worse than ugly content.
The non-negotiable rules — and the specific ways this has gone wrong before — are
in [references/failure_cases.md](references/failure_cases.md). The one that bites
hardest: **never give a real person an inferred translated name, and copy every
number/proper-noun verbatim** (failure_cases #6). Read that file before any
translation run; skim it before any run.

## Next Step

After producing the HTML, suggest the natural follow-up:

```
Conversion complete: output.html (single self-contained file).

Options:
A) Make a PDF of it — run /daymade-docs:pdf-creator if you want a print/share copy (Recommended if they need to send it)
B) Extract the text as Markdown instead — run /daymade-docs:doc-to-markdown (if they wanted editable text, not a reading page)
C) No thanks — the HTML is what I wanted
```

README.md

What This Does

Convert a PDF into one self-contained, readable HTML file that preserves images, tables, charts, and reading order — optionally translating it while keeping every figure intact.

What's Inside

The template covers:

When to use / not use
What it does NOT do
Dependencies
Workflow
Scripts
Fidelity (read before translating)

Quick Start

Step 1: Create a Project Folder

Make a dedicated folder for this workflow and open it in Claude Code.

Step 2: Download the Template

Click Download above to save the template, then drop it into your project as CLAUDE.md (or paste it into your existing one).

Step 3: Start Working

Tell Claude what you need in plain language — it will follow the template's workflow automatically. For example:

Make this PDF readable on my phone, in English

Claude reads the template and runs the steps for you.

PDF to Readable HTML

What This Does

What's Inside

Quick Start

Step 1: Create a Project Folder

Step 2: Download the Template

Step 3: Start Working

$Related Playbooks

Add Integration

Activity Digest Generator

Airtable Automation