Saturday, November 8, 2025

PDF → DOCX using Python — a practical guide

 

PDF → DOCX using Python — a practical guide

PDF → DOCX using Python — a practical guide


Converting PDF files to editable DOCX format is a common task: you might need to repurpose content, edit a report, or prepare material for collaborators who prefer Word. PDFs are a display format (layout-focused) while DOCX is a flow/document model (reflowable text and editable paragraphs). That mismatch means conversion is rarely perfect, but with Python you can automate reliable, repeatable conversions for many real-world PDFs. This article explains the typical approaches, lists useful libraries, shows a practical code recipe, and highlights common pitfalls and tips.

Two main approaches

  1. Text extraction and reflow — extract textual content (and images) from the PDF and create a DOCX document from that content. This is best for digital PDFs that contain real text (not scanned images). You’ll get editable text, but exact layout, fonts, and complex multi-column/multi-level formatting may not be preserved.

  2. OCR (optical character recognition) — when the PDF pages are images (scanned), you must run OCR to convert images into text. OCR introduces recognition errors and requires quality scans, but it makes scanned documents editable.

Most robust solutions combine both: try text extraction first, and fall back to OCR for pages with no extractable text.

Useful Python libraries

  • pdfplumber — excellent for extracting text, words, and layout information. Works well for many PDFs.
  • PyMuPDF (fitz) — fast, extracts text and images; gives bounding boxes for text blocks.
  • pdfminer.six — powerful low-level text extraction (more complex API).
  • python-docx — create and write DOCX files; supports paragraphs, runs, basic styling, and adding images.
  • pytesseract (with Tesseract OCR engine) — OCR for scanned images. Requires Tesseract installed on the system.
  • Pillow (PIL) — image manipulation (useful with OCR and image extraction).
  • pdf2image — convert PDF pages to images for OCR if needed.

You can install the typical stack with:

pip install pdfplumber python-docx
 pytesseract pdf2image pillow
# system requirement: 
install tesseract-ocr in 
your OS (apt/brew/choco)

Practical recipe (digital PDF → DOCX)

Below is a pragmatic script that:

  • extracts text page-by-page with pdfplumber,
  • creates paragraphs with python-docx,
  • extracts images and embeds them in the DOCX.
import pdfplumber
from docx import Document
from docx.shared import Inches
from io import BytesIO
from PIL import Image

def pdf_to_docx(pdf_path, docx_path):
doc = Document()
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# Extract text (preserve simple newlines)
text = page.extract_text()
if text:
# Split by double-newline to make paragraphs
   for para in text.split('\n\n'):
doc.add_paragraph(para.strip())
            else:
# If no text, add a placeholder line 
(you might OCR here)
 doc.add_paragraph("[No extractable 
text on this page - consider OCR]")
 # Extract images and add them
  for img_dict in page.images:
# pdfplumber's images are location
 references; extract via crop
  bbox = (img_dict['x0'], 
img_dict['top'], img_dict['x1'],
 img_dict['bottom'])
 # Crop the page as an image
 then insert (requires rendering 
page as image)
  page_image = page.to_image(resolution=150)
  cropped = page_image.original.crop(bbox)
  bio = BytesIO()
  cropped.save(bio, format='PNG')
  bio.seek(0)
  doc.add_picture(bio, width=Inches(4))

  # page break between PDF pages
  doc.add_page_break()

    doc.save(docx_path)

# usage
pdf_to_docx("input.pdf", "output.docx")

This approach works well for many documents where text is extractable. It creates simple paragraphs; complex headings, tables, and multi-column layouts generally need additional logic to detect and reconstruct.

Adding OCR for scanned PDFs

For scanned pages, convert the page to an image and use pytesseract:

from pdf2image import convert_from_path
import pytesseract

pages = convert_from_path('scanned.pdf', 
dpi=300)
doc = Document()
for p_img in pages:
    text = pytesseract.image_to_string
(p_img, lang='eng')
    doc.add_paragraph(text.strip()
 or "[OCR returned no text]")
    doc.add_page_break()
doc.save('scanned_output.docx')

Be sure to install Tesseract separately (for example, sudo apt install tesseract-ocr on Debian/Ubuntu) and optionally language packs.

Handling tables and complex layout

  • Tables: neither pdfplumber nor pdfminer will magically create Word tables; you must detect table structures (by lines or consistent column x-coordinates) and use python-docx’s table API to recreate them.
  • Multi-column: detect text x-positions and reorder reading flow before writing paragraphs.
  • Fonts and styles: python-docx can set fonts and basic styles, but matching original fonts exactly is hard.

Tips and pitfalls

  • Accuracy vs. fidelity: conversion that preserves exact visual layout is best done with dedicated commercial tools (Adobe, LibreOffice headless export), while Python extraction excels at making content editable.
  • Performance: large PDFs and OCR are slow and memory-intensive. Process page-by-page and consider batching.
  • Images and embedded fonts: some PDFs embed text as shapes or outlines — that text won’t be extractable and will need OCR.
  • Legal / privacy: ensure you have the right to convert and store document contents.
  • Testing: test on a representative sample of your PDFs (reports, receipts, scanned documents) and tune extraction heuristics.

When to choose external tools

If you need near-perfect layout preservation (exact page look, headers/footers, positioning), consider:

  • Export via LibreOffice in headless mode: libreoffice --convert-to docx file.pdf (can be invoked from Python subprocess).
  • Commercial APIs and desktop tools often give better visual fidelity but may cost money and require data transfer.

Conclusion

Converting PDF to DOCX with Python is achievable and flexible. The basic flow is: extract text/images (with pdfplumber / PyMuPDF), reconstruct content in python-docx, and add OCR for scanned pages (pytesseract + pdf2image). Expect to invest effort in handling tables, complex layouts, and scanned-image quality. For many automation tasks—batch conversions, text reflow, or content repurposing—a Python-based pipeline is efficient and customizable.

PDF → DOCX using Python — a practical guide

  PDF → DOCX using Python — a practical guide Converting PDF files to editable DOCX format is a common task: you might need to repurpose co...