<< All versions
Skill v1.0.1
currentAutomated scan100/100freekmurze/dotfiles/pdf
9 files
──Details
PublishedMay 17, 2026 at 01:59 PM
Content Hashsha256:38d8559d4899602f...
Git SHA3974caaa4459
Bump Typepatch
──Files
Files (1 file, 6.9 KB)
SKILL.md6.9 KBactive
SKILL.md · 296 lines · 6.9 KB
version: "1.0.1" name: pdf description: Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale. license: Proprietary. LICENSE.txt has complete terms
PDF Processing Guide
Overview
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions.
Quick Start
python
from pypdf import PdfReader, PdfWriter# Read a PDFreader = PdfReader("document.pdf")print(f"Pages: {len(reader.pages)}")# Extract texttext = ""for page in reader.pages:text += page.extract_text()
Python Libraries
pypdf - Basic Operations
Merge PDFs
python
from pypdf import PdfWriter, PdfReaderwriter = PdfWriter()for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:reader = PdfReader(pdf_file)for page in reader.pages:writer.add_page(page)with open("merged.pdf", "wb") as output:writer.write(output)
Split PDF
python
reader = PdfReader("input.pdf")for i, page in enumerate(reader.pages):writer = PdfWriter()writer.add_page(page)with open(f"page_{i+1}.pdf", "wb") as output:writer.write(output)
Extract Metadata
python
reader = PdfReader("document.pdf")meta = reader.metadataprint(f"Title: {meta.title}")print(f"Author: {meta.author}")print(f"Subject: {meta.subject}")print(f"Creator: {meta.creator}")
Rotate Pages
python
reader = PdfReader("input.pdf")writer = PdfWriter()page = reader.pages[0]page.rotate(90) # Rotate 90 degrees clockwisewriter.add_page(page)with open("rotated.pdf", "wb") as output:writer.write(output)
pdfplumber - Text and Table Extraction
Extract Text with Layout
python
import pdfplumberwith pdfplumber.open("document.pdf") as pdf:for page in pdf.pages:text = page.extract_text()print(text)
Extract Tables
python
with pdfplumber.open("document.pdf") as pdf:for i, page in enumerate(pdf.pages):tables = page.extract_tables()for j, table in enumerate(tables):print(f"Table {j+1} on page {i+1}:")for row in table:print(row)
Advanced Table Extraction
python
import pandas as pdwith pdfplumber.open("document.pdf") as pdf:all_tables = []for page in pdf.pages:tables = page.extract_tables()for table in tables:if table: # Check if table is not emptydf = pd.DataFrame(table[1:], columns=table[0])all_tables.append(df)# Combine all tablesif all_tables:combined_df = pd.concat(all_tables, ignore_index=True)combined_df.to_excel("extracted_tables.xlsx", index=False)
reportlab - Create PDFs
Basic PDF Creation
python
from reportlab.lib.pagesizes import letterfrom reportlab.pdfgen import canvasc = canvas.Canvas("hello.pdf", pagesize=letter)width, height = letter# Add textc.drawString(100, height - 100, "Hello World!")c.drawString(100, height - 120, "This is a PDF created with reportlab")# Add a linec.line(100, height - 140, 400, height - 140)# Savec.save()
Create PDF with Multiple Pages
python
from reportlab.lib.pagesizes import letterfrom reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreakfrom reportlab.lib.styles import getSampleStyleSheetdoc = SimpleDocTemplate("report.pdf", pagesize=letter)styles = getSampleStyleSheet()story = []# Add contenttitle = Paragraph("Report Title", styles['Title'])story.append(title)story.append(Spacer(1, 12))body = Paragraph("This is the body of the report. " * 20, styles['Normal'])story.append(body)story.append(PageBreak())# Page 2story.append(Paragraph("Page 2", styles['Heading1']))story.append(Paragraph("Content for page 2", styles['Normal']))# Build PDFdoc.build(story)
Command-Line Tools
pdftotext (poppler-utils)
bash
# Extract textpdftotext input.pdf output.txt# Extract text preserving layoutpdftotext -layout input.pdf output.txt# Extract specific pagespdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
qpdf
bash
# Merge PDFsqpdf --empty --pages file1.pdf file2.pdf -- merged.pdf# Split pagesqpdf input.pdf --pages . 1-5 -- pages1-5.pdfqpdf input.pdf --pages . 6-10 -- pages6-10.pdf# Rotate pagesqpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees# Remove passwordqpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
pdftk (if available)
bash
# Mergepdftk file1.pdf file2.pdf cat output merged.pdf# Splitpdftk input.pdf burst# Rotatepdftk input.pdf rotate 1east output rotated.pdf
Common Tasks
Extract Text from Scanned PDFs
python
# Requires: pip install pytesseract pdf2imageimport pytesseractfrom pdf2image import convert_from_path# Convert PDF to imagesimages = convert_from_path('scanned.pdf')# OCR each pagetext = ""for i, image in enumerate(images):text += f"Page {i+1}:\n"text += pytesseract.image_to_string(image)text += "\n\n"print(text)
Add Watermark
python
from pypdf import PdfReader, PdfWriter# Create watermark (or load existing)watermark = PdfReader("watermark.pdf").pages[0]# Apply to all pagesreader = PdfReader("document.pdf")writer = PdfWriter()for page in reader.pages:page.merge_page(watermark)writer.add_page(page)with open("watermarked.pdf", "wb") as output:writer.write(output)
Extract Images
bash
# Using pdfimages (poppler-utils)pdfimages -j input.pdf output_prefix# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
Password Protection
python
from pypdf import PdfReader, PdfWriterreader = PdfReader("input.pdf")writer = PdfWriter()for page in reader.pages:writer.add_page(page)# Add passwordwriter.encrypt("userpassword", "ownerpassword")with open("encrypted.pdf", "wb") as output:writer.write(output)
Quick Reference
| Task | Best Tool | Command/Code | |
|---|---|---|---|
| Merge PDFs | pypdf | writer.add_page(page) | |
| Split PDFs | pypdf | One page per file | |
| Extract text | pdfplumber | page.extract_text() | |
| Extract tables | pdfplumber | page.extract_tables() | |
| Create PDFs | reportlab | Canvas or Platypus | |
| Command line merge | qpdf | qpdf --empty --pages ... | |
| OCR scanned PDFs | pytesseract | Convert to image first | |
| Fill PDF forms | pdf-lib or pypdf (see forms.md) | See forms.md |
Next Steps
- For advanced pypdfium2 usage, see reference.md
- For JavaScript libraries (pdf-lib), see reference.md
- If you need to fill out a PDF form, follow the instructions in forms.md
- For troubleshooting guides, see reference.md