Skill v1.0.0
currentAutomated scan100/100version: "1.0.0" name: glmocr-sdk description: | Trigger when: (1) User wants to extract text, tables, formulas, or structured data from images/PDFs/scanned documents, (2) User mentions "OCR", "文字识别", "文档解析", (3) User has a document (screenshot, scanned page, invoice, paper, whiteboard photo) and needs its content in structured form, (4) User asks to parse, digitize, or extract content from a visual document.
Invokes the GLM-OCR SDK (pip install glmocr) to parse documents via Zhipu's cloud API. No GPU required. Returns structured JSON (regions with labels + bounding boxes) and Markdown. Agent can operate entirely via CLI — no YAML files needed.
NOT for: real-time camera feeds, audio transcription, or non-document images (photos, illustrations). metadata: openclaw: requires: env:
- ZHIPU_API_KEY
primaryEnv: ZHIPU_API_KEY emoji: "📄" homepage: https://github.com/zai-org/GLM-OCR/tree/main/skills/sdk
OpenClaw Skill: glmocr
Parses documents (images, PDFs, scans) via the GLM-OCR SDK.
📌 On-demand: This skill requires onlyZHIPU_API_KEYin the environment. No YAML config files or GPU needed.
⚡ Quick Start
# Installpip install glmocr# Set API key (once)export ZHIPU_API_KEY=sk-xxx# or add to .env file in working directory:echo "ZHIPU_API_KEY=sk-xxx" >> .env
# One-linerimport glmocrresult = glmocr.parse("document.pdf")print(result.markdown_result)print(result.to_dict())
# CLI — pass API key directly (no env setup needed)glmocr parse image.png --api-key sk-xxx# Or load from a specific .env fileglmocr parse image.png --env-file /path/to/.env# Or rely on env var / auto-discovered .env (set once, then omit)glmocr parse image.pngglmocr parse ./scans/ --output ./output/ --stdout
Configuration Priority
Constructor kwargs > os.environ > .env file > config.yaml > built-in defaults
Agents override everything via constructor kwargs or env vars — no YAML editing needed.
Key Environment Variables
| Variable | Description | Example | |
|---|---|---|---|
ZHIPU_API_KEY | API key (required for MaaS) | sk-abc123 | |
GLMOCR_MODEL | Model name | glm-ocr | |
GLMOCR_TIMEOUT | Request timeout (seconds) | 600 | |
GLMOCR_ENABLE_LAYOUT | Layout detection on/off | true | |
GLMOCR_LOG_LEVEL | DEBUG / INFO / WARNING / ERROR | INFO |
Python API
Convenience function (single call)
import glmocr# Single file → PipelineResultresult = glmocr.parse("invoice.png")# Multiple files → list[PipelineResult]results = glmocr.parse(["page1.png", "page2.png", "report.pdf"])
Class-based (multiple calls / resource reuse)
from glmocr import GlmOcrparser = GlmOcr(api_key="sk-xxx") # mode auto-set to "maas"parser = GlmOcr(mode="maas") # reads ZHIPU_API_KEY from env# Always use as context manager or call .close()with GlmOcr(api_key="sk-xxx") as parser:result = parser.parse("document.png")print(result.markdown_result)parser.close() # if not using `with`
Constructor Parameters
| Parameter | Type | Description | |
|---|---|---|---|
api_key | str | API key. Providing this auto-enables MaaS mode. | |
api_url | str | Override MaaS endpoint URL | |
model | str | Model name override | |
timeout | int | Request timeout in seconds (default: 600) | |
enable_layout | bool | Enable layout detection | |
log_level | str | Logging level |
Working with PipelineResult
Fields
result.markdown_result # str — full document as Markdownresult.json_result # list[list[dict]] — structured regions per pageresult.original_images # list[str] — absolute paths of input images
json_result structure
List of pages → list of regions per page:
[[{"index": 0,"label": "title","content": "Annual Report 2024","bbox_2d": [100, 50, 900, 120]},{"index": 1,"label": "table","content": "| Q1 | Q2 |\n|---|---|\n| 120 | 145 |","bbox_2d": [100, 140, 900, 400]}]]
Bounding boxes (bbox_2d): [x1, y1, x2, y2] normalised to 0–1000 scale.
Region labels: title, text, table, figure, formula, header, footer, page_number, reference, seal
Serialization
# Dict (JSON-serializable, for passing to other tools)d = result.to_dict()# Keys: json_result, markdown_result, original_images, usage (MaaS), data_info (MaaS)# JSON stringjson_str = result.to_json() # pretty-printed, ensure_ascii=Falsejson_str = result.to_json(indent=None) # compact single line# Save to disk: writes <stem>/<stem>.json + <stem>/<stem>.md + layout_vis/result.save(output_dir="./output")result.save(output_dir="./output", save_layout_visualization=False)
Error Handling
The SDK does not raise on MaaS errors — check to_dict() for an "error" key:
result = parser.parse("image.png")d = result.to_dict()if "error" in d:# Handle failureprint("OCR failed:", d["error"])else:print(d["markdown_result"])
CLI Reference
Agent-preferred interface: use the CLI for most operations. SetZHIPU_API_KEYin env once, then invoke as needed.
Supported input formats: .jpg, .jpeg, .png, .bmp, .gif, .webp, .pdf
Basic usage
# Parse a single file → saves to ./output/<stem>/# MaaS mode is the default; ZHIPU_API_KEY must be set (or use --api-key)glmocr parse image.png# Pass API key directly without any env setupglmocr parse image.png --api-key sk-xxx# Parse a directory → saves each file to ./output/<stem>/glmocr parse ./scans/# Use self-hosted vLLM/SGLang instead of cloudglmocr parse image.png --mode selfhosted# Specify output directoryglmocr parse image.png --output ./results/
Read results in the terminal (agent-friendly)
# Print Markdown + JSON to stdout (and still save to disk)glmocr parse image.png --stdout# Print to stdout ONLY — do not write any filesglmocr parse image.png --stdout --no-save# JSON only (no Markdown output)glmocr parse image.png --stdout --json-only# Pipe JSON into jq for structured extractionglmocr parse image.png --stdout --json-only --no-save | jq '.[0] | map(select(.label=="table"))'
Save control
# Skip layout visualization images (faster, smaller output)glmocr parse image.png --no-layout-vis# Parse and save only JSON + Markdown, skip layout visglmocr parse image.png --no-layout-vis --output ./results/
Batch processing
# All images in a folderglmocr parse ./invoice_scans/ --output ./parsed/ --no-layout-vis# With progress visible in logsglmocr parse ./docs/ --output ./parsed/ --log-level INFO
Debugging
glmocr parse image.png --log-level DEBUG
Full flag reference
| Flag | Default | Description | |
|---|---|---|---|
--api-key / -k | env var | API key for MaaS mode (overrides ZHIPU_API_KEY) | |
--mode | maas | maas (cloud, default) or selfhosted (local GPU) | |
--env-file | auto | Path to .env file (default: auto-discover from cwd) | |
--output / -o | ./output | Output directory | |
--stdout | off | Print JSON + Markdown to stdout | |
--no-save | off | Skip writing files (use with --stdout) | |
--json-only | off | stdout JSON only, no Markdown | |
--no-layout-vis | off | Skip layout visualization images | |
--config / -c | none | Path to YAML config override | |
--log-level | INFO | DEBUG / INFO / WARNING / ERROR |
Typical Agent Workflow
receive document path / URL│▼glmocr.parse(path) ← single call, handles PDF/image│▼result.to_dict() ← safe to pass as tool output│├── markdown_result → hand to LLM for reading / summarization└── json_result → structured extraction (tables, formulas, regions by label)
Filter by label
result = glmocr.parse("report.png")regions = result.json_result[0] # first pagetables = [r for r in regions if r["label"] == "table"]formulas = [r for r in regions if r["label"] == "formula"]body_text = [r for r in regions if r["label"] == "text"]
Multi-page PDF → iterate pages
with GlmOcr(api_key="sk-xxx") as parser:result = parser.parse("document.pdf") # all pages in one PipelineResultfor page_idx, page_regions in enumerate(result.json_result):print(f"Page {page_idx + 1}: {len(page_regions)} regions")for region in page_regions:print(f" [{region['label']}] {region['content'][:60]}")
Programmatic config (no env vars)
from glmocr.config import GlmOcrConfigcfg = GlmOcrConfig.from_env(api_key="sk-xxx",mode="maas",timeout=600,log_level="DEBUG",)
Output Directory Layout
After result.save(output_dir):
output_dir/<image_stem>/<image_stem>.json ← structured regions<image_stem>.md ← full Markdown (with cropped figure images)imgs/ ← cropped figures referenced in Markdownlayout_vis/ ← layout detection overlay images (if enabled)<image_stem>.jpg
Common Pitfalls
- `ZHIPU_API_KEY` not set: SDK defaults to MaaS mode. Without a key,
parse()will fail with a clear error message and quick-fix instructions. Set viaexport ZHIPU_API_KEY=sk-xxx, add to a.envfile, or pass--api-key sk-xxxto the CLI. - Large PDFs: Default timeout is 600s. For very long documents increase with
timeout=1200. - `result.json_result` is a string: Happens when the model returns malformed JSON. The SDK preserves the raw string — parse or log it manually.