<< All versions
Skill v1.0.1
currentAutomated scan100/100shuyu-labs/webcode/office-to-md
4 files
──Details
PublishedMay 15, 2026 at 02:02 AM
Content Hashsha256:e89583d827c2f49e...
Git SHAa40b2b092e81
Bump Typepatch
──Files
Files (1 file, 4.0 KB)
SKILL.md4.0 KBactive
SKILL.md · 176 lines · 4.0 KB
version: "1.0.1" name: office-to-md description: Convert Office documents (Word, Excel, PowerPoint, PDF) to Markdown format. ONLY use this skill when the user explicitly requests to CONVERT, TRANSFORM or PARSE a specific office file into Markdown. Do NOT trigger for general questions, documentation reading, or discussions about files.
Office Document to Markdown Converter
Convert various Office document formats to structured Markdown with text, table, and image extraction.
File Description
enhanced_parser.py- Core document parserdoc_converter.py- DOC to DOCX converter (requires LibreOffice)requirements.txt- Python dependencies
Install Dependencies
bash
pip install -r requirements.txt
Additional Dependencies for DOC Format
.doc format requires LibreOffice:
bash
# Windows: Install LibreOffice from official website# https://www.libreoffice.org/download/# Linuxsudo apt install libreoffice# Macbrew install --cask libreoffice
Quick Start
Python Code
python
from enhanced_parser import EnhancedDocumentParser# Initialize parserparser = EnhancedDocumentParser(image_base_url="http://localhost:5000",image_save_dir="./static/images",filter_headers_footers=True # Filter headers and footers)# Parse documentresult = parser.parse_document("document.docx")if result["success"]:print(result["markdown"])print(f"Extracted {result['images_count']} images")
Start API Service
bash
# Start service using app.py from project rootpython app.py# Visit http://localhost:5000/analyzer to upload files
Supported Formats
| Format | Extensions | Notes | |
|---|---|---|---|
| Word | .docx, .doc | .doc requires LibreOffice | |
| Excel | .xlsx, .xls | Supports multiple worksheets and date formats | |
| PowerPoint | .pptx | Extracts slide text and images | |
| Auto-detects tables and images |
Features
Word Documents
- Automatic heading level detection
- Convert tables to Markdown tables
- Extract inline images
- Filter headers and footers
- Preserve list formatting
Excel Workbooks
- Support for multiple worksheets
- Automatic date format detection (prevents display as numbers)
- Convert to Markdown tables
- Extract embedded images
PowerPoint Presentations
- Extract content by slide
- Extract images and text boxes
- Preserve slide order
PDF Documents
- Auto-detect tables (line detection + text position detection)
- Extract page images
- Intelligently identify headings and lists
- Output content in original order
Advanced Options
DOC Conversion
bash
# Test LibreOffice configurationpython doc_converter.py
PDF Table Strategy
python
parser = EnhancedDocumentParser(pdf_table_strategy="lines_strict" # Default: strict line detection, fastest# "lines": Normal line detection# "text": Based on text position, more accurate but slower)
Image Processing
python
parser = EnhancedDocumentParser(image_base_url="https://your-domain.com", # Image access URLimage_save_dir="./static/images" # Image save directory)
Return Format
json
{"success": true,"markdown": "# Document Title\n\nContent...","images_count": 2,"images": [{"filename": "uuid.png","url": "http://localhost:5000/static/images/uuid.png","size": 12345}],"file_type": "docx","file_info": {"name": "document.docx","size": 45678,"paragraphs": 50,"tables": 3}}
Common Issues
DOC Conversion Failed
- Ensure LibreOffice is installed
- Run
python doc_converter.pyto test configuration
Dates Display as Numbers
- Excel parsing automatically handles date formats
- Ensure you're using the latest version of enhanced_parser.py
PDF Table Recognition Inaccurate
- Try different pdf_table_strategy parameters
- Use "lines_strict" for standard tables
- Use "text" for complex tables
File Limitations
- Maximum file size: 160MB
- Supported extensions: docx, doc, pdf, xlsx, xls, pptx
- Automatic cleanup of temporary files