Skill v1.0.2
currentAutomated scan100/1001 files
version: "1.0.2" name: docx-parse-resilient description: Extract text from DOCX files with shell-primary approach and Python zipfile fallback for maximum reliability
Resilient DOCX Text Extraction
Extract text from Microsoft Word (.docx) files using a robust two-tier approach: shell-based extraction as the primary method, with Python zipfile fallback when shell commands fail or return no output.
When to Use
- Python environment may lack
python-docxbutzipfilemodule is available (standard library) - Working in constrained or inconsistent environments (containers, minimal images, CI/CD)
- Shell
unzipcommand returns errors or no output - Need reliable extraction with automatic fallback
Core Technique
DOCX files are ZIP archives containing XML files. This skill provides two extraction methods:
- Primary (Shell):
unzip -p+sedfor fast extraction - Fallback (Python):
zipfilemodule for reliable extraction when shell fails
Step-by-Step Instructions
1. Verify the DOCX file exists
ls -la document.docx
2. Test shell extraction first (recommended)
Try the shell-based approach:
unzip -p document.docx word/document.xml 2>/dev/null | sed -e 's/<[^>]*>//g'
3. Check if shell extraction produced output
Verify the shell method returned content:
content=$(unzip -p document.docx word/document.xml 2>/dev/null | sed -e 's/<[^>]*>//g')if [ -z "$content" ]; thenecho "Shell extraction returned no output, trying Python fallback..."fi
4. Use Python zipfile fallback if needed
When shell commands fail or return empty output, use Python's standard zipfile module:
python3 -c "import zipfileimport sysimport retry:with zipfile.ZipFile('document.docx', 'r') as z:content = z.read('word/document.xml').decode('utf-8')# Strip XML tagstext = re.sub(r'<[^>]*>', '', content)# Clean whitespacelines = [line.strip() for line in text.split('\n') if line.strip()]print('\n'.join(lines))except Exception as e:print(f'Error: {e}', file=sys.stderr)sys.exit(1)"
5. Save extracted text to file
# Try shell firstunzip -p document.docx word/document.xml 2>/dev/null | \sed -e 's/<[^>]*>//g' > output.txt# Verify output has contentif [ ! -s output.txt ]; then# Fallback to Pythonpython3 -c "import zipfile, rewith zipfile.ZipFile('document.docx', 'r') as z:content = z.read('word/document.xml').decode('utf-8')text = re.sub(r'<[^>]*>', '', content)lines = [line.strip() for line in text.split('\n') if line.strip()]print('\n'.join(lines))" > output.txtfi
Complete Shell Function with Fallback
Add this resilient function to your scripts:
parse_docx_resilient() {local file="$1"local output="$2"if [ ! -f "$file" ]; thenecho "Error: File not found: $file" >&2return 1fi# Primary: Shell extractionlocal contentcontent=$(unzip -p "$file" word/document.xml 2>/dev/null | \sed -e 's/<[^>]*>//g' | \sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' | \sed -e '/^$/d')# Check if shell extraction succeededif [ -n "$content" ]; thenecho "$content" > "${output:-/dev/stdout}"return 0fi# Fallback: Python zipfileecho "Shell extraction failed, using Python fallback..." >&2python3 -c "import zipfile, sys, retry:with zipfile.ZipFile('$file', 'r') as z:content = z.read('word/document.xml').decode('utf-8')text = re.sub(r'<[^>]*>', '', content)lines = [line.strip() for line in text.split('\n') if line.strip()]print('\n'.join(lines))except Exception as e:print(f'Python extraction failed: {e}', file=sys.stderr)sys.exit(1)" > "${output:-/dev/stdout}" || return 1}# Usage examples:# parse_docx_resilient document.docx # Output to stdout# parse_docx_resilient document.docx out.txt # Output to file
Python Script Alternative
For complex workflows, save as a standalone script:
#!/usr/bin/env python3"""DOCX text extractor with resilient fallback."""import zipfileimport sysimport reimport subprocessdef extract_with_shell(filepath):"""Try shell-based extraction first."""try:result = subprocess.run(['unzip', '-p', filepath, 'word/document.xml'],capture_output=True, text=True, timeout=10)if result.returncode == 0 and result.stdout.strip():text = re.sub(r'<[^>]*>', '', result.stdout)lines = [l.strip() for l in text.split('\n') if l.strip()]return '\n'.join(lines)except Exception:passreturn Nonedef extract_with_python(filepath):"""Fallback Python zipfile extraction."""with zipfile.ZipFile(filepath, 'r') as z:content = z.read('word/document.xml').decode('utf-8')text = re.sub(r'<[^>]*>', '', content)lines = [l.strip() for l in text.split('\n') if l.strip()]return '\n'.join(lines)def parse_docx_resilient(filepath):"""Extract text with automatic fallback."""# Try shell firstcontent = extract_with_shell(filepath)if content:return content, 'shell'# Fallback to Pythoncontent = extract_with_python(filepath)return content, 'python'if __name__ == '__main__':if len(sys.argv) < 2:print("Usage: parse_docx_resilient.py <file.docx>", file=sys.stderr)sys.exit(1)content, method = parse_docx_resilient(sys.argv[1])if content:print(f"# Extracted using {method} method", file=sys.stderr)print(content)else:print("Failed to extract text from DOCX", file=sys.stderr)sys.exit(1)
Limitations
- Does not preserve formatting, images, or tables structure
- May include some residual XML entity references
- Works best for simple text extraction needs
- DOCX must be a valid Office Open XML format
- Python fallback requires Python 3 with standard library (no external packages)
Verification
Confirm extraction worked by checking output:
# Test shell methodparse_docx_resilient document.docx | head -20# Test with file outputparse_docx_resilient document.docx extracted.txtwc -l extracted.txt # Should show line count > 0# Verify contentgrep -c "[a-zA-Z]" extracted.txt # Should show character content
Troubleshooting
Shell returns "unknown error" or no output:
- This is expected in some environments
- The function automatically falls back to Python zipfile
- Check
which unzipto verify unzip is available
Python also fails:
- Verify the file is a valid DOCX:
file document.docx - Check if file is corrupted:
unzip -t document.docx - Ensure Python 3 is available:
python3 --version
File not found errors:
- Use absolute path or verify working directory
- Check file permissions:
ls -la document.docx
Environment Detection
To pre-detect which method to use:
# Check if unzip is availableif command -v unzip &> /dev/null; thenecho "Shell method available"elseecho "Only Python method available"fi# Check if Python 3 is availableif command -v python3 &> /dev/null; thenecho "Python fallback available"elseecho "Warning: No extraction method available!"fi