<< All versions
Skill v2.0.0
currentLLM-judged scan95/100majiayu000/claude-skill-registry-data/skills-mitkox-fteplusai
3 files
──Details
PublishedMay 15, 2026 at 09:13 AM
Content Hashsha256:80ff16f47c5d67c8...
Git SHA01042ae58061
Bump Typepatch
──Files
Files (1 file, 13.7 KB)
SKILL.md13.7 KBactive
SKILL.md · 496 lines · 13.7 KB
skill: 'hardware-sizing' version: '2.0.0' updated: '2025-12-31' category: 'local-ai-infrastructure' complexity: 'advanced' prerequisite_skills: [] composable_with:
- 'local-ai-deployment'
- 'mlops-operations'
- 'financial-modeling'
- 'production-readiness'
Hardware Sizing Skill
Overview
Expertise in calculating and specifying hardware requirements for local AI deployments, including GPU selection, server configuration, storage, and network planning based on workload characteristics and team size.
Key Capabilities
- GPU selection and sizing for LLM inference
- Server configuration for AI workloads
- Storage planning for models and data
- Network bandwidth calculations
- TCO modeling for hardware investments
- Capacity planning and growth projections
GPU Selection Guide
NVIDIA GPU Comparison
| GPU | VRAM | FP16 TFLOPS | Bandwidth | TDP | Price (approx) | Best For | |
|---|---|---|---|---|---|---|---|
| RTX 4090 | 24GB | 82.6 | 1 TB/s | 450W | $1,600 | Small teams, dev | |
| RTX A6000 | 48GB | 38.7 | 768 GB/s | 300W | $4,500 | Medium teams | |
| A100 40GB | 40GB | 77.9 | 1.5 TB/s | 400W | $10,000 | Production | |
| A100 80GB | 80GB | 77.9 | 2.0 TB/s | 400W | $15,000 | Large models | |
| H100 80GB | 80GB | 267 | 3.35 TB/s | 700W | $30,000 | Maximum perf | |
| L40S | 48GB | 91.6 | 864 GB/s | 350W | $8,000 | Balanced |
Model VRAM Requirements
| Model Size | FP16 | INT8 | INT4/AWQ | Example Models | |
|---|---|---|---|---|---|
| 7B | 14GB | 8GB | 4GB | Qwen-Next (small variant), GLM-4.6 (small variant) | |
| 13B | 26GB | 14GB | 8GB | Qwen-Next (mid variant), MiniMax-M2 (mid variant) | |
| 34B | 68GB | 36GB | 18GB | Qwen-Next / GLM-4.6 (large-ish variants) | |
| 70B | 140GB | 75GB | 38GB | Qwen-Next / GLM-4.6 / MiniMax-M2 (largest variants) | |
| 110B | 220GB | 115GB | 58GB | Frontier-scale variants (verify availability + license) |
VRAM Formula:
VRAM Required = (Parameters × Bytes per Parameter) + Context Window Overhead- FP16: 2 bytes per parameter- INT8: 1 byte per parameter- INT4: 0.5 bytes per parameter- Context overhead: ~2GB for 8K context, ~8GB for 32K context
GPU Sizing by Team Size
| Team Size | Usage Level | Model Size | Recommended GPU | Quantity | |
|---|---|---|---|---|---|
| 1-5 | Dev/Test | 7B-13B | RTX 4090 | 1 | |
| 5-15 | Production | 13B-34B | RTX 4090 or A6000 | 1-2 | |
| 15-30 | Production | 34B-70B | A100 40GB | 2 | |
| 30-75 | Production | 70B | A100 80GB | 2-4 | |
| 75-150 | Enterprise | 70B+ | H100 or A100 | 4-8 | |
| 150+ | Enterprise | 70B+ | H100 cluster | 8+ |
Server Configuration Templates
Small Team Server (5-15 developers)
yaml
# Small team AI server specificationserver:type: Tower or 2U Rackcpu:model: AMD EPYC 7343 or Intel Xeon Gold 5315Ycores: 16threads: 32memory:type: DDR4-3200 ECCcapacity: 128GBchannels: 8gpu:model: NVIDIA RTX 4090count: 1-2vram_total: 24-48GBnvlink: falsestorage:system:type: NVMe SSDcapacity: 500GBraid: Nonemodels:type: NVMe SSDcapacity: 2TBraid: Nonelogs:type: SATA SSDcapacity: 2TBraid: 1network:type: 10GbEports: 2bonding: Active/Standbypower:psu: 1200Wredundancy: Single (N)ups: Recommendedestimated_cost:hardware: $10,000 - $15,000annual_power: $1,500annual_maintenance: $1,000
Medium Team Server (15-50 developers)
yaml
# Medium team AI server specificationserver:type: 2U Rack Mountcpu:model: AMD EPYC 7543 or Intel Xeon Platinum 8358cores: 32threads: 64memory:type: DDR4-3200 ECCcapacity: 256GBchannels: 8gpu:model: NVIDIA A6000 or RTX 4090count: 2-4vram_total: 96-192GBnvlink: Recommended for A6000storage:system:type: NVMe SSDcapacity: 1TBraid: 1models:type: NVMe SSDcapacity: 4TBraid: 0logs:type: SAS SSDcapacity: 4TBraid: 10network:type: 25GbEports: 2bonding: LACPpower:psu: 2000Wredundancy: Redundant (N+1)ups: Requiredestimated_cost:hardware: $35,000 - $60,000annual_power: $4,000annual_maintenance: $3,000
Enterprise Server (50-200 developers)
yaml
# Enterprise AI server specificationserver:type: 4U Rack Mount or DGX-stylecpu:model: 2x AMD EPYC 9354 or Intel Xeon Platinum 8480+cores: 64 totalthreads: 128memory:type: DDR5-4800 ECCcapacity: 512GB - 1TBchannels: 12-16gpu:model: NVIDIA A100 80GB or H100count: 4-8vram_total: 320-640GBnvlink: Required (NVSwitch for 8+ GPUs)storage:system:type: NVMe SSDcapacity: 2TBraid: 1models:type: NVMe SSDcapacity: 8TBraid: 0 or 10logs:type: NVMe SSDcapacity: 8TBraid: 10backup:type: SAS HDDcapacity: 32TBraid: 6network:type: 100GbE or InfiniBandports: 2-4bonding: LACPpower:psu: 3000W+redundancy: Redundant (N+N)ups: Required with generator backupestimated_cost:hardware: $150,000 - $400,000annual_power: $15,000 - $30,000annual_maintenance: $10,000 - $20,000
Capacity Planning
Request Volume Estimation
| Developer Usage | Requests/Day | Tokens/Request | Daily Tokens | |
|---|---|---|---|---|
| Light (occasional) | 20-30 | 2,000 | 40K-60K | |
| Medium (regular) | 50-100 | 3,000 | 150K-300K | |
| Heavy (power user) | 150-250 | 4,000 | 600K-1M | |
| Intensive (AI-first) | 300-500 | 5,000 | 1.5M-2.5M |
Throughput Calculation
# Calculate required throughputDaily Requests = Team Size × Requests per User per DayPeak Factor = 0.1 (10% of daily load in peak hour)Peak Requests per Minute = (Daily Requests × Peak Factor) / 60Tokens per Request = Avg Input Tokens + Avg Output TokensPeak Tokens per Second = Peak Requests per Minute × Tokens per Request / 60# Example: 50 medium-usage developersDaily Requests = 50 × 100 = 5,000Peak Requests/min = 5,000 × 0.1 / 60 = 8.3Tokens/Request = 2,000 + 1,000 = 3,000Peak Tokens/sec = 8.3 × 3,000 / 60 = 415 tok/s
GPU Throughput Reference
| GPU | Model Size | Throughput (tok/s) | Concurrent Requests | |
|---|---|---|---|---|
| RTX 4090 | 7B | 100-150 | 8-12 | |
| RTX 4090 | 13B | 50-80 | 4-8 | |
| A100 40GB | 13B | 120-180 | 16-24 | |
| A100 40GB | 34B | 60-100 | 8-16 | |
| A100 80GB | 70B | 40-70 | 4-8 | |
| H100 80GB | 70B | 100-150 | 8-16 | |
| 2x A100 80GB | 70B (TP=2) | 80-140 | 8-16 |
Sizing Formula
Required GPUs = Peak Tokens/sec / Single GPU Throughput × Safety FactorSafety Factor = 1.3 (30% headroom for spikes)# Example: 415 tok/s needed for 70B modelSingle A100 80GB throughput = 55 tok/s averageRequired GPUs = 415 / 55 × 1.3 = 9.8 → 10 A100 80GB# OR with 2-GPU tensor parallel:TP=2 throughput = 110 tok/sRequired TP pairs = 415 / 110 × 1.3 = 4.9 → 5 pairs (10 GPUs)
Storage Planning
Model Storage Requirements
| Model Size | Weights (FP16) | Weights (INT4) | With Tokenizer | |
|---|---|---|---|---|
| 7B | 14GB | 4GB | +500MB | |
| 13B | 26GB | 7GB | +500MB | |
| 34B | 68GB | 18GB | +500MB | |
| 70B | 140GB | 38GB | +500MB | |
| 100B+ | 200GB+ | 50GB+ | +1GB |
Storage Architecture
yaml
storage_tiers:tier1_hot: # Active modelstype: NVMe SSDiops: 500K+latency: <0.1mspurpose: Currently loaded models, active inferencesizing: 2x largest model sizetier2_warm: # Standby modelstype: SATA SSD or NVMeiops: 50K+latency: <1mspurpose: Quick-loading alternate modelssizing: 5-10x model sizes for model librarytier3_cold: # Archivestype: HDD or object storagepurpose: Model version history, backupssizing: 3x warm storage for versioninglog_storage:type: SSD (fast write)sizing: |Daily logs = Requests/day × 2KB averageMonthly = Daily × 30Retention storage = Monthly × Retention months
Network Planning
Bandwidth Requirements
| Component | Traffic Type | Bandwidth Need | |
|---|---|---|---|
| API Requests | Client → Server | 1-10 Mbps per concurrent user | |
| Responses | Server → Client | 5-50 Mbps per concurrent user | |
| Model Loading | Storage → GPU | 10+ Gbps (reduces load time) | |
| Monitoring | Server → Collector | 10-100 Mbps | |
| Replication | Server → Backup | Varies by backup frequency |
Network Architecture
┌─────────────────────────────────────────────────────┐│ Corporate Network ││ (10 GbE) │└──────────────────────┬──────────────────────────────┘│┌──────────────────────┴──────────────────────────────┐│ Load Balancer ││ (25-100 GbE uplink) │└──────────────────────┬──────────────────────────────┘│┌─────────────┴─────────────┐│ │┌────┴────┐ ┌────┴────┐│ AI Node │ ◄──(25 GbE)──► │ AI Node ││ #1 │ │ #2 │└────┬────┘ └────┬────┘│ │└─────────────┬─────────────┘│┌────────┴────────┐│ Storage Array ││ (100 GbE) │└─────────────────┘
TCO Calculation
Hardware TCO Template
markdown
## 3-Year Total Cost of Ownership### Capital Expenditure (CapEx)| Item | Unit Cost | Quantity | Total ||------|-----------|----------|-------|| Server (compute) | $15,000 | 2 | $30,000 || GPUs (A100 80GB) | $15,000 | 4 | $60,000 || Storage (NVMe) | $500/TB | 8TB | $4,000 || Network equipment | $5,000 | 1 | $5,000 || Installation | $2,000 | 1 | $2,000 || **CapEx Total** | | | **$101,000** |### Operating Expenses (OpEx) - Annual| Item | Monthly | Annual ||------|---------|--------|| Power (3kW average) | $400 | $4,800 || Cooling | $100 | $1,200 || Maintenance/support | $500 | $6,000 || Hosting/colocation | $1,000 | $12,000 || Admin labor (0.25 FTE) | $2,500 | $30,000 || **Annual OpEx** | **$4,500** | **$54,000** |### 3-Year TCO| Year | CapEx | OpEx | Cumulative ||------|-------|------|------------|| Year 1 | $101,000 | $54,000 | $155,000 || Year 2 | $0 | $54,000 | $209,000 || Year 3 | $0 | $54,000 | $263,000 |### Per-Request Cost (at 500K requests/month)Year 1: $155,000 / 6M requests = $0.026/requestYear 3: $263,000 / 18M requests = $0.015/request (amortized)
Cloud API Cost Comparison
markdown
## Local vs Cloud Cost Comparison### Assumptions-50 developers, medium usage-100 requests/dev/day = 5,000 requests/day-3,000 tokens/request average-15M tokens/day = 450M tokens/month### Cloud API Costs (GPT-4o-mini pricing)-Input: $0.15/1M tokens × 150M = $22.50/month-Output: $0.60/1M tokens × 300M = $180/month-Total: ~$200/month = $2,400/year### Cloud API Costs (GPT-4o pricing)-Input: $2.50/1M tokens × 150M = $375/month-Output: $10.00/1M tokens × 300M = $3,000/month-Total: ~$3,375/month = $40,500/year### Local Large Model (Qwen-Next / MiniMax-M2 / GLM-4.6)-Year 1 TCO: $155,000-Equivalent cloud cost: $40,500/year-Breakeven: 3.8 years### With Data Sovereignty PremiumIf data can't go to cloud, local is only option.Value of data sovereignty: Priceless / Required
Scaling Strategy
Horizontal Scaling Triggers
| Metric | Add Capacity When | Scale Strategy | |
|---|---|---|---|
| GPU Utilization | >80% sustained | Add GPU or node | |
| Queue Depth | >10 requests sustained | Add replica | |
| P95 Latency | >5s sustained | Add GPU for parallelism | |
| Memory Pressure | >90% VRAM | Larger GPU or quantization |
Vertical Scaling Path
Stage 1: Single RTX 4090 (24GB)↓ Need more VRAMStage 2: Single A6000 (48GB)↓ Need more throughputStage 3: 2x A6000 with tensor parallel↓ Need larger modelsStage 4: 2x A100 80GB↓ Need more throughputStage 5: 4x A100 80GB with NVLink↓ Need maximum performanceStage 6: 8x H100 with NVSwitch
Best Practices
Procurement
- Budget 20% contingency for unexpected needs
- Test before bulk purchase with single unit
- Consider used enterprise GPUs (A100s at 50% cost)
- Plan for 3-year lifecycle (hardware depreciation)
- Include installation and training in budget
Deployment
- Start small, scale up - validate before expanding
- Keep 30% headroom for traffic spikes
- Plan upgrade path before initial deployment
- Document all specifications for future reference
Monitoring
- Track utilization trends weekly
- Plan capacity 3-6 months ahead
- Review TCO quarterly against cloud alternatives
- Update sizing models with actual usage data
This skill ensures organizations size hardware appropriately for their AI workloads, optimizing for both performance and cost-effectiveness.