Data Cleanup & Formatting
Clean, standardize, and validate messy data files — fix formats, remove duplicates, and get analysis-ready datasets.
80% of your analysis time is spent cleaning data, not analyzing it. Dates are in 4 formats, company names have 3 spellings each, there are 200 duplicates, and someone entered 'N/A' in a numeric field. You can't trust insights from data you can't trust.
Who it's for: data analysts spending most of their time cleaning before analyzing, operations teams receiving messy data exports from multiple systems, researchers standardizing datasets from different sources, marketing teams cleaning CRM data for segmentation, anyone who receives CSVs and spreadsheets that need serious cleanup before use
Example
"Clean up this customer database export" → Cleaned dataset: dates standardized to ISO format, 150 duplicates merged, company names normalized, 30 invalid emails flagged, missing fields filled where inferrable, and a data quality report showing before/after statistics
New here? 3-minute setup guide → | Already set up? Copy the template below.
# Data Cleanup & Formatting
## Your Role
You are an expert data analyst and quality specialist. Your job is to clean, standardize, and validate messy datasets while maintaining complete traceability of all changes.
## Core Principles
- Never modify source files — always create new cleaned versions
- Document every change in a detailed change log
- Flag ambiguous items for human review rather than guessing
- Validate output (row counts, totals) against source data
- Apply consistent standards across all fields
## Instructions
When given a data file to clean, perform:
1. **COLUMN STANDARDIZATION** - Consistent headers, data types
2. **DATE FORMATTING** - Convert all dates to specified format (default: YYYY-MM-DD)
3. **TEXT STANDARDIZATION** - Proper case, trimmed whitespace, consistent abbreviations
4. **MISSING VALUES** - Flag, fill with defaults, or mark for review
5. **DUPLICATES** - Identify, merge, or flag for review
6. **DATA VALIDATION** - Check ranges, formats, cross-field consistency
## Output Files
1. `cleaned-[filename].csv` - Standardized, validated dataset
2. `change-log.md` - Every modification with row references
3. `data-quality-report.md` - Summary of issues found and resolved
4. `review-needed.csv` - Rows requiring human judgment
## Commands
- "Clean this CSV" - Full cleanup pipeline
- "Find duplicates" - Duplicate detection and report
- "Standardize dates/phones/names" - Specific field cleanup
- "Data quality report" - Assessment without modifications
What This Does
Takes messy CSV/Excel files with inconsistent formats, duplicates, and errors, and produces clean, standardized, analysis-ready data. Includes a detailed change log so you know exactly what was modified. Handles what used to take hours of manual cleanup.
Quick Start
Step 1: Download the Template
Click Download above to get the CLAUDE.md file.
Step 2: Place Your Data File
Put the messy CSV or Excel file in your working directory.
Step 3: Start Using It
claude
Say: "Clean up customers.csv — standardize dates to YYYY-MM-DD, fix phone numbers, remove duplicates, and flag missing emails"
What Gets Cleaned
| Area | Before → After |
|---|---|
| Dates | "Jan 5, 2024" / "1/5/24" → "2024-01-05" |
| Phone numbers | "(555) 123-4567" / "5551234567" → "+1-555-123-4567" |
| Names | "john smith" / "JOHN SMITH" → "John Smith" |
| Duplicates | 3 entries for same person → 1 merged record |
| Missing values | Empty fields → Flagged for review |
| Whitespace | Extra spaces, line breaks → Clean text |
Output Files
## What You Get
1. cleaned-customers.csv — Ready for analysis
2. change-log.md — Every modification documented
3. data-quality-report.md — Issues found and resolved
4. review-needed.csv — Rows requiring human judgment
Tips
- Never overwrite source data: Always create a new "cleaned" file
- Start with 100 rows: Verify rules work before processing the full dataset
- Save your standards: Create a "data-standards.md" for reuse across projects
- Validate results: Check row counts and totals match the original
Commands
"Clean and standardize this CSV file"
"Show me all duplicate entries in this dataset"
"Standardize all dates to ISO format (YYYY-MM-DD)"
"Generate a data quality report for this file"
Troubleshooting
Lost rows after cleanup Check the change log — duplicates may have been merged. Say: "Show me which rows were merged"
Wrong format applied Specify exactly: "Use MM/DD/YYYY for dates, not ISO format"
Too many flagged items Set thresholds: "Only flag rows missing more than 3 fields"