Skip to content

Latest commit

Β 

History

History
386 lines (269 loc) Β· 10.5 KB

File metadata and controls

386 lines (269 loc) Β· 10.5 KB

πŸ“‚ CustomTextParser

πŸ”„ Concordance .DAT File Toolkit

Easily convert and manipulate Concordance .DAT load files β€” perfect for legal e-discovery, metadata extraction, and bulk processing.


πŸ›  What It Does

A powerful Python CLI tool designed to handle complex .DAT files with custom delimiters (ΓΎ, control characters), broken encodings, and Excel-incompatible data.

This tool can:

  • βœ… Convert .DAT to .CSV to .DAT
  • πŸ” Compare two .DAT files (with optional field mapping)
  • 🧠 Replace or remap headers
  • πŸ”— Merge multiple .DAT files intelligently
  • 🧹 Delete rows based on field values
  • 🎯 Extract and export selected fields


⚑ Cython Acceleration (v1.1+)

This tool now uses Cython-compiled quote-aware parsing for maximum speed on large .DAT files.

πŸš€ Performance Gain

File Size Rows Before (Pure Python) Now (Cython)
131 MB ~90k ~17 sec 3.45 sec
204 MB ~1.1M ~52 sec 13.56 sec
1.06 GB ~5.7M ~300 sec 64.39 sec

βœ… Quote-safe, newline-tolerant, and 4–5Γ— faster than the previous version.

🧱 How It Works

A custom parser module (quote_split_chunked.pyx) is written in Cython and compiled to a native .pyd extension, enabling fast, chunked line processing while preserving quote-state logic.

πŸ›  Compiling the Cython Module

Install a C compiler first:

  • Windows: Visual C++ Build Tools
  • Linux: sudo apt install build-essential python3-dev
  • macOS: xcode-select --install

Then build:

python setup.py build_ext --inplace

βš™οΈ Key Features

  • Handles Concordance .DAT files with embedded line breaks
  • Supports various encodings: UTF-8, UTF-16, Windows-1252, and more
  • Robust parsing even with Excel’s 32,767 character cell limit
  • CLI-first design β€” ideal for automation and scripting

πŸš€ Use Cases

  • Legal eDiscovery processing
  • Metadata cleanup and normalization
  • Custom conversions and field extraction
  • Comparing vendor-delivered load files

πŸ“¦ Installation

Clone the repo

git clone https://github.com/yourusername/dat-file-tool.git
cd dat-file-tool

Install dependencies (optional)

  • Python 3.7+
  • Requires: chardet
  • Optional: cython for native-speed parsing

✨ Features

  • βœ… Convert .dat to .csv | .csv to .dat or keep as .dat
  • πŸ”€ Compare two .dat files (with optional header mapping)
  • 🧹 Delete specific rows from .dat using a value list
  • πŸ” Merge .dat files by common headers
  • πŸ”€ Auto-detect encoding (UTF-8, UTF-16, Windows-1252, Latin-1)
  • πŸ’¬ Smart line reader handles embedded newlines and quoted fields
  • πŸ“ Output directory support via -o DIR
  • ⚠️ Excel field-length warning for long text fields (>32,767 chars)
  • 🎯 Select only specific fields from a DAT file using --select
Feature Description
--csv Export DAT file to CSV format (Comma Separated Value)
--tsv Export DAT file to TSV format (Tab Separated Value)
--dat Export to DAT format (default if none specified)
--c, --compare Compare two DAT files line-by-line
--r, --replace-header Replace headers using a mapping file (old_header,new_header)
--merge Merge multiple DAT files grouped by matching headers
--delete Delete rows based on field values listed in a file
--select Export only selected fields from the DAT file
--join Strictly join two DAT files using a key field, with duplicate header conflict resolution
--key Key field required to perform join
--o, --output-dir Specify output directory for generated files
--reorder-header, --reorder Reorder headers based on a specified order file
--split Split converted output into N files (even split)
--max-rows Maximum rows per output file (e.g., 10000).
--group-by Keep groups (by FIELD) intact when splitting

πŸ§ͺ Usage Examples

πŸ” Convert DAT to CSV / TSV

python Main.py input.dat --csv
# Output: input_converted.csv

python Main.py input.dat --tsv
# Output: input_converted.tsv

βœ‚οΈ Split Output into Multiple Files βœ…

Split the converted output into multiple files either by number of files or by maximum rows per file. Use --group-by to keep related rows (families) intact.

# 1) Evenly split into 3 files
python Main.py input.dat --csv --split 3
# Output: input_part1.csv, input_part2.csv, input_part3.csv

# 2) Split into files containing up to 10,000 rows each
python Main.py input.dat --csv --max-rows 10000
# Output: input_part1.csv, input_part2.csv, ... (each up to 10000 rows)

# 3) Keep families intact while splitting into 3 files (group by 'Family' header)
python Main.py input.dat --csv --split 3 --group-by Family
# Output: each file contains whole families β€” no family is split across files

# Note: If a single family's row count exceeds the requested --max-rows, that family will be placed alone in a file with a warning.

πŸŽ₯ Demo Example

Demo Animation

You can also specify custom output paths:

python Main.py input.dat --csv output.csv


2. Compare Two Files

Compare two DAT files and generate a detailed difference report:

# Simple comparison
python Main.py file1.dat file2.dat --compare

# With header mapping (useful for comparing files with different headers)
python Main.py file1.dat file2.dat --compare --mapping mapping.txt

Mapping File Format (mapping.txt):

OldHeader1,NewHeader1
OldHeader2,NewHeader2

Output: Creates file1_diff.csv containing all differences with SHA256 hashes for verification.


3. Replace Headers

Replace or rename column headers using a mapping file:

python Main.py data.dat --replace-header mapping.txt

Mapping File Format:

OldName,NewName
Age,PersonAge
Score,TestScore

Output: Creates data_Replaced.dat with renamed headers.


4. Select Specific Fields

Extract only selected columns from a file:

python Main.py data.dat --select fields.txt

Select File Format (fields.txt):

Name
Email
Age

Output: Creates data_selected.dat containing only the specified fields.


5. Delete Rows

Remove rows matching specific field values:

python Main.py data.dat --delete delete_list.txt

Delete File Format (delete_list.txt):

Status
Inactive
Deleted
Suspended

First line specifies the field, subsequent lines are values to delete.

Output:

  • Creates data{kept}.dat (rows to keep)
  • Creates data{removed}.dat (rows deleted)

6. Join Two Files

Perform a strict inner join on two DAT files based on key fields:

python Main.py file1.dat file2.dat --join --key "UserID"

# Multiple key fields
python Main.py file1.dat file2.dat --join --key "UserID Department"

Features:

  • Validates key field existence in both files
  • Detects and handles duplicate headers with three resolution modes:
    1. Suffix mode: Adds _2 to file2 column names
    2. File1 mode: Keeps file1 values (default)
    3. File2 mode: Overwrites with file2 values
  • Detects and reports duplicate keys with error handling

Output: Creates file1_joined.dat containing merged data.


7. Merge Multiple Files

Merge multiple DAT files with automatic header validation:

python Main.py merge_list.txt --merge

Merge List Format (merge_list.txt):

/path/to/file1.dat
/path/to/file2.dat
/path/to/file3.dat

Features:

  • Groups files by header hash
  • Creates separate output files for each group
  • Generates merge log with file counts and row statistics
  • Validates file existence and readability
  • Excludes problematic files with detailed warnings

Output:

  • Creates merge_list_group_1.dat, merge_list_group_2.dat, etc.
  • Creates merge_list_merge_log.csv with merge statistics

βš™οΈ Optional Arguments

Flag Description
--o DIR Set output directory
--help Show help message

πŸ“¦ Output Files

  • All exports go to the directory specified by -o, or default to the input file's folder.
  • Output filenames include tags like {kept}, {removed}, or _Replaced.

πŸ’‘ Encoding Detection Logic

Handles common encodings reliably:

  • βœ… UTF-8
  • βœ… UTF-8 with BOM
  • βœ… UTF-16 LE / BE (BOM detection)
  • πŸ” Uses chardet fallback for uncertain cases (based on confidence)

πŸ§ͺ Excel Limit Check

Warns if any field exceeds Excel's max cell limit (32,767 chars).


πŸ“ Requirements

  • Python 3.7+
  • Dependencies (see requirements.txt):

🧰 Development Tips

VS Code Debug Setup (optional)

Add .vscode/launch.json:

{
  "name": "Debug Merge Example",
  "type": "python",
  "request": "launch",
  "program": "${workspaceFolder}/Main.py",
  "console": "integratedTerminal",
  "args": [
    "--merge", "File_list.csv", "--csv", "-o", "merged/"
  ]
}

🀝 Contributing

Feel free to fork, enhance, or report issues! Contributions are welcome πŸ’¬


πŸ‘€ Author

Md Ehsan Ahsan πŸ“§ MyGitHub πŸ› οΈ Built with love using Python 🐍


⚠️ Disclaimer

This tool is provided as-is without any warranties.
Use it at your own risk.
I am not responsible if it eats your files, breaks your computer, or ruins your spreadsheet.

πŸš€ But Hey, if it helps you automate the boring stuff β€” you're welcome! πŸ˜„


πŸ“ License

This project is free to use under the MIT License.