crawn

A utility for web crawling and scraping

Features

Blazing fast – Built with Rust & tokio for async I/O and concurrency
Smart filtering – URL-based keyword matching (no content fetching required)
NDJSON output – One JSON object per line for easy streaming
BFS crawling – Depth-first traversal with configurable depth limits
Rate limiting – Configurable request rate (default: ~2req/sec)
Error recovery – Gracefully handles network errors and broken links
Rich logging – Colored, timestamped logs with context chains

Installation

Run this command (requires cargo):

cargo install crawn

Or build from source (requires cargo):

git clone https://github.com/Tahaa-Dev/crawn.git
cd crawn
cargo build --release

Usage

Basic Crawling:

crawn https://example.com

With Logging:

crawn -l crawler.log https://example.com

Verbose Mode (Log All Requests):

crawn -v https://example.com | bat --language json

Custom Depth Limit:

crawn -m 3 https://example.com | grep 'rust' | cat > output.ndjson

Full HTML:

crawn --include-content https://example.com | jq -s '.'

Extracted text only:

crawn --include-text https://example.com | grep 'rust' | sed -i 's/[^\r]\n/\r\n/g' | jq -s '.' | cat > output.json

Output Format

Results are written as NDJSON (newline-delimited JSON):

{"URL": "https://example.com", "Title": "Example Domain", "Links": 12}
{"URL": "https://example.com/about", "Title": "About Us", "Links": 9}
{"URL": "https://example.com/contact", "Title": "Contact", "Links": 48}

With --include-text:

{"URL": "https://example.com", "Title": "Example Domain", "Links": 27, "Text": "Example Domain\nThis domain is..."}

With --include-content:

{"URL": "https://example.com", "Title": "Example Domain", "Links": 30, "Content": "<!DOCTYPE html>\n<html>..."}

Logging

Log Levels:

INFO (verbose mode only): Request logs
WARN (always): Recoverable errors (404, network timeouts)
FATAL (always): Unrecoverable errors (invalid URL, disk full)

Log Format:

2026-01-24 02:37:40.351 [INFO]: Sent request to URL: https://example.com

2026-01-24 02:37:41.123 [WARN]: Failed to fetch URL: https://example.com/broken-link
Cause: HTTP 404 Not Found

Examples

Crawl Documentation Site:

crawn https://doc.rust-lang.org/book/ | cat output.ndjson

Crawl with Logging to custom file:

crawn -l crawler.log -v https://example.com | jq -s '.'

Limit to 2 Levels Deep:

crawn -m 2 https://example.com | cat > shallow.ndjson

Limitations

Same-domain only (no external links by design)
No JavaScript rendering (static HTML only)
No authentication (public pages only)

Notes

crawn is licensed under the MIT license.
For specifics about contributing to crawn, see CONTRIBUTING.md.
For new changes to crawn, see CHANGELOG.md.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawn

Features

Installation

Usage

Output Format

Logging

Log Levels:

Log Format:

Examples

Limitations

Notes

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

crawn

Features

Installation

Usage

Output Format

Logging

Log Levels:

Log Format:

Examples

Limitations

Notes

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages