Wikipedia External Links Scanner

A streaming Python script that searches multiple Wikipedia language editions for external links to specific domains, identifies the user who first introduced each link, and collects user information and recent contributions.

Overview

This project allows you to:

Read a list of domains from a CSV file (domains.csv).
Read a list of Wikipedia language codes from a CSV file (wiki_versions.csv).
For each (language, domain) pair, search for external links in Wikipedia using the exturlusage API.
For each link found, immediately find the user who introduced it by scanning the revision history of that Wikipedia page.
Collect the user’s additional metadata (e.g. edit count, registration date) and some of their recent edits (via usercontribs).
Write out:
- results.csv: details about each external link usage (including the introduction user).
- user_info_all.csv: user details for each discovered user.
- user_contributions_all.csv: recent contributions for each discovered user.

This streaming approach minimises memory usage by writing data to CSV as soon as it’s discovered and by de-duplicating (lang, user) pairs on disk (rather than storing in memory).

Features

Parallel requests: uses multi-threading to speed up API calls.
Memory-friendly: writes results as they are generated; does not hold all data in memory at once.
User metadata: captures relevant user info (edit count, registration, user rights, etc.).
User contributions: fetches up to a specified limit of recent user contributions.

Requirements

Python 3.7+ (tested on Python 3.9+).
Packages specified in requirements.txt:
- requests
- pandas
A Unix-like environment (Linux, macOS, or WSL on Windows) is recommended, as the script relies on the sort command for on-disk de-duplication of (lang, user) pairs.
If you are on Windows without a Unix shell, you can install Git Bash or adapt the script to another deduplication method (e.g. SQLite).

Installation

Clone this repository:

git clone https://github.com/CheckFirstHQ/wikipedia-external-links-scanner.git
cd wikipedia-external-links-scanner

Install Python dependencies:
```
pip install -r requirements.txt
```
Prepare your input CSVs inside the sources/ directory:

domains.csv: a list of domains to search (e.g. example.com, one per line).
wiki_versions.csv: a list of all Wikipedia language codes, with columns Wikipedia Name and Language Code.

Usage

Make sure your sources/ folder has:

domains.csv
wiki_versions.csv

Run the main script:

python3 wiki_search.py

Where wiki_search.py is the script that:

Reads the input files.
Fetches external links in a streaming fashion.
Finds the user who introduced each link.
Writes results to results.csv.
De-duplicates (lang, user) pairs on disk via sort | uniq.
Fetches user info + contributions.
Generates user_info_all.csv and user_contributions_all.csv.

Check the output CSVs:

results.csv
user_info_all.csv
user_contributions_all.csv

Notes and Limitations

Wikipedia API Limits: Be mindful of query limits and rate limits. If you experience slow responses or issues, reduce MAX_WORKERS and/or add time.sleep() calls.
Disk-based Deduplication: The script relies on sort and uniq commands. On Windows, you might need Git Bash or WSL installed, or adapt the script to use SQLite or a Python-based approach.
Data Volume: If you expect millions of results, ensure sufficient disk space. Streaming helps mitigate memory usage, but large outputs can still grow quickly on disk.
User Info & Contributions: The script fetches only up to USER_CONTRIB_LIMIT (default 10) contributions per user. Adjust if needed.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
sources		sources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
wiki_search.py		wiki_search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia External Links Scanner

Table of Contents

Overview

Features

Requirements

Installation

Usage

Notes and Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wikipedia External Links Scanner

Table of Contents

Overview

Features

Requirements

Installation

Usage

Notes and Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages