MemeTrans

This repository contains the dataset and code for the paper: "MemeTrans: A Dataset for Detecting High-Risk Memecoin Launches on Solana"

Environment

Python 3.9
Conda recommended
Install packages using:

pip install -r requirements.txt

Part 1: High-risk Memecoin Prediction

Step 1: Train ML Models on the Generated Features & Labels

cd MemeTrans/risk_prediction
python ml_model_train.py --model rf

Step 2: Evaluate Results in the Memecoin Selection Application

python memecoin_selection.py

Part 2: Data Pipeline

The data_pipeline/ directory contains the full pipeline for reproducing the dataset from scratch.

Directory	Description
`memecoin/`	Collects Pump.fun token migration transactions from the Raydium fee account via Solana RPC, producing the memecoin list (`raw_data/memecoin.jsonl`).
`transaction/`	Queries raw transactions from Google BigQuery, splits memecoin windows into CSV parts, generates BigQuery load scripts and JOIN SQL, and parses raw transactions (inner/outer) into structured records.
`bundle/`	Queries the Jito bundle API to identify MEV bundles, and traces on-chain fund flow to detect shared wallet creators for bundled account detection.
`feature/`	Generates the full feature set from parsed transactions, including holding concentration, market activity, bundle/cluster statistics, and OHLCV time series.
`annotation/`	Manipulation detection and label annotation (coming soon).

Pipeline Execution Order

memecoin/  →  transaction/ (BigQuery + parse)  →  bundle/ (Jito + fund flow)
                                                        ↓
                                                  feature/ (feat_gen)
                                                        ↓
                                                  annotation/

RPC Configuration

Scripts that query Solana RPC read endpoints from data_pipeline/rpc_endpoints.txt (one URL per line, gitignored). Create this file with your own RPC endpoints before running the memecoin collection or bundle scripts.

Parsed Dataset

Since the raw transaction data from BigQuery is very large (>1TB), we provide the parsed transaction datasets on Google Drive:

inner_tx.zip — Pre-migration (bonding curve) transactions
outer_tx.zip — Post-migration (Raydium DEX) transactions

Download and extract them into raw_data/parsed_tx/ to skip the coin_collection.py → BigQuery → parse_* steps and run the downstream pipeline directly.

Q&A

If you have any questions, please open an issue or contact the corresponding author at: husihao26@gmail.com

We will respond as soon as possible.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data_pipeline		data_pipeline
dataset		dataset
raw_data		raw_data
results		results
risk_prediction		risk_prediction
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemeTrans

Environment

Part 1: High-risk Memecoin Prediction

Step 1: Train ML Models on the Generated Features & Labels

Step 2: Evaluate Results in the Memecoin Selection Application

Part 2: Data Pipeline

Pipeline Execution Order

RPC Configuration

Parsed Dataset

Q&A

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

MemeTrans

Environment

Part 1: High-risk Memecoin Prediction

Step 1: Train ML Models on the Generated Features & Labels

Step 2: Evaluate Results in the Memecoin Selection Application

Part 2: Data Pipeline

Pipeline Execution Order

RPC Configuration

Parsed Dataset

Q&A

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages