Skip to content

git-disl/MemeTrans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MemeTrans

This repository contains the dataset and code for the paper: "MemeTrans: A Dataset for Detecting High-Risk Memecoin Launches on Solana"

Environment

  • Python 3.9
  • Conda recommended
  • Install packages using:
pip install -r requirements.txt

Part 1: High-risk Memecoin Prediction

Step 1: Train ML Models on the Generated Features & Labels

cd MemeTrans/risk_prediction
python ml_model_train.py --model rf

Step 2: Evaluate Results in the Memecoin Selection Application

python memecoin_selection.py

Part 2: Data Pipeline

The data_pipeline/ directory contains the full pipeline for reproducing the dataset from scratch.

Directory Description
memecoin/ Collects Pump.fun token migration transactions from the Raydium fee account via Solana RPC, producing the memecoin list (raw_data/memecoin.jsonl).
transaction/ Queries raw transactions from Google BigQuery, splits memecoin windows into CSV parts, generates BigQuery load scripts and JOIN SQL, and parses raw transactions (inner/outer) into structured records.
bundle/ Queries the Jito bundle API to identify MEV bundles, and traces on-chain fund flow to detect shared wallet creators for bundled account detection.
feature/ Generates the full feature set from parsed transactions, including holding concentration, market activity, bundle/cluster statistics, and OHLCV time series.
annotation/ Manipulation detection and label annotation (coming soon).

Pipeline Execution Order

memecoin/  →  transaction/ (BigQuery + parse)  →  bundle/ (Jito + fund flow)
                                                        ↓
                                                  feature/ (feat_gen)
                                                        ↓
                                                  annotation/

RPC Configuration

Scripts that query Solana RPC read endpoints from data_pipeline/rpc_endpoints.txt (one URL per line, gitignored). Create this file with your own RPC endpoints before running the memecoin collection or bundle scripts.

Parsed Dataset

Since the raw transaction data from BigQuery is very large (>1TB), we provide the parsed transaction datasets on Google Drive:

  • inner_tx.zip — Pre-migration (bonding curve) transactions
  • outer_tx.zip — Post-migration (Raydium DEX) transactions

Download and extract them into raw_data/parsed_tx/ to skip the coin_collection.py → BigQuery → parse_* steps and run the downstream pipeline directly.

Q&A

If you have any questions, please open an issue or contact the corresponding author at: husihao26@gmail.com

We will respond as soon as possible.

About

MemeTrans: A Dataset for Detecting High-Risk Meme Coin Launches on Blockchain

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages