Pipeline for generating AI character files and training datasets by scraping public figures' online presence across Twitter and blogs.
⚠️ IMPORTANT: Create a new Twitter account for this tool. DO NOT use your main account as it may trigger Twitter's automation detection and result in account restrictions.
-
Install dependencies:
npm install
-
Copy the
.env.example
into a.env
file:# (Required) Twitter Authentication TWITTER_USERNAME= # your twitter username TWITTER_PASSWORD= # your twitter password # (Optional) Blog Configuration BLOG_URLS_FILE= # path to file containing blog URLs # (Optional) Scraping Configuration MAX_TWEETS= # max tweets to scrape MAX_RETRIES= # max retries for scraping RETRY_DELAY= # delay between retries MIN_DELAY= # minimum delay between requests MAX_DELAY= # maximum delay between requests
npm run twitter -- username
Example: npm run twitter -- pmarca
npm run blog
npm run character -- username
Example: npm run character -- pmarca
npm run finetune
npm run finetune:test
Run this after Twitter Collection step
npm run generate-virtuals -- username date
Example: npm run generate-virtuals -- pmarca 2024-11-29
Example without date: npm run generate-virtuals -- pmarca
The generated character file will be in the pipeline/[username]/[date]/character/character.json
directory.
The generated tweet dataset file will be in pipeline/[username]/[date]/raw/tweets.json
.