Skip to content

gitgithan/substack_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Substack Scraper

Scrape paid or free articles from a Substack newsletter, saving both HTML and Markdown versions.

Setup

  1. Clone the repository and enter the directory:

    git clone https://github.com/gitgithan/substack_scraper.git
    cd substack_scraper
  2. (Recommended) Create and activate a virtual environment:

    python3 -m venv venv
    source venv/bin/activate
  3. Install dependencies:

    pip install requests beautifulsoup4 lxml markdownify selenium
    
  4. Install ChromeDriver and ensure it's in your PATH.

  5. Edit substack_scraper.py:

    • Set BASE_URL to your newsletter's main URL (e.g., https://newsletter.eng-leadership.com)
    • Set SITEMAP_STRING to the sitemap path (e.g., /sitemap.xml)

Usage

  • Scrape free articles:

    python substack_scraper.py
    
  • Scrape paid articles (manual login required):

    python substack_scraper.py --paid
    

    This will launch a browser for you to log in manually (doesn't matter email OTP or with password). After solving captcha and logging in, press Enter in the terminal to continue scraping.

    Note: If paid content does not load correctly, you may need to increase the sleep duration in the script (see sleep() in scrape_article_selenium). Paid articles sometimes take longer to render after login.

Output

  • HTML files: html_files/
  • Markdown files: md_files/
  • Article metadata: articles.json
  • List of URLs: urls.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages