Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 452 92

  2. cc-crawl-statistics cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 208 16

  3. cc-index-table cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 124 14

  4. cc-warc-examples cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 37 18

  5. cc-citations cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 28 4

  6. cc-notebooks cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 62 10

Repositories

Showing 10 of 77 repositories
  • cc-mrjob Public Forked from Smerity/cc-mrjob

    Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

    commoncrawl/cc-mrjob’s past year of commit activity
    Python 168 MIT 79 3 3 Updated Jan 22, 2026
  • cc-vec Public
    commoncrawl/cc-vec’s past year of commit activity
    Python 5 MIT 2 0 1 Updated Jan 20, 2026
  • cc-pyspark Public

    Process Common Crawl data with Python and Spark

    commoncrawl/cc-pyspark’s past year of commit activity
    Python 452 MIT 92 4 2 Updated Jan 20, 2026
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    commoncrawl/cc-crawl-statistics’s past year of commit activity
    Python 208 Apache-2.0 16 1 0 Updated Jan 17, 2026
  • cdx_toolkit Public

    A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

    commoncrawl/cdx_toolkit’s past year of commit activity
    Python 197 Apache-2.0 34 6 5 Updated Jan 15, 2026
  • robotstxt-experiments Public

    How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

    commoncrawl/robotstxt-experiments’s past year of commit activity
    Jupyter Notebook 0 MIT 0 0 0 Updated Jan 15, 2026
  • nutch Public Forked from Aloisius/nutch

    Common Crawl fork of Apache Nutch

    commoncrawl/nutch’s past year of commit activity
    Java 40 Apache-2.0 1,272 6 (1 issue needs help) 0 Updated Jan 11, 2026
  • warcio-s3 Public Forked from webrecorder/warcio

    Streaming WARC/ARC library for fast web archive IO

    commoncrawl/warcio-s3’s past year of commit activity
    Python 0 Apache-2.0 67 0 0 Updated Jan 11, 2026
  • webarchive-indexing Public Forked from ikreymer/webarchive-indexing

    Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

    commoncrawl/webarchive-indexing’s past year of commit activity
    Python 6 MIT 11 0 1 Updated Jan 11, 2026
  • cc-citations Public

    Scientific articles using or citing Common Crawl data

    commoncrawl/cc-citations’s past year of commit activity
    Jupyter Notebook 28 4 0 0 Updated Jan 9, 2026