Skip to content

khushidubeyokok/TopicModelling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

TopicModelling Research papers with BERTopic

This project uses BERTopic along with OpenAI's GPT-4o-mini to generate meaningful topic names and analyze 63,000+ research papers.

The project is split into 3 efficient, modular parts to ensure reusability and fast experimentation.

Project Structure

Part 1: Topic Modeling Pipeline + Caching

  • Load the neuralwork/arxiver dataset.
  • Preprocess abstracts.
  • Apply BERTopic for topic modeling.
  • Use GPT-4o-mini to assign descriptive topic names.
  • Save results (info, topic, topic name) and visualisations into a CSV in Google Drive — so you never have to re-run this compute-heavy step again.

Part 2: Final Dataset Construction

  • Download original dataset (full metadata: id, title, abstract, authors, date, link, markdown).
  • Merge with saved topic modeling results from Part 1.
  • Final DataFrame format: id | title | abstract | authors | published_date | link | markdown | topic | Topic Name

Part 3: Insights & Analysis

  • Visualizations:
  • topic_model.visualize_barchart()
  • visualize_heatmap()
  • visualize_topics()
  • visualize_hierarchy()
  • Key Analyses:
  1. Top 10 Topics by Paper Count

    • Most frequent: Astrophysics of Neutrinos and Black Holes (9561 papers)
    • Least frequent: Renewable Energy and Grid Management (509 papers) image
  2. Papers Published Per Month

    • 📈 Peak month: May 2023 (4701 papers)
    • 📉 Lowest month: Sept 2022 (1 paper) image
  3. Monthly Trends of Top 5 Topics

    • Trends shown for:
      • Astrophysics of Neutrinos and Black Holes
      • Audio Recognition and Analysis
      • Deep Neural Network Optimization Techniques
      • Medical Imaging Segmentation and Diagnosis
      • Quantum Phase Transitions and Spin Dynamics image

Dataset Details

  • Source: neuralwork/arxiver
  • Fields: ID, Title, Abstract, Authors, Date, Markdown, Link.
  • Size: 63,357 papers (September 2022 – October 2023)

Model & Config

  • Embedding: all-MiniLM-L6-v2
  • UMAP:
  • n_neighbors=10, min_dist=0.1
  • HDBSCAN:
  • min_cluster_size=60, min_samples=15
  • Topic Naming:
  • GPT-4o-mini used to summarize top words from each topic into reader-friendly labels

Visualizations

(Open the notebook to view interactive graphs)

Type Sample
Bar Chart image
Heatmap image
Inter-Topic Distance image

Sample Analysis (Top 5 Topics)

Topic Name Peak Month Trough Month Total Papers
Astrophysics of Neutrinos and Black Holes Jul 2023 (1089) Oct 2022 (1) 9561
Audio Recognition and Analysis May 2023 (190) Dec 2022 (3) 1166
Deep Neural Network Optimization Oct 2023 (94) Dec 2022 (2) 787
Medical Imaging & Diagnosis Mar 2023 (183) Dec 2022 (2) 1412
Quantum Phase Transitions Mar 2023 (842) Nov 2022 (1) 7659

Why This Matters

  • Faster Iteration: Cached outputs (topics + names) make it easy to rerun analysis anytime.
  • Human-Centric Topics: GPT-4o naming boosts clarity and presentation value.
  • Topic Evolution: Monthly trends reveal emerging or fading fields.
  • Customizable: You can easily extend this notebook to:
  • Add new months
  • Train on more papers
  • Refine topic naming

Releases

No releases published

Packages

No packages published