This project uses BERTopic along with OpenAI's GPT-4o-mini to generate meaningful topic names and analyze 63,000+ research papers.
The project is split into 3 efficient, modular parts to ensure reusability and fast experimentation.
- Load the
neuralwork/arxiverdataset. - Preprocess abstracts.
- Apply BERTopic for topic modeling.
- Use GPT-4o-mini to assign descriptive topic names.
- Save results (
info, topic, topic name) and visualisations into a CSV in Google Drive — so you never have to re-run this compute-heavy step again.
- Download original dataset (full metadata:
id, title, abstract, authors, date, link, markdown). - Merge with saved topic modeling results from Part 1.
- Final DataFrame format:
id | title | abstract | authors | published_date | link | markdown | topic | Topic Name
- Visualizations:
topic_model.visualize_barchart()visualize_heatmap()visualize_topics()visualize_hierarchy()- Key Analyses:
-
Top 10 Topics by Paper Count
-
Papers Published Per Month
-
Monthly Trends of Top 5 Topics
- Source: neuralwork/arxiver
- Fields: ID, Title, Abstract, Authors, Date, Markdown, Link.
- Size: 63,357 papers (September 2022 – October 2023)
- Embedding:
all-MiniLM-L6-v2 - UMAP:
n_neighbors=10,min_dist=0.1- HDBSCAN:
min_cluster_size=60,min_samples=15- Topic Naming:
- GPT-4o-mini used to summarize top words from each topic into reader-friendly labels
(Open the notebook to view interactive graphs)
| Type | Sample |
|---|---|
| Bar Chart | ![]() |
| Heatmap | ![]() |
| Inter-Topic Distance | ![]() |
| Topic Name | Peak Month | Trough Month | Total Papers |
|---|---|---|---|
| Astrophysics of Neutrinos and Black Holes | Jul 2023 (1089) | Oct 2022 (1) | 9561 |
| Audio Recognition and Analysis | May 2023 (190) | Dec 2022 (3) | 1166 |
| Deep Neural Network Optimization | Oct 2023 (94) | Dec 2022 (2) | 787 |
| Medical Imaging & Diagnosis | Mar 2023 (183) | Dec 2022 (2) | 1412 |
| Quantum Phase Transitions | Mar 2023 (842) | Nov 2022 (1) | 7659 |
- Faster Iteration: Cached outputs (topics + names) make it easy to rerun analysis anytime.
- Human-Centric Topics: GPT-4o naming boosts clarity and presentation value.
- Topic Evolution: Monthly trends reveal emerging or fading fields.
- Customizable: You can easily extend this notebook to:
- Add new months
- Train on more papers
- Refine topic naming





