Skip to content

Categorical data generator

License

Notifications You must be signed in to change notification settings

Point72/superstore

 
 

Repository files navigation

superstore

High-performance synthetic data generation library for testing and development.

Build Status codecov License PyPI

Overview

superstore is a Rust-powered Python library for generating realistic synthetic datasets. It provides:

Data Generators

Generator Description Use Cases
Retail Sales transactions, employees BI dashboards, forecasting
Time Series Financial-style series with regimes, jumps Quant research, backtesting
Weather Sensor data with seasonal/diurnal patterns IoT analytics, anomaly detection
Logs Web server & application logs Observability, alerting
Finance Stock prices, OHLCV, options chains Trading systems, risk analysis
Telemetry Machine metrics, anomalies, failures DevOps dashboards, ML training

Statistical Tools

Tool Description Use Cases
Distributions Sample from statistical distributions Simulation, Monte Carlo
Copulas Correlated multivariate data Risk modeling, portfolio analysis
Temporal Models AR, Markov chains, random walks Time series simulation

Key Features

  • Rust-powered: High-performance generation, 10-100x faster than pure Python
  • Flexible output: pandas DataFrame, polars DataFrame, or Python dicts
  • Configurable: Pydantic config classes for validated, structured configuration
  • Reproducible: Seed support for deterministic generation
  • Scalable: Streaming and parallel generation for large datasets

Installation

pip install superstore

For development with polars support:

pip install superstore[develop]

Quick Start

from superstore import superstore, employees, timeseries, weather

# Generate 1000 retail records as a pandas DataFrame
df = superstore(count=1000)

# Generate as polars DataFrame
df_polars = superstore(count=1000, output="polars")

# Generate as list of dicts
records = superstore(count=1000, output="dict")

Reproducibility with Seeds

All data generators support an optional seed parameter for reproducible random data generation:

from superstore import superstore, employees, getTimeSeries, machines

# Same seed produces identical data
df1 = superstore(count=100, seed=42)
df2 = superstore(count=100, seed=42)
assert df1.equals(df2)  # True

# Works with all generators
employees_df = employees(count=50, seed=123)
timeseries_df = timeseries(nper=30, seed=456)
weather_df = weather(count=100, seed=789)
machine_list = machines(count=10, seed=321)

# No seed means random data each time
df3 = superstore(count=100)  # Different each call

Development

Setup

# Clone the repository
git clone https://github.com/1kbgz/superstore.git
cd superstore

# Install development dependencies
make develop

Building

# Build Python wheel
make build

Testing

# Run all tests
make test

Linting

# Run linters
make lint

# Fix formatting
make fix

Architecture

superstore uses a hybrid Rust/Python architecture:

  • rust/: Core Rust library with all data generation logic
  • src/: PyO3 bindings exposing Rust functions to Python
  • superstore/: Python package with native module

The core data generation is implemented in Rust for performance, with PyO3 providing seamless Python integration. Output format conversion (pandas/polars/dict) happens in the Rust bindings layer.

License

This library is released under the Apache 2.0 license

Note

This library was generated using copier from the Base Python Project Template repository.

About

Categorical data generator

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages

  • Rust 79.8%
  • Python 20.2%