Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8,015 changes: 4,989 additions & 3,026 deletions docs/tutorials/virtual_db_tutorial.ipynb

Large diffs are not rendered by default.

525 changes: 0 additions & 525 deletions docs/virtual_database_concepts.md

This file was deleted.

18 changes: 12 additions & 6 deletions docs/virtual_db.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,22 @@
# VirtualDB

VirtualDB provides a unified query interface across heterogeneous datasets with
different experimental condition structures and terminologies. Each dataset
defines experimental conditions in its own way, with properties stored at
different hierarchy levels (repository, dataset, or field) and using different
naming conventions. VirtualDB uses an external YAML configuration to map these
varying structures to a common schema, normalize factor level names (e.g.,
"D-glucose", "dextrose", "glu" all become "glucose"), and enable cross-dataset
queries with standardized field names and values.

## API Reference

::: tfbpapi.virtual_db.VirtualDB
options:
show_root_heading: true
show_source: true

## Helper Functions
### Helper Functions

::: tfbpapi.virtual_db.get_nested_value
options:
Expand All @@ -14,8 +25,3 @@
::: tfbpapi.virtual_db.normalize_value
options:
show_root_heading: true

## Usage

For comprehensive usage documentation including comparative datasets, see
[Virtual Database Concepts](virtual_database_concepts.md).
257 changes: 257 additions & 0 deletions docs/virtual_db_configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
# VirtualDB Configuration Guide

VirtualDB requires a YAML configuration file that defines which datasets to
include, how to map their fields to common names, and how to normalize factor
levels.

## Basic Example

```yaml
repositories:
# Each repository defines a "table" in the virtual database
BrentLab/harbison_2004:
# REQUIRED: Specify which field is the sample identifier. At this level, it means
# that all datasets have a field `sample_id` that uniquely identifies samples.
sample_id:
field: sample_id
# Repository-wide properties (apply to all datasets in this repository)
# Paths are explicit from the datacard root
nitrogen_source:
path: experimental_conditions.media.nitrogen_source.name

dataset:
# Each dataset gets its own view with standardized fields
harbison_2004:
# note: this is optional. If not specified, then the config_name is used.
# This is useful if the config_name isn't suited to a table name, or if it
# were to conflict with another dataset in the configuration
db_name: harbison
# Dataset-specific properties (constant for all samples)
# Explicit path from datacard/config root
phosphate_source:
path: experimental_conditions.media.phosphate_source.compound

# Field-level properties (vary per sample)
# Path is relative to field's definitions dict
carbon_source:
field: condition
path: media.carbon_source.compound
dtype: string # Optional: specify data type

# Field without path (column alias with normalization)
environmental_condition:
field: condition

BrentLab/kemmeren_2014:
dataset:
kemmeren_2014:
# optional -- see the note for `db_name` in harbison above
db_name: kemmeren
# REQUIRED: If `sample_id` isn't defined at the repo level, then it must be
# defined at the dataset level for each dataset in the repo
sample_id:
field: sample_id
# Same logical fields, different physical paths
# Explicit path from datacard/config root
carbon_source:
path: experimental_conditions.media.carbon_source.compound
dtype: string
temperature_celsius:
path: experimental_conditions.temperature_celsius
dtype: numeric # Enables numeric filtering with comparison operators

# Comparative dataset example
BrentLab/yeast_comparative_analysis:
dataset:
dto:
# Use field mappings to change a field's displayed name. If not specifically
# listed, then the field is included as it exists in the source data
dto_fdr:
field: dto_fdr
dto_empirical_pvalue:
field: empirical_pvalue

# links specify which primary datasets are referenced by composite ID fields
links:
binding_id:
- [BrentLab/harbison_2004, harbison_2004]
perturbation_id:
- [BrentLab/kemmeren_2014, kemmeren_2014]

# ===== Normalization Rules =====
# Map varying terminologies to standardized values
factor_aliases:
carbon_source:
glucose: [D-glucose, glu, dextrose]
galactose: [D-galactose, gal]

# Handle missing values with defaults
missing_value_labels:
carbon_source: "unspecified"

# ===== Documentation =====
description:
carbon_source: The carbon source provided to the cells during growth
```

### Property Hierarchy

Properties are extracted at three hierarchy levels:

1. **Repository-wide**: Common to all datasets in a repository
- Paths relative to datacard/config root (explicit)
- Example: `path: experimental_conditions.media.nitrogen_source.name`

2. **Dataset-specific**: Specific to one dataset configuration
- Paths relative to datacard/config root (explicit)
- Example: `path: experimental_conditions.media.phosphate_source.compound`

3. **Field-level**: Vary per sample, defined in field definitions
- `field` specifies which field to extract from
- `path` relative to that field's definitions dict
- Example: `field: condition, path: media.carbon_source.compound`

**Special case**: Field without path creates a column alias
- `field: condition` (no path) renames `condition` column, enables normalization

### Path Resolution

Paths use dot notation to navigate nested structures:

**Repository/Dataset-level** (explicit paths from datacard root):
- `path: experimental_conditions.temperature_celsius` - access experimental conditions
- `path: experimental_conditions.media.carbon_source.compound` - nested condition data
- `path: description` - access fields outside experimental_conditions

**Field-level** (paths relative to field definitions):
- `field: condition, path: media.carbon_source.compound` looks in field
`condition`'s definitions and navigates to `media.carbon_source.compound`

### Data Type Specifications

Field mappings support an optional `dtype` parameter to ensure proper type handling
during metadata extraction and query filtering.

**Supported dtypes**:
- `string` - Text data (default if not specified)
- `numeric` - Numeric values (integers or floating-point numbers)
- `bool` - Boolean values (true/false)

**When to use dtype**:

1. **Numeric filtering**: Required for fields used with comparison operators
(`<`, `>`, `<=`, `>=`, `between`)
2. **Type consistency**: When source data might be extracted with incorrect type
3. **Performance**: Helps with query optimization and prevents type mismatches

## Comparative Datasets

Comparative datasets differ from other dataset types in that they represent
relationships between samples across datasets rather than individual samples.
Each row relates 2+ samples from other datasets.

### Structure

Comparative datasets use `source_sample` fields instead of a single `sample_id`:
- Multiple fields with `role: source_sample`
- Each contains composite identifier: `"repo_id;config_name;sample_id"`
- Example: `binding_id = "BrentLab/harbison_2004;harbison_2004;42"`

### Fields

All fields in the comparative dataset are included. But they may be re-named
(aliased) by specifically mapping them in the configuration.

```yaml
dto:
# this would make the displayed field name 'dto_pvalue'
instead of 'empirical_pvalue'
dto_pvalue:
field: empirical_pvalue
```

### Link Structure

the `links` section specifies how the composite IDs map to primary datasets. The first
sub-element under `links` is the name of the field in the comparative dataset that
contains the composite IDs. The value is a list of `[repo_id, config_name]`
pairs indicating which primary datasets are referenced by that field. Those primary
datasets must also be defined in the overall VirtualDB configuration.

```yaml
# Within the comparative dataset config
dto:
links:
binding_id:
- [BrentLab/harbison_2004, harbison_2004] # [repo_id, config_name]
- [BrentLab/callingcards, annotated_features]
perturbation_id:
- [BrentLab/kemmeren_2014, kemmeren_2014]
```

See the [huggingface datacard documentation](huggingface_datacard.md#5-comparative)
for more detailed explanation of comparative datasets and composite IDs.

## Internal Structure

VirtualDB uses an in-memory DuckDB database to construct a layered hierarchy
of SQL views over locally cached Parquet files. Views are created lazily on
first query and are not persisted to disk.

### View Hierarchy

For each configured dataset, VirtualDB registers a series of views that
build on each other. Using `harbison` as an example primary dataset and
`dto` as a comparative dataset:

**1. Metadata view**

One row per unique `sample_id`. Derived columns from the configuration
(e.g., `carbon_source`, `temperature_celsius`) are resolved here using
datacard definitions, factor aliases, and missing value labels. This is
the primary view for querying sample-level metadata.

**2. Raw data view**

The full parquet data joined to the metadata view so that every row
carries both the raw measurement columns and the derived metadata
columns. **Developer note**: There is an internal view called __<db_name>_parquet that
is just the raw parquet data without any metadata joins or derived columns.
This is used as the base for joining to the metadata view, but is not exposed directly
to users.

**3. Expanded view (comparative only)** -- `dto_expanded`

For comparative datasets, each composite ID field (e.g. `binding_id`
with format `"repo_id;config_name;sample_id"`) is parsed into two
additional columns:

- `<link_field>_source` -- the `repo_id;config_name` prefix, aliased
to the configured `db_name` when the pair is in the VirtualDB config.
For example, `BrentLab/harbison_2004;harbison_2004` becomes `harbison`.
- `<link_field>_id` -- the sample_id component.

This makes it straightforward to join back to primary dataset views
or filter by source dataset without parsing composite IDs in SQL.

### View Diagram

```
__harbison_parquet (raw parquet, not directly exposed)
|
+-> harbison_meta (deduplicated, one row per sample_id,
| with derived columns from config)
|
+-> harbison (full parquet joined to harbison_meta)

__dto_parquet (raw parquet, not directly exposed)
|
+-> dto_expanded (parquet + parsed columns:
binding_id_source, binding_id_id,
perturbation_id_source, perturbation_id_id)
```

## Usage

For usage examples and tutorials,
see the [VirtualDB Tutorial](tutorials/virtual_db_tutorial.ipynb).
3 changes: 1 addition & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -154,8 +154,6 @@ nav:
- "Cache Management": tutorials/cache_manager_tutorial.ipynb
- "Querying Data":
- "VirtualDB: Unified Cross-Dataset Queries": tutorials/virtual_db_tutorial.ipynb
- Concepts:
- "Virtual Database Design": virtual_database_concepts.md
- API Reference:
- Core:
- VirtualDB: virtual_db.md
Expand All @@ -169,3 +167,4 @@ nav:
- HuggingFace Configuration:
- HuggingFace Dataset Card Format: huggingface_datacard.md
- BrentLab Collection: brentlab_yeastresources_collection.md
- VirtualDB Configuration: virtual_db_configuration.md
2 changes: 1 addition & 1 deletion tfbpapi/datacard.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ def get_repository_info(self) -> dict[str, Any]:
"dataset_types": [config.dataset_type.value for config in card.configs],
"total_files": total_files,
"last_modified": last_modified,
"has_default_config": self.dataset_card.get_default_config() is not None,
"has_default_config": self.dataset_card.default_config is not None,
}

def extract_metadata_schema(self, config_name: str) -> dict[str, Any]:
Expand Down
Loading