BrentLab · cmatKhan · Feb 6, 2026 · Feb 4, 2026 · Feb 6, 2026
diff --git a/docs/tutorials/virtual_db_tutorial.ipynb b/docs/tutorials/virtual_db_tutorial.ipynb
diff --git a/docs/virtual_database_concepts.md b/docs/virtual_database_concepts.md
diff --git a/docs/virtual_db.md b/docs/virtual_db.md
@@ -1,11 +1,22 @@
 # VirtualDB
 
+VirtualDB provides a unified query interface across heterogeneous datasets with
+different experimental condition structures and terminologies. Each dataset
+defines experimental conditions in its own way, with properties stored at
+different hierarchy levels (repository, dataset, or field) and using different
+naming conventions. VirtualDB uses an external YAML configuration to map these
+varying structures to a common schema, normalize factor level names (e.g.,
+"D-glucose", "dextrose", "glu" all become "glucose"), and enable cross-dataset
+queries with standardized field names and values.
+
+## API Reference
+
 ::: tfbpapi.virtual_db.VirtualDB
     options:
       show_root_heading: true
       show_source: true
 
-## Helper Functions
+### Helper Functions
 
 ::: tfbpapi.virtual_db.get_nested_value
     options:
@@ -14,8 +25,3 @@
 ::: tfbpapi.virtual_db.normalize_value
     options:
       show_root_heading: true
-
-## Usage
-
-For comprehensive usage documentation including comparative datasets, see
-[Virtual Database Concepts](virtual_database_concepts.md).
diff --git a/docs/virtual_db_configuration.md b/docs/virtual_db_configuration.md
@@ -0,0 +1,257 @@
+# VirtualDB Configuration Guide
+
+VirtualDB requires a YAML configuration file that defines which datasets to
+include, how to map their fields to common names, and how to normalize factor
+levels.
+
+## Basic Example
+
+```yaml
+repositories:
+  # Each repository defines a "table" in the virtual database
+  BrentLab/harbison_2004:
+    # REQUIRED: Specify which field is the sample identifier. At this level, it means
+    # that all datasets have a field `sample_id` that uniquely identifies samples.
+    sample_id:
+      field: sample_id
+    # Repository-wide properties (apply to all datasets in this repository)
+    # Paths are explicit from the datacard root
+    nitrogen_source:
+      path: experimental_conditions.media.nitrogen_source.name
+
+    dataset:
+      # Each dataset gets its own view with standardized fields
+      harbison_2004:
+        # note: this is optional. If not specified, then the config_name is used.
+        # This is useful if the config_name isn't suited to a table name, or if it
+        # were to conflict with another dataset in the configuration
+        db_name: harbison
+        # Dataset-specific properties (constant for all samples)
+        # Explicit path from datacard/config root
+        phosphate_source:
+          path: experimental_conditions.media.phosphate_source.compound
+
+        # Field-level properties (vary per sample)
+        # Path is relative to field's definitions dict
+        carbon_source:
+          field: condition
+          path: media.carbon_source.compound
+          dtype: string  # Optional: specify data type
+
+        # Field without path (column alias with normalization)
+        environmental_condition:
+          field: condition
+
+  BrentLab/kemmeren_2014:
+    dataset:
+      kemmeren_2014:
+        # optional -- see the note for `db_name` in harbison above
+        db_name: kemmeren
+        # REQUIRED: If `sample_id` isn't defined at the repo level, then it must be
+        # defined at the dataset level for each dataset in the repo
+        sample_id:
+          field: sample_id
+        # Same logical fields, different physical paths
+        # Explicit path from datacard/config root
+        carbon_source:
+          path: experimental_conditions.media.carbon_source.compound
+          dtype: string
+        temperature_celsius:
+          path: experimental_conditions.temperature_celsius
+          dtype: numeric  # Enables numeric filtering with comparison operators
+
+  # Comparative dataset example
+  BrentLab/yeast_comparative_analysis:
+    dataset:
+      dto:
+        # Use field mappings to change a field's displayed name. If not specifically
+        # listed, then the field is included as it exists in the source data
+        dto_fdr:
+          field: dto_fdr
+        dto_empirical_pvalue:
+          field: empirical_pvalue
+
+        # links specify which primary datasets are referenced by composite ID fields
+        links:
+          binding_id:
+            - [BrentLab/harbison_2004, harbison_2004]
+          perturbation_id:
+            - [BrentLab/kemmeren_2014, kemmeren_2014]
+
+# ===== Normalization Rules =====
+# Map varying terminologies to standardized values
+factor_aliases:
+  carbon_source:
+    glucose: [D-glucose, glu, dextrose]
+    galactose: [D-galactose, gal]
+
+# Handle missing values with defaults
+missing_value_labels:
+  carbon_source: "unspecified"
+
+# ===== Documentation =====
+description:
+  carbon_source: The carbon source provided to the cells during growth
+```
+
+### Property Hierarchy
+
+Properties are extracted at three hierarchy levels:
+
+1. **Repository-wide**: Common to all datasets in a repository
+   - Paths relative to datacard/config root (explicit)
+   - Example: `path: experimental_conditions.media.nitrogen_source.name`
+
+2. **Dataset-specific**: Specific to one dataset configuration
+   - Paths relative to datacard/config root (explicit)
+   - Example: `path: experimental_conditions.media.phosphate_source.compound`
+
+3. **Field-level**: Vary per sample, defined in field definitions
+   - `field` specifies which field to extract from
+   - `path` relative to that field's definitions dict
+   - Example: `field: condition, path: media.carbon_source.compound`
+
+**Special case**: Field without path creates a column alias
+- `field: condition` (no path) renames `condition` column, enables normalization
+
+### Path Resolution
+
+Paths use dot notation to navigate nested structures:
+
+**Repository/Dataset-level** (explicit paths from datacard root):
+- `path: experimental_conditions.temperature_celsius` - access experimental conditions
+- `path: experimental_conditions.media.carbon_source.compound` - nested condition data
+- `path: description` - access fields outside experimental_conditions
+
+**Field-level** (paths relative to field definitions):
+- `field: condition, path: media.carbon_source.compound` looks in field
+  `condition`'s definitions and navigates to `media.carbon_source.compound`
+
+### Data Type Specifications
+
+Field mappings support an optional `dtype` parameter to ensure proper type handling
+during metadata extraction and query filtering.
+
+**Supported dtypes**:
+- `string` - Text data (default if not specified)
+- `numeric` - Numeric values (integers or floating-point numbers)
+- `bool` - Boolean values (true/false)
+
+**When to use dtype**:
+
+1. **Numeric filtering**: Required for fields used with comparison operators
+   (`<`, `>`, `<=`, `>=`, `between`)
+2. **Type consistency**: When source data might be extracted with incorrect type
+3. **Performance**: Helps with query optimization and prevents type mismatches
+
+## Comparative Datasets
+
+Comparative datasets differ from other dataset types in that they represent
+relationships between samples across datasets rather than individual samples.
+Each row relates 2+ samples from other datasets.
+
+### Structure
+
+Comparative datasets use `source_sample` fields instead of a single `sample_id`:
+- Multiple fields with `role: source_sample`
+- Each contains composite identifier: `"repo_id;config_name;sample_id"`
+- Example: `binding_id = "BrentLab/harbison_2004;harbison_2004;42"`
+
+### Fields
+
+All fields in the comparative dataset are included. But they may be re-named
+(aliased) by specifically mapping them in the configuration.
+
+```yaml
+dto:
+  # this would make the displayed field name 'dto_pvalue'
+  instead of 'empirical_pvalue'
+  dto_pvalue:
+    field: empirical_pvalue
+```
+
+### Link Structure
+
+the `links` section specifies how the composite IDs map to primary datasets. The first
+sub-element under `links` is the name of the field in the comparative dataset that
+contains the composite IDs. The value is a list of `[repo_id, config_name]`
+pairs indicating which primary datasets are referenced by that field. Those primary
+datasets must also be defined in the overall VirtualDB configuration.
+
+```yaml
+# Within the comparative dataset config
+dto:
+  links:
+    binding_id:
+      - [BrentLab/harbison_2004, harbison_2004]  # [repo_id, config_name]
+      - [BrentLab/callingcards, annotated_features]
+    perturbation_id:
+      - [BrentLab/kemmeren_2014, kemmeren_2014]
+```
+
+See the [huggingface datacard documentation](huggingface_datacard.md#5-comparative)
+for more detailed explanation of comparative datasets and composite IDs.
+
+## Internal Structure
+
+VirtualDB uses an in-memory DuckDB database to construct a layered hierarchy
+of SQL views over locally cached Parquet files. Views are created lazily on
+first query and are not persisted to disk.
+
+### View Hierarchy
+
+For each configured dataset, VirtualDB registers a series of views that
+build on each other. Using `harbison` as an example primary dataset and
+`dto` as a comparative dataset:
+
+**1. Metadata view**
+
+One row per unique `sample_id`. Derived columns from the configuration
+(e.g., `carbon_source`, `temperature_celsius`) are resolved here using
+datacard definitions, factor aliases, and missing value labels. This is
+the primary view for querying sample-level metadata.
+
+**2. Raw data view**
+
+The full parquet data joined to the metadata view so that every row
+carries both the raw measurement columns and the derived metadata
+columns. **Developer note**: There is an internal view called __<db_name>_parquet that
+is just the raw parquet data without any metadata joins or derived columns.
+This is used as the base for joining to the metadata view, but is not exposed directly
+to users. 
+
+**3. Expanded view (comparative only)** -- `dto_expanded`
+
+For comparative datasets, each composite ID field (e.g. `binding_id`
+with format `"repo_id;config_name;sample_id"`) is parsed into two
+additional columns:
+
+- `<link_field>_source` -- the `repo_id;config_name` prefix, aliased
+  to the configured `db_name` when the pair is in the VirtualDB config.
+  For example, `BrentLab/harbison_2004;harbison_2004` becomes `harbison`.
+- `<link_field>_id` -- the sample_id component.
+
+This makes it straightforward to join back to primary dataset views
+or filter by source dataset without parsing composite IDs in SQL.
+
+### View Diagram
+
+```
+__harbison_parquet  (raw parquet, not directly exposed)
+  |
+  +-> harbison_meta  (deduplicated, one row per sample_id,
+  |                   with derived columns from config)
+  |
+  +-> harbison  (full parquet joined to harbison_meta)
+
+__dto_parquet  (raw parquet, not directly exposed)
+  |
+  +-> dto_expanded  (parquet + parsed columns:
+                     binding_id_source, binding_id_id,
+                     perturbation_id_source, perturbation_id_id)
+```
+
+## Usage
+
+For usage examples and tutorials,
+see the [VirtualDB Tutorial](tutorials/virtual_db_tutorial.ipynb).
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -154,8 +154,6 @@ nav:
       - "Cache Management": tutorials/cache_manager_tutorial.ipynb
     - "Querying Data":
       - "VirtualDB: Unified Cross-Dataset Queries": tutorials/virtual_db_tutorial.ipynb
-  - Concepts:
-    - "Virtual Database Design": virtual_database_concepts.md
   - API Reference:
     - Core:
       - VirtualDB: virtual_db.md
@@ -169,3 +167,4 @@ nav:
   - HuggingFace Configuration:
     - HuggingFace Dataset Card Format: huggingface_datacard.md
     - BrentLab Collection: brentlab_yeastresources_collection.md
+  - VirtualDB Configuration: virtual_db_configuration.md
diff --git a/tfbpapi/datacard.py b/tfbpapi/datacard.py
@@ -264,7 +264,7 @@ def get_repository_info(self) -> dict[str, Any]:
             "dataset_types": [config.dataset_type.value for config in card.configs],
             "total_files": total_files,
             "last_modified": last_modified,
-            "has_default_config": self.dataset_card.get_default_config() is not None,
+            "has_default_config": self.dataset_card.default_config is not None,
         }
 
     def extract_metadata_schema(self, config_name: str) -> dict[str, Any]: