Skip to content

Database Schema

This document describes the core database schema of A-LEMS, focusing on the main tables every developer should understand.


๐Ÿ“Š Schema Overview

A-LEMS uses a relational database with 10+ tables to store complete data lineage. The schema is designed for:

  • Research reproducibility - Complete provenance of all measurements
  • ML readiness - Flattened views for model training
  • Performance - Indexed for fast queries
  • Extensibility - Easy to add new metrics

Database Schema Diagram

Auto-generated from live database. Always up-to-date.


๐Ÿ—„๏ธ Core Tables

1. experiments

Stores metadata for each experiment session.

Column Type Description
exp_id INTEGER Primary key
name TEXT Experiment name
description TEXT Experiment description
workflow_type TEXT 'linear', 'agentic', or 'comparison'
model_name TEXT LLM model used
provider TEXT 'cloud' or 'local'
task_name TEXT Task identifier
country_code TEXT Grid carbon intensity region
created_at TIMESTAMP When experiment started
group_id TEXT Session identifier
status TEXT 'pending', 'running', 'completed', 'failed'
started_at TIMESTAMP Actual start time
completed_at TIMESTAMP Actual end time
error_message TEXT Error if failed
runs_completed INTEGER Number of successful runs
runs_total INTEGER Total planned runs
optimization_enabled INTEGER 0/1 flag
hw_id INTEGER Foreign key to hardware_config
env_id INTEGER Foreign key to environment_config

Foreign Keys: - hw_id โ†’ hardware_config.hw_id - env_id โ†’ environment_config.env_id


2. hardware_config

Stores hardware fingerprints for reproducibility across different machines.

Column Type Description
hw_id INTEGER Primary key
hardware_hash TEXT Unique hardware fingerprint
hostname TEXT Machine hostname
cpu_model TEXT CPU model name
cpu_cores INTEGER Physical cores
cpu_threads INTEGER Logical threads
ram_gb REAL Total RAM in GB
kernel_version TEXT Linux kernel version
microcode_version TEXT CPU microcode
rapl_domains TEXT JSON array of available RAPL domains
created_at TIMESTAMP Record creation time

Additional Hardware Details (if available): - cpu_architecture - x86_64, arm64, etc. - cpu_vendor - GenuineIntel, AMD, etc. - cpu_family - CPU family ID - cpu_model_id - CPU model ID - cpu_stepping - CPU stepping - has_avx2, has_avx512, has_vmx - CPU feature flags - gpu_model, gpu_driver, gpu_count - GPU information - gpu_power_available - GPU power monitoring support - rapl_has_dram, rapl_has_uncore - RAPL capabilities - system_manufacturer, system_product, system_type - System info - virtualization_type - VM detection


3. environment_config

Stores software environment for reproducibility.

Column Type Description
env_id INTEGER Primary key
env_hash TEXT Unique environment fingerprint
python_version TEXT Python version
python_implementation TEXT CPython, PyPy, etc.
os_name TEXT Linux, Darwin, Windows
os_version TEXT OS version string
kernel_version TEXT Kernel release
git_commit TEXT Git commit hash
git_branch TEXT Git branch name
git_dirty BOOLEAN Uncommitted changes present
numpy_version TEXT NumPy version if installed
torch_version TEXT PyTorch version if installed
transformers_version TEXT Transformers version if installed
created_at TIMESTAMP Record creation time

4. runs

Core table storing all experiment run data (80+ columns). This is the main fact table.

Column Group Description Example Columns
Identifiers Foreign keys and run info run_id, exp_id, hw_id, baseline_id, run_number, workflow_type
Timing Nanosecond precision timestamps start_time_ns, end_time_ns, duration_ns
Energy Energy in microjoules total_energy_uj, dynamic_energy_uj, pkg_energy_uj, core_energy_uj
Performance CPU performance counters instructions, cycles, ipc, cache_misses
Scheduler OS scheduling metrics context_switches, thread_migrations, run_queue_length
Thermal Temperature readings package_temp_celsius, start_temp_c, max_temp_c
C-States CPU idle states c2_time_seconds, c3_time_seconds, c6_time_seconds
Memory Memory usage rss_memory_mb, vms_memory_mb
Tokens LLM token counts total_tokens, prompt_tokens, completion_tokens
Agentic Agent-specific metrics planning_time_ms, execution_time_ms, llm_calls, tool_calls
Sustainability Environmental impact carbon_g, water_ml, methane_mg
Efficiency Derived metrics energy_per_instruction, energy_per_token

Foreign Keys: - exp_id โ†’ experiments.exp_id - hw_id โ†’ hardware_config.hw_id - baseline_id โ†’ idle_baselines.baseline_id


5. idle_baselines

Stores idle power measurements used for baseline subtraction.

Column Type Description
baseline_id TEXT Primary key
timestamp REAL Measurement timestamp
package_power_watts REAL Package idle power
core_power_watts REAL Core idle power
uncore_power_watts REAL Uncore idle power
dram_power_watts REAL DRAM idle power
duration_seconds INTEGER Measurement duration
sample_count INTEGER Number of samples
package_std REAL Standard deviation
core_std REAL Standard deviation
uncore_std REAL Standard deviation
dram_std REAL Standard deviation
governor TEXT CPU governor during measurement
turbo TEXT Turbo boost status
background_cpu REAL Background CPU usage
process_count INTEGER Running processes
method TEXT Measurement method

๐Ÿ“ˆ Sample Tables

6. energy_samples

High-frequency (100Hz) RAPL samples.

Column Type Description
sample_id INTEGER Primary key
run_id INTEGER Foreign key to runs
timestamp_ns INTEGER Nanosecond timestamp
pkg_energy_uj INTEGER Package energy (ยตJ)
core_energy_uj INTEGER Core energy (ยตJ)
uncore_energy_uj INTEGER Uncore energy (ยตJ)
dram_energy_uj INTEGER DRAM energy (ยตJ)

Index: idx_energy_run_time on (run_id, timestamp_ns)


7. cpu_samples

10Hz CPU telemetry from turbostat.

Column Type Description
sample_id INTEGER Primary key
run_id INTEGER Foreign key to runs
timestamp_ns INTEGER Nanosecond timestamp
cpu_util_percent REAL CPU utilization
cpu_busy_mhz REAL Busy frequency
cpu_avg_mhz REAL Average frequency
c1_residency to c10_residency REAL C-state residency percentages
package_power REAL Package power (W)
dram_power REAL DRAM power (W)
gpu_rc6 REAL GPU RC6 residency
package_temp REAL Package temperature (ยฐC)
ipc REAL Instructions per cycle
extra_metrics_json TEXT JSON blob for additional metrics

8. interrupt_samples

10Hz interrupt rate samples.

Column Type Description
sample_id INTEGER Primary key
run_id INTEGER Foreign key to runs
timestamp_ns INTEGER Nanosecond timestamp
interrupts_per_sec REAL Interrupt rate

9. thermal_samples

1Hz temperature samples from all thermal zones.

Column Type Description
sample_id INTEGER Primary key
run_id INTEGER Foreign key to runs
timestamp_ns INTEGER Nanosecond timestamp
sample_time_s REAL Seconds since run start
cpu_temp REAL CPU temperature (ยฐC)
system_temp REAL System temperature (ยฐC)
wifi_temp REAL WiFi temperature (ยฐC)
throttle_event INTEGER Throttling detected (0/1)
all_zones_json TEXT JSON of all thermal zones
sensor_count INTEGER Number of sensors

๐Ÿงฉ Event Tables

10. orchestration_events

Tracks agentic workflow steps for tax calculation.

Column Type Description
event_id INTEGER Primary key
run_id INTEGER Foreign key to runs
step_index INTEGER Step number
phase TEXT 'planning', 'execution', 'synthesis'
event_type TEXT 'llm_call', 'tool_call', 'waiting'
start_time_ns INTEGER Event start
end_time_ns INTEGER Event end
duration_ns INTEGER Event duration
power_watts REAL Power during event
cpu_util_percent REAL CPU utilization
interrupt_rate REAL Interrupt rate
event_energy_uj INTEGER Energy consumed
tax_contribution_uj INTEGER Tax contribution
tax_percent REAL Tax percentage

11. orchestration_tax_summary

Pre-computed tax metrics per pair of runs.

Column Type Description
comparison_id INTEGER Primary key
linear_run_id INTEGER Foreign key to runs
agentic_run_id INTEGER Foreign key to runs
linear_dynamic_uj INTEGER Linear energy
agentic_dynamic_uj INTEGER Agentic energy
orchestration_tax_uj INTEGER Absolute tax (agentic - linear)
tax_percent REAL Tax percentage
linear_orchestration_uj INTEGER Linear orchestration
agentic_orchestration_uj INTEGER Agentic orchestration

๐Ÿ‘๏ธ Views

12. ml_features

Flattened view for ML training with 80+ features.

CREATE VIEW ml_features AS
SELECT
    r.run_id,
    e.workflow_type,
    e.provider,
    r.duration_ns / 1e6 AS duration_ms,
    r.dynamic_energy_uj / 1e6 AS energy_j,
    r.ipc,
    r.cache_miss_rate,
    r.instructions,
    r.cycles,
    -- ... 70+ more columns
FROM runs r
JOIN experiments e ON r.exp_id = e.exp_id;
๐Ÿ”— Relationships Diagram
text
markdown
### Database Relationships

The database schema follows these key relationships:


text

#### Foreign Key Dependencies

| Source Table | Foreign Key | Target Table | Relationship |
|--------------|-------------|--------------|--------------|
| `runs` | `exp_id` | `experiments` | Many runs belong to one experiment |
| `runs` | `hw_id` | `hardware_config` | Many runs share same hardware |
| `runs` | `env_id` | `environment_config` | Many runs share same environment |
| `energy_samples` | `run_id` | `runs` | One run has many samples |
| `cpu_samples` | `run_id` | `runs` | One run has many CPU samples |
| `interrupt_samples` | `run_id` | `runs` | One run has many interrupt samples |
| `thermal_samples` | `run_id` | `runs` | One run has many thermal samples |
| `orchestration_events` | `run_id` | `runs` | One run has many events |
| `llm_interactions` | `run_id` | `runs` | One run has many LLM interactions |
| `orchestration_tax_summary` | `linear_run_id` | `runs` | Links linear run to tax summary |
| `orchestration_tax_summary` | `agentic_run_id` | `runs` | Links agentic run to tax summary |

#### Cardinality Summary

| Relationship | Type | Description |
|--------------|------|-------------|
| experiments โ†’ runs | One-to-Many | One experiment can have multiple runs |
| hardware_config โ†’ runs | One-to-Many | Same hardware used for many runs |
| environment_config โ†’ runs | One-to-Many | Same environment used for many runs |
| runs โ†’ sample tables | One-to-Many | One run generates many samples |
| runs โ†’ orchestration_events | One-to-Many | One run generates many events |
| runs โ†’ orchestration_tax_summary | One-to-One | One pair generates one tax summary |


## ๐Ÿ“Š Sample Queries

### Get Latest Experiment with Hardware

```sql
SELECT 
    e.exp_id,
    e.task_name,
    e.provider,
    h.cpu_model,
    env.python_version,
    env.git_commit
FROM experiments e
LEFT JOIN hardware_config h ON e.hw_id = h.hw_id
LEFT JOIN environment_config env ON e.env_id = env.env_id
ORDER BY e.exp_id DESC
LIMIT 1;

Calculate Average Tax by Provider

SELECT 
    e.provider,
    AVG(ots.tax_percent) as avg_tax,
    COUNT(*) as samples
FROM orchestration_tax_summary ots
JOIN runs r ON ots.linear_run_id = r.run_id
JOIN experiments e ON r.exp_id = e.exp_id
GROUP BY e.provider;

Get Energy Samples for a Run

SELECT 
    timestamp_ns,
    pkg_energy_uj / 1e6 as energy_j
FROM energy_samples
WHERE run_id = 977
ORDER BY timestamp_ns;

๐Ÿ”ง Maintenance

Run Migrations

python scripts/migrate_db.py

Create Fresh Database

python scripts/setup_fresh_db.py

Backup Database

cp data/experiments.db data/experiments.db.backup_$(date +%Y%m%d)

This schema document corresponds to the diagram at ../assets/diagrams/database-schema-auto.svg.