Skip to content

Running Experiments

This guide explains how to run experiments in A-LEMS, from simple single runs to complex batch experiments.


๐Ÿš€ Quick Start

Run a simple experiment with one repetition:

python -m core.execution.tests.run_experiment \
    --tasks gsm8k_basic \
    --repetitions 1 \
    --providers local \
    --save-db

๐Ÿ“‹ Available Tasks

ID Name Category Level Description
gsm8k_basic GSM8K Arithmetic reasoning 1 Grade school arithmetic
gsm8k_multi_step Multi-step Arithmetic reasoning 2 Multi-step math problems
logical_reasoning Logical Deduction reasoning 2 Deductive reasoning
commonsense_reasoning Commonsense Reasoning reasoning 1 Basic commonsense
code_fibonacci Fibonacci Function coding 2 Generate Python function
code_sorting Sorting Algorithm coding 2 Implement quicksort
bug_fixing Bug Fixing coding 2 Fix syntax errors
factual_qa Factual Question qa 1 Simple factual QA
science_qa Science Question qa 1 Basic science
geography_qa Geography Question qa 1 Geography knowledge
news_summary News Summary summarization 1 Summarize articles
research_summary Research Summary summarization 2 Academic abstracts
sentiment_analysis Sentiment Analysis classification 1 Classify sentiment
topic_classification Topic Classification classification 1 Classify topics
entity_extraction Entity Extraction extraction 1 Extract named entities
keyword_extraction Keyword Extraction extraction 1 Extract key terms

Task Categories

Category Description Example Tasks
reasoning Math, logic, and commonsense problems gsm8k_basic, logical_reasoning
coding Code generation and debugging code_fibonacci, bug_fixing
qa Factual and knowledge-based questions factual_qa, science_qa
summarization Text summarization tasks news_summary, research_summary
classification Text classification sentiment_analysis, topic_classification
extraction Information extraction entity_extraction, keyword_extraction

Level Meaning

Level Description
1 Simple tasks, single-step, minimal tools
2 Complex tasks, multi-step, may use tools
3 Advanced tasks, multiple tools, synthesis

Total Tasks: 16


๐ŸŽฏ Running Different Task Types

Simple Task (Level 1)

python -m core.execution.tests.run_experiment \
    --tasks factual_qa \
    --repetitions 3 \
    --providers local \
    --save-db

Complex Task (Level 2)

python -m core.execution.tests.run_experiment \
    --tasks gsm8k_multi_step \
    --repetitions 5 \
    --providers cloud \
    --save-db

โ˜๏ธ Choosing Providers

Local Provider (No API Key)

python -m core.execution.tests.run_experiment \
    --tasks gsm8k_basic \
    --providers local \
    --save-db

Cloud Provider (Requires API Key)

# Set API key first
export GROQ_API_KEY="your-key-here"

python -m core.execution.tests.run_experiment \
    --tasks gsm8k_basic \
    --providers cloud \
    --save-db

Multiple Providers

python -m core.execution.tests.run_experiment \
    --tasks gsm8k_basic \
    --providers cloud,local \
    --repetitions 3 \
    --save-db

๐Ÿ” Repetitions for Statistical Significance

Repetitions Purpose When to Use
1 Quick test Debugging, verifying setup
3-5 Initial results Exploratory analysis
10-30 Statistical significance Research papers, final results
100+ Production benchmarking Large-scale studies
# 30 repetitions for statistical power
python -m core.execution.tests.run_experiment \
    --tasks gsm8k_basic \
    --repetitions 30 \
    --providers local \
    --save-db

๐Ÿงช Test Harness (Quick Testing)

For faster iteration during development:

python -m core.execution.tests.test_harness \
    --task-id gsm8k_basic \
    --repetitions 1 \
    --provider local \
    --verbose

Differences from run_experiment:

  • โœ… Faster (less overhead)
  • โœ… Shows real-time hardware telemetry
  • โŒ Not for production results
  • โŒ Limited batch capabilities

๐Ÿ“Š Real-time Progress

While running, you'll see:

๐Ÿ“Š Progress: 2/6 runs
  Rep 1/3
    Linear: 1.2043 J
    Agentic: 2.5945 J
    Tax: 2.15x
  โœ… Pair 1 saved (linear: 123, agentic: 124)

โ„๏ธ Cool-down Periods

Experiments include automatic cool-down between runs:

# Default 2 seconds
python -m core.execution.tests.run_experiment --tasks gsm8k_basic --repetitions 3

# Custom cool-down
python -m core.execution.tests.run_experiment --tasks gsm8k_basic --cool-down 5

๐Ÿ”ง Advanced Options

Disable Warm-up

python -m core.execution.tests.run_experiment \
    --tasks gsm8k_basic \
    --no-warmup

Specify Country for Carbon Intensity

python -m core.execution.tests.run_experiment \
    --tasks gsm8k_basic \
    --country IN  # India grid intensity

Enable Optimizer (Experimental)

python -m core.execution.tests.run_experiment \
    --tasks gsm8k_basic \
    --optimizer

๐Ÿ“ Command Reference

Option Description Example
--tasks Comma-separated task IDs --tasks gsm8k_basic,factual_qa
--repetitions Number of repetitions --repetitions 10
--providers Comma-separated providers --providers cloud,local
--save-db Save results to database --save-db
--verbose Detailed output --verbose
--cool-down Seconds between runs --cool-down 5
--no-warmup Skip warm-up runs --no-warmup
--country Country code for carbon --country US
--optimizer Enable optimizer --optimizer
--list-tasks Show available tasks --list-tasks

โš ๏ธ Common Issues

Issue Solution
No valid tasks selected Check task ID with --list-tasks
API key not found Set environment variable: export GROQ_API_KEY="key"
Permission denied Run sudo ./scripts/fix_permissions.sh
Database locked Wait or remove lock file
No baseline First run measures automatically

โœ… Next Steps