Adding New Experiment Tasks¶
This guide explains how to add new tasks to A-LEMS for both linear and agentic experiments.
๐ Task Overview¶
Tasks are defined in config/tasks.yaml and follow a simple structure:
task_id:
description: "Human-readable description"
level: 1|2|3
tools: 0|1|2|...
prompt: "The actual task prompt for the LLM"
expected_output: "Expected answer (optional)"
๐ฏ Task Levels¶
| Level | Description | Example | Tool Usage |
|---|---|---|---|
| 1 | Simple tasks | Factual QA, basic math | 0 tools (LLM only) |
| 2 | Moderate complexity | Multi-step reasoning | 1-2 tools |
| 3 | Complex tasks | Research synthesis | 3+ tools |
๐ Task Definition Fields¶
| Field | Required | Type | Description |
|---|---|---|---|
description |
โ Yes | String | Human-readable description |
level |
โ Yes | Integer | 1, 2, or 3 |
tools |
โ Yes | Integer | Number of tools available |
prompt |
โ Yes | String | The actual prompt sent to LLM |
expected_output |
โ No | String | For validation/accuracy checking |
๐๏ธ Step-by-Step: Adding a Task¶
Step 1: Edit config/tasks.yaml¶
# config/tasks.yaml
tasks:
# Existing tasks...
my_new_task:
description: "My Custom Task Description"
level: 2
tools: 1
prompt: |
Solve this problem step by step:
A train leaves Chicago at 2 PM traveling at 60 mph.
Another train leaves New York at 3 PM traveling at 70 mph.
The distance between cities is 800 miles.
When will they meet?
expected_output: "8:00 PM"
Step 2: Verify Task Loads¶
# List all tasks to verify
python -m core.execution.tests.run_experiment --list-tasks
You should see your new task in the list:
๐ Available Tasks:
----------------------------------------------------------------------
ID Name Level Tools
----------------------------------------------------------------------
my_new_task My Custom Task 2 1
...
๐งช Task Types by Level¶
Level 1: Simple Tasks (LLM Only)¶
factual_qa:
description: "Factual Question"
level: 1
tools: 0
prompt: "What is the capital of France?"
expected_output: "Paris"
science_qa:
description: "Science Question"
level: 1
tools: 0
prompt: "What is the chemical symbol for gold?"
expected_output: "Au"
Level 2: Moderate Tasks (1-2 Tools)¶
gsm8k_basic:
description: "GSM8K Arithmetic"
level: 2
tools: 1
prompt: |
Natalia sold clips to 48 of her friends in April,
and then she sold half as many clips in May.
How many clips did Natalia sell altogether in April and May?
expected_output: "72"
logical_reasoning:
description: "Logical Deduction"
level: 2
tools: 1
prompt: |
All humans are mortal. Socrates is human.
Therefore, Socrates is mortal. Is this valid?
expected_output: "Yes"
Level 3: Complex Tasks (3+ Tools)¶
research_summary:
description: "Research Abstract Summary"
level: 3
tools: 3
prompt: |
Summarize the key findings from this research paper abstract:
[PAPER TEXT HERE]
Include methodology, results, and conclusions.
expected_output: "The paper presents..."
๐ง Task with Multiple Tools¶
complex_math:
description: "Complex Multi-step Math"
level: 3
tools: 2
prompt: |
Calculate the compound interest on $5000 invested for 3 years
at 4.5% annual rate compounded monthly.
Then calculate the equivalent simple interest.
Finally, find the difference between the two.
expected_output: "Compound: $723.37, Simple: $675.00, Difference: $48.37"
The agentic executor will:
- Plan the steps
- Use calculator tool for each calculation
- Synthesize results
๐ Task Categories¶
You can also organize tasks by category in config/task_categories.yaml:
# config/task_categories.yaml
math:
- gsm8k_basic
- gsm8k_multi_step
- complex_math
reasoning:
- logical_reasoning
- commonsense_reasoning
qa:
- factual_qa
- science_qa
- geography_qa
code:
- code_fibonacci
- code_sorting
- bug_fixing
Query by category:
SELECT * FROM task_categories WHERE category = 'math';
๐ง How Tasks Are Executed¶
Linear Execution¶
# In LinearExecutor.execute()
def execute(self, prompt):
# Single LLM call
response = self._call_llm(prompt)
return response
Agentic Execution¶
# In AgenticExecutor.execute()
def execute(self, task):
# Phase 1: Planning
plan = self._create_plan(task)
# Phase 2: Execution
results = []
for step in plan['steps']:
if step['type'] == 'tool':
result = self._execute_tool(step['tool'], step['args'])
else:
result = self._call_llm(step['prompt'])
results.append(result)
# Phase 3: Synthesis
final = self._synthesize(task, plan, results)
return final
๐ Task Complexity Score¶
The system automatically calculates complexity:
def _calculate_complexity_score(llm_calls, tool_calls, total_tokens):
"""
Complexity = weighted combination of:
- LLM calls (ฮฑ=0.4) - each model invocation
- Tool calls (ฮฒ=0.3) - external operations
- Token volume (ฮณ=0.3) - computation scale
"""
normalized_llm = min(llm_calls / 10, 1.0)
normalized_tools = min(tool_calls / 10, 1.0)
normalized_tokens = min(total_tokens / 1000, 1.0)
return (0.4 * normalized_llm +
0.3 * normalized_tools +
0.3 * normalized_tokens)
Stored in runs.complexity_score.
๐ ๏ธ Adding Tool Support for Tasks¶
If your task needs a new tool, add it in agentic.py:
def _execute_tool(self, name: str, args: Dict) -> Any:
if name == "calculator":
return self._run_calculator(args)
elif name == "web_search":
return self._run_web_search(args)
elif name == "my_new_tool": # โ Add your tool
return self._run_my_new_tool(args)
else:
raise ValueError(f"Unknown tool: {name}")
def _run_my_new_tool(self, args: Dict) -> Any:
"""Implement your custom tool logic."""
# Parse args
input_data = args.get('input')
# Your tool logic here
result = process_data(input_data)
# Track for metrics
self._tool_calls += 1
return result
โ Task Validation¶
Check Expected Output¶
def validate_result(actual: str, expected: str) -> bool:
"""Check if task result matches expected."""
if not expected:
return True # No validation
# Simple exact match
if actual.strip() == expected.strip():
return True
# Numeric tolerance for math tasks
try:
actual_num = float(actual)
expected_num = float(expected)
return abs(actual_num - expected_num) < 0.01
except:
pass
return False
Stored in runs.task_success (to be added).
๐ Task Documentation Template¶
When adding a new task, document it in docs/user-guide/tasks.md:
## `my_new_task`
**Level:** 2
**Tools:** 1 (calculator)
**Description:** My custom task description
### Prompt
Solve this problem step by step:
A train leaves Chicago at 2 PM traveling at 60 mph.
Another train leaves New York at 3 PM traveling at 70 mph.
The distance between cities is 800 miles.
When will they meet?
### Expected Behavior
1. LLM plans steps (distance calculation, relative speed, time)
2. Uses calculator for each step
3. Synthesizes final answer
### Example Output
Step 2: Calculate time to meet 800 miles รท 130 mph = 6.15 hours
Step 3: Add to Chicago departure time 2 PM + 6 hours = 8 PM 0.15 hours = 9 minutes
They will meet at 8:09 PM.
### Expected Answer: `8:09 PM`
๐งช Testing Your New Task¶
Quick Test¶
# Test linear execution
python -m core.execution.tests.test_harness \
--task-id my_new_task \
--repetitions 1 \
--provider local \
--verbose
# Test agentic execution (if tools needed)
python -m core.execution.tests.test_harness \
--task-id my_new_task \
--repetitions 1 \
--provider local \
--agentic \
--verbose
Batch Test¶
python -m core.execution.tests.run_experiment \
--tasks my_new_task \
--repetitions 5 \
--providers local,cloud \
--save-db
๐ Analyzing Task Performance¶
Query Task Results¶
SELECT
r.run_id,
r.workflow_type,
r.dynamic_energy_uj/1e6 as energy_j,
r.duration_ns/1e9 as duration_s,
r.llm_calls,
r.tool_calls,
r.complexity_score
FROM runs r
JOIN experiments e ON r.exp_id = e.exp_id
WHERE e.task_name = 'my_new_task'
ORDER BY r.run_id;
Compare Across Tasks¶
SELECT
e.task_name,
e.provider,
AVG(r.dynamic_energy_uj/1e6) as avg_energy,
AVG(r.duration_ns/1e9) as avg_duration,
AVG(r.complexity_score) as avg_complexity
FROM runs r
JOIN experiments e ON r.exp_id = e.exp_id
GROUP BY e.task_name, e.provider
ORDER BY avg_energy DESC;
๐ฏ Best Practices¶
1. Clear Prompts¶
โ Good:
prompt: "Calculate 15% tip on a $45.50 meal."
โ Bad:
prompt: "do math"
2. Appropriate Level¶
- Level 1: Questions answerable in one LLM call
- Level 2: Requires 2-3 steps, may use tools
- Level 3: Multi-step with multiple tools
3. Include Expected Output¶
Always include expected output for validation:
expected_output: "6.83"
4. Test Both Providers¶
# Test with local (faster for iteration)
python -m core.execution.tests.test_harness --task-id my_task --provider local
# Test with cloud (production)
python -m core.execution.tests.test_harness --task-id my_task --provider cloud
๐ Advanced: Dynamic Tasks¶
You can also generate tasks programmatically:
# scripts/generate_tasks.py
def generate_math_tasks(base=2, count=10):
"""Generate multiplication tasks of varying difficulty."""
tasks = {}
for i in range(count):
a = random.randint(1, 10**base)
b = random.randint(1, 10**base)
task_id = f"multiply_{base}_{i}"
tasks[task_id] = {
'description': f"Multiply {a} ร {b}",
'level': min(3, base),
'tools': 1,
'prompt': f"Calculate {a} ร {b}",
'expected_output': str(a * b)
}
return tasks
Then merge with tasks.yaml.
โ Checklist for New Tasks¶
- [ ] Task ID is unique and descriptive
- [ ] Description clearly explains the task
- [ ] Level appropriate for complexity
- [ ] Tools count matches actual tool usage
- [ ] Prompt is clear and unambiguous
- [ ] Expected output included (if applicable)
- [ ] Tested with local provider
- [ ] Tested with cloud provider (if available)
- [ ] Documented in user guide
- [ ] Added to appropriate category
๐ Debugging Task Issues¶
Task Not Found¶
python -m core.execution.tests.run_experiment --list-tasks
# Verify your task appears in the list
Tool Not Supported¶
Error: Unknown tool: my_tool
Fix: Add tool to agentic.py _execute_tool() method.
Wrong Complexity Level¶
Check calculated score:
SELECT run_id, complexity_score, llm_calls, tool_calls
FROM runs WHERE task_name = 'my_task';
This guide corresponds to the Harness diagram at ../assets/diagrams/harness.svg.