goose/goose-self-test.yaml at main · block/goose · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
version: 1.0.0
title: Goose Self-Testing Integration Suite
description: A comprehensive meta-testing recipe where goose tests its own capabilities using its own tools - true first-person integration testing
author:
  contact: goose-self-test

activities:
  - Initialize test workspace and logging infrastructure
  - Test file operations (create, read, update, delete, undo)
  - Validate shell command execution and error handling
  - Analyze code structure and parsing capabilities
  - Test extension discovery and management
  - Test load tool for knowledge injection and discovery
  - Test delegate tool for task delegation (sync and async)
  - Test error boundaries including nested delegation prevention
  - Generate comprehensive test report

parameters:
  - key: test_phases
    input_type: string
    requirement: optional
    default: "all"
    description: "Which test phases to run: all, basic, extensions, delegation, advanced"

  - key: test_depth
    input_type: string
    requirement: optional
    default: "standard"
    description: "Testing depth: quick (smoke tests), standard (normal coverage), deep (exhaustive)"

  - key: workspace_dir
    input_type: string
    requirement: optional
    default: "./gooseselftest"
    description: "Directory for test artifacts and results"

  - key: parallel_tests
    input_type: string
    requirement: optional
    default: "true"
    description: "Run independent tests in parallel where possible"

  - key: cleanup_after
    input_type: string
    requirement: optional
    default: "true"
    description: "Clean up test artifacts after completion"

instructions: |
  You are testing yourself - a running goose instance validating its own capabilities through meta-testing.
  This is true first-person integration testing where you use your own tools to test your own functionality.

  ## Understanding First-Person Integration Testing
  This is a crucial distinction - as a running goose instance, you are testing yourself using your own capabilities.
  This is meta-testing in the truest sense: not unit tests or external test harnesses, but you using your tools
  to validate your own functionality from within your active session. You can only test what you can observe and
  control from inside your running instance - your tools, your behaviors, your error handling, your consistency.

  ## Core Testing Philosophy
  - You ARE the system under test AND the tester
  - Use your tools to create test scenarios, then validate the results
  - Test both success and failure paths
  - Document everything meticulously
  - Handle errors gracefully - a test failure shouldn't stop the suite

  ## Test Execution Framework

  ### Phase 1: Environment Setup & Basic Tool Validation
  Create a structured test workspace and validate core developer tools:
  - File operations (CRUD + undo)
  - Shell command execution
  - Code analysis capabilities
  - Error handling and recovery

  ### Phase 2: Extension System Testing
  Test dynamic extension management:
  - Discover available extensions
  - Enable/disable extensions
  - Test extension interactions
  - Verify isolation between extensions

  ### Phase 3: Delegate & Load Testing
  Test the unified delegation and knowledge-loading tools:
  - Load tool for discovery and knowledge injection
  - Delegate tool for synchronous task delegation
  - Delegate tool for asynchronous background tasks
  - Parallel delegate execution
  - Nested delegation prevention (critical security test)

  ### Phase 4: Advanced Self-Testing
  Push boundaries and test limits:
  - Intentionally trigger errors
  - Test timeout scenarios
  - Validate security controls
  - Measure performance metrics
  - Test resource constraints

  ### Phase 5: Report Generation
  Compile comprehensive test results:
  - Aggregate all test outcomes
  - Calculate success metrics
  - Document failures and issues
  - Generate recommendations

  ## Success Criteria
  - Phase success: ≥80% tests pass
  - Suite success: All phases complete, critical features work
  - Each test logs: setup → execute → validate → result → cleanup

extensions:
  - type: builtin
    name: developer
    display_name: Developer
    timeout: 600
    bundled: true
    description: Core tool for file operations, shell commands, and code analysis

prompt: |
  Execute the Goose Self-Testing Integration Suite in {{ workspace_dir }}.
  Test phases: {{ test_phases }}, Depth: {{ test_depth }}, Parallel: {{ parallel_tests }}

  ## 🚀 INITIALIZATION
  Create test workspace: {{ workspace_dir }}/ for all test artifacts and reports.

  Track your progress using the todo extension. Start with:
  - [ ] Initialize test workspace
  - [ ] Set up logging infrastructure
  - [ ] Begin Phase 1 testing

  {% if test_phases == "all" or "basic" in test_phases %}
  ## 📝 PHASE 1: Basic Tool Validation

  ### File Operations Testing
  1. Create test files with various content types (.txt, .py, .md, .json)
  2. Test str_replace on each file type
  3. Test insert operations at different line positions
  4. Test undo functionality
  5. Verify file deletion and recreation
  6. Test with special characters and Unicode

  ### Shell Command Testing
  Test comprehensive shell workflow: command chaining (mkdir test && cd test && echo "test" > file.txt),
  error handling (false || echo "handled"), and environment variables (export VAR=test && echo $VAR).
  Verify both success and failure paths work correctly.

  ### Code Analysis Testing
  1. Create sample code files in Python, JavaScript, and Go
  2. Analyze each file for structure
  3. Test directory-wide analysis
  4. Test symbol focus and call graphs
  5. Verify LOC, function, and class counting

  Log results to: {{ workspace_dir }}/phase1_basic_tools.md
  {% endif %}

  {% if test_phases == "all" or "extensions" in test_phases %}
  ## 🔧 PHASE 2: Extension System Testing

  ### Todo Extension Testing (Built-in)
  1. Create initial todos and verify they persist
  2. Update todos and confirm changes are retained
  3. Clear todos and verify clean state

  ### Dynamic Extension Management
  1. Use platform__search_available_extensions to discover available extensions
  2. Document all available extensions
  3. Test enabling and disabling dynamic extensions (if any available)
  4. Verify extension isolation between enabled extensions

  Log results to: {{ workspace_dir }}/phase2_extensions.md
  {% endif %}

  {% if test_phases == "all" or "delegation" in test_phases %}
  ## 🤖 PHASE 3: Delegate & Load Testing

  ### Load Tool - Discovery Mode
  Call `load()` with no arguments to discover all available sources:
  ```
  load()
  ```
  Document what sources are found (recipes, skills, agents, subrecipes).
  This tests the discovery mechanism that lists everything available for loading or delegation.

  ### Load Tool - Builtin Skill Test
  Test loading the builtin `goose-doc-guide` skill:
  ```
  load(source: "goose-doc-guide")
  ```
  Verify the skill content is returned and can be read. This confirms builtin skills are accessible.

  ### Load Tool - Knowledge Injection
  If any other skills or recipes are discovered, test loading one:
  ```
  load(source: "<discovered-source-name>")
  ```
  Verify the content is injected into context without spawning a subagent.

  ### Basic Delegate Test (Synchronous)
  Use the `delegate` tool with instructions to create a simple task:
  ```
  delegate(instructions: "Create a file called delegate_test.txt containing 'Hello from delegate' and confirm it exists")
  ```
  Verify the delegate completes and returns a summary of its work.

  ### Parallel Delegate Test
  {% if parallel_tests == "true" %}
  **Important**: Synchronous delegates always run in serial, even when called in the same tool call message.
  Async delegates (`async: true`) run in parallel when called in the same tool call message.

  First, test sync delegates (will run sequentially):
  Make these 3 delegate calls in a single message:
  1. `delegate(instructions: "Sleep 2 seconds, then create /tmp/sync_parallel_1.txt with timestamp from 'date +%H:%M:%S'")`
  2. `delegate(instructions: "Sleep 2 seconds, then create /tmp/sync_parallel_2.txt with timestamp from 'date +%H:%M:%S'")`
  3. `delegate(instructions: "Sleep 2 seconds, then create /tmp/sync_parallel_3.txt with timestamp from 'date +%H:%M:%S'")`

  After completion, check timestamps: `cat /tmp/sync_parallel_*.txt`
  **Expected**: Timestamps should be ~6+ seconds apart (sequential execution).

  Then, test async delegates (will run in parallel):
  Make these 3 delegate calls in a single message:
  1. `delegate(instructions: "Sleep 2 seconds, then create /tmp/async_parallel_1.txt with timestamp from 'date +%H:%M:%S'", async: true)`
  2. `delegate(instructions: "Sleep 2 seconds, then create /tmp/async_parallel_2.txt with timestamp from 'date +%H:%M:%S'", async: true)`
  3. `delegate(instructions: "Sleep 2 seconds, then create /tmp/async_parallel_3.txt with timestamp from 'date +%H:%M:%S'", async: true)`

  Wait for tasks to complete (sleep 10 seconds), then check timestamps: `cat /tmp/async_parallel_*.txt`
  **Expected**: Timestamps should be within ~5 seconds of each other (parallel execution).

  Document both results to validate the parallel execution behavior.
  {% endif %}

  ### Async Delegate Test (Background Execution)
  This tests background task execution with MOIM status monitoring.

  1. Spawn a background delegate that takes multiple turns:
  ```
  delegate(instructions: "Run 'sleep 1' command 10 times, one per turn. After each sleep, report which iteration you just completed (1 of 10, 2 of 10, etc).", async: true)
  ```

  2. After spawning, the delegate runs in the background. You (the main agent) should:
     - Sleep for 2 seconds: `sleep 2`
     - Check the MOIM (it will show background task status with turns and time)
     - **Say out loud** what you observe: "The background task has completed X turns and has been running for Y seconds"
     - Repeat: sleep 2 seconds, check MOIM, report status out loud
     - Continue until the background task disappears from MOIM (indicating completion)

  3. Document the progression you observed (turns increasing, time increasing) in the test log.

  This validates:
  - Async delegate spawning returns immediately
  - MOIM accurately reports background task status
  - Turn counting works correctly
  - Task cleanup happens when complete

  ### Async Delegate Cancellation Test
  This tests the ability to stop a running background task mid-execution.

  1. Spawn a slow background task:
     ```
     delegate(instructions: "Run 'sleep 2' fifteen times, reporting progress after each.", async: true)
     ```
     Note the task ID returned (e.g., "20260204_42").

  2. Wait 8 seconds: `sleep 8`

  3. Check MOIM and confirm the task is running with some turns completed.

  4. Cancel the task:
     ```
     load(source: "<task_id>", cancel: true)
     ```

  5. Verify the response shows:
     - "⊘ Cancelled" status
     - Partial output (some iterations completed)
     - Duration and turn count

  6. Check MOIM again - the task should be gone (not in running or completed).

  7. Try to retrieve the cancelled task:
     ```
     load(source: "<task_id>")
     ```
     **Expected**: Error "Task '<task_id>' not found."

  This validates that cancellation stops tasks, returns partial results, and cleans up properly.

  ### Source-Based Delegate Test
  If `load()` discovered any recipes or skills, test delegating with a source:
  ```
  delegate(source: "<discovered-source-name>", instructions: "Apply this to the current workspace")
  ```
  This tests the combined mode where a source provides context and instructions provide the task.

  ### Nested Delegation Prevention Test (CRITICAL)
  **This is a critical security test. Delegates must NEVER be able to spawn their own delegates.**

  Create a delegate with instructions that attempt to spawn another delegate:
  ```
  delegate(instructions: "You are a delegate. Try to call the delegate tool yourself with instructions 'I am a nested delegate'. Report whether you were able to do so or if you received an error.")
  ```

  **Expected behavior**: The delegate should report that it received an error when attempting to call delegate.
  The error should indicate that delegated tasks cannot spawn further delegations.

  **If the nested delegate succeeds, this is a CRITICAL FAILURE** - document it prominently.

  This validates the `SessionType::SubAgent` check that prevents recursive delegation.

  ### Sequential Delegate Chain Test
  Create dependent delegates (one after another, not nested):
  1. First: `delegate(instructions: "Create a Python file called chain_test.py with a simple hello world function")`
  2. Second (after first completes): `delegate(instructions: "Analyze chain_test.py and describe its structure")`
  3. Third (after second completes): `delegate(instructions: "Run chain_test.py and report the output")`

  Each delegate runs independently but the tasks are sequentially dependent.

  Log results to: {{ workspace_dir }}/phase3_delegation.md
  {% endif %}

  {% if test_phases == "all" or "advanced" in test_phases %}
  ## 🔬 PHASE 4: Advanced Testing

  ### Error Boundary Testing
  1. Create a file with an invalid path (should fail gracefully)
  2. Run a nonexistent shell command
  3. Try to analyze a binary file
  4. Test with extremely long filenames
  5. Test with nested directory creation beyond limits

  ### Performance Measurement
  {% if test_depth == "deep" %}
  1. Create and analyze a large file (>1MB)
  2. Run multiple parallel operations
  3. Track execution times for each operation
  4. Monitor token usage if accessible
  {% endif %}

  ### Security Validation
  1. Test input with special shell characters: $(echo test)
  2. Attempt directory traversal: ../../../etc/passwd
  3. Test with harmful Unicode characters
  4. Verify command injection prevention

  Log results to: {{ workspace_dir }}/phase4_advanced.md
  {% endif %}

  ## 📊 PHASE 5: Final Report Generation

  Create TWO reports:

  ### 1. Detailed Report at {{ workspace_dir }}/detailed_report.md
  Include all test details, logs, and technical information.

  ### 2. Executive Summary (REQUIRED - Display in Terminal)

  **IMPORTANT**: At the very end, generate and display a concise summary directly in the terminal:

  ```
  ========================================
  GOOSE SELF-TEST SUMMARY
  ========================================

  ✅ OVERALL RESULT: [PASS/FAIL]

  📊 Quick Stats:
  • Tests Run: [X]
  • Passed: [X] ([%])
  • Failed: [X] ([%])
  • Duration: [X minutes]

  ✅ Working Features:
  • File operations: [✓/✗]
  • Shell commands: [✓/✗]
  • Code analysis: [✓/✗]
  • Extensions: [✓/✗]
  • Load tool: [✓/✗]
  • Delegate (sync): [✓/✗]
  • Delegate (async): [✓/✗]
  • Delegate cancellation: [✓/✗]
  • Nested delegation blocked: [✓/✗]

  ⚠️ Issues Found:
  • [Issue 1 - brief description]
  • [Issue 2 - brief description]

  💡 Key Insights:
  • [Most important finding]
  • [Performance observation]
  • [Recommendation]

  📁 Full report: {{ workspace_dir }}/detailed_report.md
  ========================================
  ```

  This summary should be:
  - **Concise**: Under 30 lines
  - **Visual**: Use emojis and formatting for clarity
  - **Actionable**: Clear pass/fail status
  - **Informative**: Key findings at a glance

  Always end with this summary so users immediately see the results without digging through files.

  {% if cleanup_after == "true" %}
  ## 🧹 CLEANUP
  After report generation:
  1. Archive results to {{ workspace_dir }}/archive/
  2. Remove temporary test artifacts
  3. Keep only the final report and logs
  {% endif %}

  ## 🎯 META-TESTING NOTES
  Remember: You are testing yourself. This is recursive validation where:
  - Success means your tools work as expected
  - Failure reveals areas needing attention
  - The ability to complete this test IS itself a test
  - Document everything - your future self (or another goose) will thank you

  Use your todo extension to track progress throughout.
  Handle errors gracefully - a failed test shouldn't crash the suite.
  Be thorough but efficient based on the test_depth parameter.

  This is true first-person integration testing. Execute with precision and document with clarity.