Commit 8a37c18
[GREEN] feat(oracle): Add unified training pipeline for deterministic corpus merge (Refs GH-123)
Adds unified_training module that merges all data sources deterministically:
- Synthetic corpus (configurable sample count, default 12,000)
- Depyler corpus (hand-crafted from tickets)
- Verificar corpus (extracted from verificar tool)
- OIP GitHub corpus (mined from Git commit history)
- Real errors file (optional)
Key features:
- Hash-based deduplication (normalized error messages)
- Deterministic shuffle using LCG with configurable seed
- Optional class balancing with max samples per class
- Comprehensive merge statistics
Tests: 6 new tests (all passing)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>1 parent 28e2815 commit 8a37c18
2 files changed
+410
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| 36 | + | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
| |||
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
74 | 81 | | |
75 | 82 | | |
76 | 83 | | |
| |||
0 commit comments