Skip to content

Commit 8a37c18

Browse files
noahgiftclaude
andcommitted
[GREEN] feat(oracle): Add unified training pipeline for deterministic corpus merge (Refs GH-123)
Adds unified_training module that merges all data sources deterministically: - Synthetic corpus (configurable sample count, default 12,000) - Depyler corpus (hand-crafted from tickets) - Verificar corpus (extracted from verificar tool) - OIP GitHub corpus (mined from Git commit history) - Real errors file (optional) Key features: - Hash-based deduplication (normalized error messages) - Deterministic shuffle using LCG with configurable seed - Optional class balancing with max samples per class - Comprehensive merge statistics Tests: 6 new tests (all passing) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1 parent 28e2815 commit 8a37c18

File tree

2 files changed

+410
-0
lines changed

2 files changed

+410
-0
lines changed

crates/depyler-oracle/src/lib.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ pub mod synthetic;
3333
pub mod tfidf;
3434
pub mod training;
3535
pub mod tuning;
36+
pub mod unified_training;
3637
pub mod verificar_integration;
3738

3839
pub use autofixer::{AutoFixer, FixContext, FixResult, TransformRule};
@@ -71,6 +72,12 @@ pub use github_corpus::{
7172
analyze_corpus, get_moe_samples_from_oip, CorpusStats,
7273
};
7374

75+
// Unified training pipeline
76+
pub use unified_training::{
77+
build_unified_corpus, build_default_unified_corpus, build_unified_corpus_with_oip,
78+
print_merge_stats, UnifiedTrainingConfig, UnifiedTrainingResult, MergeStats,
79+
};
80+
7481
/// Error types for the oracle.
7582
#[derive(Debug, thiserror::Error)]
7683
pub enum OracleError {

0 commit comments

Comments
 (0)