Feature Selection and Preprocessing

2 sources - 10 claims

The final multiple myeloma application identified eight candidate epigenetic risk loci with full-data PIP of 1.00. The initial gene-screening process reduced 14,147 candidate genes to a smaller computational subset. The final modelling feature set will combine the features selected by Boruta and LASSO. Stage 1 retained genes with marginal association p-values below 0.05 after adjustment for age and sex. GMM clustering selected three components and yielded 436 candidate genes for downstream modeling. Dimensionality reduction will use Boruta and LASSO within each outer-loop training set. A directed acyclic graph will make assumed causal pathways, bias sources, dataset shift, measurement practices, and downstream treatment decisions explicit. Candidate features will be filtered by availability and expert domain input before dimensionality reduction. Training and testing subsets will be imputed separately to avoid data leakage. Missing data will be imputed with MICE under a missing-at-random assumption.