Model Architecture

1 sources - 5 claims

Broad ASR pretraining across many speakers supported downstream mental health classification. Whisper Small was the strongest audio backbone in the early architecture and dataset setting. The default audio backbone was Whisper Small with mean temporal pooling. The core architecture used frozen pretrained backbones with trainable LoRA adaptation modules. Training used randomly selected 30-second speech segments because Whisper Small had a 30-second receptive field.