Pretraining Mismatch

1 sources - 5 claims

When trained from scratch on WikiText-103, Block-ChaCAL outperformed dense attention and dense ChaCAL in perplexity. OpenWebText GPT-2 fine-tuning from dense-attention pretrained weights favored dense fine-tuning over ChaCAL or Block-ChaCAL variants. Direct Block-ChaCAL decoder-layer substitution underperformed standard dense fine-tuning on SCROLLS with BART-base. Replacing dense attention inside a pretrained model changes its computation graph and can hurt performance. Encoder-decoder models appeared especially sensitive to operator substitution because encoder representations and decoder attention dynamics are coupled under dense attention.