How can distillation be carried out under these circumstances? #12

lean-wang · 2024-09-18T05:42:33Z

Dear Professor,

I would like to ask some questions. In your experiment, you used the models "teacher": "arcee-ai/Arcee-Spark" and "student": "Qwen/Qwen2-1.5B" for the distil-logits task. I am wondering how to perform distillation when the vocabulary sizes of these two models are different and the word indices in their vocabularies do not match. The shape of the teacher's logits is [b, seq_len, vocabulary_size_spark], and the student's logits shape is [b, seq_len, vocabulary_size_qwen]. How can distillation be carried out under these circumstances?

ashdtu · 2024-12-17T16:42:40Z

Hi @lean-wang, in cases where the student and teacher model tokenizers(hence output logits) are not of similar dimension, hidden feature based distillation is recommended. You would need to look at the model architectures of both your student and teacher model, and find latent dimensions of similar size.

The other option for distillation, that is invariant to model architectures would be to distill using reasoning generated from teacher model as a training objective for Supervised finetuning. Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can distillation be carried out under these circumstances? #12

How can distillation be carried out under these circumstances? #12

lean-wang commented Sep 18, 2024

ashdtu commented Dec 17, 2024

How can distillation be carried out under these circumstances? #12

How can distillation be carried out under these circumstances? #12

Comments

lean-wang commented Sep 18, 2024

ashdtu commented Dec 17, 2024