Skip to content

Latest commit

 

History

History
381 lines (323 loc) · 181 KB

ablations.md

File metadata and controls

381 lines (323 loc) · 181 KB

Ablation studies on LSMDC

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the LSMDC dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model Task R@1 R@5 R@10 MdR Params Links
Concat t2v 0.5(0.5) 2.4(0.9) 4.9(1.7) 195.8(22.1) 150.63k config, model, log
CE - MW,P,CG t2v 9.7(0.3) 24.0(0.9) 33.3(0.5) 29.5(2.5) 159.78M config, model, log
CE - P,CG t2v 11.2(0.7) 26.1(0.9) 35.3(0.4) 25.7(2.1) 159.78M config, model, log
CE - CG t2v 10.6(0.4) 26.9(0.5) 35.3(0.5) 27.2(1.8) 114.48M config, model, log
CE t2v 11.2(0.4) 26.9(1.1) 34.8(2.0) 25.3(3.1) 116.86M config, model, log
Concat v2t 0.6(0.4) 3.1(0.7) 5.7(1.3) 184.7(20.7) 150.63k config, model, log
CE - MW,P,CG v2t 11.1(1.0) 24.5(0.9) 32.5(0.8) 32.2(1.9) 159.78M config, model, log
CE - P,CG v2t 11.5(0.2) 25.3(1.0) 34.3(0.3) 28.0(3.0) 159.78M config, model, log
CE - CG v2t 10.8(0.2) 26.1(0.6) 34.0(0.8) 28.8(0.3) 114.48M config, model, log
CE v2t 11.7(0.5) 25.8(1.5) 34.4(1.7) 28.0(2.6) 116.86M config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 4.2(0.4) 12.6(0.4) 18.6(0.6) 85.5(0.5) 12.83M config, model, log
Scene + Inst. t2v 8.0(1.9) 19.4(1.5) 27.1(1.1) 45.2(3.3) 27.87M config, model, log
Scene + r2p1d t2v 8.0(0.3) 20.5(0.6) 29.1(1.0) 39.8(2.1) 26.69M config, model, log
Scene + RGB t2v 5.9(0.5) 16.3(0.4) 22.4(0.3) 59.3(2.5) 27.87M config, model, log
Scene + Flow t2v 6.4(0.9) 18.2(0.9) 26.2(1.0) 48.5(0.9) 27.09M config, model, log
Scene + Audio t2v 6.3(0.1) 15.9(0.6) 24.4(0.5) 50.7(1.5) 26.6M config, model, log
Scene + OCR t2v 3.8(0.8) 11.9(0.7) 17.5(1.3) 88.8(9.0) 33.23M config, model, log
Scene + Speech t2v 3.8(0.4) 12.3(0.5) 18.1(0.6) 83.8(6.5) 27.46M config, model, log
Scene + Face t2v 4.4(0.4) 12.6(0.2) 19.2(0.6) 83.8(6.0) 26.4M config, model, log
Scene v2t 4.6(0.4) 12.6(1.1) 18.0(0.9) 91.8(2.0) 12.83M config, model, log
Scene + Inst. v2t 7.6(0.9) 19.7(0.1) 27.3(0.9) 47.5(1.3) 27.87M config, model, log
Scene + r2p1d v2t 7.9(0.2) 20.7(0.8) 28.0(0.7) 42.7(4.3) 26.69M config, model, log
Scene + RGB v2t 6.0(0.2) 16.4(0.6) 22.8(0.8) 64.0(4.8) 27.87M config, model, log
Scene + Flow v2t 6.6(1.2) 17.4(0.4) 24.8(0.9) 50.2(1.4) 27.09M config, model, log
Scene + Audio v2t 6.3(0.5) 17.3(0.5) 23.8(0.8) 56.3(3.5) 26.6M config, model, log
Scene + OCR v2t 4.8(0.6) 12.4(0.7) 17.6(0.4) 96.7(16.4) 33.23M config, model, log
Scene + Speech v2t 3.9(0.2) 11.9(0.1) 17.8(0.3) 84.8(4.4) 27.46M config, model, log
Scene + Face v2t 4.9(0.3) 13.2(0.4) 19.1(1.7) 93.2(7.5) 26.4M config, model, log

We can also study their cumulative effect:

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 4.2(0.4) 12.6(0.4) 18.6(0.6) 85.5(0.5) 12.83M config, model, log
Prev. + Speech t2v 3.8(0.4) 12.3(0.5) 18.1(0.6) 83.8(6.5) 27.46M config, model, log
Prev. + Audio t2v 6.0(0.5) 17.2(0.6) 24.4(0.7) 50.5(1.3) 38.86M config, model, log
Prev. + Flow t2v 7.6(0.7) 20.1(0.6) 28.1(0.7) 37.7(2.8) 50.75M config, model, log
Prev. + RGB t2v 8.9(0.9) 21.8(0.2) 29.9(0.7) 34.5(1.3) 63.43M config, model, log
Prev. + Inst t2v 10.4(1.3) 25.6(1.0) 34.2(1.0) 28.3(1.5) 76.11M config, model, log
Prev. + R2P1D t2v 11.3(0.4) 27.4(0.9) 36.3(1.1) 25.7(1.2) 87.61M config, model, log
Prev. + OCR t2v 11.5(0.5) 26.2(0.6) 35.8(0.5) 25.7(1.8) 105.65M config, model, log
Prev. + Face t2v 11.3(0.3) 26.7(1.5) 35.1(1.6) 25.3(3.1) 116.86M config, model, log
Scene v2t 4.6(0.4) 12.6(1.1) 18.0(0.9) 91.8(2.0) 12.83M config, model, log
Prev. + Speech v2t 3.9(0.2) 11.9(0.1) 17.8(0.3) 84.8(4.4) 27.46M config, model, log
Prev. + Audio v2t 7.2(0.3) 18.2(0.3) 24.7(1.2) 57.0(2.2) 38.86M config, model, log
Prev. + Flow v2t 7.9(0.6) 20.5(0.4) 28.6(0.5) 41.3(2.5) 50.75M config, model, log
Prev. + RGB v2t 9.3(0.6) 21.9(1.2) 30.1(0.8) 34.8(1.3) 63.43M config, model, log
Prev. + Inst. v2t 11.1(0.8) 25.1(1.6) 33.8(1.0) 30.0(1.3) 76.11M config, model, log
Prev. + R2P1D v2t 11.8(0.2) 26.6(0.9) 35.8(0.4) 27.7(2.1) 87.61M config, model, log
Prev. + OCR v2t 11.2(0.3) 26.0(1.3) 33.8(0.9) 27.4(1.2) 105.65M config, model, log
Prev. + Face v2t 11.6(0.5) 25.8(1.4) 34.4(1.7) 27.7(2.5) 116.86M config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension Task R@1 R@5 R@10 MdR Params Links
384 t2v 11.1(0.6) 26.6(0.6) 35.8(0.1) 26.2(2.3) 55.27M config, model, log
512 t2v 11.2(0.4) 26.2(0.8) 35.8(0.6) 26.3(2.1) 75.08M config, model, log
640 t2v 11.7(0.6) 26.9(1.4) 36.0(1.3) 25.3(0.6) 95.61M config, model, log
768 t2v 11.2(0.4) 26.9(1.1) 34.8(2.0) 25.3(3.1) 116.86M config, model, log
1024 t2v 11.3(0.3) 27.2(0.9) 36.1(1.3) 24.7(1.5) 161.52M config, model, log
384 v2t 11.3(0.7) 26.0(1.1) 34.5(1.4) 28.8(1.6) 55.27M config, model, log
512 v2t 11.0(0.5) 26.2(0.2) 35.0(1.5) 29.7(1.2) 75.08M config, model, log
640 v2t 12.0(0.6) 26.1(1.5) 34.1(1.1) 28.2(1.0) 95.61M config, model, log
768 v2t 11.7(0.5) 25.8(1.5) 34.4(1.7) 28.0(2.6) 116.86M config, model, log
1024 v2t 11.4(0.9) 26.6(0.7) 35.3(1.1) 27.5(3.5) 161.52M config, model, log

Ablation studies on DIDEMO

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the DIDEMO dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model Task R@1 R@5 R@10 MdR Params Links
Concat t2v 1.6(0.1) 7.9(0.6) 14.2(0.7) 65.5(2.3) 374.62k config, model, log
CE - MW,P,CG t2v 14.8(0.3) 40.4(1.7) 54.7(0.8) 8.7(0.6) 107.26M config, model, log
CE - P,CG t2v 15.4(0.8) 42.0(0.2) 55.2(0.8) 8.3(0.6) 107.26M config, model, log
CE - CG t2v 14.6(1.0) 40.1(0.2) 54.2(0.2) 8.7(0.6) 76.91M config, model, log
CE t2v 16.1(1.4) 41.1(0.4) 54.4(0.8) 8.3(0.6) 79.29M config, model, log
Concat v2t 2.4(0.5) 8.5(0.6) 14.0(1.0) 72.5(4.4) 374.62k config, model, log
CE - MW,P,CG v2t 15.9(0.4) 41.1(0.5) 54.9(0.7) 8.3(0.6) 107.26M config, model, log
CE - P,CG v2t 16.2(0.3) 41.4(0.2) 54.4(0.9) 8.3(0.6) 107.26M config, model, log
CE - CG v2t 15.7(0.9) 39.6(0.8) 53.7(0.9) 9.0(0.0) 76.91M config, model, log
CE v2t 15.6(1.3) 40.9(0.4) 55.2(0.5) 8.2(0.3) 79.29M config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 6.7(1.3) 20.8(0.7) 32.7(0.9) 25.0(1.0) 7.62M config, model, log
Scene + Inst. t2v 12.1(0.6) 31.7(0.2) 46.0(1.3) 13.0(1.0) 17.47M config, model, log
Scene + r2p1d t2v 13.0(0.8) 36.8(1.0) 52.0(0.3) 10.0(0.0) 16.29M config, model, log
Scene + RGB t2v 8.7(1.6) 26.4(1.0) 39.7(0.4) 16.8(0.8) 17.47M config, model, log
Scene + Flow t2v 10.1(0.7) 29.4(1.1) 42.9(0.8) 14.5(0.9) 16.68M config, model, log
Scene + Audio t2v 6.3(0.3) 23.0(0.8) 33.6(0.7) 22.7(1.3) 17.47M config, model, log
Scene + OCR t2v 6.1(0.6) 20.5(0.7) 32.3(1.1) 25.7(1.2) 18.9M config, model, log
Scene + Speech t2v 6.1(0.5) 21.0(1.0) 30.7(0.4) 27.2(0.3) 28.6M config, model, log
Scene + Face t2v 6.4(0.1) 20.8(1.2) 31.9(0.7) 24.8(1.8) 16.29M config, model, log
Scene v2t 6.6(0.7) 21.3(0.4) 33.0(0.4) 25.2(2.5) 7.62M config, model, log
Scene + Inst. v2t 12.5(1.3) 33.0(0.5) 46.1(0.6) 13.3(1.2) 17.47M config, model, log
Scene + r2p1d v2t 13.5(0.6) 36.2(0.2) 51.3(1.3) 10.3(0.6) 16.29M config, model, log
Scene + RGB v2t 8.5(1.0) 26.8(1.4) 38.8(0.7) 16.7(0.6) 17.47M config, model, log
Scene + Flow v2t 11.3(0.2) 29.8(0.6) 42.1(1.2) 15.3(1.5) 16.68M config, model, log
Scene + Audio v2t 6.6(0.4) 23.0(2.3) 33.6(1.0) 22.2(2.0) 17.47M config, model, log
Scene + OCR v2t 6.6(0.3) 20.9(1.2) 32.1(0.7) 26.2(1.8) 18.9M config, model, log
Scene + Speech v2t 6.8(0.8) 21.1(0.9) 31.4(1.2) 27.3(0.6) 28.6M config, model, log
Scene + Face v2t 6.8(0.2) 20.7(0.3) 32.1(1.5) 25.7(1.5) 16.29M config, model, log

We can also study their cumulative effect:

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 6.7(1.3) 20.8(0.7) 32.7(0.9) 25.0(1.0) 7.62M config, model, log
Prev. + Speech t2v 6.1(0.5) 21.0(1.0) 30.7(0.4) 27.2(0.3) 28.6M config, model, log
Prev. + Audio t2v 6.8(0.1) 23.1(0.3) 33.6(0.5) 21.8(0.8) 36.09M config, model, log
Prev. + Flow t2v 10.9(1.1) 31.0(1.1) 44.1(1.1) 14.0(1.0) 42.79M config, model, log
Prev. + RGB t2v 12.0(1.1) 33.0(0.1) 45.9(0.4) 12.7(0.6) 50.27M config, model, log
Prev. + Inst t2v 13.7(2.1) 34.6(1.9) 49.4(1.0) 11.0(1.0) 57.76M config, model, log
Prev. + R2P1D t2v 15.5(0.7) 40.1(0.2) 54.9(1.5) 8.3(0.6) 64.06M config, model, log
Prev. + OCR t2v 15.6(0.5) 39.9(1.1) 54.2(1.1) 9.0(0.0) 72.98M config, model, log
Prev. + Face t2v 16.1(1.5) 41.2(0.5) 54.5(0.9) 8.3(0.6) 79.29M config, model, log
Scene v2t 6.6(0.7) 21.3(0.4) 33.0(0.4) 25.2(2.5) 7.62M config, model, log
Prev. + Speech v2t 6.8(0.8) 21.1(0.9) 31.4(1.2) 27.3(0.6) 28.6M config, model, log
Prev. + Audio v2t 7.1(0.3) 22.8(0.5) 33.9(0.2) 22.2(0.8) 36.09M config, model, log
Prev. + Flow v2t 11.4(0.5) 30.8(0.9) 44.0(1.0) 14.0(1.0) 42.79M config, model, log
Prev. + RGB v2t 11.9(0.2) 32.6(1.8) 45.7(1.6) 12.7(0.6) 50.27M config, model, log
Prev. + Inst. v2t 12.9(0.6) 36.1(1.1) 49.2(1.3) 11.0(1.0) 57.76M config, model, log
Prev. + R2P1D v2t 15.6(0.5) 40.3(0.5) 54.3(0.9) 8.7(0.6) 64.06M config, model, log
Prev. + OCR v2t 16.1(1.5) 38.6(1.0) 53.4(1.0) 9.0(1.0) 72.98M config, model, log
Prev. + Face v2t 16.2(1.6) 41.1(0.7) 55.1(0.3) 8.2(0.3) 79.29M config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension Task R@1 R@5 R@10 MdR Params Links
384 t2v 15.2(0.8) 40.2(1.5) 54.6(1.0) 8.7(0.6) 36.45M config, model, log
512 t2v 15.3(0.8) 39.9(0.9) 53.7(0.9) 9.0(0.0) 50.01M config, model, log
640 t2v 16.2(0.6) 40.2(0.3) 54.3(0.2) 8.7(0.6) 64.29M config, model, log
768 t2v 16.1(1.4) 41.1(0.4) 54.4(0.8) 8.3(0.6) 79.29M config, model, log
1024 t2v 15.6(0.8) 39.3(1.2) 53.8(0.4) 8.7(0.6) 111.44M config, model, log
384 v2t 15.0(0.2) 40.2(1.6) 53.6(2.5) 9.0(1.0) 36.45M config, model, log
512 v2t 15.6(2.7) 39.2(1.0) 52.4(0.5) 9.3(0.6) 50.01M config, model, log
640 v2t 15.8(0.8) 40.1(0.1) 54.4(0.7) 9.0(0.0) 64.29M config, model, log
768 v2t 15.6(1.3) 40.9(0.4) 55.2(0.5) 8.2(0.3) 79.29M config, model, log
1024 v2t 15.6(1.1) 39.9(0.6) 54.1(1.0) 8.7(0.6) 111.44M config, model, log

Ablation studies on MSVD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSVD dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model Task R@1 R@5 R@10 MdR Params Links
Concat t2v 3.5(0.1) 13.9(0.2) 24.1(0.3) 32.7(0.6) 314.8k config, model, log
CE - MW,P,CG t2v 16.8(0.1) 44.8(0.5) 59.6(0.6) 7.0(0.0) 131.37M config, model, log
CE - P,CG t2v 18.9(1.0) 48.1(1.0) 63.2(0.7) 6.0(0.0) 131.37M config, model, log
CE - CG t2v 19.6(0.5) 49.4(0.8) 64.2(1.0) 5.7(0.6) 81.67M config, model, log
CE t2v 19.8(0.3) 49.0(0.3) 63.8(0.1) 6.0(0.0) 84.04M config, model, log
Concat v2t 4.0(0.6) 14.9(0.8) 22.4(0.8) 42.5(0.9) 314.8k config, model, log
CE - MW,P,CG v2t 21.7(0.3) 47.6(0.1) 58.2(0.4) 6.4(0.5) 131.37M config, model, log
CE - P,CG v2t 22.9(2.5) 48.6(1.2) 58.3(1.3) 6.2(0.4) 131.37M config, model, log
CE - CG v2t 23.4(2.5) 49.0(1.7) 59.9(1.4) 5.8(0.7) 81.67M config, model, log
CE v2t 23.9(1.4) 50.2(0.8) 59.6(1.2) 5.6(0.5) 84.04M config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 7.0(0.2) 21.8(0.4) 32.9(0.1) 23.0(0.0) 9.99M config, model, log
Scene + Inst. t2v 16.0(0.3) 41.8(0.5) 57.3(0.1) 8.0(0.0) 22.2M config, model, log
Scene + r2p1d t2v 16.8(0.0) 43.7(0.3) 58.4(0.1) 7.0(0.0) 21.02M config, model, log
Scene + RGB t2v 10.7(0.3) 31.5(0.2) 45.9(0.4) 12.7(0.6) 22.2M config, model, log
Scene + Flow t2v 11.7(0.3) 34.2(0.3) 48.1(0.2) 11.3(0.6) 21.41M config, model, log
Scene + OCR t2v 7.1(0.2) 22.3(0.0) 33.8(0.2) 22.7(0.6) 37.95M config, model, log
Scene + Face t2v 6.9(0.0) 22.4(0.4) 33.9(0.5) 22.7(0.6) 21.02M config, model, log
Scene v2t 8.3(0.4) 20.8(1.1) 29.0(0.4) 50.5(3.6) 9.99M config, model, log
Scene + Inst. v2t 17.6(0.9) 40.8(0.3) 52.1(0.3) 9.2(0.3) 22.2M config, model, log
Scene + r2p1d v2t 20.9(0.5) 43.7(2.0) 53.5(1.3) 8.2(0.8) 21.02M config, model, log
Scene + RGB v2t 11.0(0.8) 28.5(0.2) 37.2(0.5) 25.2(0.7) 22.2M config, model, log
Scene + Flow v2t 14.4(1.3) 34.4(0.6) 43.8(1.3) 15.8(1.3) 21.41M config, model, log
Scene + OCR v2t 8.9(1.4) 22.0(2.2) 29.1(1.0) 50.2(4.9) 37.95M config, model, log
Scene + Face v2t 8.4(0.9) 20.8(0.7) 29.3(0.7) 49.2(8.1) 21.02M config, model, log

We can also study their cumulative effect:

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 7.0(0.2) 21.8(0.4) 32.9(0.1) 23.0(0.0) 9.99M config, model, log
Prev. + Flow t2v 11.7(0.3) 34.2(0.3) 48.1(0.2) 11.3(0.6) 21.41M config, model, log
Prev. + RGB t2v 12.8(0.6) 37.4(0.4) 52.2(0.1) 10.0(0.0) 31.26M config, model, log
Prev. + Inst t2v 16.0(0.3) 42.1(0.7) 57.4(0.3) 7.7(0.6) 41.11M config, model, log
Prev. + R2P1D t2v 19.3(0.3) 48.5(0.4) 63.3(0.7) 6.0(0.0) 49.78M config, model, log
Prev. + OCR t2v 18.6(0.9) 47.4(1.3) 62.6(1.1) 6.0(0.0) 75.38M config, model, log
Prev. + Face t2v 19.8(0.3) 49.0(0.3) 63.8(0.1) 6.0(0.0) 84.04M config, model, log
Scene v2t 8.3(0.4) 20.8(1.1) 29.0(0.4) 50.5(3.6) 9.99M config, model, log
Prev. + Flow v2t 14.4(1.3) 34.4(0.6) 43.8(1.3) 15.8(1.3) 21.41M config, model, log
Prev. + RGB v2t 14.5(2.6) 34.9(1.3) 44.7(1.4) 15.2(0.7) 31.26M config, model, log
Prev. + Inst. v2t 17.6(0.8) 41.1(1.9) 52.1(2.2) 9.2(1.4) 41.11M config, model, log
Prev. + R2P1D v2t 23.1(0.7) 48.2(0.7) 58.5(0.3) 6.3(0.4) 49.78M config, model, log
Prev. + OCR v2t 22.6(2.1) 46.7(1.9) 57.0(2.9) 7.0(1.0) 75.38M config, model, log
Prev. + Face v2t 23.9(1.4) 50.2(0.8) 59.6(1.2) 5.6(0.5) 84.04M config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension Task R@1 R@5 R@10 MdR Params Links
384 t2v 19.3(0.4) 48.1(0.7) 63.3(0.4) 6.0(0.0) 39.43M config, model, log
512 t2v 19.4(0.4) 48.9(0.4) 63.7(0.2) 6.0(0.0) 53.71M config, model, log
640 t2v 19.8(0.8) 49.5(0.6) 64.2(0.8) 6.0(0.0) 68.58M config, model, log
768 t2v 19.8(0.3) 49.0(0.3) 63.8(0.1) 6.0(0.0) 84.04M config, model, log
1024 t2v 18.9(0.8) 47.7(1.6) 62.9(1.4) 6.3(0.6) 116.73M config, model, log
384 v2t 21.8(0.3) 48.8(1.4) 59.7(1.9) 6.2(0.7) 39.43M config, model, log
512 v2t 23.5(0.8) 48.7(0.2) 59.0(0.8) 6.0(0.0) 53.71M config, model, log
640 v2t 23.8(2.8) 48.3(1.8) 60.1(2.3) 6.3(0.6) 68.58M config, model, log
768 v2t 23.9(1.4) 50.2(0.8) 59.6(1.2) 5.6(0.5) 84.04M config, model, log
1024 v2t 21.2(2.7) 46.5(1.9) 57.0(1.6) 7.0(1.0) 116.73M config, model, log

Ablation studies on ACTIVITY-NET

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the ACTIVITY-NET dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model Task R@1 R@5 R@10 MdR Params Links
Concat t2v 1.2(0.1) 5.2(0.3) 9.4(0.4) 120.0(6.2) 417.8k config, model, log
CE - MW,P,CG t2v 17.4(0.4) 45.8(0.6) 61.4(0.6) 6.3(0.6) 330.42M config, model, log
CE - P,CG t2v 18.3(0.7) 47.2(0.6) 63.2(0.5) 6.0(0.0) 330.42M config, model, log
CE - CG t2v 17.6(0.4) 46.8(0.5) 62.9(0.4) 6.0(0.0) 258.3M config, model, log
CE t2v 18.2(0.3) 47.7(0.6) 63.9(0.5) 6.0(0.0) 260.68M config, model, log
Concat v2t 1.3(0.1) 5.3(0.6) 9.7(0.6) 141.7(2.9) 417.8k config, model, log
CE - MW,P,CG v2t 17.6(0.2) 45.9(0.5) 61.5(0.7) 6.7(0.6) 330.42M config, model, log
CE - P,CG v2t 17.2(0.1) 46.3(0.4) 62.5(0.5) 6.0(0.0) 330.42M config, model, log
CE - CG v2t 17.3(0.2) 46.4(0.3) 62.5(0.5) 6.0(0.0) 258.3M config, model, log
CE v2t 17.7(0.6) 46.6(0.7) 62.8(0.4) 6.0(0.0) 260.68M config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 7.3(0.4) 21.5(0.7) 32.4(0.2) 24.7(0.6) 25.97M config, model, log
Scene + Inst. t2v 14.4(0.3) 38.9(0.3) 53.3(0.2) 9.0(0.0) 54.13M config, model, log
Scene + r2p1d t2v 16.0(0.5) 43.8(0.4) 60.6(0.3) 7.0(0.0) 52.95M config, model, log
Scene + RGB t2v 10.2(0.7) 29.7(0.4) 43.1(0.7) 14.3(0.6) 54.13M config, model, log
Scene + Flow t2v 13.8(0.3) 37.9(0.1) 53.1(0.3) 9.0(0.0) 53.35M config, model, log
Scene + Audio t2v 8.0(0.2) 24.1(0.7) 35.7(0.3) 21.3(0.6) 53.15M config, model, log
Scene + OCR t2v 7.4(0.3) 23.0(0.3) 33.8(0.3) 23.3(0.6) 62.73M config, model, log
Scene + Speech t2v 7.4(0.1) 23.1(0.4) 34.4(0.6) 22.3(0.6) 75.66M config, model, log
Scene + Face t2v 7.9(0.3) 24.7(0.8) 36.2(0.7) 21.0(0.0) 52.95M config, model, log
Scene v2t 6.4(0.2) 20.4(0.3) 31.4(0.1) 25.3(0.6) 25.97M config, model, log
Scene + Inst. v2t 12.4(0.2) 35.9(0.1) 50.5(0.3) 10.0(0.0) 54.13M config, model, log
Scene + r2p1d v2t 13.7(0.2) 40.5(0.2) 57.1(0.3) 8.0(0.0) 52.95M config, model, log
Scene + RGB v2t 9.3(0.2) 28.2(0.4) 41.3(0.5) 15.7(0.6) 54.13M config, model, log
Scene + Flow v2t 12.4(0.1) 36.3(0.4) 52.1(0.3) 10.0(0.0) 53.35M config, model, log
Scene + Audio v2t 7.3(0.2) 23.2(0.3) 34.1(0.3) 22.0(0.0) 53.15M config, model, log
Scene + OCR v2t 6.4(0.1) 20.5(0.8) 31.2(0.5) 26.7(0.6) 62.73M config, model, log
Scene + Speech v2t 6.4(0.2) 20.9(0.1) 32.4(0.4) 24.7(0.6) 75.66M config, model, log
Scene + Face v2t 7.2(0.3) 23.1(0.3) 34.4(1.0) 21.7(0.6) 52.95M config, model, log

We can also study their cumulative effect:

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 7.3(0.4) 21.5(0.7) 32.4(0.2) 24.7(0.6) 25.97M config, model, log
Prev. + Speech t2v 7.4(0.1) 23.1(0.4) 34.4(0.6) 22.3(0.6) 75.66M config, model, log
Prev. + Audio t2v 7.7(0.3) 23.7(0.1) 35.3(0.3) 21.3(1.2) 100.47M config, model, log
Prev. + Flow t2v 14.4(0.1) 39.1(0.5) 54.5(0.1) 9.0(0.0) 125.48M config, model, log
Prev. + RGB t2v 14.9(0.6) 40.7(0.2) 55.9(0.2) 8.0(0.0) 151.27M config, model, log
Prev. + Inst t2v 15.8(0.6) 43.2(0.3) 58.3(0.4) 7.0(0.0) 177.07M config, model, log
Prev. + R2P1D t2v 18.2(0.5) 46.4(0.5) 62.5(1.1) 6.0(0.0) 201.68M config, model, log
Prev. + OCR t2v 18.0(0.6) 46.9(0.3) 63.2(0.2) 6.0(0.0) 236.06M config, model, log
Prev. + Face t2v 18.4(0.2) 47.9(0.5) 63.6(0.5) 6.0(0.0) 260.68M config, model, log
Scene v2t 6.4(0.2) 20.4(0.3) 31.4(0.1) 25.3(0.6) 25.97M config, model, log
Prev. + Speech v2t 6.4(0.2) 20.9(0.1) 32.4(0.4) 24.7(0.6) 75.66M config, model, log
Prev. + Audio v2t 7.2(0.2) 22.7(0.3) 34.3(0.2) 22.3(0.6) 100.47M config, model, log
Prev. + Flow v2t 13.8(0.3) 37.9(0.3) 53.4(0.2) 9.0(0.0) 125.48M config, model, log
Prev. + RGB v2t 14.6(0.6) 40.0(0.3) 55.4(0.4) 8.3(0.6) 151.27M config, model, log
Prev. + Inst. v2t 15.4(0.4) 42.1(0.2) 58.1(0.7) 7.7(0.6) 177.07M config, model, log
Prev. + R2P1D v2t 17.4(0.4) 45.7(0.4) 62.0(0.8) 6.3(0.6) 201.68M config, model, log
Prev. + OCR v2t 17.0(0.2) 45.8(0.2) 61.8(0.2) 6.3(0.6) 236.06M config, model, log
Prev. + Face v2t 17.7(0.7) 46.5(0.5) 62.7(0.4) 6.3(0.6) 260.68M config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension Task R@1 R@5 R@10 MdR Params Links
384 t2v 17.5(0.2) 46.9(0.0) 63.2(0.2) 6.0(0.0) 127.3M config, model, log
512 t2v 18.0(0.3) 47.4(0.8) 63.1(0.4) 6.0(0.0) 171.04M config, model, log
640 t2v 18.2(0.6) 47.8(1.0) 63.4(0.9) 6.0(0.0) 215.5M config, model, log
768 t2v 18.2(0.3) 47.7(0.6) 63.9(0.5) 6.0(0.0) 260.68M config, model, log
1024 t2v 18.3(0.1) 48.2(0.8) 63.3(0.5) 6.0(0.0) 353.2M config, model, log
384 v2t 16.8(0.3) 45.3(0.1) 61.9(0.4) 6.7(0.6) 127.3M config, model, log
512 v2t 17.2(0.1) 45.9(0.7) 62.0(0.7) 6.7(0.6) 171.04M config, model, log
640 v2t 17.5(0.5) 46.1(0.5) 62.4(0.3) 6.3(0.6) 215.5M config, model, log
768 v2t 17.7(0.6) 46.6(0.7) 62.8(0.4) 6.0(0.0) 260.68M config, model, log
1024 v2t 17.7(0.1) 47.0(0.5) 63.4(0.4) 6.0(0.0) 353.2M config, model, log

Ablation studies on QuerYD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the QuerYD dataset.

Experts Task R@1 R@5 R@10 R@50 MdR MnR Geom params Links
Scene t2v 8.7(0.4) 26.3(1.1) 37.1(0.7) 68.5(2.2) 22.2(1.6) 52.3(3.0) 20.4(0.1) 7.51M config, model, log
Scene + Inst. t2v 11.7(1.4) 31.6(0.9) 43.4(1.3) 74.5(0.9) 14.0(1.0) 41.1(2.1) 25.2(0.8) 17.25M config, model, log
Scene + r2p1d t2v 11.7(2.1) 32.1(3.0) 45.3(3.3) 74.6(0.4) 13.7(1.9) 42.9(2.2) 25.7(2.4) 16.07M config, model, log
Scene + Audio t2v 7.6(2.7) 27.4(1.4) 40.4(0.9) 69.1(0.9) 17.0(1.7) 49.0(1.9) 20.2(2.3) 17.25M config, model, log
Scene v2t 9.1(0.8) 25.4(0.9) 35.3(1.5) 68.2(2.2) 23.2(0.3) 52.6(2.6) 20.1(0.5) 7.51M config, model, log
Scene + Inst. v2t 11.9(0.5) 31.0(3.6) 43.5(2.7) 74.8(1.8) 14.5(0.9) 40.8(2.1) 25.2(1.1) 17.25M config, model, log
Scene + r2p1d v2t 12.7(1.4) 30.9(2.8) 44.0(1.8) 74.3(1.2) 14.3(1.2) 42.8(1.7) 25.8(1.7) 16.07M config, model, log
Scene + Audio v2t 10.1(1.2) 25.7(1.5) 37.5(1.2) 69.8(1.6) 20.0(1.3) 48.9(2.0) 21.3(1.1) 17.25M config, model, log

We can also study their cumulative effect:

Experts Task R@1 R@5 R@10 R@50 MdR MnR Geom params Links
Scene t2v 8.7(0.4) 26.3(1.1) 37.1(0.7) 68.5(2.2) 22.2(1.6) 52.3(3.0) 20.4(0.1) 7.51M config, model, log
Prev. + Audio t2v 7.6(2.7) 27.4(1.4) 40.4(0.9) 69.1(0.9) 17.0(1.7) 49.0(1.9) 20.2(2.3) 17.25M config, model, log
Prev. + Inst t2v 12.7(1.7) 34.8(1.7) 47.0(1.3) 78.0(1.0) 12.3(0.6) 37.6(2.1) 27.5(1.5) 24.63M config, model, log
Prev. + R2P1D t2v 14.3(0.3) 37.5(1.3) 48.6(0.8) 78.8(0.3) 11.3(0.6) 35.2(1.8) 29.7(0.3) 30.82M config, model, log
Scene v2t 9.1(0.8) 25.4(0.9) 35.3(1.5) 68.2(2.2) 23.2(0.3) 52.6(2.6) 20.1(0.5) 7.51M config, model, log
Prev. + Audio v2t 10.1(1.2) 25.7(1.5) 37.5(1.2) 69.8(1.6) 20.0(1.3) 48.9(2.0) 21.3(1.1) 17.25M config, model, log
Prev. + Inst. v2t 12.8(1.3) 33.5(2.8) 46.6(1.0) 76.7(1.7) 11.8(0.8) 37.6(1.9) 27.1(0.6) 24.63M config, model, log
Prev. + R2P1D v2t 14.0(0.3) 35.4(2.9) 47.2(2.8) 78.7(2.4) 12.3(1.5) 35.8(2.4) 28.6(1.2) 30.82M config, model, log