We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the LSMDC dataset.
CE Design: First, we investigate the importance of the parts used by the CE model.
Model | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Concat | t2v | 0.5(0.5) | 2.4(0.9) | 4.9(1.7) | 195.8(22.1) | 150.63k | config, model, log |
CE - MW,P,CG | t2v | 9.7(0.3) | 24.0(0.9) | 33.3(0.5) | 29.5(2.5) | 159.78M | config, model, log |
CE - P,CG | t2v | 11.2(0.7) | 26.1(0.9) | 35.3(0.4) | 25.7(2.1) | 159.78M | config, model, log |
CE - CG | t2v | 10.6(0.4) | 26.9(0.5) | 35.3(0.5) | 27.2(1.8) | 114.48M | config, model, log |
CE | t2v | 11.2(0.4) | 26.9(1.1) | 34.8(2.0) | 25.3(3.1) | 116.86M | config, model, log |
Concat | v2t | 0.6(0.4) | 3.1(0.7) | 5.7(1.3) | 184.7(20.7) | 150.63k | config, model, log |
CE - MW,P,CG | v2t | 11.1(1.0) | 24.5(0.9) | 32.5(0.8) | 32.2(1.9) | 159.78M | config, model, log |
CE - P,CG | v2t | 11.5(0.2) | 25.3(1.0) | 34.3(0.3) | 28.0(3.0) | 159.78M | config, model, log |
CE - CG | v2t | 10.8(0.2) | 26.1(0.6) | 34.0(0.8) | 28.8(0.3) | 114.48M | config, model, log |
CE | v2t | 11.7(0.5) | 25.8(1.5) | 34.4(1.7) | 28.0(2.6) | 116.86M | config, model, log |
Each row adds an additional component to the model. The names refer to the following model designs:
- Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
- CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
- CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
- CE - CG - The CE model without Collaborative Gating (CG).
- CE - The full CE model.
Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.
Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 4.2(0.4) | 12.6(0.4) | 18.6(0.6) | 85.5(0.5) | 12.83M | config, model, log |
Scene + Inst. | t2v | 8.0(1.9) | 19.4(1.5) | 27.1(1.1) | 45.2(3.3) | 27.87M | config, model, log |
Scene + r2p1d | t2v | 8.0(0.3) | 20.5(0.6) | 29.1(1.0) | 39.8(2.1) | 26.69M | config, model, log |
Scene + RGB | t2v | 5.9(0.5) | 16.3(0.4) | 22.4(0.3) | 59.3(2.5) | 27.87M | config, model, log |
Scene + Flow | t2v | 6.4(0.9) | 18.2(0.9) | 26.2(1.0) | 48.5(0.9) | 27.09M | config, model, log |
Scene + Audio | t2v | 6.3(0.1) | 15.9(0.6) | 24.4(0.5) | 50.7(1.5) | 26.6M | config, model, log |
Scene + OCR | t2v | 3.8(0.8) | 11.9(0.7) | 17.5(1.3) | 88.8(9.0) | 33.23M | config, model, log |
Scene + Speech | t2v | 3.8(0.4) | 12.3(0.5) | 18.1(0.6) | 83.8(6.5) | 27.46M | config, model, log |
Scene + Face | t2v | 4.4(0.4) | 12.6(0.2) | 19.2(0.6) | 83.8(6.0) | 26.4M | config, model, log |
Scene | v2t | 4.6(0.4) | 12.6(1.1) | 18.0(0.9) | 91.8(2.0) | 12.83M | config, model, log |
Scene + Inst. | v2t | 7.6(0.9) | 19.7(0.1) | 27.3(0.9) | 47.5(1.3) | 27.87M | config, model, log |
Scene + r2p1d | v2t | 7.9(0.2) | 20.7(0.8) | 28.0(0.7) | 42.7(4.3) | 26.69M | config, model, log |
Scene + RGB | v2t | 6.0(0.2) | 16.4(0.6) | 22.8(0.8) | 64.0(4.8) | 27.87M | config, model, log |
Scene + Flow | v2t | 6.6(1.2) | 17.4(0.4) | 24.8(0.9) | 50.2(1.4) | 27.09M | config, model, log |
Scene + Audio | v2t | 6.3(0.5) | 17.3(0.5) | 23.8(0.8) | 56.3(3.5) | 26.6M | config, model, log |
Scene + OCR | v2t | 4.8(0.6) | 12.4(0.7) | 17.6(0.4) | 96.7(16.4) | 33.23M | config, model, log |
Scene + Speech | v2t | 3.9(0.2) | 11.9(0.1) | 17.8(0.3) | 84.8(4.4) | 27.46M | config, model, log |
Scene + Face | v2t | 4.9(0.3) | 13.2(0.4) | 19.1(1.7) | 93.2(7.5) | 26.4M | config, model, log |
We can also study their cumulative effect:
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 4.2(0.4) | 12.6(0.4) | 18.6(0.6) | 85.5(0.5) | 12.83M | config, model, log |
Prev. + Speech | t2v | 3.8(0.4) | 12.3(0.5) | 18.1(0.6) | 83.8(6.5) | 27.46M | config, model, log |
Prev. + Audio | t2v | 6.0(0.5) | 17.2(0.6) | 24.4(0.7) | 50.5(1.3) | 38.86M | config, model, log |
Prev. + Flow | t2v | 7.6(0.7) | 20.1(0.6) | 28.1(0.7) | 37.7(2.8) | 50.75M | config, model, log |
Prev. + RGB | t2v | 8.9(0.9) | 21.8(0.2) | 29.9(0.7) | 34.5(1.3) | 63.43M | config, model, log |
Prev. + Inst | t2v | 10.4(1.3) | 25.6(1.0) | 34.2(1.0) | 28.3(1.5) | 76.11M | config, model, log |
Prev. + R2P1D | t2v | 11.3(0.4) | 27.4(0.9) | 36.3(1.1) | 25.7(1.2) | 87.61M | config, model, log |
Prev. + OCR | t2v | 11.5(0.5) | 26.2(0.6) | 35.8(0.5) | 25.7(1.8) | 105.65M | config, model, log |
Prev. + Face | t2v | 11.3(0.3) | 26.7(1.5) | 35.1(1.6) | 25.3(3.1) | 116.86M | config, model, log |
Scene | v2t | 4.6(0.4) | 12.6(1.1) | 18.0(0.9) | 91.8(2.0) | 12.83M | config, model, log |
Prev. + Speech | v2t | 3.9(0.2) | 11.9(0.1) | 17.8(0.3) | 84.8(4.4) | 27.46M | config, model, log |
Prev. + Audio | v2t | 7.2(0.3) | 18.2(0.3) | 24.7(1.2) | 57.0(2.2) | 38.86M | config, model, log |
Prev. + Flow | v2t | 7.9(0.6) | 20.5(0.4) | 28.6(0.5) | 41.3(2.5) | 50.75M | config, model, log |
Prev. + RGB | v2t | 9.3(0.6) | 21.9(1.2) | 30.1(0.8) | 34.8(1.3) | 63.43M | config, model, log |
Prev. + Inst. | v2t | 11.1(0.8) | 25.1(1.6) | 33.8(1.0) | 30.0(1.3) | 76.11M | config, model, log |
Prev. + R2P1D | v2t | 11.8(0.2) | 26.6(0.9) | 35.8(0.4) | 27.7(2.1) | 87.61M | config, model, log |
Prev. + OCR | v2t | 11.2(0.3) | 26.0(1.3) | 33.8(0.9) | 27.4(1.2) | 105.65M | config, model, log |
Prev. + Face | v2t | 11.6(0.5) | 25.8(1.4) | 34.4(1.7) | 27.7(2.5) | 116.86M | config, model, log |
Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.
Dimension | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
384 | t2v | 11.1(0.6) | 26.6(0.6) | 35.8(0.1) | 26.2(2.3) | 55.27M | config, model, log |
512 | t2v | 11.2(0.4) | 26.2(0.8) | 35.8(0.6) | 26.3(2.1) | 75.08M | config, model, log |
640 | t2v | 11.7(0.6) | 26.9(1.4) | 36.0(1.3) | 25.3(0.6) | 95.61M | config, model, log |
768 | t2v | 11.2(0.4) | 26.9(1.1) | 34.8(2.0) | 25.3(3.1) | 116.86M | config, model, log |
1024 | t2v | 11.3(0.3) | 27.2(0.9) | 36.1(1.3) | 24.7(1.5) | 161.52M | config, model, log |
384 | v2t | 11.3(0.7) | 26.0(1.1) | 34.5(1.4) | 28.8(1.6) | 55.27M | config, model, log |
512 | v2t | 11.0(0.5) | 26.2(0.2) | 35.0(1.5) | 29.7(1.2) | 75.08M | config, model, log |
640 | v2t | 12.0(0.6) | 26.1(1.5) | 34.1(1.1) | 28.2(1.0) | 95.61M | config, model, log |
768 | v2t | 11.7(0.5) | 25.8(1.5) | 34.4(1.7) | 28.0(2.6) | 116.86M | config, model, log |
1024 | v2t | 11.4(0.9) | 26.6(0.7) | 35.3(1.1) | 27.5(3.5) | 161.52M | config, model, log |
We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the DIDEMO dataset.
CE Design: First, we investigate the importance of the parts used by the CE model.
Model | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Concat | t2v | 1.6(0.1) | 7.9(0.6) | 14.2(0.7) | 65.5(2.3) | 374.62k | config, model, log |
CE - MW,P,CG | t2v | 14.8(0.3) | 40.4(1.7) | 54.7(0.8) | 8.7(0.6) | 107.26M | config, model, log |
CE - P,CG | t2v | 15.4(0.8) | 42.0(0.2) | 55.2(0.8) | 8.3(0.6) | 107.26M | config, model, log |
CE - CG | t2v | 14.6(1.0) | 40.1(0.2) | 54.2(0.2) | 8.7(0.6) | 76.91M | config, model, log |
CE | t2v | 16.1(1.4) | 41.1(0.4) | 54.4(0.8) | 8.3(0.6) | 79.29M | config, model, log |
Concat | v2t | 2.4(0.5) | 8.5(0.6) | 14.0(1.0) | 72.5(4.4) | 374.62k | config, model, log |
CE - MW,P,CG | v2t | 15.9(0.4) | 41.1(0.5) | 54.9(0.7) | 8.3(0.6) | 107.26M | config, model, log |
CE - P,CG | v2t | 16.2(0.3) | 41.4(0.2) | 54.4(0.9) | 8.3(0.6) | 107.26M | config, model, log |
CE - CG | v2t | 15.7(0.9) | 39.6(0.8) | 53.7(0.9) | 9.0(0.0) | 76.91M | config, model, log |
CE | v2t | 15.6(1.3) | 40.9(0.4) | 55.2(0.5) | 8.2(0.3) | 79.29M | config, model, log |
Each row adds an additional component to the model. The names refer to the following model designs:
- Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
- CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
- CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
- CE - CG - The CE model without Collaborative Gating (CG).
- CE - The full CE model.
Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.
Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 6.7(1.3) | 20.8(0.7) | 32.7(0.9) | 25.0(1.0) | 7.62M | config, model, log |
Scene + Inst. | t2v | 12.1(0.6) | 31.7(0.2) | 46.0(1.3) | 13.0(1.0) | 17.47M | config, model, log |
Scene + r2p1d | t2v | 13.0(0.8) | 36.8(1.0) | 52.0(0.3) | 10.0(0.0) | 16.29M | config, model, log |
Scene + RGB | t2v | 8.7(1.6) | 26.4(1.0) | 39.7(0.4) | 16.8(0.8) | 17.47M | config, model, log |
Scene + Flow | t2v | 10.1(0.7) | 29.4(1.1) | 42.9(0.8) | 14.5(0.9) | 16.68M | config, model, log |
Scene + Audio | t2v | 6.3(0.3) | 23.0(0.8) | 33.6(0.7) | 22.7(1.3) | 17.47M | config, model, log |
Scene + OCR | t2v | 6.1(0.6) | 20.5(0.7) | 32.3(1.1) | 25.7(1.2) | 18.9M | config, model, log |
Scene + Speech | t2v | 6.1(0.5) | 21.0(1.0) | 30.7(0.4) | 27.2(0.3) | 28.6M | config, model, log |
Scene + Face | t2v | 6.4(0.1) | 20.8(1.2) | 31.9(0.7) | 24.8(1.8) | 16.29M | config, model, log |
Scene | v2t | 6.6(0.7) | 21.3(0.4) | 33.0(0.4) | 25.2(2.5) | 7.62M | config, model, log |
Scene + Inst. | v2t | 12.5(1.3) | 33.0(0.5) | 46.1(0.6) | 13.3(1.2) | 17.47M | config, model, log |
Scene + r2p1d | v2t | 13.5(0.6) | 36.2(0.2) | 51.3(1.3) | 10.3(0.6) | 16.29M | config, model, log |
Scene + RGB | v2t | 8.5(1.0) | 26.8(1.4) | 38.8(0.7) | 16.7(0.6) | 17.47M | config, model, log |
Scene + Flow | v2t | 11.3(0.2) | 29.8(0.6) | 42.1(1.2) | 15.3(1.5) | 16.68M | config, model, log |
Scene + Audio | v2t | 6.6(0.4) | 23.0(2.3) | 33.6(1.0) | 22.2(2.0) | 17.47M | config, model, log |
Scene + OCR | v2t | 6.6(0.3) | 20.9(1.2) | 32.1(0.7) | 26.2(1.8) | 18.9M | config, model, log |
Scene + Speech | v2t | 6.8(0.8) | 21.1(0.9) | 31.4(1.2) | 27.3(0.6) | 28.6M | config, model, log |
Scene + Face | v2t | 6.8(0.2) | 20.7(0.3) | 32.1(1.5) | 25.7(1.5) | 16.29M | config, model, log |
We can also study their cumulative effect:
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 6.7(1.3) | 20.8(0.7) | 32.7(0.9) | 25.0(1.0) | 7.62M | config, model, log |
Prev. + Speech | t2v | 6.1(0.5) | 21.0(1.0) | 30.7(0.4) | 27.2(0.3) | 28.6M | config, model, log |
Prev. + Audio | t2v | 6.8(0.1) | 23.1(0.3) | 33.6(0.5) | 21.8(0.8) | 36.09M | config, model, log |
Prev. + Flow | t2v | 10.9(1.1) | 31.0(1.1) | 44.1(1.1) | 14.0(1.0) | 42.79M | config, model, log |
Prev. + RGB | t2v | 12.0(1.1) | 33.0(0.1) | 45.9(0.4) | 12.7(0.6) | 50.27M | config, model, log |
Prev. + Inst | t2v | 13.7(2.1) | 34.6(1.9) | 49.4(1.0) | 11.0(1.0) | 57.76M | config, model, log |
Prev. + R2P1D | t2v | 15.5(0.7) | 40.1(0.2) | 54.9(1.5) | 8.3(0.6) | 64.06M | config, model, log |
Prev. + OCR | t2v | 15.6(0.5) | 39.9(1.1) | 54.2(1.1) | 9.0(0.0) | 72.98M | config, model, log |
Prev. + Face | t2v | 16.1(1.5) | 41.2(0.5) | 54.5(0.9) | 8.3(0.6) | 79.29M | config, model, log |
Scene | v2t | 6.6(0.7) | 21.3(0.4) | 33.0(0.4) | 25.2(2.5) | 7.62M | config, model, log |
Prev. + Speech | v2t | 6.8(0.8) | 21.1(0.9) | 31.4(1.2) | 27.3(0.6) | 28.6M | config, model, log |
Prev. + Audio | v2t | 7.1(0.3) | 22.8(0.5) | 33.9(0.2) | 22.2(0.8) | 36.09M | config, model, log |
Prev. + Flow | v2t | 11.4(0.5) | 30.8(0.9) | 44.0(1.0) | 14.0(1.0) | 42.79M | config, model, log |
Prev. + RGB | v2t | 11.9(0.2) | 32.6(1.8) | 45.7(1.6) | 12.7(0.6) | 50.27M | config, model, log |
Prev. + Inst. | v2t | 12.9(0.6) | 36.1(1.1) | 49.2(1.3) | 11.0(1.0) | 57.76M | config, model, log |
Prev. + R2P1D | v2t | 15.6(0.5) | 40.3(0.5) | 54.3(0.9) | 8.7(0.6) | 64.06M | config, model, log |
Prev. + OCR | v2t | 16.1(1.5) | 38.6(1.0) | 53.4(1.0) | 9.0(1.0) | 72.98M | config, model, log |
Prev. + Face | v2t | 16.2(1.6) | 41.1(0.7) | 55.1(0.3) | 8.2(0.3) | 79.29M | config, model, log |
Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.
Dimension | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
384 | t2v | 15.2(0.8) | 40.2(1.5) | 54.6(1.0) | 8.7(0.6) | 36.45M | config, model, log |
512 | t2v | 15.3(0.8) | 39.9(0.9) | 53.7(0.9) | 9.0(0.0) | 50.01M | config, model, log |
640 | t2v | 16.2(0.6) | 40.2(0.3) | 54.3(0.2) | 8.7(0.6) | 64.29M | config, model, log |
768 | t2v | 16.1(1.4) | 41.1(0.4) | 54.4(0.8) | 8.3(0.6) | 79.29M | config, model, log |
1024 | t2v | 15.6(0.8) | 39.3(1.2) | 53.8(0.4) | 8.7(0.6) | 111.44M | config, model, log |
384 | v2t | 15.0(0.2) | 40.2(1.6) | 53.6(2.5) | 9.0(1.0) | 36.45M | config, model, log |
512 | v2t | 15.6(2.7) | 39.2(1.0) | 52.4(0.5) | 9.3(0.6) | 50.01M | config, model, log |
640 | v2t | 15.8(0.8) | 40.1(0.1) | 54.4(0.7) | 9.0(0.0) | 64.29M | config, model, log |
768 | v2t | 15.6(1.3) | 40.9(0.4) | 55.2(0.5) | 8.2(0.3) | 79.29M | config, model, log |
1024 | v2t | 15.6(1.1) | 39.9(0.6) | 54.1(1.0) | 8.7(0.6) | 111.44M | config, model, log |
We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSVD dataset.
CE Design: First, we investigate the importance of the parts used by the CE model.
Model | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Concat | t2v | 3.5(0.1) | 13.9(0.2) | 24.1(0.3) | 32.7(0.6) | 314.8k | config, model, log |
CE - MW,P,CG | t2v | 16.8(0.1) | 44.8(0.5) | 59.6(0.6) | 7.0(0.0) | 131.37M | config, model, log |
CE - P,CG | t2v | 18.9(1.0) | 48.1(1.0) | 63.2(0.7) | 6.0(0.0) | 131.37M | config, model, log |
CE - CG | t2v | 19.6(0.5) | 49.4(0.8) | 64.2(1.0) | 5.7(0.6) | 81.67M | config, model, log |
CE | t2v | 19.8(0.3) | 49.0(0.3) | 63.8(0.1) | 6.0(0.0) | 84.04M | config, model, log |
Concat | v2t | 4.0(0.6) | 14.9(0.8) | 22.4(0.8) | 42.5(0.9) | 314.8k | config, model, log |
CE - MW,P,CG | v2t | 21.7(0.3) | 47.6(0.1) | 58.2(0.4) | 6.4(0.5) | 131.37M | config, model, log |
CE - P,CG | v2t | 22.9(2.5) | 48.6(1.2) | 58.3(1.3) | 6.2(0.4) | 131.37M | config, model, log |
CE - CG | v2t | 23.4(2.5) | 49.0(1.7) | 59.9(1.4) | 5.8(0.7) | 81.67M | config, model, log |
CE | v2t | 23.9(1.4) | 50.2(0.8) | 59.6(1.2) | 5.6(0.5) | 84.04M | config, model, log |
Each row adds an additional component to the model. The names refer to the following model designs:
- Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
- CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
- CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
- CE - CG - The CE model without Collaborative Gating (CG).
- CE - The full CE model.
Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.
Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 7.0(0.2) | 21.8(0.4) | 32.9(0.1) | 23.0(0.0) | 9.99M | config, model, log |
Scene + Inst. | t2v | 16.0(0.3) | 41.8(0.5) | 57.3(0.1) | 8.0(0.0) | 22.2M | config, model, log |
Scene + r2p1d | t2v | 16.8(0.0) | 43.7(0.3) | 58.4(0.1) | 7.0(0.0) | 21.02M | config, model, log |
Scene + RGB | t2v | 10.7(0.3) | 31.5(0.2) | 45.9(0.4) | 12.7(0.6) | 22.2M | config, model, log |
Scene + Flow | t2v | 11.7(0.3) | 34.2(0.3) | 48.1(0.2) | 11.3(0.6) | 21.41M | config, model, log |
Scene + OCR | t2v | 7.1(0.2) | 22.3(0.0) | 33.8(0.2) | 22.7(0.6) | 37.95M | config, model, log |
Scene + Face | t2v | 6.9(0.0) | 22.4(0.4) | 33.9(0.5) | 22.7(0.6) | 21.02M | config, model, log |
Scene | v2t | 8.3(0.4) | 20.8(1.1) | 29.0(0.4) | 50.5(3.6) | 9.99M | config, model, log |
Scene + Inst. | v2t | 17.6(0.9) | 40.8(0.3) | 52.1(0.3) | 9.2(0.3) | 22.2M | config, model, log |
Scene + r2p1d | v2t | 20.9(0.5) | 43.7(2.0) | 53.5(1.3) | 8.2(0.8) | 21.02M | config, model, log |
Scene + RGB | v2t | 11.0(0.8) | 28.5(0.2) | 37.2(0.5) | 25.2(0.7) | 22.2M | config, model, log |
Scene + Flow | v2t | 14.4(1.3) | 34.4(0.6) | 43.8(1.3) | 15.8(1.3) | 21.41M | config, model, log |
Scene + OCR | v2t | 8.9(1.4) | 22.0(2.2) | 29.1(1.0) | 50.2(4.9) | 37.95M | config, model, log |
Scene + Face | v2t | 8.4(0.9) | 20.8(0.7) | 29.3(0.7) | 49.2(8.1) | 21.02M | config, model, log |
We can also study their cumulative effect:
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 7.0(0.2) | 21.8(0.4) | 32.9(0.1) | 23.0(0.0) | 9.99M | config, model, log |
Prev. + Flow | t2v | 11.7(0.3) | 34.2(0.3) | 48.1(0.2) | 11.3(0.6) | 21.41M | config, model, log |
Prev. + RGB | t2v | 12.8(0.6) | 37.4(0.4) | 52.2(0.1) | 10.0(0.0) | 31.26M | config, model, log |
Prev. + Inst | t2v | 16.0(0.3) | 42.1(0.7) | 57.4(0.3) | 7.7(0.6) | 41.11M | config, model, log |
Prev. + R2P1D | t2v | 19.3(0.3) | 48.5(0.4) | 63.3(0.7) | 6.0(0.0) | 49.78M | config, model, log |
Prev. + OCR | t2v | 18.6(0.9) | 47.4(1.3) | 62.6(1.1) | 6.0(0.0) | 75.38M | config, model, log |
Prev. + Face | t2v | 19.8(0.3) | 49.0(0.3) | 63.8(0.1) | 6.0(0.0) | 84.04M | config, model, log |
Scene | v2t | 8.3(0.4) | 20.8(1.1) | 29.0(0.4) | 50.5(3.6) | 9.99M | config, model, log |
Prev. + Flow | v2t | 14.4(1.3) | 34.4(0.6) | 43.8(1.3) | 15.8(1.3) | 21.41M | config, model, log |
Prev. + RGB | v2t | 14.5(2.6) | 34.9(1.3) | 44.7(1.4) | 15.2(0.7) | 31.26M | config, model, log |
Prev. + Inst. | v2t | 17.6(0.8) | 41.1(1.9) | 52.1(2.2) | 9.2(1.4) | 41.11M | config, model, log |
Prev. + R2P1D | v2t | 23.1(0.7) | 48.2(0.7) | 58.5(0.3) | 6.3(0.4) | 49.78M | config, model, log |
Prev. + OCR | v2t | 22.6(2.1) | 46.7(1.9) | 57.0(2.9) | 7.0(1.0) | 75.38M | config, model, log |
Prev. + Face | v2t | 23.9(1.4) | 50.2(0.8) | 59.6(1.2) | 5.6(0.5) | 84.04M | config, model, log |
Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.
Dimension | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
384 | t2v | 19.3(0.4) | 48.1(0.7) | 63.3(0.4) | 6.0(0.0) | 39.43M | config, model, log |
512 | t2v | 19.4(0.4) | 48.9(0.4) | 63.7(0.2) | 6.0(0.0) | 53.71M | config, model, log |
640 | t2v | 19.8(0.8) | 49.5(0.6) | 64.2(0.8) | 6.0(0.0) | 68.58M | config, model, log |
768 | t2v | 19.8(0.3) | 49.0(0.3) | 63.8(0.1) | 6.0(0.0) | 84.04M | config, model, log |
1024 | t2v | 18.9(0.8) | 47.7(1.6) | 62.9(1.4) | 6.3(0.6) | 116.73M | config, model, log |
384 | v2t | 21.8(0.3) | 48.8(1.4) | 59.7(1.9) | 6.2(0.7) | 39.43M | config, model, log |
512 | v2t | 23.5(0.8) | 48.7(0.2) | 59.0(0.8) | 6.0(0.0) | 53.71M | config, model, log |
640 | v2t | 23.8(2.8) | 48.3(1.8) | 60.1(2.3) | 6.3(0.6) | 68.58M | config, model, log |
768 | v2t | 23.9(1.4) | 50.2(0.8) | 59.6(1.2) | 5.6(0.5) | 84.04M | config, model, log |
1024 | v2t | 21.2(2.7) | 46.5(1.9) | 57.0(1.6) | 7.0(1.0) | 116.73M | config, model, log |
We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the ACTIVITY-NET dataset.
CE Design: First, we investigate the importance of the parts used by the CE model.
Model | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Concat | t2v | 1.2(0.1) | 5.2(0.3) | 9.4(0.4) | 120.0(6.2) | 417.8k | config, model, log |
CE - MW,P,CG | t2v | 17.4(0.4) | 45.8(0.6) | 61.4(0.6) | 6.3(0.6) | 330.42M | config, model, log |
CE - P,CG | t2v | 18.3(0.7) | 47.2(0.6) | 63.2(0.5) | 6.0(0.0) | 330.42M | config, model, log |
CE - CG | t2v | 17.6(0.4) | 46.8(0.5) | 62.9(0.4) | 6.0(0.0) | 258.3M | config, model, log |
CE | t2v | 18.2(0.3) | 47.7(0.6) | 63.9(0.5) | 6.0(0.0) | 260.68M | config, model, log |
Concat | v2t | 1.3(0.1) | 5.3(0.6) | 9.7(0.6) | 141.7(2.9) | 417.8k | config, model, log |
CE - MW,P,CG | v2t | 17.6(0.2) | 45.9(0.5) | 61.5(0.7) | 6.7(0.6) | 330.42M | config, model, log |
CE - P,CG | v2t | 17.2(0.1) | 46.3(0.4) | 62.5(0.5) | 6.0(0.0) | 330.42M | config, model, log |
CE - CG | v2t | 17.3(0.2) | 46.4(0.3) | 62.5(0.5) | 6.0(0.0) | 258.3M | config, model, log |
CE | v2t | 17.7(0.6) | 46.6(0.7) | 62.8(0.4) | 6.0(0.0) | 260.68M | config, model, log |
Each row adds an additional component to the model. The names refer to the following model designs:
- Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
- CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
- CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
- CE - CG - The CE model without Collaborative Gating (CG).
- CE - The full CE model.
Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.
Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 7.3(0.4) | 21.5(0.7) | 32.4(0.2) | 24.7(0.6) | 25.97M | config, model, log |
Scene + Inst. | t2v | 14.4(0.3) | 38.9(0.3) | 53.3(0.2) | 9.0(0.0) | 54.13M | config, model, log |
Scene + r2p1d | t2v | 16.0(0.5) | 43.8(0.4) | 60.6(0.3) | 7.0(0.0) | 52.95M | config, model, log |
Scene + RGB | t2v | 10.2(0.7) | 29.7(0.4) | 43.1(0.7) | 14.3(0.6) | 54.13M | config, model, log |
Scene + Flow | t2v | 13.8(0.3) | 37.9(0.1) | 53.1(0.3) | 9.0(0.0) | 53.35M | config, model, log |
Scene + Audio | t2v | 8.0(0.2) | 24.1(0.7) | 35.7(0.3) | 21.3(0.6) | 53.15M | config, model, log |
Scene + OCR | t2v | 7.4(0.3) | 23.0(0.3) | 33.8(0.3) | 23.3(0.6) | 62.73M | config, model, log |
Scene + Speech | t2v | 7.4(0.1) | 23.1(0.4) | 34.4(0.6) | 22.3(0.6) | 75.66M | config, model, log |
Scene + Face | t2v | 7.9(0.3) | 24.7(0.8) | 36.2(0.7) | 21.0(0.0) | 52.95M | config, model, log |
Scene | v2t | 6.4(0.2) | 20.4(0.3) | 31.4(0.1) | 25.3(0.6) | 25.97M | config, model, log |
Scene + Inst. | v2t | 12.4(0.2) | 35.9(0.1) | 50.5(0.3) | 10.0(0.0) | 54.13M | config, model, log |
Scene + r2p1d | v2t | 13.7(0.2) | 40.5(0.2) | 57.1(0.3) | 8.0(0.0) | 52.95M | config, model, log |
Scene + RGB | v2t | 9.3(0.2) | 28.2(0.4) | 41.3(0.5) | 15.7(0.6) | 54.13M | config, model, log |
Scene + Flow | v2t | 12.4(0.1) | 36.3(0.4) | 52.1(0.3) | 10.0(0.0) | 53.35M | config, model, log |
Scene + Audio | v2t | 7.3(0.2) | 23.2(0.3) | 34.1(0.3) | 22.0(0.0) | 53.15M | config, model, log |
Scene + OCR | v2t | 6.4(0.1) | 20.5(0.8) | 31.2(0.5) | 26.7(0.6) | 62.73M | config, model, log |
Scene + Speech | v2t | 6.4(0.2) | 20.9(0.1) | 32.4(0.4) | 24.7(0.6) | 75.66M | config, model, log |
Scene + Face | v2t | 7.2(0.3) | 23.1(0.3) | 34.4(1.0) | 21.7(0.6) | 52.95M | config, model, log |
We can also study their cumulative effect:
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 7.3(0.4) | 21.5(0.7) | 32.4(0.2) | 24.7(0.6) | 25.97M | config, model, log |
Prev. + Speech | t2v | 7.4(0.1) | 23.1(0.4) | 34.4(0.6) | 22.3(0.6) | 75.66M | config, model, log |
Prev. + Audio | t2v | 7.7(0.3) | 23.7(0.1) | 35.3(0.3) | 21.3(1.2) | 100.47M | config, model, log |
Prev. + Flow | t2v | 14.4(0.1) | 39.1(0.5) | 54.5(0.1) | 9.0(0.0) | 125.48M | config, model, log |
Prev. + RGB | t2v | 14.9(0.6) | 40.7(0.2) | 55.9(0.2) | 8.0(0.0) | 151.27M | config, model, log |
Prev. + Inst | t2v | 15.8(0.6) | 43.2(0.3) | 58.3(0.4) | 7.0(0.0) | 177.07M | config, model, log |
Prev. + R2P1D | t2v | 18.2(0.5) | 46.4(0.5) | 62.5(1.1) | 6.0(0.0) | 201.68M | config, model, log |
Prev. + OCR | t2v | 18.0(0.6) | 46.9(0.3) | 63.2(0.2) | 6.0(0.0) | 236.06M | config, model, log |
Prev. + Face | t2v | 18.4(0.2) | 47.9(0.5) | 63.6(0.5) | 6.0(0.0) | 260.68M | config, model, log |
Scene | v2t | 6.4(0.2) | 20.4(0.3) | 31.4(0.1) | 25.3(0.6) | 25.97M | config, model, log |
Prev. + Speech | v2t | 6.4(0.2) | 20.9(0.1) | 32.4(0.4) | 24.7(0.6) | 75.66M | config, model, log |
Prev. + Audio | v2t | 7.2(0.2) | 22.7(0.3) | 34.3(0.2) | 22.3(0.6) | 100.47M | config, model, log |
Prev. + Flow | v2t | 13.8(0.3) | 37.9(0.3) | 53.4(0.2) | 9.0(0.0) | 125.48M | config, model, log |
Prev. + RGB | v2t | 14.6(0.6) | 40.0(0.3) | 55.4(0.4) | 8.3(0.6) | 151.27M | config, model, log |
Prev. + Inst. | v2t | 15.4(0.4) | 42.1(0.2) | 58.1(0.7) | 7.7(0.6) | 177.07M | config, model, log |
Prev. + R2P1D | v2t | 17.4(0.4) | 45.7(0.4) | 62.0(0.8) | 6.3(0.6) | 201.68M | config, model, log |
Prev. + OCR | v2t | 17.0(0.2) | 45.8(0.2) | 61.8(0.2) | 6.3(0.6) | 236.06M | config, model, log |
Prev. + Face | v2t | 17.7(0.7) | 46.5(0.5) | 62.7(0.4) | 6.3(0.6) | 260.68M | config, model, log |
Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.
Dimension | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
384 | t2v | 17.5(0.2) | 46.9(0.0) | 63.2(0.2) | 6.0(0.0) | 127.3M | config, model, log |
512 | t2v | 18.0(0.3) | 47.4(0.8) | 63.1(0.4) | 6.0(0.0) | 171.04M | config, model, log |
640 | t2v | 18.2(0.6) | 47.8(1.0) | 63.4(0.9) | 6.0(0.0) | 215.5M | config, model, log |
768 | t2v | 18.2(0.3) | 47.7(0.6) | 63.9(0.5) | 6.0(0.0) | 260.68M | config, model, log |
1024 | t2v | 18.3(0.1) | 48.2(0.8) | 63.3(0.5) | 6.0(0.0) | 353.2M | config, model, log |
384 | v2t | 16.8(0.3) | 45.3(0.1) | 61.9(0.4) | 6.7(0.6) | 127.3M | config, model, log |
512 | v2t | 17.2(0.1) | 45.9(0.7) | 62.0(0.7) | 6.7(0.6) | 171.04M | config, model, log |
640 | v2t | 17.5(0.5) | 46.1(0.5) | 62.4(0.3) | 6.3(0.6) | 215.5M | config, model, log |
768 | v2t | 17.7(0.6) | 46.6(0.7) | 62.8(0.4) | 6.0(0.0) | 260.68M | config, model, log |
1024 | v2t | 17.7(0.1) | 47.0(0.5) | 63.4(0.4) | 6.0(0.0) | 353.2M | config, model, log |
We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the QuerYD dataset.
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
Scene | t2v | 8.7(0.4) | 26.3(1.1) | 37.1(0.7) | 68.5(2.2) | 22.2(1.6) | 52.3(3.0) | 20.4(0.1) | 7.51M | config, model, log |
Scene + Inst. | t2v | 11.7(1.4) | 31.6(0.9) | 43.4(1.3) | 74.5(0.9) | 14.0(1.0) | 41.1(2.1) | 25.2(0.8) | 17.25M | config, model, log |
Scene + r2p1d | t2v | 11.7(2.1) | 32.1(3.0) | 45.3(3.3) | 74.6(0.4) | 13.7(1.9) | 42.9(2.2) | 25.7(2.4) | 16.07M | config, model, log |
Scene + Audio | t2v | 7.6(2.7) | 27.4(1.4) | 40.4(0.9) | 69.1(0.9) | 17.0(1.7) | 49.0(1.9) | 20.2(2.3) | 17.25M | config, model, log |
Scene | v2t | 9.1(0.8) | 25.4(0.9) | 35.3(1.5) | 68.2(2.2) | 23.2(0.3) | 52.6(2.6) | 20.1(0.5) | 7.51M | config, model, log |
Scene + Inst. | v2t | 11.9(0.5) | 31.0(3.6) | 43.5(2.7) | 74.8(1.8) | 14.5(0.9) | 40.8(2.1) | 25.2(1.1) | 17.25M | config, model, log |
Scene + r2p1d | v2t | 12.7(1.4) | 30.9(2.8) | 44.0(1.8) | 74.3(1.2) | 14.3(1.2) | 42.8(1.7) | 25.8(1.7) | 16.07M | config, model, log |
Scene + Audio | v2t | 10.1(1.2) | 25.7(1.5) | 37.5(1.2) | 69.8(1.6) | 20.0(1.3) | 48.9(2.0) | 21.3(1.1) | 17.25M | config, model, log |
We can also study their cumulative effect:
Experts | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Geom | params | Links |
---|---|---|---|---|---|---|---|---|---|---|
Scene | t2v | 8.7(0.4) | 26.3(1.1) | 37.1(0.7) | 68.5(2.2) | 22.2(1.6) | 52.3(3.0) | 20.4(0.1) | 7.51M | config, model, log |
Prev. + Audio | t2v | 7.6(2.7) | 27.4(1.4) | 40.4(0.9) | 69.1(0.9) | 17.0(1.7) | 49.0(1.9) | 20.2(2.3) | 17.25M | config, model, log |
Prev. + Inst | t2v | 12.7(1.7) | 34.8(1.7) | 47.0(1.3) | 78.0(1.0) | 12.3(0.6) | 37.6(2.1) | 27.5(1.5) | 24.63M | config, model, log |
Prev. + R2P1D | t2v | 14.3(0.3) | 37.5(1.3) | 48.6(0.8) | 78.8(0.3) | 11.3(0.6) | 35.2(1.8) | 29.7(0.3) | 30.82M | config, model, log |
Scene | v2t | 9.1(0.8) | 25.4(0.9) | 35.3(1.5) | 68.2(2.2) | 23.2(0.3) | 52.6(2.6) | 20.1(0.5) | 7.51M | config, model, log |
Prev. + Audio | v2t | 10.1(1.2) | 25.7(1.5) | 37.5(1.2) | 69.8(1.6) | 20.0(1.3) | 48.9(2.0) | 21.3(1.1) | 17.25M | config, model, log |
Prev. + Inst. | v2t | 12.8(1.3) | 33.5(2.8) | 46.6(1.0) | 76.7(1.7) | 11.8(0.8) | 37.6(1.9) | 27.1(0.6) | 24.63M | config, model, log |
Prev. + R2P1D | v2t | 14.0(0.3) | 35.4(2.9) | 47.2(2.8) | 78.7(2.4) | 12.3(1.5) | 35.8(2.4) | 28.6(1.2) | 30.82M | config, model, log |