focal finetuning add a loss component that leverages a semantic segmentation on a seg class that you want to specifically target for improvement, like hands/fingers. could even just be picking a fixed size window centered on a region that maximizes CLIP