focal finetuning

add a loss component that leverages a semantic segmentation on a seg class that you want to specifically target for improvement, like hands/fingers.

could even just be picking a fixed size window centered on a region that maximizes CLIP