diegetic classifier video input, classifies sound in a given clip. inverted could be used for text-to-sound generation, a la sound effects for videos.