Open source multimedia knowledge extraction system, taking a massive stream of unstructured, heterogeneous multimedia data from various sources and languages as input and creates a coherent, structured knowledge base, indexing entities, relations and events.
GAIA classifies knowledge elements into fine-grained types. Given a photo with a flag and people, it can detect the flag, and deduce that the people are soldiers, or politicians of that country. It can support various applications, like multimedia news event understanding and recommendation.
First there is a text knowledge extraction (TKE) branch and a visual knowledge extraction (VKE). Each banch takes the same input and initially creates a separate knowledge base (KB). Both KBs are fuse into a coherent multimedia KB.
TKE extracts entities, relations and events from input documents. It clusters identical entities through entity linking and coreference, and clusters identical events using event coreference.
VKE takes images and video key frames as input and creates a single, coherent visual KB. Similar to TKE, it consists of entity extraction, linking and coreference modules.
We use a visual grounding system to resolve entity coreference across modalities. For each entity mention extracted from text, we feed its text along with the whole sentence into an ELMo model that extracts contextualized features for the entity mentions, and then we compare that with CNN feature maps of surrounding images. We get a relevant score per image. More details in section 5 in paper.