Skip to content

Code Specifics

Ricardo Oliveira edited this page Jun 20, 2022 · 1 revision

Extractors

The extractors are a small library that we created to normalize how we extract data from the articles fetched from the publishers. They can be found in dags/common/parsing/*_extractors.py. Here are the available extractors :

Name Description Extractor Type
NestedValueExtractor  Gets the value from a nested source in JSON. (e.g. test.nested.field.value)  JSONExtractor
TextExtractor  Gets the text present inside the specified tags.  XMLExtractor
AttributeExtractor  Gets the value present inside the specified attribute of the specified tag  XMLExtractor
ContantExtractor  Simply returns a value. Used to standardize the usage of the lib. XMLExtractor
 CustomExtractor Takes a function as parameter to configure a custom way of getting data. Used for more intricate cases. XMLExtractor, JSONExtractor

Each of these extractors define a how to extract the data from the source. Then it's the parser that puts everything together in the following code :

def _publisher_specific_parsing(self, article: ET.Element):
  return {
    extractor.destination: value
    for extractor in self.extractors
    if (value := extractor.extract(article)) is not None
  }

This extracts the values with every extractor given and sets the key: value in a dict if the value isn't empty. This structure is used to be able to use the same parser without having to know if the source is XML or JSON, we simply use the extractors.

Clone this wiki locally