Code Specifics

Extractors

The extractors are a small library that we created to normalize how we extract data from the articles fetched from the publishers. They can be found in dags/common/parsing/*_extractors.py. Here are the available extractors :

Name	Description	Extractor Type
NestedValueExtractor	Gets the value from a nested source in JSON. (e.g. test.nested.field.value)	JSONExtractor
TextExtractor	Gets the text present inside the specified tags.	XMLExtractor
AttributeExtractor	Gets the value present inside the specified attribute of the specified tag	XMLExtractor
ContantExtractor	Simply returns a value. Used to standardize the usage of the lib.	XMLExtractor
CustomExtractor	Takes a function as parameter to configure a custom way of getting data. Used for more intricate cases.	XMLExtractor, JSONExtractor

Each of these extractors define a how to extract the data from the source. Then it's the parser that puts everything together in the following code :

def _publisher_specific_parsing(self, article: ET.Element):
  return {
    extractor.destination: value
    for extractor in self.extractors
    if (value := extractor.extract(article)) is not None
  }

This extracts the values with every extractor given and sets the key: value in a dict if the value isn't empty. This structure is used to be able to use the same parser without having to know if the source is XML or JSON, we simply use the extractors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Specifics

Extractors

Clone this wiki locally