-
Notifications
You must be signed in to change notification settings - Fork 3
Code Specifics
Ricardo Oliveira edited this page Jun 20, 2022
·
1 revision
The extractors are a small library that we created to normalize how we extract data from the articles fetched from the publishers. They can be found in dags/common/parsing/*_extractors.py
. Here are the available extractors :
Name | Description | Extractor Type |
---|---|---|
NestedValueExtractor | Gets the value from a nested source in JSON. (e.g. test.nested.field.value) | JSONExtractor |
TextExtractor | Gets the text present inside the specified tags. | XMLExtractor |
AttributeExtractor | Gets the value present inside the specified attribute of the specified tag | XMLExtractor |
ContantExtractor | Simply returns a value. Used to standardize the usage of the lib. | XMLExtractor |
CustomExtractor | Takes a function as parameter to configure a custom way of getting data. Used for more intricate cases. | XMLExtractor, JSONExtractor |
Each of these extractors define a how to extract the data from the source. Then it's the parser that puts everything together in the following code :
def _publisher_specific_parsing(self, article: ET.Element):
return {
extractor.destination: value
for extractor in self.extractors
if (value := extractor.extract(article)) is not None
}
This extracts the values with every extractor given and sets the key: value
in a dict if the value isn't empty. This structure is used to be able to use the same parser without having to know if the source is XML or JSON, we simply use the extractors.