Skip to content

Package : Extractor : Feeds Class

Nuwan Waidyanatha edited this page Sep 2, 2023 · 1 revision

Introduction

The extractor module is specific to data extraction from exterior data sources. It utilizes the sparkNoSQLwls to store and retrieve data feed information. An object may use the read and write functions to interact with the nosql database to:

  1. write (insert or update) a new realm and context specific data feed URI to an module and entity specific NoSQL DB
  2. get a realm and context specific data feed URI from the module and entity specific NoSQL DB
  3. build a list of data feeds using a dictionary of parameters to insert parametric values into the query placeholders
  4. read the data using the data feeds information through scrapping, api calls, or direct url-based downloads
  5. convert the data received data into a desired data type object such as DataFrame (spark), DataFrame (pandas), array, list, dict, or string

Specifications

Data Model

The data follows a structure of JSON or DICT. The data model builds on the concepts or a Source, Realm_, Context, URI, and Get key/value pair elements. See Appendix A for an example.

id

A UUID assigned by the NoSQL DB for the respective data feed. It is an hexadecimal string constructed by the NoSQL engine using the ObjectID() feature. You may use the same method to define an ObjID; else will be automatically created at the time of inserting the data feed into the NoSQL DB.

Example

    "_id" : ObjectId("638740c7d8ab1a48899bea90")

Source

Mainly defines information about the data source. Use this section to include a dictionary of key value pairs to describe any qualifying and restrictions about the data source. Such information might be about the:

  1. owner describing the data owner and with whom the contract (if any) for getting permission to retrieve data
  2. supplier who actually makes the data available for us to extract the data
  3. dates describing the effective dates of the valid data retrieved from the data feed.
  4. contract information that might indicate any contractual time-period for keeping the data (e.g., privacy data limitation ).

Example

        "source" : {
            "owner" : "Expedia",   # data owner unique identifier (i.e., legal entity name)
            "supplier" : 'kayak.com'
            "dates": {
                "activated" :'2023-07-07', # optional date the data source is active
                "expires":'2024-07-06',     # and inactive period
            },
            "contract" : {
                "anonymize":{
                    "protocol" : 'SHA256',
                    "attributes":['first_name','last_name','address'],
                }, 
                "from_date":'2023-08-01',
                "to_date" : '2023-09-30',
            }
        }

Context

A dictionary of, optional, key/value pairs to categorize the data feed. This taxonomy will be useful to programmatically filter the data feeds. While the source informs about the data feed, this section is purely to provide a taxonomy for the business use cases.

Example

        "context": {
            "summary":'scraping kayak.com airline booking data for HERO',
            "country":'any',
            "destination":'any',
            "origin" : 'any'
            }

Realm

The realm has a set of predefined mandatory keys. The values, for each key, is made use in generating the DB name and Collection name to store the data feeds as documents.

Note all values are restricted to alpha numeric values and any special characters will be stripped off

  1. module - a distinct string that is relate to a functional class or module that the data feed, primarily, was introduced and utilized
    • e.g., "module" : 'OTA'
    • The lowercase value is used as the leading prefix of the NoSQL database name; unless otherwise, specified with an alternate database name
  2. entity - a distinct string that may relate to the sector or class of data
    • e.g., "entity" : 'Transport'
    • The lowercase value is used as the subsequent prefix to the leading prefix of the NoSQL database name; unless otherwise, specified.
  3. package - a distinct string that may relate to a functional class of the module and entity
    • e.g., "package" : 'Airline'
    • The lowercase value is used as leading prefix of the NoSQL database collection name.
  4. function - a distinct string that may relate to the function (scope) associated with the package class; unless otherwise, specified.
    • e.g., "function":'Booking'
    • The lowercase value is used as the subsequent prefix to the leading prefix of the NoSQL database collection name; unless otherwise, specified.

URI

The core information required for extracting the data. It has a set of predefined elements:

  1. uri is a list comprising one or more dictionary of feeds
    • e.g., "uri" : [ { <first feed info block> }, { <second feed info block> }, ..., { <last feed info block> } ]
  2. urn [optional] and is usually defined by IANA
  3. protocol defines whether it is FTP, HTTP, HTTPS, telnet, scp, and so on to apply any protocol specific login
    • e.g., "protocol" : 'https', the domain will be augmented with the prefix https:\\
  4. domain specifies the URL or root directory absolute path
    • e.g., "domain" : 'kayak.com' (excluding the protocol; else the protocol will be stripped before saving)
  5. port [optional] define, if a port number is required to augment to the domain
    • e.g., "port" : '9090' will be augmented to for https://kayak.com:9090
  6. path [optional] define, if the domain requires extensions or the directory path relative to the root
    • e.g., "path" : ['flights','international'], in the listed order, will be augmented the domain to form https://kayak.com:9090/flights/international/
  7. query [optional] comprises any parametric values to be used in filtering the data from the source
    • expression - defines the query with placeholders
      • e.g., "expression":'{arrivalPort}-{departurePort}/{flightDate}/1adults?a&fs=cfc=1;bfc=1;transportation=transportation_plane'
    • parameter set defines corresponding data types for the defined parameters
      • e.g.,
          "parameter" :{
              "arrivalPort" : 'string',
              "departurePort":'string',
              "flightDate" : 'date'
            }
        
  8. fragments are additional strings with preceding # tags the define the page
    • e.g., "fragment":'number' will augment the query string to form https://kayak.com:9090/flights/#number

Get

Characterizes the behavior for receiving the data

  1. method for retrieving the data; such as wget, beautifulsoup, scp, and so on
  2. object defines the data type object; such as json, csv, txt, and so on

Example

        "get":{
            "method":'download',
            "object":'json'
        } 

Package Functions

The package specification describes the class of functions to use in:

  1. writing data feed information
  2. getting data feed information
  3. building data feeds with parametric values
  4. reading data for each of the feeds
  5. converting to a data type object

write feed information to db

    write_feeds_to_nosql(
        feed_list:list = [],   # provide a non-empty list of dictionaries comprising source, realm, context, uri, and get dictionaries
        **kwargs,   # optional for changing the default behavior of the function (see function description in code)
    ) -> list:
  • feed_list must comply with the data model
  • returns a list of dictionaries comprising NoSQL database and collection names to which the data feeds were saved

get feed information from db

    read_feeds_to_list(
        db_name = 'string',
        coll_list=[],
        realm = [], 
        context={},
        **kwargs,
    ) -> list:
  • db_name [mandatory] the non-empty precise database name to search the list of collections and documents
  • coll_list [optional] a list of non-empty precise collection names, else define substring in kwargs; otherwise feed information from all collections in the database will be read
  • realm [optional] list of dictionaries comprising module, entity, package, function key/value pairs to further filter the retrieved documents by those values
  • context [optional] dict of key/value pairs to further filter the documents by the context values
  • kwargs [optional] # keys collection HASINNAME or DOCHASINNAME to filter the documents or collections by those values

build data feeds with parametric values

TBD

read data for each feed

TBD

convert data to object

TBD

Appendix A - Example Feed Dictionary

[
    {
        "id" : 'ObjID(9876)',
        "source" : {
            "owner" : "kayak.com",   # data ownser unique identifier (i.e., legal entity name)
            "dates": {
                "activated" :'2023-07-07', # optional date the data source is active
                "expires":'2024-0706',     # and inactive period
            }},
        "context": {
            "summary":'scraping kayak.com airline booking data for HERO', # any set of key value pairs to
            "country":'canada',"scope" : 'national',   # describe, identify, and distinguish the data feed
            },
        "realm":{
            "module" : 'etl', # a unique realm name, db name prefix
            "entity" : 'loader', # db name second prefix
            "package" : 'Airline',   # collection prefix
            "function":'Booking',
        },
        "uri":
        [
            {
            "urn" : "", # urn:ota:transport:airline:booking (IANA)
            "protocol":'https',   # FTP, FTPS, HTTP, TELENET
            "domain" : 'kayak.com',  # https://kayak.com/flights/
            "port" : '',
            "path" : 'flights',
            "query": {
                "expression":'{arrivalPort}-{departurePort}/{flightDate}/1adults?a&fs=cfc=1;bfc=1;transportation=transportation_plane',
                "parameter" :{
                    "arrivalPort" : 'string',
                    "departurePort":'string',
                    "flightDate" : 'date'
                }
            },
            "fragment":''  # https://kayak.com/flights/#number
            },
            {
            "urn" : "", # urn:ota:transport:airline:booking (IANA)
            "protocol":'https',   # FTP, FTPS, HTTP, TELENET
            "domain" : 'kayak.com',  # https://kayak.com/flights/
            "port" : '',
            "path" : 'flights',
            "query": {
                "expression":'{arrivalPort}-{departurePort}/{flightDate}/1adults?a&fs=cfc=1;bfc=1;transportation=transportation_plane',
                "parameter" :{
                    "arrivalPort" : 'string',
                    "departurePort":'string',
                    "flightDate" : 'date'
                }
            },
            "fragments":[]             
            }
        ],
        "get":{
            "method":'download',
            "object":'json'
        } 

    }
]