Skip to content

Latest commit

 

History

History
224 lines (160 loc) · 12.3 KB

File metadata and controls

224 lines (160 loc) · 12.3 KB

Step 6: Read Data

Describing schemas is good and all, but at some point we have to start reading data! So let's get to work. But before, let's describe what we're about to do:

The HttpStream superclass, like described in the concepts documentation, is facilitating reading data from HTTP endpoints. It contains built-in functions or helpers for:

  • authentication
  • pagination
  • handling rate limiting or transient errors
  • and other useful functionality

In order for it to be able to do this, we have to provide it with a few inputs:

  • the URL base and path of the endpoint we'd like to hit
  • how to parse the response from the API
  • how to perform pagination

Optionally, we can provide additional inputs to customize requests:

  • request parameters and headers
  • how to recognize rate limit errors, and how long to wait (by default it retries 429 and 5XX errors using exponential backoff)
  • HTTP method and request body if applicable
  • configure exponential backoff policy

Backoff policy options:

  • retry_factor Specifies factor for exponential backoff policy (by default is 5)
  • max_retries Specifies maximum amount of retries for backoff policy (by default is 5)
  • raise_on_http_errors If set to False, allows opting-out of raising HTTP code exception (by default is True)

There are many other customizable options - you can find them in the airbyte_cdk.sources.streams.http.HttpStream class.

So in order to read data from the exchange rates API, we'll fill out the necessary information for the stream to do its work. First, we'll implement a basic read that just reads the last day's exchange rates, then we'll implement incremental sync using stream slicing.

Let's begin by pulling data for the last day's rates by using the /latest endpoint:

class ExchangeRates(HttpStream):
    url_base = "https://api.exchangeratesapi.io/"

    primary_key = None

    def __init__(self, base: str, **kwargs):
        super().__init__()
        self.base = base


    def path(
        self, 
        stream_state: Mapping[str, Any] = None, 
        stream_slice: Mapping[str, Any] = None, 
        next_page_token: Mapping[str, Any] = None
    ) -> str:
        # The "/latest" path gives us the latest currency exchange rates
        return "latest"  

    def request_params(
            self,
            stream_state: Mapping[str, Any],
            stream_slice: Mapping[str, Any] = None,
            next_page_token: Mapping[str, Any] = None,
    ) -> MutableMapping[str, Any]:
        # The api requires that we include the base currency as a query param so we do that in this method
        return {'base': self.base}

    def parse_response(
            self,
            response: requests.Response,
            stream_state: Mapping[str, Any],
            stream_slice: Mapping[str, Any] = None,
            next_page_token: Mapping[str, Any] = None,
    ) -> Iterable[Mapping]:
        # The response is a simple JSON whose schema matches our stream's schema exactly, 
        # so we just return a list containing the response
        return [response.json()]

    def next_page_token(self, response: requests.Response) -> Optional[Mapping[str, Any]]:
        # The API does not offer pagination, 
        # so we return None to indicate there are no more pages in the response
        return None

This may look big, but that's just because there are lots of (unused, for now) parameters in these methods (those can be hidden with Python's **kwargs, but don't worry about it for now). Really we just added a few lines of "significant" code: 1. Added a constructor __init__ which stores the base currency to query for. 2. return {'base': self.base} to add the ?base=<base-value> query parameter to the request based on the base input by the user. 3. return [response.json()] to parse the response from the API to match the schema of our schema .json file. 4. return "latest" to indicate that we want to hit the /latest endpoint of the API to get the latest exchange rate data.

Let's also pass the base parameter input by the user to the stream class:

def streams(self, config: Mapping[str, Any]) -> List[Stream]:
        auth = NoAuth()
        return [ExchangeRates(authenticator=auth, base=config['base'])]

We're now ready to query the API!

To do this, we'll need a ConfiguredCatalog. We've prepared one here -- download this and place it in sample_files/configured_catalog.json. Then run:

 python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json

you should see some output lines, one of which is a record from the API:

{"type": "RECORD", "record": {"stream": "exchange_rates", "data": {"base": "USD", "rates": {"GBP": 0.7196938353, "HKD": 7.7597848573, "IDR": 14482.4824162185, "ILS": 3.2412081092, "DKK": 6.1532478279, "INR": 74.7852709971, "CHF": 0.915763343, "MXN": 19.8439387671, "CZK": 21.3545717832, "SGD": 1.3261894911, "THB": 31.4398014067, "HRK": 6.2599917253, "EUR": 0.8274720728, "MYR": 4.0979726934, "NOK": 8.3043442284, "CNY": 6.4856433595, "BGN": 1.61836988, "PHP": 48.3516756309, "PLN": 3.770872983, "ZAR": 14.2690111709, "CAD": 1.2436905254, "ISK": 124.9482829954, "BRL": 5.4526272238, "RON": 4.0738932561, "NZD": 1.3841125362, "TRY": 8.3101365329, "JPY": 108.0182043856, "RUB": 74.9555647497, "KRW": 1111.7583781547, "USD": 1.0, "AUD": 1.2840711626, "HUF": 300.6206040546, "SEK": 8.3829540753}, "date": "2021-04-26"}, "emitted_at": 1619498062000}}

There we have it - a stream which reads data in just a few lines of code!

We theoretically could stop here and call it a connector. But let's give adding incremental sync a shot.

Adding incremental sync

To add incremental sync, we'll do a few things: 1. Pass the start_date param input by the user into the stream. 2. Declare the stream's cursor_field. 3. Implement the get_updated_state method. 4. Implement the stream_slices method. 5. Update the path method to specify the date to pull exchange rates for. 6. Update the configured catalog to use incremental sync when we're testing the stream.

We'll describe what each of these methods do below. Before we begin, it may help to familiarize yourself with how incremental sync works in Airbyte by reading the docs on incremental.

To keep things concise, we'll only show functions as we edit them one by one.

Let's get the easy parts out of the way and pass the start_date:

def streams(self, config: Mapping[str, Any]) -> List[Stream]:
        auth = NoAuth()
        # Parse the date from a string into a datetime object
        start_date = datetime.strptime(config['start_date'], '%Y-%m-%d') 
        return [ExchangeRates(authenticator=auth, base=config['base'], start_date=start_date)]

Let's also add this parameter to the constructor and declare the cursor_field:

from datetime import datetime, timedelta


class ExchangeRates(HttpStream):
    url_base = "https://api.exchangeratesapi.io/"
    cursor_field = "date"
    primary_key = "date"

    def __init__(self, base: str, start_date: datetime, **kwargs):
        super().__init__()
        self.base = base
        self.start_date = start_date

Declaring the cursor_field informs the framework that this stream now supports incremental sync. The next time you run python main_dev.py discover --config sample_files/config.json you'll find that the supported_sync_modes field now also contains incremental.

But we're not quite done with supporting incremental, we have to actually emit state! We'll structure our state object very simply: it will be a dict whose single key is 'date' and value is the date of the last day we synced data from. For example, {'date': '2021-04-26'} indicates the connector previously read data up until April 26th and therefore shouldn't re-read anything before April 26th.

Let's do this by implementing the get_updated_state method inside the ExchangeRates class.

    def get_updated_state(self, current_stream_state: MutableMapping[str, Any], latest_record: Mapping[str, Any]) -> Mapping[str, any]:
        # This method is called once for each record returned from the API to compare the cursor field value in that record with the current state
        # we then return an updated state object. If this is the first time we run a sync or no state was passed, current_stream_state will be None.
        if current_stream_state is not None and 'date' in current_stream_state:
            current_parsed_date = datetime.strptime(current_stream_state['date'], '%Y-%m-%d')
            latest_record_date = datetime.strptime(latest_record['date'], '%Y-%m-%d')
            return {'date': max(current_parsed_date, latest_record_date).strftime('%Y-%m-%d')}
        else:
            return {'date': self.start_date.strftime('%Y-%m-%d')}

This implementation compares the date from the latest record with the date in the current state and takes the maximum as the "new" state object.

We'll implement the stream_slices method to return a list of the dates for which we should pull data based on the stream state if it exists:

 def _chunk_date_range(self, start_date: datetime) -> List[Mapping[str, any]]:
        """
        Returns a list of each day between the start date and now.
        The return value is a list of dicts {'date': date_string}.
        """
        dates = []
        while start_date < datetime.now():
            dates.append({'date': start_date.strftime('%Y-%m-%d')})
            start_date += timedelta(days=1)
        return dates

    def stream_slices(self, sync_mode, cursor_field: List[str] = None, stream_state: Mapping[str, Any] = None) -> Iterable[
        Optional[Mapping[str, any]]]:
        start_date = datetime.strptime(stream_state['date'], '%Y-%m-%d') if stream_state and 'date' in stream_state else self.start_date
        return self._chunk_date_range(start_date)

Each slice will cause an HTTP request to be made to the API. We can then use the information present in the stream_slice parameter (a single element from the list we constructed in stream_slices above) to set other configurations for the outgoing request like path or request_params. For more info about stream slicing, see the slicing docs.

In order to pull data for a specific date, the Exchange Rates API requires that we pass the date as the path component of the URL. Let's override the path method to achieve this:

def path(self, stream_state: Mapping[str, Any] = None, stream_slice: Mapping[str, Any] = None, next_page_token: Mapping[str, Any] = None) -> str:
    return stream_slice['date']

With these changes, your implementation should look like the file here.

The last thing we need to do is change the sync_mode field in the sample_files/configured_catalog.json to incremental:

"sync_mode": "incremental",

We should now have a working implementation of incremental sync!

Let's try it out:

python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json

You should see a bunch of RECORD messages and STATE messages. To verify that incremental sync is working, pass the input state back to the connector and run it again:

# Save the latest state to sample_files/state.json
python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json | grep STATE | tail -n 1 | jq .state.data > sample_files/state.json

# Run a read operation with the latest state message
python main.py read --config sample_files/config.json --catalog sample_files/configured_catalog.json --state sample_files/state.json

You should see that only the record from the last date is being synced! This is acceptable behavior, since Airbyte requires at-least-once delivery of records, so repeating the last record twice is OK.

With that, we've implemented incremental sync for our connector!