Skip to content

s7clarke10/pipelinewise-tap-oracle

 
 

Repository files navigation

pipelinewise-tap-oracle

singer_oracle_tap

PyPI version PyPI - Python Version License: MIT

Singer tap that extracts data from a Oracle database and produces JSON-formatted data following the Singer spec.

This is a PipelineWise compatible tap connector.

How to use it

The recommended method of running this tap is to use it from PipelineWise. When running it from PipelineWise you don't need to configure this tap with JSON files and most of things are automated. Please check the related documentation at Tap Oracle

If you want to run this Singer Tap independently please read further.

Log based replication

Tap-Oracle Log-based replication requires some configuration changes in Oracle database:

  • Enable ARCHIVELOG mode

  • Set retention period a reasonable and long enough period, ie. 1 day, 3 days, etc.

  • Enable Supplemental logging

Setting up Log-based replication on a self hosted Oracle Database:

To verify the current archiving mode, if the result is ARCHIVELOG, archiving is enabled:

  SQL> SELECT LOG_MODE FROM V$DATABASE

To enable ARCHIVELOG mode (if not enabled yet):

  SQL> SHUTDOWN IMMEDIATE
  SQL> STARTUP MOUNT
  SQL> ALTER DATABASE ARCHIVELOG
  SQL> ALTER DATABASE OPEN

To set retention period, use RMAN:

  RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 1 DAYS;

To enable supplemental logging:

  SQL> ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS

Setting up Log-based replication on Oracle on Amazon RDS

To set retention period:

  begin
      rdsadmin.rdsadmin_util.set_configuration(
          name  => 'archivelog retention hours',
          value => '24');
  end;

To enable supplemental logging:

  begin
    rdsadmin.rdsadmin_util.alter_supplemental_logging(p_action => 'ADD');
  end;

Install and Run

First, make sure Python 3 is installed on your system or follow these installation instructions for Mac or Ubuntu.

It's recommended to use a virtualenv:

  python3 -m venv venv
  pip install pipelinewise-tap-oracle

or

  python3 -m venv venv
  . venv/bin/activate
  pip install --upgrade pip
  pip install .

OS Setup

Before you can use tap-oracle, you need to download and install Oracle Client software. This means downloading appropriate Oracle Client software, and configuring environment variables to point the Oracle Client software as per Oracle's software setup.

The following script is an Example for running this tap in a Docker Container and it a snippet of code from a Dockerfile. An equivalent setup can be done as a one-off in a Linux Server by way of example with appropriate environment variables set in a .bash_profile or appropriate shell for your environment.

Note: The legacy cx_Oracle library is not available on MacOS. The setup.py will ignore installing the library for a MacOS, the ora_python_driver_type becomes mandatory with a value of thin for the client mode.

# For more details on the Oracle CX_Oracle library refer to 
# https://cx-oracle.readthedocs.io/en/latest/
# For more details on the Oracle oracledb library ref to
# https://oracle.github.io/python-oracledb/
# Install the libaio1 library for the Oracle Client, unzip
RUN apt-get update \
    && apt-get install -y unzip libaio1 \
    && apt-get clean \
    && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete

# Install the Oracle Client
RUN mkdir /opt/oracle \
    && cd ~ \
    && curl -O https://download.oracle.com/otn_software/linux/instantclient/1912000/instantclient-basic-linux.x64-19.12.0.0.0dbru.zip \
    && unzip instantclient-basic-linux.x64-19.12.0.0.0dbru.zip -d /opt/oracle \
    && rm instantclient-basic-linux.x64-19.12.0.0.0dbru.zip

# Set required Oracle Client environment variables
ENV ORACLE_HOME="/opt/oracle/instantclient_19_12"
ENV PATH="$PATH:$ORACLE_HOME" \
    LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$ORACLE_HOME" \
    TNS_ADMIN="$ORACLE_HOME/network/admin"

Configuration

Running the the tap requires a config.json file. Example with the minimal settings:

  {
    "host": "foo.com",
    "port": 1521,
    "user": "my_user",
    "password": "password",
    "service_name": "ORCL"
  }

Recommended optional settings

  • "filter_schemas": "schema name" - This will speed up discover time as it only discovers the given schema.
  • "filter_tables": ["SCHEMA-TABLE1", "SCHEMA-TABLE1"] - this will speed up discovery to just the listed tables.
  • "use_singer_decimal": true - This will help avoid numeric rounding issues emitting as a string with a format of singer.decimal.
  • "cursor_array_size": 10000 - This will help speed up extracts over a WAN or low latency network. The default is 1000.
  • "ora_python_driver_type": "cx|thick|thin" - Provides an option to specify a Oracle driver library to use. The default is cx.

Optional:

For older database or connecting to an instance you can use the legacy SID for the connection. Swap the sid keyword for service_name.

  {
    "sid": "ORCL"
  }

Optional:

To filter the discovery to a particular schema within a database. This is useful if you have a large number of schemas and wish to speed up the discovery.

{
  "filter_schemas": "your database schema name",
}

Optional:

To filter the discovery to a particular list of tables in a database. This is useful if you have a large number of tables in a schema and wish to speed up the discovery. Note: There is a format feature each table of ["SCHEMA-TABLE"] and should follow JSON arry literal formatting. You can also filter tables by setting an environment variable MELTANO_EXTRACT__SELECT. e.g. export MELTANO_EXTRACT__SELECT='["HR-EMPLOYEES", "HR-DEPARTMENTS"]'

{
  "filter_tables": ["HR-EMPLOYEES", "HR-DEPARTMENTS"],
}

Optional:

Support for a common user for working with pluggable databases (PDB). Every common user can connect to an perform operations within the root database, and within any PDB in which it has privileges.

{
  "common_user": "common_user_defined_in_oracle",
  "common_password": "common_user_password",
  "common_service_name": "common_user_service_connection_name",
}

Optional:

A boolean setting: when enabled true, it outputs decimal and floating point numbers as strings to avoid loss of precision and scale. There are hints in the schema message, format = "singer.decimal", and additionalProperties scale_precision dictionary providing precision and scale. For decimal data, the target can use this information to correctly replicate decimal data without loss. For the Floats and Number data type without precision and scale it is recommended that post processing formats the datatype based on an inspection of the data because the true data size is unknown / dynamic.

{
  "use_singer_decimal": true,
}

Optional:

To avoid problems with uncommitted changes being read, you can set offset_value to add to the value found in the STATE for INCREMENTAL loads. If the value provided is for a datetime replication key then the offset_value is read as seconds to offset by, otherwise the value is used as provided.

Using offset_value < 0 would result in an overlapping set of records being read each time the tap is run.

Using offset_value > 0 may result in data being missed. However it can be useful if a period (month-year) is being used. This prevents the tap from using period >= last-read-period and doubling up on the extract.

Usage (offsetting by +1 day in seconds = 24*3600):

{
  "offset_value": 86400
}

Optional:

A numeric setting adjusting the internal buffersize. The common query tuning scenario is for SELECT statements that return a large number of rows over a slow network. Increasing arraysize can improve performance by reducing the number of round-trips to the database. However increasing this value increases the amount of memory required.

{
  "cursor_array_size": 10000,
}

Optional:

A setting which will dynamically import the correct Oracle Library and set the connection mode. This allows you to select the legacy 'cx' Oracle driver library or the newer 'oracledb' library. With the newer 'oracledb' library you can run it in either 'thick' or 'thin' mode. Under thin mode you can pass specific driver settings like 'https_proxy' if and as required via config settings.

{
  "ora_python_driver_type": "thin",
}

To run tests:

Tests require Oracle on Amazon RDS >= 12.1, and a user called ROOT.

  1. Define environment variables that requires running the tests.
  export TAP_ORACLE_HOST=<oracle-rds-host>
  export TAP_ORACLE_PORT=<oracle-rds-port>
  export TAP_ORACLE_USER=ROOT
  export TAP_ORACLE_PASSWORD=<oracle-rds-password>
  export TAP_ORACLE_SID=<oracle-rds-sid>
  1. Install python dependencies in a virtual env and run nose unit and integration tests
  python3 -m venv venv
  . venv/bin/activate
  pip install --upgrade pip
  pip install .
  pip install nose
  1. To run unit tests:
  nosetests