Singer tap that extracts data from a Oracle database and produces JSON-formatted data following the Singer spec.
This is a PipelineWise compatible tap connector.
The recommended method of running this tap is to use it from PipelineWise. When running it from PipelineWise you don't need to configure this tap with JSON files and most of things are automated. Please check the related documentation at Tap Oracle
If you want to run this Singer Tap independently please read further.
Tap-Oracle Log-based replication requires some configuration changes in Oracle database:
-
Enable
ARCHIVELOG
mode -
Set retention period a reasonable and long enough period, ie. 1 day, 3 days, etc.
-
Enable Supplemental logging
To verify the current archiving mode, if the result is ARCHIVELOG
, archiving is enabled:
SQL> SELECT LOG_MODE FROM V$DATABASE
To enable ARCHIVELOG
mode (if not enabled yet):
SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT
SQL> ALTER DATABASE ARCHIVELOG
SQL> ALTER DATABASE OPEN
To set retention period, use RMAN:
RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 1 DAYS;
To enable supplemental logging:
SQL> ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS
To set retention period:
begin
rdsadmin.rdsadmin_util.set_configuration(
name => 'archivelog retention hours',
value => '24');
end;
To enable supplemental logging:
begin
rdsadmin.rdsadmin_util.alter_supplemental_logging(p_action => 'ADD');
end;
First, make sure Python 3 is installed on your system or follow these installation instructions for Mac or Ubuntu.
It's recommended to use a virtualenv:
python3 -m venv venv
pip install pipelinewise-tap-oracle
or
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install .
Before you can use tap-oracle, you need to download and install Oracle Client software. This means downloading appropriate Oracle Client software, and configuring environment variables to point the Oracle Client software as per Oracle's software setup.
The following script is an Example for running this tap in a Docker Container and it a snippet of code from a Dockerfile
. An equivalent setup can be done as a one-off in a Linux Server by way of example with appropriate environment variables set in a .bash_profile or appropriate shell for your environment.
# For more details on the Oracle CX_Oracle library refer to
# https://cx-oracle.readthedocs.io/en/latest/
# Install the libaio1 library for the Oracle Client, unzip
RUN apt-get update \
&& apt-get install -y unzip libaio1 \
&& apt-get clean \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete
# Install the Oracle Client
RUN mkdir /opt/oracle \
&& cd ~ \
&& curl -O https://download.oracle.com/otn_software/linux/instantclient/1912000/instantclient-basic-linux.x64-19.12.0.0.0dbru.zip \
&& unzip instantclient-basic-linux.x64-19.12.0.0.0dbru.zip -d /opt/oracle \
&& rm instantclient-basic-linux.x64-19.12.0.0.0dbru.zip
# Set required Oracle Client environment variables
ENV ORACLE_HOME="/opt/oracle/instantclient_19_12"
ENV PATH="$PATH:$ORACLE_HOME" \
LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$ORACLE_HOME" \
TNS_ADMIN="$ORACLE_HOME/network/admin"
Running the the tap requires a config.json
file. Example with the minimal settings:
{
"host": "foo.com",
"port": 1521,
"user": "my_user",
"password": "password",
"service_name": "ORCL"
}
Recommended optional settings
"filter_schemas": "schema name"
- This will speed up discover time as it only discovers the given schema."filter_tables": ["SCHEMA-TABLE1", "SCHEMA-TABLE1"]
- this will speed up discovery to just the listed tables."use_singer_decimal": true
- This will help avoid numeric rounding issues emitting as a string with a format of singer.decimal."cursor_array_size": 10000
- This will help speed up extracts over a WAN or low latency network. The default is 1000.
Optional:
For older database or connecting to an instance you can use the legacy SID for the connection.
Swap the sid
keyword for service_name
.
{
"sid": "ORCL"
}
Optional:
To filter the discovery to a particular schema within a database. This is useful if you have a large number of schemas and wish to speed up the discovery.
{
"filter_schemas": "your database schema name",
}
Optional:
To filter the discovery to a particular list of tables in a database. This is useful if you have a large number of tables in a schema and wish to speed up the discovery.
Note: There is a format feature each table of ["SCHEMA-TABLE"] and should follow JSON arry literal formatting.
You can also filter tables by setting an environment variable MELTANO_EXTRACT__SELECT
. e.g. export MELTANO_EXTRACT__SELECT='["HR-EMPLOYEES", "HR-DEPARTMENTS"]'
{
"filter_tables": ["HR-EMPLOYEES", "HR-DEPARTMENTS"],
}
Optional:
Support for a common user for working with pluggable databases (PDB). Every common user can connect to an perform operations within the root database, and within any PDB in which it has privileges.
{
"common_user": "common_user_defined_in_oracle",
"common_password": "common_user_password",
"common_service_name": "common_user_service_connection_name",
}
Optional:
A boolean setting: when enabled true
, it outputs decimal and floating point numbers as strings to avoid loss of precision and scale.
There are hints in the schema message, format = "singer.decimal", and additionalProperties scale_precision dictionary providing precision and scale. For decimal data, the target can use this
information to correctly replicate decimal data without loss. For the Floats and Number data type without precision and scale it is recommended that post processing formats the datatype based on an inspection of the data because the true data size is unknown / dynamic.
{
"use_singer_decimal": true,
}
Optional:
To avoid problems with uncommitted changes being read, you can set offset_value
to add to the value found in the STATE for INCREMENTAL loads. If the value provided is for a datetime replication key then the offset_value
is read as seconds to offset by, otherwise the value is used as provided.
Using offset_value < 0 would result in an overlapping set of records being read each time the tap is run.
Using offset_value > 0 may result in data being missed. However it can be useful if a period (month-year) is being used. This prevents the tap from using period >= last-read-period and doubling up on the extract.
Usage (offsetting by +1 day in seconds = 24*3600):
{
"offset_value": 86400
}
Optional:
A numeric setting adjusting the internal buffersize. The common query tuning scenario is for SELECT statements that return a large number of rows over a slow network. Increasing arraysize can improve performance by reducing the number of round-trips to the database. However increasing this value increases the amount of memory required.
{
"cursor_array_size": 10000,
}
Tests require Oracle on Amazon RDS >= 12.1, and a user called ROOT
.
- Define environment variables that requires running the tests.
export TAP_ORACLE_HOST=<oracle-rds-host>
export TAP_ORACLE_PORT=<oracle-rds-port>
export TAP_ORACLE_USER=ROOT
export TAP_ORACLE_PASSWORD=<oracle-rds-password>
export TAP_ORACLE_SID=<oracle-rds-sid>
- Install python dependencies in a virtual env and run nose unit and integration tests
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install .
pip install nose
- To run unit tests:
nosetests