Table of Contents
The Hashmap Data Cataloger utility that can be used to catalog(read) data assets such as Databases, Schemas, and Tables from a given source system and map(write) them into a given destination system.
hashmap-data-cataloger (hdc) is can be invoked from the command line interface (next section) or as a library of APIs.
This tool is available on PyPi and can be installed as:
pip install hashmap-data-cataloger
This will install the hashmap-data-cataloger and all of its dependencies. This is a pypi package and can be installed as
The hdc tool is a configuration driven application that depends on 3 types of configurations encoded as YAML.
The hdc tool uses this YAML file to define the supported sources, destinations, and corresponding mappers in order to self-configure itself to enable the 'map' or 'catalog' functions. The layout of this file looks like this.
The default version comes with pre-configured sources, destinations, and mappers that can be used as is, while invoking 'hdc' from CLI or through an API call (see examples below). The user would only need to update the connection profile for each source/destination individually under the section 'connection_profiles'. The profile names being updated should match the one of the profile names in the profile.yml file.
You can override the default version file from CLI by using the '-c' option followed by the path of the custom YAML file. However, it must conform to the format linked above.
To create a default YAML configuration file do the following:
- Using any text editor create a file like this and save as 'app_config.yml'
- Create a hidden directory in the User's root with the name '.hdc'
- Move the 'app_config.yml' into the hidden directory created above.
The hdc tool uses this YAML file to configure/provide the necessary connection details for source and destination databases. The elements required in the YAML file and their layout looks like this. Presently, the connections are secured via user credentials.
You cannot override this file from CLI and therefore will need to be made available beforehand as follows:
- Using any text editor create a file like this and save as 'profile.yml'
- Create a hidden directory in the User's root with the name '.hdc'
- Move the 'profile.yml' into the hidden directory created above.
The hdc tool uses this YAML file to configure the log settings (Python's logging). The elements required in the YAML file and their layout looks like this.
You can override this file from CLI using the '-l' option followed by the path of the custom YAML file. However, it must conform to the format linked above.
To create a default YAML configuration file do the following:
- Using any text editor create a file like this and save as 'log_settings.yml'
- Create a hidden directory in the User's root with the name '.hdc'
- Move the 'log_settings.yml' into the hidden directory created above.
Once the package is installed along-with its dependencies, invoke it from the command line as:
usage: hdc [-h] -r {catalog,map} -s SOURCE [-d DESTINATION] [-c APP_CONFIG] [-l LOG_SETTINGS] optional arguments: -h, --help show this help message and exit -r {catalog,map}, --run {catalog,map} One of 'catalog' or 'map' -s SOURCE, --source SOURCE Name of any one of sources configured in hdc.yml -d DESTINATION, --destination DESTINATION Name of any one of destinations configured in hdc.yml -c APP_CONFIG, --app_config APP_CONFIG Path to application config (YAML) file if other than default -l LOG_SETTINGS, --log_settings LOG_SETTINGS Path to log settings (YAML) file if other than defaultFor example: python3 -m hdc -r catalog -s oracle python3 -m hdc -r map -s oracle -d snowflake python3 -m hdc -r map -s netezza -d snowflake
Other applications could import hdc as a library and make use of the cataloging or mapping functions as explained below.
- AssetMapper. - Provides a method 'map_assets()' to kickoff the crawling, mapping, and writing of data assets from a given source system to a target system based on the connection profile parameters for each. > An AssetMapper object can be created in the following manner:
asset_mapper = AssetMapper(source = '', destination = '') result: bool = asset_mapper.map_assets()
'source' a str name of any one of the sources configured in the default app_config.yml.
'destination' a str name of any one of the destinations configured in the default app_config.yml
- Cataloger - Provides a method 'obtain_catalog()' to kickoff a crawler process against a given source system and pull the data asset information according to the connection profile parameters. > > A Cataloger object can be created in the following manner:
cataloger = Cataloger(source = '' ) result: pandas.DataFrame = cataloger.obtain_catalog()
'source' a str name of any one of the sources configured in the default app_config.yml.
At present the hdc tool crawls through the entire hierarchy of a given database (all schemas, all tables under all schemas). This can be fine tuned to allow to crawl through only selected schemas under a given database.
Allow configuration of external Key Stores for storing user authentication details required while connecting with source or destination systems. The application shall be able to interact with the external KS based on the configuration provided.
This is to provide a stronger security option instead of directly configuring the user credentials in the profile.yml file.
TBD
TBD
TBD