Skip to content

Writing a data source driver

Jeff Heard edited this page Jul 20, 2013 · 9 revisions

Audience

This document is for Python-saavy developers interested in connecting a new kind of data source with Geoanalytics.

What's a driver for?

A Driver is a Python class that connects data to Geoanalytics. Instances of the driver are associated with every data resource that is instantiated. A driver supports the following functions.

  • Introspecting the functionality supported for various services
  • Readying a local cached copy of data that will be used by the current request (WMS, WFS, and eventually WCS)
  • Computing the spatial metadata fields for the data resource
  • Get data from a single point
  • Get the data in the form of an analyzable data-set, either a Pandas DataFrame or Panel object. These are high-level, high performance data structures that do not directly support spatial analysis but do support statistical and numerical analysis.
  • Taking a DataFrame or Panel object and saving it as a new dataset, returning that resource as a file path.

Create the driver

First create a new module. A single module should contain nore more than a single Driver in it, and that Driver should be assigned to the "driver" variable at the end of the module:

from ga_resources.drivers.related import Driver

class MyDriver(Driver):
     ...

driver = MyDriver

Supporting essential functionality

The driver needs to support some essential basic functionality. This involves getting cached copies of the data source and reporting some basic statistics of the data source.

def ready_data_resource(self, **kwargs)

The keyword arguments that this takes have not been standardized yet. They are any keyword arguments passed to a WFS/WMS request as well as possibly:

  • sort_by : column name
  • bbox : (minx, miny, maxx, maxy) in the passed in srs
  • srs : spatial reference system as a string (epsg:* or a proj.4 string)
  • boundary : Geometry as WKT
  • query : Currently the support for this depends upon the driver. CQL may be required in the future, but for now the query language is as you implement it. Document this in the driver.

Processing these parameters in ready_data_resource will allow you to affect rendered layers (by writing a set of features to the data source that will ultimately get passed to the WMS renderer) or features returned from a WFS request.

Normally, this is implemented by calling the super- function and then doing any post processing. The return should be a three-tuple:

def ready_data_resource(self, **kwargs)
    slug, srs = super(ShapefileDriver, self).ready_data_resource(**kwargs)
    
    # ... do any additional work

    return slug, srs, {
        'type': 'shape',
        "file": self.cached_basename + '.shp'
    }

The first two elements of the tuple are returned by the superclass function ready_data_resource, but the third one is determined by your driver implementatation. It is the DataSource fragment for Mapnik's XML configuration. It takes the form of a set of key-value pairs, and should at the very least return "type". For more information on what parameters are available for each Mapnik driver, see Mapnik's documentation and look at the source code for the existing drivers in Geoanalytics.

def compute_fields(self, **kwargs)

This method should create a ga_resources.models.SpatialMetadata object and attach it to the DataResource or repopulate the SpatialMetadata object from the DataResource. It should not check to make sure the resource hasn't been modified before doing so. This is done inside Geoanalytics, and if compute_fields is called, it is expected that this method should fully recompute all the data fields.

The data fields are as follows:

  • native_bounding_box : geometry. A rectangular POLYGON field containing the bounding box in the native coordinate system.
  • bounding_box : geometry. A rectangular POLYGON field containing the bounding box in lat/lon coordinates. (EPSG:4326)
  • three_d : boolean. Whether or not the features or coverage referred to by the data source contains 2 dimensional data or three.
  • native_srs : string. A spatial reference string in proj.4 format.

The compute_fields method should then save() the SpatialMetadata object and save self.resource. The compute_fields method should return nothing. It should always call the superclass compute_fields method with its arguments.

def as_dataframe(self, **kwargs)

This important method does two things. It enables WFS support and gives users the ability to access the underlying data in a form suitable for hardware accelerated analysis. The essence of it is this: using the keyword arguments to limit or manipulate the result set, take the underlying data source and turn it into a DataFrame or Panel object and return it.

This is the most complex method in the driver by far and requires the most hand-implementation. For a good reference, see the Shapefile driver implementation.

The keyword arguments that this takes have not been standardized yet, but they are the same as can be received by ready_data_resource. They are any keyword arguments passed to a WFS/WMS request as well as possibly:

  • sort_by : column name
  • bbox : (minx, miny, maxx, maxy) in the passed in srs
  • srs : spatial reference system as a string (epsg:* or a proj.4 string)
  • boundary : Geometry as WKT
  • query : Currently the support for this depends upon the driver. CQL may be required in the future, but for now the query language is as you implement it. Document this in the driver.

These should be used to limit the scope and manipulate the records that are returned. A DataFrame should contain all the fields requested from the underlying data source and a geometry field containing geometry objects from the Shapely library.

It is also possible to return a Panel object, although WFS does not currently support Panels. Panels are necessary however, when representing rasters as a DataFrame. Future implementations of WFS or WCS may support Panels, however this functionality is not well specced out yet. For an example of how to write an implementation of this method which returns a panel, see the GeoTIFF driver code

Supporting optional functionality

def from_dataframe(cls, df, filename, srs)

Parameters:

  • df : pandas.DataFrame. The data frame to serialize
  • filename : string. The filename to serialize the data to
  • srs : osgeo.ogr.osr.SpatialReference. The spatial reference system for the file.

This method should take a DataFrame object and serialize it to the datatype that the Driver supports. For a working example, see the Shapefile driver implementation. It should return None, but the filename that you passed in should contain the data in the proper format.

def get_data_for_point(self, wherex, wherey, srs, fuzziness=0, **kwargs)

Parameters:

  • wherex : Processed by the super method, ignore
  • wherey : Processed by the super-class method, ignore
  • srs : Processed by the super-class method, ignore
  • fuzziness : number. You can implement fuzziness. Fuzziness simply means that you should "buffer" the area around the point asked for and return any matches you see.

This method should return a list of Python dicts containing data records as "column : value" pairs for the location asked for. If there is not an exact data record, the behaviour should be determined by the driver. You could provide None, or the "nearest neighbor" to the point requested, or an interpolated value for that point.

def get_data_fields(self, **kwargs)

The get_data_fields method is optional functionality. It will be deprecated soon in favor of taking the fields from as_dataframe. If you implement it, it should return a list of tuples containing:

  • field name : string
  • field type name : string
  • field width in bytes : integer

These are used for informational purposes only in views and are not needed to support any underlying functionality.

Introspection of supported functionality

The following class methods all return True by default. To support graceful degradation of functionality in Geoanalytics, your driver should override any of these items it does not support.

def supports_mutiple_layers(cls)

If your dataset has sub-datasets underlying it, such as a PostGIS dataset where each dataset is a particular table or query, then you need to support multiple layers. If you support multiple layers, then certain functions may get a "sublayer" keyword argument that lists the layers that are being requested by the requesting entity. These functions are:

  • ready_data_resource
  • get_data_fields
  • get_data_for_point
  • as_dataframe

def supports_download(cls)

If your dataset can be downloaded in its entirety, then this should return True. If this is True, then your data resource should always have either a resource_file or resource_url attribute in the database.

def supports_related(cls)

If your vector dataset can be fit entirely on disk, then this should probably return True unless you have specific reasons for it not to. This allows a Geoanalytics user to define a RelatedResource that contains records to be joined to your resource on a particular join key. This is the equivalent of a table join, and if you have tabular data, this should be supported.

def supports_upload(cls)

If your driver supports reading or unpacking a dataset from a single file (say a .zip for a Shapefile, or KML, etc) then this is true. A user could upload a single file in its entirey and have a dataset that could be used.

def supports_configuration(cls)

If your driver supports loading configuration variables via a JSON configuration object stored in the DataResource object, then this should return True. Configuration variables might be used to designate stored queries, paths to data, defaults for a layer, and so on. The GRASS driver uses this for data paths and defaults for display.

def supports_point_query(cls)

If you can get meaningful information from a single point, then this should return True. This will enable WMSGetFeatureInfo.

def supports_save(cls)

If this is left to be True, then your dataset should have a from_dataframe method to support saving data in this format.

def datatype(cls)

This should either be ga_resources.drivers.VECTOR or ga_resources.drivers.RASTER.

def mimetype(self)

This should return the mimetype for downloading the whole dataset. It defaults to "application/octet-stream"