Skip to content

Data inventory

Joaquin Bedia edited this page Dec 9, 2013 · 6 revisions

A typical task when dealing with predictor datasets prior to downscaling consists of the data exploration in order to get some basic preliminary information, needed to ensure the consistency between predictors, the time span of the dataset etc. With the aid of the dataInventory function we can easily retrieve all the necessary information to access and manipulate the variables stored in a dataset.

Gridded datasets

In the case of gridded datasets, the function dataInventory uses as argument a dataset. In the following example, we get a description of the NcML dataset created in this example, containing several variables of the NCEP reanalysis in the Iberian Peninsula.

> inv.iberiaNCEP <- dataInventory("datasets/reanalysis/Iberia_NCEP/Iberia_NCEP.ncml")

The inventory consists of a list of four elements, which are the four variables stored in the dataset:

> names(inv.iberiaNCEP)
[1] "Q"    "SLPd" "T"    "Z"

A more detailed description of the structure of the data inventory is next presented:

> str(inv.iberiaNCEP)
List of 4
 $ Q   :List of 5
  ..$ Description: chr "Specific humidity"
  ..$ DataType   : chr "float"
  ..$ Units      : chr "kg kg**-1"
  ..$ TimeStep   :Class 'difftime'  atomic [1:1] 24
  .. .. ..- attr(*, "tzone")= chr ""
  .. .. ..- attr(*, "units")= chr "hours"
  ..$ Dimensions :List of 4
  .. ..$ level:List of 3
  .. .. ..$ Type  : chr "Pressure"
  .. .. ..$ Units : chr "millibar"
  .. .. ..$ Values: num 850
  .. ..$ time :List of 3
  .. .. ..$ Type  : chr "Time"
  .. .. ..$ Units : chr "days since 1950-01-01 00:00:00"
  .. .. ..$ Values: POSIXlt[1:16071], format: "1958-01-01" "1958-01-02" "1958-01-03" "1958-01-04" ...
  .. ..$ lat  :List of 3
  .. .. ..$ Type  : chr "Lat"
  .. .. ..$ Units : chr "degrees north"
  .. .. ..$ Values: num [1:6] 35 37.5 40 42.5 45 47.5
  .. ..$ lon  :List of 3
  .. .. ..$ Type  : chr "Lon"
  .. .. ..$ Units : chr "degrees east"
  .. .. ..$ Values: num [1:9] -15 -12.5 -10 -7.5 -5 -2.5 0 2.5 5
 $ SLPd:List of 5
  ..$ Description: chr "Mean Sea Level Pressure; Mean daily value"
  ..$ DataType   : chr "float"
  ..$ Units      : chr "Pa"
  ..$ TimeStep   :Class 'difftime'  atomic [1:1] 24
  .. .. ..- attr(*, "tzone")= chr ""
  .. .. ..- attr(*, "units")= chr "hours"
  ..$ Dimensions :List of 3
  .. ..$ time:List of 3
  .. .. ..$ Type  : chr "Time"
  .. .. ..$ Units : chr "days since 1950-01-01 00:00:00"
  .. .. ..$ Values: POSIXlt[1:16071], format: "1958-01-01" "1958-01-02" "1958-01-03" "1958-01-04" ...
  .. ..$ lat :List of 3
  .. .. ..$ Type  : chr "Lat"
  .. .. ..$ Units : chr "degrees north"
  .. .. ..$ Values: num [1:6] 35 37.5 40 42.5 45 47.5
  .. ..$ lon :List of 3
  .. .. ..$ Type  : chr "Lon"
  .. .. ..$ Units : chr "degrees east"
  .. .. ..$ Values: num [1:9] -15 -12.5 -10 -7.5 -5 -2.5 0 2.5 5
 $ T   :List of 5
  ..$ Description: chr "Temperature"
  ..$ DataType   : chr "float"
  ..$ Units      : chr "K"
  ..$ TimeStep   :Class 'difftime'  atomic [1:1] 24
  .. .. ..- attr(*, "tzone")= chr ""
  .. .. ..- attr(*, "units")= chr "hours"
  ..$ Dimensions :List of 4
  .. ..$ level:List of 3
  .. .. ..$ Type  : chr "Pressure"
  .. .. ..$ Units : chr "millibar"
  .. .. ..$ Values: num 850
  .. ..$ time :List of 3
  .. .. ..$ Type  : chr "Time"
  .. .. ..$ Units : chr "days since 1950-01-01 00:00:00"
  .. .. ..$ Values: POSIXlt[1:16071], format: "1958-01-01" "1958-01-02" "1958-01-03" "1958-01-04" ...
  .. ..$ lat  :List of 3
  .. .. ..$ Type  : chr "Lat"
  .. .. ..$ Units : chr "degrees north"
  .. .. ..$ Values: num [1:6] 35 37.5 40 42.5 45 47.5
  .. ..$ lon  :List of 3
  .. .. ..$ Type  : chr "Lon"
  .. .. ..$ Units : chr "degrees east"
  .. .. ..$ Values: num [1:9] -15 -12.5 -10 -7.5 -5 -2.5 0 2.5 5
 $ Z   :List of 5
  ..$ Description: chr "Geopotential"
  ..$ DataType   : chr "float"
  ..$ Units      : chr "m**2 s**-2"
  ..$ TimeStep   :Class 'difftime'  atomic [1:1] 24
  .. .. ..- attr(*, "tzone")= chr ""
  .. .. ..- attr(*, "units")= chr "hours"
  ..$ Dimensions :List of 4
  .. ..$ level:List of 3
  .. .. ..$ Type  : chr "Pressure"
  .. .. ..$ Units : chr "millibar"
  .. .. ..$ Values: num 850
  .. ..$ time :List of 3
  .. .. ..$ Type  : chr "Time"
  .. .. ..$ Units : chr "days since 1950-01-01 00:00:00"
  .. .. ..$ Values: POSIXlt[1:16071], format: "1958-01-01" "1958-01-02" "1958-01-03" "1958-01-04" ...
  .. ..$ lat  :List of 3
  .. .. ..$ Type  : chr "Lat"
  .. .. ..$ Units : chr "degrees north"
  .. .. ..$ Values: num [1:6] 35 37.5 40 42.5 45 47.5
  .. ..$ lon  :List of 3
  .. .. ..$ Type  : chr "Lon"
  .. .. ..$ Units : chr "degrees east"
  .. .. ..$ Values: num [1:9] -15 -12.5 -10 -7.5 -5 -2.5 0 2.5 5

Observational datasets (.csv format)

In the case of observational (station) datasets stored in the standard text format, the function dataInventory receives as an argument the path to the directory where the files are stored.

> inventory.csv <- dataInventory(dataset = "datasets/observations/GSN_Iberia/", return.stats = TRUE)
> str(inventory.csv)
List of 3
 $ Stations     :List of 5
  ..$ stationIDs  : chr [1:6] "SP000008027" "SP000008181" "SP000008202" "SP000008215" ...
  ..$ stationNames: chr [1:6] "SAN SEBASTIAN - IGUELDO" "BARCELONA/AEROPUERTO" "SALAMANCA AEROPUERTO" "NAVACERRADA" ...
  ..$ Altitude    : int [1:6] 251 4 790 1894 704 90
  ..$ LonLatCoords:'data.frame':	6 obs. of  2 variables:
  .. ..$ LON: num [1:6] -2.04 2.07 -5.5 -4.01 -1.86 ...
  .. ..$ LAT: num [1:6] 43.3 41.3 41 40.8 39 ...
  ..$ timeAxis    :List of 3
  .. ..$ startDate: POSIXlt[1:1], format: "1957-01-01"
  .. ..$ endDate  : POSIXlt[1:1], format: "2012-12-31"
  .. ..$ timeStep :Class 'difftime'  atomic [1:1] 24
  .. .. .. ..- attr(*, "tzone")= chr ""
  .. .. .. ..- attr(*, "units")= chr "hours"
 $ Variables    :List of 4
  ..$ varIDs      : chr [1:3] "tmax" "tmin" "precip"
  ..$ varNames    : chr [1:3] "maximum daily temperature" "minimum daily temperature" "daily accumulated precipitation"
  ..$ units       : chr [1:3] "0.1 deg C" "0.1 deg C" "0.1 mm"
  ..$ missing.code: chr [1:3] "NaN" "NaN" "NaN"
 $ Summary.stats:List of 4
  ..$ missing.percent: num [1:6, 1:3] 2.6 0.7 0.5 2.9 2.9 7.6 2.2 2.1 3.5 2.9 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:6] "SP000008027" "SP000008181" "SP000008202" "SP000008215" ...
  .. .. ..$ : chr [1:3] "tmax" "tmin" "precip"
  ..$ min            : num [1:6, 1:3] -43 0 -52 -122 -18 0 -100 -80 -200 -203 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:6] "SP000008027" "SP000008181" "SP000008202" "SP000008215" ...
  .. .. ..$ : chr [1:3] "tmax" "tmin" "precip"
  ..$ max            : num [1:6, 1:3] 386 374 410 318 426 466 252 268 220 206 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:6] "SP000008027" "SP000008181" "SP000008202" "SP000008215" ...
  .. .. ..$ : chr [1:3] "tmax" "tmin" "precip"
  ..$ mean           : num [1:6, 1:3] 163 201 183 104 195 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:6] "SP000008027" "SP000008181" "SP000008202" "SP000008215" ...
  .. .. ..$ : chr [1:3] "tmax" "tmin" "precip"

The structure of the inventory in the case of observational datasets is different from gridded ones, and it is tailored for providing summary information for weather station datasets, typically used as predictands in downscaling applications.

The argument return.stats is set to FALSE by default, and it is ignored in the case of grid datasets. In the case of point observation datasets, if return.stats = TRUE it provides a list of 4 tables (2D matrix) with the stations arranged in rows and variables in columns, with the percentage (%) of missing data in each case, and the minimum, maximum and mean values for each station and variable considering the whole time period:

> inventory.csv$Summary.stats$missing.percent
            tmax tmin precip
SP000008027  2.6  2.2    0.4
SP000008181  0.7  2.1    0.4
SP000008202  0.5  3.5    0.3
SP000008215  2.9  2.9    0.3
SP000008280  2.9 12.5    0.3
SP000008410  7.6 11.2    4.5

In this case, the largest percentage of missing data (12.5%) corresponds to the fifth station (station code SP000008280, as indicated by the row name, for the variable tmin (minimum daily temperature).

> inventory.csv$Stations$stationNames[5]
[1] "ALBACETE LOS LLANOS"
Clone this wiki locally