-
Notifications
You must be signed in to change notification settings - Fork 19
/
hdf5_intro.Rmd
343 lines (255 loc) · 14.4 KB
/
hdf5_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: "HDF"
output:
html_document:
number_sections: true
toc: true
toc_float: false
collapsed: false
editor_options:
chunk_output_type: inline
---
```{r set-options, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=TRUE, out.width = "75%", out.height = "75%")
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```
<span style="color: green;">**NOTE: This page has been revised for the 2024 version of the course, but there may be some additional edits.** <br>
# Introduction #
HDF (Hierarchical Data Format), see [[https://www.hdfgroup.org]](https://www.hdfgroup.org) is another format for storing large data files and associated metadata, that is not unrealted to netCDF, because in fact netCDF-4 uses the HDF-5 "data model" to store data. There are currently two versions of HDF in use HDF4 and HDF5, and two special cases designed for handling satellite remote-sensing data (HDF-EOS and HDF-EOS5). Unlike netCDF-4, which is backward compatible with netCDF-3, the two versions are not mutually readible, but there are tools for converting between the different versions (as well as for converting HDF files to netCDF files).
HDF5 files can be read and written in R using the `rhdf5` package, which is part of the Bioconductor collection of packages. HDF4 files can also be handled via the `rgdal` package, but the process is more cumbersome. Consequently, the current standard approach for analyzing HDF4 data in R is to first convert it to HDF5 or netCDF, and proceed from there using `rhdf5`.
Compared to netCDF files, HDF files can be messy: they can contain images as well as data, and HDF5 files in particular may contain groups of datasets, which emulated in some ways the folder/directory stucture on a local machine. This will become evident below.
# Preliminary set up #
The `rhdf5` package can be retrieved from Bioconductor, and installed as follows. Generally, a lot of dependent packages will also need to be installed as well, but the process is usually straightforward.
```{r install_rhdf5, eval=FALSE, echo=TRUE, message=FALSE}
# install rhdf5
if (!requireNamespace("rhdf5", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rhdf5", version = "3.18", force = TRUE)
```
The examples below use data from TEMIS, the Tropospheric Emission Monitoring Internet Service ([[http://www.temis.nl/index.php]](http://www.temis.nl/index.php)), and the Global Fire Emissions Database (GFED, [[https://www.globalfiredata.org/index.html]](https://www.globalfiredata.org/index.html)). The TEMIS data are HDF4 and need to be converted, while the GFED data are HDF5. The specific data sets can be found on the ClimateLab SFTP server.
The various conversion tasks and tools are:
- HDF4 to HDF5: `h4toh5convert` from the HDFGroup [[h4h5tools]](https://portal.hdfgroup.org/downloads/h4h5tools/h4h5tools_2_2_5.html)
- HDF4 to netCDF: `h4tonccf` (netCDF-3), and `h4tonccf_nc4` (netCDF-4), from HDF-EOS [[http://hdfeos.org/software/h4cflib.php]](http://hdfeos.org/software/h4cflib.php)
- HDF4 and HDF5 (and other file types) to netCDF: `ncl_convert2nc` from Unidata [[https://www.ncl.ucar.edu/]](https://www.ncl.ucar.edu/overview.shtml)
The first two conversion tools have installers for MacOS, Windows and Linux. On the Mac, it will usually be the case that the executable file should be moved to the /Applications folder, while on Windows the usual fooling around with paths will be required. The `ncl_convert2nc` tool is part of NCL, which can be installed on run on Windows 10/11 machines using the Linux subsystem for Windows (WSL2.
There are some useful tutorials on using HDF in R from the NSF NEON program:
- [[intro-hdf5-R]](http://neonscience.github.io/neon-data-institute-2016//R/intro-hdf5-R/)
- [[hdf5-intro-r]](https://www.neonscience.org/hdf5-intro-r)
- [[HDF5]](https://github.com/NEONScience/NEON-Data-Skills/tree/master/code/HDF5)
Begin by loading the appropriate packages:
```{r libraries, message=FALSE}
# load packages
library(sf)
library(rhdf5)
library(terra)
library(RColorBrewer)
```
For latter plotting of the data, also read a world outline file. The file(s) can be copied from the class RESS server, or by downloading the contents of the folder at [[https://pages.uoregon.edu/bartlein/RESS/shp_files/world2015/]](https://pages.uoregon.edu/bartlein/RESS/shp_files/world2015/). Clicking on this link should open a file listing on the server, from which you can download the files. Note that shape files consist of multiple files, e.g. `*.shp`, `*.shx`, `*.dbf`, `*.prj`, `cpg`, and all must be downloaded.Alternatively you could use FileZilla to download the files from the course SFTP server. Note that shape files consist of multiple files, e.g. `*.shp`, `*.shx`, `*.dbf`, `*.prj`, `cpg`, and all must be downloaded.
Move the download files to a convenient folder, e.g. `/Users/bartlein/Projects/ESSD/data/shp_files/world2015/`, and plot the outlines.
```{r hdf5-intro-1}
# plot the world outline
shp_file <- "/Users/bartlein/Projects/RESS/data/shp_files/world2015/UIA_World_Countries_Boundaries.shp"
world_outline <- as(st_geometry(read_sf(shp_file)), Class = "Spatial")
plot(world_outline, col="lightblue", lwd=1)
```
Convert the `world_outline` to a spatial vector for adding to `terra` maps later.
```{r convert_outline}
# convert world_outline to a spatial vector
world_otl <- vect(world_outline)
class(world_otl)
plot(world_otl)
```
## Convert an HDF4 file to HDF5
Here's a link to the `.hdf` file: [[https://pages.uoregon.edu/bartlein/RESS/hdf_files/hdf_files/uvief20190423.hdf/]](https://pages.uoregon.edu/bartlein/RESS/hdf_files/uvief20190423.hdf). Download the file and copy the file a convenient places, say, a folder `/hdf_files` in your data folder (e.g. `/Users/bartlein/Projects/ESSD/data/hdf_files/`)
In this example, a data set of the "Clear-Sky UV Index"" from TEMIS was downloaded (i.e. `uvief20190423.hdf` -- 2019-04-23) from the website listed above. This is an HDF4 file, which can be verified in Panoply. Viewing files in Panoply is always good practice--in this case the particular layout of the data in file is rather ambiguous at first glance, and so getting a look at the contents of the file before attempting to read it is a good idea.
The HDF4 file can be converted by typing in a terminal or command window (for example):
```{r convert, echo=TRUE, eval=FALSE}
h4toh5convert uvief20190423.hdf
# or, e.g.: /Users/bartlein/Downloads/hdf/HDF_Group/H4H5/2.2.5/bin/h4toh5convert GFED4.0_DQ_2015270_BA.hdf
# if the h4toh5convert.exe program isn't in the PATH
```
This will create the file `uvief20190423.h5`. (Note that an explicit output filename and path can be included. Not everything in the HDF4 file will be converted, and it's likely that a number of error messages will appear, giving the impression that the operation has failed. Check to see whether in fact the HDF5 was indeed created, and has something in it (via Panoply).)
# Read a simple HDF5 file #
Begin by setting paths and filenames:
```{r hdf5-intro-2, file1paths}
# set path and filename
hdf_path <- "/Users/bartlein/Projects/RESS/data/hdf_files/"
hdf_name <- "uvief20190423.h5"
hdf_file <- paste(hdf_path, hdf_name, sep="")
```
List the contents of the file. (WARNING: This can produce a lot of output)
```{r listfile1}
# list the file content
h5ls(hdf_file)
```
Read the attributes of the `UVI_field` variable, the UV index:
```{r readattributes}
# read the attributes of UVI_field
h5readAttributes(hdf_file, name = "UVI_field")
```
Get the latitudes and longitudes of the grid:
```{r getlatlon1}
# get lons and lats
lon <- h5read(hdf_file, "/Longitudes")
nlon <- length(lon)
nlon
head(lon); tail(lon)
lat <- h5read(hdf_file, "/Latitudes")
nlat <- length(lat)
nlat
head(lat); tail(lat)
```
Get the UV data, and convert it to a `SpatRaster` object, and get a short summary. Note that the UVI data is in layer 3 of the raster brick.
```{r read_UVI1}
hdf_file <- "/Users/bartlein/Projects/RESS/data/hdf_files/uvief20190423.h5"
UVI_in <- rast(hdf_file, lyrs = 3)
UVI_in
plot(UVI_in)
```
Notice that the raster has an unknown extent (i.e the range of longitudes and latitudes), and it also appears to be upside down. Flip the raster, and add the longitude and latitude extent.
```{r hdf5-intro-3}
# flip the raster
UVI <- flip(UVI_in, direction = "vertical")
# add the spatial extent
ext(UVI) <- c(-179.875, 179.875, -89.875, 89.875)
UVI
```
Plot the raster:
```{r hdf5-intro-4}
plot(UVI)
```
Looks ok. Remove the input file:
```{r hdf5-intro-5}
# Remove the input file
remove(UVI_in)
```
Note that the data could also be read as an a matrix, or 2d array
```{r readh1, ECHO = TRUE, EVAL = FALSE}
# read the UV data
h1 <- h5read(hdf_file, "/UVI_field")
class(h1); str(h1)
```
Remove the array:
```{r hdf5-intro-6}
# Remove the input file
remove(h1)
```
## Plot the UV data
A plot with overlain outlines can be produced as follows:
```{r rasterPlot}
color_fun <- colorRampPalette(brewer.pal(9,"YlOrRd"))
plot(UVI, col = color_fun(50), main = "UV Index", ylab = "Latitude", xlab = "Longitude")
plot(world_otl, add = TRUE)
```
The `colorRampPalette()` function creates another function `color_fun` that can be used to interpolate a color scale or ramp, here the `RColorBrewer` "YlOrRd" scale. When supplied to the `col` argument in the `plot()` function, it creats a color ramp with 50 steps
Close the open files before moving on:
```{r close1}
# close open HDF5 files
h5closeAll()
```
# Read a more complicated HDF5 file
A much more stuctured file that exploits the ability of HDF5 to store data in "groups" that mimics in some ways the directory or folder structure on a local computer is represented by the GFED4.1 fire data set. Here's a link to the file: [[https://pages.uoregon.edu/bartlein/RESS/hdf_files/hdf_files/GFED4.1s_2016.hdf5]](https://pages.uoregon.edu/bartlein/RESS/hdf_files/GFED4.1s_2016.hdf5). Dowload the file as usual. Begin by setting paths and file names.
```{r hdf5-intro-8}
# set path
hdf_path <- "/Users/bartlein/Projects/RESS/data/hdf_files/"
hdf_name <- "GFED4.1s_2016.hdf5"
hdf_file <- paste(hdf_path, hdf_name, sep="")
```
List the file contents, DOUBLE WARNING: This really does produce a lot of output, only some is included here:
```{r list2, eval=FALSE, echo=TRUE}
# list the file contents
h5ls(hdf_file) # may create a lot of output
```
```{r output_listing, eval=FALSE, echo=TRUE}
## group name otype dclass dim
## 0 / ancill H5I_GROUP
## 1 /ancill basis_regions H5I_DATASET INTEGER 1440 x 720
## 2 /ancill grid_cell_area H5I_DATASET FLOAT 1440 x 720
## 3 / biosphere H5I_GROUP
## 4 /biosphere 01 H5I_GROUP
## 5 /biosphere/01 BB H5I_DATASET FLOAT 1440 x 720
## 6 /biosphere/01 NPP H5I_DATASET FLOAT 1440 x 720
## 7 /biosphere/01 Rh H5I_DATASET FLOAT 1440 x 720
## 8 /biosphere 02 H5I_GROUP
## 9 /biosphere/02 BB H5I_DATASET FLOAT 1440 x 720
## 10 /biosphere/02 NPP H5I_DATASET FLOAT 1440 x 720
## 11 /biosphere/02 Rh H5I_DATASET FLOAT 1440 x 720
...
## 52 / burned_area H5I_GROUP
## 53 /burned_area 01 H5I_GROUP
## 54 /burned_area/01 burned_fraction H5I_DATASET FLOAT 1440 x 720
## 55 /burned_area/01 source H5I_DATASET INTEGER 1440 x 720
## 56 /burned_area 02 H5I_GROUP
## 57 /burned_area/02 burned_fraction H5I_DATASET FLOAT 1440 x 720
## 58 /burned_area/02 source H5I_DATASET INTEGER 1440 x 720
## 59 /burned_area 03 H5I_GROUP
...
```
Read the longtidues and latiudes. A preliminary look at the file in Panoply revealed that longitudes and latitudes were stored as full matices, as opposed to vectors. Consequently, after reading, only the first row (for longitude) and first column (for latitude) were stored.
```{r readlatlon2}
# read lons and lats
lon_array <- h5read(hdf_file, "lon")
lon <- lon_array[,1]
nlon <- length(lon)
nlon
head(lon); tail(lon)
lat_array <- h5read(hdf_file, "lat")
lat <- lat_array[1,]
nlat <- length(lat)
nlat
lat <- lat[nlat:1]
head(lat); tail(lat)
```
Read the attributes of the `burned_area` variable:
```{r readattr2}
# read the attributes of the burned_area variable
h5readAttributes(hdf_file, name = "/burned_area/07/burned_fraction/")
```
From the listing of the contents of the file, and the information provided in Panoply, it can be seen that, for example, the burned fraction (the proportion of a grid cell burned), for July 2016 is in the variable named `/burned_area/07/burned_fraction`. Get that variable:
```{r readh2}
# read a raster slice
h2 <- h5read(hdf_file, "/burned_area/07/burned_fraction")
class(h2); str(h2)
```
The distribution of the burned fraction data is extremely long-tailed, meaning there are a lot of small values and few large ones:
```{r hist1}
hist(h2, breaks = 40)
```
The usual strategy for dealing with such data is to log transform it:
```{r hist2}
h2[h2 == 0.0] <- NA
h2 <- log10(h2)
hist(h2, breaks = 40)
```
```{r hist3}
BF <- rast(h2)
BF
```
Take a quick look:
```{r image2}
# terra plot
plot((BF))
```
The plot, confirmed by Panoply, suggests that the raster needs to be transposed.
```{r transpose BF}
# tranpose the raster
BF <- t(BF)
BF
plot(BF)
```
As was the case with the Ozone data, the rasterized input data (`BF`) does not contain spatial entent data. These can be gotten from the longitude and latitude data read in above using the `h5read()` function.
```{r hdf5-intro-9}
# add the spatial extent
ext(BF) <- c(lon[1], lon[nlon], lat[1], lat[nlat])
BF
plot(BF)
```
## plot the GFED burned fraction data ##
Plot the data:
```{r rasterPlot2}
color_fun <- colorRampPalette(brewer.pal(9,"YlOrRd"))
plot(BF, y = 1, col = color_fun(50), main = "log10 Burned Fraction First", ylab = "Latitude", xlab = "Longitude")
plot(world_otl, add = TRUE)
```