Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WFS paging and parallelization support #70

Open
salvafern opened this issue Mar 29, 2022 · 7 comments
Open

WFS paging and parallelization support #70

salvafern opened this issue Mar 29, 2022 · 7 comments
Assignees

Comments

@salvafern
Copy link

salvafern commented Mar 29, 2022

Hi @eblondel ,

I have been giving a try to ows4r to query biological occurrence data from EMODnet-Biology

In this example below, I requested:

I got a WFS request using the EMODnet-Biology download toolbox (at the end of the selection, you can copy the WFS request in "Get webservice url")

Good news are that viewParams via vendor params work like a charm! (although I have to watch out for the encoding lifewatch/eurobis#15 (comment))

I am having troubles however with the paging and parallel options. After some debugging, I think the issue might be that ows4r is relying on a param named numberMatched when using resultstype = "hits" at: https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L240

And this is not being returned geo.vliz.be (should happen around: https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L291)

Could you have a look and see what is happening?

Thanks a lot!

# Example get CPR dataset, North Sea and Calanus finmarchicus

library(ows4R)
library(parallel)

# URL as provided by download toolbox
url_download_toolbox <- "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal%3Aeurobis-obisenv_basic&resultType=results&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&propertyName=datasetid%2Cdatecollected%2Cdecimallatitude%2Cdecimallongitude%2Ccoordinateuncertaintyinmeters%2Cscientificname%2Caphiaid%2Cscientificnameaccepted&outputFormat=csv"
URLdecode(url_download_toolbox)
#> [1] "http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal:eurobis-obisenv_basic&resultType=results&viewParams=where:((up.geoobjectsids+&&+ARRAY[2350]))+AND+datasetid+IN+(216);context:0100;aphiaid:104464&propertyName=datasetid,datecollected,decimallatitude,decimallongitude,coordinateuncertaintyinmeters,scientificname,aphiaid,scientificnameaccepted&outputFormat=csv"

# Only params
params <- "where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464"
URLdecode(params)
#> [1] "where:((up.geoobjectsids+&&+ARRAY[2350]))+AND+datasetid+IN+(216);context:0100;aphiaid:104464"

# Create wfs client and find feature
wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "1.1.0", logger = "INFO")$
  getCapabilities()$
  findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")
#> [ows4R][INFO] OWSGetCapabilities - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&request=GetCapabilities

# Create cluster
cl <- makeCluster(detectCores() - 1)

# Perform tests: around 20K rows
system.time(feature_only_viewparams <- wfs$getFeatures(viewParams = params, resultType="results"))
#> [ows4R][INFO] WFSDescribeFeatureType - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&request=DescribeFeatureType 
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=results&request=GetFeature
#>    user  system elapsed 
#>   0.990   0.100   3.712

system.time(feature_pagination <- wfs$getFeatures(viewParams = params, paging = TRUE, paging_length = 1000))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resulttype=hits&request=GetFeature
#> Error in seq.default(from = 0, to = numberMatched, by = paging_length): 'to' must be of length 1
#> Timing stopped at: 0.09 0.001 0.678

system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=results&request=GetFeature
#>    user  system elapsed 
#>   0.986   0.088   3.429

# Debugging pagination
nft <- wfs$getFeatures(viewParams = params, resultType="hits")
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=1.1.0&typeName=Dataportal:eurobis-obisenv_basic&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464&resultType=hits&request=GetFeature
names(nft)
#> [1] "numberOfFeatures" "timeStamp"

"numberMatched" %in% names(nft)
#> [1] FALSE

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.6 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>   [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#> [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#> [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#>   [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>   [1] httr_1.4.2    reprex_2.0.1  ows4R_0.2-1   keyring_1.3.0 geometa_0.6-6
#> 
#> loaded via a namespace (and not attached):
#>   [1] tinytex_0.35       tidyselect_1.1.1   xfun_0.28          purrr_0.3.4       
#> [5] sf_0.9-4           lattice_0.20-41    vctrs_0.3.8        generics_0.1.0    
#> [9] htmltools_0.5.0    yaml_2.2.1         utf8_1.2.2         XML_3.99-0.3      
#> [13] rlang_0.4.11       e1071_1.7-3        pillar_1.6.3       glue_1.4.2        
#> [17] withr_2.4.2        DBI_1.1.1          bit64_4.0.5        sp_1.4-6          
#> [21] lifecycle_1.0.1    evaluate_0.14      knitr_1.29         tzdb_0.1.2        
#> [25] callr_3.7.0        ps_1.6.0           curl_4.3           class_7.3-17      
#> [29] fansi_0.5.0        highr_0.8          Rcpp_1.0.7         readr_2.0.2       
#> [33] KernSmooth_2.23-17 openssl_1.4.2      classInt_0.4-3     vroom_1.5.5       
#> [37] jsonlite_1.7.0     bit_4.0.4          fs_1.5.0           hms_1.1.1         
#> [41] askpass_1.1        digest_0.6.25      processx_3.5.2     dplyr_1.0.7       
#> [45] grid_3.6.3         rgdal_1.5-12       cli_3.0.1          tools_3.6.3       
#> [49] magrittr_2.0.1     tibble_3.1.5       crayon_1.4.1       pkgconfig_2.0.3   
#> [53] ellipsis_0.3.2     assertthat_0.2.1   rmarkdown_2.11     rstudioapi_0.13   
#> [57] R6_2.5.1           units_0.6-7        compiler_3.6.3   

Created on 2022-03-29 by the reprex package (v2.0.1)

This issue partly follows up #29

@eblondel
Copy link
Owner

@salvafern make sure to use WFS 2.0 version; AFAIK pagination in WFS is only supported in WFS 2.0, I see you used 1.1.0

@eblondel
Copy link
Owner

Try with setting version 2.0.0 like this:

wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "2.0.0", logger = "INFO")$
  getCapabilities()$
  findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")

   params <- "where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+datasetid+IN+%28216%29%3Bcontext%3A0100%3Baphiaid%3A104464"

   #with pagination
   system.time(feature_pagination <- wfs$getFeatures(viewParams = params, paging = TRUE, paging_length = 1000))

justed tested the pagination and it worked

@eblondel eblondel self-assigned this Mar 30, 2022
@salvafern
Copy link
Author

Indeed now it works, thanks a lot!
I was using v1.1.0 to copy what the download toolbox did, but I guess there's no harm in using v2.0.0

I have also tried now using the parellel options:

Using parellelization and pagination together

Probably I'm doing something wrong. I expected that multiple requests would be done for each chunk, but I just ran into an error.

library(ows4R)
library(parallel)

wfs <- WFSClient$
  new("https://geo.vliz.be/geoserver/Dataportal/wfs", "2.0.0", logger = "INFO")$
  getCapabilities()$
  findFeatureTypeByName("Dataportal:eurobis-obisenv_basic")

# Querying dataset: https://www.emodnet-biology.eu/data-catalog?module=dataset&dasid=8020
# ~500K rows
params <- "where%3Adatasetid+IN+%288020%29"

# With pagination and parellelization
cl <- makeCluster(detectCores() - 1)
cl
#> socket cluster with 15 nodes on host ‘localhost’

debug(wfs$getFeatures)
system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                paging = TRUE, paging_length = 10000,
                                                parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  : 
#>   No layers in datasource.
#> Timing stopped at: 0.023 0 11.45

via debug() I can see that at some point, a request of type 'hits' is read with sf::st_read(), which of course fails. This happens at https://github.com/eblondel/ows4R/blob/master/R/WFSFeatureType.R#L328

The response in destfile looks like

<?xml version="1.0" encoding="UTF-8"?>
<wfs:FeatureCollection
	xmlns:xs="http://www.w3.org/2001/XMLSchema"
	xmlns:fes="http://www.opengis.net/fes/2.0"
	xmlns:wfs="http://www.opengis.net/wfs/2.0"
	xmlns:gml="http://www.opengis.net/gml/3.2"
	xmlns:ows="http://www.opengis.net/ows/1.1"
	xmlns:xlink="http://www.w3.org/1999/xlink"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" numberMatched="408603" numberReturned="0" timeStamp="2022-03-31T07:57:57.251Z" xsi:schemaLocation="http://www.opengis.net/wfs/2.0 http://schemas.opengis.net/wfs/2.0/wfs.xsd"/>

Using only parallelization

I tried comparing no parallelization vs parallelization with mclapply and parLapply but I'm not seeing any improvement on the performance. Probably it needs pagination as well?

# No pagination nor parellelization
system.time(feature <- wfs$getFeatures(viewParams = params, resultType="results"))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 26.718   2.080  67.476

# Parallelization parLapply
system.time(feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                parallel = TRUE, parallel_handler = parallel::parLapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 27.457   2.477  65.883

# Parallelization mclapply
system.time(feature_parallel2 <- wfs$getFeatures(viewParams = params, resultType="results", 
                                                 parallel = TRUE, parallel_handler = parallel::mclapply, cl = cl))
#> [ows4R][INFO] WFSGetFeature - Fetching https://geo.vliz.be/geoserver/Dataportal/wfs?service=WFS&version=2.0.0&typeNames=Dataportal:eurobis-obisenv_basic&viewParams=where%3Adatasetid+IN+%288020%29&resultType=results&request=GetFeature 
#> user  system elapsed 
#> 26.226   2.274  63.895 

Many thanks again for the help! Let me know if I there is anything I can do.

@eblondel
Copy link
Owner

Yes, sounds they are issues with the parallelization, will have a look asap.

@eblondel
Copy link
Owner

If you want to use the cluster approach, you can use this handler : parallel::parLapply which works with cluster. mclapply can't work apparently because I didn't allow specifying the extra args needed for this handler

@salvafern
Copy link
Author

I got the same error :(

feature_parallel <- wfs$getFeatures(viewParams = params, resultType="results", 
                                    paging = TRUE, paging_length = 10000,
                                    parallel = TRUE, parallel_handler = parallel::parLapply, cl = cl)
#> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  : 
#>   No layers in datasource.

@eblondel eblondel added the bug label Apr 1, 2022
@eblondel
Copy link
Owner

eblondel commented Apr 8, 2022

@salvafern i don't forget this, i started working on it, but still looking into the best way to fix the parallel handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants