Move gdalinfo call to executor #948

EmileSonneveld · 2024-11-25T11:05:55Z

Now, gdalinfo is called on output assets in the driver. In case of gtiff output on S3, the assets where written on an executor, and need to get downloaded again in the driver.
In case of fusemount it happens implicitly, in case of direct S3 access, it happens explicitly here:

openeo-geopyspark-driver/openeogeotrellis/integrations/gdal.py

Lines 177 to 182 in 88ab283

    
           if not abs_asset_path.exists() and asset_href.startswith("s3://"): 
        
               try: 
        
                   abs_asset_path.parent.mkdir(parents=True, exist_ok=True) 
        
                   with open(abs_asset_path, "wb") as f: 
        
                       for chunk in stream_s3_binary_file_contents(asset_href): 
        
                           f.write(chunk)

Moving gdalinfo to the executor and passing the info on would avoid this extra download.

This might avoid OOM like this: #809
And would have avoided this log deadlock: #906

cc @jdries

jdries · 2024-11-25T11:47:32Z

Requires that scala code makes the gdalinfo call, but also that we have a way to pass the resulting metadata back to the driver.
This could perhaps be achieved by assembling the stac json files already in executors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move gdalinfo call to executor #948

Move gdalinfo call to executor #948

EmileSonneveld commented Nov 25, 2024

jdries commented Nov 25, 2024

Move gdalinfo call to executor #948

Move gdalinfo call to executor #948

Comments

EmileSonneveld commented Nov 25, 2024

jdries commented Nov 25, 2024