Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphfcns #31

Merged
merged 46 commits into from
Feb 28, 2024
Merged

Graphfcns #31

merged 46 commits into from
Feb 28, 2024

Conversation

barneydobson
Copy link
Collaborator

@barneydobson barneydobson commented Jan 26, 2024

Description

This PR will be used to implement all of the graph functions, addressing #8, part of #19 . Examples have been discussed in #10 and this PR will address the 'model parameters' part of the issue.

Design

I have implemented some functions and parameters to demonstrate how we might do this, but wanted @dalonsoa and @cheginit to agree before I do all of them (as there are many so I don't want to do them and have to change them...). To remind how these functions are expected to be called (example from #10 ):

# This kind of behaviour would be in generate_network.py
custom_parameters = {'outlet_derivation' : {'max_river_length': 50.0}}
params = validate_parameters(custom_parameters)

import networkx as nx
G = nx.Graph()
function_list = ['double_directed', 'split_long_edges']
for function in function_list:
    assert function in graphfcns.keys(), f"Function {function} not registered in graphfcns."
    G = graphfcns[function](G, **params)

(We may also have addresses separately in cases such as where a raster or downloaded data is needed, but as mentioned in #10 , it may be sensible to store addresses as parameters too).

A couple of discussion points:

  • I will try to have it so that one function adds or changes as few attributes as possible, but in some cases this doesn't seem to make sense that it has to be only one (e.g., see format_osmnx_lanes). I plan also to add some kind of 'verbosity' setting, so for example, even if the subcatchment shapes are not actually needed for the algortihm, they can be printed during the derive subcatchments function. Part of verbosity could also include printing the graph after every function, this would certainly be useful for debugging.
  • At the moment, the function docstring specifies that the graph that is being operated on needs to have certain attributes, and that it will add certain attributes. I can envisage that there may be some clever way to design this so that you can determine if a list of functions are feasible a-priori based on which functions need which attributes and which functions create such attributes. Any ideas on this front welcomed!

Edits:

  • A smaller point - the typing in register_graphfcn seems weird, like surely the ... is antithesis to typing, but I couldn't get it to work any other way since the graph functions require all different inputs and also kwargs.

Current graph functions to implement

  • Calculate width
  • Format ID
  • Double directed
  • Split long edges
  • Burn in roads, derive subcatchments and set contributing area
  • Set elevation
  • Set surface slop
  • Set Chahinan angle
  • Calculate weights
  • Identify outlets
  • Find nearest outlets/shortest paths
  • Pipe by pipe? (or possibly this is in a separate hydraulic design module)
  • Implement as graphfcn classes

barneydobson and others added 4 commits January 26, 2024 10:36
-Create graph_utilities.py
-Create tests
-Create parameters
@barneydobson barneydobson self-assigned this Jan 26, 2024
@cheginit
Copy link
Collaborator

I don't understand the justification for the need to treat graph functions like this. From the development (especially typing) and maintenance perspective, this approach can make debugging difficult and reduce readability. What's the trade-off for this? If you want to apply some functions in a specific sequence, why not just have a wrapper function that does so?

@barneydobson
Copy link
Collaborator Author

Yes I probably should have motivated it a bit:

  • Keeping functions in a prescribed format makes it easy for a user to switch them out if they want to change the method (e.g. not shortest path but another algorithm, or a new method for sub catchment derivation). They just need to register a new graph function and change the function list in their config.
  • Making bigger changes to the workflow are also possible. E.g. some methods derive topology and hydraulics simultaneously or iteratively. Something like this is also easier to accommodate.
  • The anticipated use case is to run downloaders, pre processing and a certain number of graph functions only once for a case study. And then vary parameters for the remaining graph functions thousands of times. However, which graph functions need to be run once, and which need to be run lots depends on which parameters I want to vary. E.g., sub catchment derivation is slow, so I will probably want to run that (and thus all of the graph functions needed to enable it) only once, unless of course my parameters interact with sub catchment derivation.
  • On the debugging, the kwargs argument is ignored in a graph function, it is just so that they can be called in a loop. Happy for other ideas (especially RE maintenance) here but ultimately the benefits above of flexible function choice and operation order seem to me to outweigh. I think from a research and methods comparison perspective, it is very good if others can use the software to implement inherently different kinds of synthetic algorithms within this software.

Hopefully it is clear?

@cheginit
Copy link
Collaborator

I see. For this case, I recommend defining a base class with some required methods (can be enforced by using python's built-in ABC module) that users can inherent from. The class can be used for any changes to the workflow, either replace or add. For running the workflow, you can either define the __call__ method of the class in such a way that it will carry out the operations in the desired sequence, or have a method called, for example, run. The advantage of the __call__ method is that whenever you call the class, it runs the entire workflow. The advantage of the run method is that you can instantiate the class once and run the workflow multiple times as you change its attributes. IMHO, this approach gives you the flexibility that you're looking for without sacrificing readability and maintainability. That said, if you have already implemented the register approach and feel strongly about it, it's fine by me.

@barneydobson
Copy link
Collaborator Author

barneydobson commented Jan 29, 2024

Thanks for the suggestions. The register is implemented but it was minimal work and easy to change. I think how this interacts with the config file #10 is probably the most critical aspect to useability and will be the hardest bit to code (saving this for now). I think a class based approach could make sense. I will have a look at ABC and see - the pydantic package for the parameters seems pretty nice so far so am happy if there is some graph function or maybe entire 'workflow' equivalent.

@dalonsoa - any views on this?

Edit:
In particular a class based approach would nicely handle the 'adds attributes' and 'requires attributes' elements of a graphfcn. Very easy to iterate over a list_of_functions and determine a priori whether a sequence is feasible

@barneydobson
Copy link
Collaborator Author

barneydobson commented Jan 29, 2024

OK here is a testable mockup of how this could work in practice to give everyone a better idea - would be great to hear thoughts:

import networkx as nx
from swmmanywhere import parameters
from abc import ABC, abstractmethod

class BaseGraphFunction(ABC):
    @abstractmethod
    def __init__(self):
        self.required_attributes = []
        self.adds_attributes = []

    @abstractmethod
    def __call__(self, 
                 G: nx.Graph, 
                 **kwargs) -> nx.Graph:
        return G
    
    def validate_requirements(self,
                              attributes: set):
        for attribute in self.required_attributes:
            assert attribute in attributes, "{0} not in attributes".format(
                attribute)

    def add_graphfcn(self,
                    attributes: set):
        self.validate_requirements(attributes)
        return attributes.union(self.adds_attributes)

graphfcns = {}
   
def register_graphfcn(cls):
    graphfcns[cls.__name__] = cls
    return cls

def get_osmid_id(data):
    id_ = data.get('osmid', data.get('id'))
    if isinstance(id_, list):
        id_ = id_[0]
    return id_

@register_graphfcn
class assign_id(BaseGraphFunction):
    def __init__(self):
        super().__init__()
        self.required_attributes = ['osmid']
        self.adds_attributes = ['id']
    def __call__(self, 
                 G: nx.Graph, 
                **kwargs):
        for u, v, data in G.edges(data=True):
            data['id'] = get_osmid_id(data)
        return G
    
@register_graphfcn
class format_osmnx_lanes(BaseGraphFunction):
    def __init__(self):
        super().__init__()
        self.required_attributes = ['lanes']
        self.adds_attributes = ['width', 'nlanes']
    def __call__(self, 
                G: nx.Graph, 
                subcatchment_derivation: parameters.SubcatchmentDerivation, 
                **kwargs):
        G = G.copy()
        for u, v, data in G.edges(data=True):
            lanes = data.get('lanes',1)
            if isinstance(lanes, list):
                lanes = sum([float(x) for x in lanes])
            else:
                lanes = float(lanes)
            data['nlanes'] = lanes
            data['width'] = lanes * subcatchment_derivation.lane_width
        return G

@register_graphfcn
class double_directed(BaseGraphFunction):
    def __init__(self):
        super().__init__()
        self.required_attributes = ['id']
        self.adds_attributes = []
    def __call__(self, 
                 G: nx.Graph, 
                 **kwargs):
        G_new = G.copy()
        for u, v, data in G.edges(data=True):
            if (v, u) not in G.edges:
                reverse_data = data.copy()
                reverse_data['id'] = '{0}.reversed'.format(data['id'])
                G_new.add_edge(v, u, **reverse_data)
        return G_new


from swmmanywhere.prepare_data import download_street

# Download example graph
bbox = (-0.11643,51.50309,-0.11169,51.50549)
G = download_street(bbox)

# Define functions list
functions_list = ['assign_id', 'format_osmnx_lanes', 'double_directed']
functions_list_init = []
# Populate attributes common to all edges in graph
attributes = set(list(G.edges(data=True))[0][2].keys())
for u, v, data in G.edges(data=True):
    attributes.intersection(data.keys())

# Add and validate attributes and functions
for function_name in functions_list:
    functions_list_init.append(graphfcns[function_name]())
    attributes = functions_list_init[-1].add_graphfcn(attributes)

# iterate over functions
params = {'subcatchment_derivation' : parameters.SubcatchmentDerivation()}
for function in functions_list_init:
    G = function(G,**params)

# This will break because double_directed needs 'id' which has not yet been 
# added
G = download_street(bbox)

functions_list = ['double_directed', 'assign_id', 'format_osmnx_lanes']
attributes = set()
for u, v, data in G.edges(data=True):
    attributes.update(data.keys())

for function_name in functions_list:
    functions_list_init.append(graphfcns[function_name]())
    attributes = functions_list_init[-1].add_graphfcn(attributes)

params = {'subcatchment_derivation' : parameters.SubcatchmentDerivation()}
for function in functions_list_init:
    G = function(G,**params)

@cheginit
Copy link
Collaborator

That's not exactly what I had in mind. For example, if I were to refactor my BUSN code, here's the base class that I would have defined with all the must-have methods:

import functools
from abc import ABC, abstractmethod
from pathlib import Path
import networkx as nx
import geopandas as gpd
import xarray as xr
from typing import Optional, Any
import yaml


Loader = getattr(yaml, "CSafeLoader", yaml.SafeLoader)
yaml_load = functools.partial(yaml.load, Loader=Loader)

class BaseModel(ABC):
    def __init__(self, config_path: Path | str):
        self.config = self.load_config(config_path)
        # Additional initialization code if necessary

    @staticmethod
    def load_config(config_path: Path | str) -> dict[str, Any]:
        config = yaml_load(Path(config_path).read_text())
        # Additional config validation code
        return config

    @abstractmethod
    def get_street(self, bbox: tuple[float, float, float, float]) -> nx.DiGraph:
        pass

    @abstractmethod
    def network_cleanup(self, street: nx.DiGraph, merge_dist: int) -> nx.DiGraph:
        pass

    @abstractmethod
    def get_buildings(self, bbox: tuple[float, float, float, float]) -> gpd.GeoDataFrame:
        pass

    @abstractmethod
    def set_network_attrs(self, street: nx.DiGraph, building_footprints: Optional[gpd.GeoDataFrame], cover_mul: float) -> nx.DiGraph:
        pass

    @abstractmethod
    def set_street_topo(self, street: nx.DiGraph, slope_min: float) -> nx.DiGraph:
        pass

    @abstractmethod
    def pipe_attrs(self, street: nx.DiGraph) -> nx.DiGraph:
        pass

    @abstractmethod
    def flow_weight(self, street: nx.DiGraph) -> None:
        pass

    @abstractmethod
    def cover_weight(self, street: nx.DiGraph, cover_da: xr.DataArray) -> None:
        pass

    @abstractmethod
    def road_weight(self, street: nx.DiGraph) -> None:
        pass

    @abstractmethod
    def btw_weight(self, street: nx.DiGraph) -> None:
        pass

    @abstractmethod
    def compute_betweenness(self, street: nx.DiGraph) -> nx.DiGraph:
        pass

    @abstractmethod
    def remove_streets(self, street: nx.DiGraph, drain_levels: int) -> dict[float, nx.DiGraph]:
        pass

    @abstractmethod
    def run(self) -> None:
        """Abstract method for running the model's main workflow."""
        # It will use `self.config` to run the operations in desired order
        pass

Then, I would've defined a DefaultModel that inherits from this with default methods in my methodology. However, in this case, you can define several classes for each method that inherits from the base class. For example, there can be a class that is based on the shortest path that you implemented, another one for the betweenness centrality that I implemented, and so on.

@dalonsoa
Copy link
Collaborator

I see the advantage of @barneydobson implementation using small classes that deal with one particular activity, instead of plain functions, as it was the initial idea. It needs some work to polish some rough edges, but the validation step to ensure the right information is there is definitely useful.

I definitely do not understand the advantage that @cheginit implementation of an all-mighty class that do a lot of work with many methods to define (even if we consider inheritance). I do not see how that is more readable, easy to test and debug or extensible by end users. Anyone who wants to try something new will need to hack the core code to tell it to use a new (sub)class, rather than just providing a configuration that will run any registered functions (or classes) in the right order. And such registration can happen in a python module totally external to this tool, so no need to change the core code.

Dobson added 2 commits January 30, 2024 12:36
@cheginit
Copy link
Collaborator

Maybe I couldn't explain it well. There's no reason for a user to "hack the core code". If a user wants to change a specific step, they just need to inherit from one of the main classes, let's say BCModel, and then change that specific function. For example, if a user would like to modify the set_network_attrs step, they simply need to do:

class NewBCModel(BCModel):
    def set_network_attrs(...):
        # new code

This approach is common for packages that require this level of customization, such as optimization packages (pymoo) or ML ones like PyTorch.

I am not sure why you'd think this is difficult to read, test and debug. Perhaps because I am used to using many such packages, this approach feels more natural and suitable to me.

@dalonsoa
Copy link
Collaborator

Because you need the whole class that do everything (including testing) and there is a somewhat implicit implementation of much of the functionality (defined in the parent classes - I know this is what inheritance is all about, but I find that much harder to follow that a more functional approach). Why putting all possible functionality in a single class when you can have small classes that only need to worry about their own specific functionality?

About the customisation, I think we are seeing different use cases - or understanding the use cases differently. I see this tool used either directly from a command line or with a very simple script that reads a configuration file and calls high level functions. In other words, an application. You see it more as, following your example, PyTorch, a library that contains many bits and pieces that you put together to do some customised tasks. Both are totally fine, they are just different and need different solutions. I don't know what @barneydobson had in mind.

@cheginit
Copy link
Collaborator

Ok, I see what you mean. The reason I think this class-based approach makes more sense is that we have all these functionalities and there are also different methods of generating the network, each of which uses a subset of these functionalities. So, IMHO, since each method bundles a subset of these functionalities, it's easier to understand and follow the workflow when all required steps are gathered in a class. There is a base class that contains the steps shared between all methods, then there are method-specific classes containing only modified/additional steps.

I think, the advantages of this approach don't depend on whether it's mainly used via CLI/config file or scripting. Having said that, since @barneydobson and @dalonsoa will be maintaining the package, using an implementation that you're more comfortable with is fine by me.

@barneydobson
Copy link
Collaborator Author

barneydobson commented Jan 31, 2024

OK thanks so much @cheginit and @dalonsoa for this very helpful discussion. I definitely see the pros and cons of both. I think we are probably all agreed that there is both a need for and benefit of having a 'default loadout' that can be customised by a user for their case. For me I think the practical distinction above is boiling down to whether this default is contained in a default config file, or a default model class - with many implications discussed.

One thing that was not made explicit in @cheginit 's approach was the handling of graphfcn attribute dependencies and validation. Which I suppose still treating the graphfcns as a class is still probably the most general way to handle this. Are these graphfcns defined either within or attached to a model_class or instantiated within a function in generate_network as defined by the config file? I don't know - but I think from the perspective of this PR, the graphfcns are written the same.

If you are both happy with it - perhaps we can revisit the discussion of default config vs default model class when at #10 , this will also have the benefit of some more tangible examples.

@barneydobson barneydobson mentioned this pull request Jan 31, 2024
Dobson added 3 commits January 31, 2024 12:53
-Load test data rather than download
-test derive_rc
-minor edits to fix derive subs
@cheginit
Copy link
Collaborator

One thing that was not made explicit in cheginit 's approach was the handling of graphfcn attribute dependencies and validation.

I think all validation and checks should be done when instantiating the class. If you look at the example template class that I wrote, there is a load_config method with a comment about validations. Since all the other methods in the class will use self.config to get the attributes, so when you validate and check the state of the required attributes during loading the config, then method calls will run without any issues.

The validation is done by calling the functions or classes that you develop separately in the other PR and don't need to (shouldn't) be part of this base class.

Dobson added 6 commits January 31, 2024 15:28
- Introduce Addresses into parameters
- Test
-Add tests
-Identify and fix error in interpolate_points_on_raster
in geospatial_utilities and its test
Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a collection of comments and suggestions to make the code cleaner and easier to read. I have not enter into discussing the science and specialised packages and formats because, to be honest, you know more about that.

It is a huge PR, so it might deserve another review because I migth have missed something, but so far is looking really good.

swmmanywhere/geospatial_utilities.py Outdated Show resolved Hide resolved
swmmanywhere/geospatial_utilities.py Outdated Show resolved Hide resolved
swmmanywhere/geospatial_utilities.py Outdated Show resolved Hide resolved
swmmanywhere/geospatial_utilities.py Outdated Show resolved Hide resolved
swmmanywhere/geospatial_utilities.py Outdated Show resolved Hide resolved
swmmanywhere/graph_utilities.py Outdated Show resolved Hide resolved
swmmanywhere/graph_utilities.py Show resolved Hide resolved
swmmanywhere/graph_utilities.py Show resolved Hide resolved
swmmanywhere/graph_utilities.py Show resolved Hide resolved
swmmanywhere/graph_utilities.py Show resolved Hide resolved
@barneydobson barneydobson merged commit e9284f3 into main Feb 28, 2024
7 of 10 checks passed
@barneydobson barneydobson deleted the graphfcns branch February 28, 2024 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants