Graph response formatting fails when no pipeline data is in the database #367

alyssadai · 2024-10-25T06:08:21Z

Is there an existing issue for this?

I have searched the existing issues

Expected Behavior

No response

Current Behavior

When submitting any cohort query to an n-API v0.4.0

in non-aggregated mode,
with a graph with old JSONLD data (no pipeline metadata),

the query results in an Internal server error.

The same issue does not occur if the n-API is run in aggregated mode, or for a graph database containing at least one subject with pipeline metadata.

Error message

Error in the n-API container logs:

TypeError: DataFrame.reset_index() got an unexpected keyword argument 'name'

This error is misleading, as it gives the impression that the relevant section of code:

api/app/api/crud.py

Lines 222 to 232 in 9913736

    
           session_completed_pipeline_data = ( 
        
               pipeline_grouped_data.groupby( 
        
                   ["sub_id", "session_id", "session_type"] 
        
               ) 
        
               .apply( 
        
                   lambda x: dict( 
        
                       zip(x["pipeline_name"], x["pipeline_version"]) 
        
                   ) 
        
               ) 
        
               .reset_index(name="completed_pipelines") 
        
           )

is using a non-existent or deprecated argument / has some inherent syntax error.

However, what's actually happening is that the code assumes reset_index() is operating on a pd.Series (which DOES have the name argument). But something is going wrong in the logic for session_completed_pipeline_data such that it's producing a pd.DataFrame instead (which DOESN'T have the name argument for reset_index()).

Source of problem

Earlier in the code, when pipeline_grouped_data is constructed:

api/app/api/crud.py

Lines 202 to 220 in 9913736

    
           pipeline_grouped_data = ( 
        
               group.groupby( 
        
                   [ 
        
                       "sub_id", 
        
                       "session_id", 
        
                       "session_type", 
        
                       "pipeline_name", 
        
                   ], 
        
                   dropna=True, 
        
               ) 
        
               .agg( 
        
                   { 
        
                       "pipeline_version": lambda x: list( 
        
                           x.dropna().unique() 
        
                       ) 
        
                   } 
        
               ) 
        
               .reset_index() 
        
           )

we are dropping NaNs during the groupby, meaning that when there are no pipeline names in the data, we get an empty dataframe like:

Empty DataFrame
Columns: [sub_id, session_id, session_type, pipeline_name, pipeline_version]
Index: []

as a result, when we then try to run groupby again on this object to construct session_completed_pipeline_data, that has no effect and still returns a pd.DataFrame, causing the unexpected keyword error when we then try to run reset_index() on it.

If we instead set dropna=False in the groupby when constructing pipeline_grouped_data, there is no longer an error, but the resulting completed_pipelines field for single subject-session looks like this in the response:

        "completed_pipelines": {
          "null": []
        }

Environment

OS:
Python/Node version:

How to reproduce

No response

Anything else?

Some considerations

Why this wasn't caught by our tests

This section of code is not currently covered by any tests - i.e., we don't assert over the new pipeline fields in an n-API response based on a non-aggregated (mocked or real) result from the graph (i.e., checking that the graph response is re-formatted correctly on the API side)

We do have a mocked response from the graph that we currently use in some tests of the API response formatting:

api/tests/conftest.py

Lines 111 to 168 in 9913736

    
           @pytest.fixture 
        
           def mock_post_query_to_graph(): 
        
               """Mock post_query_to_graph function that returns toy aggregate data containing a dataset with no modalities for testing.""" 
        
               def mockreturn(query, timeout=5.0): 
        
                   return { 
        
                       "head": { 
        
                           "vars": [ 
        
                               "dataset_uuid", 
        
                               "dataset_name", 
        
                               "dataset_portal_uri", 
        
                               "sub_id", 
        
                               "image_modal", 
        
                               "pipeline_name", 
        
                               "pipeline_version", 
        
                           ] 
        
                       }, 
        
                       "results": { 
        
                           "bindings": [ 
        
                               { 
        
                                   "dataset_uuid": { 
        
                                       "type": "uri", 
        
                                       "value": "http://neurobagel.org/vocab/12345", 
        
                                   }, 
        
                                   "dataset_portal_uri": { 
        
                                       "type": "literal", 
        
                                       "value": "https://rpq-qpn.ca/en/researchers-section/databases/", 
        
                                   }, 
        
                                   "sub_id": {"type": "literal", "value": "sub-ON95534"}, 
        
                                   "dataset_name": {"type": "literal", "value": "QPN"}, 
        
                               }, 
        
                               { 
        
                                   "dataset_uuid": { 
        
                                       "type": "uri", 
        
                                       "value": "http://neurobagel.org/vocab/12345", 
        
                                   }, 
        
                                   "dataset_portal_uri": { 
        
                                       "type": "literal", 
        
                                       "value": "https://rpq-qpn.ca/en/researchers-section/databases/", 
        
                                   }, 
        
                                   "sub_id": {"type": "literal", "value": "sub-ON95535"}, 
        
                                   "dataset_name": {"type": "literal", "value": "QPN"}, 
        
                                   "image_modal": { 
        
                                       "type": "uri", 
        
                                       "value": "http://purl.org/nidash/nidm#T1Weighted", 
        
                                   }, 
        
                                   "pipeline_version": { 
        
                                       "type": "literal", 
        
                                       "value": "7.3.2", 
        
                                   }, 
        
                                   "pipeline_name": { 
        
                                       "type": "uri", 
        
                                       "value": "https://github.com/nipoppy/pipeline-catalog/tree/main/processing/freesurfer", 
        
                                   }, 
        
                               }, 
        
                           ] 
        
                       }, 
        
                   }

However, even if we did use this, it would not have caught the error because:
a. It assumes a dataset with info containing pipeline metadata
b. It is an aggregated response

To avoid similar issues

We need tests that assert over the API response given all possible graph database types allowed by Neurobagel, i.e. datasets with:
i. ONLY phenotypic data
ii. ONLY phenotypic + bids data (no derivatives)
iii. ONLY phenotypic + derivatives data (no BIDS)
iv. Phenotypic + bids + derivatives
- see datasets in https://github.com/neurobagel/neurobagel_examples/tree/main/data-upload
- this means either spinning up multiple test graphs or mocking these 4 types of graph responses
- we need to make sure the graph responses are also non-aggregated
We may want to simplify our existing code for reformatting graph responses so that they are more explicit, rather than complex lambda funcs or chaining multiple methods together (harder to debug)

The text was updated successfully, but these errors were encountered:

alyssadai · 2024-10-25T06:25:30Z

Also possibly related to #303

alyssadai added the type:bug label Oct 25, 2024

surchs added this to Neurobagel Oct 25, 2024

alyssadai added bug:functional and removed type:bug labels Oct 25, 2024

alyssadai moved this to Backlog in Neurobagel Oct 25, 2024

surchs added the Epic A collection of issues that are related by topic and can be addressed together. label Oct 28, 2024

surchs removed the status in Neurobagel Nov 15, 2024

surchs removed the bug:functional label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph response formatting fails when no pipeline data is in the database #367

Graph response formatting fails when no pipeline data is in the database #367

alyssadai commented Oct 25, 2024 •

edited

Loading

alyssadai commented Oct 25, 2024

Graph response formatting fails when no pipeline data is in the database #367

Graph response formatting fails when no pipeline data is in the database #367

Comments

alyssadai commented Oct 25, 2024 • edited Loading

Is there an existing issue for this?

Expected Behavior

Current Behavior

Error message

Source of problem

Environment

How to reproduce

Anything else?

Some considerations

Why this wasn't caught by our tests

To avoid similar issues

alyssadai commented Oct 25, 2024

alyssadai commented Oct 25, 2024 •

edited

Loading