Query parquet database data for selected area on click #41

carlhiggs · 2024-10-22T11:30:22Z

To abstract and reduce the data included in a map, and provide additional flexibility, rather than include these as column attributes (e.g. of administrative areas), the linkage ID of a selected area could be used to retrieve, summarise and display relevant data on click.

A possible application for this workflow relates to presenting socio-demographic charactersistics of areas based on analysis of highly granular synthetic population data. Rather than pre-aggregate statistics as averages for areas (which can bloat the spatial data retrieved), the distribution could be analysed on click and represented, for example as a box plot of age, stratified by gender or other categorical aspects.

One approach for this is to create a Parquet file that contains the additional attributes with linkage codes. Parquet is designed as an efficient format for querying. One way of doing this in an interactive web app is using a query engine like Amazon Athena.

The basic workflow is

User clicks on area >>> area id is passed to a lambda function >>> lambda function runs Athena query of parquet data in S3 bucket >>> Lambda returns summary >> Typescript is used to format summary for display to user

A sketch has been made of this workflow, but I have some lingering permissions issues impeding data retrieval needed to prototype this full workflow.

…onsole, and only in proof of concept form, later we could load this via AWS-CDK; towards #41

carlhiggs · 2024-10-24T13:53:26Z

Have addressed permissions issues, prototyped some example queries (e.g. basic queries of 3 million records takes about a second, apparently) and about to attempt to integrate basic query into app. I'll write up full approach later, just marking this milestone as it seems using Lambda to run an Athena query on Parquet file is now functional and appears promising. Next step is to invoke the Lambda from typescript and create a display using the retrieved data.

For example, the following query when run in Test of Lambda within the AWS console:

SELECT 
        ROUND(AVG(gender-1)*100,1) AS pct_female,
        cast(
            row(
                ROUND(approx_percentile(age,0.25),1),
                ROUND(approx_percentile(age,0.5),1), 
                ROUND(approx_percentile(age,0.75),1)  
            ) as row(age_p25 int,age_p50 int, age_p75 int)) as age
    FROM synpop_manchester_2021
    WHERE "{key.lower()}.home" = '{value}'
    GROUP BY "{key.lower()}.home";

returns,

{
  "statusCode": 200,
  "body": "[{\"Data\": [{\"VarCharValue\": \"pct_female\"}, {\"VarCharValue\": \"age\"}]}, {\"Data\": [{\"VarCharValue\": \"51.3\"}, {\"VarCharValue\": \"{age_p25=23, age_p50=42, age_p75=60}\"}]}]"
}

In the above

because gender is coded as 1=male and 2=female, if we subtract 1 from this we have a binary indicator of female-ness -- and taking the average gives the proportion of females, multiplied by one hundred being the percentage.
and we summarise age to get the 25th, 50th and 75th percentiles, which we can present as Median (IQR) --- or median age and 'usual range', perhaps contrasting with the overall statistics as a reference comparison.

The data will be interpreted in JSON more or less as:

{
    "Data": [{
        "VarCharValue": "pct_female",
        "VarCharValue": "age",
        "VarCharValue": "51.3",
        "VarCharValue": "{age_p25=23, age_p50=42, age_p75=60}"
    }]
  }

which is a bit idiosyncrative (i.e. it shoes the variable names, then the values corresponding to these), but totally usable.

Ideally, we'll reframe it so it returns

{
    "Data": [{
        "pct_female": "51.3",
        "age": "{age_p25=23, age_p50=42, age_p75=60}"
    }]
}

But that's just a detail. For now, the priorities are:

get something on a map (i.e. the above in a popup when you click on an area)
get something useful on the map (retrieve place of work LSOAs, and plot flows to these on click)
make it all more useful with additional results processed next
refine the code so its nicer and optimised (lowest priority, but still important; potentially things will be more optimised using partitioning and compression... but that's all later)

…election

carlhiggs added a commit that referenced this issue Oct 24, 2024

added in Athena parquet query lambda function; currently loaded via c…

9339eb1

…onsole, and only in proof of concept form, later we could load this via AWS-CDK; towards #41

carlhiggs added a commit that referenced this issue Oct 24, 2024

updated Lambda function towards #41

fd65600

carlhiggs added a commit that referenced this issue Oct 24, 2024

updated Lambda function towards #41

7f6677d

carlhiggs referenced this issue Oct 28, 2024

updated proof of concept retrieval of data for popups based on area s…

db9e9bf

…election

carlhiggs added a commit that referenced this issue Oct 28, 2024

further advanced towards #41, adding nicer styling

40c66ea

carlhiggs added a commit that referenced this issue Nov 1, 2024

developed queryable popup table as per #41

fc58dd7

carlhiggs added a commit that referenced this issue Nov 1, 2024

added in div for the group by, better presenting #41

ae59cb6

carlhiggs added a commit that referenced this issue Nov 1, 2024

added JIBE project attribution and URL to table copy function for #41

f4699bc

carlhiggs closed this as completed Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query parquet database data for selected area on click #41

Query parquet database data for selected area on click #41

carlhiggs commented Oct 22, 2024

carlhiggs commented Oct 24, 2024 •

edited

Loading

Query parquet database data for selected area on click #41

Query parquet database data for selected area on click #41

Comments

carlhiggs commented Oct 22, 2024

carlhiggs commented Oct 24, 2024 • edited Loading

carlhiggs commented Oct 24, 2024 •

edited

Loading