Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query parquet database data for selected area on click #41

Closed
carlhiggs opened this issue Oct 22, 2024 · 1 comment
Closed

Query parquet database data for selected area on click #41

carlhiggs opened this issue Oct 22, 2024 · 1 comment

Comments

@carlhiggs
Copy link
Contributor

To abstract and reduce the data included in a map, and provide additional flexibility, rather than include these as column attributes (e.g. of administrative areas), the linkage ID of a selected area could be used to retrieve, summarise and display relevant data on click.

A possible application for this workflow relates to presenting socio-demographic charactersistics of areas based on analysis of highly granular synthetic population data. Rather than pre-aggregate statistics as averages for areas (which can bloat the spatial data retrieved), the distribution could be analysed on click and represented, for example as a box plot of age, stratified by gender or other categorical aspects.

One approach for this is to create a Parquet file that contains the additional attributes with linkage codes. Parquet is designed as an efficient format for querying. One way of doing this in an interactive web app is using a query engine like Amazon Athena.

The basic workflow is

User clicks on area >>> area id is passed to a lambda function >>> lambda function runs Athena query of parquet data in S3 bucket >>> Lambda returns summary >> Typescript is used to format summary for display to user

A sketch has been made of this workflow, but I have some lingering permissions issues impeding data retrieval needed to prototype this full workflow.

carlhiggs added a commit that referenced this issue Oct 24, 2024
…onsole, and only in proof of concept form, later we could load this via AWS-CDK; towards #41
carlhiggs added a commit that referenced this issue Oct 24, 2024
@carlhiggs
Copy link
Contributor Author

carlhiggs commented Oct 24, 2024

Have addressed permissions issues, prototyped some example queries (e.g. basic queries of 3 million records takes about a second, apparently) and about to attempt to integrate basic query into app. I'll write up full approach later, just marking this milestone as it seems using Lambda to run an Athena query on Parquet file is now functional and appears promising. Next step is to invoke the Lambda from typescript and create a display using the retrieved data.

For example, the following query when run in Test of Lambda within the AWS console:

SELECT 
        ROUND(AVG(gender-1)*100,1) AS pct_female,
        cast(
            row(
                ROUND(approx_percentile(age,0.25),1),
                ROUND(approx_percentile(age,0.5),1), 
                ROUND(approx_percentile(age,0.75),1)  
            ) as row(age_p25 int,age_p50 int, age_p75 int)) as age
    FROM synpop_manchester_2021
    WHERE "{key.lower()}.home" = '{value}'
    GROUP BY "{key.lower()}.home";

returns,

{
  "statusCode": 200,
  "body": "[{\"Data\": [{\"VarCharValue\": \"pct_female\"}, {\"VarCharValue\": \"age\"}]}, {\"Data\": [{\"VarCharValue\": \"51.3\"}, {\"VarCharValue\": \"{age_p25=23, age_p50=42, age_p75=60}\"}]}]"
}

In the above

  • because gender is coded as 1=male and 2=female, if we subtract 1 from this we have a binary indicator of female-ness -- and taking the average gives the proportion of females, multiplied by one hundred being the percentage.
  • and we summarise age to get the 25th, 50th and 75th percentiles, which we can present as Median (IQR) --- or median age and 'usual range', perhaps contrasting with the overall statistics as a reference comparison.

The data will be interpreted in JSON more or less as:

{
    "Data": [{
        "VarCharValue": "pct_female",
        "VarCharValue": "age",
        "VarCharValue": "51.3",
        "VarCharValue": "{age_p25=23, age_p50=42, age_p75=60}"
    }]
  }

which is a bit idiosyncrative (i.e. it shoes the variable names, then the values corresponding to these), but totally usable.

Ideally, we'll reframe it so it returns

{
    "Data": [{
        "pct_female": "51.3",
        "age": "{age_p25=23, age_p50=42, age_p75=60}"
    }]
}

But that's just a detail. For now, the priorities are:

  1. get something on a map (i.e. the above in a popup when you click on an area)
  2. get something useful on the map (retrieve place of work LSOAs, and plot flows to these on click)
  3. make it all more useful with additional results processed next
  4. refine the code so its nicer and optimised (lowest priority, but still important; potentially things will be more optimised using partitioning and compression... but that's all later)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant