Coco/area relational integrity #428

CocoisBuggy · 2024-11-28T15:02:43Z

Fulfils the feature request in #126

There's a bit happening here, since I made some performance driven changes to help me with a different PR (#407) I tried to knock out the major test cases but the nature of this particular feature is a little thorny because of how we utilize de-normalized data.

The commit history is a little messy and I am frankly feeling a bit lazy, which is why I have not squashed them. Been moving between machines a bunch and reworked the approach a few times to get speed tuned in.
If someone would like me to squash then I will try and motivate myself to do it.

The key takeaway in terms of changes that will affect other devs is that a new field has been introduced on the area document from which all other ancestry / children are derived. the parent field points to the areas direct parent. The pathTokens and ancestors fields have been colocated into a single array of sub documents so that the update propogations can happen all together - not a big deal now, but gives us a sane path to embed more de-normalized data if we need to.

@vnugent I'd like your particular eyes for these changes.

(https://en.wikipedia.org/wiki/Shebang_(Unix)#Version_8_improved_shell_scripts if you are sufficiently nerdy to care)

I ran into a one-off issue where the replica set specification was throwing off the script in a development environment. This code will let you un-set the variable if you don't need it while in dev.

This is by no means a complete environment, but if no one minds me leaving this here it would save me some trouble - but is not required. If you are unfamiliar with nixos, this file is intended to give my environment the necessary mongo utils / yarn / node and so forth. You can peg a node version and all sorts of various magic. I would recommend any nerds with sufficient free time investigate - but you will find it very hard to escape once it has ensnared you.

…of mono-express

The query is too slow, it remains to be seen if this is because the scope is simply too liberal - but my intuition is that this is down to the regex search and full-document retrieval Current root query time is along the lines of 16 seconds (Unthinkably slow) so I'll try another approach, otherwise this architechture is simply no good.

value? This may be down to me not manually running some cron operation in my development environment? I will work it out later.

There is still some query tuning that I want to do - I might sqaush this commit later Cost Reference Texas ~50ms California ~900ms

achieved by pruning the data being passed through the pipeline

The only way I can think of to speed this up is to eliminate the parent lookup stage in the pipeline, which seems to double the time the query takes to run until completion. The natural way to address this would be to embed a parent pointer - which I assume should be easy enough since there must be existing code to maintain the integrity of the pathTokens, ancestors So I'll take a look at doing that too

…raphql into coco-hierarchy-tools

Rather than duping logic from other code units over and over

the name of mongo was changed to mongo a little while ago, and I can't be bothered to symlink the name, so now if mongo is not available the script will fallback onto mongosh. + Add mongosh to the nix shell

The tests are not passing at this commit, even though TSC is not reporting type issues. I will try and correct whatever is preventing the types from being reported.

Previously you would need to re-enter the auth credentials.

This commit should be squashed later...

This code, as far as I can tell, hasn't been touched for a while and is just clutter? If I am wrong about this then I need to know why we are still importing data that we should presumably now be maintaining.

I'm not sure why I didn't do this initially

There is a new, simple, source of truth for area structure derivations. Each area document holds a parent reference, which points to its desired reference (or none, if it is a root node). The derived fields such as - Children - Ancestry have been collated in `embeddedRelations`. the denormalized fields `ancestors` and `pathTokens` have also been colocated into a single entity path that makes propogating updates to denormalized data easier (e.g, when updating an areas name). It leaves us a space for adding new fields, and finally exposes the ACTUAL id of the document so that we can index effectively. I would suggest that using the parent field is the most straightforward and safest thing to do in pretty much all scenarios, but if you are a mongodb veteran you may find need for the array indexing features - though as far as I can tell, this is rarely the case? - Added tests - Updated some existing tests - Data returned at the gql layer should be identical, but I have not fully de-risked those tests. All I can say is that they are passing. This commit does not address #126, but gives us an enviable headstart at it. It is however needed by #407

I'm going to go live in the mountains for a week or so, but will take a laptop so may want to work on this a bit.

It's all in, I think. I added a new test to cover name uniqueness but no doubt more cases will come up.

vnugent · 2024-11-30T01:45:17Z

@CocoisBuggy can you please explain what problems this PR is trying to solve? Maybe a diagram or example would help. Thanks!
Not sure if you notice I created a rough plan for v2 #422 where we can address some of the major issues. Perhaps we can discuss larger impact changes in v2?

CocoisBuggy · 2024-11-30T07:47:57Z

Sure thing, sorry I suppose I wasn't super clear in my explanation.

The tldr is:

performance
code sanity
collate demoralized fields together
allow updating of area parents to reorganize areas
deconvolute data structure

Who should be affected?

Only developers working on the back end. There is a scary number of changes, but I haven't touched the way in which the data is accessed by the api so there shouldn't be a problem - but now that I'm writing that it may be just as well for me to check that the existing tests cover the behavior.

Developers shouldn't have to change their mental model too much, since all the old data Is still available - and is recomputed when it's dependencies update.

Performance

In my other pr I'm working on some queries that I need for a feature I'm interested in, but performing the lookup for parents of nodes was always a substantial hit when I was looking at my profiler. Partly this is a consequence of storing the ancestors as a comma delimited string which seemed to always be a performance sink whenever I touched it. When using the children array I found a couple of strange hydras when performing deep queries.

After struggling to get the queries down below 100-200ms I decided to look elsewhere for a solution.

Code Sanity

While I was looking into area update and coming to terms with the existing denormalization strategies I found myself yearning for a single approach to be used so that it could all be debugged and understood together - especially since they're absolutely related code work.

It also made me increasingly perplexed that the children were specified as an array of ObjectId but the ancestors is a string - and a string of uuid values at that.

My solution to future-proof against this problem getting worse over time and also resolve the inconsistencies I decided to put the related data together.

Area Structure Updating

When I went to implement this feature neither the array of children nor the ancestor string felt like good ways to express which area had which parent. I decided to make these explicitly derived fields, and then simply put a parent field on each area. The difference here is slightly more than philosophical, but without going into too much detail my belief is that this is the easiest way to not get stuck in side effect hell. I can expand more on that if needed, but for the time being it suffices to say that there's no way to accidentally end up with areas belonging to multiple parents, which would have cascading bugs.

So now you need only update an areas parent, and then recompute the paths from the altered node (which is quite speedy).

conclusion

The pr is more or less worthless if we want to specifically avoid a parent field pattern. It should work well with the v2 milestone, especially for phasing out uuid under the hood. I can do more legwork to make sure it doesn't change the existing api, as it shouldn't be necessary for this pr

vnugent · 2024-12-26T01:16:04Z

@CocoisBuggy Thanks for the large PR. Can we discuss this after the new year?

CocoisBuggy · 2024-12-26T05:55:24Z

For sure, merry Christmas @vnugent !

CocoisBuggy added 30 commits November 2, 2024 11:34

shebang should be first line

4e9b177

(https://en.wikipedia.org/wiki/Shebang_(Unix)#Version_8_improved_shell_scripts if you are sufficiently nerdy to care)

For those using bash beautify vscode extension

8b66de3

Make replica set specification optional in seed script

a9c8527

I ran into a one-off issue where the replica set specification was throwing off the script in a development environment. This code will let you un-set the variable if you don't need it while in dev.

Integrate memory server into nixos environment

c0eb188

compass is a neat tool - though duplicates much of the functionality …

bf1f307

…of mono-express

I think I may have been wrong about this. Path Hash is not a cached

027b0e6

value? This may be down to me not manually running some cron operation in my development environment? I will work it out later.

Performance improvements and less jank

4794107

There is still some query tuning that I want to do - I might sqaush this commit later Cost Reference Texas ~50ms California ~900ms

minor speed improvement to the query

9c58cfe

achieved by pruning the data being passed through the pipeline

minor speed improvement to the query

c72f95e

achieved by pruning the data being passed through the pipeline

Merge branch 'coco-hierarchy-tools' of github.com:OpenBeta/openbeta-g…

671e3b4

…raphql into coco-hierarchy-tools

Add utility for using or creating a client session

9551722

Rather than duping logic from other code units over and over

Merge branch 'coco/#409' into coco/area-relational-integrity

2790855

update script to allow 'mongosh'

dfac7d1

the name of mongo was changed to mongo a little while ago, and I can't be bothered to symlink the name, so now if mongo is not available the script will fallback onto mongosh. + Add mongosh to the nix shell

Migrate to new data model

53e0b80

The tests are not passing at this commit, even though TSC is not reporting type issues. I will try and correct whatever is preventing the types from being reported.

Allow compass to open on the correct connection

e0c1545

Previously you would need to re-enter the auth credentials.

Begin integreating the parent-centric changes

26d22c4

This commit should be squashed later...

Add debug flag to tests

ff27ec2

Add an attach request to the launch options

6e6f251

Prefer optional dbg over forced

04e7430

I believe we have moved passed the need for import

af694b2

This code, as far as I can tell, hasn't been touched for a while and is just clutter? If I am wrong about this then I need to know why we are still importing data that we should presumably now be maintaining.

Denormalize related data together

2e9c378

I'm not sure why I didn't do this initially

Merge branch 'develop' into coco/area-relational-integrity

bf7da84

Checkpoint commit -- pls squash

2363740

I'm going to go live in the mountains for a week or so, but will take a laptop so may want to work on this a bit.

Checkpoint - back from mountains

99e77d8

Area parent migration implemented and tested

ffd4363

Expose functionality via the gql layer

79511ee

It's all in, I think. I added a new test to cover name uniqueness but no doubt more cases will come up.

CocoisBuggy added 3 commits November 29, 2024 11:41

Merge with dependency updates

348c685

Merge branch 'develop' into coco/area-relational-integrity

e015e27

Fix double import (oops)

c997027

CocoisBuggy requested a review from vnugent November 29, 2024 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coco/area relational integrity #428

Coco/area relational integrity #428

CocoisBuggy commented Nov 28, 2024

vnugent commented Nov 30, 2024

CocoisBuggy commented Nov 30, 2024 •

edited

Loading

vnugent commented Dec 26, 2024

CocoisBuggy commented Dec 26, 2024

Coco/area relational integrity #428

Are you sure you want to change the base?

Coco/area relational integrity #428

Conversation

CocoisBuggy commented Nov 28, 2024

vnugent commented Nov 30, 2024

CocoisBuggy commented Nov 30, 2024 • edited Loading

Who should be affected?

Performance

Code Sanity

Area Structure Updating

conclusion

vnugent commented Dec 26, 2024

CocoisBuggy commented Dec 26, 2024

CocoisBuggy commented Nov 30, 2024 •

edited

Loading