Add Codespace usage metrics #173

Jongmassey · 2024-05-10T16:04:18Z

closes opensafely-core/codespaces-initiative#42

Jongmassey · 2024-05-17T11:45:39Z

Two things outstanding on this:

generating the required PAT for the opensafely org and adding to the prod environment
future schema changes to the github_codespaces table will require migrations (unlike the other tables that are dropped and recreated each time). There currently isn't a mechanism for this.

iaindillingham

Hopefully the comments are useful and interesting. More generally, please take a moment to rework the commit messages. "How to Write a Git Commit Message" contains sound advice. Consider this example, from this PR:

Codespaces database table definition and metric

Use the creation timestamp as PK as near-impossible clash

Slight misuse of "metric" to store raw records of codespaces

Some questions:

What does this commit do? (imperative mood)
Why is the clash "near-impossible"? Why not "impossible"?
What's the 'Slight misuse of "metric"'? Why are we misusing "metric" at all?
What's wrong with sentences?

metrics/github/client.py

iaindillingham · 2024-05-20T06:56:06Z

metrics/github/github.py

+    # The Repo type requires fields neither returned by the codespaces
+    # endpoint, nor required for codespaces metrics so str for repo name
+    repo: str


But we do expect an instance of Repo when we see repo, don't we? The required fields of Repo are org, name, team, and created_on. We can populate three of these without effort, can we not? What's stopping us from making Repo.created_on optional?

org and name are trivial. team isn't applicable here. We could get created_on in another trip to the API, or change Repo.from_dict() to not require it.

AFAICT we don't need any of these fields beyond the name

From the data perspective, we don't need these fields. Conceptually, however, we represent repositories as instances of the Repo class. Here, you're introducing another representation of repositories: as instances of the str class.

Imagine returning to this code in the future. Why is PR.repo a Repo and Codespace.repo a str? The comment says something about an endpoint and metrics. Might the endpoint change? Might metrics change? We'd have to follow a path through the system to better understand the comment, especially if the endpoint changed.

I've added a commit that switches this to using the Repo type and fills in all the required information for this type.

INSTALL.md

metrics/github/github.py

metrics/github/query.py

tests/metrics/timescaledb/test_db.py

Jongmassey · 2024-05-30T20:37:41Z

@iaindillingham I've renamed the "metrics" method to something more appropriate. I hope the other changes are as expected, I'd appreciate a re-review :)

iaindillingham

This looks great, @Jongmassey. I've left a couple of comments, but if you just wanted to merge, then I'd completely understand 🙂. Thank you for all your hard work.

metrics/github/client.py

metrics/github/github.py

iaindillingham · 2024-05-31T12:46:48Z

metrics/timescaledb/db.py

+            # the primary key constraint and telling the database to update
+            # the conflicting rows according to the SET clause
+            insert_stmt = insert_stmt.on_conflict_do_update(
+                constraint=constraint,


I guess we can't pass table.primary_key instead of constraint here, because table.primary_key.name == None?

If that's the case, then consider naming the primary key when you define the table and removing the logic that determines the value of constraint from upsert. Something like:

GitHubCodespaces = Table( "github_codespaces", metadata, Column("created_at", TIMESTAMP(timezone=True)), Column("organisation", Text), Column("repo", Text), Column("user", Text), Column("last_used_at", TIMESTAMP(timezone=True)), PrimaryKeyConstraint( "created_at", "organisation", "repo", "user", name="github_codespaces_pkey" ), )

By just passing table.primary_key it actually passes a set of columns, from which it infers the correct constraint based on the columns. With compound primary keys, this is known to be a bit broken! The way I've written it, it handles either explicitly or defaultly-named PK constraints, either single or multiple columns. I feel this is a pretty robust strategy!

I did actually initially write this with an explicit PrimaryKeyConstraint in the table definition, but felt that defaulting to the default constraint name was a bit cleaner.

metrics/timescaledb/db.py

Define a Codespace dataclass containing required fields (see discussion in opensafely-core/codespaces-initiative#42). Rather than use an instance of the existing Repo dataclass to store repo data, we only need the name and we only receive a minimal amount of repo data from the API so just store the name as a string. This is hopefully less confusing than modifying the Repo class or populating the extra fields this class requires with dummy data. An additional PAT is required to query codespaces for the opensafely GitHub organisation. Any future querying of codespaces for other organisations will require similarly permissioned PATs. The organisation codespaces endpoint is queried and returned data is passed unmodified to the Codespace dataclass's from_dict() method, which does the required data conversion. This follows the pattern established for the other domain dataclasses.

Use a broad composite key to ensure uniqueness and stability over time. Candidate "id" fields from Codespaces API not used as minimally documented by GitHub and thus untrustworthy. Large composite primary key does have performance penalities, but for the expected data volume this is not presumed to be significant.

This is required as codespaces' state will change over time, and their data will disappear from the API once deleted so we are unable to do a full refresh of the data each time. Uses PostgreSQL "INSERT..ON CONFLICT.. UPDATE" style as newer "MERGE" statement not yet supported in SQLAlchemy.

As the codespaces data will change over time and deleted codespaces will disappear from the API, we do not drop and recreate the table each time the task is run as per the other tasks. Instead, calling the upsert() method ensures the table exists then merges the new data with the existing. Conversion method added to metrics.py is not a metric in the usual sense of the word, but instead renames some fields to match the database schema. This is neccesary due to different naming conventions in the domain dataclasses and database tables.

Jongmassey force-pushed the Jongmassey/codespaces-usage branch 10 times, most recently from 04ab973 to 73ab482 Compare May 17, 2024 11:35

Jongmassey marked this pull request as ready for review May 17, 2024 11:45

Jongmassey force-pushed the Jongmassey/codespaces-usage branch 3 times, most recently from 67a9fa5 to b33bf6c Compare May 17, 2024 12:04

iaindillingham reviewed May 20, 2024

View reviewed changes

Jongmassey force-pushed the Jongmassey/codespaces-usage branch 4 times, most recently from 5e8fc42 to a3fe32e Compare May 21, 2024 21:02

Jongmassey requested a review from iaindillingham May 21, 2024 21:14

Jongmassey force-pushed the Jongmassey/codespaces-usage branch 4 times, most recently from b509b5a to 60e3873 Compare May 30, 2024 20:35

iaindillingham approved these changes May 31, 2024

View reviewed changes

Jongmassey added 3 commits May 31, 2024 14:33

Jongmassey force-pushed the Jongmassey/codespaces-usage branch from 60e3873 to 3b7b084 Compare May 31, 2024 13:33

Jongmassey merged commit 2d16d4b into main May 31, 2024
9 checks passed

Jongmassey deleted the Jongmassey/codespaces-usage branch May 31, 2024 14:27

Jongmassey mentioned this pull request Jun 3, 2024

Add metrics for Codespaces usage opensafely-core/codespaces-initiative#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Codespace usage metrics #173

Add Codespace usage metrics #173

Jongmassey commented May 10, 2024 •

edited

Loading

Jongmassey commented May 17, 2024

iaindillingham left a comment

iaindillingham May 20, 2024

Jongmassey May 20, 2024

iaindillingham May 21, 2024

Jongmassey May 21, 2024

Jongmassey commented May 30, 2024

iaindillingham left a comment

iaindillingham May 31, 2024

Jongmassey May 31, 2024

Add Codespace usage metrics #173

Add Codespace usage metrics #173

Conversation

Jongmassey commented May 10, 2024 • edited Loading

Jongmassey commented May 17, 2024

iaindillingham left a comment

Choose a reason for hiding this comment

iaindillingham May 20, 2024

Choose a reason for hiding this comment

Jongmassey May 20, 2024

Choose a reason for hiding this comment

iaindillingham May 21, 2024

Choose a reason for hiding this comment

Jongmassey May 21, 2024

Choose a reason for hiding this comment

Jongmassey commented May 30, 2024

iaindillingham left a comment

Choose a reason for hiding this comment

iaindillingham May 31, 2024

Choose a reason for hiding this comment

Jongmassey May 31, 2024

Choose a reason for hiding this comment

Jongmassey commented May 10, 2024 •

edited

Loading