-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Codespace usage metrics #173
Conversation
04ab973
to
73ab482
Compare
Two things outstanding on this:
|
67a9fa5
to
b33bf6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the comments are useful and interesting. More generally, please take a moment to rework the commit messages. "How to Write a Git Commit Message" contains sound advice. Consider this example, from this PR:
Codespaces database table definition and metric
Use the creation timestamp as PK as near-impossible clash
Slight misuse of "metric" to store raw records of codespaces
Some questions:
- What does this commit do? (imperative mood)
- Why is the clash "near-impossible"? Why not "impossible"?
- What's the 'Slight misuse of "metric"'? Why are we misusing "metric" at all?
- What's wrong with sentences?
metrics/github/github.py
Outdated
# The Repo type requires fields neither returned by the codespaces | ||
# endpoint, nor required for codespaces metrics so str for repo name | ||
repo: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do expect an instance of Repo
when we see repo
, don't we? The required fields of Repo
are org
, name
, team
, and created_on
. We can populate three of these without effort, can we not? What's stopping us from making Repo.created_on
optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org
and name
are trivial. team
isn't applicable here. We could get created_on
in another trip to the API, or change Repo.from_dict()
to not require it.
AFAICT we don't need any of these fields beyond the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the data perspective, we don't need these fields. Conceptually, however, we represent repositories as instances of the Repo
class. Here, you're introducing another representation of repositories: as instances of the str
class.
Imagine returning to this code in the future. Why is PR.repo
a Repo
and Codespace.repo
a str
? The comment says something about an endpoint and metrics. Might the endpoint change? Might metrics change? We'd have to follow a path through the system to better understand the comment, especially if the endpoint changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a commit that switches this to using the Repo
type and fills in all the required information for this type.
5e8fc42
to
a3fe32e
Compare
b509b5a
to
60e3873
Compare
@iaindillingham I've renamed the "metrics" method to something more appropriate. I hope the other changes are as expected, I'd appreciate a re-review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @Jongmassey. I've left a couple of comments, but if you just wanted to merge, then I'd completely understand 🙂. Thank you for all your hard work.
# the primary key constraint and telling the database to update | ||
# the conflicting rows according to the SET clause | ||
insert_stmt = insert_stmt.on_conflict_do_update( | ||
constraint=constraint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can't pass table.primary_key
instead of constraint
here, because table.primary_key.name == None
?
If that's the case, then consider naming the primary key when you define the table and removing the logic that determines the value of constraint
from upsert
. Something like:
GitHubCodespaces = Table(
"github_codespaces",
metadata,
Column("created_at", TIMESTAMP(timezone=True)),
Column("organisation", Text),
Column("repo", Text),
Column("user", Text),
Column("last_used_at", TIMESTAMP(timezone=True)),
PrimaryKeyConstraint(
"created_at", "organisation", "repo", "user", name="github_codespaces_pkey"
),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By just passing table.primary_key
it actually passes a set of columns, from which it infers the correct constraint based on the columns. With compound primary keys, this is known to be a bit broken! The way I've written it, it handles either explicitly or defaultly-named PK constraints, either single or multiple columns. I feel this is a pretty robust strategy!
I did actually initially write this with an explicit PrimaryKeyConstraint
in the table definition, but felt that defaulting to the default constraint name was a bit cleaner.
Define a Codespace dataclass containing required fields (see discussion in opensafely-core/codespaces-initiative#42). Rather than use an instance of the existing Repo dataclass to store repo data, we only need the name and we only receive a minimal amount of repo data from the API so just store the name as a string. This is hopefully less confusing than modifying the Repo class or populating the extra fields this class requires with dummy data. An additional PAT is required to query codespaces for the opensafely GitHub organisation. Any future querying of codespaces for other organisations will require similarly permissioned PATs. The organisation codespaces endpoint is queried and returned data is passed unmodified to the Codespace dataclass's from_dict() method, which does the required data conversion. This follows the pattern established for the other domain dataclasses.
Use a broad composite key to ensure uniqueness and stability over time. Candidate "id" fields from Codespaces API not used as minimally documented by GitHub and thus untrustworthy. Large composite primary key does have performance penalities, but for the expected data volume this is not presumed to be significant.
This is required as codespaces' state will change over time, and their data will disappear from the API once deleted so we are unable to do a full refresh of the data each time. Uses PostgreSQL "INSERT..ON CONFLICT.. UPDATE" style as newer "MERGE" statement not yet supported in SQLAlchemy.
As the codespaces data will change over time and deleted codespaces will disappear from the API, we do not drop and recreate the table each time the task is run as per the other tasks. Instead, calling the upsert() method ensures the table exists then merges the new data with the existing. Conversion method added to metrics.py is not a metric in the usual sense of the word, but instead renames some fields to match the database schema. This is neccesary due to different naming conventions in the domain dataclasses and database tables.
60e3873
to
3b7b084
Compare
closes opensafely-core/codespaces-initiative#42