Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Codespace usage metrics #173

Merged
merged 4 commits into from
May 31, 2024
Merged

Add Codespace usage metrics #173

merged 4 commits into from
May 31, 2024

Conversation

Jongmassey
Copy link
Contributor

@Jongmassey Jongmassey commented May 10, 2024

@Jongmassey Jongmassey force-pushed the Jongmassey/codespaces-usage branch 10 times, most recently from 04ab973 to 73ab482 Compare May 17, 2024 11:35
@Jongmassey
Copy link
Contributor Author

Two things outstanding on this:

  • generating the required PAT for the opensafely org and adding to the prod environment
  • future schema changes to the github_codespaces table will require migrations (unlike the other tables that are dropped and recreated each time). There currently isn't a mechanism for this.

@Jongmassey Jongmassey marked this pull request as ready for review May 17, 2024 11:45
@Jongmassey Jongmassey force-pushed the Jongmassey/codespaces-usage branch 3 times, most recently from 67a9fa5 to b33bf6c Compare May 17, 2024 12:04
Copy link
Member

@iaindillingham iaindillingham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the comments are useful and interesting. More generally, please take a moment to rework the commit messages. "How to Write a Git Commit Message" contains sound advice. Consider this example, from this PR:

Codespaces database table definition and metric

Use the creation timestamp as PK as near-impossible clash

Slight misuse of "metric" to store raw records of codespaces

Some questions:

  • What does this commit do? (imperative mood)
  • Why is the clash "near-impossible"? Why not "impossible"?
  • What's the 'Slight misuse of "metric"'? Why are we misusing "metric" at all?
  • What's wrong with sentences?

metrics/github/client.py Outdated Show resolved Hide resolved
Comment on lines 114 to 116
# The Repo type requires fields neither returned by the codespaces
# endpoint, nor required for codespaces metrics so str for repo name
repo: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do expect an instance of Repo when we see repo, don't we? The required fields of Repo are org, name, team, and created_on. We can populate three of these without effort, can we not? What's stopping us from making Repo.created_on optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org and name are trivial. team isn't applicable here. We could get created_on in another trip to the API, or change Repo.from_dict() to not require it.

AFAICT we don't need any of these fields beyond the name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the data perspective, we don't need these fields. Conceptually, however, we represent repositories as instances of the Repo class. Here, you're introducing another representation of repositories: as instances of the str class.

Imagine returning to this code in the future. Why is PR.repo a Repo and Codespace.repo a str? The comment says something about an endpoint and metrics. Might the endpoint change? Might metrics change? We'd have to follow a path through the system to better understand the comment, especially if the endpoint changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a commit that switches this to using the Repo type and fills in all the required information for this type.

INSTALL.md Outdated Show resolved Hide resolved
metrics/github/github.py Outdated Show resolved Hide resolved
metrics/github/query.py Outdated Show resolved Hide resolved
tests/metrics/timescaledb/test_db.py Outdated Show resolved Hide resolved
tests/metrics/timescaledb/test_db.py Outdated Show resolved Hide resolved
tests/metrics/timescaledb/test_db.py Outdated Show resolved Hide resolved
tests/metrics/timescaledb/test_db.py Outdated Show resolved Hide resolved
tests/metrics/timescaledb/test_db.py Show resolved Hide resolved
@Jongmassey Jongmassey force-pushed the Jongmassey/codespaces-usage branch 4 times, most recently from 5e8fc42 to a3fe32e Compare May 21, 2024 21:02
@Jongmassey Jongmassey requested a review from iaindillingham May 21, 2024 21:14
@Jongmassey Jongmassey force-pushed the Jongmassey/codespaces-usage branch 4 times, most recently from b509b5a to 60e3873 Compare May 30, 2024 20:35
@Jongmassey
Copy link
Contributor Author

@iaindillingham I've renamed the "metrics" method to something more appropriate. I hope the other changes are as expected, I'd appreciate a re-review :)

Copy link
Member

@iaindillingham iaindillingham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @Jongmassey. I've left a couple of comments, but if you just wanted to merge, then I'd completely understand 🙂. Thank you for all your hard work.

metrics/github/client.py Outdated Show resolved Hide resolved
metrics/github/github.py Outdated Show resolved Hide resolved
# the primary key constraint and telling the database to update
# the conflicting rows according to the SET clause
insert_stmt = insert_stmt.on_conflict_do_update(
constraint=constraint,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can't pass table.primary_key instead of constraint here, because table.primary_key.name == None?

If that's the case, then consider naming the primary key when you define the table and removing the logic that determines the value of constraint from upsert. Something like:

GitHubCodespaces = Table(
    "github_codespaces",
    metadata,
    Column("created_at", TIMESTAMP(timezone=True)),
    Column("organisation", Text),
    Column("repo", Text),
    Column("user", Text),
    Column("last_used_at", TIMESTAMP(timezone=True)),
    PrimaryKeyConstraint(
        "created_at", "organisation", "repo", "user", name="github_codespaces_pkey"
    ),
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By just passing table.primary_key it actually passes a set of columns, from which it infers the correct constraint based on the columns. With compound primary keys, this is known to be a bit broken! The way I've written it, it handles either explicitly or defaultly-named PK constraints, either single or multiple columns. I feel this is a pretty robust strategy!

I did actually initially write this with an explicit PrimaryKeyConstraint in the table definition, but felt that defaulting to the default constraint name was a bit cleaner.

metrics/timescaledb/db.py Show resolved Hide resolved
Define a Codespace dataclass containing required fields (see discussion
in opensafely-core/codespaces-initiative#42).
Rather than use an instance of the existing Repo dataclass to store
repo data, we only need the name and we only receive a
minimal amount of repo data from the API so just store the name as a
string. This is hopefully less confusing than modifying the Repo class
or populating the extra fields this class requires with dummy data.

An additional PAT is required to query codespaces for the opensafely
GitHub organisation. Any future querying of codespaces for other
organisations will require similarly permissioned PATs.

The organisation codespaces endpoint is queried and returned data is
passed unmodified to the Codespace dataclass's from_dict() method, which
does the required data conversion. This follows the pattern established
for the other domain dataclasses.
Use a broad composite key to ensure uniqueness and stability over time.
Candidate "id" fields from Codespaces API not used as minimally documented
by GitHub and thus untrustworthy. Large composite primary key does have
performance penalities, but for the expected data volume this is not
presumed to be significant.
This is required as codespaces' state will change over time, and their
data will disappear from the API once deleted so we are unable to
do a full refresh of the data each time.

Uses PostgreSQL "INSERT..ON CONFLICT.. UPDATE" style as newer
"MERGE" statement not yet supported in SQLAlchemy.
As the codespaces data will change over time and deleted codespaces
will disappear from the API, we do not drop and recreate the table
each time the task is run as per the other tasks. Instead, calling
the upsert() method ensures the table exists then merges the new
data with the existing.

Conversion method added to metrics.py is not a metric in the usual
sense of the word, but instead renames some fields to match the
database schema. This is neccesary due to different naming conventions
in the domain dataclasses and database tables.
@Jongmassey Jongmassey force-pushed the Jongmassey/codespaces-usage branch from 60e3873 to 3b7b084 Compare May 31, 2024 13:33
@Jongmassey Jongmassey merged commit 2d16d4b into main May 31, 2024
9 checks passed
@Jongmassey Jongmassey deleted the Jongmassey/codespaces-usage branch May 31, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metrics for Codespaces usage
2 participants