This repository uses the Python package pre-commit
to manage
pre-commit hooks. Pre-commit hooks are actions which are run automatically, typically
on each commit, to perform some common set of tasks. For example, a pre-commit hook
might be used to run any code linting automatically, providing any warnings before code
is committed, ensuring that all of our code adheres to a certain quality standard.
For this repository, we are using pre-commit
for a number of purposes:
- Checking for secrets being committed accidentally — there is a strict definition of a "secret";
- Checking for any large files (over 5 MB) being committed; and
- Cleaning Jupyter notebooks, which means removing all outputs, execution counts, Python kernels, and, for Google Colaboratory (Colab), stripping out user information.
We have configured pre-commit
to run automatically on every commit. By running on
each commit, we ensure that pre-commit
will be able to detect all contraventions and
keep our repository in a healthy state.
⚠️ No pre-commit hooks will be run on Google Colab notebooks pushed directly to GitHub. For security reasons, it is highly recommended that you manually download your notebook, and commit up locally to ensure pre-commit hooks are executed on your changes
In order for pre-commit
to run, action is needed to configure it on your system.
- Install the
pre-commit
package into your Python environment fromrequirements.txt
; and - Run
pre-commit install
in your terminal to set uppre-commit
to run when code is committed.
⚠️ Thedetect-secrets
package does its best to prevent accidental committing of secrets, but it can't catch everything. It doesn't replace good software development practices! See the definition of a secret for further information.
We use detect-secrets
to check that no
secrets, are accidentally
committed. This hook requires you to generate a baseline file if one is not already
present within the root directory. To create the baseline file, run the following at
the root of the repository:
detect-secrets scan > .secrets.baseline
Next, audit the baseline that has been generated by running:
detect-secrets audit .secrets.baseline
When you run this command, you'll enter an interactive console and be presented with a list of high-entropy strings and/or anything which could be a secret, and asked to verify whether this is the case. By doing this, the hook will be in a position to know if you're later committing any new secrets to the repository, and it will be able to alert you accordingly.
The detect-secrets
documentation, as of January 2021, says it works:
...by running periodic diff outputs against heuristically crafted [regular expression] statements, to identify whether any new secret has been committed.
This means it uses regular expression patterns to scan your code changes for anything
that looks like a secret according to one or more of these regular expression
patterns. By definition, there are only a limited number of patterns, so the
detect-secrets
package cannot detect every conceivable type of secret.
To understand what types of secrets will be detected, read the
caveats, and the list of
supported plugins that the package uses. Also, you should use
secret variable names that contain words that will trip the KeywordDetector plugin; see
the DENYLIST
variable for the full list of words.
If pre-commit
detects any secrets when you try to create a commit, it will detail
what it found and where to go to check the secret.
If the detected secret is a false positive, there are two options to resolve this, and
prevent your commit from being blocked:
inline allowlisting (recommended) or
updating .secrets.baseline
.
In either case, if an actual secret is detected (or a combination of actual secrets and false positives), first remove the actual secret before following either of these processes.
To exclude a false positive, add a pragma
comment such as:
secret = "Password123" # pragma: allowlist secret
or
# pragma: allowlist nextline secret
secret = "Password123"
If the detected secret is actually a secret (or other sensitive information), remove
the secret and re-commit; there is no need to add any pragma
comments.
If your commit contains a mixture of false positives and actual secrets, remove the
actual secrets first before adding pragma
comments to the false positives.
To exclude a false positive, you can also update the .secrets.baseline
by repeating
the same two commands as in the
initial setup.
During auditing, if the detected secret is actually a secret (or other sensitive
information), remove the secret and re-commit. There is no need to update the
.secrets.baseline
file in this case.
If your commit contains a mixture of false positives and actual secrets, remove the
actual secrets first before updating and auditing the .secrets.baseline
file.
It may be necessary or useful to keep certain output cells of a Jupyter notebook, for
example charts or graphs visualising some set of data. To do this, according to the
documentation for the nbstripout
package, either:
- Add a
keep_output
tag to the desired cell; or - Add
"keep_output": true
to the desired cell's metadata.
You can access cell tags or metadata in Jupyter by enabling the "Tags" or
"Edit Metadata" toolbar (View > Cell Toolbar > Tags; View > Cell Toolbar >
Edit Metadata). For the tags approach, enter keep_output
in the text field for each
desired cell, and press the "Add tag" button. For the metadata approach, press the
"Edit Metadata" button on each desired cell, and edit the metadata to look like this:
{
"keep_output": true
}
This will tell the hook not to strip the resulting output of the desired cell(s), allowing the output(s) to be committed.
ℹ️ Currently (March 2020) there is no way to add tags and/or metadata to Google Colab notebooks. It's strongly suggested that you download the Colab as a .ipynb file, and edit tags and/or metadata using Jupyter before committing the code if you want to keep some outputs.