geometry | output | colorlinks | linkcolor | link-citations | bibliography | abstract | header-includes | toc | toc-title | hyperrefoptions | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
left=3.5cm,right=3.5cm,top=2.5cm,bottom=2.5cm |
pdf_document |
true |
blue |
true |
|
Scientific software is often experimental and thus messy,
undocumented, not reusable. As a given piece of software grows
in user count, though, it can benefit in applying
development best practices which are well-known in the
industry setting.
This thesis will discuss a selection of these practices and show
their application for a `python` project named
`GWFish`, a Fisher matrix code used for gravitational wave
data analysis.
|
\usepackage{siunitx}
\usepackage{listings}
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour},
commentstyle=\color{codegreen},
keywordstyle=\color{magenta},
numberstyle=\tiny\color{codegray},
stringstyle=\color{codepurple},
basicstyle=\ttfamily\footnotesize,
breakatwhitespace=false,
breaklines=true,
captionpos=b,
keepspaces=true,
%numbers=left,
numbersep=5pt,
showspaces=false,
showstringspaces=false,
showtabs=false,
tabsize=2
}
\lstset{style=mystyle}
|
true |
Table of contents |
|
It is an opinion held by many that "scientists write bad code"; as evidence for this claim I will provide the large number of positive votes on this academia StackExchange post titled "Why do many talented scientists write horrible software?".
This is, however, a thorny statement which should be qualified: code may be "bad" according to the standards of software companies, but it may serve the purpose it is written to accomplish perfectly well, especially if this purpose is to act as a proof of concept, or as an exploratory step. However, the industry best practices that are missing from scientific code do have a reason to exist, and that reason often becomes apparent when projects become large enough, and/or used by enough people. Ignoring some of them is acceptable, and perhaps even advisable, for small projects, but for larger ones they start to matter more and more, and code not using them starts to accumulate technical debt, becoming difficult to read, maintain, modify, extend.
Resources on these best practices are plentiful, but are often focused on giving "industry" examples, which may not be too relevant to the scientific setting. When making an attempt to implement them in my own code, it was always a mental strain to "map" them to the kinds of issues I was interested in.
This work is an attempt to ease this process for others, providing some examples
of the practical application of these concepts in a practical, scientific context.
Using GWFish
[@harmsGWFishSimulationSoftware2022] to this end
is something of a perfect storm.
It is a young piece of software (its development started in earnest in early 2021,
although the ideas it implements are quite a bit older)
which serves a conceptually simple purpose, and which
has recently started to be useful to a large amount of people.
Because of this, it is worth spending time on refactoring it, and adopting some development best practices. As of the writing of this thesis this process is still underway, therefore unfortunately I will not be able to discuss as many completed modifications as I would like.
Also, GWFish
is written in python
, which is also the programming language I know best.
All discussions in this work will be limited to this language, and although some may
apply for others as well I cannot guarantee this.
The discussions will feature several suggestions of python
-specific
tools, and will assume a degree of familiarity with the language
and its more common scientific packages.
All the code snippets discussed in this thesis, as well as the source code for this document, are provided in full in the repository github.com/jacopok/clean-coding-thesis.
What follows is a short explanation of what this piece of software does,
aimed at non-physicists. I will first introduce the concept of gravitational
wave detection and the purpose of GWFish
at a general level, without
the mathematical details.
Then, I will go into more detail into the things it needs to compute,
including some math, but still at a level which an undergraduate
in a scientific field should be able to understand.
The first direct measurement of gravitational waves was accomplished in 2015 by the LIGO interferometers in the United States [@ligoscientificcollaborationandvirgocollaborationObservationGravitationalWaves2016]. In 2017, they were joined by the Virgo interferometer in Italy, and the network has since detected almost 100 distinct signals. Most of these signals were generated by pairs of black holes orbiting each other, with the remaining few corresponding to pairs of neutron stars or one neutron star and one black hole.
By lowering the noise in the interferometer, we went from no detection to about one signal per week during detector operation. This has already been called the birth of a new era of gravitational wave astronomy, which can complement "regular" electromagnetic astronomy.
A lot of interest is going towards the question: how can we do better? Which kinds of new detectors are best if we want to detect even more gravitational waves? These instruments are big, expensive projects, therefore a careful scientific evaluation of what we think a new detector concept might be able to accomplish is crucial when outlining a funding proposal.
The basic questions we wish to answer for any new detector concept are:
- Which kinds of gravitational wave sources will it be able to detect? How many of them?
- How well will it be able to estimate the properties of these sources?
The answers depend on two basic aspects:
- Given a specific astrophysical source, can new planned detector X measure it?
If so, how well?
- For example, consider a pair of black holes, with masses so-and-so, at a distance of such-and-such...
- How many of that astrophysical source kind are there? How far away are they?
- For example, what is the distribution of black hole binaries? What are their typical masses, how often do they occur in any given universe volume?
GWFish
is a piece of software built for the purpose of giving an approximate
answer to the first of these questions, while it relies on other tools to answer the second.
Typically, the workflow consists in the generation of a theoretical population of, for example,
binaries of black holes, which will contain several thousands of them, listing for each
their masses, distances, spin and so on. This will then be fed into a tool like GWFish
to
get detection statistics.
This section and the following will go in some detail about the mathematics of the approximation used in order to quickly get the required detection statistics.
Our starting point is matched filtering, a fundamental technique used for all modern gravitational data analysis. The idea is that we want to extract a very weak signal which is submerged in noise. A complete overview of the method can be found in [@maggioreGravitationalWavesVolume2007], here I will give a very brief one.
Our detectors are measuring strain, which roughly speaking is the fluctuation in
the length of the detector arms normalized to the arm length itself,
It is clear that the largest contribution in terms of amplitude is an oscillation
with period on the order of a couple of seconds, and amplitude on the order of a few times
The signal we want to measure, unfortunately, is at least three orders of magnitude smaller:
We are not even showing the part of the spectrum with period on the order a few seconds,
the trend at low frequency continues and the amplitude for
The first step towards actually measuring something lies in whitening the measured data, that is, dividing every Fourier component by its root-mean-square value. After this procedure, our data looks like this:
Still, no signal is visible! This is because the signal present in these data has high frequency and low amplitude; specifically, if we were to plot it subjected to the same procedure as the data, it would look like the orange curve:
So, how can we possibly detect something that is so far smaller in amplitude than our data?
The trick lies in a technique called matched filtering.
Suppose our measured data is
$$ I = \frac{1}{T} \int d(t) s(t) \mathrm{d}t = \underbrace{\frac{1}{T} \int n(t) s(t) \mathrm{d}t}{I_n} + \underbrace{\frac{1}{T} \int s(t) s(t) \mathrm{d}t}{I_s},, $$
where
This is the essence of gravitational data analysis: if we know the expected signal ahead of time, we may extract its contribution from noisy data. The integrals above are optimal if the case of white noise, but going back to the actually-measured data the procedure is slightly more complicated.
Let us now jump ahead to the final result: first,
we need to estimate the noise power spectral density (that is, the noise power per frequency bin)
This product is the basis for signal searches, which are performed by
looking for peaks in the following function of
for a selection ("template bank") of plausible signals
Similarly, parameter estimation for any observed signal
where
Up to now we discussed the analysis of current data; the question GWFish
seeks to answer,
on the other hand, pertains to data taken by detectors we have not built yet.
We can, however, make estimates as to what their noise level will be and go from there.
A typical question it answers could be posed as:
Given the estimated noise curve of the planned Einstein Telescope gravitational wave interferometer, suppose that two black holes with masses
$M_1 = M_2 = 30 M_{\odot}$ (solar masses) merge at a distance of$10^9$ light years from Earth.1 How well could we measure their masses, distance, and position from the gravitational wave data?
The "proper" way to answer this question would be to simulate noise distributed according to the given noise curve, add the known signal to it, and analyze it as if it were real data. This can be done, and it is done to a certain extent, but it is very expensive: a single analysis of this kind takes several hours to days. This is what is done for the data of current detectors, where we know it to be worth it since it's real astrophysical data).
This prevents us from exploring things such as the dependence of the results on things
such as the masses of the black holes, the distance and so on, as that requires re-doing
the aforementioned analysis several (thousands of) times.
The solution (or at least a partial one) is to make an approximation,
called the Fisher matrix approximation: basically, we take the aforementioned likelihood,
and approximate it as a multivariate Gaussian in the parameters
Taking the expectation value amounts to looking at the case in which the noise equals zero,
i.e.
This is the basic quantity GWFish
is computing; we are approximating the likelihood as a
Gaussian in the form
$$ \sigma^2 i \approx (\mathcal{F}^{-1}){ii},. $$
This is a frequentist phrasing; alternatively, it is equivalent to a Bayesian one
if we take a flat prior on the parameters,
The first step towards successfully managing a software project is version control: having a proper system to track changes to the code made by many people.
The most popular system for this purpose is by far and away git
,
which was developed by Linus Torvalds in 2005.
It is a powerful and fast distributed version control system.
Typically, git
is used in conjunction with a hosting server in which to store
the software project: common choices include github
,
gitlab
, bitbucket
.
GWFish
is hosted on github
, at the url github.com/janosch314/GWFish.
I will not introduce the basics of how to use git
here, since there are many excellent
resources for that; the discussion here is targeted at someone who is familiar with how to e.g. make commits, push to a remote repository and pull from it.
Recently, the group developing GWFish
has started adopting a formal process for the development of new features,
which quite closely matches Github flow.
This is not the only possible way of handling a collaborative workflow, but it is a rather simple
and effective one.
The idea is as follows: there is a main
branch, which is expected to include finished code features
and a usable version of the code.
If I want to add a new feature, I will make a feature branch off the main one.
The syntax is as follows:
git branch my-new-feature
git checkout my-new-feature
I can then work on this feature, making commits and pushing them to the repository. This way, my work can be public while not interfering with the main branch. While I will not voluntarily commit anything "wrong", building a new feature is necessarily experimental, and the branch is a safe place to possibly make mistakes.
When the feature seems to be good to go, I can then open a Pull Request (PR); the name is somewhat
unfortunate since it's not in the perspective of the one making it.
gitlab
calls them "merge requests", which is more descriptive: I am asking for my code
in the branch to be merged into the main one.
Regardless of the name, the concept is rather simple: a PR is a structured process with the end
goal of integrating code from a feature branch into the main one.
This could also be accomplished with a simple git command (git merge
), but the idea is that
code from a feature branch should be evaluated and discussed.
As an example I will provide a pull request
I made not for GWFish
but for pycbc
, an older and bigger piece of software
for gravitational wave data analysis.
There is no need to discuss the specifics of that PR, the point here is the structure
of what was happening: I made a feature branch, proposing a new feature.
I had previously outlined in an issue
the reason why this feature was needed --- a common pattern
is to start with an issue and to solve it with a PR.
A maintainer of the project reviewed the PR; technical issues and slight modifications were discussed and implemented until a version of the feature which was satisfactory to everyone was reached. To this end, the automated testing pipeline was quite useful: for each commit I made to the branch, the pipeline could run to check that I had not accidentally created a regression, i.e. broken some previously-working functionality.
Besides making it easier for developers to track their work, versioning is useful in order for users to be able to understand how and when the software they are using is changing, whether they should upgrade, whether the functionality they are using will be broken or discontinued.
The full extent of this information should be specified in long-form in a changelog, discussed in the next section, but a simple shorthand for it may be given by a standardized versioning system such as semantic versioning. It is a rather simple specification, with all versions2 looking like X.Y.Z with integer X, Y and Z. An alternative, which is recommended for large projects, is calendar-based versioning.
The advantage of using a standardized (and very widespread) system is that the meaning may be clear without requiring explanation. If it is not, or a user wants to know more, having a changelog is useful.
A changelog should be human-readable, and communicate in short to a user what changed from a version to another. As with versions, it is good for it to follow some standard, and a good one may be found at keep a changelog.
If we write good git commit messages it may be tempting to use them to automatically construct a changelog. This is a bad idea! They contain way too much information compared to what the user needs to know about the new version.
Is it worth it to use semantic versioning and to maintain a changelog? As always, it depends on the size of the project at hand, but I would argue that if you want people to be able to use it without needing personal contact with you, it is time to have a changelog.
Even if this is not the case, however, a changelog may be very useful: if a project is being version-tracked, the act of writing out what has been modified can be helpful in clarifying what is happening with the code.
A common source of issues in software development lies in dependency management. Ideally, we would like the user to be able to install our package without having to worry about its dependencies, with everything being handled automatically.
This has been a long-standing issue in the python
ecosystem, and a complete solution does
not exist. However, a lot of work is going into it, and the most promising approach seems to
be the management of a pyproject.toml
file with poetry
.
A usage guide for poetry
is beyond the scope of this work, and the documentation for it is
quite good.
It allows for the automatic creation of virtual environments containing only the dependencies
specified as required for the project, thus making it less likely that we will inadvertently use
functionality from dependencies we are not specifying.
It also automates checks for valid version combinations, as well as version control and publishing
a project to pypi
.
Overall, it is a very convenient system to manage the versions of our software and of its dependencies.
A crucial aspect in software usability is the presence of good documentation.
In my experience, when documentation is brought up people's mind often goes to comments in the code, or line comments; but these are not proper documentation. Documentation is meant for the user of our code, while line comments are at best useful for future developers. As Clean Code [@martinCleanCodeHandbook2008] puts it, line comments are a "necessary evil".
Sometimes, a simple-looking piece of code has a counterintuitive element, which may be clarified by a quick line comment; however, in most of their typical uses, line comments could be substituted by clearer code. The rest of this section, therefore, will not discuss line comments.
The diátaxis framework [@procidaDiataxisDocumentationFramework2022] allows us to structure our thinking about documentation according to the needs of the user, as opposed to our convenience when writing the code.
The website referenced above does an excellent job of explaining its categories, I will just give a quick summary here. The classification is along two axes: the first is based on whether the piece of documentation is meant to be used while studying or while working, while the second is based on whether the piece of documentation contains pratical steps or theoretical knowledge. Based on this, they distinguish the following categories:
- tutorials show the user at study practical steps in a safe environment: they are meant to show them how to get started using the software;
- how-to guides show the user at work practical steps in the real world: they are meant to show them how to accomplish some practical goal;
- reference material describes software in a way that is helpful for a user at work, by listing its features, providing examples etc.;
- explanation discusses the sofware broadly and theoretically, and is therefore meant for a user at study.
Reference material may be auto-generated in certain cases: for example,
the documentation-formatting software for Python
sphinx
has an
autodoc
extension which can read function and class docstrings and
extract them into HTML documentation.
This, however, is not the be-all and end-all of documentation, and it is arguably
not even the most important part of it.
Writing documentation may be a chore, and finding time to do it
is difficult.
This section is devoted to the thought process behind
the documentation I wrote for GWFish
,
and how it grew organically from user requirements.
I started writing it when I was asked to give a short tutorial on the usage of this
piece of software for people working on another gravitational wave detector proposal,
LGWA [@harmsLunarGravitationalwaveAntenna2021].
Having a spoken tutorial refer to written documentation is helpful, therefore
I wrote it down as a documentation page.
I used sphinx
combined with
readthedocs
to make it so documentation could
be version-tracked in the same repository as the code, as well as be deployed
automatically whenever changes were made there.
Since the aim of a tutorial is to introduce new users to the software, I focused mine on a simple test case: computing the Fisher matrix errors for a single signal, with a specific combination of future planned detectors. The tutorial is in two parts, here and here; it is basic in the sense that it assumes no prior knowledge of the software itself, but it does assume familiarity with the task of Fisher matrix forecasting.
The tutorial provides runnable scripts and their expected output, with an explanation of what that output means. It is computationally cheap, since it is meant to be a learning tool, not to solve any concrete problems.
This was the starting point; after the architecture (sphinx
+readthedocs
) for the documentation was built it was easy to
make small improvements and complement documentation pages with new information.
A few examples of things that were added are as follows:
- the conventions for the naming of relevant parameters are not uniform, therefore I added a page to the reference material section detailing the ones used by this code specifically;
- a piece of functionality for this code is the computation of a detection horizon (i.e. the maximum distance at which a certain signal can be detected), I discussed the line of reasonining behind this computation in the explanation section;
- users may want to add new, custom detectors to do their own experiments: this was a good candidate for a very brief how-to page, in which the steps to complete this real-world task are detailed; this page does not discuss the details of all the possible configurations for the detector, since that is material best deferred to a reference page.
Crucially, all of these arose from someone's need (often mine) of having a clear explanation of some aspect of the software readily available. Documentation grew organically, without any titanic one-time effort.
We should make sure the code we write works. That is fairly uncontroversial, but the way to practically test it is definitely nontrivial.
Often, in scientific code, what is tested are the end-to-end results of the code, but not the intermediate steps; these may only be tested informally, in an ad hoc way, if at all.
In the GWFish
case, end-to-end testing was performed by way of a comparison
with alternative software which did the same computations.
Specifically, cross-checks were performed between GWFish
, GWFast
[@iacovelliForecastingDetectionCapabilities2022]
and GWBench
[@borhanianGwbenchNovelFisher2021] as an activity within the
Einstein Telescope Observational Science Board,
which is an organization dedicated to developing the science case for the Einstein
Telescope proposal.
Since all these pieces of software are working with the same assumptions ---
Fisher matrix approximation, the same parametrization for the planned detectors,
etc. --- they should yield the same end result, and indeed, they did,
at least for the situations considered.
This effort can give us confidence that those versions of all codes were working correctly
--- reaching the same, wrong result with three independent approaches is of course possible but unlikely (or it is an intrisic feature of the assumptions made by all three).
This kind of testing is definitely useful, but it is not the main subject of this section. What we will discuss instead is how to make an automated test suite, which can be expanded as more features are added to the code, and which may be made complete enough that the fact it passes (i.e. runs without any errors) can give us a reasonable degree of confidence that our code is working sensibly.
We will not prove our code correct, but we can construct a series of checks that it is not failing in any silly way. This simplifies the development process significantly, since we can check at any time whether we have broken anything. Also, it is not too difficult an extension to run our test suite in an isolated environment with different software versions; this way we can make an informed claim about which ones our software supports and which it doesn't.
The golden standard in this regard is called test driven development, in which a workflow is adopted in which a test is written before the code which implements the feature it is testing. Before getting to that, however, we shall discuss the simpler task of how to test existing code: what paradigms and techniques we can use to construct good and convenient tests?
This section showcases how unit tests can be added to a relatively
simple section of code: a function within GWFish
meant to invert matrices,
with some extra restrictions.
Really, while testing it we will see that it does not really compute the
inverse of a matrix but a pseudoinverse; the tests here may be viewed
as exploratory, I wrote them as I would when testing a "black box"
function whose behaviour in edge cases is unknown.
Within GWFish
, an important step is the inversion of the Fisher matrix, which is
required in order to provide estimates of the errors on the parameters.
This by itself does not seem like a difficult task computationally:
after all, the matrices at hand are not very large (on the order of
The code which inverts the Fisher matrix within GWFish
looks like this:
import numpy as np
def invertSVD(matrix):
thresh = 1e-10
dm = np.sqrt(np.diag(matrix))
normalizer = np.outer(dm, dm)
matrix_norm = matrix / normalizer
[U, S, Vh] = np.linalg.svd(matrix_norm)
kVal = sum(S > thresh)
matrix_inverse_norm = U[:, 0:kVal] @ np.diag(1. / S[0:kVal]) @ Vh[0:kVal, :]
return matrix_inverse_norm / normalizer
How can we investigate whether this code will correctly invert a matrix?
Let us start by building the simplest kind of test possible, which will already allow us to showcase some ideas about automated testing.
We start by adding a testing function to the same script as the one in which the
invertSVD
function is defined.
The basic paradigm in testing is, of course, to run the code with some input
and see whether it produces the correct result.
We will use a symmetric matrix, since all Fisher matrices are symmetric. We will not check exact equality, since when working with floating point numbers that cannot be guaranteed.
def test_matrix_inversion():
matrix = np.array([[1, 3], [3, 4]])
inverse = invertSVD(matrix)
inverse_should_be = np.array([[-4/5, 3/5], [3/5, -1/5]])
return np.allclose(inverse, inverse_should_be)
if __name__ == '__main__':
print(test_matrix_inversion())
As expected, when running the script we get the result True
;
on the other hand, if we change one of the numbers in the inverse_should_be
matrix we get False
.
Several problems with this appear right away: do we really need to manually compute matrix inverses to test our code? We will get to that; first, though, let us refactor our test.
As is, we need to actively look at the output of the script in order to see whether our test has succeeded or failed. Also, the test and the actual code live in the same file, which is not great: as we add more tests, it will become a source of clutter.
Our first refactoring step lies in moving the test code to its own script.
Also, as opposed to returning a boolean value, we will use an assert
statement.
assert
is a convenient tool for debugging and testing:
a statement like assert x
will not do anything if x
is truthy,3
while it will fail with an AssertionError
if x
is falsey.
So, our code will look like:
from gwfish_matrix_inverse import invertSVD
import numpy as np
def test_matrix_inversion():
matrix = np.array([[1, 3], [3, 4]])
inverse = invertSVD(matrix)
inverse_should_be = np.array([[-4/5, 3/5], [3/5, -1/5]])
assert np.allclose(inverse, inverse_should_be)
if __name__ == '__main__':
test_matrix_inversion()
and, unlike before, it will now not output anything if everything is working correctly, and raise an error if not (try it!).
The next step is to try the same thing with the
pytest
library.
We first need to install it (pip install pytest
); after that, in
the folder containing these files, we may simply run:
$ pytest
============================= test session starts ==============================
platform linux -- Python 3.9.11, pytest-7.1.3, pluggy-1.0.0
rootdir: /home/jacopo/Documents/clean-coding-thesis/scripts/testing_2
collected 1 item
test_matrix_inverse.py . [100%]
============================== 1 passed in 0.07s ===============================
pytest
is able to go through the files in the folder, see that
one of them has a name starting with test_
, inside it
there's a function starting with test_
, run that function, find no
errors, give us a success!
Note that now calling the function test_matrix_inversion
in the script is not
required anymore: we may safely remove those last two lines.
What happens if the test fails? Here, pytest
really shines:
let us change one of the numbers in the should-be inverse, and re-run the same
command:
$ pytest
============================= test session starts ==============================
platform linux -- Python 3.9.11, pytest-7.1.3, pluggy-1.0.0
rootdir: /home/jacopo/Documents/clean-coding-thesis/scripts/testing_2
collected 1 item
test_matrix_inverse.py F [100%]
=================================== FAILURES ===================================
____________________________ test_matrix_inversion _____________________________
def test_matrix_inversion():
matrix = np.array([[1, 3], [3, 4]])
inverse = invertSVD(matrix)
inverse_should_be = np.array([[-4/5, 3/6], [3/5, -1/5]])
> assert np.allclose(inverse, inverse_should_be)
E assert False
E + where False = <function allclose at 0x7f25d5d27280>(array([[-0.8, 0.6],
[ 0.6, -0.2]]), array([[-0.8, 0.5],\n [ 0.6, -0.2]]))
E + where <function allclose at 0x7f25d5d27280> = np.allclose
test_matrix_inverse.py:10: AssertionError
=========================== short test summary info ============================
FAILED test_matrix_inverse.py::test_matrix_inversion - assert False
============================== 1 failed in 0.12s ===============================
pytest
ran our test code and found an error.
It tells us exactly where it found it, and specifically how it came about:
we used the np.allclose
function to check for the equality of two matrices
which it prints out, so we can see at a glance what has gone wrong.
Of course, we might not understand where the problem is right away in general, but this all is about convenience, and having the tools at hand to spot as many details as possible.
Often, the information shown by pytest
will still not be enough.
A really convenient thing to be able to do, then, is to start from the
failure and work interactively with the variables defined at that time:
this way, we can do all sorts of manipulations, or even make plots if we
want!
This is easily achieved by adding the --pdb
flag to our call to pytest
.
pdb
, "python debugger", is a standard tool for debugging, and it has
plenty of features worth exploring. Here I will give just a flavor of the
possibilities.
The shell looks like this:
$ pytest --pdb
[... same output as before ...]
test_matrix_inverse.py:10: AssertionError
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> PDB post_mortem (IO-capturing turned off) >>>>>>>>>>>>>>>>>>>
> /home/jacopo/Documents/clean-coding-thesis/scripts/testing_2/
test_matrix_inverse.py(10)test_matrix_inversion()
-> assert np.allclose(inverse, inverse_should_be)
(Pdb) matrix @ inverse
array([[ 1.00000000e+00, -1.11022302e-16],
[ 4.44089210e-16, 1.00000000e+00]])
(Pdb) matrix @ inverse_should_be
array([[ 1.0000000e+00, -1.0000000e-01],
[-4.4408921e-16, 7.0000000e-01]])
(Pdb) q
=========================== short test summary info ============================
FAILED test_matrix_inverse.py::test_matrix_inversion - assert False
!!!!!!!!!!!!!!!!!!! _pytest.outcomes.Exit: Quitting debugger !!!!!!!!!!!!!!!!!!!
It is hard to show in a fixed medium such as this, but after the test failure
I was presented with a shell prompt (Pdb)
,
from which I could give arbitrary python
commands.
I used it to compute the matrix product between the initial matrix
and the computed inverse (with the numpy shortcut A@B
=$AB$ in a matricial sense),
and the same with the manually-written inverse,
which showed that the computed inverse was indeed correct.
So far, we have tested the output of our code against a manually computed "correct result". This is OK as far as it goes, but is necessarily only checks a few examples which we hope will be relevant, but which might not cover all edge cases.
In many situations, we may be able to find an invariant in our code, which
we expect to hold regardless of input.
In this matrix inversion scenario, this is particularly simple: the defining property
of the inverse
The first step in this direction is to refactor our test so that it can accept any matrix: the following implementation uses the same matrix as before, but now we check the aforementioned property as opposed to the specific inverse.
MATRIX = np.array([[1, 3], [3, 4]])
def test_matrix_inversion_constant_matrix(matrix = MATRIX):
inverse = invertSVD(matrix)
assert np.allclose(inverse@matrix, np.eye(*matrix.shape))
assert np.allclose(matrix@inverse, np.eye(*matrix.shape))
The method np.eye
is a convenient way to generate an identity
matrix with arbitrary shape.
This is the first step; what we could now do is to construct a method which generates random matrices and feed it to the algorithm.
This would already be quite good, but there is a better way, thanks to the
hypothesis
library.
We are starting out on a fairly complex example (a matrix full of floating point numbers) for it --- if we were manipulating, say, strings, things could be quite a bit simpler ---; however, it is the typical kind of task that we might need in a scientific context.
After installing hypothesis
with the numpy
extra (pip install hypothesis[numpy]
),
we may use it as follows:
from gwfish_matrix_inverse import invertSVD
import numpy as np
from hypothesis import given
from hypothesis import strategies as st
from hypothesis.extra.numpy import arrays
@given(arrays(np.float64, (2, 2)))
def test_matrix_inversion_hypothesis(matrix):
inverse = invertSVD(matrix)
assert np.allclose(inverse@matrix, np.eye(*matrix.shape))
assert np.allclose(matrix@inverse, np.eye(*matrix.shape))
The @given
decorator is what tells hypothesis
to provide us with some
test data, of the kind specified in its argument: for us, numpy arrays
.
We then specify the data type (floating point numbers), the shape (which
for now we keep as 2x2, we will generalize this later), and any extra conditions.
The data provided by hypothesis
will not be random but arbitrary:
it will purposefully try extreme examples, trying to get our code
to break. We will then be able to constrain the parameter space
we allow it to explore when attempting this if we wish.
If we run this code, we get an immediate failure: I will not clutter
this document with the full output, but the command to run is still pytest
,
which fails by raising an error in the SVD step with the falsifying example:
Fair enough: the inverse of the zero matrix does not exist. One would now probably think of somehow restricting the examples to invertible matrices, but let us try to simply put a lower bound on the numbers allowed, so that they cannot be zero:
from gwfish_matrix_inverse import invertSVD
import numpy as np
from hypothesis import given
from hypothesis import strategies as st
from hypothesis.extra.numpy import arrays
@given(arrays(np.float64, (2, 2), elements=st.floats(min_value=1e-20)))
def test_matrix_inversion_hypothesis(matrix):
inverse = invertSVD(matrix)
assert np.allclose(inverse@matrix, np.eye(*matrix.shape))
assert np.allclose(matrix@inverse, np.eye(*matrix.shape))
As expected, the code fails with a singular matrix,
but surprisingly it does not raise an error: instead, it returns the matrix
The test then fails on the step of checking that this is the inverse (since it is not).
This is getting to the problem which really does occur
in these computations: the Fisher matrix
We really should test this only the kinds of inputs we expect to be possible.
Doing so in this case turned out to be tricky but possible with hypothesis
.
Gravitational-wave Fisher matrices can always be written as
The following test generates arbitrary vectors
The conditions
from gwfish_matrix_inverse import invertSVD
import numpy as np
from hypothesis import given, reject, target, seed
from hypothesis import strategies as st
from hypothesis.extra.numpy import arrays
import pytest
MATRIX_DIMENSION = 4
ABS_TOLERANCE = 1e-1
REL_TOLERANCE = 1e-2
MIN_NORM = 1e-5
MAX_NORM = 1e5
@seed(1)
@given(
vector_norms=arrays(
np.float64,
(MATRIX_DIMENSION,),
elements=st.floats(
min_value=MIN_NORM,
max_value=MAX_NORM,
),
unique=True,
),
cosines=arrays(
np.float64,
(MATRIX_DIMENSION, MATRIX_DIMENSION),
elements=st.floats(
min_value=-1.0,
max_value=1.0,
),
unique=True,
),
)
def test_matrix_pseudoinverse_hypothesis(vector_norms, cosines):
cosines[np.arange(MATRIX_DIMENSION), np.arange(MATRIX_DIMENSION)] = 1
cosines = np.maximum(cosines, cosines.T)
matrix = np.outer(vector_norms, vector_norms) * cosines
inverse = invertSVD(matrix)
assert np.allclose(
inverse @ matrix @ inverse, inverse, atol=ABS_TOLERANCE, rtol=REL_TOLERANCE
)
assert np.allclose(
matrix @ inverse @ matrix, matrix, atol=ABS_TOLERANCE, rtol=REL_TOLERANCE
)
This is not the be-all-and-end all for this kind of test: for one, the tolerances were hand-picked, and a better understanding of the numerical problem is desirable. Still, this showcases how powerful this property-based testing framework is. Not all tests can be formally specified in this way, but many can, and if possible writing a test of this sort is very powerful.
Ideally, we would like to not rely on a specific version of our dependencies,
but instead to support a range of versions for them.
Testing every possible combination is unfeasible,
but we can go to some length by at least checking every version
of our "main" dependencies.
For example, we can check that our software is installable
with every currently supported version of python
.
This section will discuss a tool to automate this:
tox
.
It allows us to run our tests in a freshly created
virtual environment with arbitrary package versions.
In order to use it we need to structure our code snippet as a package,
so we will use poetry
with a pyproject.toml
file as follows:
[tool.poetry]
name = "matrix_inverse"
version = "0.1.0"
description = ""
authors = ["jacopo <[email protected]>"]
packages = [{include = "matrix_inverse"}]
[tool.poetry.dependencies]
python = ">=3.8, <3.12"
pytest = "^7.2.0"
hypothesis = "^6.56.4"
numpy = "^1.23.4"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Then, we need to move our code to a folder called matrix_inverse
.
Further, for consistency we will move the test code in a folder
called tests
.
Finally, we can install tox
and create a tox.ini
file containing
[tox]
skipsdist = true
envlist = py{38,39,310,311}
isolated_build = true
[testenv]
deps =
poetry
commands =
poetry install
poetry run pytest {posargs}
Then we are done! We can run tox
from the shell, which will create
virtual environments with our dependencies for python 3.8, 3.9, 3.10
and 3.11; it will then run the tests within these.
This takes a while the first time and the output is very long, but it ends with
__________________________________ summary ___________________________________
py38: commands succeeded
py39: commands succeeded
py310: commands succeeded
py311: commands succeeded
congratulations :)
We can check that the system is working properly by trying to exploit
some functionality that is available only in certain python releases.
For example, the package tomllib
was added to the standard python
library in version 3.11; before, one would have had to import it manually.
So, if without further changes we add a line import tomllib
to our module or test code, we expect to get an error for versions
3.10 and below; indeed, the output is
__________________________________ summary ___________________________________
ERROR: py38: commands failed
ERROR: py39: commands failed
ERROR: py310: commands failed
py311: commands succeeded
Running such a combination of tests can get expensive as the
test suite grows, but it may be worth it, especially if we require backward
compatibility.
For example, I often use the latest python
release, but sometimes develop
code that will be run on clusters whose installations are only updated
irregularly; it is therefore important for me to be able to easily check that
the code I am writing is compatible with the version installed there.
Writing tests is time-consuming, but once they have been written they are extremely useful. When is a good time to write them?
If we are developing a new feature, a good answer to this is: before writing the code that implements the feature. This practice is known as test-driven development (TDD), and it is more than twenty years old. The idea is that writing the test first forces the developer to think about what the thing they are implementing should accomplish.
If this seems hard, that's a feature, not a bug: writing code without being able to formulate a test case for it is indicative of the code being not well-thought-out, or perhaps not modular enough.
This may be applied to new code, but often the real situation is that we have a large module without any tests: what to do then? One approach, which will be showcased in the next chapter, is to start by creating some end-to-end tests to document the expected behavior of the code, and then refactor / simplify it as needed (since it will probably need refactoring).
Another opportunity to write tests is in the event of bugs or problems: if we encounter
a failure for a certain input to our code, we can calcify that input as a test case.
In pytest
, the @pytest.mark.xfail
decorator can help in clarifying: the
test we just wrote is known to be currently failing.
When the bug is fixed, the test should succeed, and it can be kept in our suite, allowing
us to avoid regressing on that bug fix.
Finally, the kind of test development shown in the previous sections is a sort of "exploratory testing", which is yet another context in which tests can be written. If we have some functionality which we do not completely understand the behavior of, perhaps because it was written by somebody else, or maybe it even lies in a different package. As we are trying to make sense of it --- as we would need to anyway --- we can write tests cementing our understanding of its behavior. Besides allowing us to clarify our understanding in this exploratory phase, the tests will remain there acting as a safeguard against changes in the functionality, in the case that it is externally managed. For example, if we are relying on an external library which gets an update, we can quickly check whether it breaks our use case.
All previous sections have been around code more than about it: how to track its versions, document it, test it. Indeed, the aforementioned topics should typically have a higher priority than making changes to the code. Eventually, though, we will have to modify it: how and why should we?
In order to make the discussion here somewhat constrained, I will limit it to the refactoring of one single function, purposefully doing so without modifying anything else.
The computation of the errors from Fisher matrices in GWFish
goes through a function
called analyzeFisherErrors
, which takes the following parameters:
network
, an object of typeNetwork
, which contains severalDetector
objects;parameter_values
, apandas
DataFrame
with each row representing a signal, and with each column corresponding to a different event parameter (masses, distance etc.);fisher_parameters
, a list containing the aforementioned parameters in order;population
, a string containing the name of the population being considered;networks_ids
, a list of lists of integers, each between 0 and the number of detectors minus one.
The idea with the networks_ids
parameter is the following: suppose we have a network
of three detectors, A B and C, and that we want to be able to compute the capabilities
they would have all together or as pairs of 2.
We can accomplish this with network_ids=[[0, 1], [1, 2], [0, 2], [0, 1, 2]]
:
the numbers in each sub-list are indices used to select the detectors
making up a subnetwork.
This function performs the following tasks for each of the subnetworks:
- compute the overall signal-to-noise ratio (SNR) as the root-mean-square of the individual SNRs;
- compute the network Fisher matrix as the sum of the individual Fisher matrices;
- compute the covariance matrix by inverting the network Fisher matrix;
- compute the sky localization ellipse size, which is determined by the errors on the sky position angles (right ascension and declination);
- write all the computed errors to a text file.
The function does not return anything. The code is reported here for reference, with minor changes to allow for correct printing. The full one can be found in this snapshot of the repository.
def analyzeFisherErrors(network, parameter_values, fisher_parameters, population, networks_ids):
"""
Analyze parameter errors.
"""
# Check if sky-location parameters are part of Fisher analysis.
# If yes, sky-location error will be calculated.
signals_havesky = False
if ('ra' in fisher_parameters) and ('dec' in fisher_parameters):
signals_havesky = True
i_ra = fisher_parameters.index('ra')
i_dec = fisher_parameters.index('dec')
signals_haveids = False
if 'id' in parameter_values.columns:
signals_haveids = True
signal_ids = parameter_values['id']
parameter_values.drop('id', inplace=True, axis=1)
npar = len(fisher_parameters)
ns = len(network.detectors[0].fisher_matrix[:, 0, 0]) # number of signals
N = len(networks_ids)
detect_SNR = network.detection_SNR
network_names = []
for n in np.arange(N):
network_names.append('_'.join(
[network.detectors[k].name for k in networks_ids[n]])
)
for n in np.arange(N):
parameter_errors = np.zeros((ns, npar))
sky_localization = np.zeros((ns,))
networkSNR = np.zeros((ns,))
for d in networks_ids[n]:
networkSNR += network.detectors[d].SNR ** 2
networkSNR = np.sqrt(networkSNR)
for k in np.arange(ns):
network_fisher_matrix = np.zeros((npar, npar))
if networkSNR[k] > detect_SNR[1]:
for d in networks_ids[n]:
if network.detectors[d].SNR[k] > detect_SNR[0]:
network_fisher_matrix += np.squeeze(
network.detectors[d].fisher_matrix[k, :, :])
if npar > 0:
network_fisher_inverse = invertSVD(network_fisher_matrix)
parameter_errors[k, :] = np.sqrt(
np.diagonal(network_fisher_inverse))
if signals_havesky:
sky_localization[k] = (
np.pi * np.abs(np.cos(parameter_values['dec'].iloc[k]))
* np.sqrt(network_fisher_inverse[i_ra, i_ra]
* network_fisher_inverse[i_dec, i_dec]
- network_fisher_inverse[i_ra, i_dec]**2))
delim = " "
header = ('network_SNR '+delim.join(parameter_values.keys())+
" "+delim.join(["err_" + x for x in fisher_parameters])
)
ii = np.where(networkSNR > detect_SNR[1])[0]
save_data = np.c_[networkSNR[ii],
parameter_values.iloc[ii],
parameter_errors[ii, :]]
if signals_havesky:
header += " err_sky_location"
save_data = np.c_[save_data, sky_localization[ii]]
if signals_haveids:
header = "signal "+header
save_data = np.c_[signal_ids.iloc[ii], save_data]
file_name = ('Errors_' + network_names[n] + '_'
+ population + '_SNR' + str(detect_SNR[1]) + '.txt'
)
if signals_haveids and (len(save_data) > 0):
np.savetxt('Errors_' + network_names[n] + '_'
+ population + '_SNR' + str(detect_SNR[1]) + '.txt',
save_data,
delimiter=' ',
fmt='%s ' + "%.3E " * (len(save_data[0, :]) - 1),
header=header,
comments=''
)
else:
np.savetxt(
'Errors_' + network_names[n] + '_' + population +
'_SNR' + str(detect_SNR[1]) + '.txt',
save_data,
delimiter=' ',
fmt='%s ' + "%.3E " * (len(save_data[0, :]) - 1),
header=header,
comments=''
)
This function was not written from scratch like this: it shows the signs of a simple function being gradually extended to "work above its pay grade", which made it quite complex. It has many switches and several layers of loops. It features some duplication, which is a violation of the so-called "DRY principle" (Don't Repeat Yourself): similar but different parts of the code being activated in different situations are dangerous, since it is easy to modify one but not the other.
The next sections will discuss how these issues can be addressed, but
one thing needs to be made explicit: if this was one-off code, it would be
fine as-is.
The necessity to refactor it came from actual user needs --- specifically,
a user required hdf5
output for the errors, which would not be easily achievable
with the function as written here.
As opposed to adding another if
clause, we can refactor the logic
and make it more modular.
Writing the function in a "messy" way and then refactoring it if needed is a good process: the messy code is often easier to write quickly, it does not require us to think about which abstractions we should use and how each of the modular components we make maybe used in other contexts. Premature abstraction can create technical debt just like forests of for loops can. So, this section should not be interpreted as "fixing bad code", but as a natural part of the development process: starting with the easiest-to-write code that will get the job done, and then refining it.
The first thing that catches the eye is that this function is doing two "big" things at once: calculations (combining Fisher matrices etc.) and input-output (I/O: writing out to a file). Combining them may seem natural in a first formulation, since once we have computed these values we want to save them somewhere, but this poses several issues.
- If a user wishes to use the computed errors, they have no way to access them directly, and if they do not modify the source code they must save to a file and then read from that file;
- in this case, loss of information: the values are being saved in scientific notation with three decimals in the mantissa, which means all information beyond these digits is lost;
- tests are really difficult to write for such a function: it has side effects, meaning that when it is run it saves things to disk somewhere, as opposed to being fully characterized by its output.
A simple solution to this is to split the function in two: one for the computation, one for the output.
We can do better: several sub-tasks in this function can be modularized, with the guiding principle of having each function perform a single conceptual task.
This function's call signature is a good example of how type hints
can be useful. Without reading the function code, how could we be able to tell
that, for example, population
should be a string while network
should be
an object of type Newtork
?
Documenting the function could be a solution, but is not the best one for this purpose. Despite our best efforts, it is not guaranteed to remain up-to-date; also, we may get it wrong, and there is no way to automatically check it.
In this context, since we are writing the code of a relatively large module, it makes sense to use type hinting: python is dynamically typed, so it allows us to not say anything about the types of the variables we are using, but we do have functionality to specify which types we expect.
The syntax to do it in this case is as follows:
def analyzeFisherErrors(
network: det.Network,
parameter_values: pd.DataFrame,
fisher_parameters: list[str],
population: str,
networks_ids: list[list[int]]
) -> None:
Python by itself will completely ignore these (as long as they can be evaluated without
raising an error), which means they are only as good
as documentation written in a comment; however, there are ways in which they
can be better.
For one, they are returned when calling help(analyzeFisherErrors)
in a console.
That is, however, also true for the docstring of the function.
The real power of type annotations in python
is that they can be formally checked
by a tool, called a static type checker,
which will raise an error if, for example, a function is called with the wrong types,
or if it uses its arguments in a way that is not compatible with their stated types.
There are several of these available, but here we will focus on mypy
.
This is a very useful tool to make spotting errors easier in a big codebase; since it is only used on the module code, it allows us to have clarity on the types within our module while retaining the flexibility of dynamic types when we use our code in an interactive console or a script.
After installing mypy
(pip install mypy
), we may use it from the command line
with the command
mypy fishermatrix.py
where fishermatrix.py
is the name of the module file containing the analyzeFisherErrors
function.
By itself this will raise several errors, of the sort
auxiliary.py:1: error: Skipping analyzing "scipy": module is installed,
but missing library stubs or py.typed marker
for many packages beyond scipy
. This reflects the fact that these imported packages are
not adopting type hints themselves, which means calls to them cannot be checked.
This reflects the fact that type hinting is a relatively recent addition to the
python
ecosystem, and many large libraries have not adopted it.
This does not prevent us from using mypy
in our own code, however, it might
just limit its usefulness; in order to get rid of the error we can run mypy
with the option --ignore-missing-imports
.
With it, we get
$ mypy fishermatrix.py --ignore-missing-imports
Success: no issues found in 1 source file
This is good! Note that the file also contains other, non-type-hinted functions,
which does not create any issue.
We could enforce typing on every function with the option --strict
.
We can make sure that mypy
is indeed checking the body of the function
by making the signature wrong in some way: for example, if we change the
hint on the network_ids
argument to networks_ids: int
(as opposed to
networks_ids: list[list[int]]
) we get the error
$ mypy fishermatrix.py --ignore-missing-imports
fishermatrix.py:52: error: Argument 1 to "len" has incompatible type "int"; expected "Sized"
fishermatrix.py:59: error: Value of type "int" is not indexable
fishermatrix.py:66: error: Value of type "int" is not indexable
fishermatrix.py:74: error: Value of type "int" is not indexable
Found 4 errors in 1 file (checked 1 source file)
mypy
is able to understand that we are using this parameter
as a list (for example, indexing it), so if it were an integer it
would not be valid.
Several errors can be caught this way.
Git commits often get polluted with meaningless whitespace changes, newlines added somewhere, and so on. This takes away from our ability to understand what was actually changed. Also, if different parts of the code are written by different people, the style will often look inconsistent.
This is a somewhat minor thing, but having automatic formatting as a part
of our development routine ensures consistency and makes all code
easier to read.
There are several choices for this task, and I personally like
black
(of course, this is something
that should be agreed on within a project).
For python
, standard style is defined by PEP8;
while conforming to it is not a requirement, it is used widely enough that
code formatted according to it will be readily understandable to
a large amount of people.
After installing it with pip install black
, we may format any file or folder
with a command such as:
$ black fishermatrix.py
reformatted fishermatrix.py
All done!
1 file reformatted.
This is the output of the first run, while subsequent ones will show a message such as
All done!
1 file left unchanged.
We can also use automated analysis tools to check our code for style and compliance to best practices: this is called linting.
When running pylint
, a common linter, on this function, we get a few warnings,
shortened here for brevity:
$ pylint fishermatrix.py
************* Module fishermatrix
fishermatrix.py:38:0: C0301: Line too long (114/100) (line-too-long)
fishermatrix.py:12:0: C0116: Missing function or method docstring (missing-function-docstring)
fishermatrix.py:15:4: C0103: Variable name "dm" doesn't conform to snake_case naming style (invalid-name)
fishermatrix.py:27:0: C0103: Function name "analyzeFisherErrors" doesn't conform to snake_case naming style (invalid-name)
fishermatrix.py:27:0: R0914: Too many local variables (29/15) (too-many-locals)
fishermatrix.py:27:0: R0912: Too many branches (15/12) (too-many-branches)
Some of these are rather generic warnings, which can be disabled or changed;
however, they do point to some issues we discussed earlier: this function
has too many tasks, and this naturally shows up in its number of local variables,
branches (e.g. if
s and for
s).
The heuristics used by a given linter are not gospel, but they may give us a helpful indication about which parts of the code were not carefully written.
Things such as automatic formatting, linting, and even static type checking,
are very useful if applied consistently.
However, we are fallible and may forget to do so;
also, typing black <name.py>
often gets annoying.
There is a nice way, however, to ensure that badly-formatted code does not have the chance to get into our version tracking: git hooks.
Using them is not very difficult: we just need to create a file called
pre-commit
in the folder .git/hooks/
.
Its content, as a bash script, will be executed every time we make a commit;
if it returns a non-zero code (which, for example, black
will do if
it needed to modify the files it found) it will abort the commit.
However, it will have modified the relevant files; re-adding them
will allow us to make a commit with only correctly-formatted files.
The code in analyzeFisherErrors
is often using the syntax for i in np.arange(N)
in order to loop
over N
numbers, as opposed to the native python for i in range(N)
.
The logic behind it was to improve performance: after all, numpy
is typically faster than native python
for vector and matrix operations.
However, using it in this context is a case of premature optimization. Benchmarking the whole function for this example is overkill, so let us use a simpler example: suppose we want to compute (but not store) the sums of the first 1000 integers. The native-python solution is
for x in range(1000):
x+x
which takes about \qty{20}{\micro\second} on my machine (measured with the ipython
magic %timeit
).
The alternative with np.arange
,
for x in np.arange(1000):
x+x
takes roughly double that: \qty{43}{\micro\second}.
The way to really get a speed improvement in this context would be to ditch the for
loop completely,
and directly work with numpy
vectors:
x_vec = np.arange(1000):
x_vec + x_vec
This takes roughly \qty{1.4}{\micro\second}.
Really, though, the for
loops within analyzeFisherErrors
are not the bottleneck in its evaluation, and this function is a rather fast component of the code.
I would argue that in this case readability matters more than speed;
even if np.arange
was slightly faster than range
,
it would still be worth it to use the simpler
native syntax in order to have less visual clutter.
Here performance is secondary to speed, but sometimes it may not be.
In those cases, proper profiling is essential.
There are several tools for this in the python
ecosystem.
For targeted optimization of a single function, I have had great success
with cProfile
combined with
the visualization tool snakeviz
:
they can provide a breakdown of each function call happening inside the target function,
allowing us to understand which the potentially slow parts are.
Another great tool which has recently been rising in popularity is
scalene
[@bergerScaleneScriptingLanguageAware2020]. It allows for the profiling
of memory usage, as well as distinguishing time spent in native python
versus time spent running wrappers around C
code.
This is useful since a common strategy to accelerate python
code
is indeed to take the numerical kernel of our computations and have it
be computed by efficient, compiled C
code.
If we made a mistake and the C
code is not being run, evaluation
may be slow: scalene
can aid in spotting such issues.
When refactoring, an important part of the job is to ensure that we are not breaking existing functionality.
So, before modifying anything substantial we should be covered by a test
(or multiple).
A minimal example of calling the analyzeFisherErrors
function looks like
the following. Note that, while it is formatted as a test, it is currently
not performing any actual testing! As written, this is no more than a
smoke test.4
import pytest
from fishermatrix import analyzeFisherErrors
from detection import Network, Detector
import pandas as pd
import numpy as np
def test_fisher_analysis_output():
params = {
"mass_1": 1.4,
"mass_2": 1.4,
"redshift": 0.01,
"luminosity_distance": 40,
"theta_jn": 5 / 6 * np.pi,
"ra": 3.45,
"dec": -0.41,
"psi": 1.6,
"phase": 0,
"geocent_time": 1187008882,
}
parameter_values = pd.DataFrame()
for key, item in params.items():
parameter_values[key] = np.full((1,), item)
fisher_parameters = list(params.keys())
network = Network(
detector_ids=["ET"],
parameters=parameter_values,
fisher_parameters=fisher_parameters,
config="detectors.yaml",
)
network.detectors[0].fisher_matrix[0, :, :] = fishermatrix.FisherMatrix(
"gwfish_TaylorF2",
parameter_values.iloc[0],
fisher_parameters,
network.detectors[0],
)
network.detectors[0].SNR[0] = 100
analyzeFisherErrors(
network=network,
parameter_values=parameter_values,
fisher_parameters=fisher_parameters,
population="test",
networks_ids=[[0]],
)
The fact that the minimum code required to run this function is so large is an indication of the high degree of coupling in the codebase: a part of it cannot function independently of the others.
In this case, testing the output seems to be hard to do since our code is writing out a file.
Specifically, running this code leads to a file named Errors_ET_test_SNR8.0.txt
being created,
with the content:
network_SNR mass_1 mass_2 redshift luminosity_distance theta_jn ra dec psi phase geocent_time err_mass_1 err_mass_2 err_redshift err_luminosity_distance err_theta_jn err_ra err_dec err_psi err_phase err_geocent_time err_sky_location
100.0 1.400E+00 1.400E+00 1.000E-02 4.000E+01 2.618E+00 3.450E+00 -4.100E-01 1.600E+00 0.000E+00 1.187E+09 1.018E-07 1.018E-07 8.969E-08 2.322E+00 1.042E-01 3.127E-03 2.694E-03 2.042E-01 4.093E-01 5.639E-05 2.423E-05
An option would be to read the file as a part of the test and then delete it, but that is somewhat risky: the deletion might have issues (especially if the test fails), and if the file is still there for the next test we have created non-independent test cases.
There are various ways around this, but I will use it to showcase the concept of mocking: substituting a complex function with a simplified version, which does not have its full functionality, but which allows us to check that the complex function would have been called correctly.
In our case, we can mock the numpy.savetxt
function, so that our test does not actually
save anything to a file, but we just check that we would save the correct thing.
We can do so with the pytest-mock
library, and it looks like this:
import pytest
from fishermatrix import analyzeFisherErrors
import fishermatrix
import waveforms
import detection
from detection import Network, Detector
import pandas as pd
import numpy as np
def test_fisher_analysis_output(mocker):
params = {
"mass_1": 1.4,
"mass_2": 1.4,
"redshift": 0.01,
"luminosity_distance": 40,
"theta_jn": 5 / 6 * np.pi,
"ra": 3.45,
"dec": -0.41,
"psi": 1.6,
"phase": 0,
"geocent_time": 1187008882,
}
parameter_values = pd.DataFrame()
for key, item in params.items():
parameter_values[key] = np.full((1,), item)
fisher_parameters = list(params.keys())
network = Network(
detector_ids=["ET"],
parameters=parameter_values,
fisher_parameters=fisher_parameters,
config="detectors.yaml",
)
network.detectors[0].fisher_matrix[0, :, :] = fishermatrix.FisherMatrix(
"gwfish_TaylorF2",
parameter_values.iloc[0],
fisher_parameters,
network.detectors[0],
)
network.detectors[0].SNR[0] = 100
mocker.patch("numpy.savetxt")
analyzeFisherErrors(
network=network,
parameter_values=parameter_values,
fisher_parameters=fisher_parameters,
population="test",
networks_ids=[[0]],
)
header = (
"network_SNR mass_1 mass_2 redshift luminosity_distance "
"theta_jn ra dec psi phase geocent_time err_mass_1 err_mass_2 "
"err_redshift err_luminosity_distance err_theta_jn err_ra "
"err_dec err_psi err_phase err_geocent_time err_sky_location"
)
data = [
1.00000000000e02,
1.39999999999e00,
1.39999999999e00,
1.00000000000e-02,
4.00000000000e01,
2.61799387799e00,
3.45000000000e00,
-4.09999999999e-01,
1.60000000000e00,
0.00000000000e00,
1.18700888200e09,
1.01791427671e-07,
1.01791427689e-07,
8.96883449508e-08,
2.32204133549e00,
1.04213847237e-01,
3.12695677565e-03,
2.69412953826e-03,
2.04240222976e-01,
4.09349000642e-01,
5.63911212310e-05,
2.42285325663e-05,
]
assert np.savetxt.call_args.args[0] == "Errors_ET_test_SNR8.0.txt"
assert np.allclose(np.savetxt.call_args.args[1], data)
assert np.savetxt.call_args.kwargs == {
"delimiter": " ",
"header": header,
"comments": "",
}
With this test code ready as a "safety net", we can move on to the refactoring.
I refactored the function as follows:
def sky_localization_area(,
`
network_fisher_inverse: np.ndarray,
declination_angle: np.ndarray,
right_ascension_index: int,
declination_index: int,
) -> float:
"""
Compute the 1-sigma sky localization ellipse area starting
from the full network Fisher matrix inverse and the inclination.
"""
return (
np.pi
* np.abs(np.cos(declination_angle))
* np.sqrt(
network_fisher_inverse[right_ascension_index, right_ascension_index]
* network_fisher_inverse[declination_index, declination_index]
- network_fisher_inverse[right_ascension_index, declination_index] ** 2
)
)
def compute_fisher_errors(
network: det.Network,
parameter_values: pd.DataFrame,
fisher_parameters: list[str],
sub_network_ids: list[int],
) -> tuple[np.ndarray, np.ndarray, Optional[np.ndarray]]:
"""
Compute Fisher matrix errors for a network whose
SNR and Fisher matrices have already been calculated.
Will only return output for the n_above_thr signals
for which the network SNR is above network.detection_SNR[1].
Returns:
network_snr: array with shape (n_above_thr,)
Network SNR for the detected signals.
parameter_errors: array with shape (n_above_thr, n_parameters)
One-sigma Fisher errors for the parameters.
sky_localization: array with shape (n_above_thr,) or None
One-sigma sky localization area in steradians,
returned if the signals have both right ascension and declination,
None otherwise.
"""
n_params = len(fisher_parameters)
n_signals = len(parameter_values)
assert n_params > 0
assert n_signals > 0
signals_havesky = False
if ("ra" in fisher_parameters) and ("dec" in fisher_parameters):
signals_havesky = True
i_ra = fisher_parameters.index("ra")
i_dec = fisher_parameters.index("dec")
detector_snr_thr, network_snr_thr = network.detection_SNR
parameter_errors = np.zeros((n_signals, n_params))
if signals_havesky:
sky_localization = np.zeros((n_signals,))
network_snr = np.zeros((n_signals,))
detectors = [network.detectors[d] for d in sub_network_ids]
network_snr = np.sqrt(sum((detector.SNR**2 for detector in detectors)))
for k in range(n_signals):
network_fisher_matrix = np.zeros((n_params, n_params))
for detector in detectors:
if detector.SNR[k] > detector_snr_thr:
network_fisher_matrix += detector.fisher_matrix[k, :, :]
network_fisher_inverse = invertSVD(network_fisher_matrix)
parameter_errors[k, :] = np.sqrt(np.diagonal(network_fisher_inverse))
if signals_havesky:
sky_localization[k] = sky_localization_area(,
`
network_fisher_inverse, parameter_values["dec"].iloc[k], i_ra, i_dec
)
detected = np.where(network_snr > network_snr_thr)[0]
if signals_havesky:
return (
network_snr[detected],
parameter_errors[detected, :],
sky_localization[detected],
)
return network_snr[detected], parameter_errors[detected, :], None
def output_to_txt_file(
parameter_values: pd.DataFrame,
network_snr: np.ndarray,
parameter_errors: np.ndarray,
sky_localization: Optional[np.ndarray],
fisher_parameters: list[str],
filename: str,
) -> None:
delim = " "
header = (
"network_SNR "
+ delim.join(parameter_values.keys())
+ " "
+ delim.join(["err_" + x for x in fisher_parameters])
)
save_data = np.c_[network_snr, parameter_values, parameter_errors]
if sky_localization is not None:
header += " err_sky_location"
save_data = np.c_[save_data, sky_localization]
row_format = "%s " + " ".join(["%.3E" for _ in range(save_data.shape[1] - 1)])
np.savetxt(
filename + ".txt",
save_data,
delimiter=" ",
header=header,
comments="",
fmt=row_format,
)
def errors_file_name(
network: det.Network, sub_network_ids: list[int], population_name: str
) -> str:
sub_network = "_".join([network.detectors[k].name for k in sub_network_ids])
return (
"Errors_"
+ sub_network
+ "_"
+ population_name
+ "_SNR"
+ str(network.detection_SNR[1])
)
def analyze_and_save_to_txt(
network: det.Network,
parameter_values: pd.DataFrame,
fisher_parameters: list[str],
sub_network_ids_list: list[list[int]],
population_name: str,
) -> None:
for sub_network_ids in sub_network_ids_list:
network_snr, errors, sky_localization = compute_fisher_errors(
network=network,
parameter_values=parameter_values,
fisher_parameters=fisher_parameters,
sub_network_ids=sub_network_ids,
)
filename = errors_file_name(
network=network,
sub_network_ids=sub_network_ids,
population_name=population_name,
)
output_to_txt_file(
parameter_values=parameter_values,
network_snr=network_snr,
parameter_errors=errors,
sky_localization=sky_localization,
fisher_parameters=fisher_parameters,
filename=filename,
)
A summary of the changes made is as follows.
- The functionality of
analyzeFisherErrors
was split into five:compute_fisher_errors
,sky_localization_area
,output_to_txt
,errors_file_name
,analyze_and_save_to_txt
. These all contain logically distinct sections of the code, which we may desire to modify independently of each other. - The
compute_fisher_errors
function now only considers one subnetwork, and the looping over subnetworks is relegated to the higher-level functionanalyze_and_save_to_txt
. - The check for
npar>0
which nested the whole function by one layer was changed to anassert
statement at the beginning of the function --- if the number of parameters is less than zero the whole function call is invalid, and something has gone wrong. - A few computations were happening in different places, such as
detector selection (evaluating
network.detectors[d]
) or a check that the network SNR was greater than a threshold value. They were unified. - The computation of
newtorkSNR
was compacted.
The resulting code is still not perfect, of course. For one, too many parameters are being passed amongst these functions. This makes them coupled, long, complex.
However, it is now straightforward to use the compute_fisher_errors
function within a script, if we do not need to output to a file.
Also, it is equally straightforward to make a function analogous to
analyze_and_save_to_txt
but which saves to a format different from
txt
.
A way to ameliorate this issue will entail modifying more than just
analyzeFisherErrors
. Probably, the most convenient approach will be to
restructure the classes in the whole package,
probably with one representing the whole population analyzed:
the functions in the refactored code are all working on the same data
(a table of parameter values, a list of which parameters to consider
for the Fisher analysis, a table of Fisher errors etc.).
This thesis discussed how some best practices can be applied in order to make software more maintainable, user-friendly, readable.
Several aspects have not been discussed here, one of the most important of which
is large-scale module organization and object-oriented programming:
the structure of classes, the possibility to use inheritance, interfaces, and so on.
The main reason why this topic was omitted, besides its inherent complexity,
is that concretely applying it to GWFish
will take a large effort, and it has
not been done yet.
Several improvements are underway for GWFish
, both in terms of new functionality
(such as allowing the treatment of different kinds of signals which are difficult to model
with the current infrastructure) and of the application of the concepts discussed
in this thesis.
The main takeaway I hope a reader will have from this thesis is in terms of a sort of "Maslow hierarchy"5 of software needs. The main topics discussed are in rough order of urgency: if a piece of software is not being properly versioned that is the first problem to be solved, since we could not reliably keep track of any changes applied otherwise. The relative importance of testing and documentation is debatable, but both are surely more urgent than cosmetic and even structural changes such as the ones discussed in the "Refactoring" chapter.
Beautiful, clean and clever code is the equivalent of "self-actualization" in the hierarchy of needs. It is wonderful to be able to reach it, but there are several steps before it which need to be fulfilled for the clever snippet of code to be useful.
Implementing a robust process for the development of a software project that is growing large is not simple, but it is definitely worth it, and an initial time investment can have a significant payoff.
Footnotes
-
The problem as stated is under-specified, there are several other parameters to consider, but let us keep it simple here. ↩
-
All full releases; prereleases are allowed to be in the form e.g. X.Y.Z-alpha. ↩
-
"Truthy" in this context means that, when casted to a boolean value, it will be cast to
True
. For example,0
is falsey, while all other numbers are truthy. ↩ -
The terms may originate in analog circuit design: this "test" is the equivalent of plugging in a soldered circuit board and seeing whether smoke is coming out of it. The absence of smoke is not a guarantee of the correctness of the results. ↩
-
This is a reference to the notorious Maslow hierarchy of needs for humans, starting with a need for food and shelter and going all the way to self-actualization and transcendence. ↩