title | author | date | geometry | output | colorlinks | linkcolor |
---|---|---|---|---|---|---|
Scientific coding best practices |
Jacopo Tissino |
March 10, 2022 |
left=3.5cm,right=3.5cm,top=3cm,bottom=3cm |
pdf_document |
true |
blue |
What follows is a rough list of topics to touch on, which should be expanded on with detailed reasoning and examples. The idea for the examples is as follows: try to write them consistently, in python, and make them at least somewhat realistic --- give the reader a way to understand how the patterns described can actually help their code.
Maybe mention external, proprietary tools which could be used for the more advanced stuff (e.g. Github Actions for continuous integration) but trying not to be too platform-specific, giving alternatives.
- the ideal requirements for code, in rough order of importance
- doing what it's supposed to (!)
- being tracked with a version control system (→
git
) - documentation: other people (and you in 6 months' time) will know how to use it
- testing (→
pytest
,tox
): it will be harder to make the code non-functional - using good software design, with a focus on maintainability and extensibility
- Refactoring code is much safer and easier if it comes after a suite of tests is already in place to check whether the changes are not breaking existing functionality, which is why this is last
- the importance of open source software and science accessibility (→
zenodo
) - the interplay between rapid development and modularity/generality
- discuss how the principles here should be taken with a grain of salt, in that
- scientists are paid to write papers and not beautiful code
- however, well-organized code can easily save time in the long run and avoid many headaches
- maybe we can make some concrete examples?
- not reinventing the wheel: for many "common" problems, somebody has already made a better solution than what you can make quickly
- use branches to develop new features!
- semantic versioning for public code: it's important to have a consistent set of rules according to which the version number changes
- the same logic applies to conventional commits
- but, git logs are not changelogs: instead, make a proper CHANGELOG
- the possibility to use Continuous Integration (→
tox
, CircleCI, Github Actions) - semantic commits after many changes have been made using
git add --patch ...
- using pre-commit hooks to make auto-formatting and other checks painless:
pre-commit
combined withblack
for auto-formatting,flake8
,mypy
for type checking,- checking that no large files are being committed, and so on
- watch out for dependency management in modules! Document it (
pyproject.toml
), possibly in an automated way (→poetry
) - issue tracking: a clear pipeline for handling the inevitable problems
- having a paradigm for how branches are managed:
- for example, git flow works for bigger projects with hard versioning and which need to maintain old versions
- while github flow is simpler and works well for smaller projects, or ones with a continuous development schedule
- having a proper test suite which can be run at will, as opposed to "ad hoc" tests
of some property: code is really easy to break, and running all tests when something
changes is a very good way to spot new problems which were introduced
(->
pytest
,unittest
) - also, half of the benefit of testing is that in order to be testable, code needs to not be a tangled mess of interdependent stuff: by testing the code, one is also forced to modularize it, thereby improving it
- property-based testing, as opposed to unit testing: test that with
"randomly chosen" inputs some condition holds, as opposed to just using
some selected input-output pairs
(→
hypothesis
) - parametrizing tests to check more behavior (→
pytest.mark.parametrize
) - benchmarking code inside the test suite (→
pytest-benchmark
) - using fixtures to simplify setup and tear-down
- test-driven development (for things whose purpose is already known, which is not always the case, but it should often be)
- running tests with a debugger (→
pdb
)- takes a while to get used to, but it makes finding issues way easier than
inserting
print
statements everywhere
- takes a while to get used to, but it makes finding issues way easier than
inserting
- measuring test coverage, not "to get 100%" but to get an idea
of the thoroughness of the tests
- maybe mention cleaver ideas like mutation testing
(→
mutmut
): a "test for your tests", in which the source code itself is randomly modified in certain places (say, change a number, negate a boolean...) and see whether the tests still pass --- if they do, they might not be capturing some issue with the code;
- maybe mention cleaver ideas like mutation testing
(→
- beautiful abstractions! when and how to use them (→
abc
,Protocol
)- decoupling implementation from usage
- avoiding "magic numbers" in the code
- the DRY (don't repeat yourself) principle! "if you're copy-pasting, you should feel like you're doing something dirty"
- the notion of "code smell" and technical debt
- SOLID principles for object-oriented design, when do they apply?
(this may not be the best way to approach the topic,
but it seems like a good summary of important things)
- Single-responsibility: no huge classes
- Open-closed: code should be open for addition, closed for modification:
- this means it should be able to be used in all* possible situations, without needing to modify it
- if functionality needs to be added, it should be possible to do so without changing the existing code
- but, this is to be taken with a grain of salt: in "business" software development one typically has a much better idea of what the code should do than in science Still, it is good to keep in mind that ideally this would be the case, since it enables us to avoid writing code that we already know will need to be modified. For example, a parameter which we already know we will try several values for should not be hardcoded.
- Liskov substitution: ability to substitute a parent class with its child
- Interface segregation: decoupling, a part of the software should only need to care about the code it actually needs
- Dependency inversion: "depend on abstractions, not on concretions"
- modules vs scripts in python: ideally, the combination should look like
- module structure: large amount of code + no direct instructions in the code
- script structure: small amount of code + direct instructions
(under an
if __name__ == '__main__'
clause) - modules for separating interface and implementation
- scripts for testing and trying quick things
- also, jupyter notebooks work well for that sort of thing
-
write code optimized for human comprehension!
- use
bool
s as opposed toint
values of 1 and 0 - give understandable names to variables
- if these names are becoming very long, that could be a sign your code is not very modular
- use
Enum
s as opposed to strings for "things which can only take on a certain amount of values"
- use
-
creating data structures for data which should always be together (→
dataclass
)- this allows us to be versatile but consistent in creation as well:
making
classmethod
s in the formfrom_...
- this allows us to be versatile but consistent in creation as well:
making
-
functional programming patterns and how/when to use them in python (generators,
filter
,map
,reduce
)- these functions should have no side effects!
-
static type checking the code is a good idea! (→
mypy
): it can help catch bugs before they even have the possibility to occur python is easy to use with it being dynamically typed, but for large projects typing makes life much easier -
building documentation starting from the docstrings in the code (→
sphinx
and so on)- in python, docstrings are used by the code itself, when calling
help
!
- in python, docstrings are used by the code itself, when calling
-
the principles of good documentation: the Diátaxis Framework
- optimize with a scientific mindset: not blindly changing things, but writing code, benchmarking/profiling it, trying to improve what's taking the longest and testing to see whether it makes any difference
- the tricky business of benchmarking: your computer will not behave consistently! To get precise measurements, it's good to average out over several randomized runs.
- don't optimize prematurely: it's often good to start by writing the easy, naîve version of what you're thinking of and then, if it turns out that's too slow, change it! This avoids many instances of losing hours to write complex code which is a couple milliseconds faster (or, sometimes, is actually slower) than the simple version which could have been written in much less time
- the tools for optimization, when it's needed:
- caching
- parallelization
- threading
- inserting compiled code within interpreted code
(→
numba
)
- even though I don't always agree with the tone, how to ask questions the smart way and how to report bugs effectively are good references
- making a minimal reproducible example of your issue can result in you figuring out the problem
- contributing to open source projects