diff --git a/paper/paper.md b/paper/paper.md index 9756f6a..d7b54d2 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -47,8 +47,7 @@ We introduce **DeepBench**, a Python library that employs mechanistic models (i. # Statement of Need There are multiple open problems and issues that are critical for the machine learning and scientific communities to address -- principally, interpretability, explainability, uncertainty quantification, and inductive bias in machine learning models when they are applied to scientific data. Multiple kinds of data sets and data simulation software can be used for developing models and confronting these challenges. These data sets range from natural images and text to multi-dimensional data of physical processes. Indeed, multiple benchmark data and simulation software have been created and inculcated for developing and comparing models. However, these benchmarks are typically limited in significant ways. Natural image data sets comprising images from the real or natural world (e.g., vehicles, animals, landscapes) are widely used in the development of machine learning models. These kinds of data sets tend to be large, diverse, and carefully curated. However, they are not underpinned by or constructed upon physical principles: they cannot be generated by mathematical expressions of formal physical theory, so there is not a robust connection between the data and a quantitative theory. Therefore, these data sets have a severely limited capacity to help address many questions in machine learning models, such as uncertainty quantification. On the other hand, complex physics simulations (e.g., cosmological n-body simulations and particle physics simulators) are accurate, detailed, and based on precise quantitative theories and models. This facilitates studies of interpretability and uncertainty quantification because there is the possibility of linking the simulated data to the input choices through each layer of calculation in the simulator. However, they are relatively small in size and number, and they are computationally expensive to reproduce. In addition, while they are underpinned by specific physical functions, the complexity of the calculations makes them challenging as a venue through which to make connections between machine learning results and input choices. Complex physics simulations have one or more layers of mechanistic models. Mechanistic models are defined with analytic functions and equations that describe and express components of a given physical process: these are based on theory and empirical observations. In both of these scenarios, it is difficult to build interpretable models that connect raw data and labels, and it is difficult to generate new data rapidly. -The physical sciences community lacks sufficient datasets and software as benchmarks for the development of statistical and machine learning models. -In particular, there currently does not exist simulation software that generates data underpinned by physical principles and that satisfies the following criteria: +The physical sciences community lacks sufficient datasets and software as benchmarks for the development of statistical and machine learning models. In particular, there currently does not exist simulation software that generates data underpinned by physical principles and that satisfies the following criteria: * multi-domain * multi-purpose @@ -86,18 +85,14 @@ The **DeepBench** software simulates data for analysis tasks that require precis # Primary Modules -* Geometry objects: two-dimensional images generated with matplotlib `[@hunterMatplotlib2DGraphics2007b]`. The shapes include $N$-sided polygons, arcs, straight lines, and ellipses. They are solid, filled or unfilled two-dimensional shapes with edges of variable thickness. -* Physics objects: one-dimensional profiles for two types of implementations of pendulums dynamics: one using Newtonian physics, the other using Hamiltonian. +* Geometry objects: two-dimensional images generated with `matplotlib` `[@hunterMatplotlib2DGraphics2007b]`. The shapes include $N$-sided polygons, arcs, straight lines, and ellipses. They are solid, filled or unfilled two-dimensional shapes with edges of variable thickness. +* Physics objects: one-dimensional profiles for two types of implementations of pendulum dynamics: one using Newtonian physics, the other using Hamiltonian. * Astronomy objects: two-dimensional images generated based on radial profiles of typical astronomical objects. The star object is created using the Moffat distribution provided by the AstroPy `[@theastropycollaborationAstropyCommunityPython2013a]` library. The spiral galaxy object is created with the function used to produce a logarithmic spiral `[@ringermacherNewFormulaDescribing2009a]`. The elliptical Galaxy object is created using the Sérsic profile provided by the AstroPy library. Two-dimensional models are representations of astronomical objects commonly found in data sets used for galaxy morphology classification. * Image: two-dimensional images that are combinations and/or concatenations of Geometry or Astronomy objects. The combined images are within `matplotlib` meshgrid objects. Sky images are composed of any combination of Astronomy objects, while geometric images comprise individual geometric shape objects. * Collection: Provides a framework for producing module images or objects at once and storing all parameters that were included in their generation, including exact noise levels, object hyper-parameters, and non-specified defaults. -All objects also come with the option to add noise to each object. -For Physics objects -- i.e., the pendulum -- the user may add Gaussian noise to parameters: initial angle $theta_0$, the pendulum length $L$, the gravitational acceleration $g$, the planet properties $\Phi = (M/r^2)$, and Newton's gravity constant $G$. -Note that $ g = G * \Phi = G * M/r^2$: all parameters in that relationship can receive noise. -For Astronomy and Geometry Objects, which are images, the user can add Poisson or Gaussian noise to the output images. -Finally, the user can regenerate the same noise using the saved random seed. +All objects also come with the option to add noise to each object. For Physics objects -- i.e., the pendulum -- the user may add Gaussian noise to parameters: initial angle $\theta_0$, the pendulum length $L$, the gravitational acceleration $g$, the planet properties $\Phi = (M/r^2)$, and Newton's gravity constant $G$. Note that $g = G * \Phi = G * M/r^2$: all parameters in that relationship can receive noise. For Astronomy and Geometry Objects, which are images, the user can add Poisson or Gaussian noise to the output images. Finally, the user can regenerate the same noise using the saved random seed. # Example Outputs