diff --git a/interim.tex b/interim.tex index d7b7ffa..40e0622 100644 --- a/interim.tex +++ b/interim.tex @@ -12,6 +12,7 @@ \usepackage{amsmath} \usepackage{amsthm} \usepackage{graphicx} +\usepackage[]{listings} \usepackage[colorinlistoftodos]{todonotes} \usepackage[colorlinks=true, allcolors=blue]{hyperref} \usepackage{comment} diff --git a/report/background/background.tex b/report/background/background.tex index c282f17..45cfa50 100644 --- a/report/background/background.tex +++ b/report/background/background.tex @@ -25,5 +25,5 @@ \chapter{Background} % 10-20 pages \input{report/background/relationalmodel.tex} \input{report/background/databaserepresentation.tex} -\section{Benchmarking databases} +\section{Benchmarking databases}\label{sec:benchmarking} \section{Expanding the use of adjunctions} \ No newline at end of file diff --git a/report/background/categorytheory.tex b/report/background/categorytheory.tex index 138a9a1..9a63dc7 100644 --- a/report/background/categorytheory.tex +++ b/report/background/categorytheory.tex @@ -20,4 +20,4 @@ \subsection{Natural transformations} \subsection{Adjunctions} \todo{Describe why adjunctions are such a key character in this paper} \todo{Add more information between here} -\subsection{Graded Monads} \ No newline at end of file +\subsection{Graded Monads}\label{sec:gradedmonads} \ No newline at end of file diff --git a/report/background/relationalmodel.tex b/report/background/relationalmodel.tex index 66fb8fa..92f596e 100644 --- a/report/background/relationalmodel.tex +++ b/report/background/relationalmodel.tex @@ -1,4 +1,4 @@ -\section{The relational model of a database} +\section{The relational model of a database}\label{sec:relationalmodel} We briefly describe the relational model of a database so that we can introduce the key operators we are modelling using category theory. \todo{not modelling, think of better word} \subsection{Introduction to the relational model} There are several different data models to choose from when designing a database that specify important aspects of the design such as the structure, operations and constraints on the data \cite{DatabaseSystems}. For this project we concern ourselves with the relational model and its associated algebra. @@ -174,7 +174,7 @@ \subsubsection{Products}\label{sec:products} \end{verbatim} as one might expect from an ordinary Cartesian product on sets. Furthermore, we also draw attention to the new schema as seen in the headings of \fref{tab:productResult}. -\subsubsection{Joins} +\subsubsection{Joins}\label{sec:joins} A join is a way of creating a new relation by combining relations with common attributes. There are many such ways of doing this. \begin{table}[h] \centering diff --git a/report/bibs/interim.bib b/report/bibs/interim.bib index 3364fe4..a73a353 100644 --- a/report/bibs/interim.bib +++ b/report/bibs/interim.bib @@ -64,4 +64,59 @@ @article{JoinProcessing isbn={0360-0300}, url={https://doi.org/10.1145/128762.128764}, doi={10.1145/128762.128764} +} +@article{CambridgeStructuralDatabase, + author={Colin R. Groom and Ian J. Bruno and Matthew P. Lightfoot and Suzanna C. Ward}, + year={2016}, + month={APR}, + title={The Cambridge Structural Database}, + journal={Acta Crystallographica Section B-Structural Science Crystal Engineering and Materials}, + volume={72}, + pages={171-179}, + note={PT: J; NR: 57; TC: 6076; J9: ACTA CRYSTALLOGR B; PN: 2; PG: 9; GA: DJ1UI; UT: WOS:000373989900003}, + abstract={The Cambridge Structural Database (CSD) contains a complete record of all published organic and metal-organic small-molecule crystal structures. The database has been in operation for over 50 years and continues to be the primary means of sharing structural chemistry data and knowledge across disciplines. As well as structures that are made public to support scientific articles, it includes many structures published directly as CSD Communications. All structures are processed both computationally and by expert structural chemistry editors prior to entering the database. A key component of this processing is the reliable association of the chemical identity of the structure studied with the experimental data. This important step helps ensure that data is widely discoverable and readily reusable. Content is further enriched through selective inclusion of additional experimental data. Entries are available to anyone through free CSD community web services. Linking services developed and maintained by the CCDC, combined with the use of standard identifiers, facilitate discovery from other resources. Data can also be accessed through CCDC and third party software applications and through an application programming interface.}, + isbn={2052-5206}, + language={English}, + doi={10.1107/S2052520616003954} +} +@article{MessidorDatabase, + author={Etienne Decenciere and Xiwei Zhang and Guy Cazuguel and Bruno Lay and Beatrice Cochener and Caroline Trone and Philippe Gain and John-Richard Ordonez-Varela and Pascale Massin and Ali Erginay and Beatrice Charton and Jean-Claude Klein}, + year={2014}, + title={Feedback on a Publicly Distributed Image Database: the Messidor Database}, + journal={Image Analysis & Stereology}, + volume={33}, + number={3}, + pages={231-234}, + note={PT: J; NR: 7; TC: 565; J9: IMAGE ANAL STEREOL; PG: 4; GA: AU2MP; UT: WOS:000345452800006}, + abstract={The Messidor database, which contains hundreds of eye fundus images, has been publicly distributed since 2008. It was created by the Messidor project in order to evaluate automatic lesion segmentation and diabetic retinopathy grading methods. Designing, producing and maintaining such a database entails significant costs. By publicly sharing it, one hopes to bring a valuable resource to the public research community. However, the real interest and benefit of the research community is not easy to quantify. We analyse here the feedback on the Messidor database, after more than 6 years of diffusion. This analysis should apply to other similar research databases.}, + isbn={1580-3139}, + language={English}, + doi={10.5566/ias.1155} +} +@article{MonadComprehensions, + author={George Giorgidze and Torsten Grust and Nils Schweinsberg and Jeroen Weijers}, + year={2011}, + month={sep}, + title={Bringing Back Monad Comprehensions}, + journal={SIGPLAN Not.}, + volume={46}, + number={12}, + pages={13–22}, + abstract={This paper is about a Glasgow Haskell Compiler (GHC) extension that generalises Haskell's list comprehension notation to monads. The monad comprehension notation implemented by the extension supports generator and filter clauses, as was the case in the Haskell 1.4 standard. In addition, the extension generalises the recently proposed parallel and SQL-like list comprehension notations to monads. The aforementioned generalisations are formally defined in this paper. The extension will be available in GHC 7.2.This paper gives several instructive examples that we hope will facilitate wide adoption of the extension by the Haskell community. We also argue why the do notation is not always a good fit for monadic libraries and embedded domain-specific languages, especially for those that are based on collection monads. Should the question of how to integrate the extension into the Haskell standard arise, the paper proposes a solution to the problem that led to the removal of the monad comprehension notation from the language standard.}, + isbn={0362-1340}, + url={https://doi.org/10.1145/2096148.2034678}, + doi={10.1145/2096148.2034678} +} +@misc{GHCListComprehension, + title={6.2.7. Generalised (SQL-like) List Comprehensions - Glasgow Haskell Compiler 9.7.20230125 User's Guide}, + volume={2023}, + number={26/01/}, + url={https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/generalised_list_comprehensions.html} +} +@inproceedings{ComprehensiveComprehensions, + author={Simon Peyton Jones and Philip Wadler}, + year={2007}, + title={Comprehensive comprehensions}, + booktitle={Proceedings of the ACM SIGPLAN workshop on Haskell workshop}, + pages={61-72} } \ No newline at end of file diff --git a/report/introduction/introduction.tex b/report/introduction/introduction.tex index 4678073..0f7a5f8 100644 --- a/report/introduction/introduction.tex +++ b/report/introduction/introduction.tex @@ -1,4 +1,28 @@ \chapter{Introduction} % 1-3 pages \begin{comment} It’s a good idea to *try* to write the introduction to your final report early on in the project. However, you will find it hard, as you won’t yet have a complete story and you won’t know what your main contributions are going to be. However, the exercise is useful as it will tell you what you *don’t* yet know and thus what questions your project should aim to answer. For the interim report this section should be a short, succinct, summary of the project’s main objectives. Some of this material may be re-usable in your final report, but the chances are that your final introduction will be quite different. You are therefore advised to keep this part of the interim report short, focusing on the following questions: What is the problem, why is it interesting and what’s your main idea for solving it? (DON'T use those three questions as subheadings however! The answers should emerge from what you write.) -\end{comment} \ No newline at end of file +\end{comment} + +Databases are absolutely vital to modern day society and contain domain specific knowledge about anything from specialised images of eyes \todo{add citation here} to the structure of crystals \cite{CambridgeStructuralDatabase}. Since its conception many different data models describing how to hold the data in databases have emerged, including the relational model \cite{RelationalModel} and the semi-structured model\cite{DatabaseSystems}. + +In this project we concern ourselves with the relational model (to be introduced in \fref{sec:relationalmodel}) as a way of modelling the database. This rich model has many methods of expressing queries, especially relational algebra and relational calculus, both with their strengths and weaknesses \cite{RelationalCalculus,RelationalModel}. \todo{they are equivalent though?} However, the favoured specification of the authors of \cite{RelationalAlgebraByWayOfAdjunctions} seemed to be through list comprehensions, a beautiful feature that ``provide for a concise and expressive notation for writing list-processing code''. \cite{MonadComprehensions} They eventually propose using GHC's extended list comprehension syntax specifically designed to help bridge the already close relationship between relational calculus and list comprehensions \cite{GHCListComprehension,ComprehensiveComprehensions} in order to avoid the significant theoretical performance hit. + +It is widely noted that \emph{joins}, an integral operation in relational algebra are associated with inefficient implementations \cite{JoinProcessing}. It is easy to see why when considering the most general joins. However in their paper \cite{RelationalAlgebraByWayOfAdjunctions} they concern themselves with a specialised join called an \emph{equijoin}. As described in \fref{sec:joins} an equijoin is a specialised \emph{theta-join} -- a way of combining two relations based off of an arbitrary condition depending on the attributes of both relations. An integral part to the calculation of a theta-join is calculating the Cartesian product (all possible combinations of tuples of both relationships, as described in \fref{sec:products}). The algorithm must then filter every single tuple individually to check for equality of attributes! It is clear that this is wasteful for such an specialised join. As a more practical example consider the SQL program: +\begin{lstlisting} + SELECT * + FROM R, S + WHERE R.a = S.b +\end{lstlisting} +This could be naively converted into a list comprehension with the following: +\[ + \left[\,(r, s)\;|\;r \leftarrow R,\;s \leftarrow S,\;r.a = s.b\,\right] +\] +where $(r, s)$ is seen as a single tuple whose attributes \attribute{\relation{R}.a} and \attribute{\relation{R}.b} are merged. \todo{fix example so that you do not need to see things} + + With this naive list comprehension implementation we effectively convert \equijoin{R}{\attribute{a}}{S}{\attribute{b}} to \select{\attribute{a} = \attribute{b}}{\relation{R} \times \relation{S}}, generating a relation with $|R||S|$ tuples in the process then filtering through each. + + This can much be much more efficiently implemented by viewing databases as indexed tables. We can index each relation by its associated attribute in the equijoin and merge the results -- localising the data required to in a cartesian product. This approach admits a linear time equijoin, if careful about comparison and projection functions. \cite{RelationalAlgebraByWayOfAdjunctions} With some mathematical tools explained in \fref{sec:gradedmonads} we can describe give these operations a monadic structure and therefore a comprehension syntax using the extended syntax discussed above. + + What this project adds to this story is a concrete demonstration of the improvement this solution offers. + We will use the \emph{Haskell} and the list comprehensions and functions described above to implement a simple database querying software taking into account these key changes. Along with real world pragmatic data sources, benchmarking techniques found in \fref{sec:benchmarking} will be used to accurately measure and compare the efficiency difference between the two approaches -- with the new equijoin and without. This evidence would be very important to justify the use of these methods and the claims made in the paper \cite{RelationalAlgebraByWayOfAdjunctions}. + With a concrete implementation, it could also provide insights into the downfall of remaining operations as well as those mentioned in the paper, potentially using profiling techniques to further analyse the performance bottlenecks, in order to inspire a theoretical approach at determining the issue as shown in \cite{RelationalAlgebraByWayOfAdjunctions} instead of an efficiency-driven optimisation approach.