diff --git a/report/ethics/ethicalissues.tex b/report/ethics/ethicalissues.tex index fdb83c0..16ea1b6 100644 --- a/report/ethics/ethicalissues.tex +++ b/report/ethics/ethicalissues.tex @@ -14,4 +14,7 @@ \chapter{Ethical issues} % 1-2 pages \subparagraph*{Data protection} The immediate issue that springs to mind would be whether the data source contains personal data and if so the subsequent data protection methods required to fulfil my legal and ethical obligation. To avoid this issue for primary data sources, I will not collect data when creating a benchmarking database, instead opting to randomly generate records. For data sources I acquire, they will all be in public domain already, and avoid sensitive personal data. This should minimise the risk to individuals affected. The data protection laws consider "organisation" and "structuring" as processing personal data, hence why this is relevant. The databases acquired might need to be reorganised into the data structured explained in this report \todo{cite}. \todo{See if this is enough} \todo{Cite data protection laws} \subparagraph*{Database rights} Another consideration when outsourcing database collections, is the extent to which those who have created the database are protected. The creation of a database, although not creative, has been recognised as taking significant work and so those that create the database have a form of copyright protection on their work. \todo{Make sure that this is accurate, cite if so} This risk will be mitigated by ensuring that all databases used are in the public domain and the correct licenses are acquired. -\todo{Potentially comment on the distribution of my implementation} \ No newline at end of file +\todo{Potentially comment on the distribution of my implementation} + +\section{Extensions} +Furthermore, any domains investigated in this project have their own set of ethical considerations which reasonably cannot be discussed now (as they are largely unknown). Care should be taken on thinking of the ethical impact of the work that is being taken when branching out to other scenarios that use bulk-types. For instance, the suggested family of languages -- logic programming -- has close ties with artificial intelligence which itself is an ethically complex subject. \todo{Cite AI ethics} I think it is fair to say, however, that the work done will not bias any field in a particular way (ethical or not), but potentially contribute to more efficient implementations or different perspective of ideas within them. \ No newline at end of file diff --git a/report/evaluation/evaluationplan.tex b/report/evaluation/evaluationplan.tex index 3cf449e..197bcad 100644 --- a/report/evaluation/evaluationplan.tex +++ b/report/evaluation/evaluationplan.tex @@ -16,4 +16,7 @@ \section{Correctness}\label{sec:correctnessevaluation} \section{Analysis} Evaluating the analysis is more difficult. Of course, as per usual benchmarking standards, each test would be run multiple times, changing whether in succession or not (in order to let any potential cache optimisations and ideal memory conditions occur). However these repeat results are conducted, the deviation should be closely monitored to ensure they are consistent. -It is difficult to suggest comparing deviations of other database systems to the one implemented, even of atomic operations, to ensure they follow a similar distribution of relative performance as databases are so mainstream that query optimisation is commonplace and may affect results in a way that cannot be predicted. \ No newline at end of file +It is difficult to suggest comparing deviations of other database systems to the one implemented, even of atomic operations, to ensure they follow a similar distribution of relative performance as databases are so mainstream that query optimisation is commonplace and may affect results in a way that cannot be predicted. + +\section{Extension to other domains} +If a purely theoretical approach is taken when investigating the methods of the paper in neighbouring domains, proofs should be provided for all major conclusions in which case its correctness can be shown. A comment should be made on whether the conclusions of such an investigation actually potentially add value to their implementation in functional languages or as a perspective of the field. Although a purely mathematical approach was taken in the inspiring paper, its roots were on efficiency of implementation, readability and conformation to the functional paradigm; it could be that in order to make other domains work in a similar fashion it may fall short in these values even if it is mathematically sound. \ No newline at end of file diff --git a/report/introduction/introduction.tex b/report/introduction/introduction.tex index 0f7a5f8..1541d04 100644 --- a/report/introduction/introduction.tex +++ b/report/introduction/introduction.tex @@ -3,11 +3,12 @@ \chapter{Introduction} % 1-3 pages It’s a good idea to *try* to write the introduction to your final report early on in the project. However, you will find it hard, as you won’t yet have a complete story and you won’t know what your main contributions are going to be. However, the exercise is useful as it will tell you what you *don’t* yet know and thus what questions your project should aim to answer. For the interim report this section should be a short, succinct, summary of the project’s main objectives. Some of this material may be re-usable in your final report, but the chances are that your final introduction will be quite different. You are therefore advised to keep this part of the interim report short, focusing on the following questions: What is the problem, why is it interesting and what’s your main idea for solving it? (DON'T use those three questions as subheadings however! The answers should emerge from what you write.) \end{comment} -Databases are absolutely vital to modern day society and contain domain specific knowledge about anything from specialised images of eyes \todo{add citation here} to the structure of crystals \cite{CambridgeStructuralDatabase}. Since its conception many different data models describing how to hold the data in databases have emerged, including the relational model \cite{RelationalModel} and the semi-structured model\cite{DatabaseSystems}. +Databases are vital to modern day society for their ability to structure, sort and query vast amounts of data from any domain, ranging from specialised images of eyes \todo{add citation here} to crystal structure \cite{CambridgeStructuralDatabase}. Of course no one theoretical model of data has surfaced since its conception, with different approaches describing how exactly to hold the data. Examples of such data models include the semi-structured model \cite{DatabaseSystems} and, more relevant to this project, the relational model \cite{RelationalModel} as introduced in \fref{sec:relationalmodel}. -In this project we concern ourselves with the relational model (to be introduced in \fref{sec:relationalmodel}) as a way of modelling the database. This rich model has many methods of expressing queries, especially relational algebra and relational calculus, both with their strengths and weaknesses \cite{RelationalCalculus,RelationalModel}. \todo{they are equivalent though?} However, the favoured specification of the authors of \cite{RelationalAlgebraByWayOfAdjunctions} seemed to be through list comprehensions, a beautiful feature that ``provide for a concise and expressive notation for writing list-processing code''. \cite{MonadComprehensions} They eventually propose using GHC's extended list comprehension syntax specifically designed to help bridge the already close relationship between relational calculus and list comprehensions \cite{GHCListComprehension,ComprehensiveComprehensions} in order to avoid the significant theoretical performance hit. +The relational model has rich support for the expression of queries and is by no means not varied across its use. Two examples of its family of query representations are \emph{relational calculus} and \emph{relational algebra} with their various advantages \cite{RelationalCalculus,RelationalModel}. \todo{they are equivalent though?} However, the favoured specification of queries for the authors of \cite{RelationalAlgebraByWayOfAdjunctions} seemed to be list comprehensions, a beautiful feature that ``provide for a concise and expressive notation for writing list-processing code''. \cite{MonadComprehensions} They eventually propose using GHC's extended list comprehension syntax in a way to help bridge the already close relationship between relational calculus and list comprehensions \cite{GHCListComprehension,ComprehensiveComprehensions}. Such a method is required in order to avoid the significant theoretical performance hit when working with equijoins. -It is widely noted that \emph{joins}, an integral operation in relational algebra are associated with inefficient implementations \cite{JoinProcessing}. It is easy to see why when considering the most general joins. However in their paper \cite{RelationalAlgebraByWayOfAdjunctions} they concern themselves with a specialised join called an \emph{equijoin}. As described in \fref{sec:joins} an equijoin is a specialised \emph{theta-join} -- a way of combining two relations based off of an arbitrary condition depending on the attributes of both relations. An integral part to the calculation of a theta-join is calculating the Cartesian product (all possible combinations of tuples of both relationships, as described in \fref{sec:products}). The algorithm must then filter every single tuple individually to check for equality of attributes! It is clear that this is wasteful for such an specialised join. As a more practical example consider the SQL program: +It is widely noted that \emph{joins}, an integral family of operations in relational algebra, are associated with inefficient implementations \cite{JoinProcessing}. This is clear to see in the generalised case but in their paper \cite{RelationalAlgebraByWayOfAdjunctions} the authors concern themselves with equijoins, a specialised \emph{theta-join} that allows you to combine two relations depending on their attributes (as further described in \fref{sec:joins}); an equijoin simply matches records whose given attribute are equal. An integral part to the calculation of a theta-join is calculating the Cartesian product (all possible combinations of tuples of both relationships, as described in \fref{sec:products}). The algorithm must then filter this bloated colleciton of tuples by the predicate associated with the theta-join! It is clear that this is wasteful for such an specialised join. As a more practical example consider the SQL program: +\todo{Write Haskell syntax for the above} \begin{lstlisting} SELECT * FROM R, S @@ -19,10 +20,12 @@ \chapter{Introduction} % 1-3 pages \] where $(r, s)$ is seen as a single tuple whose attributes \attribute{\relation{R}.a} and \attribute{\relation{R}.b} are merged. \todo{fix example so that you do not need to see things} - With this naive list comprehension implementation we effectively convert \equijoin{R}{\attribute{a}}{S}{\attribute{b}} to \select{\attribute{a} = \attribute{b}}{\relation{R} \times \relation{S}}, generating a relation with $|R||S|$ tuples in the process then filtering through each. +With this naive implementation we effectively convert \equijoin{R}{\attribute{a}}{S}{\attribute{b}} to \select{\attribute{a} = \attribute{b}}{\relation{R} \times \relation{S}}, generating a relation with $|R||S|$ tuples in the process then filtering through each. - This can much be much more efficiently implemented by viewing databases as indexed tables. We can index each relation by its associated attribute in the equijoin and merge the results -- localising the data required to in a cartesian product. This approach admits a linear time equijoin, if careful about comparison and projection functions. \cite{RelationalAlgebraByWayOfAdjunctions} With some mathematical tools explained in \fref{sec:gradedmonads} we can describe give these operations a monadic structure and therefore a comprehension syntax using the extended syntax discussed above. +This can much be much more efficiently implemented by viewing databases as indexed tables. We can index each relation by its associated attribute in the equijoin and merge the results -- localising the cartesian product to tuples we know are in the new relation; if care is taken with comparisons and projection, this approach admits a linear time implementation of the equijoin \cite{RelationalAlgebraByWayOfAdjunctions}. With some mathematical tools explained in \fref{sec:gradedmonads} we show that these indexed tables share enough of a monadic structure to fit a comprehension syntax using the extended syntax discussed above. - What this project adds to this story is a concrete demonstration of the improvement this solution offers. - We will use the \emph{Haskell} and the list comprehensions and functions described above to implement a simple database querying software taking into account these key changes. Along with real world pragmatic data sources, benchmarking techniques found in \fref{sec:benchmarking} will be used to accurately measure and compare the efficiency difference between the two approaches -- with the new equijoin and without. This evidence would be very important to justify the use of these methods and the claims made in the paper \cite{RelationalAlgebraByWayOfAdjunctions}. - With a concrete implementation, it could also provide insights into the downfall of remaining operations as well as those mentioned in the paper, potentially using profiling techniques to further analyse the performance bottlenecks, in order to inspire a theoretical approach at determining the issue as shown in \cite{RelationalAlgebraByWayOfAdjunctions} instead of an efficiency-driven optimisation approach. +What this project adds to this story is a concrete demonstration of the improvement this solution offers. +We will use \emph{Haskell} and the list comprehensions and functions described above to implement a simple database querying software taking into account these key changes. Along with real world pragmatic data sources, benchmarking techniques found in \fref{sec:benchmarking} will be used to accurately measure and compare the efficiency difference between the two approaches -- with the new equijoin and without. This evidence would be very important to justify the use of these methods and the claims made in the paper \cite{RelationalAlgebraByWayOfAdjunctions}. +With a concrete implementation, further insights may come to light on the downfall of functional implementation of the remaining operations in the algebra. Using tools such as profiling, clear performance bottlenecks may be identified in order to be studied at a theoretical level in a similar fashion to \cite{RelationalAlgebraByWayOfAdjunctions} over an efficiency-driven optimisation approach. + +Similarly, this project also uses similar category theoretical approaches to explore how similar operations are conducted in other domains. We explore what other monadic structures may admit an inefficient list comprehension implementation for common uses. A domain this may be particularly interesting to explore is that of \emph{logic programming}. Questions such as what effect on the underlying logic programs do relational algebraic operations on their model sets have (as many bulk types also admit to a list comprehension syntax by virtue of their monadic structure). Furthermore, if we view normal logic programs themselves as monads, what is the interpretation of the operations above? We also explore whether there is a way to neatly and efficiently implement joins as you might see native in their domains, such as recombining a split program's stable models to find the stable model of the whole system. diff --git a/report/project/projectplan.tex b/report/project/projectplan.tex index b4b3321..f7555f0 100644 --- a/report/project/projectplan.tex +++ b/report/project/projectplan.tex @@ -28,6 +28,11 @@ \section{Reporting the project} Reporting the project is a necessity. My project is unique in that the first section is itself an analysis and therefore, if done correctly, will almost immediately be written in a final report style while it is being conducted. Later parts of the project might need their own time set apart to write about as mathematics is rarely linear and the report should tell a compelling story. My plan is to continue to develop the tools to efficiently write my report alongside the implementation of the database, and begin to draft and validate the final parts of the report a month before the deadline. Useful tools I have already integrated are both manual and commit triggered checks on the spelling and code style of the reports. I have also worked on automatic releases and management of different report versions, early in December. A few of the systems are helpful but not at their potential yet and so I might put some time to think about possible extensions, for instance custom dictionaries for the spell checker so it does not flag up domain specific terminology as errors. I definitely need to migrate the report structure I have to lhs2\TeX{} so that when I have code snippets to report it can be done with ease. This should be finished by February, in time to develop the database in the report if that is the workflow I find most beneficial. + +\section{Application of methods to other domains} +After enough theoretical knowledge is obtained in the research and implementation stages, thought can be given about how to extend the line of reasoning and techniques used in the paper in order apply them in different domains with these so-called bulk types. +\subsection{Logic programming} +It is clear that programs themselves can be thought of as a bulk type. Especially purely declarative languages such as \emph{ASP} where the order of clauses does not matter, a view of the program as a bulk type could easily fall to sugared version of a bag. A roadmap could change depending on how promising the results are; an initial investigation could be made into \emph{definite logic programs}. We could apply this analysis of bulk types both on the programs themselves and their answer sets. The investigation could expand to extensions of the language, notable \emph{negation-by-failure} where more complex behaviours are seen. I do not see the project extending defining syntax past \emph{normal logic programming} due to the increasing complexity. \section{Possible extensions} \subsection{Progress in optimising other aspects of relational algebra} Given a successful stage in \fref{sec:theoreticalanalysis} this might give the project a lot of scope to grow as an extension over the final months. I could choose between using the category theory learned to write proofs of the theoretical advantage of any novel approach I think of, or just a novel description of what is already there. Alternatively I could extend my implementation to include the modified operations and do another analysis of their practical performance change.