chapter6.tex

%===================================== CHAP 6 =================================

\chapter{Conclusion and Future Work}


\section{Conclusion}

The aim of this project has been to augment the use of examples for learning, by making use of educational technology. A system parsing Wikipedia articles for the extraction of sections containing examples has been created. Four research goals were established in chapter \ref{cap_1} in order to manage the projects workflow into desired results. In chapter \ref{cap_2}, other peoples work regarding text data mining, semi-structured text and Wikimedia was examined, to help discover the usefulness and possibilities of this project. The concept of a pipeline turning raw source data from Wikipedia into an index containing examples, were explained in chapter \ref{cap_3}. In addition, how to search the index was also expressed. To optimize how the system handles the collecting and serving of examples, an analysis of example's structure and content were performed as well. Chapter \ref{cap_4} explained in detail how the defined concept were implemented into a working system. Finally chapter \ref{cap_5} examined the accomplishment of the research goals. Research goal 1, 2 and 3 were summed up and concluded, while four experiments were conducted to evaluate the fourth research goal. The first experiment tested the precision for several keywords when used as a search phrase. The second experiment tested the precision of the results when the four different whitelists were applied one at a time. The third test evaluated how well the system finds related examples when a specific example is selected. Finally, the fourth experiment tested the system's recall.

Through Experiment I we showed that the system are able to find relevant examples with the original implementation of applying the whitelists \textit{Top200Edu} and \textit{MathTechWiki} to filter examples before being added to the index. A set of keywords with different degree of generality and from different domains gave an average precision of \(0.8\). This is a satisfying number for the system's first implementation. The experiment did reveal that the whitelists applied influences the system to a significant degree. Consequently, Experiment II were conducted to discover how the whitelists influenced the system, and to find the best whitelist. The results of Experiment II taught us that the system is affected by many different aspects which determines its performance. Format of the search phrase and amount of relevant examples in the whole collection revealed patters that had a negative impact on the precision. All things considered, the two whitelists \textit{Top200Edu} and \textit{MathTech}, gave better results than the others. They had both very similar score, which reflects that the two lists also are very similar, with \textit{MathTech} having excluded 43 categories compared to \textit{Top200Edu}. Although these whitelists performed best, the first implemented union of \textit{Top200Edu} and \textit{MathTechWiki} scored better. The experiments revealed some weaknesses regarding the union of these whitelists though. For instance lacking relevant results for some keywords, where the independent whitelists managed better. Therefore different unions of whitelists should be combined and tested in attempt of finding one that patches the weaknesses found in Experiment I and II.

Experiment III tested how well the system found and rated related examples based on an example selected by a user. The query combines information from the initial search and the categories of the example, which the user selected. The experiment revealed that the set of categories for the selected example greatly influenced the results. Since many of the examples had not a lacking category list, the results ended up being unstable. The algorithm's rating of categories worked well when the example had many categories, but that is not always the case.

In Experiment IV, the system's recall was measured with a small subset of manually inspected examples from the database. All possible relevant documents in the subset were returned for each keyword used, which entails a recall of 1. Although a recall of 1 seems good, it most likely causes a lower precision for the system. Sacrificing some recall for better precision could give the system a more optimal performance.

\section{Future Work}
There are several aspects of this project which would benefit from an extended amount of work or research

Firstly the system itself, can be greatly improved by being able to deal with examples from different sources. The foundation for doing so already exists in the system, since the idea of extracting examples from different sources have existed since the beginning. For simplicity and saving time, this project has focused on only using Wikipedia as source, and therefore is excessively tailored for Wikipedia articles. But since the different processes in the pipeline is extremely independent, replacing them to accommodate other sources should be a trivial task. Accommodating more sources would result in a richer database of examples, which in turn would help the end user.

Another aspect that could benefit the system is a deeper knowledge of examples, and how they relate to each other. A better understanding could improve both the rating of their relevance score when searched for and displaying related examples. Improvement on finding related examples, can give a natural learning progress when browsing from one example and to its related ones.

The search itself could also be more optimized. Since Elasticsearch was chosen to manage the database of examples, the search API served by the Elasticsearch process could be explored further. Elasticsearch offers a great amount of different customizations that can be applied to the queries used for the search, which would make the search more complex, but also could improve the search results. Experiment IV indicated that the system has a very high recall. Making use of methods available in Elasticsearch's search API, a more optimal trade-off between precision and recall could be achieved. 

Improvements can also be made regarding querying for examples based on a user selected example. The algorithm rating the results rely too much on categories, when the current collection of examples does not facilitate use of categories enough. One approach is to alter the algorithm, for instance making use of references in the articles has been explored during the project, although it was not successful enough to make it in the final implementation. Another approach could be to change our view of categories. It could be more beneficial to look at them as tags instead. The current categories could be converted to tags, and a system for manually tagging examples could be implemented. This way, tags would work similar as they do for a Youtube video, helping both the search and creating a set of related videos. A combination of the two mentioned approaches, should improve the retrieval of related examples to a selected example.

Finally usability testing should be conducted. The system is created to help the user, therefore it is very important to make sure the user interaction is optimal. Both the design of the user interface and the user interaction should be tested, and handle the feedback accordingly. By analyzing user interaction with the system, we can also get feedback that will help optimize the search. Especially in regards with the level of recall and precision. For instance if users rarely explores the lower ranked results, increasing precision is a good idea. On the other hand, if the users explores many of the returned results, good recall can be more beneficial. 


\cleardoublepage