Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements on Fitness Function: Current Status #41

Open
danielvangelder opened this issue Jun 4, 2018 · 2 comments
Open

Improvements on Fitness Function: Current Status #41

danielvangelder opened this issue Jun 4, 2018 · 2 comments
Assignees
Labels

Comments

@danielvangelder
Copy link
Contributor

For a part of our honours project, Paul and I have been working on improving the fitness function for the GA. However during this tweaking we ran into some problems and were hoping you were willing to take a look at it.

The goal was that the fitness function was supposed to align more with the function that is described in the paper. So right now the GA compares each Fixture and orders all of them based on some comparison. But we wanted a function that would return a real value for a given Fixture. So the new fitness function would be the step_distance (amount of query levels to reach for full coverage) plus the branch_distance (distance at deepest level to achieve coverage target) which is the distance of the last querylevel normalised: d = x / (x + 1).

The main difference between the old implementation and new implementation is that although both of them use a comparator between two Fixtures' fitnesses, the new implementation compares a calculated real value and the old implementation compares the data structures of the fitnesses themselves.

The implementation of the new version was rather straight forward and is as mentioned above. However, when running this on the tests we got some tests which kept running in an infinite loop. Intuitively we would have expected that the tests would run without problems but apparently we either have some subtle bug in our fitness calculation or the data that we use does not give a clear enough indication of the fitness of the Fixture. After many debugging sessions we found that it was a combination of these two. To illustrate this I will show an example and compare the different ways in which the Fixture's would be compared between the two illustrations:

We have the following query (this one will fail on the new implementation, it is test case SubquerySelectWhereTest#test7):
SELECT * FROM PRODUCTS T WHERE T.PRICE < (SELECT MAX(TYPE) FROM (SELECT NAME, TYPE FROM PRODUCT_DETAIL WHERE LENGTH(NAME) = 2) T2 WHERE T.PRODUCT = T2.NAME)

The typical fitness of a Fixture halfway through the loop would be something like this:
{{Level: 0, MaxRangeVariableIndex: 0, Distance: 1.7976931348623156E306, Subdata: [{Level: 0, MaxRangeVariableIndex: 0, Distance: 8.513101825710521}]}}

The old implementation would first compare the two Fixture's on the depth that they achieved, thereafter on the first distance and then for the subdata on the distance, etcetera. If at any point in this comparison one Fixture turns out to be better it stops comparing and returns the resulting comparison value.

The new implementation, however, would find a fitness value of approximately: 2.96 for the fitness described above, since the query has three levels (and is currently in the second level '1') and a very big branch_distance on the top level. However this very big branch distance is currently not what we should optimize for but the distance in the 'subdata'.

I'm not really sure what this subdata object means. since if we have multiple querylevels, a QueryLevelData object is created with another nested QueryLevelData, called prevLevelData, within it for the previous level. However for these types of queries we have only a single QueryLevelData with the subLevelData which is a list of QueryLevelData (in this case containing only one object) and that object will have another subLevelData for another deeper level. So I assume that these actually represent the deeper querylevels in some cases (perhaps only for subqueries). Therefore I think that the bug is originating from the assumption that the different 'query levels' are always stored in nested QueryLevelData objects, which are actually in the subLevelData. But in some cases we don't have subLevelData and instead get these nested QueryLevelData objects without subLevelData.

I implemented this approach by just looking into the subLevelData (which used to have only private access) and this approach will work for the failing tests, however this causes problems in some other tests which do converge but never to a minimum. So that means it takes too long to finish the GA.

After I discussed this with Maurício and he discussed this with Jeroen we found that there is a difference between the subLevelData and the nested QueryLevelData object and that both are important for the fitness function. The decision to be made now is what to look at in the final calculation. I propose the following possible solutions:

  • We combine the distance values on all values
  • We look at the deepest level (then we still need to know what the deepest level actually is)

I currently implemented something which looks at the 'deepest' subLevelData object present. This fails for only some tests so that is already a good improvement.
Tests that fail are currently:

  • RightJointTest#testRightJoinWithNulls2BasedOnEspocrmQ21P15()
  • All tests in DatatypesTest
  • StringsTest#testStringFunctionsReverse()
  • SubquerySelectWhereTest#test7() (Maurício implemented this as a way to inject a correct population into the GA, strangely this doesn't work on the new implementation.

The new implementation can be found on the redesign-fitness branch.

@apanichella
Copy link
Member

apanichella commented Jun 6, 2018

Hi @danielvangelder ,
I just made a commit (id = 6bd7b97) to fix most of the problems. Actually, those tests failed beacuse of a bug in how the population was initialized.

In the old version of the code, the population was configured with only one solution provided in the test case but without allowing filling the entire population size. With a population with only one solution, clearly, the search does not work.

I also fixed some broken test cases, where the initialization of the population should be handled differently when the fitness value is set up using Mockito.

At the moment, there is only one failing test: RightJointTest#testRightJoinWithNulls2BasedOnEspocrmQ21P15(). To fix it, we should check how the joins are handled in the new version of the code. Perhaps, you can help me giving a look at that part of the fitness evaluation.

Annibale

@JeroennC
Copy link

Hi @danielvangelder , @apanichella ,

I left a comment on Annibale's commit because I believe one of those lines will cause bugs or wrong output.

The reason we have the two versions of QueryLevelData, namely the prevLevelData and subLevelData is due to subqueries within the SELECT or WHERE parts of the query. Probably also in the HAVING.

The idea behind it is that if there is a subquery in the FROM, this subquery must be solved (i.e. have some output) in order for us to find a solution. However, if there is a subquery in the SELECT or WHERE clause, we don't know for certain whether it is relevant at all. Thus we require a fitness function that allows us to optimize for these subqueries, but is not halted by needing to optimize it.

That is why we introduced subLevelData. If a fixture has solved everything on the current level then we try optimizing in one of the subqueries. An example of this necessity:

SELECT * FROM t1 WHERE t1.x = 'abcd' AND t1.y = 'efgh' AND t1.z = (SELECT w FROM t2 WHERE v = 'klmn')

If t1.x and t1.y are solved, but the subquery returns no rows then the subquery must be optimized.
Now if we change the query to:

SELECT * FROM t1 WHERE (t1.x = 'abcd' AND t1.y = 'efgh') OR t1.z = (SELECT w FROM t2 WHERE v = 'klmn')

It is not necessary for us to solve the subquery. Hence we can allow our fixtures to optimize for either part of the OR and ignore the rest. This is the logical difference that can exist between subqueries in the FROM and other parts of the query, and the subLevelData distinction is how this logical difference is implemented.

This is also why during my thesis we decided not yet to try and change the fitness value to a single real number, the structure can get quite complex. Probably the best idea remains knowing beforehand what the possible structure of a FixtureFitness object is. I believe Mauricio told me that you attempted retrieving this from HSQLDB before? Perhaps I can be of help looking into it.

Jeroen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants