-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query Builder tests for OOM and memory limit (from PR #2053) #2054
Conversation
fff58e8
to
bbb8502
Compare
The query basically will do aggregation of half of dataframe | ||
""" | ||
q = QueryBuilder() | ||
return ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this brackets? It's inconsistent with query_resample
from this file and most of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a way to have decent multilines ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need then in this case. The function quoted above query_resample
doesn't use brackets. We should keep the style homogeneous.
lib: Library = arctic_library_lmdb | ||
df = generate_big_dataframe(300000) | ||
lib.write(symbol, df) | ||
del df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed. It should be deleted when it goes out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory yes. In practice not sure. Thus one delete more is not hurting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have good understanding of the code we're putting both in theory and in practice. It's unusual to delete stuff in python in such way. We should prove there is a reason to do this and document why it is needed. If not we should omit it.
Let's see @alexowens90's take on this as well.
everything not relevant from mem leak measurement out of | ||
test, so it works as less as possible | ||
""" | ||
lib: Library = arctic_library_lmdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are some things type annotated e.g. lib: Library
while others are not e.g. df is not df: pd.DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ommissions, or somewhere not needed. fixed here and there
""" | ||
q = QueryBuilder() | ||
return ( | ||
q[q["strings"] != "QASDFGH"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chances of this filtering anything out are very low, is that deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is deliberate to impose attempt to filter by impossible condition which should still be evaluated. And as we know with strings things are always buggier than with numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue is that the condition is not impossible, just unlikely. Making it actually impossible is fine (e.g. use a different character set than the generated dataframe), but we have optimisations in place for when nothing is filtered out, which could then skew the results if you get unlucky and the dataframe does contain QASDFGH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is very unlikely. Thus making the results 99.99% of the dataframe (note that by default I think we generate str of 10 chars array and this is 7 chars. So perhaps really impossible to achieve unless bug
Anyway for our purposes we do not need exactness at all. Thus actual returned result if it is 99 or 100% of rows really does not matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does matter. Returning 100% and 99.999% of a dataframe can take totally different code paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see what @alexowens90 thinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added aditional query also without any filter. Thus we have now 3 queries that have groupby agg., with filter(50%) filter(100%) effectivly and no filter ...
there is additional functional test also that covers the case if there is something fishy with queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grusev the query needs to be changed to one of:
- Return 100% of the dataframe
- Return <100% of the dataframe
And it has to be consistent. We have optimisations in place so that a filter that doesn't remove anything is much more efficient than a filter that removes even 1 row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have it already .. look at:
def query_no_filter_only_groupby_with_aggregations() -> QueryBuilder:
"""
groupby composite aggregation query for QueryBuilder memory tests.
The query basically will do aggregation of half of dataframe
"""
q = QueryBuilder()
return (
q.groupby("uint8")
.agg({"uint32": "mean",
"int32": "sum",
"strings": "count",
"float64": "sum",
"float32": "min",
"int16": "max"})
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the comment is misleading but all is ok
Co-authored-by: Vasil Danielov Pashov <[email protected]>
…w on V1 store only
Pass size of dataframe and it will generate random row range | ||
""" | ||
percentage_rows_returned = 0.57 | ||
start_percentage = random.uniform(0.01, 1.0 - percentage_rows_returned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will probably disable starting from 0-th row. Is this the desired behavior?
The same question applies to query_date_range_57percent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to be exact in testing, thus nothing will change in both cases start or not from 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same comment as: https://github.com/man-group/ArcticDB/pull/2054/files#r1907343843
start_percentage = random.uniform(0.01, 1.0 - percentage_rows_returned) | ||
result_size_rows = int(0.57 * size) | ||
q = QueryBuilder() | ||
a = random.randint(0,int((size-1) * start_percentage)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- As the row range is closed on the left side and opened on the right side it is safe to pass size here. It's also the correct thing to do because the current thing represents a the dataframe without it's last row.
- Why is the conversion to int needed. Looking at the type hint size is already an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not want exactness or correctness here, we want randomness, nothing changes for the test if it includes or not the last row. If we wanted to cover exactly last row we would not use at all random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want randomness
Randomness is achieved by using random
This is all about clear and readable code. When I'm reading the code I'm wondering what is the reason not to include this in the query? Why are we specifically removing that last row? Turns out there is no reason to do that, if there's no reason for something it should be in the code, so that it doesn't bring such confusion.
Reference Issues/PRs
What does this implement or fix?
Any other comments?
Checklist
Query Builder tests for OOM and memory limit
The PR contains code that introduces memray and pytest-memray libraries. With their help it is possible to create 2 types of tests:
memory leaks test - the memory leaks tests will accept one or two parameters:
memory limit test - the memory limit test is about enforcing memory efficiency of code. Currently I have set CURRENT memory limit as limit for the this QueryBuilder code. If test fails in future then that would mean that our code is LESS memory efficient in that code. Why that has happened needs to be investigated Also, the number currently set accepts situation AS IS. It does not makes claim that we are memory efficient. No such study has been done it is outside of the test purpose. If such thing is needed then an investigative task need to be created
Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?