Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Builder tests for OOM and memory limit (from PR #2053) #2054

Merged
merged 33 commits into from
Jan 10, 2025
Merged

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Dec 10, 2024

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Query Builder tests for OOM and memory limit

The PR contains code that introduces memray and pytest-memray libraries. With their help it is possible to create 2 types of tests:

  • memory leaks test - the memory leaks tests will accept one or two parameters:

    • mem leak threshold - a threshold of mem in KB/MB that will trigger failure if above
    • (optional) filter function - the function can be passed to memray to filterout code that we do not want to be accounted as mem leak. In future we might get some code that we KNOW has memory leaks but is outside of our scope. Thus we can use this function to filter it out.. Currently the code accepts that mem leaks under certain treshold will not lead to failure. If that threshold is vialoted in future this should mean immediate development investigative task. In the log there will be information what is the code that triggered that. IMPORTANT: code source will be if the test is executed against release build - only the size of leak will be correct. To know the source of the code the test must be executed against build with debug information
  • memory limit test - the memory limit test is about enforcing memory efficiency of code. Currently I have set CURRENT memory limit as limit for the this QueryBuilder code. If test fails in future then that would mean that our code is LESS memory efficient in that code. Why that has happened needs to be investigated Also, the number currently set accepts situation AS IS. It does not makes claim that we are memory efficient. No such study has been done it is outside of the test purpose. If such thing is needed then an investigative task need to be created

  • Have you updated the relevant docstrings, documentation and copyright notice?

  • Is this contribution tested against all ArcticDB's features?

  • Do all exceptions introduced raise appropriate error messages?

  • Are API changes highlighted in the PR description?

  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@grusev grusev force-pushed the qb_oom branch 2 times, most recently from fff58e8 to bbb8502 Compare December 11, 2024 12:26
@grusev grusev marked this pull request as ready for review December 12, 2024 06:01
The query basically will do aggregation of half of dataframe
"""
q = QueryBuilder()
return (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this brackets? It's inconsistent with query_resample from this file and most of the codebase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a way to have decent multilines ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need then in this case. The function quoted above query_resample doesn't use brackets. We should keep the style homogeneous.

lib: Library = arctic_library_lmdb
df = generate_big_dataframe(300000)
lib.write(symbol, df)
del df
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed. It should be deleted when it goes out of scope.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yes. In practice not sure. Thus one delete more is not hurting

Copy link
Collaborator

@vasil-pashov vasil-pashov Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have good understanding of the code we're putting both in theory and in practice. It's unusual to delete stuff in python in such way. We should prove there is a reason to do this and document why it is needed. If not we should omit it.

Let's see @alexowens90's take on this as well.

everything not relevant from mem leak measurement out of
test, so it works as less as possible
"""
lib: Library = arctic_library_lmdb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are some things type annotated e.g. lib: Library while others are not e.g. df is not df: pd.DataFrame

Copy link
Collaborator Author

@grusev grusev Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ommissions, or somewhere not needed. fixed here and there

"""
q = QueryBuilder()
return (
q[q["strings"] != "QASDFGH"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chances of this filtering anything out are very low, is that deliberate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is deliberate to impose attempt to filter by impossible condition which should still be evaluated. And as we know with strings things are always buggier than with numbers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue is that the condition is not impossible, just unlikely. Making it actually impossible is fine (e.g. use a different character set than the generated dataframe), but we have optimisations in place for when nothing is filtered out, which could then skew the results if you get unlucky and the dataframe does contain QASDFGH

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is very unlikely. Thus making the results 99.99% of the dataframe (note that by default I think we generate str of 10 chars array and this is 7 chars. So perhaps really impossible to achieve unless bug

Anyway for our purposes we do not need exactness at all. Thus actual returned result if it is 99 or 100% of rows really does not matter

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter. Returning 100% and 99.999% of a dataframe can take totally different code paths

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what @alexowens90 thinks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added aditional query also without any filter. Thus we have now 3 queries that have groupby agg., with filter(50%) filter(100%) effectivly and no filter ...

there is additional functional test also that covers the case if there is something fishy with queries

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grusev the query needs to be changed to one of:

  • Return 100% of the dataframe
  • Return <100% of the dataframe

And it has to be consistent. We have optimisations in place so that a filter that doesn't remove anything is much more efficient than a filter that removes even 1 row.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have it already .. look at:

def query_no_filter_only_groupby_with_aggregations() -> QueryBuilder:
"""
groupby composite aggregation query for QueryBuilder memory tests.
The query basically will do aggregation of half of dataframe
"""
q = QueryBuilder()
return (
q.groupby("uint8")
.agg({"uint32": "mean",
"int32": "sum",
"strings": "count",
"float64": "sum",
"float32": "min",
"int16": "max"})
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the comment is misleading but all is ok

Pass size of dataframe and it will generate random row range
"""
percentage_rows_returned = 0.57
start_percentage = random.uniform(0.01, 1.0 - percentage_rows_returned)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably disable starting from 0-th row. Is this the desired behavior?

The same question applies to query_date_range_57percent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to be exact in testing, thus nothing will change in both cases start or not from 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_percentage = random.uniform(0.01, 1.0 - percentage_rows_returned)
result_size_rows = int(0.57 * size)
q = QueryBuilder()
a = random.randint(0,int((size-1) * start_percentage))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As the row range is closed on the left side and opened on the right side it is safe to pass size here. It's also the correct thing to do because the current thing represents a the dataframe without it's last row.
  2. Why is the conversion to int needed. Looking at the type hint size is already an integer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want exactness or correctness here, we want randomness, nothing changes for the test if it includes or not the last row. If we wanted to cover exactly last row we would not use at all random.

Copy link
Collaborator

@vasil-pashov vasil-pashov Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want randomness

Randomness is achieved by using random

This is all about clear and readable code. When I'm reading the code I'm wondering what is the reason not to include this in the query? Why are we specifically removing that last row? Turns out there is no reason to do that, if there's no reason for something it should be in the code, so that it doesn't bring such confusion.

@grusev grusev merged commit d71a0bb into master Jan 10, 2025
125 of 126 checks passed
@grusev grusev deleted the qb_oom branch January 10, 2025 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants