Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix libraries performance problem #34523

Merged
merged 1 commit into from
May 20, 2024

Conversation

jesperhodge
Copy link
Member

@jesperhodge jesperhodge commented Apr 16, 2024

Context (skip if you are only interested in the actual changes)

This is an attempt to fix a performance problem on the libraries home page. When you go to studio home and click on the libraries tab, on prod it will be quick for admins but extremely slow for course instructors (> 12 seconds) and leads to timeouts. It grows with the number of libraries that are assigned to the instructor.

My hypothesis is that this is due to slow Python code in the BulkRoleCache and RoleCache classes which was suspiciously slow locally. Complexity for admins is O(n) where n is the total number of libraries that exist, but for course instructors it's O(n * m) where m is the number of courses and libraries that this instructor has access to, which can be > 1000. The reason is that for every library that exists, there's a lookup in the RoleCache to see whether the user has a role for this library. This used to be an iteration over a set.

I changed the cache to store roles per user in a dict that groups roles by course id if they have one. That changes the complexity of the lookup to O(1), so total complexity for loading the libraries home page is reduced to O(n).

I observed that locally with a user that only had a small number of libraries assigned it cut the request time to a third.

Even if this does not fix the described performance problem it's an improvement of code performance and good to have. So fixing the described bug is not a requirement for merging this PR.

Description

This is a python performance improvement. I want to change the lookup time in the RoleCache to O(1) when there's a cache hit.
I changed the RoleCache / BulkRoleCache to store roles per user in a dict that groups roles by course id if they have one. I then transformed it into a flat set like it was before to keep the _.roles property the same to avoid breaking anything that relied on this.

Caveats

This will improve code performance but since the ungrouped set of all roles is also needed, I'm caching it as a set as well in the _roles variable of RoleCache. That means that the RoleCache takes up twice as much memory while it's being used, but I think it's request-scoped and should reset when the request is finished if I'm not mistaken (correct me if I'm wrong?), so it should not be a big deal.

Supporting information

Internal JIRA

Testing instructions

  • Log into studio as a course instructor
  • assign two libraries to that instructor
  • go to studio and click on the libraries tab
  • courses and libraries should load without any errors
  • otherwise, we really just want to make sure I'm not introducing regression bugs.

Other information

  • While the cache now stores user roles in a dict, I also kept the original ._roles property of the RolesCache and made it so it transforms the roles into a set that's stored in this variable. One reason for this is not to break any existing code.
  • Most unexpected errors I had to deal with came from calculating or using the key for this course dict. Some roles didn't have a CourseKey and some of the course keys - for example of V2 libraries - didn't have an html_id() function, and this threw errors. I think I dealt with this problem sufficiently now.
  • The RoleCache is pretty central so it's really important to avoid breaking any existing functionality that relies on it. Please pay close attention to that in the review.

@jesperhodge jesperhodge force-pushed the fix--libraries-performance-problem branch from 34fa403 to a71b414 Compare April 16, 2024 22:13
@jesperhodge jesperhodge marked this pull request as ready for review April 19, 2024 16:11
@jesperhodge jesperhodge requested a review from a team April 19, 2024 19:22
Copy link
Contributor

@connorhaugh connorhaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of nits before you merge.

cms/djangoapps/contentstore/rest_api/v1/views/home.py Outdated Show resolved Hide resolved
common/djangoapps/student/roles.py Outdated Show resolved Hide resolved
common/djangoapps/student/roles.py Outdated Show resolved Hide resolved
@jesperhodge jesperhodge requested a review from kdmccormick April 23, 2024 20:06
@jesperhodge
Copy link
Member Author

Hi @kdmccormick , Connor suggested to ask you for a review here. If you don't have time, could you suggest someone else who has good backend experience?

Copy link
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Jesper, and thanks for the great PR description with your findings. Below is all my feedback that doesn't have anything to do with the optimization. I'll leave another review in a moment that directly addresses the optimization.

Comment on lines 348 to 355
def get_user_roles(username):
def get_user_roles(username, by_course_id=False):
"""
Returns a set of all roles that this user has.
:param username: The id of the selected user.
:return: All roles for all courses that this user has.
:return: Either a set of all roles for all courses that this user has
or a dictionary of course_id to roles with the key
ROLE_CACHE_UNGROUPED_COURSES_KEY (see common/djangoapps/student/roles.py)
for roles that are not course specific.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this change is unneeded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 65 to 78
def get_role_cache_course_key(course_id=None):
"""
Get the cache key for the course key.
"""
default = ROLE_CACHE_UNGROUPED_COURSES_KEY
if not course_id:
return default
if not getattr(course_id, 'html_id', False):
return default
html_id = course_id.html_id()
if not html_id:
return default

return html_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the definition of html_id, this method is just a roundabout way of saying:

    return str(course_id) if course_id else ROLE_CACHE_UNGROUPED_COURSES_KEY

so why not just do that?

By the way, dynamic methods like getattr play poorly with linting and typechecking, and they're also prone to spelling errors and mis-refactorings. I find it best to avoid them unless you're consciously doing something in the realm of meta-programming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 148 to 156
# TODO: test that this works as expected
def get_all_roles_set(self):
return self._roles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 152 to 160
def get_roles_by_course_id(self):
return self._roles_by_course_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pythonic way to write a getter is to use a readonly property:

Suggested change
def get_roles_by_course_id(self):
return self._roles_by_course_id
@property
def roles_by_course_id(self):
return self._roles_by_course_id

From the perspective of the caller, the property will look like any other field, eg obj.roles_by_course_id rather than obj.get_roles_by_course_id()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it as per our convention then, but I think generally it's a pretty bad thing to do. There's no benefit other than that it looks slightly nicer, but the drawback is big: when accessing this property, you hide the fact that not just a value is recalled but a function is run, which can lead to bugs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 108 to 170
return any(
res = False
course_id_string = get_role_cache_course_key(course_id)
course_roles = self._roles_by_course_id.get(course_id_string)
if not course_roles:
return False
res = any(
access_role.role in self.get_roles(role) and
access_role.course_id == course_id and
access_role.org == org
for access_role in self._roles
for access_role in course_roles
)
return res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The res variable serves no purpose, and it obfuscates the logic in this function. Furthermore, you can avoid the guard conditional by defaulting course_roles to [], as any([]) == False.

It would be equivalent to write:

        course_id_string = get_role_cache_course_key(course_id)
        course_roles = self._roles_by_course_id.get(course_id_string, [])
        return any(
            access.role in self.get_roles(role) and access_role.org == org
            for access_role in course_roles
        )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 91 to 101
user_id = role.user.id
course_id = get_role_cache_course_key(role.course_id)
if not roles_by_user.get(user_id):
roles_by_user[user_id] = {course_id: set()}
if not roles_by_user[user_id].get(course_id):
roles_by_user[user_id][course_id] = set()
roles_by_user[user_id][course_id].add(role)

if not roles_by_user.get(course_id):
roles_by_user[course_id] = set()
roles_by_user[course_id].add(role)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the new schema of the cache? I thought it would be {user_id: {course_id: role, ...}, ...}, but in the last three lines here, we're keying roles_by_user[course_id] too? A comment describing the cache schema would go a long way here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, see my other comment, I must have gotten confused. The last three lines make no sense to me looking at it now, need to review and fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 69 to 87
roles_by_user = defaultdict(set)
roles_by_user = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit: defaultdicts are great because they internally implement the "check-if-missing-and-set-a-default" pattern that regular dicts require you to repeat over and over again:

            if not roles_by_user.get(user_id):
                roles_by_user[user_id] = {course_id: set()}
            if not roles_by_user[user_id].get(course_id):
                roles_by_user[user_id][course_id] = set()
            roles_by_user[user_id][course_id].add(role)

If roles_by_user were declared as defaultdict(lambda: defaultdict(set)), then the code above could become:

            roles_by_user[user_id][course_id].add(role)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this in the wrong place before.

I think this might be because he's using it to mix two different key types (which I also think is worth examining). That could be done with a defaultdict, but it'd need a lambda to switch by type or something, which is getting into 😦 territory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# No need to worry about ErrorBlocks - split's get_libraries() never returns them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before putting your PR up for review, it's best to take a look through and scrub away any unnecessary changes, including no-op changes like whitespace and trailing commas. A clean diff saves your reviewers time, and it saves other devs time later when they are debugging commits or putting together a release.

Of course, style/whitespace improvements can be very helpful, but they are best when broken into separate commits or even entirely separate PRs.

Copy link
Contributor

@ormsbee ormsbee Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be because he's using it to mix two different key types (which I also think is worth examining). That could be done with a defaultdict, but it'd need a lambda to switch by type or something, which is getting into 😦 territory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormsbee did you mean to put this on #34523 (comment) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... yeah. Weird. Thank you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks, I'll scrub my PR for things like this next time.

Copy link
Member

@kdmccormick kdmccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jesperhodge and @ormsbee, here are my thoughts on the actual optimization.

Comment on lines 165 to 172
res = any(
access_role.role in self.get_roles(role) and
access_role.course_id == course_id and
access_role.org == org
for access_role in self._roles
for access_role in course_roles
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions that we are going from O(total_num_libs * accessible_libs_and_courses) down to O(total_num_libs). Am I correct this is precisely where that optimization is happening?

In other words: is the optimization that self._roles contained all of a user's access roles across the whole system, whereas course_roles contains only their set of roles in the course in question?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is precisely where the optimization is happening.

is the optimization that self._roles contained all of a user's access roles across the whole system, whereas course_roles contains only their set of roles in the course in question?

Yes. Let's understand the fundamental problem with the prior implementation: the previous code was iterating over every single role a user has, and some user have thousands of roles, because each library or course results in a new role. Now running a find operation here by iterating over those thousands of roles is very inefficient.
On top of that the code itself seems already slow, which I still don't completely understand, but that's fine - iterating over thousands of roles every time this function is called (and it's called for every library that exists on the platform) makes this much worse.
Just picking the key for the course or library you want to check from a dictionary instead of a check is a quick O(1) operation instead of an iteration over the whole set.
Now the reason why the optimization should work very well is that a user usually has at most a handful of roles for a course. So as long as this is not iterating over the ungrouped set it should significantly increase performance of the python code.

Copy link
Member Author

@jesperhodge jesperhodge Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically as pseudocode:
roles = dict[course_id]
is much quicker than
roles = all_roles.iterate_and_filter(role.course_id == course_id)

class BulkRoleCache: # lint-amnesty, pylint: disable=missing-class-docstring
CACHE_NAMESPACE = "student.roles.BulkRoleCache"
CACHE_KEY = 'roles_by_user'

@classmethod
def prefetch(cls, users): # lint-amnesty, pylint: disable=missing-function-docstring
roles_by_user = defaultdict(set)
roles_by_user = {}
get_cache(cls.CACHE_NAMESPACE)[cls.CACHE_KEY] = roles_by_user
Copy link
Member

@kdmccormick kdmccormick Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also raises my eyebrows performance-wise, although I might just be mis-understanding it.

roles_by_user is a dict which is stored in the cache, and the next few lines populate the dict. Does every update to roles_by_user trigger a cache-write? Or does the code somehow know that it needs to bulk-update the roles_by_user cache entry once this method finishes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works because it's a Request cache, so just an in-memory dict with no write-through to an external cache server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, ty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the existing code or to my code change here?
I have to say I was quite confused by the prefetch method as well - the line
get_cache(cls.CACHE_NAMESPACE)[cls.CACHE_KEY] = roles_by_user is a pretty weird looking way to write to a cache, especially since the RequestCache has explicit set operations.
I was wondering if this even works and if so I highly doubt that this is good practice.

However my change was only to change the datastructure in the cache for each user from a set to a dict, so I didn't want to touch this to avoid breaking anything. But I was going to ask you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I asked in our python slack about it and they confirmed that the RequestCache is pretty much just a writable dict. It's still weird to me that you'd do this instead of using it's explicit setter method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very likely that this usage of RequestCache just predates RequestCache having a setter method.

Copy link
Contributor

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions/requests. If this is being made for library auth optimization, you should definitely explain that in some detail in your commit message because it wouldn't be at all clear to casual observers why all the course_ids being thrown around affects libraries.

@@ -60,21 +62,47 @@ def strict_role_checking():
ACCESS_ROLES_INHERITANCE.update(OLD_ACCESS_ROLES_INHERITANCE)


def get_role_cache_course_key(course_id=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is meant to be a CourseKey and not a string representation, please call it course_key instead of course_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 72 to 73
if not getattr(course_id, 'html_id', False):
return default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that html_id is just there for backwards compatibility, and isn't actually intended to be used, since it just returns the str representation for CourseKey. What is this conditional here to filter out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to usestr()

class BulkRoleCache: # lint-amnesty, pylint: disable=missing-class-docstring
CACHE_NAMESPACE = "student.roles.BulkRoleCache"
CACHE_KEY = 'roles_by_user'

@classmethod
def prefetch(cls, users): # lint-amnesty, pylint: disable=missing-function-docstring
roles_by_user = defaultdict(set)
roles_by_user = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring comment that explains the data structure being created here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 93 to 94
if not roles_by_user.get(user_id):
roles_by_user[user_id] = {course_id: set()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (optional): You can also do this kind of "init to this value if no entry exists" logic with setdefault, e.g.:

roles_by_user.setdefault(user_id, {course_id: set()})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Also consider using named variables to make what you're doing clearer. So something like this (though I haven't debugged this, I just wrote it as an example):

roles_by_user = {}
get_cache(cls.CACHE_NAMESPACE)[cls.CACHE_KEY] = roles_by_user

for role in CourseAccessRole.objects.filter(user__in=users).select_related('user'):
    user_id = role.user.id
    course_id = get_role_cache_course_key(role.course_id)
    
    # Add role to the set in roles_by_user[user_id][course_id]
    user_roles_dict = roles_by_user.setdefault(user_id, {})
    user_roles_set_for_course = user_roles_dict.setdefault(course_id, set())
    user_roles_set_for_course.add(role)

    # Add role to roles_by_user[course_id] for backwards compatibility
    course_roles_set = roles_by_user.setdefault(course_id, set())
    course_roles_set.add(role)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it with Kyle's defaultdict recommendation, but also improved naming a bit as you suggested

common/djangoapps/student/roles.py Show resolved Hide resolved
self._roles_by_course_id = {}
roles = CourseAccessRole.objects.filter(user=user).all()
for role in roles:
course_id = get_role_cache_course_key(role.course_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that org level roles (where org is specified but course_id is null) will be dumped into the generic ungrouped cache key? Is that okay?

Copy link
Member Author

@jesperhodge jesperhodge Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems okay to me. I mean it was ungrouped before already, so it's not worse than before, and we're trying to improve performance for lookups for libraries that relate to a specific course, not for orgs. It could be an idea for a future improvement but it's not clear that that is necessary or has a benefit. Also it depends how many roles are in that group. If we can get to a strong assumption that the biggest mass of roles is already grouped because it's by course which should include libraries, then a user will normally not belong to thousands of different orgs, so the amount of ungrouped roles will be quite small und thus no big problem to iterate over.

Now a big caveat here is that I didn't get this to work with V2 libraries. It seems that V2 libraries don't have course keys with an html_id attribute and thus will not be groupable by course and land in the ungrouped set. So this performance improvement will not apply to V2 libraries as-is (at least it doesn't break anything according to the unit tests), and I wanted to ask about it. However, you or Kyle pointed out that html_id is not the right attribute to look for the string representation, that it should be something like str(course_key). If I understand that correctly and that works for everything including V2 libraries, that should resolve our problem here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, you or Kyle pointed out that html_id is not the right attribute to look for the string representation, that it should be something like str(course_key). If I understand that correctly and that works for everything including V2 libraries, that should resolve our problem here.

Yes, for any OpaqueKey instance (including CourseKeys, LibraryKeys, and UsageKeys), simply calling str(...) with the key is a reliable and surprise-free way of getting a string that you can use as a cache key, guaranteed to be unique from any other OpaqueKey instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surprise-free

(Well, I should dial this back a bit. In the CMS, keys for modulestore content can have version information attached to them, which means that two different versions of a piece of content may have two different keys. To get the version-agnostic key, you'd have to call course_key.for_version(None).for_branch(None). That's all unrelated to this PR, though, since we're in the content of LMS, and LMS content keys are automatically version-agnostic).

# No need to worry about ErrorBlocks - split's get_libraries() never returns them.

Copy link
Contributor

@ormsbee ormsbee Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be because he's using it to mix two different key types (which I also think is worth examining). That could be done with a defaultdict, but it'd need a lambda to switch by type or something, which is getting into 😦 territory.

self._roles_by_course_id[course_id].add(role)
self._roles = set()
for roles_for_course in self._roles_by_course_id.values():
self._roles.update(roles_for_course)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Shouldn't this kind of refresh logic be delegated to something in BulkRoleCache?

Copy link
Member Author

@jesperhodge jesperhodge Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should. The BulkRoleCache is responsible for roles in general it seems and this is scoped to a specific user's roles, so you could maybe make some case for it to stay here.

The refresh happened here originally as well, so I tried to change as few things as I could to avoid breaking stuff. With your guidance here I could see about moving this logic to the BulkRoleCache in some way, potentially, but maybe that's rather part of a separate need to work on the RoleCache in general:

Generally I wasn't very impressed with the design of the RolesCache and BulkRoleCache, given that

  • the performance was not optimal, as I've explained
  • the separation between RolesCache and BulkRolesCache is not very clear since as you said, maybe this refresh logic belongs in the BulkRoleCache
  • I'm not quite sure why you have a RequestCache with a BulkRoleCache but then you instead cache a single user's roles in the _roles property of the RoleCache here instead of the RequestCache. Is the RequestCache somehow slower, or why not just fetch the roles you want from the RequestCache and maybe avoid the whole problem we're having?
  • the private data attribute _roles was accessed directly instead of by a getter in some external code which means that changing it means you may break places in the codebase you aren't aware of, which is why I tried to not change that attribute's internal data structure and why I was so concerned about changing any code here
  • The roleCache is initialized and saved in a user's _.roles attribute, named the same as the roleCache's internal data attribute _roles, so you'll need to do something like user_roles = current_user._roles._roles which is extremely confusing
  • Outside of this implementation, it's anyway very weird how the RoleCache is randomly saved as an attribute a user has. The way it's done is just that some code somewhere suddenly adds this attribute _roles to a user object, which is not a set of roles but instead an instance of RoleCache. The problem besides the naming is that that's not a standard attribute a user has and you'll just go through code line by line to see if the code somewhere instantiates a RoleCache and adds it as such an attribute to a user. It's completely intransparent whether the user object will have an attached RoleCache or not at a given moment in time
  • The BulkRoleCache's prefetch method uses very weird logic to set or overwrite the RequestCache that completely ignores the RequestCache's setter methods

I could probably find some more things that bother me.
Now a few things of these I'm attempting to fix here but I'm not trying to fix the overall design that shows the problems you see above. I'll try to keep these two things separate, which is why I probably wouldn't move the refresh logic here to the BulkRoleCache.

It probably would make sense to improve the architecture of this RoleCache in a separate issue, though. I'd love to see that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure why you have a RequestCache with a BulkRoleCache but then you instead cache a single user's roles in the _roles property of the RoleCache here instead of the RequestCache. Is the RequestCache somehow slower, or why not just fetch the roles you want from the RequestCache and maybe avoid the whole problem we're having?

I'd have to dig into the commit history, but keep in mind that things like this sometimes spring up as a series of bandaids that deal with specific bugs/issues over time, and don't necessarily add up to a coherent design intent. (Though sometimes they do, and those can be mind-bending. 😛 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of this implementation, it's anyway very weird how the RoleCache is randomly saved as an attribute a user has. The way it's done is just that some code somewhere suddenly adds this attribute _roles to a user object, which is not a set of roles but instead an instance of RoleCache. The problem besides the naming is that that's not a standard attribute a user has and you'll just go through code line by line to see if the code somewhere instantiates a RoleCache and adds it as such an attribute to a user. It's completely intransparent whether the user object will have an attached RoleCache or not at a given moment in time

I think this is part of the early days of fumbling around as to how to cache something at the request level. There's no real built-in thing for this in Django, but there is a User object that's passed into the view as part of the Request, and that Request can be munged with in middleware. So if you want to stick some data onto a user for the duration of the request, you can slap it in there. Not something I'd recommend as a good practice, but I'd bet that's where it started from.

Comment on lines 148 to 156
# TODO: test that this works as expected
def get_all_roles_set(self):
return self._roles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove TODO and make this a @property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

roles_by_user[user_id][course_id] = set()
roles_by_user[user_id][course_id].add(role)

if not roles_by_user.get(course_id):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I made quite a mistake here. I think it should be if not roles_by_user[user_id].get(course_id). What's even the purpose of this since the three lines above already cover it? I think I did something wrong here. I'll have another look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (removed these lines)

@jesperhodge jesperhodge requested review from ormsbee and kdmccormick May 8, 2024 16:53
Copy link
Contributor

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one change request on the Course ID format used in the tests (sorry for not catching that earlier). After that, please squash your commits and take some of the great context you have in your PR description and add it to your commit message.

Thank you!

@@ -189,7 +192,8 @@ def test_get_orgs_for_user(self):
@ddt.ddt
class RoleCacheTestCase(TestCase): # lint-amnesty, pylint: disable=missing-class-docstring

IN_KEY = CourseKey.from_string('edX/toy/2012_Fall')
IN_KEY_STRING = 'edX/toy/2012_Fall'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that you're just copying what's there, but this type of Course ID doesn't work for anything in the system any longer. While it should make no difference for these tests specifically, please use the newer style course keys, e.g. course-v1:edX+toy+2012_Fall).

I thought we had managed to kill most of these IDs from the test suite, except where we had to keep them for backwards compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The library key you're using is fine, fwiw.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kdmccormick kdmccormick added the content libraries misc Libraries Overhaul tech work not captured in the stories label May 15, 2024
@jesperhodge jesperhodge force-pushed the fix--libraries-performance-problem branch from d30fd00 to 245f9aa Compare May 20, 2024 18:56
This is an attempt to fix a performance problem on the libraries home page. When you go to studio home and click on the libraries tab, on prod it will be quick for admins but extremely slow for course instructors (> 12 seconds) and leads to timeouts. It grows with the number of libraries that are assigned to the instructor.

The Python code for the request to load libraries for a particular user goes through all existing libraries and then checks all of the user's roles for each library, which results in a complexity of O(l*r), l=libraries, r=roles. This PR improves the complexity to O(l).

The BulkRoleCache and RoleCache classes were using a python set to store all roles for a particular user. A user can have a large number of roles, and lookup speed of iterating through a set is slow (O(n)). Most roles don't have the same course id, however. So if you have the course id of the role you're looking for, we can use a dict of course ids that contain related roles. The number of roles per course id is negligible, so we arrive at a lookup speed of O(1) when looking up a user's roles that belong to a specific course id.

The BulkRoleCache now caches and stores user roles in a data structure like this:
    {
        user_id_1: {
            course_id_1: {role1, role2, role3},  # Set of roles associated with course_id_1
            course_id_2: {role4, role5, role6},  # Set of roles associated with course_id_2
            [ROLE_CACHE_UNGROUPED_ROLES_KEY]: {role7, role8}  # Set of roles not tied to any specific course or library. For example, Global Staff roles.
        },
        user_id_2: { ... }  # Similar structure for another user
    }

While this changes the data structure used to store roles under the hood and adds the new property `roles_by_course_id` to the RoleCache,
when initializing the RoleCache will store roles additionally in the previous data structure - as a flat set - in the `_roles` property accessible via `all_roles_set`. This establishes
backwards compatibility.

We are now storing roles twice in the RoleCache (in each of the two data structures), which means this takes twice as much memory, but only in the scope of a request.
@jesperhodge jesperhodge force-pushed the fix--libraries-performance-problem branch from 245f9aa to e524349 Compare May 20, 2024 18:57
@kdmccormick kdmccormick requested review from kdmccormick and removed request for kdmccormick May 20, 2024 19:12
@kdmccormick kdmccormick dismissed their stale review May 20, 2024 19:14

Trusting Dave's and Connor's reviews

@kdmccormick kdmccormick removed their request for review May 20, 2024 19:14
@jesperhodge jesperhodge merged commit e95d7e7 into master May 20, 2024
49 checks passed
@jesperhodge jesperhodge deleted the fix--libraries-performance-problem branch May 20, 2024 20:34
@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

1 similar comment
@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content libraries misc Libraries Overhaul tech work not captured in the stories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants