Update search indexing to use celery rather then coverage records (PP-1225) #1849

jonathangreen · 2024-05-14T16:31:58Z

Description

This PR updates our search indexing to happen via a celery queue, vs the existing work coverage record approach. The existing search index work coverage records are left in the DB and will be removed in a follow up PR.

All responsibility for the search index initialization and migration has been moved into InstanceInitializationScript. This script now handles both the DB and search index initialization. When a migration needs to happen to a new search index this now happens via a celery task, previously this was handled in the search coverage monitor.

There was some refactoring of responsibilities between the external search classes and the search client. The search client is now responsible for making sure the correct search alias is used when requests are being made.

Motivation and Context

On our larger CMs, a large amount of the DB load is due to the work coverage records being generated for search. This is mostly due to deadlocks happening between these records when two processes try to update the same work. Moving search indexing to a queue like this should drastically reduce the DB load, and improve our performance.

How Has This Been Tested?

Tested locally
Running unit tests

Checklist

I have updated the documentation accordingly.
All new and existing tests passed.

codecov · 2024-05-15T18:37:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.11%. Comparing base (474f4dd) to head (8779d75).
Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1849      +/-   ##
==========================================
+ Coverage   90.10%   90.11%   +0.01%     
==========================================
  Files         325      324       -1     
  Lines       39640    39578      -62     
  Branches     8591     8595       +4     
==========================================
- Hits        35716    35665      -51     
+ Misses       2602     2599       -3     
+ Partials     1322     1314       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jonathangreen · 2024-05-17T15:13:26Z

src/palace/manager/scripts/initialization.py

@@ -1,5 +1,6 @@
 from __future__ import annotations


This script now has all the responsibility for initializing and migrating our database and search.

jonathangreen · 2024-05-17T15:15:17Z

src/palace/manager/search/external_search.py

-
-        # Get references to the read and write pointers.
-        self._search_read_pointer = self._search_service.read_pointer_name()
-        self._search_write_pointer = self._search_service.write_pointer_name()


Previously the responsibility for making sure operations were done against the correct search alias was split between this class and the search service. This has now entirely been moved into the search service, so this class doesn't have to know anything about the aliases we are using.

jonathangreen · 2024-05-17T15:16:49Z

src/palace/manager/search/external_search.py

@@ -253,54 +214,25 @@ def count_works(self, filter):
        )
        return qu.count()

-    def create_search_documents_from_works(
-        self, works: Iterable[Work]
-    ) -> Sequence[SearchDocument]:


This function didn't really add anything to Work.to_search_documents other then doing some timing of how long that function took, so it was removed, and uses were updated to call Work.to_search_documents directly.

jonathangreen · 2024-05-17T15:20:43Z

src/palace/manager/search/service.py

-    """The 'write' pointer; the pointer that will be used to populate an index with search documents."""
+@dataclass(frozen=True)
+class SearchPointer:
+    """A search pointer, which is an alias that points to a specific index."""


Previously calls to write_pointer would return a SearchWritePointer and calls to read_pointer would return a str. This creates a new SearchPointer that both methods return.

jonathangreen · 2024-05-17T15:24:45Z

src/palace/manager/search/service.py

-    @abstractmethod
-    def index_set_populated(self, revision: SearchSchemaRevision) -> None:
-        """Set an index as populated."""
-


We used to have empty and populated aliases for an index. Now that all our initialization is centralized in the instance_initialization script, and the actions in that script are protected by a postgres lock, so we are sure only one instance is doing the initialization at a time, we don't need these extra alias.

jonathangreen · 2024-05-17T15:42:56Z

src/palace/manager/sqlalchemy/model/work.py

+        ).join_from(
+            Identifier,
+            equivalent_identifiers,
+            Identifier.id == literal_column("equivalent_cte.equivalent_id"),


This query is equivalent to what we had before, but we are using a join condition rather then a where clause to limit the results. This removes a warning that sql alchemy was emitting every time we did this query because it thought we were doing a join without a condition on it.

jonathangreen · 2024-05-17T15:43:32Z

src/palace/manager/sqlalchemy/model/work.py

-            )
-            .select_from(
-                join(Classification, Subject, Classification.subject_id == Subject.id)
+                == literal_column("equivalent_cte.equivalent_id"),


Same thing as the identifiers_query we use a join condition to stop a sqlalchemy warning.

jonathangreen · 2024-05-17T15:45:50Z

tests/fixtures/services.py

@@ -88,7 +88,7 @@ def services_search_fixture() -> ServicesSearchFixture:
    search_container = Search()
    client_mock = create_autospec(boto3.client)
    service_mock = create_autospec(SearchServiceOpensearch1)
-    revision_directory_mock = create_autospec(SearchRevisionDirectory.create)
+    revision_directory_mock = create_autospec(SearchRevisionDirectory)


This spec was wrong, we need a spec for SearchRevisionDirectory not the create method. This caused me issues when I tried to actually use the mock.

jonathangreen · 2024-05-17T15:48:25Z

tests/manager/search/test_external_search.py

-
-    def test_to_search_document(self, db: DatabaseTransactionFixture):
-        """Test the output of the to_search_document method."""
-        customlist, editions = db.customlist()


This wasn't really testing the Coverage Provider, so this test was moved into the Work tests.

jonathangreen · 2024-05-17T15:54:57Z

tests/manager/search/test_external_search.py

-    def test_to_search_documents_with_missing_data(
-        self, db: DatabaseTransactionFixture
-    ):
-        # Missing edition relationship


This was also moved into works tests

jonathangreen · 2024-05-17T16:25:45Z

I'm still doing some final local testing on this one, but I think this is ready for review whenever anyone has the cycles to take a look.

tdilauro

This looks great! 🎸🤘🏽

Just a few small comments. I'm gonna go ahead and approve.

tdilauro · 2024-05-21T14:24:32Z

src/palace/manager/scripts/initialization.py

+        cls.create_search_index(service, revision)
+        task = get_migrate_search_chain().apply_async()
+        cls.logger().info(
+            f"Task queued to indexing data into new search index (Task ID: {task.id})."


Minor: grammar - suggest either:

"Task queued to index data..." or

"Task queued for indexing data..."

tdilauro · 2024-05-21T14:27:50Z

src/palace/manager/scripts/initialization.py

+        # This script doesn't take any arguments, but we still call argparse, so that
+        # we can use the --help option to print out a help message. This avoids the
+        # surprise of the script actually running when the user just wanted to see the help.


tdilauro · 2024-05-21T14:46:27Z

src/palace/manager/util/backoff.py

+    :param retries: The number of retries that have already been attempted.
+    :return: The number of seconds to wait before the next retry.


Should we have a limit on the level of backoff?

limit = 6 # e.g. backoff: int = 3 ** (min(retries, limit) + 1)

In the instances we use this function right now, this is limited by the max retries of the task, but it does make sense to have a parameter like this. I'll add it in.

tdilauro · 2024-05-21T20:46:08Z

tests/manager/sqlalchemy/model/test_customlist.py

+        assert work_external_indexing.is_queued(first.work)
+        work_external_indexing.clear()


Minor: Can you combine these two into

assert work_external_indexing.is_queued(first.work, clear=True)

tdilauro · 2024-05-21T21:12:59Z

tests/manager/search/test_external_search.py

+        moby_duck = db.work(title="Moby Duck", with_open_access_download=True)
+        moby_dick = db.work(title="Moby Dick", with_open_access_download=True)


I had to look at this a couple of times because I didn't catch the difference. Moby Duck. 🦆

I think I pulled this from an older test. It is hard to see the diff though. I renamed things a bit to make it easier to see whats happening at a glance.

tdilauro · 2024-05-21T21:32:51Z

tests/mocks/search.py

+        to_remove = []
+        for item in items:
+            if item.get("_id") == doc_id:
+                to_remove.append(item)


I see this came in from older code, but could do this with a list comprehension:

to_remove = [item for item in items if item.get("_id") == doc_id]

tdilauro · 2024-05-21T21:35:22Z

tests/manager/sqlalchemy/model/test_work.py

+                assert doc["customlists"] is None
+
+            if work.presentation_edition.contributions:
+                assert len(doc["contributors"]) is len(


'==' instead is here?

This whole test was just moved in from the coverage providers tests. Not sure why it was using is. I updated it.

tdilauro · 2024-05-21T22:17:14Z

@jonathangreen Also, I forgot to mention that your pre-review comments were very handy. Thanks!

…-1225) (#1849) * Index search documents via celery

jonathangreen force-pushed the feature/search-indexing-celery branch from 02ba424 to 369d8c0 Compare May 14, 2024 16:41

Index search documents via celery

23560ca

jonathangreen force-pushed the feature/search-indexing-celery branch from d844845 to 23560ca Compare May 16, 2024 23:29

jonathangreen added 7 commits May 16, 2024 21:42

Make both read_pointer and write_pointer return same thing

954da3c

Move backoff function

13be038

Add a couple more tests

a5db25d

Add some comments

28a13c2

Add one more test case.

e4034b4

Add argparse to avoid suprises.

858c4c2

Rename methods

c20c9db

jonathangreen added feature New feature cleanup migration PR that will need a cleanup migration once its been fully deployed labels May 17, 2024

Make sure ID is set.

b8076bc

jonathangreen marked this pull request as ready for review May 17, 2024 16:24

jonathangreen commented May 17, 2024

View reviewed changes

jonathangreen requested a review from a team May 17, 2024 16:25

jonathangreen added 2 commits May 17, 2024 16:46

Some changes to retries after some local testing.

e92d58b

Merge branch 'main' into feature/search-indexing-celery

fce586e

tdilauro approved these changes May 21, 2024

View reviewed changes

Code review feedback

8779d75

jonathangreen merged commit 5be6f9a into main May 22, 2024
20 checks passed

jonathangreen deleted the feature/search-indexing-celery branch May 22, 2024 13:10

jonathangreen mentioned this pull request May 22, 2024

Remove old update-search-index work coverage records (PP-1279) #1861

Merged

2 tasks

jonathangreen added a commit that referenced this pull request May 24, 2024

Update search indexing to use celery rather then coverage records (PP…

e833e30

…-1225) (#1849) * Index search documents via celery

jonathangreen mentioned this pull request Jun 18, 2024

Update indexing task to use bulk indexing endpoint (PP-1382) #1909

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update search indexing to use celery rather then coverage records (PP-1225) #1849

Update search indexing to use celery rather then coverage records (PP-1225) #1849

jonathangreen commented May 14, 2024 •

edited

Loading

codecov bot commented May 15, 2024 •

edited

Loading

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen May 17, 2024

jonathangreen commented May 17, 2024

tdilauro left a comment

tdilauro May 21, 2024

tdilauro May 21, 2024

tdilauro May 21, 2024

jonathangreen May 22, 2024

tdilauro May 21, 2024

tdilauro May 21, 2024

jonathangreen May 22, 2024

tdilauro May 21, 2024

tdilauro May 21, 2024

jonathangreen May 22, 2024

tdilauro commented May 21, 2024

		:param retries: The number of retries that have already been attempted.
		:return: The number of seconds to wait before the next retry.

		assert work_external_indexing.is_queued(first.work)
		work_external_indexing.clear()

		moby_duck = db.work(title="Moby Duck", with_open_access_download=True)
		moby_dick = db.work(title="Moby Dick", with_open_access_download=True)

Update search indexing to use celery rather then coverage records (PP-1225) #1849

Update search indexing to use celery rather then coverage records (PP-1225) #1849

Conversation

jonathangreen commented May 14, 2024 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Checklist

codecov bot commented May 15, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathangreen commented May 17, 2024

tdilauro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdilauro commented May 21, 2024

jonathangreen commented May 14, 2024 •

edited

Loading

codecov bot commented May 15, 2024 •

edited

Loading