What is the reason to use the Postgres `id` column for the `_id` field in Elasticsearch? #1947

sarayourfriend · 2023-05-01T03:13:50Z

sarayourfriend
May 1, 2023
Collaborator

The Elasticsearch _id field is set to the Postgres integer ID (primary key) of the row rather than the Openverse UUID identifier:

openverse/ingestion_server/ingestion_server/elasticsearch_models.py

Line 101 in 697f62f

"_id": row[schema["id"]],

What is the reason for this difference? It makes certain queries against Elasticsearch oddly different (and easy to mess up, if you're not aware of this small detail), as the _id of the Elasticsearch document is not the identifier that we use literally everywhere else in our stack.

Would it be possible to change the _id to be the Openverse identifier instead? What's the reason to maintain this difference, if there is one?

Answered by obulat

May 1, 2023

As far as I understand from How Indexing works diagram¹, the _id was used to determine whether the items in the database need to be synced with the elasticsearch index. We have migrated to refreshing all data instead of trying to sync only the latest items.

In fact, we have a [now obsolete, I think] comment in the ingestion_server/indexer.py describing this process:

openverse/ingestion_server/ingestion_server/indexer.py

Lines 1 to 6 in 4d6e995

     """  
   A utility for indexing data to Elasticsearch.  
     
   For each table to sync, find its largest ID in database. Find the corresponding largest  
   ID in Elasticsearch. If the database ID is greater than the largest corresponding

View full answer

obulat · 2023-05-01T08:01:14Z

obulat
May 1, 2023
Maintainer

As far as I understand from How Indexing works diagram¹, the _id was used to determine whether the items in the database need to be synced with the elasticsearch index. We have migrated to refreshing all data instead of trying to sync only the latest items.

In fact, we have a [now obsolete, I think] comment in the ingestion_server/indexer.py describing this process:

openverse/ingestion_server/ingestion_server/indexer.py

Lines 1 to 6 in 4d6e995

    
           """ 
        
           A utility for indexing data to Elasticsearch. 
        
           For each table to sync, find its largest ID in database. Find the corresponding largest 
        
           ID in Elasticsearch. If the database ID is greater than the largest corresponding 
        
           ID in Elasticsearch, copy the missing records over to Elasticsearch.

Seems that this was the issue for implementing this kind of sync between upstream and elasticsearch: cc-archive/cccatalog-api#7

https://github.com/cc-archive/cccatalog-api/blob/main/ingestion_server/howitworks.png
↩

1 reply

sarayourfriend May 1, 2023
Collaborator Author

Ahhh, I see! Because the Postgres ID is an int in series, you can use it to see which one was "last", gotcha. I'll look into the code now and see how it actually works, if the database ID is indeed used in this way, and then open an issue to add documentation in the code explaining the reason for it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the reason to use the Postgres `id` column for the `_id` field in Elasticsearch? #1947

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

	"""
	A utility for indexing data to Elasticsearch.

	For each table to sync, find its largest ID in database. Find the corresponding largest
	ID in Elasticsearch. If the database ID is greater than the largest corresponding

What is the reason to use the Postgres id column for the _id field in Elasticsearch? #1947

sarayourfriend May 1, 2023 Collaborator

Replies: 1 comment · 1 reply

obulat May 1, 2023 Maintainer

Footnotes

sarayourfriend May 1, 2023 Collaborator Author

What is the reason to use the Postgres `id` column for the `_id` field in Elasticsearch? #1947

sarayourfriend
May 1, 2023
Collaborator

Replies: 1 comment 1 reply

obulat
May 1, 2023
Maintainer

sarayourfriend May 1, 2023
Collaborator Author