Reduce storage required for indexing - stop writing sp_name, res_type…

…, and sp_updated to hfj_spidx_* tables (hapifhir#5941) * Reduce storage required for indexing - implementation
trifork · Jun 20, 2024 · 0397b9d · 0397b9d
1 parent 5799c6b
commit 0397b9d
Show file tree

Hide file tree

Showing 48 changed files with 1,837 additions and 266 deletions.
diff --git a/...n/resources/ca/uhn/hapi/fhir/changelog/7_4_0/5937-reduce-storage-for-sp-index-tables.yaml b/...n/resources/ca/uhn/hapi/fhir/changelog/7_4_0/5937-reduce-storage-for-sp-index-tables.yaml
@@ -0,0 +1,7 @@
+---
+type: perf
+issue: 5937
+title: "A new configuration option, `StorageSettings#setIndexStorageOptimized(boolean)` has been added. If enabled, 
+the server will not write data to the `SP_NAME`, `RES_TYPE`, `SP_UPDATED` columns for all `HFJ_SPIDX_xxx` tables. 
+This can help reduce the overall storage size on servers where HFJ_SPIDX tables are expected to have a large 
+amount of data."
diff --git a/hapi-fhir-docs/src/main/resources/ca/uhn/hapi/fhir/changelog/7_4_0/upgrade.md b/hapi-fhir-docs/src/main/resources/ca/uhn/hapi/fhir/changelog/7_4_0/upgrade.md
@@ -0,0 +1,25 @@
+## Possible migration errors on SQL Server (MSSQL)
+
+* This affects only clients running SQL Server (MSSQL) who have custom indexes on `HFJ_SPIDX` tables, which
+  include `sp_name` or `res_type` columns.
+* For those clients, migration of `sp_name` and `res_type` columns to nullable on `HFJ_SPIDX` tables may be completed with errors, as changing a column to nullable when a column is a
+  part of an index can lead to errors on SQL Server (MSSQL).
+* If client wants to use existing indexes and settings, these errors can be ignored. However, if client wants to enable both [Index Storage Optimized](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/entity/StorageSettings.html#setIndexStorageOptimized(boolean))
+   and [Index Missing Fields](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/entity/StorageSettings.html#getIndexMissingFields()) settings, manual steps are required to change `sp_name` and `res_type` nullability.
+
+To update columns to nullable in such a scenario, execute steps below:
+
+1. Indexes that include `sp_name` or `res_type` columns should be dropped:
+```sql
+DROP INDEX IDX_SP_TOKEN_REST_TYPE_SP_NAME ON HFJ_SPIDX_TOKEN;
+```
+2.  The nullability of `sp_name` and `res_type` columns should be updated:
+
+```sql
+ALTER TABLE HFJ_SPIDX_TOKEN ALTER COLUMN RES_TYPE varchar(100) NULL;
+ALTER TABLE HFJ_SPIDX_TOKEN ALTER COLUMN SP_NAME varchar(100) NULL;
+```
+3. Additionally, the following index may need to be added to improve the search performance:
+```sql
+CREATE INDEX IDX_SP_TOKEN_MISSING_OPTIMIZED ON HFJ_SPIDX_TOKEN (HASH_IDENTITY, SP_MISSING, RES_ID, PARTITION_ID);
+```
diff --git a/hapi-fhir-docs/src/main/resources/ca/uhn/hapi/fhir/docs/server_jpa/performance.md b/hapi-fhir-docs/src/main/resources/ca/uhn/hapi/fhir/docs/server_jpa/performance.md
@@ -68,3 +68,19 @@ This setting controls whether non-resource (ex: Patient is a resource, MdmLink i
 Clients may want to disable this setting for performance reasons as it populates a new set of database tables when enabled.
 
 Setting this property explicitly to false disables the feature:  [Non Resource DB History](/apidocs/hapi-fhir-storage/ca/uhn/fhir/jpa/api/config/JpaStorageSettings.html#isNonResourceDbHistoryEnabled())
+
+# Enabling Index Storage Optimization
+
+If enabled, the server will not write data to the `SP_NAME`, `RES_TYPE`, `SP_UPDATED` columns for all `HFJ_SPIDX_xxx` tables.
+
+This setting may be enabled on servers where `HFJ_SPIDX_xxx` tables are expected to have a large amount of data (millions of rows) in order to reduce overall storage size.
+
+Setting this property explicitly to true enables the feature: [Index Storage Optimized](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/entity/StorageSettings.html#setIndexStorageOptimized(boolean))
+
+## Limitations
+
+* This setting only applies to newly inserted and updated rows in `HFJ_SPIDX_xxx` tables. All existing rows will still have values in `SP_NAME`, `RES_TYPE` and `SP_UPDATED` columns. Executing `$reindex` operation will apply storage optimization to existing data.
+
+* If this setting is enabled along with [Index Missing Fields](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/entity/StorageSettings.html#getIndexMissingFields()) setting, the following index may need to be added into the `HFJ_SPIDX_xxx` tables to improve the search performance: `(HASH_IDENTITY, SP_MISSING, RES_ID, PARTITION_ID)`.
+
+* This setting should not be enabled in combination with [Include Partition in Search Hashes](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/config/PartitionSettings.html#setIncludePartitionInSearchHashes(boolean)) flag, as in this case, Partition could not be included in Search Hashes. 
diff --git a/hapi-fhir-docs/src/main/resources/ca/uhn/hapi/fhir/docs/server_jpa/schema.md b/hapi-fhir-docs/src/main/resources/ca/uhn/hapi/fhir/docs/server_jpa/schema.md
@@ -502,7 +502,7 @@ The following columns are common to **all HFJ_SPIDX_xxx tables**.
             <td>SP_NAME</td>
             <td></td>
             <td>String</td>
-            <td></td>
+            <td>Nullable</td>
             <td>
                 This is the name of the search parameter being indexed. 
             </td>        
@@ -511,7 +511,7 @@ The following columns are common to **all HFJ_SPIDX_xxx tables**.
             <td>RES_TYPE</td>
             <td></td>
             <td>String</td>
-            <td></td>
+            <td>Nullable</td>
             <td>
                 This is the name of the resource being indexed.
             </td>        

diff --git a/...esources/ca/uhn/hapi/fhir/docs/server_jpa_partitioning/enabling_in_hapi_fhir.md b/...esources/ca/uhn/hapi/fhir/docs/server_jpa_partitioning/enabling_in_hapi_fhir.md
@@ -6,6 +6,6 @@ The [PartitionSettings](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir
 
 The following settings can be enabled:
 
-* **Include Partition in Search Hashes** ([JavaDoc](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/config/PartitionSettings.html#setIncludePartitionInSearchHashes(boolean))): If this feature is enabled, partition IDs will be factored into [Search Hashes](/hapi-fhir/docs/server_jpa/schema.html#search-hashes). When this flag is not set (as is the default), when a search requests a specific partition, an additional SQL WHERE predicate is added to the query to explicitly request the given partition ID. When this flag is set, this additional WHERE predicate is not necessary since the partition is factored into the hash value being searched on. Setting this flag avoids the need to manually adjust indexes against the HFJ_SPIDX tables. Note that this flag should **not be used in environments where partitioning is being used for security purposes**, since it is possible for a user to reverse engineer false hash collisions.
+* **Include Partition in Search Hashes** ([JavaDoc](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/config/PartitionSettings.html#setIncludePartitionInSearchHashes(boolean))): If this feature is enabled, partition IDs will be factored into [Search Hashes](/hapi-fhir/docs/server_jpa/schema.html#search-hashes). When this flag is not set (as is the default), when a search requests a specific partition, an additional SQL WHERE predicate is added to the query to explicitly request the given partition ID. When this flag is set, this additional WHERE predicate is not necessary since the partition is factored into the hash value being searched on. Setting this flag avoids the need to manually adjust indexes against the HFJ_SPIDX tables. Note that this flag should **not be used in environments where partitioning is being used for security purposes**, since it is possible for a user to reverse engineer false hash collisions. This setting should not be enabled in combination with [Index Storage Optimized](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/entity/StorageSettings.html#isIndexStorageOptimized()) flag, as in this case Partition could not be included in Search Hashes. 
 
 * **Cross-Partition Reference Mode**: ([JavaDoc](/hapi-fhir/apidocs/hapi-fhir-jpaserver-model/ca/uhn/fhir/jpa/model/config/PartitionSettings.html#setAllowReferencesAcrossPartitions(ca.uhn.fhir.jpa.model.config.PartitionSettings.CrossPartitionReferenceMode))): This setting controls whether resources in one partition should be allowed to create references to resources in other partitions.
diff --git a/hapi-fhir-jpaserver-base/src/main/java/ca/uhn/fhir/jpa/config/SearchConfig.java b/hapi-fhir-jpaserver-base/src/main/java/ca/uhn/fhir/jpa/config/SearchConfig.java
@@ -19,7 +19,9 @@
  */
 package ca.uhn.fhir.jpa.config;
 
+import ca.uhn.fhir.context.ConfigurationException;
 import ca.uhn.fhir.context.FhirContext;
+import ca.uhn.fhir.i18n.Msg;
 import ca.uhn.fhir.interceptor.api.IInterceptorBroadcaster;
 import ca.uhn.fhir.jpa.api.config.JpaStorageSettings;
 import ca.uhn.fhir.jpa.api.dao.DaoRegistry;
@@ -47,6 +49,7 @@
 import ca.uhn.fhir.jpa.search.cache.ISearchResultCacheSvc;
 import ca.uhn.fhir.rest.server.IPagingProvider;
 import ca.uhn.fhir.rest.server.util.ISearchParamRegistry;
+import jakarta.annotation.PostConstruct;
 import org.hl7.fhir.instance.model.api.IBaseResource;
 import org.springframework.beans.factory.BeanFactory;
 import org.springframework.beans.factory.annotation.Autowired;
@@ -206,4 +209,15 @@ public SearchContinuationTask createSearchContinuationTask(SearchTaskParameters
 				exceptionService() // singleton
 				);
 	}
+
+	@PostConstruct
+	public void validateConfiguration() {
+		if (myStorageSettings.isIndexStorageOptimized()
+				&& myPartitionSettings.isPartitioningEnabled()
+				&& myPartitionSettings.isIncludePartitionInSearchHashes()) {
+			throw new ConfigurationException(Msg.code(2525) + "Incorrect configuration. "
+					+ "StorageSettings#isIndexStorageOptimized and PartitionSettings.isIncludePartitionInSearchHashes "
+					+ "cannot be enabled at the same time.");
+		}
+	}
 }
diff --git a/...ir-jpaserver-base/src/main/java/ca/uhn/fhir/jpa/dao/index/DaoSearchParamSynchronizer.java b/...ir-jpaserver-base/src/main/java/ca/uhn/fhir/jpa/dao/index/DaoSearchParamSynchronizer.java
@@ -20,7 +20,9 @@
 package ca.uhn.fhir.jpa.dao.index;
 
 import ca.uhn.fhir.jpa.model.entity.BaseResourceIndex;
+import ca.uhn.fhir.jpa.model.entity.BaseResourceIndexedSearchParam;
 import ca.uhn.fhir.jpa.model.entity.ResourceTable;
+import ca.uhn.fhir.jpa.model.entity.StorageSettings;
 import ca.uhn.fhir.jpa.searchparam.extractor.ResourceIndexedSearchParams;
 import ca.uhn.fhir.jpa.util.AddRemoveCount;
 import com.google.common.annotations.VisibleForTesting;
@@ -29,10 +31,12 @@
 import jakarta.persistence.PersistenceContextType;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
+import org.springframework.beans.factory.annotation.Autowired;
 import org.springframework.stereotype.Service;
 
 import java.util.ArrayList;
 import java.util.Collection;
+import java.util.Date;
 import java.util.HashSet;
 import java.util.Iterator;
 import java.util.List;
@@ -42,6 +46,9 @@
 public class DaoSearchParamSynchronizer {
 	private static final Logger ourLog = LoggerFactory.getLogger(DaoSearchParamSynchronizer.class);
 
+	@Autowired
+	private StorageSettings myStorageSettings;
+
 	@PersistenceContext(type = PersistenceContextType.TRANSACTION)
 	protected EntityManager myEntityManager;
 
@@ -68,6 +75,11 @@ public AddRemoveCount synchronizeSearchParamsToDatabase(
 		return retVal;
 	}
 
+	@VisibleForTesting
+	public void setStorageSettings(StorageSettings theStorageSettings) {
+		this.myStorageSettings = theStorageSettings;
+	}
+
 	@VisibleForTesting
 	public void setEntityManager(EntityManager theEntityManager) {
 		myEntityManager = theEntityManager;
@@ -115,6 +127,7 @@ private <T extends BaseResourceIndex> void synchronize(
 		List<T> paramsToRemove = subtract(theExistingParams, newParams);
 		List<T> paramsToAdd = subtract(newParams, theExistingParams);
 		tryToReuseIndexEntities(paramsToRemove, paramsToAdd);
+		updateExistingParamsIfRequired(theExistingParams, paramsToAdd, newParams, paramsToRemove);
 
 		for (T next : paramsToRemove) {
 			if (!myEntityManager.contains(next)) {
@@ -134,6 +147,62 @@ private <T extends BaseResourceIndex> void synchronize(
 		theAddRemoveCount.addToRemoveCount(paramsToRemove.size());
 	}
 
+	/**
+	 * <p>
+	 * This method performs an update of Search Parameter's fields in the case of
+	 * <code>$reindex</code> or update operation by:
+	 * 1. Marking existing entities for updating to apply index storage optimization,
+	 * if it is enabled (disabled by default).
+	 * 2. Recovering <code>SP_NAME</code>, <code>RES_TYPE</code> values of Search Parameter's fields
+	 * for existing entities in case if index storage optimization is disabled (but was enabled previously).
+	 * </p>
+	 * For details, see: {@link StorageSettings#isIndexStorageOptimized()}
+	 */
+	private <T extends BaseResourceIndex> void updateExistingParamsIfRequired(
+			Collection<T> theExistingParams,
+			List<T> theParamsToAdd,
+			Collection<T> theNewParams,
+			List<T> theParamsToRemove) {
+
+		theExistingParams.stream()
+				.filter(BaseResourceIndexedSearchParam.class::isInstance)
+				.map(BaseResourceIndexedSearchParam.class::cast)
+				.filter(this::isSearchParameterUpdateRequired)
+				.filter(sp -> !theParamsToAdd.contains(sp))
+				.filter(sp -> !theParamsToRemove.contains(sp))
+				.forEach(sp -> {
+					// force hibernate to update Search Parameter entity by resetting SP_UPDATED value
+					sp.setUpdated(new Date());
+					recoverExistingSearchParameterIfRequired(sp, theNewParams);
+					theParamsToAdd.add((T) sp);
+				});
+	}
+
+	/**
+	 * Search parameters should be updated after changing IndexStorageOptimized setting.
+	 * If IndexStorageOptimized is disabled (and was enabled previously), this method copies paramName
+	 * and Resource Type from extracted to existing search parameter.
+	 */
+	private <T extends BaseResourceIndex> void recoverExistingSearchParameterIfRequired(
+			BaseResourceIndexedSearchParam theSearchParamToRecover, Collection<T> theNewParams) {
+		if (!myStorageSettings.isIndexStorageOptimized()) {
+			theNewParams.stream()
+					.filter(BaseResourceIndexedSearchParam.class::isInstance)
+					.map(BaseResourceIndexedSearchParam.class::cast)
+					.filter(paramToAdd -> paramToAdd.equals(theSearchParamToRecover))
+					.findFirst()
+					.ifPresent(newParam -> {
+						theSearchParamToRecover.restoreParamName(newParam.getParamName());
+						theSearchParamToRecover.setResourceType(newParam.getResourceType());
+					});
+		}
+	}
+
+	private boolean isSearchParameterUpdateRequired(BaseResourceIndexedSearchParam theSearchParameter) {
+		return (myStorageSettings.isIndexStorageOptimized() && !theSearchParameter.isIndexStorageOptimized())
+				|| (!myStorageSettings.isIndexStorageOptimized() && theSearchParameter.isIndexStorageOptimized());
+	}
+
 	/**
 	 * The logic here is that often times when we update a resource we are dropping
 	 * one index row and adding another. This method tries to reuse rows that would otherwise

diff --git a/...jpaserver-base/src/main/java/ca/uhn/fhir/jpa/migrate/tasks/HapiFhirJpaMigrationTasks.java b/...jpaserver-base/src/main/java/ca/uhn/fhir/jpa/migrate/tasks/HapiFhirJpaMigrationTasks.java
@@ -250,6 +250,104 @@ protected void init740() {
 				.unique(false)
 				.withColumns("RES_UPDATED", "RES_ID")
 				.heavyweightSkipByDefault();
+
+		// Allow null values in SP_NAME, RES_TYPE columns for all HFJ_SPIDX_* tables. These are marked as failure
+		// allowed, since SQL Server won't let us change nullability on columns with indexes pointing to them.
+		{
+			Builder.BuilderWithTableName spidxCoords = version.onTable("HFJ_SPIDX_COORDS");
+			spidxCoords
+					.modifyColumn("20240617.1", "SP_NAME")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+			spidxCoords
+					.modifyColumn("20240617.2", "RES_TYPE")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+
+			Builder.BuilderWithTableName spidxDate = version.onTable("HFJ_SPIDX_DATE");
+			spidxDate
+					.modifyColumn("20240617.3", "SP_NAME")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+			spidxDate
+					.modifyColumn("20240617.4", "RES_TYPE")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+
+			Builder.BuilderWithTableName spidxNumber = version.onTable("HFJ_SPIDX_NUMBER");
+			spidxNumber
+					.modifyColumn("20240617.5", "SP_NAME")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+			spidxNumber
+					.modifyColumn("20240617.6", "RES_TYPE")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+
+			Builder.BuilderWithTableName spidxQuantity = version.onTable("HFJ_SPIDX_QUANTITY");
+			spidxQuantity
+					.modifyColumn("20240617.7", "SP_NAME")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+			spidxQuantity
+					.modifyColumn("20240617.8", "RES_TYPE")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+
+			Builder.BuilderWithTableName spidxQuantityNorm = version.onTable("HFJ_SPIDX_QUANTITY_NRML");
+			spidxQuantityNorm
+					.modifyColumn("20240617.9", "SP_NAME")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+			spidxQuantityNorm
+					.modifyColumn("20240617.10", "RES_TYPE")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+
+			Builder.BuilderWithTableName spidxString = version.onTable("HFJ_SPIDX_STRING");
+			spidxString
+					.modifyColumn("20240617.11", "SP_NAME")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+			spidxString
+					.modifyColumn("20240617.12", "RES_TYPE")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+
+			Builder.BuilderWithTableName spidxToken = version.onTable("HFJ_SPIDX_TOKEN");
+			spidxToken
+					.modifyColumn("20240617.13", "SP_NAME")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+			spidxToken
+					.modifyColumn("20240617.14", "RES_TYPE")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+
+			Builder.BuilderWithTableName spidxUri = version.onTable("HFJ_SPIDX_URI");
+			spidxUri.modifyColumn("20240617.15", "SP_NAME")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+			spidxUri.modifyColumn("20240617.16", "RES_TYPE")
+					.nullable()
+					.withType(ColumnTypeEnum.STRING, 100)
+					.failureAllowed();
+		}
 	}
 
 	protected void init720() {

diff --git a/...c/main/java/ca/uhn/fhir/jpa/search/builder/predicate/BaseSearchParamPredicateBuilder.java b/...c/main/java/ca/uhn/fhir/jpa/search/builder/predicate/BaseSearchParamPredicateBuilder.java
@@ -98,10 +98,19 @@ public Condition createHashIdentityPredicate(String theResourceType, String theP
 
 	public Condition createPredicateParamMissingForNonReference(
 			String theResourceName, String theParamName, Boolean theMissing, RequestPartitionId theRequestPartitionId) {
-		ComboCondition condition = ComboCondition.and(
-				BinaryCondition.equalTo(getResourceTypeColumn(), generatePlaceholder(theResourceName)),
-				BinaryCondition.equalTo(getColumnParamName(), generatePlaceholder(theParamName)),
-				BinaryCondition.equalTo(getMissingColumn(), generatePlaceholder(theMissing)));
+
+		List<Condition> conditions = new ArrayList<>();
+		if (getStorageSettings().isIndexStorageOptimized()) {
+			Long hashIdentity = BaseResourceIndexedSearchParam.calculateHashIdentity(
+					getPartitionSettings(), getRequestPartitionId(), theResourceName, theParamName);
+			conditions.add(BinaryCondition.equalTo(getColumnHashIdentity(), generatePlaceholder(hashIdentity)));
+		} else {
+			conditions.add(BinaryCondition.equalTo(getResourceTypeColumn(), generatePlaceholder(theResourceName)));
+			conditions.add(BinaryCondition.equalTo(getColumnParamName(), generatePlaceholder(theParamName)));
+		}
+		conditions.add(BinaryCondition.equalTo(getMissingColumn(), generatePlaceholder(theMissing)));
+
+		ComboCondition condition = ComboCondition.and(conditions.toArray());
 		return combineWithRequestPartitionIdPredicate(theRequestPartitionId, condition);
 	}
 

diff --git a/...paserver-base/src/test/java/ca/uhn/fhir/jpa/dao/index/DaoSearchParamSynchronizerTest.java b/...paserver-base/src/test/java/ca/uhn/fhir/jpa/dao/index/DaoSearchParamSynchronizerTest.java
@@ -4,6 +4,7 @@
 import ca.uhn.fhir.jpa.model.entity.BaseResourceIndex;
 import ca.uhn.fhir.jpa.model.entity.ResourceIndexedSearchParamNumber;
 import ca.uhn.fhir.jpa.model.entity.ResourceTable;
+import ca.uhn.fhir.jpa.model.entity.StorageSettings;
 import ca.uhn.fhir.jpa.searchparam.extractor.ResourceIndexedSearchParams;
 import ca.uhn.fhir.jpa.util.AddRemoveCount;
 import jakarta.persistence.EntityManager;
@@ -61,6 +62,7 @@ void setUp() {
 		THE_SEARCH_PARAM_NUMBER.setResource(resourceTable);
 
 		subject.setEntityManager(entityManager);
+		subject.setStorageSettings(new StorageSettings());
 	}
 
 	@Test