prototype of bulk import v2 distributed file examination #4898

keith-turner · 2024-09-17T22:22:53Z

Two new ways of computing bulk import load plans are offered in these change. First the RFile API was modified to support computing a LoadPlan as the RFile is written. Second a new LoadPlan.compute() method was added that creates a LoadPlan from an existing RFile. In addition to these changes methods were added to LoadPlan that support serializing and deserializing load plans to/from json.

All of these changes together support the use case of computing load plans in a distributed manner. For example, with a bulk import directory with N files the following use case is now supported.

For eack file a task is spun up on a remote server that calls the new LoadPlan.compute() API to determine what tablets the file overlaps. Then the new LoadPlan.toJson() method is called to serialize the load plan and send it to a central place.
All the load plans from the remote servers are deserialized calling the new LoadPlan.fromJson() method and merged into a single load plan that is used to do the bulk import.

Another use case these new APIs could support is running this new code in the map reduce job that generates bulk import data.

In each reducer as it writes to an rfile it could also be building a LoadPlan. A load plan can be obtained from the Rfile after closing it and serialized using LoadPlan.toJson() and the result saved to a file. So after the map reduce job completes each rfile would have corresponding file with a load plan for that file.
Another process that runs after the map reduce job can load all the load plans from files and merge them using the new LoadPlan.fromJson() method. Then the merged LoadPlan can be used to do the bulk import.

Both of these use cases avoid doing the analysis of files on a single machine doing the bulk import. Bulk import V1 had this functionality and would ask random tservers to do the file analysis. This could cause unexpected load on those tservers. Bulk V1 would interleave analyzing files and adding them to tablets. This could lead to odd situations where files are partially imported to some tablets and analysis fails, leaving the file partially imported. Bulk v2 does all analysis before any files are added to tablets, however it lacks this distributed analysis capability. These changes provide the building blocks to do the distributed analysis that bulk v1 did for bulk v2.

This is prototype for a few new APIs that allow distributing the examination of files for bulk import. For a given bulk import directory with N files this would support a use case like the following. 1. For eack file a task is spun up on a remote server that calls the new LoadPlan.compute() API to determine what tablets the file overlaps. Then the new LoadPlan.toJson() method is called to serialize the load plan and send it to a central place. 2. All the load plans from the remote servers are deserialized calling the new LoadPlan.fromJson() method and merged into a single load plan that is used to do the bulk import. Another use case these new APIs could support is running this new code in the map reduce job that generates bulk import data. 1. In each reducer after it produces an rfile it could then call the new LoadPlan.compute(), then call LoadPlan.toJson() and save the result to a file. So after the map reduce job completes each rfile would have corresponding file with a load plan for that file. 2. Another process that runs after the map reduce job can load all the load plans from files and merge them using the new LoadPlan.fromJson() method. Then the merged LoadPlan can be used to do the bulk import. BulkNewIT.testComputeLoadPlan() simulates this map reduce use case by going through the steps in code that a map reduce job would. This tests the new APIs and shows what using it would look like. Both of these use cases avoid doing the analysis of files on a single machine doing the bulk import. Bulk import V1 had this functionality and would ask random tservers to do the file analysis. This could cause unexpected load on those tservers. Bulk V1 would interleave analyzing files and adding them to tablets. This could lead to odd situations where files are partially imported to some tablets and analysis fails, leaving the file partially imported. Bulk v2 does all analysis before any files are added to tablets, however it lacks this distributed analysis capability. This is an initial attempt to offer that functionality in bulk v2.

keith-turner · 2024-09-18T16:41:05Z

core/src/main/java/org/apache/accumulo/core/data/LoadPlan.java

+  // TODO javadoc
+  public static LoadPlan compute(URI file, SortedSet<Text> splits) throws IOException {
+
+    // TODO if the files needed a crypto service how could it be instantiated? Was trying to make


Not sure of the best way forward here. As a design goal was attempting to make this compute method independent of something like an accumulo client and a client context, however ran into a problem with that design goal with the crypto service.

You can look at how the rfile PrintInfo command does this. It calls:

CryptoService cs = CryptoFactoryLoader.getServiceForClient(CryptoEnvironment.Scope.TABLE, siteConfig.getAllCryptoProperties());

As a server-side utility, you could make certain assumptions that it has the ability to read the accumulo properties file on the server side, like that utility does. However, as a purely client-side API, you may need to just pass in the CryptoService directly, or pass in other options, so it can set up the right config (crypto, compression, etc.) to be able to read the files.

Made an update in 9ef0bcf to pass in a map of props which is passed to the Rfile api which internally calls CryptoFactoryLoader.getServiceForClient using that map of properties.

That could work, since it's a static entry point to building a load plan. If it was dangling off AccumuloClient, users might expect client properties to be passed. But for the static entry point, I think it's reasonable to require them to be provided explicitly.

I like how you were able to use the existing RFile.newScanner() code.

This reverts commit 174b4e0.

core/src/main/java/org/apache/accumulo/core/client/rfile/RFileWriter.java

ddanielr

Needs some minor formatting changes to pass build checks

core/src/main/java/org/apache/accumulo/core/data/LoadPlan.java

core/src/test/java/org/apache/accumulo/core/data/LoadPlanTest.java

Co-authored-by: Daniel Roberts <[email protected]>

…java Co-authored-by: Daniel Roberts <[email protected]>

keith-turner added 3 commits September 17, 2024 22:16

format code

407358a

fix build

fd70d34

keith-turner commented Sep 18, 2024

View reviewed changes

keith-turner added 12 commits September 18, 2024 18:18

remove tight coupling to SortedSet by adding simple indirection

058328a

format code

c8c5f21

use Rfile api to read

9ef0bcf

adds cache

368b2a4

adds ability to constuct load plan while writing to rfile

e228b68

adds tests and javadoc

9328277

fail build when local mods

3190d19

update pom for including sha in version

174b4e0

revert pom change

f82d111

Revert "update pom for including sha in version"

9c7dc66

This reverts commit 174b4e0.

cleanup

4285753

more cleanup

ec0febb

keith-turner mentioned this pull request Sep 30, 2024

Offers new ways to compute bulk import load plans. #4933

Open

fix validation bug

97e4684

ddanielr reviewed Sep 30, 2024

View reviewed changes

core/src/main/java/org/apache/accumulo/core/client/rfile/RFileWriter.java Outdated Show resolved Hide resolved

ddanielr reviewed Sep 30, 2024

View reviewed changes

keith-turner and others added 6 commits October 1, 2024 18:36

Update core/src/main/java/org/apache/accumulo/core/data/LoadPlan.java

aabe2d8

Co-authored-by: Daniel Roberts <[email protected]>

Update core/src/test/java/org/apache/accumulo/core/data/LoadPlanTest.…

2003eae

…java Co-authored-by: Daniel Roberts <[email protected]>

Update core/src/test/java/org/apache/accumulo/core/data/LoadPlanTest.…

667f12e

…java Co-authored-by: Daniel Roberts <[email protected]>

Update core/src/test/java/org/apache/accumulo/core/data/LoadPlanTest.…

a5ead55

…java Co-authored-by: Daniel Roberts <[email protected]>

sync w/ 3.1 changes

926dec7

Add new prefix for bulk load working files

235945b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prototype of bulk import v2 distributed file examination #4898

prototype of bulk import v2 distributed file examination #4898

keith-turner commented Sep 17, 2024 •

edited

Loading

keith-turner Sep 18, 2024

ctubbsii Sep 18, 2024

keith-turner Sep 19, 2024

ctubbsii Sep 19, 2024

ddanielr left a comment

prototype of bulk import v2 distributed file examination #4898

Are you sure you want to change the base?

prototype of bulk import v2 distributed file examination #4898

Conversation

keith-turner commented Sep 17, 2024 • edited Loading

keith-turner Sep 18, 2024

Choose a reason for hiding this comment

ctubbsii Sep 18, 2024

Choose a reason for hiding this comment

keith-turner Sep 19, 2024

Choose a reason for hiding this comment

ctubbsii Sep 19, 2024

Choose a reason for hiding this comment

ddanielr left a comment

Choose a reason for hiding this comment

keith-turner commented Sep 17, 2024 •

edited

Loading