Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototype of bulk import v2 distributed file examination #4898

Draft
wants to merge 22 commits into
base: 2.1
Choose a base branch
from

Commits on Sep 17, 2024

  1. prototype of bulk import v2 distributed file examination

    This is prototype for a few new APIs that allow distributing the
    examination of files for bulk import.
    
    For a given bulk import directory with N files this would support a use
    case like the following.
    
     1. For eack file a task is spun up on a remote server that calls the new
        LoadPlan.compute() API to determine what tablets the file overlaps.
        Then the new LoadPlan.toJson() method is called to serialize the
        load plan and send it to a central place.
     2. All the load plans from the remote servers are deserialized calling
        the new LoadPlan.fromJson() method and merged into a single load
        plan that is used to do the bulk import.
    
    Another use case these new APIs could support is running this new code
    in the map reduce job that generates bulk import data.
    
      1. In each reducer after it produces an rfile it could then call the
         new LoadPlan.compute(), then call LoadPlan.toJson() and save the
         result to a file.  So after the map reduce job completes each rfile
         would have  corresponding file with a load plan for that file.
      2. Another process that runs after the map reduce job can load all the
         load plans from files and merge them using the new
         LoadPlan.fromJson() method.  Then the merged LoadPlan can be used
         to do the bulk import.
    
    BulkNewIT.testComputeLoadPlan() simulates this map reduce use case by
    going through the steps in code that a map reduce job would.  This tests
    the new APIs and shows what using it would look like.
    
    Both of these use cases avoid doing the analysis of files on a single
    machine doing the bulk import.  Bulk import V1 had this functionality
    and would ask random tservers to do the file analysis.  This could cause
    unexpected load on those tservers.  Bulk V1 would interleave analyzing
    files and adding them to tablets.  This could lead to odd situations
    where files are partially imported to some tablets and analysis fails,
    leaving the file partially imported.  Bulk v2 does all analysis before
    any files are added to tablets, however it lacks this distributed
    analysis capability.  This is an initial attempt to offer that
    functionality in bulk v2.
    keith-turner committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    aa593ad View commit details
    Browse the repository at this point in the history
  2. format code

    keith-turner committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    407358a View commit details
    Browse the repository at this point in the history
  3. fix build

    keith-turner committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    fd70d34 View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2024

  1. Configuration menu
    Copy the full SHA
    058328a View commit details
    Browse the repository at this point in the history
  2. format code

    keith-turner committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    c8c5f21 View commit details
    Browse the repository at this point in the history
  3. use Rfile api to read

    keith-turner committed Sep 18, 2024
    Configuration menu
    Copy the full SHA
    9ef0bcf View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2024

  1. adds cache

    keith-turner committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    368b2a4 View commit details
    Browse the repository at this point in the history

Commits on Sep 26, 2024

  1. Configuration menu
    Copy the full SHA
    e228b68 View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2024

  1. adds tests and javadoc

    keith-turner committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    9328277 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3190d19 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    174b4e0 View commit details
    Browse the repository at this point in the history
  4. revert pom change

    keith-turner committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    f82d111 View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2024

  1. Revert "update pom for including sha in version"

    This reverts commit 174b4e0.
    keith-turner committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    9c7dc66 View commit details
    Browse the repository at this point in the history
  2. cleanup

    keith-turner committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    4285753 View commit details
    Browse the repository at this point in the history
  3. more cleanup

    keith-turner committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    ec0febb View commit details
    Browse the repository at this point in the history
  4. fix validation bug

    keith-turner committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    97e4684 View commit details
    Browse the repository at this point in the history

Commits on Oct 1, 2024

  1. Configuration menu
    Copy the full SHA
    aabe2d8 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2003eae View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    667f12e View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    a5ead55 View commit details
    Browse the repository at this point in the history
  5. sync w/ 3.1 changes

    keith-turner committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    926dec7 View commit details
    Browse the repository at this point in the history

Commits on Oct 30, 2024

  1. Configuration menu
    Copy the full SHA
    235945b View commit details
    Browse the repository at this point in the history