Skip to content

Commit

Permalink
new image, delete stuff about copying, minor proofreading. (#6875)
Browse files Browse the repository at this point in the history
  • Loading branch information
johnnyaug authored Oct 25, 2023
1 parent 32cfbfa commit 669e932
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 38 deletions.
Binary file modified docs/assets/img/UI-Import-Dialog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/howto/copying.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ redirect_from:
- /integrations/rclone.html
---

# Copying data to/from lakeFS

{% include toc.html %}

## Using DistCp
Expand Down
64 changes: 26 additions & 38 deletions docs/howto/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,42 +6,35 @@ redirect_from:
- /setup/import.html
---

# Import data into lakeFS
_This section describes how to import existing data into a lakeFS repository, without copying it.
If you are interested in copying data into lakeFS, see [Copying data to/from lakeFS](./copying.md)._
{: .mt-5 .mb-1 }

The simplest way to bring data into lakeFS is by [copying it](#copying-data-into-a-lakefs-repository), but this approach may not be suitable when a lot of data is involved.
To avoid copying the data, lakeFS offers [Zero-copy import](#zero-copy-import). With this approach, lakeFS only creates pointers to your existing objects in your new repository.
# Importing data into lakeFS
{: .mt-2 }

{% include toc_2-3.html %}

## Zero-copy import

Mirror an existing object store location into a lakeFS repository, without copying the data.

### Prerequisites
## Prerequisites

* Importing is permitted for users in the Supers (lakeFS open-source) group or the SuperUsers (lakeFS Cloud/Enterprise) group.
To learn how lakeFS Cloud and lakeFS Enterprise users can fine-tune import permissions, see [Fine-grained permissions](#fine-grained-permissions) below.
* The lakeFS _server_ must have permissions to list the objects in the source bucket.
* The source bucket must be in the same region as your repository.

### Using the lakeFS UI

1. In your repository's main page, click the _Import_ button to open the import dialog:

![lakeFS UI import dialog]({% link assets/img/UI-Import-Dialog.png %})
## Using the lakeFS UI

1. In your repository's main page, click the _Import_ button to open the import dialog.
2. Under _Import from_, fill in the location on your object store you would like to import from.
3. Fill in the import destination in lakeFS
4. Add a commit message, and optionally metadata.
5. Press _Import_
3. Fill in the import destination in lakeFS. This should be a path under the current branch.
4. Add a commit message, and optionally commit metadata.
5. Press _Import_.

Once the import is complete, a new commit containing the imported objects will be created in the destination branch.

#### Notes
* Any previously existing objects under the destination prefix will be deleted.
* The import duration depends on the amount of imported objects, but will roughly be a few thousand objects per second.
![lakeFS UI import dialog]({% link assets/img/UI-Import-Dialog.png %})

### Using the CLI: _lakectl import_
## Using the CLI: _lakectl import_
The _lakectl import_ command acts the same as the UI import wizard. It commits the changes to the selected branch.

<div class="tabs">
Expand Down Expand Up @@ -73,15 +66,18 @@ lakectl import \
</div>
</div>

### Limitations
## Notes
{:.no_toc}

1. Any previously existing objects under the destination prefix will be deleted.
1. The import duration depends on the amount of imported objects, but will roughly be a few thousand objects per second.
1. Importing is only possible from the object storage service in which your installation stores its data. For example, if lakeFS is configured to use S3, you cannot import data from Azure.
2. For security reasons, if you are using lakeFS on top of your local disk (`blockstore.type=local`), you need to enable the import feature explicitly.
1. For security reasons, if you are using lakeFS on top of your local disk (`blockstore.type=local`), you need to enable the import feature explicitly.
To do so, set the `blockstore.local.import_enabled` to `true` and specify the allowed import paths in `blockstore.local.allowed_external_prefixes` (see [configuration reference]({% link reference/configuration.md %})).
Since there are some differences between object-stores and file-systems in the way directories/prefixes are treated, local import is allowed only for directories.
3. Making changes to data in the original bucket will not be reflected in lakeFS, and may cause inconsistencies.
1. Making changes to data in the original bucket will not be reflected in lakeFS, and may cause inconsistencies.

### Fine-grained permissions
## Fine-grained permissions
{:.no_toc}
{: .d-inline-block }
lakeFS Cloud
Expand All @@ -94,7 +90,7 @@ With RBAC support, The lakeFS user running the import command should have the fo

As mentioned above, all of these permissions are available by default to the Supers (lakeFS open-source) group or the SuperUsers (lakeFS Cloud/Enterprise).

### Provider-specific permissions
## Provider-specific permissions
{:.no_toc}

In addition, the following for provider-specific permissions may be required:
Expand All @@ -108,7 +104,8 @@ In addition, the following for provider-specific permissions may be required:
<div markdown="1" id="aws-s3">


#### AWS S3: Importing from public buckets
## AWS S3: Importing from public buckets
{:.no_toc}

lakeFS needs access to the imported location to first list the files to import and later read the files upon users request.

Expand Down Expand Up @@ -151,7 +148,8 @@ the following policy needs to be attached to the lakeFS S3 service-account to al
<div markdown="1" id="azure-storage">
See [Azure deployment][deploy-azure-storage-account-creds] on limitations when using account credentials.

#### Azure Data Lake Gen2
### Azure Data Lake Gen2
{:.no_toc}

lakeFS requires a hint in the import source URL to understand that the provided storage account is ADLS Gen2

Expand All @@ -169,15 +167,5 @@ No specific prerequisites
</div>
</div>

## Copying data into a lakeFS repository

Another way of getting existing data into a lakeFS repository is by copying it. This has the advantage of having the objects along with their metadata managed by the lakeFS installation, along with lifecycle rules, immutability guarantees and consistent listing. However, do make sure to account for storage cost and time.

To copy data into lakeFS you can use the following tools:

1. The `lakectl` command line tool - see the [reference][lakectl-fs-upload] to learn more about using it to copy local data into lakeFS. Using `lakectl fs upload --recursive` you can upload multiple objects together from a given directory.
1. Using [rclone](./copying.md#using-rclone)
1. Using Hadoop's [DistCp](./copying.md#using-distcp)

[deploy-azure-storage-account-creds]: {% link howto/deploy/azure.md %}#storage-account-credentials
[lakectl-fs-upload]: {% link reference/cli.md %}#lakectl-fs-upload

0 comments on commit 669e932

Please sign in to comment.