Skip to content

Latest commit

 

History

History
144 lines (96 loc) · 4.14 KB

README.md

File metadata and controls

144 lines (96 loc) · 4.14 KB

lakeview

lakeview is a visibility tool for AWS S3 based data lakes.

Think of it as ncdu, but for Petabyte-scale data, on S3.

Instead of scanning billions of objects using the S3 API (which would require millions of API calls), lakeview uses Athena to query S3 Inventory Reports.

What can it do?

  1. Aggregate the sizes of directories* in S3, allowing you to drill down and find what is taking up space.
  2. Compare sizes between different dates - see how directories size change over time between different inventory reports.
  3. _Planned but not yet implemented - _ find the largest duplicates in your directories.

* S3, being an object store and not a filesystem, doesn't really have a notion of directories, but its API supports so-called "common prefixes".

All capabilities are provided in both a human consumable web interface and a machine consumable JSON report - feel free to plug them into your favorite monitoring tool.

What does it look like?

Size report:

Size diff:

Quickstart

  1. Ensure you have an S3 inventory set up (preferably as Parquet or ORC)

  2. Verify the table is registered in Athena

  3. Run lakeview as a standalone Docker container:

    docker run -it -p 5000:5000 \
        -v $HOME/.aws:/home/lakeview/.aws \
        treeverse/lakeview \
            --table <athena table name> \
            --output-location <s3 uri>

    note <athena table name> is the name you gave in step 2, and <s3 uri> is a location in S3 where Athena could store its results (e.g. s3://my-bucket/athena/)

  4. Open http://localhost:5000/ and start exploring

Using lakeview as an API

API endpoint: /du

To get results as JSON - add Accept: application/json to your request headers, or pass json as a query string parameter.

Query Parameters:

prefix (default: "") - return objects and directories[1] starting with the given prefix

delimiter (default: "/") - use this character as delimiter to group objects under a common prefix

date - date string corresponding to the inventory you'd like to query (YYYY-MM-DD-00-00) is S3's default structure

compare (optional) - another date string. If present, lakeview will calculate a diff between the two reports for every common prefix and will sort the results based on the largest absolute diff

Example

Request:

http://localhost:5000/du?prefix=&delimiter=%2F&date=2020-08-23-00-00&compare=2020-08-22-00-00&json

Response:

{
  "compare": "2020-08-22-00-00",
  "date": "2020-08-23-00-00",
  "delimiter": "/",
  "prefix": "",
  "response": [
    {
      "common_prefix": "users/",
      "diff": 3363690400953,
      "size_left": 231203538669496,
      "size_right": 231203538669496
    },
    {
      "common_prefix": "production/",
      "diff": 2737293183914,
      "size_left": 6238586023266733,
      "size_right": 6238586023266733
    },
    {
      "common_prefix": "staging/",
      "diff": 281953288549,
      "size_left": 367219795944457,
      "size_right": 367219795944457
    },
    ...
  ]
}

Building and running locally

Clone the repo, and from the root directory run:

$ pip install -r requirements.txt

and run this:

$ python server.py \
      --table <athena table name> \
      --output-location <s3 uri>

For a complete reference, run:

$ python server.py --help

License

lakeview is distributed under the Apache 2.0 license. See the included LICENSE file.

More information

lakeview was originally built (with <3) by Treeverse.

We're actively developing lakeFS as an open source tool that delivers resilience and manageability to object-storage based data lakes.