Quickstart¶

Welcome! This quickstart guide will get you up and running with lakeFS-spec by showing you how to

install the lakefs-spec package,
spin up a local lakeFS server,
create a lakeFS repository for experimentation, and
perform basic file system operations in a lakeFS repository using lakeFS-spec.

Prerequisites

To follow along with this guide, you will need a few prerequisites ready on your machine:

lakeFS-spec supports Windows, macOS, or Linux
Docker, with Docker Compose
Python 3.9 or later
optionally, lakectl, the lakeFS command line tool

Please take a moment to make sure you have these tools available before proceeding with the next steps.

Installing lakeFS-spec¶

A note on virtual environments

We generally recommend installing the library in a virtual environment to ensure proper isolation, especially when following this quickstart guide.

If you are using Poetry, virtual environments can automatically be created by the tool.

If you prefer the venv functionality built into Python, see the official docs (tl;dr: python -m venv venv; source venv/bin/activate).

To install the package directly from PyPI, run:

pippoetry

pip install lakefs-spec

poetry add lakefs-spec

Or, if you want to try the latest pre-release version directly from GitHub:

pippoetry

pip install git+https://github.com/aai-institute/lakefs-spec.git

poetry add git+https://github.com/aai-institute/lakefs-spec.git

First Steps¶

Spinning up a local lakeFS instance¶

Warning

This setup is not recommended for production uses, since it does not store the data persistently.

Please check out the lakeFS docs for production-ready deployment options.

If you don't already have access to a lakeFS server, you can quickly start a local instance using Docker Compose. Before continuing, please make sure Docker is installed and running on your machine.

The lakeFS quickstart deployment can be launched directly with a configuration file provided in the lakeFS-spec repository:

$ curl https://raw.githubusercontent.com/aai-institute/lakefs-spec/main/hack/docker-compose.yml | docker-compose -f - up

If you do not have curl installed on your machine or would like to examine and/or customize the container configuration, you can also create a docker-compose.yml file locally and use it with docker-compose up:

docker-compose.yml

version: "3"

services:
  lakefs:
    image: treeverse/lakefs:1.7.0
    ports:
      - 8000:8000
    environment:
      LAKEFS_INSTALLATION_USER_NAME: "quickstart"
      LAKEFS_INSTALLATION_ACCESS_KEY_ID: "AKIAIOSFOLQUICKSTART"
      LAKEFS_INSTALLATION_SECRET_ACCESS_KEY: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
      LAKEFS_DATABASE_TYPE: "local"
      LAKEFS_AUTH_ENCRYPT_SECRET_KEY: "THIS_MUST_BE_CHANGED_IN_PRODUCTION"
      LAKEFS_BLOCKSTORE_TYPE: "local"

In order to allow lakeFS-spec to automatically discover credentials to access this lakeFS instance, create a .lakectl.yaml in your home directory containing the credentials for the quickstart environment (you can also use lakectl config to create this file interactively if you have the lakectl tool installed on your machine):

~/.lakectl.yaml

credentials: # (1)!
  access_key_id: AKIAIOSFOLQUICKSTART
  secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
server:
  endpoint_url: http://127.0.0.1:8000

These must match the credentials set in the environment section of the Docker Compose file above

After the container has finished initializing, you can access the web UI of your local lakeFS deployment in your browser. Fill out the setup form, where you can optionally share your email address with the developers of lakeFS to receive updates on their product. Next, you can log into your fresh lakeFS instance with the credentials listed above.

Success

Your fresh local lakeFS instance is a playground for you to explore lakeFS functionality.

In the next step, we will create your first repository on this server.

Create a lakeFS repository¶

Once you have logged into the web UI of the lakeFS server for the first time, you can create an empty repository on the next page. Click the small Click here link at the bottom of the page to proceed and create a repository named repo (we don't want to add the sample data for this guide):

Tip: Creating a repository later

If you have inadvertently skipped over the quickstart repository creation page, you can always create a new repository on the Repositories tab in the lakeFS web UI (and optionally choose to add the sample data):

Success

You have successfully created a lakeFS repository named repo, ready to be used with lakeFS-spec.

Using the lakeFS file system¶

We will now use the lakeFS-spec file system interface to perform some basic operations on the repository created in the previous step:

Upload a local file to the repository
Read data from a file in the repository
Make a commit
Fetch metadata about repository contents
Delete a file from the repository

To get started, create a file called quickstart.py with the following contents:

quickstart.py

from pathlib import Path

from lakefs_spec import LakeFSFileSystem

REPO, BRANCH = "repo", "main"

# Prepare example local data
local_path = Path("demo.txt")
local_path.write_text("Hello, lakeFS!")

Tip

We will keep adding more code to this file as we progress through the next steps. Feel free to execute the script after each step and observe the effects as noted in the guide.

This code snippet prepares a file demo.txt on your machine, ready to be added to the lakeFS repository, so let's do just that:

fs = LakeFSFileSystem()  # will auto-discover credentials from ~/.lakectl.yaml
repo_path = f"{REPO}/{BRANCH}/{local_path.name}"

with fs.transaction(REPO, BRANCH) as tx:
    fs.put(str(local_path), f"{REPO}/{tx.branch.id}/{local_path.name}")
    tx.commit(message="Add demo data")

If you execute the quickstart.py script at this point, you can already see the committed file in the lakeFS web UI:

While examining the file contents in the browser is nice, we want to access the committed file programmatically. Add the following lines at the end of your script and observe the output:

f = fs.open(repo_path, "rt")
print(f.readline())  # prints "Hello, lakeFS!"

Note that executing the same code multiple times will only result in a single commit in the repository since the contents of the file on disk and in the repository are identical.

In addition to simple read and write operations, the fsspec file system interface also allows us to list the files in a repository folder using ls, and query the metadata of objects in the repository through info (akin to the POSIX stat system call). Let's add the following code to our script and observe the output:

# Compare the sizes of local file and repo
file_info = fs.info(repo_path)
print(
    f"{local_path.name}: local size: {file_info['size']}, remote size: {local_path.stat().st_size}"
)

# Get information about all files in the repo root
print(fs.ls(f"{REPO}/{BRANCH}/"))

As the last order of business, let's clean up the repository to its original state by removing the file using the rm operation and creating another commit (also, the local file is deleted, since we don't need it anymore):

with fs.transaction(REPO, BRANCH) as tx:
    fs.rm(f"{REPO}/{tx.branch.id}/{local_path.name}")
    tx.commit(message="Delete demo data")

Success

You now have all the basic tools available to version data from your Python code using the file system interface provided by lakeFS-spec.

Full example code

quickstart.py

from pathlib import Path

from lakefs_spec import LakeFSFileSystem

REPO, BRANCH = "repo", "main"

# Prepare example local data
local_path = Path("demo.txt")
local_path.write_text("Hello, lakeFS!")

# Upload the local file to the repo and commit
fs = LakeFSFileSystem()  # will auto-discover credentials from ~/.lakectl.yaml
repo_path = f"{REPO}/{BRANCH}/{local_path.name}"

with fs.transaction(REPO, BRANCH) as tx:
    fs.put(str(local_path), f"{REPO}/{tx.branch.id}/{local_path.name}")
    tx.commit(message="Add demo data")

# Read back the file contents
f = fs.open(repo_path, "rt")
print(f.readline())  # prints "Hello, lakeFS!"

# Compare the sizes of local file and repo
file_info = fs.info(repo_path)
print(
    f"{local_path.name}: local size: {file_info['size']}, remote size: {local_path.stat().st_size}"
)

# Get information about all files in the repo root
print(fs.ls(f"{REPO}/{BRANCH}/"))

# Delete uploaded file from the repository (and commit)
with fs.transaction(REPO, BRANCH) as tx:
    fs.rm(f"{REPO}/{tx.branch.id}/{local_path.name}")
    tx.commit(message="Delete demo data")

local_path.unlink()

Next Steps¶

After this walkthrough of the installation and an introduction to basic file system operations using lakeFS-spec, you might want to consider more advanced topics: