Quickstart¶
Welcome! This quickstart guide will get you up and running with lakeFS-spec by showing you how to
- install the
lakefs-spec
package, - spin up a local lakeFS server,
- create a lakeFS repository for experimentation, and
- perform basic file system operations in a lakeFS repository using lakeFS-spec.
Prerequisites
To follow along with this guide, you will need a few prerequisites ready on your machine:
- lakeFS-spec supports Windows, macOS, or Linux
- Docker, with Docker Compose
- Python 3.9 or later
- optionally,
lakectl
, the lakeFS command line tool
Please take a moment to make sure you have these tools available before proceeding with the next steps.
Installing lakeFS-spec¶
A note on virtual environments
We generally recommend installing the library in a virtual environment to ensure proper isolation, especially when following this quickstart guide.
If you are using Poetry, virtual environments can automatically be created by the tool.
If you prefer the venv
functionality built into Python, see the official docs (tl;dr: python -m venv venv; source venv/bin/activate
).
To install the package directly from PyPI, run:
Or, if you want to try the latest pre-release version directly from GitHub:
First Steps¶
Spinning up a local lakeFS instance¶
Warning
This setup is not recommended for production uses, since it does not store the data persistently.
Please check out the lakeFS docs for production-ready deployment options.
If you don't already have access to a lakeFS server, you can quickly start a local instance using Docker Compose. Before continuing, please make sure Docker is installed and running on your machine.
The lakeFS quickstart deployment can be launched directly with a configuration file provided in the lakeFS-spec repository:
$ curl https://raw.githubusercontent.com/aai-institute/lakefs-spec/main/hack/docker-compose.yml | docker-compose -f - up
If you do not have curl
installed on your machine or would like to examine and/or customize the container configuration, you can also create a docker-compose.yml
file locally and use it with docker-compose up
:
version: "3"
services:
lakefs:
image: treeverse/lakefs:1.7.0
ports:
- 8000:8000
environment:
LAKEFS_INSTALLATION_USER_NAME: "quickstart"
LAKEFS_INSTALLATION_ACCESS_KEY_ID: "AKIAIOSFOLQUICKSTART"
LAKEFS_INSTALLATION_SECRET_ACCESS_KEY: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
LAKEFS_DATABASE_TYPE: "local"
LAKEFS_AUTH_ENCRYPT_SECRET_KEY: "THIS_MUST_BE_CHANGED_IN_PRODUCTION"
LAKEFS_BLOCKSTORE_TYPE: "local"
In order to allow lakeFS-spec to automatically discover credentials to access this lakeFS instance, create a .lakectl.yaml
in your home directory containing the credentials for the quickstart environment (you can also use lakectl config
to create this file interactively if you have the lakectl
tool installed on your machine):
credentials: # (1)!
access_key_id: AKIAIOSFOLQUICKSTART
secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
server:
endpoint_url: http://127.0.0.1:8000
- These must match the credentials set in the
environment
section of the Docker Compose file above
After the container has finished initializing, you can access the web UI of your local lakeFS deployment in your browser. Fill out the setup form, where you can optionally share your email address with the developers of lakeFS to receive updates on their product. Next, you can log into your fresh lakeFS instance with the credentials listed above.
Success
Your fresh local lakeFS instance is a playground for you to explore lakeFS functionality.
In the next step, we will create your first repository on this server.
Create a lakeFS repository¶
Once you have logged into the web UI of the lakeFS server for the first time, you can create an empty repository on the next page.
Click the small Click here link at the bottom of the page to proceed and create a repository named repo
(we don't want to add the sample data for this guide):
Tip: Creating a repository later
If you have inadvertently skipped over the quickstart repository creation page, you can always create a new repository on the Repositories tab in the lakeFS web UI (and optionally choose to add the sample data):
Success
You have successfully created a lakeFS repository named repo
, ready to be used with lakeFS-spec.
Using the lakeFS file system¶
We will now use the lakeFS-spec file system interface to perform some basic operations on the repository created in the previous step:
- Upload a local file to the repository
- Read data from a file in the repository
- Make a commit
- Fetch metadata about repository contents
- Delete a file from the repository
To get started, create a file called quickstart.py
with the following contents:
from pathlib import Path
from lakefs_spec import LakeFSFileSystem
REPO, BRANCH = "repo", "main"
# Prepare example local data
local_path = Path("demo.txt")
local_path.write_text("Hello, lakeFS!")
Tip
We will keep adding more code to this file as we progress through the next steps. Feel free to execute the script after each step and observe the effects as noted in the guide.
This code snippet prepares a file demo.txt
on your machine, ready to be added to the lakeFS repository, so let's do just that:
fs = LakeFSFileSystem() # will auto-discover credentials from ~/.lakectl.yaml
repo_path = f"{REPO}/{BRANCH}/{local_path.name}"
with fs.transaction(REPO, BRANCH) as tx:
fs.put(str(local_path), f"{REPO}/{tx.branch.id}/{local_path.name}")
tx.commit(message="Add demo data")
If you execute the quickstart.py
script at this point, you can already see the committed file in the lakeFS web UI:
While examining the file contents in the browser is nice, we want to access the committed file programmatically. Add the following lines at the end of your script and observe the output:
Note that executing the same code multiple times will only result in a single commit in the repository since the contents of the file on disk and in the repository are identical.
In addition to simple read and write operations, the fsspec file system interface also allows us to list the files in a repository folder using ls
, and query the metadata of objects in the repository through info
(akin to the POSIX stat
system call).
Let's add the following code to our script and observe the output:
# Compare the sizes of local file and repo
file_info = fs.info(repo_path)
print(
f"{local_path.name}: local size: {file_info['size']}, remote size: {local_path.stat().st_size}"
)
# Get information about all files in the repo root
print(fs.ls(f"{REPO}/{BRANCH}/"))
As the last order of business, let's clean up the repository to its original state by removing the file using the rm
operation and creating another commit (also, the local file is deleted, since we don't need it anymore):
with fs.transaction(REPO, BRANCH) as tx:
fs.rm(f"{REPO}/{tx.branch.id}/{local_path.name}")
tx.commit(message="Delete demo data")
Success
You now have all the basic tools available to version data from your Python code using the file system interface provided by lakeFS-spec.
Full example code
from pathlib import Path
from lakefs_spec import LakeFSFileSystem
REPO, BRANCH = "repo", "main"
# Prepare example local data
local_path = Path("demo.txt")
local_path.write_text("Hello, lakeFS!")
# Upload the local file to the repo and commit
fs = LakeFSFileSystem() # will auto-discover credentials from ~/.lakectl.yaml
repo_path = f"{REPO}/{BRANCH}/{local_path.name}"
with fs.transaction(REPO, BRANCH) as tx:
fs.put(str(local_path), f"{REPO}/{tx.branch.id}/{local_path.name}")
tx.commit(message="Add demo data")
# Read back the file contents
f = fs.open(repo_path, "rt")
print(f.readline()) # prints "Hello, lakeFS!"
# Compare the sizes of local file and repo
file_info = fs.info(repo_path)
print(
f"{local_path.name}: local size: {file_info['size']}, remote size: {local_path.stat().st_size}"
)
# Get information about all files in the repo root
print(fs.ls(f"{REPO}/{BRANCH}/"))
# Delete uploaded file from the repository (and commit)
with fs.transaction(REPO, BRANCH) as tx:
fs.rm(f"{REPO}/{tx.branch.id}/{local_path.name}")
tx.commit(message="Delete demo data")
local_path.unlink()
Next Steps¶
After this walkthrough of the installation and an introduction to basic file system operations using lakeFS-spec, you might want to consider more advanced topics: