How to use the lakeFS file system¶
This guide contains instructions and code snippets on how to use the lakeFS file system.
The lakeFS URI structure¶
In the following subsections, we frequently make use of lakeFS URIs in the example code.
lakeFS URIs identify resources in a lakeFS deployment through a unique path consisting of repository name, lakeFS revision/ref name, and file name relative to the repository root. Optionally, they may be prefixed with the lakefs://
URI scheme (this is required when using third-party libraries).
As an example, a URI like repo/main/file.txt
addresses the file.txt
file on the main
branch in the repository named repo
.
In some lakeFS file system operations, directories are also allowed as resource names.
For example, the URI repo/main/data/
(note the optional trailing slash) refers to the data
directory on the main
branch in the repo
repository.
On staged versus committed changes¶
When uploading, copying, or removing files or directories from a branch, those removal operations will result in staged changes in the repository until a commit is created. lakeFS-spec does not create these commits automatically, since it separates file operations from versioning operations rigorously. If you want to conduct versioning operations, like creating commits, between file transfers, the best way to do so is by using filesystem transactions.
How to use lakeFS file system APIs¶
The following section explains more in-depth how to use the LakeFSFileSystem
APIs.
This section concerns the explicitly implemented operations. In addition, there are a number of file system APIs inherited from the AbstractFileSystem
interface in fsspec.
More information on file system usage can be found in the fsspec documentation.
Uploading and downloading files¶
The arguably most important feature of the file system is file transfers.
File uploads¶
To upload a file, you can use the fs.put()
and fs.put_file()
methods.
While fs.put_file()
operates on single files only, the fs.put()
API can be used for directory uploads.
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem()
# remote path, then local target path.
fs.put_file("file.txt", "my-repo/my-ref/file.txt")
If you want to upload an entire directory to lakeFS, you can use the fs.put()
API together with the recursive=True
switch:
# structure:
# dir/
# ├── a.txt
# ├── b.yaml
# ├── c.csv
# └── ...
fs.put("dir", "my-repo/my-ref/dir", recursive=True)
Info
The above method of file uploading results in two transfers: Once from the client to the lakeFS server, and once from the lakeFS server to the object storage. This can impact performance if the uploaded files are very large. To avoid this performance issue, you can also decide to write the file directly to the underlying object storage:
Direct lakeFS blockstore uploads require the installation of the corresponding fsspec file system implementation through pip
.
For an S3-based lakeFS deployment, install the s3fs
package. For Google Cloud Storage (GCS), install the gcsfs
package.
For Azure blob storage, install the adlfs
package.
File downloads¶
To download a file, you can use the fs.get()
or fs.get_file()
methods.
While fs.get_file()
downloads single files only, the fs.get()
API can be used for recursive directory downloads.
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem()
# remote path, then local target path.
fs.get_file("my-repo/my-ref/file.txt", "file.txt")
In the case of a directory in lakeFS, use the fs.get()
API together with the recursive=True
switch:
# structure:
# dir/
# ├── a.txt
# ├── b.yaml
# ├── c.csv
# └── ...
# downloads the entire `dir` directory (and subdirectories) into the current directory.
fs.get("my-repo/my-ref/dir", "dir", recursive=True)
Checking the existence of lakeFS objects¶
To check the existence of a file in a given revision of a repository, you can use the fs.exists()
API:
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem()
my_file_exists = fs.exists("my-repo/my-ref/my-file.txt")
This function returns True
if the file exists on that revision, and False
if it does not. Errors (e.g. permission errors) will be raised, since in that case, object existence cannot be decided.
Warning
fs.exists()
only works on file objects, and will return False
if called on directories.
Obtaining info on stored objects¶
To query the metadata of a single object in a lakeFS repository, use the fs.info()
API:
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem()
my_file_info = fs.info("my-repo/my-ref/my-file.txt")
The resulting my_file_info
object is a dictionary containing useful information such as storage location of the file, creation timestamp, and size (in bytes).
You can also call fs.info()
on directories:
In this case, the resulting dir_info
object only contains the directory name, and the cumulated size of the files it contains.
Listing directories in lakeFS¶
To list the files in a directory in lakeFS, use the fs.ls()
method:
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem()
my_dir_listing = fs.ls("my-repo/my-ref/my-dir/")
This returns a list of Python dictionaries containing information on the objects contained in the requested directory.
The returned objects have the same fields set as those returned by a normal fs.info()
call on a file object.
Deleting objects from a lakeFS branch¶
To delete objects from a lakeFS branch, use the fs.rm_file()
or fs.rm()
APIs. As before, while the former works only for single files, the latter can be used to remove entire directories with the recursive=True
option.
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem()
fs.rm_file("my-repo/my-branch/my-file.txt")
# removes the entire `my-dir` directory.
fs.rm("my-repo/my-branch/my-dir/", recursive=True)
Copying files in a repository¶
To copy files on a branch or from one branch to another, use the fs.cp_file()
or fs.copy()
methods:
from lakefs_spec import LakeFSFileSystem
fs = LakeFSFileSystem()
# copies a single file on the same branch to a new location.
fs.cp_file("my-repo/branch-a/file.txt", "my-repo/branch-a/file.txt.bak")
# copies a single file from branch A to branch B.
fs.cp_file("my-repo/branch-a/file.txt", "my-repo/branch-b/file.txt")
# copies the entire `my-dir` directory from branch A to branch B (which must exist).
fs.copy("my-repo/branch-a/my-dir/", "my-repo/branch-b/my-dir/", recursive=True)
Info
Files and directories can only be copied between branches in the same repository, not between different repositories.
Trying to copy to a non-existent branch will not create the branch.