Skip to content

Using transactions on the lakeFS file system

In addition to file operations, you can carry out versioning operations in your Python code using file system transactions.

A transaction is basically a context manager that collects all file uploads, defers them, and executes the uploads on completion of the transaction. They are an "all or nothing" proposition: If an error occurs during the transaction, none of the queued files are uploaded. For more information on fsspec transactions, see the official documentation.

The main features of the lakeFS file system transaction are:

Atomicity

If an exception occurs anywhere during the transaction, all queued file uploads and versioning operations are discarded:

from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem()

with fs.transaction as tx:
    fs.put_file("my-file.txt", "repo/main/my-file.txt")
    tx.commit("repo", "main", message="Add my-file.txt")
    raise ValueError("oops!")

The above code will not produce a commit on main, since the ValueError prompts a discard of the full upload queue.

Versioning helpers

The lakeFS file system's transaction is the intended place for conducting versioning operations between file transfers. The following is an example of file uploads with commit creations, with a tag being applied at the end.

from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem()

with fs.transaction as tx:
    fs.put_file("train-data.txt", "repo/main/train-data.txt")
    tx.commit("repo", "main", message="Add training data")
    fs.put_file("test-data.txt", "repo/main/test-data.txt")
    sha = tx.commit("repo", "main", message="Add test data")
    tx.tag("repo", sha, tag="My train-test split")

The full list of supported lakeFS versioning operations:

  • commit, for creating commits on a branch, optionally with attached metadata.
  • create_branch, for creating a new branch.
  • merge, for merging a given branch into another branch.
  • revert, for reverting a previous commit on a branch.
  • rev_parse, for parsing revisions like branch/tag names and SHA fragments into full commit SHAs.
  • tag, for creating a tag pointing to a commit.

Warning

All of the operations above are deferred, and their results are not available until completion of the transaction. For example, the sha return value of tx.commit will be a placeholder for the actual commit SHA computed by the lakeFS server on commit creation.

While you can directly use some values (branch/tag names) returned by transaction versioning helpers, care needs to be taken with computed objects like commit SHAs:

with fs.transaction as tx:
    fs.put_file("my-file.txt", "repo/branch/my-file.txt")
    sha = tx.commit("repo", "branch", message="Add my-file.txt")

# This will not work: `sha` is of type `Placeholder[Commit]`
fs.get_file(f"repo/{sha}/my-file.txt", "my-new-file.txt")

See the following section on how to reuse commits created during transactions.

Reusing resources created in transactions

Some transaction versioning helpers create new objects in the lakeFS instance that are not known before said helpers are actually executed. An example of this is a commit SHA, which is only available once created by the lakeFS server. In the above example, a commit is created directly after a file upload, but its actual SHA identifier will not be available until the transaction is complete. After the transaction is completed, you can reuse the computed value (a Placeholder object) in your code like you would any other lakeFS server result:

with fs.transaction as tx:
    fs.put_file("my-file.txt", "repo/branch/my-file.txt")
    sha = tx.commit("repo", "branch", message="Add my-file.txt")

# after transaction completion, just use the SHA value as normal.
fs.get_file(f"repo/{sha.id}/my-file.txt", "my-new-file.txt")

Thread safety

Through its use of collections.deque as a store for uploads, upload queueing and file transfers are thread-safe.