pCacheFS – persistently caches other filesystems

Imagine you have a mount to a remote filesystem which houses lots of data which you access regularly. Perhaps it is your music library on a remote NFS server, or via SSHFS. Imagine that this filesystem is slow to access – perhaps it is over a wireless network.

If, like me, you have music playing in the background much of the time when you are using your workstation, having that music on a slow remote filesystem can be irritating. If I change track, there is a delay of a few seconds while my music player opens and reads the metadata for the new file I’ve selected. If I want to perform a search of my music library, it takes a long time unless my music player has some sort of database of all the metadata (which admittedly many modern players do). And if you have ever tried seeking within an MP3 on a remote filesystem, it’s painful.

For my SSHFS-mounted music library you can easily be kept waiting for 5 seconds or longer for some of these operations to complete. (Before you post a comment saying there’s something wrong with my LAN: I access my music over a VPN, over a WAN at ADSL speeds.)

For this reason I wrote pCacheFS (http://code.google.com/p/pcachefs/).

pCacheFS is a FUSE filesystem that presents a mirror of other filesystems, with transparent caching. Not only that but the cache is persistent. It is stored locally, which means that when I reboot, my cache remains intact.

As the cache also resides on any filesystem  I choose (e.g. a local hard disk), it can grow much bigger than caches that only reside in RAM. Given enough time, pCacheFS will probably end up creating a replica copy of my remote filesystem, locally.

How it works

Using pCacheFS is very simple. Go to the website (http://code.google.com/p/pcachefs/), download pCache 0.1 and extract it somewhere. Then run this:

pcachefs.py -c /cache -t /remote /remote-cached

The -c parameter specifies the location of your cache directory – all data that pCacheFS caches will end up stored here. It’ll be created if needed.

The -t parameter specifies your ‘target’ directory – this is your slow filesystem you want to cache data from.

The last parameter is where the cached filesystem will be mounted.

If you look in your new mountpoint you’ll see a mirror of the target filesystem:

$ ls /remote
. .. file1 file2 file3
$ ls /remote-cached
. .. file1 file2 file3

You can read files in the /remote-cached just as in /remote, with the important difference that everything you access in /remote-cached is transparently cached by pCacheFS. When you access that data again, that data will come from the cache (i.e. must faster than your remote filesystem), saving you time and network traffic.

pCacheFS will only cache data you access in /remote-cached, right down to the byte level. If you tail a file on /remote, only the parts of the file that tail reads will be cached, not the entire file. This means that your filesystem performance for uncached files is the same as it would be if you were accessing the remote filesystem directly.

One notable limitation of pCacheFS at this point is that its filesystems are read-only. Any write operations will fail – the only way to change the data is to change it directly in the underlying target filesystem. This may change in the future.

Conclusion

If you think you may have a use for this I thoroughly recommend giving pCacheFS a try. As it can be used to cache any filesystem and the cache itself can reside on any filesystem, you may even find more novel uses for it.

6 thoughts on “pCacheFS – persistently caches other filesystems

  1. Hello Jonny,
    I am trying to use your pCacheFS, but sadly it is always returning Invalid argument.

    Running in foreground shows that the script is failing with the following backtrace:
    UFS GETATTR C_F_S
    create (posix.stat_result(st_mode=16877, st_ino=27, st_dev=35L, st_nlink=1, st_uid=0, st_gid=0, st_size=4096, st_atime=1368460867, st_mtime=1368298239, st_ctime=1368298239),)
    Traceback (most recent call last):
    File “/usr/lib/python2.7/site-packages/fuse.py”, line 362, in __call__
    return apply(self.func, args, kw)
    File “pcachefs/pcachefs.py”, line 119, in getattr
    return self.cacher.getattr(path)
    File “pcachefs/pcachefs.py”, line 415, in getattr
    result = self.underlying_fs.getattr(path)
    File “pcachefs/pcachefs.py”, line 221, in getattr
    return factory.create(FuseStat, os.stat(self._get_real_path(path)))
    File “pcachefs/factory.py”, line 4, in create
    return t(args)
    File “pcachefs/pcachefs.py”, line 53, in __init__
    self.st_mode = stat.st_mode
    AttributeError: ‘tuple’ object has no attribute ‘st_mode’

  2. I finally discovered the issue:
    factory.create(FuseStat, ) isn’t calling FuseStat.__init__ with as the stat parameter. Instead, it is providing a tuple which contain
    . The solution is to add on FuseStat.__init__:
    if isinstance(stat,tuple):
    stat = stat[0]

    It then fails with NameError: global name ‘__builtin__’ is not defined on the line “with __builtin__.open(cache_dir, ‘wb’) as stat_cache_file:”, solved simply by adding an “import __builtin__”

    In summary:
    diff -r 91f01152cee2 pcachefs/pcachefs.py
    — a/pcachefs/pcachefs.py Thu Apr 11 18:07:57 2013 +0100
    +++ b/pcachefs/pcachefs.py Tue May 14 15:18:54 2013 +0200
    @@ -27,6 +27,7 @@
    import pickle
    import types
    import factory
    +import __builtin__

    from datetime import datetime
    from ranges import (Ranges, Range)
    @@ -50,6 +51,9 @@
    def __init__(self, stat):
    fuse.Stat.__init__(self)

    + if isinstance(stat,tuple):
    + stat = stat[0]
    +
    self.st_mode = stat.st_mode
    self.st_nlink = stat.st_nlink
    self.st_size = stat.st_size

    I also added a basic symlink support, however there’s still something not completely right as Iget SIGBUS if running programs from it.

    • Hi Ángel,

      Thanks for commenting – and great to hear you solved the problem yourself.

      Thanks for the patches too – I’ve hit the builtin problems before, particularly when trying to unit test pCacheFS. A royal pain with Python 2.x.

      J

    • Hi Angel,

      Did you ever fix the SIGBUS error? I’m testing this and am getting it as well. I’ve strace-ed the processes that seem to hit the SIGBUS, but it’s unclear which operation is causing it (most things run fine).

      -J

      • Hello Jerome

        Sorry for the late reply. Seems your comment was in moderation and didn’t get through until a couple of days ago.

        No, I didn’t fix the mysterious SIGBUS. The fact that it only happens when running programs makes it hard to find out. Perhaps a problem with mmaping, or even some kernel assumption… :/

        Let us know if you figure out the fix or even a reproducer.

        Cheers

  3. Hi Jonny,

    thanks a lot for your nice software.

    I’m using it to overcome two big limitations of the HP ltfs driver (yes I’m using tapes):
    1) it lacks a caching mechanism and therefore you read again from tape every time you access a file
    2) it doesn’t handle well concurrent reads. To overcome this I had to modify a bit the pcachefs.py code redefining the requested range when opening the file.
    with open(cache_data, ‘wb’) as f:
    debug(‘ creating blank file, size’, str(file_stat.st_size))
    f.seek(file_stat.st_size – 1)
    f.write(”)
    requested_range = Range(0, file_stat.st_size)
    blocks_to_read = []
    blocks_to_read = cached_blocks.get_uncovered_portions(requested_range)

    In this way the file is read in one go and concurrent reads are made on the copies on disk. Would it be possible to have it as an option?

    For completeness you could add some cache clearing of files not accessed recently when the disk space runs out.

    Thanks again,
    Pietro

Leave a comment