pCacheFS – persistently caches other filesystems

December 16, 2012 jtyers pcachefs

Imagine you have a mount to a remote filesystem which houses lots of data which you access regularly. Perhaps it is your music library on a remote NFS server, or via SSHFS. Imagine that this filesystem is slow to access – perhaps it is over a wireless network.

If, like me, you have music playing in the background much of the time when you are using your workstation, having that music on a slow remote filesystem can be irritating. If I change track, there is a delay of a few seconds while my music player opens and reads the metadata for the new file I’ve selected. If I want to perform a search of my music library, it takes a long time unless my music player has some sort of database of all the metadata (which admittedly many modern players do). And if you have ever tried seeking within an MP3 on a remote filesystem, it’s painful.

For my SSHFS-mounted music library you can easily be kept waiting for 5 seconds or longer for some of these operations to complete. (Before you post a comment saying there’s something wrong with my LAN: I access my music over a VPN, over a WAN at ADSL speeds.)

For this reason I wrote pCacheFS (http://code.google.com/p/pcachefs/).

pCacheFS is a FUSE filesystem that presents a mirror of other filesystems, with transparent caching. Not only that but the cache is persistent. It is stored locally, which means that when I reboot, my cache remains intact.

As the cache also resides on any filesystem I choose (e.g. a local hard disk), it can grow much bigger than caches that only reside in RAM. Given enough time, pCacheFS will probably end up creating a replica copy of my remote filesystem, locally.

How it works

Using pCacheFS is very simple. Go to the website (http://code.google.com/p/pcachefs/), download pCache 0.1 and extract it somewhere. Then run this:

pcachefs.py -c /cache -t /remote /remote-cached

The -c parameter specifies the location of your cache directory – all data that pCacheFS caches will end up stored here. It’ll be created if needed.

The -t parameter specifies your ‘target’ directory – this is your slow filesystem you want to cache data from.

The last parameter is where the cached filesystem will be mounted.

If you look in your new mountpoint you’ll see a mirror of the target filesystem:

$ ls /remote
. .. file1 file2 file3
$ ls /remote-cached
. .. file1 file2 file3

You can read files in the /remote-cached just as in /remote, with the important difference that everything you access in /remote-cached is transparently cached by pCacheFS. When you access that data again, that data will come from the cache (i.e. must faster than your remote filesystem), saving you time and network traffic.

pCacheFS will only cache data you access in /remote-cached, right down to the byte level. If you tail a file on /remote, only the parts of the file that tail reads will be cached, not the entire file. This means that your filesystem performance for uncached files is the same as it would be if you were accessing the remote filesystem directly.

One notable limitation of pCacheFS at this point is that its filesystems are read-only. Any write operations will fail – the only way to change the data is to change it directly in the underlying target filesystem. This may change in the future.

Conclusion

If you think you may have a use for this I thoroughly recommend giving pCacheFS a try. As it can be used to cache any filesystem and the cache itself can reside on any filesystem, you may even find more novel uses for it.

6 thoughts on “pCacheFS – persistently caches other filesystems”

Ángel says:

May 13, 2013 at 9:40 pm

Hello Jonny,
I am trying to use your pCacheFS, but sadly it is always returning Invalid argument.

Running in foreground shows that the script is failing with the following backtrace:
UFS GETATTR C_F_S
create (posix.stat_result(st_mode=16877, st_ino=27, st_dev=35L, st_nlink=1, st_uid=0, st_gid=0, st_size=4096, st_atime=1368460867, st_mtime=1368298239, st_ctime=1368298239),)
Traceback (most recent call last):
File “/usr/lib/python2.7/site-packages/fuse.py”, line 362, in __call__
return apply(self.func, args, kw)
File “pcachefs/pcachefs.py”, line 119, in getattr
return self.cacher.getattr(path)
File “pcachefs/pcachefs.py”, line 415, in getattr
result = self.underlying_fs.getattr(path)
File “pcachefs/pcachefs.py”, line 221, in getattr
return factory.create(FuseStat, os.stat(self._get_real_path(path)))
File “pcachefs/factory.py”, line 4, in create
return t(args)
File “pcachefs/pcachefs.py”, line 53, in __init__
self.st_mode = stat.st_mode
AttributeError: ‘tuple’ object has no attribute ‘st_mode’

Reply
Ángel says:

May 14, 2013 at 1:18 pm

I finally discovered the issue:
factory.create(FuseStat, ) isn’t calling FuseStat.__init__ with as the stat parameter. Instead, it is providing a tuple which contain
. The solution is to add on FuseStat.__init__:
if isinstance(stat,tuple):
stat = stat[0]

It then fails with NameError: global name ‘__builtin__’ is not defined on the line “with __builtin__.open(cache_dir, ‘wb’) as stat_cache_file:”, solved simply by adding an “import __builtin__”

In summary:
diff -r 91f01152cee2 pcachefs/pcachefs.py
— a/pcachefs/pcachefs.py Thu Apr 11 18:07:57 2013 +0100
+++ b/pcachefs/pcachefs.py Tue May 14 15:18:54 2013 +0200
@@ -27,6 +27,7 @@
import pickle
import types
import factory
+import __builtin__

from datetime import datetime
from ranges import (Ranges, Range)
@@ -50,6 +51,9 @@
def __init__(self, stat):
fuse.Stat.__init__(self)

+ if isinstance(stat,tuple):
+ stat = stat[0]
+
self.st_mode = stat.st_mode
self.st_nlink = stat.st_nlink
self.st_size = stat.st_size

I also added a basic symlink support, however there’s still something not completely right as Iget SIGBUS if running programs from it.

Reply
- jtyers says:
  
  May 29, 2013 at 8:14 pm
  
  Hi Ángel,
  
  Thanks for commenting – and great to hear you solved the problem yourself.
  
  Thanks for the patches too – I’ve hit the builtin problems before, particularly when trying to unit test pCacheFS. A royal pain with Python 2.x.
  
  J
  
  Reply
- Jerome says:
  
  March 14, 2014 at 10:40 am
  
  Hi Angel,
  
  Did you ever fix the SIGBUS error? I’m testing this and am getting it as well. I’ve strace-ed the processes that seem to hit the SIGBUS, but it’s unclear which operation is causing it (most things run fine).
  
  -J
  
  Reply
  - Ángel says:
    
    June 15, 2016 at 10:49 pm
    
    Hello Jerome
    
    Sorry for the late reply. Seems your comment was in moderation and didn’t get through until a couple of days ago.
    
    No, I didn’t fix the mysterious SIGBUS. The fact that it only happens when running programs makes it hard to find out. Perhaps a problem with mmaping, or even some kernel assumption…
    
    Let us know if you figure out the fix or even a reproducer.
    
    Cheers
Pietro says:

August 19, 2013 at 1:40 pm

Hi Jonny,

thanks a lot for your nice software.

I’m using it to overcome two big limitations of the HP ltfs driver (yes I’m using tapes):
1) it lacks a caching mechanism and therefore you read again from tape every time you access a file
2) it doesn’t handle well concurrent reads. To overcome this I had to modify a bit the pcachefs.py code redefining the requested range when opening the file.
with open(cache_data, ‘wb’) as f:
debug(‘ creating blank file, size’, str(file_stat.st_size))
f.seek(file_stat.st_size – 1)
f.write(”)
requested_range = Range(0, file_stat.st_size)
blocks_to_read = []
blocks_to_read = cached_blocks.get_uncovered_portions(requested_range)

In this way the file is read in one go and concurrent reads are made on the copies on disk. Would it be possible to have it as an option?

For completeness you could add some cache clearing of files not accessed recently when the disk space runs out.

Thanks again,
Pietro

Reply