Black Lives Matter. Please consider donating to Black Girls Code today.

Show and Tell - brain-plasma (fast sharing for large objects between callbacks)

As mentioned here I’ve been using Apache Plasma to solve what is still one of Dash’s biggest problems: sharing large data between callbacks, apps, and pages.

Now I’ve roughly formalized some of that functionality in the PyPi package brain-plasma. It’s a simple and easy-to-use way to store Python objects, even very large Pandas dataframes or dictionaries, in a shared memory space. This method offers (imperfect, but much better) thread safety, blazing speed relative to reading from disk or Redis, and a super simple, if corny, API. Basically, it uses Plasma to function as the “brain” of your app or other Python project by creating an indexed object namespace in Plasma.

brain_plasma.Brain can brain.learn() new things, brain.recall()old factoids and can brain.forget() just like I too often do; I can tell my brain to brain.wake_up() if it’s been brain.sleep()ing; sadly, sometimes it’s just brain.dead(). But it can store quite a bit of brain.knowledge() and it very good at remembering brain.names().

Full basic docs at https://github.com/russellromney/brain-plasma

Basic usage is:

from brain_plasma import Brain
brain = Brain()

df = pd.DataFrame(numpy.random.randint(0,100,size=(1000000,4))
txt = 'my text string'

# store the data
brain.learn(df,'df')
brain.learn(txt,'txt')

# get the data again
txt==brain.recall('txt')
> True

# delete a name's value
brain.forget('df')

# get all variable names currently available to brain
vars = brain.names()

This is still a work in progress in EXTREME ALPHA i.e. I built it today and is only tested enough to confirm that the functionality works and is better than what I was using before. So, please don’t use this on your production apps until a) the Apache Plasma API is more stable (it’s not) or b) until this API is more stable and c) the functionality is hammered out a bit more (probably in v0.15).

I’d love any help, requests, or critiques you have!

11 Likes

I really like the corny API :smiley:

1 Like

Added thanks to @tcbegley: __getitem__ and __setitem__

brain = Brain()
brain['text'] = 'asdf' # calls brain.learn()
brain['text'] # calls brain.recall()
# >>> 'asdf'
1 Like

Updates:

Ability to start the underlying plasma_state process when you instantiate Brain

brain = Brain(start_process=True, size=100000000)

Also, if you have used brain.dead(i_am_sure=True) to kill the plasma_state process, you can restart it with the new method brain.start(path='this/path',size=numberofbytes) (parameters are optional - default is to use the previous size and path)


Fixed bug that sometimes doesn’t let you assign a new value to a given name:

# old error example
brain['a'] = 'asdf'
brain['a']
# >>> 'asdf'
brain['a'] = 5
# >>> Plasma Error - ObjectID already exists

New attributes: brain.size & brain.mb
number of bytes (integer) and megabytes (e.g. '50 MB'), respectively, available in the plasma_store

1 Like

Updates:

Ability to resize the memory available in the underlying plasma_store process without losing any variables.

brain['a'] = [1,2,3,4]
brain.size
# >>> 50000000

brain.resize(100000000)

# size changes
brain.size
# >>> 100000000

# all the values remain
brain['a']
# >>> [1,2,3,4]

Now you have to specify to NOT start the process rather than assuming that the plasma_state process is already there.

Plus general bugfixes, stabilizing the API, and performance.

3 Likes

Updates:

new functions

# how much space is used
`brain.used()`

# how much space is free
`brain.free()`

# dynamically find size of plasma_state
`brain.size()`

# see dictionary of names:ObjectID()s
`brain.object_map()`

Bugfix: brain.start() and brain.resize()started a new plasma_store instance, now they don’t. Problem was in brain.dead()

2 Likes

Update with release v0.2:

Big things! The brain-plasma is stable again (with breaking changes around starting and killing Plasma instances), documentation is better, and there is a new killer feature: namespaces!

RELEASE WITH BREAKING CHANGES

  • changed parameter order of learn() to ('name',thing) which is more intuitive (but you should always use bracket notation)
  • removed ability to start, kill, or resize the underlying brain instance (stability)
  • added ability to use unique namespaces to hold same-name values.
  • newly available:
    • len(brain) --> # 5
    • del brain['this'] --> # brain.forget('this')
    • 'this' in brain --> # True
    • (implemented __len__ , __delitem__ , and __contains__ )

Using namespaces:

brain.namespace
>>> 'default'

brain['this'] = 'default text object'

# change namespace
brain.set_namespace('newname')
brain['this'] = 'newname text object'

brain.set_namespace('default')
brain['this']
>>> 'default text object'

brain.names(namespaces='all')
>>>['this','this']

brain.show_namespaces()
>>> {'default','newname'}

brain.remove_namespace('newname')
brain.namespace
>>> 'default'

I’m currently using the namespaces feature to back up persistent user state data that can’t be held client-side, and continuing to use the main storage features for quick big-object access.

I’d love some help on emulating dictionary indexed assignment behavior like in this helpful issue: https://github.com/russellromney/brain-plasma/issues/18

Hope this is useful for some folks!

4 Likes

Hi Russell,

Thanks for this tool but could you please tell me (and maybe others) how to use it with Dash?

I tried it in my app and i get this:

WARNING: Logging before InitGoogleLogging() is written to STDERR
E1002 12:08:20.027307 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 5 more times
E1002 12:08:20.430019 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 4 more times
E1002 12:08:20.834305 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 3 more times
E1002 12:08:21.237042 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 2 more times
E1002 12:08:21.638428 2871042944 io.cc:168] Connection to IPC socket failed for pathname /tmp/plasma, retrying 1 more times
Traceback (most recent call last):
  File "index.py", line 8, in <module>
    from apps import triangles_app, ts_analysis, graphing, gtaa, play
  File "/Users/Desktop/webapp/apps/graphing.py", line 34, in <module>
    brain = Brain()
  File "/Users/Desktop/webapp/env/lib/python3.7/site-packages/brain_plasma/brain_plasma.py", line 20, in __init__
    self.client = plasma.connect(self.path,num_retries=5)
  File "pyarrow/_plasma.pyx", line 805, in pyarrow._plasma.connect
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Could not connect to socket /tmp/plasma

I am running the dash app in a virtualenv and plasma_store -m 50000000 -s /tmp/plasma in a separate terminal window.

Thanks!

I solved my problem by launching the plasma_store process with a path inside the dash webapp. Is this the best way to do it?

You should not need to do that. Afaik, /tmp is used to create temporary socket files which are used to communicate with the Plasma instance. Changing the path should not how brain-plasma works though. Which system are you on?

I’m on MacOS. Also, I couldn’t use start_process=True as argument (i have brain-plasma 0.2)

That’s odd that it doesn’t work on Mac. Will you open an issue on Github with an example of the non-working code?

In v0.2 I removed that ability as it made the tool unstable. I have updated the reference in the README on Github.

This is really cool! Thanks for making it @russellthehippo. I’m using it for a server and was wondering what the best way to start a gunicorn/flask process is with brain/plasma. I can’t start all workers at the same time because they will all be trying do the initial write of the dataframe to the same object. Right now, I’m running an initial script to read my data into the brain, and then start the gunicorn workers.

It sounds like you need to have each worker check if the object exists already before it tries to load it. Maybe I misunderstand the question.

Well, I do, but these objects have to be loaded as soon as the worker starts and all the workers try to load the same object at the same time. I guess there’s no obvious solution besides a two stage initialisation.

Update with Release v0.3:

Summary

This release is the biggest release yet in the path to production usefulness outside of a few large objects. I mostly rewrote brain_plasma.Brain and entirely refactored: it now hashes names for direct access to speed up read and write operations by several orders of magnitude due to fewer and more lightweight calls. The API is mostly the same. Custom exceptions are added to help users catch and understand errors better. Most functions are unit tested and can be checked with pytest.

The sum of these changes means brain-plasma can be used as a fast production backend similar to Redis, but with fast support for very large values as well as for very small values (and for pure Python objects rather than transformed values a la JSON) and for few as well as many values. I’m pretty excited about it.

Hashing speedup

Speedup results are drastic, especially when there are more than a dozen or so names in the store. This is because the old Brain called client.list() multiple times for a most Brain interactions. This was admittedly a horrible design. The new Brain doesn’t call client.list() at all for most operations including all reads and writes. The script many_vals.py compares the old with the new Brains (all values in seconds):

plasma_store -m 10000000 -s /tmp/plasma
# new terminal
python many_vals.py
>>>
100 items:
    learn:
        old: 3.6606647968292236
        hash: 0.030955076217651367
    recall:
        old: 4.092543840408325
        hash: 0.017110824584960938
 10 items:
    learn:
        old: 0.32016992568969727
        hash: 0.005012035369873047
    recall:
        old: 0.31406521797180176
        hash: 0.002324819564819336

Unit tests

Most functions are tested in tests/. Check yourself or test your changes with:

pip install pytest
pytest

Exceptions

Custom exceptions are added to help users catch and understand errors better. Most types of errors that are unique to the functions rather than to Python errors are defined as custom exceptions. Function docstrings mention which exceptions which may be caught. New exceptions are imported en masse like:

from brain_plasma.exceptions import (
    BrainNameNotExistError,
    BrainNamespaceNameError,
    BrainNamespaceNotExistError,
    BrainNamespaceRemoveDefaultError,
    BrainNameLengthError,
    BrainNameTypeError,
    BrainClientDisconnectedError,
    BrainRemoveOldNameValueError,
    BrainLearnNameError,
    BrainUpdateNameError,
)

Other

Code is formatted with the excellent black. Markdown is formatted with Prettier.

Hello I am using brain-plasma in production, I update with kombu my dataframes so I don’t have to query database anymore, only at start, in my Dockerfile I call entrypoint.sh and use this line for starting alongside guinicorn

plasma_store -m 50000000 -s /tmp/plasma &

then I do

exec gunicorn src.app:server --bind 0.0.0.0:8000 --log-level=info --timeout=90

I hope this can be helpful for you

2 Likes

Could Brain Plasma be used as an alternative backend for https://pythonhosted.org/Flask-Caching/#custom-cache-backends

That would be great I guess.

1 Like

Hi @mwveliz, thanks for the reply! Could you please explain in a bit more detail how you deal with the issue of multiple workers trying to write to plasma the initial dataframe? Thanks :slight_smile:

Hi @dldx I am not sure because I only use one worker, but maybe this way (using --preload) you could do it:

gunicorn --preload src.app:server --bind 0.0.0.0:8000 --log-level=info --timeout=90

as descirbed on


and

Hope it can help you :slight_smile: