Hey Dash Community
I wanted to create a thread to share tips and tricks with using the new Apache Arrow project with Python, Dash, and Pandas. See http://wesmckinney.com/blog/apache-arrow-pandas-internals/ for an introductory essay by the author.
Is anybody using this project yet? In what ways?
I’d also like to collect some simple examples of Arrow with Dash that we can eventually fold into the Dash user guide (if they are compelling!)
Of the top of my head, I can see Arrow being useful in Dash in a couple of ways:
Using Parquet files → PyArrow when caching (to the disk) and loading data from callbacks or in transfering between callbacks and multi-user dash apps. Right now, the examples are using JSON, which may or may not be slower and may or may not have issues in data conversion. See https://plot.ly/dash/performance and Capture window/tab closing event for examples that could be adapted to Arrow with Parquet
Using PyArrow’s Tables instead of Pandas DataFrames (https://arrow.apache.org/docs/python/pandas.html). In http://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/, the authors mention that Arrow
Arrow supports zero-copy reads, so objects can naturally be stored in shared memory and used by multiple processes
Does this apply for mutliple wsgi processes? In Dash, a common pattern is to read from a global Pandas DataFrame, which (AFAIK) is copied for each of the workers when the app runs under gunicorn (
$ gunicorn app.server --workers 4). If this was an Arrow Table, would this Table be automatically shared across all workers? How does that work?
Another common pattern in Dash is share data between callbacks by serializing the data as JSON and storing in the network/the user’s browser (see https://plot.ly/dash/sharing-data-between-callbacks). These examples use JSON. Is there a way that some serialized form of an Arrow Table would be more efficient (both in serialization/deserialization time but also in request size).
When Dash apps deal with “large data” (data that is too big to fit in RAM), we usually store it in a database or a sqlite file. Is Parquet + Arrow a better solution for this now?
It would be great to fill up this thread with answers to these questions in the form of simple, reproducable examples. And if anyone has any other ideas or examples, please share!