Apache Arrow and Dash - Community Thread

irr3 · September 11, 2020, 6:26pm

Have these initial questions been answered in the meantime and could be shared here maybe? Would be interested to know if it’s worth to invest time into Apache Arrow.

Using Parquet files → PyArrow when caching (to the disk) and loading data from callbacks or in transfering between callbacks and multi-user dash apps. Right now, the examples are using JSON, which may or may not be slower and may or may not have issues in data conversion. See https://plot.ly/dash/performance and Capture window/tab closing event for examples that could be adapted to Arrow with Parquet
Using PyArrow’s Tables instead of Pandas DataFrames (https://arrow.apache.org/docs/python/pandas.html ). In http://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/ , the authors mention that Arrow

Arrow supports zero-copy reads, so objects can naturally be stored in shared memory and used by multiple processes
Does this apply for mutliple wsgi processes? In Dash, a common pattern is to read from a global Pandas DataFrame, which (AFAIK) is copied for each of the workers when the app runs under gunicorn ( $ gunicorn app.server --workers 4 ). If this was an Arrow Table, would this Table be automatically shared across all workers? How does that work?

Another common pattern in Dash is share data between callbacks by serializing the data as JSON and storing in the network/the user’s browser (see https://plot.ly/dash/sharing-data-between-callbacks ). These examples use JSON. Is there a way that some serialized form of an Arrow Table would be more efficient (both in serialization/deserialization time but also in request size).
When Dash apps deal with “large data” (data that is too big to fit in RAM), we usually store it in a database or a sqlite file. Is Parquet + Arrow a better solution for this now?

Topic		Replies	Views
Parquet Files and Dash - Community Thread Dash Python	1	2435	June 25, 2018
Show and tell: Super-fast way to share data between callbacks: Apache Plasma in-memory object store Dash Python show-and-tell	4	8502	May 9, 2019
Working on large datasets -- comparison with shiny Dash Python	12	28105	November 16, 2019
Dynamic Use of Global Data Dash Python	17	7659	August 13, 2019
Modify a dataframe across uwsgi workers Dash Python	2	972	October 3, 2017

Apache Arrow and Dash - Community Thread

Related topics