Apache Arrow and Dash - Community Thread

Have these initial questions been answered in the meantime and could be shared here maybe? Would be interested to know if it’s worth to invest time into Apache Arrow.

  1. Using Parquet files → PyArrow when caching (to the disk) and loading data from callbacks or in transfering between callbacks and multi-user dash apps. Right now, the examples are using JSON, which may or may not be slower and may or may not have issues in data conversion. See https://plot.ly/dash/performance and Capture window/tab closing event for examples that could be adapted to Arrow with Parquet
  2. Using PyArrow’s Tables instead of Pandas DataFrames (https://arrow.apache.org/docs/python/pandas.html ). In http://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/ , the authors mention that Arrow

Arrow supports zero-copy reads, so objects can naturally be stored in shared memory and used by multiple processes
Does this apply for mutliple wsgi processes? In Dash, a common pattern is to read from a global Pandas DataFrame, which (AFAIK) is copied for each of the workers when the app runs under gunicorn ( $ gunicorn app.server --workers 4 ). If this was an Arrow Table, would this Table be automatically shared across all workers? How does that work?

  1. Another common pattern in Dash is share data between callbacks by serializing the data as JSON and storing in the network/the user’s browser (see https://plot.ly/dash/sharing-data-between-callbacks ). These examples use JSON. Is there a way that some serialized form of an Arrow Table would be more efficient (both in serialization/deserialization time but also in request size).
  2. When Dash apps deal with “large data” (data that is too big to fit in RAM), we usually store it in a database or a sqlite file. Is Parquet + Arrow a better solution for this now?