Have these initial questions been answered in the meantime and could be shared here maybe? Would be interested to know if it’s worth to invest time into Apache Arrow.
- Using Parquet files → PyArrow when caching (to the disk) and loading data from callbacks or in transfering between callbacks and multi-user dash apps. Right now, the examples are using JSON, which may or may not be slower and may or may not have issues in data conversion. See https://plot.ly/dash/performance and Capture window/tab closing event for examples that could be adapted to Arrow with Parquet
- Using PyArrow’s Tables instead of Pandas DataFrames (https://arrow.apache.org/docs/python/pandas.html ). In http://arrow.apache.org/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/ , the authors mention that Arrow
Arrow supports zero-copy reads, so objects can naturally be stored in shared memory and used by multiple processes
Does this apply for mutliple wsgi processes? In Dash, a common pattern is to read from a global Pandas DataFrame, which (AFAIK) is copied for each of the workers when the app runs under gunicorn ($ gunicorn app.server --workers 4
). If this was an Arrow Table, would this Table be automatically shared across all workers? How does that work?
- Another common pattern in Dash is share data between callbacks by serializing the data as JSON and storing in the network/the user’s browser (see https://plot.ly/dash/sharing-data-between-callbacks ). These examples use JSON. Is there a way that some serialized form of an Arrow Table would be more efficient (both in serialization/deserialization time but also in request size).
- When Dash apps deal with “large data” (data that is too big to fit in RAM), we usually store it in a database or a sqlite file. Is Parquet + Arrow a better solution for this now?