Apache Arrow and Dash - Community Thread

russellthehippo · September 15, 2020, 2:30am

Parquet is usually faster than any other data format (Feather is close).
PyArrow Table doesn’t require any serialization from Apache Arrow, but you pay I/O cost to transfer from shared memory to the requesting process
The clientside JSON store pattern only works if the data is small, otherwise you’re sending the data that is stored in the browser back and forth multiple times. So anything that stays on the server is likely going to be faster than clientside JSON
Anything using Arrow is usually faster than static files, because all serialization happens in-memory. However, if the data is larger-than-memory, Arrow doesn’t help much as the data cannot be stored in memory. Parquet alone is likely the best bet in combination with Dask or Vaex. SQLite may be good depending on the operations needed

brain-plasma (mentioned above) provides a simple API for fast-serialization, smaller-than-memory values. Dask and Vaex do a good job otherwise. Just remember that Arrow needs to be held in memory to be most useful. If speed and out-of-core processing are most important, perhaps consider Julia and JuliaDB which has the fastest load time and out-of-core processing I’ve seen anywhere. It’s a tough problem to solve no matter how you swing it.

Topic		Replies	Views
Show and tell: Super-fast way to share data between callbacks: Apache Plasma in-memory object store Dash Python show-and-tell	4	8453	May 9, 2019
Parquet Files and Dash - Community Thread Dash Python	1	2390	June 25, 2018
Working on large datasets -- comparison with shiny Dash Python	12	28024	November 16, 2019
Improving App Performance [ Help Please ] Dash Python	10	1848	December 11, 2021
Show and Tell - Dash Pivottable Dash Python show-and-tell	24	17398	May 2, 2022

Apache Arrow and Dash - Community Thread

Related topics