@russellthehippo No worries, it’s more likely that I’m on unknown waters here, which is why I’m not fully understanding you and making mistakers. It makes sense with the get_old_data()
caching, but I guess that automatically means that I can’t store 18m + new rows
in a beneficial way (by just using caching I mean). It was very fast to just cache the old data though, despite using redis
and pa.Table.from_pandas
.
Yes, I realized after posting my previous post that I didn’t do it correctly. I found your absolutely excellent thread here on apacha plasma sharing data. Things started to click for me and I tried it out, and I think I got it more or less working now.
- So in the
get_new_rows
there is aplasma_read
function that returns the new rows, which are already stored in the apache plasma store? I.e. It gets the already stored new rows, finds new rows in addition to these, deletes the old new rows from strore, writes the new-new ros to the store, and then returns the data so that I can concat it in the app.callback.- This solution makes sense for when I have big dataset which I then start an app on, while the alternative solution makes sense if the app is running from when the data is empty. This is interesting.
- I think making like a dual solution will be beneficial, i.e. running the alternative method from start until end of project / close to end, when no more data comes in (like now). Then I can switch over to simply always using a cached store of the data etc.
I implemented the alternative!
- I started a Plasma store in a separate process, but there is another issue I stumbled upon: How do I only run the initial big query once?
- I ended up creating a separate script.
- I run it to store the big data in the plasma store, the object_id in the pickle file, and then it terminates. I then start the app and read from the plasma store. This is to avoid the app running it multiple times. Is there a better way?
I then do what you say, I pull all new rows (cannot test it at the moment, gateway onsite is offline, so it pulls an empty dataframe), pull the old data, concat, and store it in shared memory. However, I currently don’t delete the old one (which obviously is a problem), but it seems from the plasma
documentation that it is as easy as client.delete(object_id)
.
Again, thanks a lot for the help! And thanks for offering to help via messages. I’ll be sure to message you if there is anything else that comes up =)