Auto Athena queries in Python dash app

Hi there,

has anyone here had a chance to create a Python dash app using data from AWS Athena? I am trying to do so and having difficulty automatically querying structured data from my data lake (AWS Athena) via Python.
Specifically, I have been creating a real-time report about information of each user, and as the user inputs their own ID to dcc.Input, then all relevant information about that user will render. Here the thing is, I must query all user IDs and their related data first, then my app can work well, but there are a very large number of users there => my app starts extremely slowly.
Now, I’ve been trying placing my Python queries to under callback, meaning that I want whenever the user inputs their own ID to dcc.Input, it will trigger my query codes for only that user ID. The below is my codes for this but it does not work:

#essential libraries
import pandas as pd
import numpy as np

import plotly.express as px
import dash
from dash import dcc, html
from dash import dash_table as dt
from dash.dependencies import Input, Output

from pyathena import connect

#DASH APP
app = dash.Dash(name)

#--------------------------------------------------
app.layout = html.Div(children=[
add_logo(),
*make_break(2),

html.Div
([
    #Input Box for user ID and Dropdown for each respective profile name
    html.Div
    ([
        html.Label('user ID'),

        *make_break(1),

        dcc.Input(id="input_user_id", 
        type="text", 
        placeholder="Input your ID here vĂ  Enter!", 
        debounce=True,                                                             #True = enter or tab to execute this input box 
        style={'margin':'auto', 'display':'inline-block', "width":'70%','height':'29px', 'font-size' :'115%'}),

        html.Div(id="output"),
    ], style={'padding': 10, 'flex': 1}),




],style={‘text-align’:‘center’, ‘width’:‘100%’,‘background-image’:‘url(“/assets/background-MS.jpg”)’})
#--------------------------------------------------
@app.callback(
Output(‘output’, ‘children’),
Input(‘input_user_id’, ‘value’),
)
def update_query_output(ID):
#query from AWS Athena
ms=pd.read_sql(f"“”
SELECT h.*
FROM
(
SELECT
CAST(date_format(date_parse( SUBSTR(cast(“ts” as varchar(8)), 1, 8), ‘%Y%m%d’ ), ‘%Y-%m-%d’) AS DATE) as “date”,
“ts”,
“profile.identity” as “id”,
“eventname”,
“eventprops.lesson_name”,
“eventprops.completed”,
“eventprops.course_type”,
FROM clevertap_db_user.my_db
) h
WHERE h.ID == ‘{ID}’
“”", con = conn)

@app.callback(


)


#--------------------------------------------------
if name == ‘main’:
app.run_server(debug=False)

Then, running the above codes will result in an error saying that name ‘ms’ is not defined

Hi,

Is this your entire update_query_output callback or have you removed some lines after the query? Just a reminder that your callback must return something to be placed as a Div “children” (a list of components or strings at least
).

I can also see that the input in the callback signature (function argument) is ID, but you are trying to filter in the query by a variable input_user_id. Besides, I believe you should replace the value of input_user_id in the string and you are missing the templating part: f".... WHERE h.ID == '{input_user_id}'" (assuming h.ID is a string, you might need extra quotes around that).

I don’t understand the error message. It could be from other parts of the code or just the error that I mentioned above.

Hope this helps!

I don’t understand the error message. It could be from other parts of the code or just the error that I mentioned above.

Yes, the error warning is exactly from below parts of the code that are removed to emphasize on the main part of code. It means that my query code does not work at all.

Is this your entire update_query_output callback or have you removed some lines after the query? Just a reminder that your callback must return something to be placed as a Div “children” (a list of components or strings at least
).

What it should return there? I tried return ms already but it is not a solution

I can also see that the input in the callback signature (function argument) is ID , but you are trying to filter in the query by a variable input_user_id . Besides, I believe you should replace the value of input_user_id in the string and you are missing the templating part: f".... WHERE h.ID == '{input_user_id}'" (assuming h.ID is a string, you might need extra quotes around that).

it does not work :frowning: actually, I initially left f".... WHERE h.ID == '{input_user_id}'", then I change it to ID to make my code readable.

it does not work :frowning: actually, I initially left f".... WHERE h.ID == '{input_user_id}'" , then I change it to ID to make my code readable.

What about this?

@app.callback(
    Output(‘output’, ‘children’),
    Input(‘input_user_id’, ‘value’),
)
def update_query_output(ID):
    #query from AWS Athena
   ms=pd.read_sql(f"""
SELECT h.*
FROM
(
SELECT
CAST(date_format(date_parse( SUBSTR(cast(“ts” as varchar(8)), 1, 8), ‘%Y%m%d’ ), ‘%Y-%m-%d’) AS DATE) as “date”,
“ts”,
“profile.identity” as “id”,
“eventname”,
“eventprops.lesson_name”,
“eventprops.completed”,
“eventprops.course_type”,
FROM clevertap_db_user.my_db
) h
WHERE h.ID == ‘{ID}’
“”", con = conn)

ID is the variable name in the function, so you have to use it in the query parameter.

What it should return there? I tried return ms already but it is not a solution

It is not, because ms is a pandas dataframe. You will have to generate Dash components from the dataframe data or return a dash_table.DataTable component (if you want a table). What exactly do you want to use this data for?

my bad! Actually h.ID == '{ID}' already in my code right now. When I copied my code to post here, I removed and adjuted some lines, so it results in some typo as such

It is not, because ms is a pandas dataframe. You will have to generate Dash components from the dataframe data or return a dash_table.DataTable component (if you want a table). What exactly do you want to use this data for?

I want the variable ms is an input for other callbacks below, like creating bar graph:

@app.callback(

Output('bar_graph', 'figure'),

Input("input_user_id", "value"),

Input('daterange', 'start_date'),

Input('daterange', 'end_date')

)

def update_bar(input_user_id,start_date,end_date):

#Ensure the dataframe is not overwritten

pl = ms.copy(deep=True)

pl = pl[( (pl.id == input_user_id) & (pl.date >= start_date) & (pl.date <= end_date) )]

pl = pl.groupby(['id', 'TĂȘn chủ đề', 'level'])['TĂȘn chủ đề'].count().reset_index(name = 'Số láș§n học')

pl.sort_values(['Số láș§n học', 'TĂȘn chủ đề'], ascending=False,inplace=True)



#Conditionally filter the dataframe using the input

pl_graph = px.bar(data_frame = pl,

                x='TĂȘn chủ đề',

                y='Số láș§n học',  

                title=f'<b>BĂ© { input_user_id} đã học cĂĄc chủ đề sau đñy từ {start_date} đáșżn {end_date}</b>',

                color='level',

                text_auto= True)

#Return the figure to render

return(pl_graph)

Right there the error message happens name ‘ms’ is not defined

You cannot do that, because ms is defined only in the scope of the first callback. Besides, the reason why you can’t return a pandas dataframe in a callback is that callback returns must be json-serializable objects.

One alternative is to replace html.Div("output") with dcc.Store("output") and send the data in ms as a dictionary to the data prop like this:

@app.callback(
    Output(‘output’, ‘data’),  # if dcc.Store("output") is defined in layout
    Input(‘input_user_id’, ‘value’),
)
def update_query_output(ID):
    #query from AWS Athena
   ms=pd.read_sql(f"""
SELECT h.*
FROM
(
SELECT
CAST(date_format(date_parse( SUBSTR(cast(“ts” as varchar(8)), 1, 8), ‘%Y%m%d’ ), ‘%Y-%m-%d’) AS DATE) as “date”,
“ts”,
“profile.identity” as “id”,
“eventname”,
“eventprops.lesson_name”,
“eventprops.completed”,
“eventprops.course_type”,
FROM clevertap_db_user.my_db
) h
WHERE h.ID == ‘{ID}’
""", con = conn)

    return ms.to_dict("records")  # convert pd.Dataframe to dict

Then you can pass the dict in the second callback:

@app.callback(

Output('bar_graph', 'figure'),

Input("input_user_id", "value"),

Input('daterange', 'start_date'),

Input('daterange', 'end_date'),

Input('output', 'data')
)

def update_bar(input_user_id,start_date,end_date, df_as_dict):
    pl = pd.DataFrame(df_as_dict)
    # ...rest of your code

Note that this approach sends the entire dataframe to the client as a json object, and the performance of this operation depends on how large your data is (there is a limit of ~10Mb or similar).

shoud It be pl = pd.DataFrame(ms) or pl = pd.DataFrame(ms.to_dict("records")) or anythin else, sir? (creating bar graph)

None of them, ms is not defined in the callback as I explained in the previous answer. You need to generate the dataframe from the df_as_dict as I wrote.

Many thanks for your precious time, sir! It worked like a charm!

1 Like

Hi @jlfsjunior,
I was very busy yesterday, and after adjusting my codes abiding by your suggestion and reviewing them quickly, I thought my code worked, but it turned out they did not.
I added dcc.Store to layout dcc.Store(id='output'), I turned the variable ms to dict return ms.to_dict("records"), I called its dict type as dataframe within other callbacks

@app.callback(

Output('profile_name_dd', 'options'),

Input('input_user_id', 'value'),

Input('output', 'data'),

)
def update_profileID_dd(input_user_id, ms_as_dict):

#Ensure the dataframe is not overwritten

mj_dd = pd.DataFrame(ms_as_dict)

#Filter the respective profiles

mj_dd = mj_dd[( (mj_dd['id'] == input_user_id) )]

return [ {'label': o, 'value': o} for o in mj_dd.completed.unique() ]

The error message says something like: KeyError: 'id', KeyError: 'date', in which id and date are columns of the data frame mj_dd

***Today I have just found a related topic: Part 4. Sharing Data Between Callbacks | Dash for Python Documentation | Plotly
But I have difficulty turning json to dataframe using pd.read_json() :frowning:
Specifically, when I created the variable mj_dd like mj_dd = pd.read_json(mj1_as_dict) and replace return [ {'label': o, 'value': o} for o in mj_dd.completed.unique() ] by print (mj_dd.completed.unique()), I realize that the data is empty :frowning:

Can you share a print of how ms_as_dict looks like inside the callback?

1 Like

Hey,

I found the solution already! The problem is here:

before:
dcc.Store(id='output')

after:
dcc.Store(id='output', data=[])

For the before part, my dcc.Store() adheres to the guideline: Part 5. Sharing Data Between Callbacks | Dash for Python Documentation | Plotly

Then, I saw this amazing video: Sharing Data between Dash Callbacks - YouTube and led me to the after part !

1 Like

Makes sense. Another approach (just FYI) would be to use prevent_initial_update=True in the callback, otherwise data will be None when you initialize the application.

Glad you figured it out though! :slight_smile:

1 Like