Hi,
TLDR:
Steps to reproduce:
import dask.dataframe as dd
import pandas as pd
pandas.read_csv(‘https://plot.ly/~bdun9/2754.csv’) <-- success
dd.read_csv(‘https://plot.ly/~bdun9/2754.csv’) <-- fails
Expected Result:
identical output with significantly better scalability and performance on large file reads from dask.read_csv()
Error:
pandas.read_csv() executes without error while dask.read_csv() fails
Error Message:
requests.exceptions.HTTPError: 405 Client Error: Method Not Allowed for url: ht tps://plot.ly/ ~bdun9/ 2754.csv
Full Description:
I’ve been experimenting with Plotly and Dash using the following repository as a sandbox: https: //github .com/plotly/dash-vanguard-report.
As I was looking to test the performance difference between using Pandas and Dask, I ran into something that appears to be either an issue or an intentional security measure. When reading from a csv using Dask.read_csv(url), Dask performs a HEAD request on the Plotly url specified above. This is required to determine the content-length value so dask can parallelize its execution of Pandas.read_csv(url) for significant performance improvement when reading large files.
The error message was the following: “requests.exceptions.HTTPError: 405 Client Error: Method Not Allowed for url:…”
Is there a reason that HEAD requests are not allowed by the server on this csv url?
Any ideas for how I could resolve or work around this. If the performance improvement from using Dask is as good as the hype, it could IMO have great utility for the Plotly and Dash.
Thanks!