Error on Dask.read_csv(url): 405 Method Not Allowed for url:

cedricmckinnie · September 8, 2018, 1:12pm

Hi,

TLDR:
Steps to reproduce:
import dask.dataframe as dd
import pandas as pd
pandas.read_csv(‘https://plot.ly/~bdun9/2754.csv’) <-- success
dd.read_csv(‘https://plot.ly/~bdun9/2754.csv’) <-- fails

Expected Result:
identical output with significantly better scalability and performance on large file reads from dask.read_csv()

Error:
pandas.read_csv() executes without error while dask.read_csv() fails

Error Message:
requests.exceptions.HTTPError: 405 Client Error: Method Not Allowed for url: ht tps://plot.ly/ ~bdun9/ 2754.csv

Full Description:
I’ve been experimenting with Plotly and Dash using the following repository as a sandbox: https: //github .com/plotly/dash-vanguard-report.

As I was looking to test the performance difference between using Pandas and Dask, I ran into something that appears to be either an issue or an intentional security measure. When reading from a csv using Dask.read_csv(url), Dask performs a HEAD request on the Plotly url specified above. This is required to determine the content-length value so dask can parallelize its execution of Pandas.read_csv(url) for significant performance improvement when reading large files.

The error message was the following: “requests.exceptions.HTTPError: 405 Client Error: Method Not Allowed for url:…”

Is there a reason that HEAD requests are not allowed by the server on this csv url?

Any ideas for how I could resolve or work around this. If the performance improvement from using Dask is as good as the hype, it could IMO have great utility for the Plotly and Dash.

Thanks!

jmmease · September 10, 2018, 9:35am

Hi @cedricmckinnie,

I glanced through the dask source code (https://github.com/dask/dask/blob/master/dask/dataframe/io/csv.py) but I didn’t spot what you meant by a HEAD request failing. In any case, lets start this as a dask issue (https://github.com/dask/dask/issues) since it works in pandas. You can cc my github handle in the issue (@jonmmease).

As a side note, this is not a situation where I would expect much speedup from using dask. If you’re loading from a single csv file, dask will still need to download the whole thing in one shot before splitting it up locally for parsing. Dask would have an easier time if you split you csv file into a collection of smaller files then then load the collection of files all at once using a glob pattern. Also, make sure you’re data set is at a large enough scale. You’ll probably need to have at least 10s of megabytes per partition to see much improvement from using dask (maybe at least 100MB total).

Hope that helps get you started,
-Jon

Topic		Replies	Views
Error while reading CSV file with pd.read_csv 📊 Plotly Python	4	3847	March 7, 2022
Tutorial URL data file failures Dash Python	2	930	June 27, 2019
Store and then read ExcelFile Dash Python	2	689	June 26, 2020
Issue with plotting uploaded CSV Dash Python	1	1551	February 14, 2022
AttributeError: ('Read-only: can only be set in the Dash constructor or during init_app()', 'requests_pathname_prefix') Dash Python	13	4800	February 19, 2022

Error on Dask.read_csv(url): 405 Method Not Allowed for url:

Related topics