Black Lives Matter. Please consider donating to Black Girls Code today.

Error on Dask.read_csv(url): 405 Method Not Allowed for url:

Hi,

TLDR:
Steps to reproduce:
import dask.dataframe as dd
import pandas as pd
pandas.read_csv(‘https://plot.ly/~bdun9/2754.csv’) <-- success
dd.read_csv(‘https://plot.ly/~bdun9/2754.csv’) <-- fails

Expected Result:
identical output with significantly better scalability and performance on large file reads from dask.read_csv()

Error:
pandas.read_csv() executes without error while dask.read_csv() fails

Error Message:
requests.exceptions.HTTPError: 405 Client Error: Method Not Allowed for url: ht tps://plot.ly/ ~bdun9/ 2754.csv

Full Description:
I’ve been experimenting with Plotly and Dash using the following repository as a sandbox: https: //github .com/plotly/dash-vanguard-report.

As I was looking to test the performance difference between using Pandas and Dask, I ran into something that appears to be either an issue or an intentional security measure. When reading from a csv using Dask.read_csv(url), Dask performs a HEAD request on the Plotly url specified above. This is required to determine the content-length value so dask can parallelize its execution of Pandas.read_csv(url) for significant performance improvement when reading large files.

The error message was the following: “requests.exceptions.HTTPError: 405 Client Error: Method Not Allowed for url:…”

Is there a reason that HEAD requests are not allowed by the server on this csv url?

Any ideas for how I could resolve or work around this. If the performance improvement from using Dask is as good as the hype, it could IMO have great utility for the Plotly and Dash.

Thanks!

Hi @cedricmckinnie,

I glanced through the dask source code (https://github.com/dask/dask/blob/master/dask/dataframe/io/csv.py) but I didn’t spot what you meant by a HEAD request failing. In any case, lets start this as a dask issue (https://github.com/dask/dask/issues) since it works in pandas. You can cc my github handle in the issue (@jonmmease).

As a side note, this is not a situation where I would expect much speedup from using dask. If you’re loading from a single csv file, dask will still need to download the whole thing in one shot before splitting it up locally for parsing. Dask would have an easier time if you split you csv file into a collection of smaller files then then load the collection of files all at once using a glob pattern. Also, make sure you’re data set is at a large enough scale. You’ll probably need to have at least 10s of megabytes per partition to see much improvement from using dask (maybe at least 100MB total).

Hope that helps get you started,
-Jon