-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask-awkward array loaded by dak.from_parquet
and field-sliced is not populated (has PlaceholderArrays)
#501
Comments
Since #491 is in progress, can we test with that code, rather than trying to fix code that's about to disappear? |
Good idea. I'll check it on that git-branch. |
On that git-branch, the >>> import uproot
>>> result = uproot.dask("dak-issue-501.root")["AnalysisJetsAuxDyn.pt"].compute()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 277, in dask
return _get_dak_array(
File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1560, in _get_dak_array
return dask_awkward.from_map(
File "/tmp/dask-awkward/src/dask_awkward/lib/io/io.py", line 630, in from_map
form_with_unique_keys(io_func.form, "@"),
AttributeError: '_UprootRead' object has no attribute 'form' and the >>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501.parquet")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
<offsets><Index dtype='int64' len='20001'>[## ... ##]</Index></offsets>
<content><NumpyArray dtype='float32' len='##'>[## ... ##]</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8eaa0>
>>> result.layout.content.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8ead0> |
I think this must be because of the name of the one field containing a "." character, which is also used to indicate nesting. |
That's right, it is: >>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501-nodot.parquet")["AnalysisJetsAuxDyn_pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
<offsets><Index dtype='int64' len='20001'>
[ 0 9 16 ... 189827 189839 189853]
</Index></offsets>
<content><NumpyArray dtype='float32' len='189853'>
[131514.36 126449.445 112335.195 ... 12131.118 13738.865 10924.1 ]
</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
array([ 0, 9, 16, ..., 189827, 189839, 189853])
>>> result.layout.content.data
array([131514.36 , 126449.445, 112335.195, ..., 12131.118, 13738.865,
10924.1 ], dtype=float32) But we should expect that the field names can contain any characters, right? When I look up "parquet column names dot", I see a lot of instances of people doing this on Spark, which uses backticks to avoid interpreting dot as nesting. Handling column names with dots in them (by requiring such columns to have backticks) might need to be implemented in |
Actually, how does the dot cause column optimization to fail? What assumption is being broken? If scikit-hep/awkward#3088 is fixed, what else would be needed? |
I haven't spotted a place where we assume dots to be special, but I suspect that parquet might have this built in (or not). |
This ZIP contains a ROOT file and a Parquet file.
dak-issue-501.zip
If we open it with
uproot.dask
, extract one field and compute it, we get what we expect:But if we open it with
dak.from_parquet
, extract one field and compute it, the field is populated with a PlaceholderArray:And of course, that would cause troubles downstream.
What happened here? This is almost the simplest case of column optimization that one could have.
The text was updated successfully, but these errors were encountered: