dask-awkward array loaded by `dak.from_parquet` and field-sliced is not populated (has PlaceholderArrays) #501

jpivarski · 2024-04-19T16:52:47Z

This ZIP contains a ROOT file and a Parquet file.

If we open it with uproot.dask, extract one field and compute it, we get what we expect:

>>> import uproot
>>> result = uproot.dask("dak-issue-501.root")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>
        [     0      9     16 ... 189827 189839 189853]
    </Index></offsets>
    <content><NumpyArray dtype='float32' len='189853'>
        [131514.36  126449.445 112335.195 ...  12131.118  13738.865  10924.1  ]
    </NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
array([     0,      9,     16, ..., 189827, 189839, 189853])
>>> result.layout.content.data
array([131514.36 , 126449.445, 112335.195, ...,  12131.118,  13738.865,
        10924.1  ], dtype=float32)

But if we open it with dak.from_parquet, extract one field and compute it, the field is populated with a PlaceholderArray:

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501.parquet")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>[## ... ##]</Index></offsets>
    <content><NumpyArray dtype='float32' len='##'>[## ... ##]</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x7a45532925c0>
>>> result.layout.content.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x7a4553292260>

And of course, that would cause troubles downstream.

What happened here? This is almost the simplest case of column optimization that one could have.

The text was updated successfully, but these errors were encountered:

martindurant · 2024-04-19T16:54:53Z

Since #491 is in progress, can we test with that code, rather than trying to fix code that's about to disappear?

jpivarski · 2024-04-19T16:55:25Z

Good idea. I'll check it on that git-branch.

jpivarski · 2024-04-19T16:59:04Z

On that git-branch, the uproot.dask case is now broken (I assume something needs to be updated in Uproot),

>>> import uproot
>>> result = uproot.dask("dak-issue-501.root")["AnalysisJetsAuxDyn.pt"].compute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 277, in dask
    return _get_dak_array(
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1560, in _get_dak_array
    return dask_awkward.from_map(
  File "/tmp/dask-awkward/src/dask_awkward/lib/io/io.py", line 630, in from_map
    form_with_unique_keys(io_func.form, "@"),
AttributeError: '_UprootRead' object has no attribute 'form'

and the dak.from_parquet case is unchanged,

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501.parquet")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>[## ... ##]</Index></offsets>
    <content><NumpyArray dtype='float32' len='##'>[## ... ##]</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8eaa0>
>>> result.layout.content.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8ead0>

martindurant · 2024-04-19T17:12:20Z

I think this must be because of the name of the one field containing a "." character, which is also used to indicate nesting.

jpivarski · 2024-04-19T17:44:55Z

That's right, it is:

dak-issue-501-nodot.zip

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501-nodot.parquet")["AnalysisJetsAuxDyn_pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>
        [     0      9     16 ... 189827 189839 189853]
    </Index></offsets>
    <content><NumpyArray dtype='float32' len='189853'>
        [131514.36  126449.445 112335.195 ...  12131.118  13738.865  10924.1  ]
    </NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
array([     0,      9,     16, ..., 189827, 189839, 189853])
>>> result.layout.content.data
array([131514.36 , 126449.445, 112335.195, ...,  12131.118,  13738.865,
        10924.1  ], dtype=float32)

But we should expect that the field names can contain any characters, right? When I look up "parquet column names dot", I see a lot of instances of people doing this on Spark, which uses backticks to avoid interpreting dot as nesting.

Handling column names with dots in them (by requiring such columns to have backticks) might need to be implemented in ak.forms.Form.select_columns. Would anything be needed in dask-awkward?

jpivarski · 2024-04-19T17:59:41Z

Actually, how does the dot cause column optimization to fail? What assumption is being broken?

If scikit-hep/awkward#3088 is fixed, what else would be needed?

martindurant · 2024-04-19T19:45:13Z

I haven't spotted a place where we assume dots to be special, but I suspect that parquet might have this built in (or not).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask-awkward array loaded by `dak.from_parquet` and field-sliced is not populated (has PlaceholderArrays) #501

dask-awkward array loaded by `dak.from_parquet` and field-sliced is not populated (has PlaceholderArrays) #501

jpivarski commented Apr 19, 2024

martindurant commented Apr 19, 2024

jpivarski commented Apr 19, 2024

jpivarski commented Apr 19, 2024

martindurant commented Apr 19, 2024

jpivarski commented Apr 19, 2024

jpivarski commented Apr 19, 2024

martindurant commented Apr 19, 2024

dask-awkward array loaded by dak.from_parquet and field-sliced is not populated (has PlaceholderArrays) #501

dask-awkward array loaded by dak.from_parquet and field-sliced is not populated (has PlaceholderArrays) #501

Comments

jpivarski commented Apr 19, 2024

martindurant commented Apr 19, 2024

jpivarski commented Apr 19, 2024

jpivarski commented Apr 19, 2024

martindurant commented Apr 19, 2024

jpivarski commented Apr 19, 2024

jpivarski commented Apr 19, 2024

martindurant commented Apr 19, 2024

dask-awkward array loaded by `dak.from_parquet` and field-sliced is not populated (has PlaceholderArrays) #501

dask-awkward array loaded by `dak.from_parquet` and field-sliced is not populated (has PlaceholderArrays) #501