Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic histogram method for Vector{<:Any}, so we can use counts(::UniqueElements, x) on any data #349

Open
kahaaga opened this issue Dec 21, 2023 · 10 comments
Labels
enhancement New feature or request that is non-breaking good first issue Good for newcomers

Comments

@kahaaga
Copy link
Member

kahaaga commented Dec 21, 2023

Currently, we can do

julia> x = rand([1, 4, 6, 7, 8], 1000)

julia> counts(x)
 Counts{Int64,1} over 5 outcomes
 1  199
 4  215
 6  197
 7  180
 8  209

But this only works if x is sortable, because fasthist! uses sorting internally. We should have a generic (probably much slower) fallback that enables things like this to work too:

julia> y = ['a', (1, 2, 3,), 25];

julia> counts(y)
ERROR: MethodError: no method matching isless(::Tuple{Int64, Int64, Int64}, ::Char)

Closest candidates are:
  isless(::Char, ::Char)
   @ Base char.jl:214
  isless(::AbstractChar, ::AbstractChar)
   @ Base char.jl:221
  isless(::Tuple, ::Tuple{})
   @ Base tuple.jl:532
  ...

Stacktrace:
  [1] lt(o::Base.Order.ForwardOrdering, a::Tuple{Int64, Int64, Int64}, b::Char)
    @ Base.Order ./ordering.jl:117
  [2] _sort!(v::Vector{Any}, #unused#::Base.Sort.InsertionSortAlg, o::Base.Order.ForwardOrdering, kw::NamedTuple{(:scratch, :lo, :hi), Tuple{Nothing, Int64, Int64}})
    @ Base.Sort ./sort.jl:748
  [3] _sort!
    @ ./sort.jl:713 [inlined]
  [4] _sort!
    @ ./sort.jl:660 [inlined]
  [5] _sort!
    @ ./sort.jl:596 [inlined]
  [6] #sort!#28
    @ ./sort.jl:1374 [inlined]
  [7] sort!
    @ ./sort.jl:1367 [inlined]
  [8] fasthist!(x::Vector{Any})
    @ ComplexityMeasures ~/Documents/Repos/ComplexityMeasures.jl/src/encoding_implementations/fasthist.jl:21
  [9] counts_and_outcomes
    @ ~/Documents/Repos/ComplexityMeasures.jl/src/core/counts.jl:99 [inlined]
 [10] counts(x::Vector{Any})
    @ ComplexityMeasures ~/Documents/Repos/ComplexityMeasures.jl/src/core/counts.jl:121
 [11] top-level scope
    @ REPL[42]:1
@kahaaga kahaaga added enhancement New feature or request that is non-breaking good first issue Good for newcomers labels Dec 21, 2023
@kahaaga
Copy link
Member Author

kahaaga commented Dec 21, 2023

Hm. Not entirely sure about this. Should we instead require the input to be sortable (i.e. the user has to map their data to some sortable format)? That would make maintenance a bit easier.

@Datseris
Copy link
Member

I think in the future there should be a version implemented for non-sortable collections. But this is extremely low priority because a user can always map their vector to the integers corresponding to unique elements and then use count on the integer. For now we can focus on v3 and hope that some other contributor will add this method in the future.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 22, 2023

I think in the future there should be a version implemented for non-sortable collections. But this is extremely low priority because a user can always map their vector to the integers corresponding to unique elements and then use count on the integer.

Agreed. Then I keep this issue open.

@Datseris
Copy link
Member

There is a difficulty I see in implementing this: there is no julia function, or type trait, that checks for sortability. So it is not clear to me how we would check whether we can call fasthist or not in the first place...

@Datseris
Copy link
Member

We could check something like

julia> hasmethod(isless, (T, T))
false

with T the element type of the vector.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 22, 2023

We could check something like

julia> hasmethod(isless, (T, T))
false

with T the element type of the vector.

Yes, this should work. The fallback would typically just occur when the user has different types of elements in their input vector. Then T becomes Any and

julia> T = Any
Any

julia> hasmethod(isless, (T, T))
false

@kahaaga
Copy link
Member Author

kahaaga commented Dec 22, 2023

In the current API, the docstring of outcome_space is

"""
    outcome_space(o::OutcomeSpace, x) → Ω

Return a sorted container containing all _possible_ outcomes of `o` for input `x`.

For some estimators the concrete outcome space is known without knowledge of input `x`,
in which case the function dispatches to `outcome_space(o)`.
In general it is recommended to use the 2-argument version irrespectively of estimator.
"""
function outcome_space(o::OutcomeSpace)

For a non-sortable vector, we can't sort the elements. Therefore, we either need to relax this requirement, or disallow using non-sortable data (which I think is too restrictive atm). We could just say that "if your input data isn't sortable, then the outcome space is given in the order of first appearance of the outcomes" or something like that

@Datseris
Copy link
Member

Yes order of first appearance is fine, but at the moment non sortable data simply domm't work with the pacakge ayways

@kahaaga
Copy link
Member Author

kahaaga commented Dec 23, 2023

I haven't done any sort of testing for speed etc, but an obvious (and perhaps the easiest) solution to this issue is precisely what @Datseris says above - convert to the integers (and then convert back again). We already have the infrastructure to do this conversion, in the form of UniqueElementsEncoding.

  1. Make a method is_sortable(x::AbstractVector{T}) where T = hasmethod(isless, (T, T))
  2. In all instances where fasthist!(x, args...) is called, instead call and intermediate function fasthist_checksorted!(x, args...). It does the following:
    a. Checks if the input x is sortable.
    b. If x is sortable, then use fasthist! directly.
    c. If the input is not sortable, then create a new vector x_ints that map the unique elements of x to unique integers. To do so, use UniqueElementsEncoding on x and use encode to map the elements of x onto the integers.
    d. Then pass x_ints to fasthist! and obtain the histogram.
    e. Use decode to replace each element of x_ints with the original input.
    f. Optionally optimize counts_and_outcomes(::UniqueElements, x) by not replacing every element in x_ints. Because fasthist! gives a histogram over the unique elements of x_int, we can get away with only decoding the unique elements in x_ints.

@kahaaga
Copy link
Member Author

kahaaga commented Dec 23, 2023

In any case, anyone working on this should wait until #347 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request that is non-breaking good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants