This is great to see; it was my biggest gripe/grief with pandas back in the day and I'm glad the project is maturing and taking this more seriously.

As one of primary implementors of `missing::Missing` in #julialang, I had a few thoughts reading over the proposal: 1/ https://twitter.com/jorisvdbossche/status/1296793636055777282
It's amazing how much being able to represent `Union{T, Missing}` formally in #julialang makes a difference. I.e. a column type can literally be `Vector{Union{Float64, Missing}}`, which means I have an array of elements that are Float64 or Missing.
This solves the issues of: how do I tell if there are or might be missing values in an array? And how do I create an empty array of a specific type that might have missing values? And subsequently, what's the type of an array with only missing values? (Vector{Missing})
These were all design questions brought up in the pandas proposal. The other wonderful thing about Union{T, Missing} is that it generalizes beyond just Arrays; I can use that as the type of a field in a custom struct.
Oh, and how about the fact that I could create my own custom Missing type (Undefined? Unspecified? RefusedToAnswer) and have *all* *the* *internal* *optimizations* just work. Super cool! Like, it wouldn't be more than a couple hundred LOC to make your own missing type.
Which makes me shudder a little thinking about the implementation of NA for pandas. You essentially have to design for two languages: the python-side and the C/C++ internals. And as noted in the proposal, the breaking change/introduction of this is formidable.
You never want to have to break things in major ways, but I agree w/ the proposal that it would be worth a fully consistent missing-value story for pandas.
On the 3-value logic point, I'd strongly vote for it vs. NaN semantics. I still hear horror stories of analyses-gone-wrong because unaccounted missing values completely change statistics/ratios. *3-value logic forces users to DEAL w/ missing values, which they absolutely should*
In #julialang, this has resulted in useful tools like skipmissing, dropmissing!, and coalesce to make dealing w/ missing values as convenient as possible. There's also an extremely promising version of passmissing in Missings.jl that allows arbitrary lifting over operations
Which leads me to my biggest question for pandas: how do they plan on handling non-comparison missing propagation? This doesn't seem to be discussed much in the proposal, but I know still generates discussion in the Julia world.
Perhaps the "world" of pandas is limited/fenced in enough that NA propagation is "easy"? In #julialang we have the issue that the language is so easily extendable, that we want missing propagation to Just Work for any package/user code anywhere.
Anyway, super cool to see the progres here and read thru details of the implementation; brings back fond memories of past JuliaCons where we hashed out a lot of these details for #julialang. Best of luck!
As a last follow up: it's fun to see #julialang and the work we've put into `missing` show up so much in pandas proposal/issue discussions. I do think we got a lot right, even if there have been wrinkles/compiler work to be ironed out.
You can follow @quinn_jacobd.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.