Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: Add option to use nullable dtypes in read_csv#48776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
d79a552 to af6056bComparepandas/io/parsers/base_parser.py Outdated
| ---------- | ||
| values : ndarray | ||
| na_values : set | ||
| cast_type: Specifies if we want to cast explicitly |
mroeschkeSep 30, 2022 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this bool? Looks like we only need to check that it's not None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
| bool_mask=np.zeros(result.shape, dtype=np.bool_) | ||
| result=BooleanArray(result, bool_mask) | ||
| elifresult.dtype==np.object_anduse_nullable_dtypes: | ||
| result=StringDtype().construct_array_type()._from_sequence(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test what happens when the string pyarrow global config is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| parser=all_parsers | ||
| data="""a,b,c,d,e,f,g,h,i | ||
| 1,2.5,True,a,,,,,12-31-2019 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a column here where both rows have an empty value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, casts to Int64 now in both cases. Better question is what we actually want here, because this could be everything
mroeschke commented Sep 30, 2022
Probably worth discussing in the issue, but I want to mention here since this will be the first instance of Motivation: It would be cool to see a state where Understandably An alternative and easier path to my goal would be to have |
phofl commented Sep 30, 2022
Wouldn’t it be easier to do this via an option like we are currently inferring for string? I guess this should also apply to our constructors etc? Similar to what the final state of nullable is supposed to be. Provide a global option to opt into them |
mroeschke commented Sep 30, 2022
Hmm so the end state idea could be have like a global option like I am taking a particular focus on IO methods since I am hoping to avoid the jump from pa.Table -> np.array -> pa.ChunkedArray (in theory) with |
phofl commented Sep 30, 2022
Currently the idea is to make a global option to opt into nullable dtypes, yes. I think we can most certainly make this into a three way option to allow arrow too. But wouldn’t this cause problems on the first operation done with a object backed by a numpy array? |
mroeschke commented Sep 30, 2022
Are you referring to the IO conversion I mentioned? |
phofl commented Sep 30, 2022
Ah no, sorry. More like if you get a pyarrow backed object from IO but a NDArray backed object from a constructor and you want to combine them somehow (concat, merge, ...) Basically what I wanted to ask: Wouldn't it make more sense if everything could be backed by arrow if a single flag is set to avoid these inconsistencies? |
mroeschke commented Sep 30, 2022
Ah yeah, most definitely. I haven't really encountered/thought too hard about the As long as |
mroeschke left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked pretty good. Could you merge in main once more?
| parser.read_csv(StringIO(data), dtype=dtype) | ||
| deftest_use_nullabla_dtypes(all_parsers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, fixed
| 3,4.5,False,b,6,7.5,True,a,12-31-2019, | ||
| """ | ||
| result=parser.read_csv( | ||
| StringIO(data), use_nullable_dtypes=True, parse_dates=["i"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize for use_nullable_dtypes = True/False here and for the other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is impossible to understand if paramterized. Expected looks completely different. I could add a new test in theory, but would not bring much value, we are testing all possible cases already with numpy dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for checking.
mroeschke commented Oct 7, 2022
Thanks @phofl |
ENH: Add option to use nullable dtypes in read_csv (pandas-dev#48776)
* ENH: Add option to use nullable dtypes in read_csv * Finish implementation * Update * Fix mypy * Add tests and fix call * Fix typo
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.