ENH: Add option to use nullable dtypes in read_csv#48776

phofl · 2022-09-25T20:45:54Z

closesENH: add option to get nullable dtypes to pd.read_csv #36712 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

mroeschke · 2022-09-30T00:57:25Z

pandas/io/parsers/base_parser.py

 ----------
 values : ndarray
 na_values : set
+ cast_type: Specifies if we want to cast explicitly


Could we make this bool? Looks like we only need to check that it's not None?

mroeschke · 2022-09-30T01:02:06Z

pandas/io/parsers/base_parser.py

+bool_mask=np.zeros(result.shape, dtype=np.bool_)
+result=BooleanArray(result, bool_mask)
+elifresult.dtype==np.object_anduse_nullable_dtypes:
+result=StringDtype().construct_array_type()._from_sequence(values)


Could you test what happens when the string pyarrow global config is true?

mroeschke · 2022-09-30T01:03:21Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

+parser=all_parsers
+
+data="""a,b,c,d,e,f,g,h,i
+1,2.5,True,a,,,,,12-31-2019


Could you add a column here where both rows have an empty value?

Added, casts to Int64 now in both cases. Better question is what we actually want here, because this could be everything

mroeschke · 2022-09-30T17:08:32Z

Probably worth discussing in the issue, but I want to mention here since this will be the first instance of use_nullable_dtypes.

Motivation: It would be cool to see a state where read_*(engine="pyarrow") can result in a DataFrame that is backed by ArrowExtensionArray (trying to avoid the conversion to numpy)

Understandably engine="pyarrow" doesn't do that today and may be fairly difficult to change (deprecate) that behavior in the future.

An alternative and easier path to my goal would be to have use_nullable_dtypes="pandas"|"pyarrow" (default "pandas") to allow the user to pick the "nullable representation". Thoughts?

phofl · 2022-09-30T17:12:06Z

Wouldn’t it be easier to do this via an option like we are currently inferring for string? I guess this should also apply to our constructors etc?

Similar to what the final state of nullable is supposed to be. Provide a global option to opt into them

mroeschke · 2022-09-30T17:30:57Z

Wouldn’t it be easier to do this via an option like we are currently inferring for string?

Hmm so the end state idea could be have like a global option like mode.nullable=None|"pandas"|"pyarrow" where dtype/array representation is either numpy/pd.array & pd.NA/ pa.array & pd.NA more or less consistently?

I am taking a particular focus on IO methods since I am hoping to avoid the jump from pa.Table -> np.array -> pa.ChunkedArray (in theory) with engine="pyarrow" and just have pa.Table -> pa.ChunkedArray

phofl · 2022-09-30T17:33:34Z

Currently the idea is to make a global option to opt into nullable dtypes, yes. I think we can most certainly make this into a three way option to allow arrow too.

But wouldn’t this cause problems on the first operation done with a object backed by a numpy array?

mroeschke · 2022-09-30T18:29:04Z

But wouldn’t this cause problems on the first operation done with a object backed by a numpy array?

Are you referring to the IO conversion I mentioned?

phofl · 2022-09-30T18:30:44Z

Ah no, sorry. More like if you get a pyarrow backed object from IO but a NDArray backed object from a constructor and you want to combine them somehow (concat, merge, ...)

Basically what I wanted to ask: Wouldn't it make more sense if everything could be backed by arrow if a single flag is set to avoid these inconsistencies?

mroeschke · 2022-09-30T18:40:50Z

Wouldn't it make more sense if everything could be backed by arrow if a single flag is set to avoid these inconsistencies?

Ah yeah, most definitely. I haven't really encountered/thought too hard about the op(arrow-backed, ndarray-backed) outcomes yet, but a global option would hopefully avoid this.

As long as use_nullable_types=True + a global config can lead to maintaining arrow objects from parsing, that would satisfy my goal.

mroeschke

Looked pretty good. Could you merge in main once more?

lithomas1 · 2022-10-01T23:23:24Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

 parser.read_csv(StringIO(data), dtype=dtype)
+
+
+deftest_use_nullabla_dtypes(all_parsers):


nit: typo here and below.

lithomas1 · 2022-10-01T23:24:25Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

+3,4.5,False,b,6,7.5,True,a,12-31-2019,
+"""
+result=parser.read_csv(
+StringIO(data), use_nullable_dtypes=True, parse_dates=["i"]


Can you parametrize for use_nullable_dtypes = True/False here and for the other tests?

No this is impossible to understand if paramterized. Expected looks completely different. I could add a new test in theory, but would not bring much value, we are testing all possible cases already with numpy dtypes

OK, thanks for checking.

mroeschke · 2022-10-07T16:41:40Z

Thanks @phofl

ENH: Add option to use nullable dtypes in read_csv (pandas-dev#48776)

* ENH: Add option to use nullable dtypes in read_csv * Finish implementation * Update * Fix mypy * Add tests and fix call * Fix typo

phofl added 3 commits September 21, 2022 21:13

ENH: Add option to use nullable dtypes in read_csv
d1c0b51

Finish implementation
d7a7eca

Update
4f05540

phofl added IO CSV read_csv, to_csv NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 25, 2022

phofl marked this pull request as draft September 25, 2022 20:46

Merge remote-tracking branch 'upstream/main' into use_nullable_dtypes
afe9ca5

vuule mentioned this pull request Sep 28, 2022
[BUG] cudf.read_csv should not cast to floating types if there are null entries in csv rapidsai/cudf#6313
Open

Fix mypy
af6056b

phofl force-pushed the use_nullable_dtypes branch from d79a552 to af6056bCompare September 29, 2022 08:13

Merge remote-tracking branch 'upstream/main' into use_nullable_dtypes
8b0dc2f

phofl marked this pull request as ready for review September 29, 2022 08:14

Merge remote-tracking branch 'upstream/main' into use_nullable_dtypes
291e761

mroeschke reviewed Sep 30, 2022
View reviewed changes

Add tests and fix call
8a4d206

mroeschke reviewed Oct 7, 2022
View reviewed changes

lithomas1 reviewed Oct 7, 2022
View reviewed changes

phofl added 2 commits October 7, 2022 13:45

Fix typo
30d68a8

Merge remote-tracking branch 'upstream/main' into use_nullable_dtypes
6139d87

lithomas1 approved these changes Oct 7, 2022
View reviewed changes

mroeschke added this to the 1.6 milestone Oct 7, 2022

mroeschke approved these changes Oct 7, 2022
View reviewed changes

mroeschke merged commit 7f24bff into pandas-dev:mainOct 7, 2022

zain581 added a commit to zain581/pandas that referenced this pull request Oct 7, 2022
Merge pull request#2from pandas-dev/main
a8737bf
ENH: Add option to use nullable dtypes in read_csv (pandas-dev#48776)

phofl deleted the use_nullable_dtypes branch October 7, 2022 17:09

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

coroa mentioned this pull request Oct 17, 2022
BUG: pd.read_csv with use_nullable_dtypes incompatible with dtype coercion #49146
Closed
2 tasks

		parser.read_csv(StringIO(data), dtype=dtype)


		deftest_use_nullabla_dtypes(all_parsers):

Uh oh!

ENH: Add option to use nullable dtypes in read_csv#48776

ENH: Add option to use nullable dtypes in read_csv #48776

Uh oh!

Conversation

phofl commented Sep 25, 2022

Uh oh!

mroeschkeSep 30, 2022• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Sep 30, 2022

Uh oh!

phofl commented Sep 30, 2022

Uh oh!

mroeschke commented Sep 30, 2022

Uh oh!

phofl commented Sep 30, 2022

Uh oh!

mroeschke commented Sep 30, 2022

Uh oh!

phofl commented Sep 30, 2022

Uh oh!

mroeschke commented Sep 30, 2022

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Oct 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mroeschkeSep 30, 2022•
edited
Loading