Series combine#821

Rubtsowa · 2020-04-28T10:37:17Z

No description provided.

…ies_combine

examples/series/series_combine.py

densmirn · 2020-05-13T09:30:33Z

sdc/datatypes/hpat_pandas_series_functions.py

+
+len_val=max(len(self), len(other))
+result=numpy.empty(len_val, self._data.dtype)
+forindinrange(len_val):


Can we parallel the method based on chunks?

1e-to · 2020-05-13T09:56:34Z

sdc/datatypes/hpat_pandas_series_functions.py

+iffill_valueisNone:
+fill_value=numpy.nan
+
+len_val=max(len(self), len(other))


And what if all indexes are different? I think we should use sdc_join_series_indexes to find len of result series

1e-to · 2020-05-13T10:20:50Z

sdc/datatypes/hpat_pandas_series_functions.py

+
+len_val=max(len(self), len(other))
+result=numpy.empty(len_val, self._data.dtype)
+forindinrange(len_val):


It is case for non-indexes series. Also, it should rewrite with prange

+ usage of chunks to predict scalability

1e-to · 2020-05-13T10:22:02Z

examples/series/series_combine.py

+
+
+@njit
+defseries_copy():


Wrong name function

…ies_combine

kozlov-alexey · 2020-05-21T13:18:51Z

sdc/datatypes/hpat_pandas_series_functions.py

+iffill_valueisNone:
+fill_value=numpy.nan


This will make fill_value type undefined at compile time. You can probably use the same approach as in operators:
sdc/sdc/sdc_function_templates.py
Line 144 in e87095a
_fill_value=numpy.naniffill_value_is_none==Trueelsefill_value# noqa

kozlov-alexey · 2020-05-21T13:24:59Z

sdc/datatypes/hpat_pandas_series_functions.py

+fill_value=numpy.nan
+
+len_val=max(len(self), len(other))
+result=numpy.empty(len_val, self._data.dtype)


This is actually wrong, result dtype should be common dtype for result dtype of func(a, b) where a,b are series values and dtype of _fill_value. Provided tests do not cover this, but e.g. this (where fill_value is float and series are integers) won't pass:
def test_series_combine_integer_new(self): def test_impl(S1, S2): return S1.combine(S2, lambda a, b: 2 * a + b, 16.2) hpat_func = self.jit(test_impl) S1 = pd.Series([1, 2, 3, 4, 5]) S2 = pd.Series([6, 21, 3, 5]) result = hpat_func(S1, S2) result_ref = test_impl(S1, S2) print(f"DEBUG: result:\n{result},\nresult_ref:\n{result_ref}") pd.testing.assert_series_equal(result, result_ref)

kozlov-alexey

Need to add deducing result dtype and add parallelization.

…ies_combine

densmirn · 2020-05-26T12:59:01Z

sdc/tests/test_series.py

 pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))

-@skip_numba_jit
+@unittest.expectedFailure


Please add comment why the test is skipped.
@unittest.expectedFailure # ...

@Rubtsowa No need to skip the test if that's how impl is intended to work. Use check_dtype=False in assert_series_equal and add a comment just before this check to refer to SDC Limitation.

densmirn · 2020-05-26T13:07:34Z

sdc/datatypes/hpat_pandas_series_functions.py

+ifself_indexes[j] ==-1:
+val_self=_fill_value
+else:
+ind_self=self_indexes[j]
+val_self=self[ind_self]._data[0]
+
+ifother_indexes[j] ==-1:
+val_other=_fill_value
+else:
+ind_other=other_indexes[j]
+val_other=other[ind_other]._data[0]


Suggested change
ifself_indexes[j]==-1:
val_self=_fill_value
else:
ind_self=self_indexes[j]
val_self=self[ind_self]._data[0]
ifother_indexes[j]==-1:
val_other=_fill_value
else:
ind_other=other_indexes[j]
val_other=other[ind_other]._data[0]
self_idx=self_indexes[j]
ifself_idx==-1:
val_self=_fill_value
else:
val_self=self[self_idx]._data[0]
other_idx=other_indexes[j]
ifother_idx==-1:
val_other=_fill_value
else:
val_other=other[other_idx]._data[0]

or
Suggested change
ifself_indexes[j] ==-1:
val_self=_fill_value
else:
ind_self=self_indexes[j]
val_self=self[ind_self]._data[0]
ifother_indexes[j] ==-1:
val_other=_fill_value
else:
ind_other=other_indexes[j]
val_other=other[ind_other]._data[0]
self_idx=self_indexes[j]
val_self=_fill_valueifself_idx==-1elseself[self_idx]._data[0]
other_idx=other_indexes[j]
val_other=_fill_valueifother_idx==-1elseother[other_idx]._data[0]

kozlov-alexey · 2020-05-27T08:46:40Z

sdc/datatypes/hpat_pandas_series_functions.py

 min_periods=2

 ifmin_periods<2:
 min_periods=2


Suggested change
iffill_valueisnotNone:
_fill_value=numpy.naniffill_valueisNoneelsefill_value:

kozlov-alexey · 2020-05-27T08:57:21Z

sdc/datatypes/hpat_pandas_series_functions.py


 ifnotisinstance(fill_value, (types.Omitted, types.NoneType, types.Number)) andfill_valueisnotNone:
 ty_checker.raise_exc(fill_value, 'number', 'fill_value')



self_idx (and other_idx) is position in the Series, not the index, so instead of using getitem on a Series, that performs index lookup and returns a Series, so that you have to take _data[0] from it, you can just write:
Suggested change
val_self=self[self_idx]._data[0]
val_self=self._data[self_idx]

…ies_combine

kozlov-alexey

Overall looks OK, but I would reorganize and extend existing tests a bit.

kozlov-alexey · 2020-05-27T10:15:54Z

sdc/datatypes/hpat_pandas_series_functions.py

+
+ Limitations
+ -----------
+ - Only supports the case when data in series of the same type.


This line is not correct - impl handles all cases. For the next line we need exact definition of difference to pandas, e.g:
Suggested change
-Onlysupportsthecasewhendatainseriesofthesame type.
-Resultingseriesdtypemaybewiderthaninpandasdueto type-stabilityrequirementsanddependsonfill_valuedtypeandresultofseriesindexesalignment.

kozlov-alexey · 2020-05-27T10:25:40Z

sdc/tests/test_series.py

 pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))

-@skip_numba_jit
 deftest_series_combine_float3264(self):


This test has incorrect code, which should be corrected probably:
S1 = pd.Series([np.float64(1), np.float64(2), np.float64(3), np.float64(4), np.float64(5)]) S2 = pd.Series([np.float32(1), np.float32(2), np.float32(3), np.float32(4), np.float32(5)])
S2.dtype will be float64 on Win, not float32. Moreover, series dtype should be specified this way:
S1 = pd.Series([1, 2, 3, 4, 5], dtype=np.int64) S2 = pd.Series([1, 2, 3, 4, 5], dtype=np.int32)

kozlov-alexey · 2020-05-27T12:08:59Z

sdc/datatypes/hpat_pandas_series_functions.py

+chunks=parallel_chunks(len_val)
+foriinprange(len(chunks)):
+chunk=chunks[i]
+forjinrange(chunk.start, chunk.stop):
+self_idx=self_indexes[j]
+val_self=_fill_valueifself_idx==-1elseself._data[self_idx]
+
+other_idx=other_indexes[j]
+val_other=_fill_valueifother_idx==-1elseother._data[other_idx]
+
+result[j] =func(val_self, val_other)


Why not?
Suggested change
chunks=parallel_chunks(len_val)
foriinprange(len(chunks)):
chunk=chunks[i]
forjinrange(chunk.start, chunk.stop):
self_idx=self_indexes[j]
val_self=_fill_valueifself_idx==-1elseself._data[self_idx]
other_idx=other_indexes[j]
val_other=_fill_valueifother_idx==-1elseother._data[other_idx]
result[j] =func(val_self, val_other)
result=numpy.empty(len_val, res_dtype)
foriinprange(len(result)):
self_idx, other_idx=self_indexes[i], other_indexes[i]
val_self=_fill_valueifself_idx==-1elseself._data[self_idx]
val_other=_fill_valueifother_idx==-1elseother._data[other_idx]
result[i] =func(val_self, val_other)

kozlov-alexey · 2020-05-27T12:13:27Z

sdc/tests/test_series.py

 S2=pd.Series([6., 21., 3., 5.])
-withself.assertRaises(AssertionError):
-hpat_func(S1, S2)
+pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))


General comment for tests: not all combinations of input series dtypes and fill_value are tested e.g. the one I mentioned before - where float fill_value is assigned to otherwise int series. There are no tests with series with non-default indexes (we refer to samelen, but it's not fully correct - series may have same len, but not same indexes), and no tests for checking func impact on result dtype, so it's hard to see from such tests what's really tested and what is not. So the suggestion is to organize tests in a different manner:
product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_value_samelen
product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
with fill_value is omitted
Covers: test_series_combine_float3264, test_series_combine_integer_samelen, test_series_combine_samelen, test_series_combine_different_types
product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_integer, test_series_combine_value
product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is omitted
Covers: test_series_combine, test_series_combine_assert1, test_series_combine_assert2, test_series_combine_different_types
New test:
5. (for testing func changes dtype properly)
product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value = 0
with diff functions (chaning and not chaning res dtype e.g. preserving int domain, e.g. ** and + and not, e.g. /)

For example, test 1 can look like this (it can also be split into two: one when we use check_dtype=False and one when we don't):
def test_series_combine_same_index_fill_value(self): def test_impl(S1, S2): return S1.combine(S2, lambda a, b: 2 * a + b) hpat_func = self.jit(test_impl) n = 11 np.random.seed(0) A = np.random.randint(-100, 100, n) B = np.arange(n) * 2 + 1 series_index = 1 + np.arange(n) series_dtypes = [None, np.int64, np.float64] fill_values = [None, np.nan, 4, 4.2] for dtype1, dtype2, fill_value in product(series_dtypes, series_dtypes, fill_values): S1 = pd.Series(A, index=series_index, dtype=dtype1) S2 = pd.Series(B, index=series_index, dtype=dtype2) with self.subTest(S1_dtype=dtype1, S2_dtype=dtype2, fill_value=fill_value): result = hpat_func(S1, S2) result_ref = test_impl(S1, S2) # check_dtype=False due to difference to pandas in some cases pd.testing.assert_series_equal(result, result_ref, check_dtype=False)

pep8speaks · 2020-06-01T13:01:44Z

Hello @Rubtsowa! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-01 13:52:44 UTC

…ies_combine

kozlov-alexey · 2020-06-16T12:07:23Z

sdc/datatypes/hpat_pandas_series_functions.py

 _func_name='Method isin().'

 ty_checker=TypeChecker(_func_name)
 ty_checker.check(self, SeriesType)

 ifnotisinstance(values, (types.Set, types.List)):


@Rubtsowa What? This sounds like a bug...

@kozlov-alexey This 'bug' in function sdc_join_series_indexes

@Rubtsowa Then it should be fixed (please create a JIRA case with a reproducer) before this is merged. Adapting to a bug is no way. @AlexanderKalistratov correct?

PokhodenkoSA

Very old PR. It is better to close it.

Rubtsowa added 2 commits April 28, 2020 13:30

Impl Series.combine
7ec05f3

merge
b576e03

Rubtsowa added the Ready for Review label Apr 28, 2020

Rubtsowa requested review from AlexanderKalistratov, PokhodenkoSA and densmirn April 28, 2020 10:37

Rubtsowa added 2 commits April 28, 2020 13:40

change comment
95d233e

Merge branch 'master' ofhttps://github.com/IntelPython/hpatinto ser…
2b95e9d
…ies_combine

densmirn suggested changes May 13, 2020
View reviewed changes

change example
2104aae

1e-to suggested changes May 13, 2020
View reviewed changes

Rubtsowa added 3 commits May 13, 2020 15:36

use sdc_join_series_indexes
aeef026

change
f09b792

Merge branch 'master' ofhttps://github.com/IntelPython/hpatinto ser…
fafcaed
…ies_combine

kozlov-alexey reviewed May 21, 2020
View reviewed changes

kozlov-alexey suggested changes May 21, 2020
View reviewed changes

kozlov-alexey added Waiting on author and removed Ready for Review labels May 21, 2020

Rubtsowa added 2 commits May 25, 2020 16:54

for from chunks, change dtype for result array
b1ed1b3

Merge branch 'master' ofhttps://github.com/IntelPython/hpatinto ser…
4fdb369
…ies_combine

Rubtsowa added Ready for Review and removed Waiting on author labels May 25, 2020

Rubtsowa added 2 commits May 26, 2020 12:22

change 'if' on 'if-else'
6d930d8

change for
c85feff

1e-to approved these changes May 26, 2020
View reviewed changes

Rubtsowa added 2 commits May 26, 2020 13:36

change for
18771f2

change for
8628e31

densmirn reviewed May 26, 2020
View reviewed changes

add comment in skip test, some change in impl
6f37f53

densmirn approved these changes May 26, 2020
View reviewed changes

kozlov-alexey reviewed May 27, 2020
View reviewed changes

Rubtsowa added 3 commits May 27, 2020 12:40

change if-else in 1 line
f66ace3

Merge branch 'master' ofhttps://github.com/IntelPython/hpatinto ser…
9fc55a7
…ies_combine

change test
e7dc1f5

kozlov-alexey reviewed May 27, 2020
View reviewed changes

change tests
6c78b8e

Rubtsowa added 2 commits June 1, 2020 16:08

fix problem with PEP8
d623b89

Merge branch 'master' ofhttps://github.com/IntelPython/hpatinto ser…
ecd3dba
…ies_combine

kozlov-alexey reviewed Jun 16, 2020
View reviewed changes

PokhodenkoSA reviewed Dec 6, 2021
View reviewed changes

	iffill_valueisnotNone:
	_fill_value=numpy.naniffill_valueisNoneelsefill_value:


		ifnotisinstance(fill_value, (types.Omitted, types.NoneType, types.Number)) andfill_valueisnotNone:
		ty_checker.raise_exc(fill_value, 'number', 'fill_value')

	val_self=self[self_idx]._data[0]
	val_self=self._data[self_idx]

	-Onlysupportsthecasewhendatainseriesofthesame type.
	-Resultingseriesdtypemaybewiderthaninpandasdueto type-stabilityrequirementsanddependsonfill_valuedtypeandresultofseriesindexesalignment.



		@njit
		defseries_copy():

		iffill_valueisNone:
		fill_value=numpy.nan

Series combine#821

Are you sure you want to change the base?

Series combine #821

Uh oh!

Conversation

Rubtsowa commented Apr 28, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

densmirnMay 13, 2020• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kozlov-alexeyMay 21, 2020• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kozlov-alexey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kozlov-alexey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Jun 1, 2020• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-06-01 13:52:44 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RubtsowaJun 16, 2020• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PokhodenkoSA left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

densmirnMay 13, 2020•
edited
Loading

kozlov-alexeyMay 21, 2020•
edited
Loading

pep8speaks commented Jun 1, 2020•
edited
Loading

RubtsowaJun 16, 2020•
edited
Loading