- Notifications
You must be signed in to change notification settings - Fork 62
Series combine#821
base:master
Are you sure you want to change the base?
Series combine #821
Uh oh!
There was an error while loading. Please reload this page.
Conversation
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
| len_val=max(len(self), len(other)) | ||
| result=numpy.empty(len_val, self._data.dtype) | ||
| forindinrange(len_val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we parallel the method based on chunks?
| iffill_valueisNone: | ||
| fill_value=numpy.nan | ||
| len_val=max(len(self), len(other)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what if all indexes are different? I think we should use sdc_join_series_indexes to find len of result series
| len_val=max(len(self), len(other)) | ||
| result=numpy.empty(len_val, self._data.dtype) | ||
| forindinrange(len_val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is case for non-indexes series. Also, it should rewrite with prange
densmirnMay 13, 2020 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ usage of chunks to predict scalability
examples/series/series_combine.py Outdated
| @njit | ||
| defseries_copy(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong name function
| iffill_valueisNone: | ||
| fill_value=numpy.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make fill_value type undefined at compile time. You can probably use the same approach as in operators:
sdc/sdc/sdc_function_templates.py
Line 144 in e87095a
| _fill_value=numpy.naniffill_value_is_none==Trueelsefill_value# noqa |
| fill_value=numpy.nan | ||
| len_val=max(len(self), len(other)) | ||
| result=numpy.empty(len_val, self._data.dtype) |
kozlov-alexeyMay 21, 2020 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually wrong, result dtype should be common dtype for result dtype of func(a, b) where a,b are series values and dtype of _fill_value. Provided tests do not cover this, but e.g. this (where fill_value is float and series are integers) won't pass:
def test_series_combine_integer_new(self): def test_impl(S1, S2): return S1.combine(S2, lambda a, b: 2 * a + b, 16.2) hpat_func = self.jit(test_impl) S1 = pd.Series([1, 2, 3, 4, 5]) S2 = pd.Series([6, 21, 3, 5]) result = hpat_func(S1, S2) result_ref = test_impl(S1, S2) print(f"DEBUG: result:\n{result},\nresult_ref:\n{result_ref}") pd.testing.assert_series_equal(result, result_ref)
kozlov-alexey left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add deducing result dtype and add parallelization.
sdc/tests/test_series.py Outdated
| pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) | ||
| @skip_numba_jit | ||
| @unittest.expectedFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment why the test is skipped.@unittest.expectedFailure # ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa No need to skip the test if that's how impl is intended to work. Use check_dtype=False in assert_series_equal and add a comment just before this check to refer to SDC Limitation.
| ifself_indexes[j] ==-1: | ||
| val_self=_fill_value | ||
| else: | ||
| ind_self=self_indexes[j] | ||
| val_self=self[ind_self]._data[0] | ||
| ifother_indexes[j] ==-1: | ||
| val_other=_fill_value | ||
| else: | ||
| ind_other=other_indexes[j] | ||
| val_other=other[ind_other]._data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ifself_indexes[j]==-1: | |
| val_self=_fill_value | |
| else: | |
| ind_self=self_indexes[j] | |
| val_self=self[ind_self]._data[0] | |
| ifother_indexes[j]==-1: | |
| val_other=_fill_value | |
| else: | |
| ind_other=other_indexes[j] | |
| val_other=other[ind_other]._data[0] | |
| self_idx=self_indexes[j] | |
| ifself_idx==-1: | |
| val_self=_fill_value | |
| else: | |
| val_self=self[self_idx]._data[0] | |
| other_idx=other_indexes[j] | |
| ifother_idx==-1: | |
| val_other=_fill_value | |
| else: | |
| val_other=other[other_idx]._data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or
| ifself_indexes[j] ==-1: | |
| val_self=_fill_value | |
| else: | |
| ind_self=self_indexes[j] | |
| val_self=self[ind_self]._data[0] | |
| ifother_indexes[j] ==-1: | |
| val_other=_fill_value | |
| else: | |
| ind_other=other_indexes[j] | |
| val_other=other[ind_other]._data[0] | |
| self_idx=self_indexes[j] | |
| val_self=_fill_valueifself_idx==-1elseself[self_idx]._data[0] | |
| other_idx=other_indexes[j] | |
| val_other=_fill_valueifother_idx==-1elseother[other_idx]._data[0] |
| min_periods=2 | ||
| ifmin_periods<2: | ||
| min_periods=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| iffill_valueisnotNone: | |
| _fill_value=numpy.naniffill_valueisNoneelsefill_value: |
| ifnotisinstance(fill_value, (types.Omitted, types.NoneType, types.Number)) andfill_valueisnotNone: | ||
| ty_checker.raise_exc(fill_value, 'number', 'fill_value') | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self_idx (and other_idx) is position in the Series, not the index, so instead of using getitem on a Series, that performs index lookup and returns a Series, so that you have to take _data[0] from it, you can just write:
| val_self=self[self_idx]._data[0] | |
| val_self=self._data[self_idx] |
kozlov-alexey left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks OK, but I would reorganize and extend existing tests a bit.
| Limitations | ||
| ----------- | ||
| - Only supports the case when data in series of the same type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not correct - impl handles all cases. For the next line we need exact definition of difference to pandas, e.g:
| -Onlysupportsthecasewhendatainseriesofthesame type. | |
| -Resultingseriesdtypemaybewiderthaninpandasdueto type-stabilityrequirementsanddependsonfill_valuedtypeandresultofseriesindexesalignment. |
sdc/tests/test_series.py Outdated
| pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) | ||
| @skip_numba_jit | ||
| deftest_series_combine_float3264(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has incorrect code, which should be corrected probably:
S1 = pd.Series([np.float64(1), np.float64(2), np.float64(3), np.float64(4), np.float64(5)]) S2 = pd.Series([np.float32(1), np.float32(2), np.float32(3), np.float32(4), np.float32(5)]) S2.dtype will be float64 on Win, not float32. Moreover, series dtype should be specified this way:
S1 = pd.Series([1, 2, 3, 4, 5], dtype=np.int64) S2 = pd.Series([1, 2, 3, 4, 5], dtype=np.int32) | chunks=parallel_chunks(len_val) | ||
| foriinprange(len(chunks)): | ||
| chunk=chunks[i] | ||
| forjinrange(chunk.start, chunk.stop): | ||
| self_idx=self_indexes[j] | ||
| val_self=_fill_valueifself_idx==-1elseself._data[self_idx] | ||
| other_idx=other_indexes[j] | ||
| val_other=_fill_valueifother_idx==-1elseother._data[other_idx] | ||
| result[j] =func(val_self, val_other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
| chunks=parallel_chunks(len_val) | |
| foriinprange(len(chunks)): | |
| chunk=chunks[i] | |
| forjinrange(chunk.start, chunk.stop): | |
| self_idx=self_indexes[j] | |
| val_self=_fill_valueifself_idx==-1elseself._data[self_idx] | |
| other_idx=other_indexes[j] | |
| val_other=_fill_valueifother_idx==-1elseother._data[other_idx] | |
| result[j] =func(val_self, val_other) | |
| result=numpy.empty(len_val, res_dtype) | |
| foriinprange(len(result)): | |
| self_idx, other_idx=self_indexes[i], other_indexes[i] | |
| val_self=_fill_valueifself_idx==-1elseself._data[self_idx] | |
| val_other=_fill_valueifother_idx==-1elseother._data[other_idx] | |
| result[i] =func(val_self, val_other) |
sdc/tests/test_series.py Outdated
| S2=pd.Series([6., 21., 3., 5.]) | ||
| withself.assertRaises(AssertionError): | ||
| hpat_func(S1, S2) | ||
| pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment for tests: not all combinations of input series dtypes and fill_value are tested e.g. the one I mentioned before - where float fill_value is assigned to otherwise int series. There are no tests with series with non-default indexes (we refer to samelen, but it's not fully correct - series may have same len, but not same indexes), and no tests for checking func impact on result dtype, so it's hard to see from such tests what's really tested and what is not. So the suggestion is to organize tests in a different manner:
- product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_value_samelen - product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
with fill_value is omitted
Covers: test_series_combine_float3264, test_series_combine_integer_samelen, test_series_combine_samelen, test_series_combine_different_types - product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
Covers: test_series_combine_integer, test_series_combine_value - product of diff series dtypes (default int, int64, float64),
series indexes that align with and without -1 in indexers
fill_value is omitted
Covers: test_series_combine, test_series_combine_assert1, test_series_combine_assert2, test_series_combine_different_types
New test:
5. (for testing func changes dtype properly)
product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value = 0
with diff functions (chaning and not chaning res dtype e.g. preserving int domain, e.g. ** and + and not, e.g. /)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, test 1 can look like this (it can also be split into two: one when we use check_dtype=False and one when we don't):
def test_series_combine_same_index_fill_value(self): def test_impl(S1, S2): return S1.combine(S2, lambda a, b: 2 * a + b) hpat_func = self.jit(test_impl) n = 11 np.random.seed(0) A = np.random.randint(-100, 100, n) B = np.arange(n) * 2 + 1 series_index = 1 + np.arange(n) series_dtypes = [None, np.int64, np.float64] fill_values = [None, np.nan, 4, 4.2] for dtype1, dtype2, fill_value in product(series_dtypes, series_dtypes, fill_values): S1 = pd.Series(A, index=series_index, dtype=dtype1) S2 = pd.Series(B, index=series_index, dtype=dtype2) with self.subTest(S1_dtype=dtype1, S2_dtype=dtype2, fill_value=fill_value): result = hpat_func(S1, S2) result_ref = test_impl(S1, S2) # check_dtype=False due to difference to pandas in some cases pd.testing.assert_series_equal(result, result_ref, check_dtype=False) pep8speaks commented Jun 1, 2020 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
| _func_name='Method isin().' | ||
| ty_checker=TypeChecker(_func_name) | ||
| ty_checker.check(self, SeriesType) | ||
| ifnotisinstance(values, (types.Set, types.List)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa What? This sounds like a bug...
RubtsowaJun 16, 2020 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlov-alexey This 'bug' in function sdc_join_series_indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rubtsowa Then it should be fixed (please create a JIRA case with a reproducer) before this is merged. Adapting to a bug is no way. @AlexanderKalistratov correct?
PokhodenkoSA left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very old PR. It is better to close it.
No description provided.