Skip to content

Conversation

zorroblue
Copy link

Reference Issue

Fixes #344

What does this implement/fix? Explain your changes.

Adds an additional parameter rng to hash_X_y, defaulted as a zero seeded random state

Any other comments?

@glemaitre
Copy link
Member

The random seed should not be defaulted. Each sampler will pass the random seed which is required.
Then you need to fix the tests if there is a need.

@zorroblue
Copy link
Author

There are a few instances where hash_X_y is called. Wouldn't those too need to be changed if we don't give a default value?

@glemaitre
Copy link
Member

The idea is to pass self.random_state for all those occurrences since each estimator should have a random_state.

@zorroblue
Copy link
Author

I almost fixed it. It fails for the case where the random_state is passed as None. In this case, calling check_random_state(random_state) gives a randomly initialized RandomState each time. I suspect it's because fit and sample both call hash_X_y in self.fit(X, y).sample(X, y) with random_state passed as None(hence the check_random_state gives different states to each). How do we solve this? I can think of storing the actual RandomState object in random_state in the __init__ itself. What do you think? @glemaitre

@chkoar
Copy link
Member

chkoar commented Oct 16, 2017

@zorroblue Nope. You will use the check_random_state function (from sklearn.utils) to get the RandomState object and pass that around. IMHO I would probably leave the random state static in the hash_X_y function.

@glemaitre
Copy link
Member

@chkoar We could make an deterministic approach instead. We could take N equally space samples and it will not required the random state. WDYT?

@chkoar
Copy link
Member

chkoar commented Oct 16, 2017

@glemaitre nice. To recap, we use that in order to be sure that the sample method is fed with the same data as the fit method, right?

@glemaitre
Copy link
Member

@glemaitre nice. To recap, we use that in order to be sure tha the sample method is fed with the same data as the fit method, right?

Exactly. Apart of computing a hash on the whole data, there is not a good solution.
I think that making it deterministic on couple of point would be enough.

@chkoar
Copy link
Member

chkoar commented Oct 16, 2017

@glemaitre agree

@glemaitre
Copy link
Member

@zorroblue are you still willing to make the changes?

@zorroblue
Copy link
Author

zorroblue commented Oct 24, 2017

@glemaitre Sure! I still didn't get what needs to be done. Could you tell me?

@glemaitre
Copy link
Member

Ups, I forgot to answer apparently. The idea is to select a number of samples which are equally spaced similarly to https://github.com/scikit-learn/scikit-learn/pull/9041/files#diff-4cbaa7df0d8c0f765f3b07c72b70b60eR135

Once those samples are selected we can compute the hash and we don't need the random_state anymore.

@chkoar
Copy link
Member

chkoar commented Nov 1, 2017

Could we change the title of the PR?

@glemaitre
Copy link
Member

@chkoar Fell free to rename the title

@zorroblue
Copy link
Author

Sure @glemaitre I'll have a look at this in the weekend :)

@glemaitre glemaitre changed the title Add random state as parameter [WIP] Change hash computatin to check data between fit and sample Nov 1, 2017
@glemaitre
Copy link
Member

@zorroblue do you plan to make the changes any time soon?

@zorroblue
Copy link
Author

Sorry @glemaitre for the delay, I am a little occupied till Dec 4. I can take up this issue after that only :(
If someone else is interested in taking this up, feel free to :)

@glemaitre
Copy link
Member

Thanks for your effort. Sorry that we changed our mind regarding what to implement. Feel free to contribute any time.

@glemaitre glemaitre closed this Nov 30, 2017
@zorroblue
Copy link
Author

No problem at all @glemaitre
I am looking forward to contributing more to the scikit-learn community :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants