Improvements to the cudamatrix directory. #3221

luitjens · 2019-04-10T21:27:13Z

Changes:

cu-array-inl.h, cu-packed-matrix.cc:
Remove unecessary synchronization. Synchronization will occur with
stream semantics

cu-device.h, cu-device.cc, cuda-common.h, cuda_64bit.mk
Add a handle for cusolverDN library. Future changes will rely on
this.

cu-kernels-ansi.h, cu-kernels.cu, cu-kernels.h:
Add RowSumMat kernel support which mirriors ColSumMat but operators on rows.

cu-matrix.cc:
make cudaMemset2D asynchronous. Synchronization is handled via
streams.

cu-value.h:
Added -= operator which mirrors += operator

cu-vector.cc, cu-vector.h:
Added ApplyLogSoftMax which matches CPU version.
Remove stream synchronization on AddMatVec (handled by streams)
Use direct kernel for row sum instead of a mat vec. This is more
efficient as it avoids extra allocation and memset.

cu-sparse-matrix-test.cc:
adjusted epsilon to be more tolerant of order of operations floating
point error.

danpovey · 2019-04-11T20:02:35Z

src/cudamatrix/cu-device.h

 inline cublasHandle_t GetCublasHandle() { return cublas_handle_; }
 inline cusparseHandle_t GetCusparseHandle() { return cusparse_handle_; }
 inline curandGenerator_t GetCurandHandle() { return curand_handle_; }
+ inline cusolverDnHandle_t GetCusolverDnHandle() { return cusolverdn_handle_; }


Is this needed for something just yet? Not that I necessarily object, just asking.

luitjens · 2019-04-11T21:19:10Z

Not yet but it will be for future commits. The example where i'm using it is to use a solver instead of the built in matrix inversion. Here is a snapshot from the ivector extractor i'm working on: #if 0 quadratic.Invert(); ivector->Resize(ivector_dim_,kUndefined); ivector->AddSpVec(1.0, quadratic, linear, 0.0); #else //x = quadratic^-1 * linear //ivector+=x //Inverting the matrix is unneccessary. We are only solving a single //linear system. So just use choleskey's to solve for a single ivector //Equation being solved: quadratic * ivector = linear int nrhs=1; ivector->Resize(ivector_dim_, kUndefined); //cusolver does an inplace solve. so copy RHS to ivector ivector->CopyFromVec(linear); //Forming new non-SP matrix for cusolver. CuMatrix<float> A(quadratic); //This is the cusolver return code. Checking it would require synchronization. //So we do not check it. int *d_info = NULL; //query temp buffer size int L_work; CUSOLVER_SAFE_CALL(cusolverDnSpotrf_bufferSize(GetCusolverDnHandle(), CUBLAS_FILL_MODE_LOWER, ivector_dim_, A.Data(), A.Stride(), &L_work)); //allocate temp buffer float *workspace = static_cast<float*>(CuDevice::Instantiate().Malloc(L_work)); //perform factorization CUSOLVER_SAFE_CALL(cusolverDnSpotrf(GetCusolverDnHandle(), CUBLAS_FILL_MODE_LOWER, ivector_dim_, A.Data(), A.Stride(), workspace, L_work, d_info)); //solve for rhs CUSOLVER_SAFE_CALL(cusolverDnSpotrs(GetCusolverDnHandle(), CUBLAS_FILL_MODE_LOWER, ivector_dim_, nrhs, A.Data(), A.Stride(), ivector->Data(), ivector_dim_, d_info)); CuDevice::Instantiate().Free(workspace); #endif Note we could also integrate something similar into the built in inversion routines as this library supports both cholesky and lu inversion. But in the case i'm working on inversion is overkill.

…

On Thu, Apr 11, 2019 at 2:02 PM Daniel Povey ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/cudamatrix/cu-device.h <#3221 (comment)>: > @@ -83,6 +84,7 @@ class CuDevice { inline cublasHandle_t GetCublasHandle() { return cublas_handle_; } inline cusparseHandle_t GetCusparseHandle() { return cusparse_handle_; } inline curandGenerator_t GetCurandHandle() { return curand_handle_; } + inline cusolverDnHandle_t GetCusolverDnHandle() { return cusolverdn_handle_; } Is this needed for something just yet? Not that I necessarily object, just asking. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3221 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGRZcmLdfV-ppKURywEEQf-6MXrTdjINks5vf5TvgaJpZM4coSYW> .

danpovey · 2019-04-12T02:35:37Z

Hm, OK, I'll merge. Can you first confirm that you ran the tests in src/ with CUDA enabled, and none failed, other than the dct thing which we are fixing?

luitjens · 2019-04-12T02:45:24Z

I ran make test and nothing failed up to the dct test. it stopped there and did not continue. I did also run .are test in the cudamatrix for which ran fine.

…

On Thu, Apr 11, 2019, 8:36 PM Daniel Povey ***@***.***> wrote: Hm, OK, I'll merge. Can you first confirm that you ran the tests in src/ with CUDA enabled, and none failed, other than the dct thing which we are fixing? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGRZcsDlhpkbb9L8VB2RXIFmgMUbChIuks5vf_EVgaJpZM4coSYW> .

danpovey · 2019-04-12T02:48:35Z

OK, please rerun after merging with master because we fixed the DCT thing. On Thu, Apr 11, 2019 at 4:45 PM Justin Luitjens <notifications@github.com> wrote:

…

I ran make test and nothing failed up to the dct test. it stopped there and did not continue. I did also run .are test in the cudamatrix for which ran fine. On Thu, Apr 11, 2019, 8:36 PM Daniel Povey ***@***.***> wrote: > Hm, OK, I'll merge. Can you first confirm that you ran the tests in src/ > with CUDA enabled, and none failed, other than the dct thing which we are > fixing? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#3221 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/AGRZcsDlhpkbb9L8VB2RXIFmgMUbChIuks5vf_EVgaJpZM4coSYW > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu7tOMLuVXLRb6gzHfoXKdW9A1SG3ks5vf_NKgaJpZM4coSYW> .

luitjens · 2019-04-12T03:31:46Z

rebased onto master and am now testing at my patch~1 (so pure master). There are failures. Going to run through and see if I can figure any out and will report any that are not simply tolerance is too tight. So far one matrix test failed due to a tolerance at 1e-5 instead of 1e-4. On Thu, Apr 11, 2019 at 8:48 PM Daniel Povey <notifications@github.com> wrote:

…

OK, please rerun after merging with master because we fixed the DCT thing. On Thu, Apr 11, 2019 at 4:45 PM Justin Luitjens ***@***.***> wrote: > I ran make test and nothing failed up to the dct test. it stopped there > and did not continue. I did also run .are test in the cudamatrix for which > ran fine. > > On Thu, Apr 11, 2019, 8:36 PM Daniel Povey ***@***.***> > wrote: > > > Hm, OK, I'll merge. Can you first confirm that you ran the tests in src/ > > with CUDA enabled, and none failed, other than the dct thing which we are > > fixing? > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > <#3221 (comment)>, > or mute > > the thread > > < > https://github.com/notifications/unsubscribe-auth/AGRZcsDlhpkbb9L8VB2RXIFmgMUbChIuks5vf_EVgaJpZM4coSYW > > > > . > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#3221 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/ADJVu7tOMLuVXLRb6gzHfoXKdW9A1SG3ks5vf_NKgaJpZM4coSYW > > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGRZcnw5991kkUHtmBHzcFiENKejvHH2ks5vf_QZgaJpZM4coSYW> .

luitjens · 2019-04-12T03:41:32Z

the one failure with tolerance in cudamatrix is the one i've already fixed in my patch set (duh!). Anyway there are other failures. I tested a few random directories to see: nnet2 failure: LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:214) DctComponent, input-dim=10, output-dim=2, dct_dim=5, dct_keep_dim=1 LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:83) Comparing feature gradients 10 times. LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing 1.62634e-05 and 7.41854e-05 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing 1.16659e-05 and 3.0525e-05 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing 2.23374e-05 and 1.37091e-05 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing -9.48816e-05 and -1.90735e-06 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing 7.98946e-05 and 4.42564e-06 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing -4.7326e-05 and -3.51667e-05 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing 0.00015349 and -4.29675e-05 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing -7.85769e-05 and 1.3791e-05 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing 6.35098e-06 and -1.19284e-05 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:116) Input gradients: comparing -2.54259e-05 and 1.44467e-05 WARNING ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:121) Bad difference! LOG ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:128) Succeeded for 0 out of 10 tries. ERROR ([5.5.287-9b730]:UnitTestGenericComponentInternal():nnet-component-test.cc:132) Feature-derivative check failed nnet3 failure: Input-indexes: <I1V> 8 <I1> 0 3 0 <I1> 0 4 0 <I1> 0 5 0 <I1> 0 6 0 <I1> 1 3 0 <I1> 1 4 0 <I1> 2 3 0 <I1> 2 4 0 Input-indexes-modified: <I1V> 12 <I1> 0 3 0 <I1> 0 4 0 <I1> 1 3 0 <I1> 1 4 0 <I1> 2 3 0 <I1> 2 4 0 <I1> 0 5 0 <I1> 0 6 0 <I1> 1 -2147483648 0 <I1> 1 -2147483648 0 <I1> 2 -2147483648 0 <I1> 2 -2147483648 0 Output-indexes: <I1V> 4 <I1> 0 4 0 <I1> 0 6 0 <I1> 1 4 0 <I1> 2 4 0 Output-indexes-modified: <I1V> 6 <I1> 0 4 0 <I1> 1 4 0 <I1> 2 4 0 <I1> 0 6 0 <I1> 1 -2147483648 0 <I1> 2 -2147483648 0 LOG ([5.5.287-9b730]:UnitTestTimeHeightConvolutionCompile():convolution-test.cc:396) iter = 1 WARNING ([5.5.287-9b730]:Check():convolution.cc:195) The input at the 2'th height is never used. WARNING ([5.5.287-9b730]:GetRandomConvolutionModel():convolution-test.cc:70) Regenerating model because it didn't pass the check: num-filters-in=3, num-filters-out=7, height-in=8, height-out=1, height-subsample-out=1, {time,height}-offsets=[0,-1 0,0 0,1], required-time-offsets=[0], input-dim=24, output-dim=7 LOG ([5.5.287-9b730]:TestRunningComputation():convolution-test.cc:298) Tested convolution for model: num-filters-in=3, num-filters-out=9, height-in=4, height-out=3, height-subsample-out=1, {time,height}-offsets=[0,0 0,2 1,0 2,1 2,2], required-time-offsets=[0,2], input-dim=12, output-dim=27 LOG ([5.5.287-9b730]:TestDataBackprop():convolution-test.cc:341) Expected objf = -29.8989, observed objf = -29.8989 LOG ([5.5.287-9b730]:TestParamsBackprop():convolution-test.cc:384) Expected objf = 26.8043, observed objf = 26.8043 LOG ([5.5.287-9b730]:UnitTestTimeHeightConvolutionCompile():convolution-test.cc:432) Input-indexes: <I1V> 3 <I1> 2 3 1 <I1> 2 4 1 <I1> 2 5 1 Input-indexes-modified: <I1V> 3 <I1> 2 3 1 <I1> 2 4 1 <I1> 2 5 1 Output-indexes: <I1V> 1 <I1> 2 3 1 Output-indexes-modified: <I1V> 1 <I1> 2 3 1 LOG ([5.5.287-9b730]:UnitTestTimeHeightConvolutionCompile():convolution-test.cc:396) iter = 2 WARNING ([5.5.287-9b730]:Check():convolution.cc:195) The input at the 3'th height is never used. WARNING ([5.5.287-9b730]:GetRandomConvolutionModel():convolution-test.cc:70) Regenerating model because it didn't pass the check: num-filters-in=8, num-filters-out=4, height-in=4, height-out=3, height-subsample-out=1, {time,height}-offsets=[0,-1 0,0 1,-2], required-time-offsets=[0,1], input-dim=32, output-dim=12 LOG ([5.5.287-9b730]:TestRunningComputation():convolution-test.cc:298) Tested convolution for model: num-filters-in=1, num-filters-out=6, height-in=4, height-out=4, height-subsample-out=1, {time,height}-offsets=[0,0], required-time-offsets=[0], input-dim=4, output-dim=24 LOG ([5.5.287-9b730]:TestDataBackprop():convolution-test.cc:341) Expected objf = -5.6619, observed objf = -5.6619 LOG ([5.5.287-9b730]:TestParamsBackprop():convolution-test.cc:384) Expected objf = 0.103592, observed objf = -4.79865 ERROR ([5.5.287-9b730]:TestParamsBackprop():convolution-test.cc:388) Difference in objf too large. [ Stack-Trace: ] kaldi::MessageLogger::LogMessage() const kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&) kaldi::nnet3::time_height_convolution::TestParamsBackprop(kaldi::nnet3::time_height_convolution::ConvolutionModel const&, std::vector<kaldi::nnet3::Index, std::allocator<kaldi::nnet3::Index> > const&, std::vector<kaldi::nnet3::Index, std::allocator<kaldi::nnet3::Index> > const&, kaldi::nnet3::time_height_convolution::ConvolutionComputation const&) kaldi::nnet3::time_height_convolution::UnitTestTimeHeightConvolutionCompile() kaldi::nnet3::time_height_convolution::UnitTestTimeHeightConvolution() main __libc_start_main _start nnet3 failure: WARNING ([5.5.287-9b730]:PreconditionDirectionsCpu():natural-gradient-online-test.cc:205) Floored 5 elements of d_{t+1}. WARNING ([5.5.287-9b730]:PreconditionDirectionsCpu():natural-gradient-online-test.cc:205) Floored 4 elements of d_{t+1}. WARNING ([5.5.287-9b730]:PreconditionDirectionsCpu():natural-gradient-online-test.cc:205) Floored 5 elements of d_{t+1}. WARNING ([5.5.287-9b730]:SelfTest():natural-gradient-online.cc:316) Failed to verify W_t (worst error: O[0,0] = 8.46464e+08, d_t = [ 1.33135e-09 ] ASSERTION_FAILED ([5.5.287-9b730]:SelfTest():natural-gradient-online.cc:283) Assertion failed: (rho_t_ > 0.9 * delta_ * d_t_max) [ Stack-Trace: ] kaldi::MessageLogger::LogMessage() const kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*) kaldi::nnet3::OnlineNaturalGradient::SelfTest() const kaldi::nnet3::OnlineNaturalGradient::PreconditionDirectionsInternal(float, float, bool, kaldi::Vector<float> const&, kaldi::CuMatrixBase<float>*, kaldi::CuMatrixBase<float>*) kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*) kaldi::nnet3::OnlineNaturalGradient::Init(kaldi::CuMatrixBase<float> const&) kaldi::nnet3::OnlineNaturalGradient::PreconditionDirections(kaldi::CuMatrixBase<float>*, float*) kaldi::nnet3::UnitTestPreconditionDirectionsOnline() main __libc_start_main _start Again these are all without my changes.

…

On Thu, Apr 11, 2019 at 9:31 PM Justin Luitjens ***@***.***> wrote: rebased onto master and am now testing at my patch~1 (so pure master). There are failures. Going to run through and see if I can figure any out and will report any that are not simply tolerance is too tight. So far one matrix test failed due to a tolerance at 1e-5 instead of 1e-4. On Thu, Apr 11, 2019 at 8:48 PM Daniel Povey ***@***.***> wrote: > OK, please rerun after merging with master because we fixed the DCT thing. > > On Thu, Apr 11, 2019 at 4:45 PM Justin Luitjens ***@***.*** > > > wrote: > > > I ran make test and nothing failed up to the dct test. it stopped there > > and did not continue. I did also run .are test in the cudamatrix for > which > > ran fine. > > > > On Thu, Apr 11, 2019, 8:36 PM Daniel Povey ***@***.***> > > wrote: > > > > > Hm, OK, I'll merge. Can you first confirm that you ran the tests in > src/ > > > with CUDA enabled, and none failed, other than the dct thing which we > are > > > fixing? > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub > > > <#3221 (comment) > >, > > or mute > > > the thread > > > < > > > https://github.com/notifications/unsubscribe-auth/AGRZcsDlhpkbb9L8VB2RXIFmgMUbChIuks5vf_EVgaJpZM4coSYW > > > > > > . > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#3221 (comment)>, > or mute > > the thread > > < > https://github.com/notifications/unsubscribe-auth/ADJVu7tOMLuVXLRb6gzHfoXKdW9A1SG3ks5vf_NKgaJpZM4coSYW > > > > . > > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#3221 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AGRZcnw5991kkUHtmBHzcFiENKejvHH2ks5vf_QZgaJpZM4coSYW> > . >

danpovey · 2019-04-12T03:48:43Z

Were these errors while running on CPU? You can tell from seeing whether it already printed that it got a GPU. Is this setup using MKL? I suspect the change to MKL may be altering things.

luitjens · 2019-04-12T03:57:40Z

This is not running with MKL. We are using Atlas. It looks like it has gotten to the GPU tests. See the attached log for the nnet2 fail.

…

On Thu, Apr 11, 2019 at 9:48 PM Daniel Povey ***@***.***> wrote: Were these errors while running on CPU? You can tell from seeing whether it already printed that it got a GPU. Is this setup using MKL? I suspect the change to MKL may be altering things. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGRZcrvzHTdw8d9gWv4EIpDPw8IG7NSlks5vgAIpgaJpZM4coSYW> .

danpovey · 2019-04-12T04:02:23Z

Looks like github does not allow attachments.
I am starting to test the current master on our hardware. Could it be those synchronization bugs due to newer hardware?
And you are sure this is master you are testing, none of your changes? Do "make depend" and "make clean" in cudamatrix to be sure it's not a dependency-tracking issue.

luitjens · 2019-04-12T04:11:16Z

I did run make depend and make clean. The only thing i didn't do is a distclean and reconfigure. it certainly could be synchronization bugs on newer hardware. Do you run any regression tests on V100?

…

On Thu, Apr 11, 2019 at 10:02 PM Daniel Povey ***@***.***> wrote: Looks like github does not allow attachments. I am starting to test the current master on our hardware. Could it be those synchronization bugs due to newer hardware? And you are sure this is master you are testing, none of your changes? Do "make depend" and "make clean" in cudamatrix to be sure it's not a dependency-tracking issue. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGRZcti-u5WMdGgl4bRKhEVHiwMvGxy2ks5vgAVggaJpZM4coSYW> .

danpovey · 2019-04-12T04:15:06Z

No, I haven't tested on v100. For me, on our hardware, the tests work with current master in the nnet3 directory. perhaps shiyin's commit would go some way to resolving it? On Thu, Apr 11, 2019 at 6:11 PM Justin Luitjens <notifications@github.com> wrote:

…

I did run make depend and make clean. The only thing i didn't do is a distclean and reconfigure. it certainly could be synchronization bugs on newer hardware. Do you run any regression tests on V100? On Thu, Apr 11, 2019 at 10:02 PM Daniel Povey ***@***.***> wrote: > Looks like github does not allow attachments. > I am starting to test the current master on our hardware. Could it be > those synchronization bugs due to newer hardware? > And you are sure this is master you are testing, none of your changes? Do > "make depend" and "make clean" in cudamatrix to be sure it's not a > dependency-tracking issue. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#3221 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/AGRZcti-u5WMdGgl4bRKhEVHiwMvGxy2ks5vgAVggaJpZM4coSYW > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu2dDZmgP6dBbGR6RcIrhb0HoXYHmks5vgAdpgaJpZM4coSYW> .

luitjens · 2019-04-12T04:18:01Z

possibly. I think we should accept that commit and I'll rebase Monday and retest. On Thu, Apr 11, 2019, 10:15 PM Daniel Povey <notifications@github.com> wrote:

…

No, I haven't tested on v100. For me, on our hardware, the tests work with current master in the nnet3 directory. perhaps shiyin's commit would go some way to resolving it? On Thu, Apr 11, 2019 at 6:11 PM Justin Luitjens ***@***.***> wrote: > I did run make depend and make clean. The only thing i didn't do is a > distclean and reconfigure. > > it certainly could be synchronization bugs on newer hardware. Do you run > any regression tests on V100? > > On Thu, Apr 11, 2019 at 10:02 PM Daniel Povey ***@***.***> > wrote: > > > Looks like github does not allow attachments. > > I am starting to test the current master on our hardware. Could it be > > those synchronization bugs due to newer hardware? > > And you are sure this is master you are testing, none of your changes? Do > > "make depend" and "make clean" in cudamatrix to be sure it's not a > > dependency-tracking issue. > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub > > <#3221 (comment)>, > or mute > > the thread > > < > https://github.com/notifications/unsubscribe-auth/AGRZcti-u5WMdGgl4bRKhEVHiwMvGxy2ks5vgAVggaJpZM4coSYW > > > > . > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#3221 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/ADJVu2dDZmgP6dBbGR6RcIrhb0HoXYHmks5vgAdpgaJpZM4coSYW > > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3221 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGRZcrxZ10HDBVJ3uU3H3W-NTvuzyFPZks5vgAhegaJpZM4coSYW> .

luitjens · 2019-04-16T21:58:18Z

Rebased off master. This now passes make test on V100.

Fixes were

revert f8021d7
fixes to DCTComponent tests: df41d4c 9b730e0

danpovey · 2019-04-22T19:17:44Z

src/cudamatrix/cu-matrix.cc



 #if HAVE_CUDA == 1
+#include <nvToolsExt.h>


is this needed?

looks like this is left over from profiling. it can be safely removed.

Changes: cu-array-inl.h, cu-packed-matrix.cc: Remove unecessary synchronization. Synchronization will occur with stream semantics cu-device.h, cu-device.cc, cuda-common.h, cuda_64bit.mk Add a handle for cusolverDN library. Future changes will rely on this. cu-kernels-ansi.h, cu-kernels.cu, cu-kernels.h: Add RowSumMat kernel support which mirriors ColSumMat but operators on rows. cu-matrix.cc: make cudaMemset2D asynchronous. Synchronization is handled via streams. cu-value.h: Added -= operator which mirriors += operator cu-vector.cc, cu-vector.h: Added ApplyLogSoftMax which matches CPU version. Remove stream synchronization on AddMatVec (handled by streams) Use direct kernel for row sum instead of a mat vec. This is more efficient as it avoids extra allocation and memset. cu-sparse-matrix-test.cc: adjusted epislon to be more tolerant of order of operations floating point error.

danpovey · 2019-04-26T16:20:28Z

@huangruizhe can you please ASAP prepare a PR that reverts just the cusolver-related parts of this PR?

…aldi-asr#3221) Changes: cu-array-inl.h, cu-packed-matrix.cc: Remove unecessary synchronization. Synchronization will occur with stream semantics cu-device.h, cu-device.cc, cuda-common.h, cuda_64bit.mk Add a handle for cusolverDN library. Future changes will rely on this. cu-kernels-ansi.h, cu-kernels.cu, cu-kernels.h: Add RowSumMat kernel support which mirriors ColSumMat but operators on rows. cu-matrix.cc: make cudaMemset2D asynchronous. Synchronization is handled via streams. cu-value.h: Added -= operator which mirriors += operator cu-vector.cc, cu-vector.h: Added ApplyLogSoftMax which matches CPU version. Remove stream synchronization on AddMatVec (handled by streams) Use direct kernel for row sum instead of a mat vec. This is more efficient as it avoids extra allocation and memset. cu-sparse-matrix-test.cc: adjusted epislon to be more tolerant of order of operations floating point error.

danpovey reviewed Apr 11, 2019

View reviewed changes

luitjens force-pushed the cudamatrix-fixes branch from 57e7104 to c3c3d69 Compare April 12, 2019 03:03

luitjens force-pushed the cudamatrix-fixes branch 2 times, most recently from c2465ee to 02beb2e Compare April 16, 2019 21:56

danpovey reviewed Apr 22, 2019

View reviewed changes

luitjens force-pushed the cudamatrix-fixes branch from 02beb2e to b0ae28c Compare April 22, 2019 19:45

danpovey mentioned this pull request Apr 22, 2019

Make no warp sync assumption for CC7.x #3211

Merged

danpovey merged commit 2c25629 into kaldi-asr:master Apr 22, 2019

huangruizhe mentioned this pull request Apr 27, 2019

removed cusolver #3276

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to the cudamatrix directory. #3221

Improvements to the cudamatrix directory. #3221

Uh oh!

luitjens commented Apr 10, 2019

danpovey Apr 11, 2019

luitjens commented Apr 11, 2019 via email

danpovey commented Apr 12, 2019

luitjens commented Apr 12, 2019 via email

danpovey commented Apr 12, 2019 via email

luitjens commented Apr 12, 2019 via email

luitjens commented Apr 12, 2019 via email

danpovey commented Apr 12, 2019

luitjens commented Apr 12, 2019 via email

danpovey commented Apr 12, 2019

luitjens commented Apr 12, 2019 via email

danpovey commented Apr 12, 2019 via email

luitjens commented Apr 12, 2019 via email

luitjens commented Apr 16, 2019

danpovey Apr 22, 2019

luitjens Apr 22, 2019

danpovey commented Apr 26, 2019

Labels

2 participants

Improvements to the cudamatrix directory. #3221

Improvements to the cudamatrix directory. #3221

Uh oh!

Conversation

luitjens commented Apr 10, 2019

danpovey Apr 11, 2019

Choose a reason for hiding this comment

luitjens commented Apr 11, 2019 via email

danpovey commented Apr 12, 2019

luitjens commented Apr 12, 2019 via email

danpovey commented Apr 12, 2019 via email

luitjens commented Apr 12, 2019 via email

luitjens commented Apr 12, 2019 via email

danpovey commented Apr 12, 2019

luitjens commented Apr 12, 2019 via email

danpovey commented Apr 12, 2019

luitjens commented Apr 12, 2019 via email

danpovey commented Apr 12, 2019 via email

luitjens commented Apr 12, 2019 via email

luitjens commented Apr 16, 2019

danpovey Apr 22, 2019

Choose a reason for hiding this comment

luitjens Apr 22, 2019

Choose a reason for hiding this comment

danpovey commented Apr 26, 2019

Labels

2 participants