Skip to content

Conversation

@luitjens
Copy link
Contributor

Changes:

cu-array-inl.h, cu-packed-matrix.cc:
Remove unecessary synchronization. Synchronization will occur with
stream semantics

cu-device.h, cu-device.cc, cuda-common.h, cuda_64bit.mk
Add a handle for cusolverDN library. Future changes will rely on
this.

cu-kernels-ansi.h, cu-kernels.cu, cu-kernels.h:
Add RowSumMat kernel support which mirriors ColSumMat but operators on rows.

cu-matrix.cc:
make cudaMemset2D asynchronous. Synchronization is handled via
streams.

cu-value.h:
Added -= operator which mirrors += operator

cu-vector.cc, cu-vector.h:
Added ApplyLogSoftMax which matches CPU version.
Remove stream synchronization on AddMatVec (handled by streams)
Use direct kernel for row sum instead of a mat vec. This is more
efficient as it avoids extra allocation and memset.

cu-sparse-matrix-test.cc:
adjusted epsilon to be more tolerant of order of operations floating
point error.

inline cublasHandle_t GetCublasHandle() { return cublas_handle_; }
inline cusparseHandle_t GetCusparseHandle() { return cusparse_handle_; }
inline curandGenerator_t GetCurandHandle() { return curand_handle_; }
inline cusolverDnHandle_t GetCusolverDnHandle() { return cusolverdn_handle_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for something just yet? Not that I necessarily object, just asking.

@luitjens
Copy link
Contributor Author

luitjens commented Apr 11, 2019 via email

@danpovey
Copy link
Contributor

Hm, OK, I'll merge. Can you first confirm that you ran the tests in src/ with CUDA enabled, and none failed, other than the dct thing which we are fixing?

@luitjens
Copy link
Contributor Author

luitjens commented Apr 12, 2019 via email

@danpovey
Copy link
Contributor

danpovey commented Apr 12, 2019 via email

@luitjens
Copy link
Contributor Author

luitjens commented Apr 12, 2019 via email

@luitjens
Copy link
Contributor Author

luitjens commented Apr 12, 2019 via email

@danpovey
Copy link
Contributor

Were these errors while running on CPU? You can tell from seeing whether it already printed that it got a GPU. Is this setup using MKL? I suspect the change to MKL may be altering things.

@luitjens
Copy link
Contributor Author

luitjens commented Apr 12, 2019 via email

@danpovey
Copy link
Contributor

Looks like github does not allow attachments.
I am starting to test the current master on our hardware. Could it be those synchronization bugs due to newer hardware?
And you are sure this is master you are testing, none of your changes? Do "make depend" and "make clean" in cudamatrix to be sure it's not a dependency-tracking issue.

@luitjens
Copy link
Contributor Author

luitjens commented Apr 12, 2019 via email

@danpovey
Copy link
Contributor

danpovey commented Apr 12, 2019 via email

@luitjens
Copy link
Contributor Author

luitjens commented Apr 12, 2019 via email

@luitjens luitjens force-pushed the cudamatrix-fixes branch 2 times, most recently from c2465ee to 02beb2e Compare April 16, 2019 21:56
@luitjens
Copy link
Contributor Author

Rebased off master. This now passes make test on V100.

Fixes were

  1. revert f8021d7
  2. fixes to DCTComponent tests: df41d4c 9b730e0


#if HAVE_CUDA == 1
#include <nvToolsExt.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is left over from profiling. it can be safely removed.

Changes: cu-array-inl.h, cu-packed-matrix.cc: Remove unecessary synchronization. Synchronization will occur with stream semantics cu-device.h, cu-device.cc, cuda-common.h, cuda_64bit.mk Add a handle for cusolverDN library. Future changes will rely on this. cu-kernels-ansi.h, cu-kernels.cu, cu-kernels.h: Add RowSumMat kernel support which mirriors ColSumMat but operators on rows. cu-matrix.cc: make cudaMemset2D asynchronous. Synchronization is handled via streams. cu-value.h: Added -= operator which mirriors += operator cu-vector.cc, cu-vector.h: Added ApplyLogSoftMax which matches CPU version. Remove stream synchronization on AddMatVec (handled by streams) Use direct kernel for row sum instead of a mat vec. This is more efficient as it avoids extra allocation and memset. cu-sparse-matrix-test.cc: adjusted epislon to be more tolerant of order of operations floating point error.
@danpovey danpovey merged commit 2c25629 into kaldi-asr:master Apr 22, 2019
@danpovey
Copy link
Contributor

@huangruizhe can you please ASAP prepare a PR that reverts just the cusolver-related parts of this PR?

@huangruizhe huangruizhe mentioned this pull request Apr 27, 2019
danpovey pushed a commit to danpovey/kaldi that referenced this pull request Jun 19, 2019
…aldi-asr#3221) Changes: cu-array-inl.h, cu-packed-matrix.cc: Remove unecessary synchronization. Synchronization will occur with stream semantics cu-device.h, cu-device.cc, cuda-common.h, cuda_64bit.mk Add a handle for cusolverDN library. Future changes will rely on this. cu-kernels-ansi.h, cu-kernels.cu, cu-kernels.h: Add RowSumMat kernel support which mirriors ColSumMat but operators on rows. cu-matrix.cc: make cudaMemset2D asynchronous. Synchronization is handled via streams. cu-value.h: Added -= operator which mirriors += operator cu-vector.cc, cu-vector.h: Added ApplyLogSoftMax which matches CPU version. Remove stream synchronization on AddMatVec (handled by streams) Use direct kernel for row sum instead of a mat vec. This is more efficient as it avoids extra allocation and memset. cu-sparse-matrix-test.cc: adjusted epislon to be more tolerant of order of operations floating point error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants