Card Status:
Tue Oct 17 10:25:24 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.73 Driver Version: 384.73 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID P4-4Q On | 00000000:02:02.0 On | N/A | | N/A N/A P8 N/A / N/A | 495MiB / 4095MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1510 G /usr/bin/X 77MiB | | 0 16459 G /usr/bin/gnome-shell 113MiB | +-----------------------------------------------------------------------------+
Start of run: Tue Oct 17 10:25:24 CDT 2017
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/alignedTypes] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 [GRID P4-4Q] has 20 MP(s) x 128 (Cores/MP) = 2560 (Cores) > Compute scaling value = 1.00 > Memory Size = 49999872 Allocating memory... Generating host input data array... Uploading input data to GPU memory... Testing misaligned types... uint8... Avg. time: 1.774187 ms / Copy throughput: 26.246387 GB/s. TEST OK uint16... Avg. time: 1.033437 ms / Copy throughput: 45.059338 GB/s. TEST OK RGBA8_misaligned... Avg. time: 0.855937 ms / Copy throughput: 54.403517 GB/s. TEST OK LA32_misaligned... Avg. time: 0.622406 ms / Copy throughput: 74.816103 GB/s. TEST OK RGB32_misaligned... Avg. time: 0.986250 ms / Copy throughput: 47.215220 GB/s. TEST OK RGBA32_misaligned... Avg. time: 1.052531 ms / Copy throughput: 44.241926 GB/s. TEST OK Testing aligned types... RGBA8... Avg. time: 0.660156 ms / Copy throughput: 70.537861 GB/s. TEST OK I32... Avg. time: 0.660187 ms / Copy throughput: 70.534523 GB/s. TEST OK LA32... Avg. time: 0.612375 ms / Copy throughput: 76.041654 GB/s. TEST OK RGB32... Avg. time: 0.715219 ms / Copy throughput: 65.107369 GB/s. TEST OK RGBA32... Avg. time: 0.606094 ms / Copy throughput: 76.829712 GB/s. TEST OK RGBA32_2... Avg. time: 0.724719 ms / Copy throughput: 64.253905 GB/s. TEST OK [alignedTypes] -> Test Results: 0 Failures Shutting down... Test passed
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/asyncAPI] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA device [GRID P4-4Q] time spent executing by the GPU: 72.02 time spent by CPU in CUDA calls: 0.06 CPU executed 187171 iterations while waiting for GPU to finish
[CUDA Bandwidth Test] - Starting... Running on... Device 0: GRID P4-4Q Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6355.7 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4529.5 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 157799.9 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
batchCUBLAS Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 ==== Running single kernels ==== Testing sgemm #### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x40000000, 2) #### args: lda=128 ldb=128 ldc=128 ^^^^ elapsed = 0.00006199 sec GFLOPS=67.6623 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0) #### args: lda=128 ldb=128 ldc=128 ^^^^ elapsed = 0.00007510 sec GFLOPS=55.8482 @@@@ dgemm test OK ==== Running N=10 without streams ==== Testing sgemm #### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x00000000, 0) #### args: lda=128 ldb=128 ldc=128 ^^^^ elapsed = 0.00008106 sec GFLOPS=517.417 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0) #### args: lda=128 ldb=128 ldc=128 ^^^^ elapsed = 0.00045013 sec GFLOPS=93.179 @@@@ dgemm test OK ==== Running N=10 with streams ==== Testing sgemm #### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x40000000, 2) beta= (0x40000000, 2) #### args: lda=128 ldb=128 ldc=128 ^^^^ elapsed = 0.00010586 sec GFLOPS=396.22 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0) #### args: lda=128 ldb=128 ldc=128 ^^^^ elapsed = 0.00034285 sec GFLOPS=122.338 @@@@ dgemm test OK ==== Running N=10 batched ==== Testing sgemm #### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x3f800000, 1) beta= (0xbf800000, -1) #### args: lda=128 ldb=128 ldc=128 ^^^^ elapsed = 0.00006104 sec GFLOPS=687.195 @@@@ sgemm test OK Testing dgemm #### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2) #### args: lda=128 ldb=128 ldc=128 ^^^^ elapsed = 0.00029707 sec GFLOPS=141.189 @@@@ dgemm test OK Test Summary 0 error(s)
WARNING: it is assumed that the matrices are stores in Matrix Market format with double as elementtype Usage: ./BiCGStab -F[matrix.mtx] [-E] [-D] Starting [~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BiCGStab] argv[0] = ~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BiCGStabUsing default input file [../../../../7_CUDALibraries/BiCGStab/gr_900_900_crg.mtx] GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Testing dbicgstab ^^^^ M=900, N=900, nnz=4322 Copy matrix from CPU to GPU, time(s) = 0.00008607 analysis lower 0.004964 (s), upper 0.001323 (s) CUSPARSE csrilu0 time(s) = 0.00109410 gpu total solve time 1.362471 (s), matvec time 0.117151 (s)
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/binomialOptions] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Generating input data... Running GPU binomial tree... Options count : 1024 Time steps : 2048 binomialOptionsGPU() time: 4.246000 msec Options per second : 241168.168883 Running CPU binomial tree... Comparing the results... GPU binomial vs. Black-Scholes L1 norm: 2.220162E-04 CPU binomial vs. Black-Scholes L1 norm: 2.217447E-04 CPU binomial vs. GPU binomial L1 norm: 1.161472E-06 Shutting down... NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. Test passed
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/binomialOptions_nvrtc] - Starting... Generating input data... Running GPU binomial tree... > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability Options count : 1024 Time steps : 2048 binomialOptionsGPU() time: 409.170013 msec Options per second : 2502.627188 Running CPU binomial tree... Comparing the results... GPU binomial vs. Black-Scholes L1 norm: 2.216577E-04 CPU binomial vs. Black-Scholes L1 norm: 9.435265E-05 CPU binomial vs. GPU binomial L1 norm: 1.513570E-04 Shutting down... NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. Test passed
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BlackScholes] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Initializing data... ...allocating CPU memory for options. ...allocating GPU memory for options. ...generating input data in CPU mem. ...copying input data to GPU mem. Data init done. Executing Black-Scholes GPU kernel (512 iterations)... Options count : 8000000 BlackScholesGPU() time : 0.481787 msec Effective memory bandwidth: 166.048442 GB/s Gigaoptions per second : 16.604844 BlackScholes, Throughput = 16.6048 GOptions/s, Time = 0.00048 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128 Reading back GPU results... Checking the results... ...running CPU calculations. Comparing the results... L1 norm: 1.741792E-07 Max absolute error: 1.192093E-05 Shutting down... ...releasing GPU memory. ...releasing CPU memory. Shutdown done. [BlackScholes] - Test Summary NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. Test passed
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BlackScholes_nvrtc] - Starting... Initializing data... ...allocating CPU memory for options. > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability ...allocating GPU memory for options. ...generating input data in CPU mem. ...copying input data to GPU mem. Data init done. Executing Black-Scholes GPU kernel (512 iterations)... Options count : 8000000 BlackScholesGPU() time : 0.481795 msec Effective memory bandwidth: 166.045751 GB/s Gigaoptions per second : 16.604575 BlackScholes, Throughput = 16.6046 GOptions/s, Time = 0.00048 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128 Reading back GPU results... Checking the results... ...running CPU calculations. Comparing the results... L1 norm: 1.741792E-07 Max absolute error: 1.192093E-05 Shutting down... ...releasing GPU memory. ...releasing CPU memory. Shutdown done. [~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BlackScholes_nvrtc] - Test Summary Test passed
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/boxFilterNPP Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 cudaSetDevice GPU0 = GRID P4-4Q NPP Library Version 9.0.176 CUDA Driver Version: 9.0 CUDA Runtime Version: 9.0 Device 0: < GRID P4-4Q >, Compute SM 6.1 detected boxFilterNPP opened: <../../../../common/data/Lena.pgm> successfully! Saved image: ../../../../common/data/Lena_boxFilter.pgm
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Read 3223503 byte corpus from ../../../../6_Advanced/c++11_cuda/warandpeace.txt counted 107310 instances of 'x', 'y', 'z', or 'w' in "../../../../6_Advanced/c++11_cuda/warandpeace.txt"
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/cannyEdgeDetectorNPP Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 cudaSetDevice GPU0 = GRID P4-4Q NPP Library Version 9.0.176 CUDA Driver Version: 9.0 CUDA Runtime Version: 9.0 Device 0: < GRID P4-4Q >, Compute SM 6.1 detected cannyEdgeDetectionNPP opened: <../../../../common/data/Lena.pgm> successfully! Saved image: ../../../../common/data/Lena_cannyEdgeDetection.pgm
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 GPU device GRID P4-4Q has compute capabilities (SM 6.1) Running qsort on 5000 elements with seed 100, on GRID P4-4Q cdpAdvancedQuicksort PASSED Sorted 5000 elems in 0.422 ms (11.851 Melems/sec)
Running on GPU 0 (GRID P4-4Q) Computing Bezier Lines (CUDA Dynamic Parallelism Version) ... Done!
Starting LU Decomposition (CUDA Dynamic Parallelism) GPU Device 0: "GRID P4-4Q" with compute capability 6.1 GPU device GRID P4-4Q has compute capabilities (SM 6.1) Compute LU decomposition of a random 1024x1024 matrix using CUDA Dynamic Parallelism Launching single task from device... Successfully launched CDP kernel GPU perf(dgetrf)= 1.585 Gflops Checking results... done Tests suceeded
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 GPU device GRID P4-4Q has compute capabilities (SM 6.1) Launching CDP kernel to build the quadtree Results: OK
starting Simple Print (CUDA Dynamic Parallelism) Running on GPU 0 (GRID P4-4Q) *************************************************************************** The CPU launches 2 blocks of 2 threads each. On the device each thread will launch 2 blocks of 2 threads each. The GPU we will do that recursively until it reaches max_depth=2 In total 2+8=10 blocks are launched!!! (8 from the GPU) *************************************************************************** Launching cdp_kernel() with CUDA Dynamic Parallelism: BLOCK 1 launched by the host BLOCK 0 launched by the host | BLOCK 4 launched by thread 0 of block 0 | BLOCK 2 launched by thread 0 of block 0 | BLOCK 3 launched by thread 0 of block 1 | BLOCK 5 launched by thread 0 of block 1 | BLOCK 6 launched by thread 1 of block 0 | BLOCK 7 launched by thread 1 of block 0 | BLOCK 8 launched by thread 1 of block 1 | BLOCK 9 launched by thread 1 of block 1
Running on GPU 0 (GRID P4-4Q) Initializing data: Running quicksort on 128 elements Launching kernel on the GPU Validating results: OK
CUDA Clock sample GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Average clocks/block = 3092.281250
CUDA Clock sample > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability Average clocks/block = 3056.218750
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/concurrentKernels] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Detected Compute SM 6.1 hardware with 20 multi-processors Expected time for serial execution of 8 kernels = 0.080s Expected time for concurrent execution of 8 kernels = 0.010s Measured time for sample = 0.010s Test passed
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > GPU device has 20 Multi-Processors, SM 6.1 compute capabilities iteration = 1, residual = 4.449882e+01 iteration = 2, residual = 3.245218e+00 iteration = 3, residual = 2.690220e-01 iteration = 4, residual = 2.307639e-02 iteration = 5, residual = 1.993140e-03 iteration = 6, residual = 1.846192e-04 iteration = 7, residual = 1.693378e-05 iteration = 8, residual = 1.600115e-06 Test Summary: Error amount = 0.000000
Unified Memory not supported on this device Starting [conjugateGradientMultiBlockCG]... GPU Device 0: "GRID P4-4Q" with compute capability 6.1
conjugateGradientPrecond starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 GPU selected Device ID = 0 > GPU device has 20 Multi-Processors, SM 6.1 compute capabilities laplace dimension = 128 Convergence of conjugate gradient without preconditioning: iteration = 564, residual = 9.462559e-13 Convergence Test: OK Convergence of conjugate gradient using incomplete LU preconditioning: iteration = 188, residual = 9.029262e-13 Convergence Test: OK Test Summary: Counted total of 0 errors qaerr1 = 0.000005 qaerr2 = 0.000003
Unified Memory not supported on this device Starting [conjugateGradientUM]... GPU Device 0: "GRID P4-4Q" with compute capability 6.1
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/convolutionFFT2D] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Testing built-in R2C / C2R FFT-based convolution ...allocating memory ...generating random input data ...creating R2C & C2R FFT plans for 2048 x 2048 ...uploading to GPU and padding convolution kernel and input data ...transforming convolution kernel ...running GPU FFT convolution: 1967.535669 MPix/s (2.033000 ms) ...reading back GPU convolution results ...running reference CPU convolution ...comparing the results: rel L2 = 1.058234E-07 (max delta = 9.978815E-07) L2norm Error OK ...shutting down Testing custom R2C / C2R FFT-based convolution ...allocating memory ...generating random input data ...creating C2C FFT plan for 2048 x 1024 ...uploading to GPU and padding convolution kernel and input data ...transforming convolution kernel ...running GPU FFT convolution: 1800.180112 MPix/s (2.222000 ms) ...reading back GPU FFT results ...running reference CPU convolution ...comparing the results: rel L2 = 1.087169E-07 (max delta = 1.084381E-06) L2norm Error OK ...shutting down Testing updated custom R2C / C2R FFT-based convolution ...allocating memory ...generating random input data ...creating C2C FFT plan for 2048 x 1024 ...uploading to GPU and padding convolution kernel and input data ...transforming convolution kernel ...running GPU FFT convolution: 2481.389596 MPix/s (1.612000 ms) ...reading back GPU FFT results ...running reference CPU convolution ...comparing the results: rel L2 = 1.084221E-07 (max delta = 9.397971E-07) L2norm Error OK ...shutting down Test Summary: 0 errors Test passed
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/convolutionSeparable] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Image Width x Height = 3072 x 3072 Allocating and initializing host arrays... Allocating and initializing CUDA arrays... Running GPU convolution (16 identical iterations)... convolutionSeparable, Throughput = 9717.1599 MPixels/sec, Time = 0.00097 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0 Reading back GPU results... Checking the results... ...running convolutionRowCPU() ...running convolutionColumnCPU() ...comparing the results ...Relative L2 norm: 0.000000E+00 Shutting down... Test passed
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/convolutionTexture] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Initializing data... Running GPU rows convolution (10 identical iterations)... Average convolutionRowsGPU() time: 0.622700 msecs; //7577.632456 Mpix/s Copying convolutionRowGPU() output back to the texture... cudaMemcpyToArray() time: 0.267000 msecs; //17672.629892 Mpix/s Running GPU columns convolution (10 iterations) Average convolutionColumnsGPU() time: 0.614100 msecs; //7683.752090 Mpix/s Reading back GPU results... Checking the results... ...running convolutionRowsCPU() ...running convolutionColumnsCPU() Relative L2 norm: 0.000000E+00 Shutting down... Test passed
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Hello World. Hello World.
C++ Function Overloading starting... DevicecheckCudaErrors Count: 1 GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Shared Size: 1024 Constant Size: 0 Local Size: 0 Max Threads Per Block: 1024 Number of Registers: 10 PTX Version: 60 Binary Version: 60 simple_kernel(const int *pIn, int *pOut, int a) PASSED Shared Size: 2048 Constant Size: 0 Local Size: 0 Max Threads Per Block: 1024 Number of Registers: 11 PTX Version: 60 Binary Version: 60 simple_kernel(const int2 *pIn, int *pOut, int a) PASSED Shared Size: 2048 Constant Size: 0 Local Size: 0 Max Threads Per Block: 1024 Number of Registers: 11 PTX Version: 60 Binary Version: 60 simple_kernel(const int *pIn1, const int *pIn2, int *pOut, int a) PASSED
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/cudaOpenMP Starting... number of host CPUs: 4 number of CUDA devices: 1 0: GRID P4-4Q --------------------------- CPU thread 0 (of 1) uses CUDA device 0 ---------------------------
Initializing... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 cudaTensorCoreGemm requires requires SM 7.0 or higher to use Tensor Cores. Exiting...
Error: Condition (allocation_cb == 1) failed at cuHook.cpp:115 cuHook sample failed (Didn't receive the allocation callback)
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 step 1: read matrix market format Using default input file [../../../../7_CUDALibraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx] sparse matrix A is 900 x 900 with 7744 nonzeros, base=1 step 2: convert CSR(A) to dense matrix step 3: set right hand side vector (b) to 1 step 4: prepare data on device step 5: solve A*x = b timing: cholesky = 0.011339 sec step 6: evaluate residual |b - A*x| = 1.136868E-13 |A| = 1.600000E+01 |x| = 2.357708E+01 |b - A*x|/(|A|*|x|) = 3.013701E-16
step 1.1: preparation step 1.1: read matrix market format GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Using default input file [../../../../7_CUDALibraries/cuSolverRf/lap2D_5pt_n100.mtx] WARNING: cusolverRf only works for base-0 sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=0 step 1.2: set right hand side vector (b) to 1 step 2: reorder the matrix to reduce zero fill-in Q = symrcm(A) or Q = symamd(A) step 3: B = Q*A*Q^T step 4: solve A*x = b by LU(B) in cusolverSp step 4.1: create opaque info structure step 4.2: analyze LU(B) to know structure of Q and R, and upper bound for nnz(L+U) step 4.3: workspace for LU(B) step 4.4: compute Ppivot*B = L*U step 4.5: check if the matrix is singular step 4.6: solve A*x = b i.e. solve B*(Qx) = Q*b step 4.7: evaluate residual r = b - A*x (result on CPU) (CPU) |b - A*x| = 4.547474E-12 (CPU) |A| = 8.000000E+00 (CPU) |x| = 7.513384E+02 (CPU) |b - A*x|/(|A|*|x|) = 7.565621E-16 step 5: extract P, Q, L and U from P*B*Q^T = L*U L has implicit unit diagonal nnzL = 671550, nnzU = 681550 step 6: form P*A*Q^T = L*U step 6.1: P = Plu*Qreroder step 6.2: Q = Qlu*Qreorder step 7: create cusolverRf handle step 8: set parameters for cusolverRf step 9: assemble P*A*Q = L*U step 10: analyze to extract parallelism step 11: import A to cusolverRf step 12: refactorization step 13: solve A*x = b step 14: evaluate residual r = b - A*x (result on GPU) (GPU) |b - A*x| = 4.547474E-12 (GPU) |A| = 8.000000E+00 (GPU) |x| = 7.513384E+02 (GPU) |b - A*x|/(|A|*|x|) = 7.565621E-16 ===== statistics nnz(A) = 49600, nnz(L+U) = 1353100, zero fill-in ratio = 27.280242 ===== timing profile reorder A : 0.006283 sec B = Q*A*Q^T : 0.001430 sec cusolverSp LU analysis: 0.005644 sec cusolverSp LU factor : 0.137884 sec cusolverSp LU solve : 0.003406 sec cusolverSp LU extract : 0.011348 sec cusolverRf assemble : 0.022505 sec cusolverRf reset : 0.000154 sec cusolverRf refactor : 0.169308 sec cusolverRf solve : 0.186020 sec
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Using default input file [../../../../7_CUDALibraries/cuSolverSp_LinearSolver/lap2D_5pt_n100.mtx] step 1: read matrix market format sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1 step 2: reorder the matrix A to minimize zero fill-in if the user choose a reordering by -P=symrcm or -P=symamd The reordering will overwrite A such that A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A) step 2.1: set right hand side vector (b) to 1 step 3: prepare data on device step 4: solve A*x = b on CPU step 5: evaluate residual r = b - A*x (result on CPU) (CPU) |b - A*x| = 4.547474E-12 (CPU) |A| = 8.000000E+00 (CPU) |x| = 7.513384E+02 (CPU) |b - A*x|/(|A|*|x|) = 7.565621E-16 step 6: solve A*x = b on GPU step 7: evaluate residual r = b - A*x (result on GPU) (GPU) |b - A*x| = 1.818989E-12 (GPU) |A| = 8.000000E+00 (GPU) |x| = 7.513384E+02 (GPU) |b - A*x|/(|A|*|x|) = 3.026248E-16 timing chol: CPU = 0.054121 sec , GPU = 0.131622 sec
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Using default input file [../../../../7_CUDALibraries/cuSolverSp_LowlevelCholesky/lap2D_5pt_n100.mtx] step 1: read matrix market format sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1 step 2: create opaque info structure step 3: analyze chol(A) to know structure of L step 4: workspace for chol(A) step 5: compute A = L*L^T step 6: check if the matrix is singular step 7: solve A*x = b step 8: evaluate residual r = b - A*x (result on CPU) (CPU) |b - A*x| = 3.637979E-12 (CPU) |A| = 8.000000E+00 (CPU) |x| = 7.513384E+02 (CPU) |b - A*x|/(|A|*|x|) = 6.052497E-16 step 9: create opaque info structure step 10: analyze chol(A) to know structure of L step 11: workspace for chol(A) step 12: compute A = L*L^T step 13: check if the matrix is singular step 14: solve A*x = b (GPU) |b - A*x| = 1.364242E-12 (GPU) |b - A*x|/(|A|*|x|) = 2.269686E-16
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Using default input file [../../../../7_CUDALibraries/cuSolverSp_LowlevelQR/lap2D_5pt_n100.mtx] step 1: read matrix market format sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1 step 2: create opaque info structure step 3: analyze qr(A) to know structure of L step 4: workspace for qr(A) step 5: compute A = L*L^T step 6: check if the matrix is singular step 7: solve A*x = b step 8: evaluate residual r = b - A*x (result on CPU) (CPU) |b - A*x| = 6.821210E-12 (CPU) |A| = 8.000000E+00 (CPU) |x| = 7.513384E+02 (CPU) |b - A*x|/(|A|*|x|) = 1.134843E-15 step 9: create opaque info structure step 10: analyze qr(A) to know structure of L step 11: workspace for qr(A) GPU buffer size = 51372672 bytes step 12: compute A = L*L^T step 13: check if the matrix is singular step 14: solve A*x = b (GPU) |b - A*x| = 5.911716E-12 (GPU) |b - A*x|/(|A|*|x|) = 9.835307E-16
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/dct8x8 Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA sample DCT/IDCT implementation =================================== Loading test image: barbara.bmp... [512 x 512]... Success Running Gold 1 (CPU) version... Success Running Gold 2 (CPU) version... Success Running CUDA 1 (GPU) version... Success Running CUDA 2 (GPU) version... 34357.012287 MPix/s //0.007630 ms Success Running CUDA short (GPU) version... Success Dumping result to barbara_gold1.bmp... Success Dumping result to barbara_gold2.bmp... Success Dumping result to barbara_cuda1.bmp... Success Dumping result to barbara_cuda2.bmp... Success Dumping result to barbara_cuda_short.bmp... Success Processing time (CUDA 1) : 0.071000 ms Processing time (CUDA 2) : 0.007630 ms Processing time (CUDA short): 0.085000 ms PSNR Original <---> CPU(Gold 1) : 32.777073 PSNR Original <---> CPU(Gold 2) : 32.777050 PSNR Original <---> GPU(CUDA 1) : 32.777092 PSNR Original <---> GPU(CUDA 2) : 32.777077 PSNR Original <---> GPU(CUDA short): 32.749447 PSNR CPU(Gold 1) <---> GPU(CUDA 1) : 64.019310 PSNR CPU(Gold 2) <---> GPU(CUDA 2) : 71.777740 PSNR CPU(Gold 2) <---> GPU(CUDA short): 42.258053 Test Summary... Test passed
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GRID P4-4Q" CUDA Driver Version / Runtime Version 9.0 / 9.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 4096 MBytes (4294705152 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1114 MHz (1.11 GHz) Memory Clock rate: 3003 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 2 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1 Result = PASS
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/deviceQueryDrv Starting... CUDA Device Query (Driver API) statically linked version Detected 1 CUDA Capable device(s) Device 0: "GRID P4-4Q" CUDA Driver Version: 9.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 4096 MBytes (4294705152 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1114 MHz (1.11 GHz) Memory Clock rate: 3003 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Texture alignment: 512 bytes Maximum memory pitch: 2147483647 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Concurrent kernel execution: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 2 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Result = PASS
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/dwtHaar1D Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 source file = "../../../../3_Imaging/dwtHaar1D/data/signal.dat" reference file = "result.dat" gold file = "../../../../3_Imaging/dwtHaar1D/data/regression.gold.dat" Reading signal from "../../../../3_Imaging/dwtHaar1D/data/signal.dat" Writing result to "result.dat" Reading reference result from "../../../../3_Imaging/dwtHaar1D/data/regression.gold.dat" Test success!
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/dxtc Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Image Loaded '../../../../3_Imaging/dxtc/data/lena_std.ppm', 512 x 512 pixels Running DXT Compression on 512 x 512 image... 16384 Blocks, 64 Threads per Block, 1048576 Threads in Grid... dxtc, Throughput = 99.9405 MPixels/s, Time = 0.00262 s, Size = 262144 Pixels, NumDevsUsed = 1, Workgroup = 64 Checking accuracy... Deviation at ( 9, 1): 0.791667 rms Deviation at ( 99, 1): 1.041667 rms Deviation at ( 33, 2): 2.645833 rms Deviation at ( 38, 4): 1.916667 rms Deviation at ( 57, 4): 0.854167 rms Deviation at ( 20, 7): 1.041667 rms Deviation at ( 57, 7): 0.458333 rms Deviation at ( 8, 9): 0.937500 rms Deviation at ( 31, 9): 0.770833 rms Deviation at ( 13, 11): 1.041667 rms Deviation at ( 88, 11): 0.729167 rms Deviation at ( 4, 13): 8.562500 rms Deviation at ( 28, 13): 0.562500 rms Deviation at ( 90, 13): 0.708333 rms Deviation at ( 25, 14): 0.520833 rms Deviation at ( 87, 16): 0.708333 rms Deviation at ( 24, 19): 0.916667 rms Deviation at ( 25, 19): 0.625000 rms Deviation at ( 26, 19): 1.041667 rms Deviation at ( 55, 20): 4.791667 rms Deviation at ( 20, 23): 1.541667 rms Deviation at ( 99, 23): 3.312500 rms Deviation at ( 45, 24): 18.104166 rms Deviation at ( 8, 28): 0.895833 rms Deviation at ( 21, 30): 1.562500 rms Deviation at ( 115, 32): 24.104166 rms Deviation at ( 2, 33): 0.854167 rms Deviation at ( 102, 33): 2.250000 rms Deviation at ( 50, 35): 26.958334 rms Deviation at ( 12, 38): 2.166667 rms Deviation at ( 96, 39): 1.041667 rms Deviation at ( 40, 40): 0.270833 rms Deviation at ( 43, 44): 2.250000 rms Deviation at ( 54, 44): 4.791667 rms Deviation at ( 46, 46): 2.875000 rms Deviation at ( 116, 46): 0.604167 rms Deviation at ( 117, 46): 6.833333 rms Deviation at ( 117, 48): 0.937500 rms Deviation at ( 23, 51): 3.520833 rms Deviation at ( 67, 54): 5.687500 rms Deviation at ( 26, 55): 0.854167 rms Deviation at ( 21, 56): 5.000000 rms Deviation at ( 24, 56): 0.562500 rms Deviation at ( 30, 57): 0.937500 rms Deviation at ( 126, 57): 1.208333 rms Deviation at ( 21, 59): 2.541667 rms Deviation at ( 120, 59): 0.104167 rms Deviation at ( 112, 60): 1.125000 rms Deviation at ( 76, 61): 1.666667 rms Deviation at ( 77, 61): 1.083333 rms Deviation at ( 75, 62): 0.937500 rms Deviation at ( 121, 62): 0.937500 rms Deviation at ( 124, 64): 2.854167 rms Deviation at ( 78, 66): 0.541667 rms Deviation at ( 106, 68): 0.375000 rms Deviation at ( 16, 70): 3.104167 rms Deviation at ( 10, 71): 0.937500 rms Deviation at ( 108, 71): 0.354167 rms Deviation at ( 0, 72): 0.854167 rms Deviation at ( 118, 72): 5.562500 rms Deviation at ( 11, 73): 0.541667 rms Deviation at ( 68, 74): 1.937500 rms Deviation at ( 70, 76): 1.791667 rms Deviation at ( 124, 76): 3.354167 rms Deviation at ( 103, 78): 0.375000 rms Deviation at ( 74, 79): 0.270833 rms Deviation at ( 108, 79): 0.083333 rms Deviation at ( 43, 82): 24.979166 rms Deviation at ( 58, 82): 2.833333 rms Deviation at ( 67, 82): 3.125000 rms Deviation at ( 78, 82): 2.437500 rms Deviation at ( 123, 84): 0.541667 rms Deviation at ( 127, 88): 0.229167 rms Deviation at ( 99, 89): 0.770833 rms Deviation at ( 93, 91): 0.666667 rms Deviation at ( 118, 91): 1.125000 rms Deviation at ( 115, 92): 0.083333 rms Deviation at ( 115, 93): 0.083333 rms Deviation at ( 45, 94): 0.166667 rms Deviation at ( 14, 95): 1.937500 rms Deviation at ( 69, 95): 1.875000 rms Deviation at ( 106, 95): 1.125000 rms Deviation at ( 107, 95): 3.708333 rms Deviation at ( 13, 96): 1.354167 rms Deviation at ( 115, 98): 0.187500 rms Deviation at ( 118, 98): 0.187500 rms Deviation at ( 116, 101): 0.187500 rms Deviation at ( 87, 106): 0.270833 rms Deviation at ( 67, 107): 0.708333 rms Deviation at ( 74, 107): 0.375000 rms Deviation at ( 65, 109): 0.770833 rms Deviation at ( 89, 109): 0.708333 rms Deviation at ( 118, 109): 3.854167 rms Deviation at ( 88, 111): 0.208333 rms Deviation at ( 64, 113): 0.708333 rms Deviation at ( 84, 113): 0.333333 rms Deviation at ( 75, 114): 2.083333 rms Deviation at ( 66, 115): 0.770833 rms Deviation at ( 89, 116): 0.770833 rms Deviation at ( 19, 118): 5.270833 rms Deviation at ( 76, 121): 0.104167 rms Deviation at ( 70, 122): 0.708333 rms Deviation at ( 91, 122): 0.208333 rms Deviation at ( 75, 123): 0.854167 rms Deviation at ( 61, 124): 0.937500 rms Deviation at ( 91, 124): 0.270833 rms Deviation at ( 91, 125): 1.020833 rms RMS(reference, result) = 0.015238 Test passed
Starting eigenvalues GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Matrix size: 2048 x 2048 Precision: 0.000010 Iterations to be timed: 100 Result filename: 'eigenvalues.dat' Gerschgorin interval: -2.894310 / 2.923303 Average time step 1: 2.891819 ms Average time step 2, one intervals: 3.538522 ms Average time step 2, mult intervals: 7.413971 ms Average time TOTAL: 13.875119 ms Test Succeeded!
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/fastWalshTransform Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Initializing data... ...allocating CPU memory ...allocating GPU memory ...generating data Data length: 8388608; kernel length: 128 Running GPU dyadic convolution using Fast Walsh Transform... GPU time: 9.447000 ms; GOP/s: 30.634804 Reading back GPU results... Running straightforward CPU dyadic convolution... Comparing the results... Shutting down... L2 norm: 1.021579E-07 Test passed
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/FDTD3d Starting... Set-up, based upon target device GMEM size... getTargetDeviceGlobalMemSize cudaGetDeviceCount GPU Device 0: "GRID P4-4Q" with compute capability 6.1 cudaGetDeviceProperties generateRandomData FDTD on 376 x 376 x 376 volume with symmetric filter radius 4 for 5 timesteps... fdtdReference... calloc intermediate Host FDTD loop t = 0 t = 1 t = 2 t = 3 t = 4 fdtdReference complete fdtdGPU... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 set block size to 32x16 set grid size to 12x24 GPU FDTD loop t = 0 launch kernel t = 1 launch kernel t = 2 launch kernel t = 3 launch kernel t = 4 launch kernel fdtdGPU complete CompareData (tolerance 0.000100)...
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/FilterBorderControlNPP Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 cudaSetDevice GPU0 = GRID P4-4Q NPP Library Version 9.0.176 CUDA Driver Version: 9.0 CUDA Runtime Version: 9.0 Device 0: < GRID P4-4Q >, Compute SM 6.1 detected gradientFilterBorderNPP opened <../../../../common/data/Lena.pgm> successfully! Saved image: ./../../../../common/data/Lena.pgm_gradientVectorPrewittBorderX_Vertical.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderX_Vertical_WithNoSourceBorders.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal_WithNoSourceBorders.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderX_Vertical_BorderDiffs.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal_BorderDiffs.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderX_Vertical_WithMixedBorders.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal_WithMixedBorders.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderX_Vertical_MixedBorderDiffs.pgm Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal_MixedBorderDiffs.pgm
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Result: 587296.812500
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/freeImageInteropNPP Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 cudaSetDevice GPU0 = GRID P4-4Q NPP Library Version 9.0.176 CUDA Driver Version: 9.0 CUDA Runtime Version: 9.0 Device 0: < GRID P4-4Q >, Compute SM 6.1 detected freeImageInteropNPP opened: <../../../../common/data/Lena.pgm> successfully! Size (512, 512) Pitch 512 Type FIC_MINISBLACK BPP 8
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/histEqualizationNPP Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 cudaSetDevice GPU0 = GRID P4-4Q NPP Library Version 9.0.176 CUDA Driver Version: 9.0 CUDA Runtime Version: 9.0 Device 0: < GRID P4-4Q >, Compute SM 6.1 detected histEqualizationNPP opened: <../../../../common/data/Lena.pgm> successfully! Saved image file ../../../../common/data/Lena_histEqualization.pgm
[[histogram]] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA device [GRID P4-4Q] has 20 Multi-Processors, Compute 6.1 Initializing data... ...allocating CPU memory. ...generating input data ...allocating GPU memory and copying input data Starting up 64-bin histogram... Running 64-bin GPU histogram for 67108864 bytes (16 runs)... histogram64() time (average) : 0.00048 sec, 138816.0059 MB/sec histogram64, Throughput = 138816.0059 MB/s, Time = 0.00048 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 64 Validating GPU results... ...reading back GPU results ...histogram64CPU() ...comparing the results... ...64-bin histograms match Shutting down 64-bin histogram... Initializing 256-bin histogram... Running 256-bin GPU histogram for 67108864 bytes (16 runs)... histogram256() time (average) : 0.00053 sec, 126740.0624 MB/sec histogram256, Throughput = 126740.0624 MB/s, Time = 0.00053 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 192 Validating GPU results... ...reading back GPU results ...histogram256CPU() ...comparing the results ...256-bin histograms match Shutting down 256-bin histogram... Shutting down... NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. [histogram] - Test Summary Test passed
HSOpticalFlow Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Loading "frame10.ppm" ... Loading "frame11.ppm" ... Computing optical flow on CPU... Computing optical flow on GPU... L1 error : 0.044308
CUDA inline PTX assembler sample GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Test Successful.
CUDA inline PTX assembler sample > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability Test Successful.
[Interval Computing] starting ... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > GPU Device has Compute Capabilities SM 6.1 GPU naive implementation Searching for roots in [0.01, 4]... Found 2 intervals that may contain the root(s) i[0] = [0.999655515093009, 1.00011722206639] i[1] = [1.00011907576551, 1.00044661086269] Number of equations solved: 65536 Time per equation: 13.6764059066772 us Check against Host computation...
NPP Library Version 9.0.176 CUDA Driver Version: 9.0 CUDA Runtime Version: 9.0 Device 0: < GRID P4-4Q >, Compute SM 6.1 detected jpegNPP usage -input=srcfile.jpg (input file JPEG image) -output=destfile.jpg (output file JPEG image) -scale=1.0 (scale multiplier for width and height) Source File: ../../../../common/data/Growth_of_cubic_bacteria_25x16.jpg Output File: scaled.jpg Scale Factor: 0.5 Image Size: 2560x1600x3 Output Size: 1280x800x3
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/lineOfSight] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Line of sight Average time: 1.848300 ms Test passed
[Matrix Multiply Using CUDA] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 MatrixA(320,320), MatrixB(640,320) Computing result using CUDA Kernel... done Performance= 682.11 GFlop/s, Time= 0.192 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
[Matrix Multiply CUBLAS] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320) Computing result using CUBLAS...done. Performance= 2053.36 GFlop/s, Time= 0.096 msec, Size= 196608000 Ops Computing result using host CPU...done. Comparing CUBLAS Matrix Multiply with CPU results: PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
[ matrixMulDrv (Driver API) ] > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability Total amount of global memory: 4294705152 bytes 64-bit Memory Address: NO > findModulePath <./matrixMul_kernel64.ptx> > initCUDA loading module: <./matrixMul_kernel64.ptx> > PTX JIT log: Processing time: 0.140000 (ms) Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
[ matrixMulDynlinkJIT (CUDA dynamic linking) ] > Device 0: "GRID P4-4Q" with Compute 6.1 capability > Compiling CUDA module > PTX JIT log: Test run success!
[Matrix Multiply Using CUDA] - Starting... MatrixA(320,320), MatrixB(640,320) > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability Computing result using CUDA Kernel... Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Monte Carlo Estimate Pi (with inline PRNG) ========================================== Estimating Pi on GPU (GRID P4-4Q) Precision: single Number of sims: 100000 Tolerance: 1.000000e-02 GPU result: 3.140440e+00 Expected: 3.141593e+00 Absolute error: 1.152754e-03 Relative error: 3.669329e-04 MonteCarloEstimatePiInlineP, Performance = 579847.94 sims/s, Time = 172.46(ms), NumDevsUsed = 1, Blocksize = 128
Monte Carlo Estimate Pi (with inline QRNG) ========================================== Estimating Pi on GPU (GRID P4-4Q) Precision: single Number of sims: 100000 Tolerance: 1.000000e-02 GPU result: 3.142520e+00 Expected: 3.141593e+00 Absolute error: 9.272099e-04 Relative error: 2.951401e-04 MonteCarloEstimatePiInlineQ, Performance = 602518.54 sims/s, Time = 165.97(ms), NumDevsUsed = 1, Blocksize = 128
Monte Carlo Estimate Pi (with batch PRNG) ========================================= Estimating Pi on GPU (GRID P4-4Q) Precision: single Number of sims: 100000 Tolerance: 1.000000e-02 GPU result: 3.136320e+00 Expected: 3.141593e+00 Absolute error: 5.272627e-03 Relative error: 1.678329e-03 MonteCarloEstimatePiP, Performance = 565585.27 sims/s, Time = 176.81(ms), NumDevsUsed = 1, Blocksize = 128
Monte Carlo Estimate Pi (with batch QRNG) ========================================= Estimating Pi on GPU (GRID P4-4Q) Precision: single Number of sims: 100000 Tolerance: 1.000000e-02 GPU result: 3.141840e+00 Expected: 3.141593e+00 Absolute error: 2.472401e-04 Relative error: 7.869895e-05 MonteCarloEstimatePiQ, Performance = 579888.28 sims/s, Time = 172.45(ms), NumDevsUsed = 1, Blocksize = 128
Monte Carlo Single Asian Option (with PRNG) =========================================== Pricing option on GPU (GRID P4-4Q) Precision: single Number of sims: 100000 Spot | Strike | r | sigma | tenor | Call/Put | Value | Expected | -----------|------------|------------|------------|------------|------------|------------|------------| 40 | 35 | 0.03 | 0.2 | 0.333333 | Call | 5.17634 | 5.16253 | MonteCarloSingleAsianOptionP, Performance = 598601.69 sims/s, Time = 167.06(ms), NumDevsUsed = 1, Blocksize = 128
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/mergeSort Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Allocating and initializing host arrays... Allocating and initializing CUDA arrays... Initializing GPU merge sort... Running GPU merge sort... Time: 9.544000 ms Reading back GPU merge sort results... Inspecting the results... ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: stable! Shutting down...
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/MersenneTwisterGP11213 Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Allocating data for 2400000 samples... Seeding with 777 ... Generating random numbers on GPU... Reading back the results... Generating random numbers on CPU... Comparing CPU/GPU random numbers... Max absolute error: 0.000000E+00 L1 norm: 0.000000E+00 MersenneTwisterGP11213, Throughput = 6.5520 GNumbers/s, Time = 0.00037 s, Size = 2400000 Numbers Shutting down...
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/MonteCarloMultiGPU Starting... Using single CPU thread for multiple GPUs MonteCarloMultiGPU ================== Parallelization method = streamed Problem scaling = weak Number of GPUs = 1 Total number of options = 8192 Number of paths = 262144 main(): generating input data... main(): starting 1 host threads... main(): GPU statistics, streamed GPU Device #0: GRID P4-4Q Options : 8192 Simulation paths: 262144 Total time (ms.): 45.661999 Note: This is elapsed time for all to compute. Options per sec.: 179405.199607 main(): comparing Monte Carlo and Black-Scholes results... Shutting down... Test Summary... L1 norm : 4.825160E-04 Average reserve: 11.741779 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. Test passed
newdelete Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Container = Vector test OK > Container = Vector, using placement new on SMEM buffer test OK > Container = Vector, with user defined datatype test OK Test Summary: 3/3 succesfully run
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Detected Compute SM 6.1 hardware with 20 multi-processors pr_1, alpha = 0.85 0.051705 0.073679 0.057413 0.348703 0.199904 0.268596 pr_2, alpha = 0.90 0.037212 0.053958 0.041506 0.375080 0.205998 0.286246 Done!
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Detected Compute SM 6.1 hardware with 20 multi-processors y_h 9.900001 14.299999 80.300003 62.699997 4.950000 Done!
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Modularity_score: 0.371466 Hit rate : 100.000000% (34 hits) Done!
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Detected Compute SM 6.1 hardware with 20 multi-processors sssp_1_h 0.000000 0.500000 0.500000 1.333333 0.833333 1.333333 Done! sssp_2_h 340282346638528859811704183484516925440.000000 340282346638528859811704183484516925440.000000 340282346638528859811704183484516925440.000000 1.000000 1.500000 0.000000 Done!
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, GRID P4-4Q, pciBusID: 2, pciDeviceID: 2, pciDomainID:0 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) in those cases. P2P Connectivity Matrix D\D 0 0 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 0 165.24 Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 0 165.31 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 0 165.59 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 0 165.42 P2P=Disabled Latency Matrix (us) D\D 0 0 6.35 P2P=Enabled Latency Matrix (us) D\D 0 0 6.40 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
[PTX Just In Time (JIT) Compilation (no-qatest)] - Starting... > Using CUDA device [0]: GRID P4-4Q Loading myPtx[] program CUDA Link Completed in 0.000000ms. Linker Output: info : 0 bytes gmem info : Function properties for '_Z8myKernelPi': info : used 4 registers, 0 stack, 0 bytes smem, 328 bytes cmem[0], 0 bytes lmem CUDA kernel launched
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/quasirandomGenerator Starting... Allocating GPU memory... Allocating CPU memory... Initializing QRNG tables... Testing QRNG... quasirandomGenerator, Throughput = 27.4616 GNumbers/s, Time = 0.00011 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384 Reading GPU results... Comparing to the CPU results... L1 norm: 7.275964E-12 Testing inverseCNDgpu()... quasirandomGenerator-inverse, Throughput = 33.7343 GNumbers/s, Time = 0.00009 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128 Reading GPU results... Comparing to the CPU results... L1 norm: 9.439909E-08 Shutting down...
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/quasirandomGenerator_nvrtc Starting... > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability Allocating GPU memory... Allocating CPU memory... Initializing QRNG tables... Testing QRNG... quasirandomGenerator, Throughput = 21.0417 GNumbers/s, Time = 0.00015 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384 Reading GPU results... Comparing to the CPU results... L1 norm: 7.275964E-12 Testing inverseCNDgpu()... quasirandomGenerator-inverse, Throughput = 31.1458 GNumbers/s, Time = 0.00010 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128 Reading GPU results... Comparing to the CPU results... L1 norm: 9.439909E-08 Shutting down...
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/radixSortThrust Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Sorting 1048576 32-bit unsigned int keys and values radixSortThrust, Throughput = 340.2453 MElements/s, Time = 0.00308 s, Size = 1048576 elements Test passed
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/reduction Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Using Device 0: GRID P4-4Q Reducing array of type int 16777216 elements 256 threads (max) 64 blocks Reduction, Throughput = 45.9587 GB/s, Time = 0.00146 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256 GPU result = 2139353471 CPU result = 2139353471 Test passed
reductionMultiBlockCG Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 33554432 elements numThreads: 1024 numBlocks: 40 Launching SinglePass Multi Block Cooperative Groups kernel Average time: 0.837790 ms Bandwidth: 160.204520 GB/s GPU result = 1.992401361465 CPU result = 1.992401361465
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/scalarProd Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Initializing data... ...allocating CPU memory. ...allocating GPU memory. ...generating input data in CPU mem. ...copying input data to GPU mem. Data init done. Executing GPU kernel... GPU time: 0.156000 msecs. Reading back GPU result... Checking GPU results... ..running CPU scalar product calculation ...comparing the results Shutting down... L1 error: 2.745062E-08 Test passed
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/scan Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Allocating and initializing host arrays... Allocating and initializing CUDA arrays... Initializing CUDA-C scan... *** Running GPU scan for short arrays (100 identical iterations)... Running scan for 4 elements (1703936 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 8 elements (851968 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 16 elements (425984 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 32 elements (212992 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 64 elements (106496 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 128 elements (53248 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 256 elements (26624 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 512 elements (13312 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 1024 elements (6656 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match scan, Throughput = 3.1163 MElements/s, Time = 0.00033 s, Size = 1024 Elements, NumDevsUsed = 1, Workgroup = 256 ***Running GPU scan for large arrays (100 identical iterations)... Running scan for 2048 elements (3328 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 4096 elements (1664 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 8192 elements (832 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 16384 elements (416 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 32768 elements (208 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 65536 elements (104 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 131072 elements (52 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match Running scan for 262144 elements (26 arrays)... Validating the results... ...reading back GPU results ...scanExclusiveHost() ...comparing the results ...Results Match scan, Throughput = 391.5051 MElements/s, Time = 0.00067 s, Size = 262144 Elements, NumDevsUsed = 1, Workgroup = 256 Shutting down...
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/segmentationTreeThrust Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 * Building segmentation tree... done in 255.807 (ms) * Dumping levels for each tree...
Starting shfl_scan GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Detected Compute SM 6.1 hardware with 20 multi-processors Starting shfl_scan GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Detected Compute SM 6.1 hardware with 20 multi-processors Computing Simple Sum test --------------------------------------------------- Initialize test data [1, 1, 1...] Scan summation for 65536 elements, 256 partial sums Partial summing 256 elements with 1 blocks of size 256 Test Sum: 65536 Time (ms): 0.052064 65536 elements scanned in 0.052064 ms -> 1258.758423 MegaElements/s CPU verify result diff (GPUvsCPU) = 0 CPU sum (naive) took 0.047840 ms Computing Integral Image Test on size 1920 x 1080 synthetic data --------------------------------------------------- Method: Fast Time (GPU Timer): 0.068608 ms Diff = 0 Method: Vertical Scan Time (GPU Timer): 0.154016 ms CheckSum: 2073600, (expect 1920x1080=2073600)
simpleAssert.cu:47: void testKernel(int): block: [1,0,0], thread: [28,0,0] Assertion `gtid < N` failed. simpleAssert.cu:47: void testKernel(int): block: [1,0,0], thread: [29,0,0] Assertion `gtid < N` failed. simpleAssert.cu:47: void testKernel(int): block: [1,0,0], thread: [30,0,0] Assertion `gtid < N` failed. simpleAssert.cu:47: void testKernel(int): block: [1,0,0], thread: [31,0,0] Assertion `gtid < N` failed. simpleAssert starting... OS_System_Type.release = 3.10.0-693.2.2.el7.x86_64 OS Info: <#1 SMP Tue Sep 12 22:26:13 UTC 2017> GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Launch kernel to generate assertion failures -- Begin assert output -- End assert output Device assert failed as expected, CUDA error message is: device-side assert triggered simpleAssert completed, returned OK
../../../../0_Simple/simpleAssert_nvrtc/simpleAssert_kernel.cu:13: void testKernel(int): block: [1,0,0], thread: [28,0,0] Assertion `gtid < N` failed. ../../../../0_Simple/simpleAssert_nvrtc/simpleAssert_kernel.cu:13: void testKernel(int): block: [1,0,0], thread: [29,0,0] Assertion `gtid < N` failed. ../../../../0_Simple/simpleAssert_nvrtc/simpleAssert_kernel.cu:13: void testKernel(int): block: [1,0,0], thread: [30,0,0] Assertion `gtid < N` failed. ../../../../0_Simple/simpleAssert_nvrtc/simpleAssert_kernel.cu:13: void testKernel(int): block: [1,0,0], thread: [31,0,0] Assertion `gtid < N` failed. simpleAssert_nvrtc starting... Launch kernel to generate assertion failures > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability -- Begin assert output -- End assert output Device assert failed as expected
simpleAtomicIntrinsics starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > GPU device has 20 Multi-Processors, SM 6.1 compute capabilities Processing time: 153.134995 (ms) simpleAtomicIntrinsics completed, returned OK
simpleAtomicIntrinsics_nvrtc starting... > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability Processing time: 1.088000 (ms) simpleAtomicIntrinsics_nvrtc completed, returned OK
Starting simpleCallback Found 1 CUDA capable GPUs GPU[0] GRID P4-4Q supports SM 6.1, capable GPU Callback Functions 1 GPUs available to run Callback Functions Starting 8 heterogeneous computing workloads Total of 8 workloads finished: Success
Launching a single block with 64 threads... Sum of all ranks 0..63 in threadBlockGroup is 2016 (expected 2016) Now creating 4 groups, each of size 16 threads: Sum of all ranks 0..15 in this tiledPartition16 group is 120 (expected 120) Sum of all ranks 0..15 in this tiledPartition16 group is 120 (expected 120) Sum of all ranks 0..15 in this tiledPartition16 group is 120 (expected 120) Sum of all ranks 0..15 in this tiledPartition16 group is 120 (expected 120) ...Done.
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA device [GRID P4-4Q] has 20 Multi-Processors SM 6.1 Covering Cubemap data array of 64~3 x 1: Grid size is 8 x 8, each block has 8 x 8 threads Processing time: 0.019 msec 1293.47 Mtexlookups/sec Comparing kernel output to expected data
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 simpleCUBLAS test running.. simpleCUBLAS test passed.
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 simpleCUBLASXT test running.. simpleCUBLASXT test passed.
[simpleCUFFT] is starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Temporary buffer size 448 bytes Transforming signal cufftExecC2C Launching ComplexPointwiseMulAndScale<<< >>> Transforming signal back cufftExecC2C
Poisson equation using CUFFT library on Multiple GPUs is starting... No. of GPU on node 1 Two GPUs are required to run simpleCUFFT_2d_MGPU sample code
[simpleCUFFT_callback] is starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Transforming signal cufftExecC2C Transforming signal back cufftExecC2C
[simpleCUFFT_MGPU] is starting... No. of GPU on node 1 Two GPUs are required to run simpleCUFFT_MGPU sample code
simpleDevLibCUBLAS test running... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Host and device APIs will be tested. simpleDevLibCUBLAS completed, returned OK
starting hyperQ... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Detected Compute SM 6.1 hardware with 20 multi-processors Expected time for serial execution of 32 sets of kernels is between approx. 0.330s and 0.640s Expected time for fully concurrent execution of 32 sets of kernels is approx. 0.020s Measured time for sample = 0.050s
Checking for multiple GPUs... CUDA-capable device count: 1 Searching for UVA capable devices... > GPU0 = " GRID P4-4Q" IS capable of UVA Spawning processes and assigning GPUs... > Process 1 -> GPU0 > Process 1: Run kernel on GPU0, taking source data from and writing results to process 0, GPU0... Spawning processes and assigning GPUs... > Process 0 -> GPU0 Launching kernels... Checking test results... Shutting down...
[simpleLayeredTexture] - Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA device [GRID P4-4Q] has 20 Multi-Processors SM 6.1 Covering 2D data array of 512 x 512: Grid size is 64 x 64, each block has 8 x 8 threads Processing time: 0.100 msec 13107.20 Mtexlookups/sec Comparing kernel output to expected data
[simpleMultiCopy] - Starting... > Using CUDA device [0]: GRID P4-4Q [GRID P4-4Q] has 20 MP(s) x 128 (Cores/MP) = 2560 (Cores) > Device name: GRID P4-4Q > CUDA Capability 6.1 hardware with 20 multi-processors > scale_factor = 1.00 > array_size = 4194304 Relevant properties of this CUDA device (X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap") (X) Can overlap two CPU<>GPU data transfers with GPU kernel execution (Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000) Measured timings (throughput): Memcpy host to device : 2.706176 ms (6.199603 GB/s) Memcpy device to host : 13.667808 ms (1.227499 GB/s) Kernel : 0.259104 ms (647.508920 GB/s) Theoretical limits for speedup gained from overlapped data transfers: No overlap at all (transfer-kernel-transfer): 16.633087 ms Compute can overlap with one transfer: 16.373983 ms Compute can overlap with both data transfers: 13.667808 ms Average measured timings over 10 repetitions: Avg. time when execution fully serialized : 6.404608 ms Avg. time when overlapped using 4 streams : 3.068211 ms Avg. speedup gained (serialized - overlapped) : 3.336397 ms Measured throughput: Fully serialized execution : 5.239108 GB/s Overlapped using 4 streams : 10.936155 GB/s
Starting simpleMultiGPU CUDA-capable device count: 1 Generating input data... Computing with 1 GPUs... GPU Processing time: 23.084000 (ms) Computing with Host CPU... Comparing GPU and Host CPU results... GPU sum: 16777296.000000 CPU sum: 16777294.395033 Relative difference: 9.566307E-08
starting Simple Occupancy [ Manual configuration with 32 threads per block ] Potential occupancy: 50% Elapsed time: 0.15312ms [ Automatic, occupancy-based configuration ] Suggested block size: 1024 Minimum grid size for maximum occupancy: 40 Potential occupancy: 100% Elapsed time: 0.06736ms Test PASSED
[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/simpleP2P] - Starting... Checking for multiple GPUs... CUDA-capable device count: 1 Two or more GPUs with SM 2.0 or higher capability are required for ~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/simpleP2P. Waiving test.
simplePitchLinearTexture starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Bandwidth (GB/s) for pitch linear: 1.63e+02; for array: 1.64e+02 Texture fetch rate (Mpix/s) for pitch linear: 2.04e+04; for array: 2.05e+04 simplePitchLinearTexture completed, returned OK
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Device 0: "GRID P4-4Q" with Compute 6.1 capability printf() is called. Output: [2, 0]: Value is:10 [2, 1]: Value is:10 [2, 2]: Value is:10 [2, 3]: Value is:10 [2, 4]: Value is:10 [2, 5]: Value is:10 [2, 6]: Value is:10 [2, 7]: Value is:10 [3, 0]: Value is:10 [3, 1]: Value is:10 [3, 2]: Value is:10 [3, 3]: Value is:10 [3, 4]: Value is:10 [3, 5]: Value is:10 [3, 6]: Value is:10 [3, 7]: Value is:10 [0, 0]: Value is:10 [0, 1]: Value is:10 [0, 2]: Value is:10 [0, 3]: Value is:10 [0, 4]: Value is:10 [0, 5]: Value is:10 [0, 6]: Value is:10 [0, 7]: Value is:10 [1, 0]: Value is:10 [1, 1]: Value is:10 [1, 2]: Value is:10 [1, 3]: Value is:10 [1, 4]: Value is:10 [1, 5]: Value is:10 [1, 6]: Value is:10 [1, 7]: Value is:10
simpleSeparateCompilation starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 simpleSeparateCompilation completed, returned OK
[ simpleStreams ] Device synchronization method set to = 0 (Automatic Blocking) Setting reps to 100 to demonstrate steady state > GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Device: canMapHostMemory: Yes > CUDA Capable: SM 6.1 hardware > 20 Multiprocessor(s) x 128 (Cores/Multiprocessor) = 2560 (Cores) > scale_factor = 1.0000 > array_size = 16777216 > Using CPU/GPU Device Synchronization method (cudaDeviceScheduleAuto) > mmap() allocating 64.00 Mbytes (generic page-aligned system memory) > cudaHostRegister() registering 64.00 Mbytes of generic allocated system memory Starting Test memcopy: 9.06 kernel: 3.46 non-streamed: 24.61 4 streams: 9.30 -------------------------------
simpleSurfaceWrite starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA device [GRID P4-4Q] has 20 Multi-Processors, SM 6.1 Loaded 'lena_bw.pgm', 512 x 512 pixels Processing time: 0.036000 (ms) 7281.78 Mpixels/sec Wrote 'output.pgm' Comparing files output: reference: <../../../../0_Simple/simpleSurfaceWrite/data/ref_rotated.pgm> simpleSurfaceWrite completed, returned OK
> runTest GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA device [GRID P4-4Q] has 20 Multi-Processors Processing time: 159.790000 (ms) Compare OK > runTest GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA device [GRID P4-4Q] has 20 Multi-Processors Processing time: 0.961000 (ms) Compare OK [simpleTemplates] -> Test Results: 0 Failures
> runTest > Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability Processing time: 1.043000 (ms) Compare OK > runTest Processing time: 0.953000 (ms) Compare OK [simpleTemplates_nvrtc] -> Test Results: 0 Failures
simpleTexture starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Loaded 'lena_bw.pgm', 512 x 512 pixels Processing time: 0.037000 (ms) 7084.97 Mpixels/sec Wrote '../../../../0_Simple/simpleTexture/data/lena_bw_out.pgm' Comparing files output: <../../../../0_Simple/simpleTexture/data/lena_bw_out.pgm> reference: <../../../../0_Simple/simpleTexture/data/ref_rotated.pgm> simpleTexture completed, returned OK
> Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability > findModulePath <./simpleTexture_kernel64.ptx> > initCUDA loading module: <./simpleTexture_kernel64.ptx> > PTX JIT log: Loaded 'lena_bw.pgm', 512 x 512 pixels Processing time: 0.034000 (ms) 7710.12 Mpixels/sec Wrote '../../../../0_Simple/simpleTextureDrv/data/lena_bw_out.pgm' Comparing files output: <../../../../0_Simple/simpleTextureDrv/data/lena_bw_out.pgm> reference: <../../../../0_Simple/simpleTextureDrv/data/ref_rotated.pgm>
[simpleVoteIntrinsics] GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > GPU device has 20 Multi-Processors, SM 6.1 compute capabilities [VOTE Kernel Test 1/3] Running <> kernel1 ... OK [VOTE Kernel Test 2/3] Running <> kernel2 ... OK [VOTE Kernel Test 3/3] Running <> kernel3 ... OK Shutting down...
> Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability [simpleVoteIntrinsics_nvrtc] [VOTE Kernel Test 1/3] Running <> kernel1 ... OK [VOTE Kernel Test 2/3] Running <> kernel2 ... OK [VOTE Kernel Test 3/3] Running <> kernel3 ... OK Shutting down...
Device 0: < GRID P4-4Q >, Compute SM 6.1 detected > Using CUDA Host Allocated (cudaHostAlloc) > vectorAddGPU kernel will add vectors using mapped CPU memory... > Checking the results from vectorAddGPU() ... > Releasing CPU memory...
Sobol Quasi-Random Number Generator Starting... > number of vectors = 100000 > number of dimensions = 100 GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Allocating CPU memory... Allocating GPU memory... Initializing direction numbers... Copying direction numbers to device... Executing QRNG on GPU... Gsamples/s: 5.95593 Reading results from GPU... Executing QRNG on CPU... Gsamples/s: 0.139458 Checking results... L1-Error: 0 Shutting down...
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/sortingNetworks Starting... Starting up CUDA context... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Allocating and initializing host arrays... Allocating and initializing CUDA arrays... Running GPU bitonic sort (1 identical iterations)... Testing array length 64 (16384 arrays per batch)... Average time: 0.394000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 128 (8192 arrays per batch)... Average time: 0.557000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 256 (4096 arrays per batch)... Average time: 0.660000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 512 (2048 arrays per batch)... Average time: 0.767000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 1024 (1024 arrays per batch)... Average time: 0.868000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 2048 (512 arrays per batch)... Average time: 0.948000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 4096 (256 arrays per batch)... Average time: 1.368000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 8192 (128 arrays per batch)... Average time: 1.715000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 16384 (64 arrays per batch)... Average time: 2.273000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 32768 (32 arrays per batch)... Average time: 2.911000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 65536 (16 arrays per batch)... Average time: 3.652000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 131072 (8 arrays per batch)... Average time: 4.489000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 262144 (4 arrays per batch)... Average time: 5.443000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 524288 (2 arrays per batch)... Average time: 6.493000 ms Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Testing array length 1048576 (1 arrays per batch)... Average time: 7.657000 ms sortingNetworks-bitonic, Throughput = 136.9434 MElements/s, Time = 0.00766 s, Size = 1048576 elements, NumDevsUsed = 1, Workgroup = 512 Validating the results... ...reading back GPU results ...inspecting keys array: OK ...inspecting keys and values array: OK ...stability property: NOT stable Shutting down...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > GPU device has 20 Multi-Processors, SM 6.1 compute capabilities Loaded <../../../../3_Imaging/stereoDisparity/data/stereo.im0.640x533.ppm> as image 0 Loaded <../../../../3_Imaging/stereoDisparity/data/stereo.im1.640x533.ppm> as image 1 Launching CUDA stereoDisparityKernel() Input Size [640x533], Kernel size [17x17], Disparities [-16:0] GPU processing time : 0.9226 (ms) Pixel throughput : 369.728 Mpixels/sec GPU Checksum = 4293895789, GPU image: Computing CPU reference... CPU Checksum = 4293895789, CPU image:
Starting [~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/StreamPriorities]... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 CUDA stream priority range: LOW: 0 to HIGH: -1 elapsed time of kernels launched to LOW priority stream: 6.455 ms elapsed time of kernels launched to HI priority stream: 3.546 ms
Unified Memory not supported on this device GPU Device 0: "GRID P4-4Q" with compute capability 6.1
~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/template Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Processing time: 153.804001 (ms)
threadFenceReduction Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 GPU Device supports SM 6.1 compute capability 1048576 elements 128 threads (max) 64 blocks Average time: 0.048940 ms Bandwidth: 85.702985 GB/s GPU result = 0.062298238277 CPU result = 0.062298242003
Starting threadMigration [ threadMigration ] API test... > 1 CUDA device(s), 2 Thread(s)/device to launched Device 0: "GRID P4-4Q" (Compute 6.1) sharedMemPerBlock: 49152 constantMemory : 65536 regsPerBlock : 65536 clockRate : 1113500 > findModulePath found file at <./threadMigration_kernel64.ptx> > PTX JIT log: - ThreadProc() Launched... - ThreadProc() Launched... - ThreadProc() Finished! - ThreadProc() Finished!
./RunTests: line 17: ~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/threadMigration_kernel64.ptx: Permission denied
GPU0 <-> CPU: * Atomic Supported: no
Transpose Starting... GPU Device 0: "GRID P4-4Q" with compute capability 6.1 > Device 0: "GRID P4-4Q" > SM Capability 6.1 detected: > [GRID P4-4Q] has 20 MP(s) x 128 (Cores/MP) = 2560 (Cores) > Compute performance scaling factor = 1.00 Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16 transpose simple copy , Throughput = 147.6276 GB/s, Time = 0.05292 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose shared memory copy, Throughput = 145.9333 GB/s, Time = 0.05353 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose naive , Throughput = 47.9143 GB/s, Time = 0.16305 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coalesced , Throughput = 146.1290 GB/s, Time = 0.05346 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose optimized , Throughput = 147.0868 GB/s, Time = 0.05311 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coarse-grained , Throughput = 146.8038 GB/s, Time = 0.05322 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose fine-grained , Throughput = 146.9169 GB/s, Time = 0.05318 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose diagonal , Throughput = 140.4011 GB/s, Time = 0.05564 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 Test passed
Unified Memory not supported on this device GPU Device 0: "GRID P4-4Q" with compute capability 6.1
[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
Vector Addition (Driver API) > Using Device 0: "GRID P4-4Q" with Compute 6.1 capability > findModulePath found file at <./vectorAdd_kernel64.ptx> > initCUDA loading module: <./vectorAdd_kernel64.ptx> > PTX JIT log: Result = PASS
> Using CUDA Device [0]: GRID P4-4Q > GPU Device has SM 6.1 compute capability [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
GPU Device 0: "GRID P4-4Q" with compute capability 6.1 Warp Aggregated Atomics PASSED
End of run: Tue Oct 17 10:29:13 CDT 2017 Card Status:
Tue Oct 17 10:29:13 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.73 Driver Version: 384.73 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID P4-4Q On | 00000000:02:02.0 On | N/A | | N/A N/A P0 N/A / N/A | 495MiB / 4095MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1510 G /usr/bin/X 77MiB | | 0 16459 G /usr/bin/gnome-shell 113MiB | +-----------------------------------------------------------------------------+
Tests run by Tony Foster - all rights resreved Tue Oct 17
10:29:13 CDT 2017
Published for informational purposeses only.
Return to wondernerd.net