Evaluation of NVIDIA CUDA Toolkit Example Files

Card Status:

Tue Oct 17 10:25:24 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.73                 Driver Version: 384.73                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID P4-4Q          On   | 00000000:02:02.0  On |                  N/A |
| N/A   N/A    P8    N/A /  N/A |    495MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1510      G   /usr/bin/X                                    77MiB |
|    0     16459      G   /usr/bin/gnome-shell                         113MiB |
+-----------------------------------------------------------------------------+

Start of run: Tue Oct 17 10:25:24 CDT 2017

file: alignedTypes

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/alignedTypes] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

[GRID P4-4Q] has 20 MP(s) x 128 (Cores/MP) = 2560 (Cores)
> Compute scaling value = 1.00
> Memory Size = 49999872
Allocating memory...
Generating host input data array...
Uploading input data to GPU memory...
Testing misaligned types...
uint8...
Avg. time: 1.774187 ms / Copy throughput: 26.246387 GB/s.
        TEST OK
uint16...
Avg. time: 1.033437 ms / Copy throughput: 45.059338 GB/s.
        TEST OK
RGBA8_misaligned...
Avg. time: 0.855937 ms / Copy throughput: 54.403517 GB/s.
        TEST OK
LA32_misaligned...
Avg. time: 0.622406 ms / Copy throughput: 74.816103 GB/s.
        TEST OK
RGB32_misaligned...
Avg. time: 0.986250 ms / Copy throughput: 47.215220 GB/s.
        TEST OK
RGBA32_misaligned...
Avg. time: 1.052531 ms / Copy throughput: 44.241926 GB/s.
        TEST OK
Testing aligned types...
RGBA8...
Avg. time: 0.660156 ms / Copy throughput: 70.537861 GB/s.
        TEST OK
I32...
Avg. time: 0.660187 ms / Copy throughput: 70.534523 GB/s.
        TEST OK
LA32...
Avg. time: 0.612375 ms / Copy throughput: 76.041654 GB/s.
        TEST OK
RGB32...
Avg. time: 0.715219 ms / Copy throughput: 65.107369 GB/s.
        TEST OK
RGBA32...
Avg. time: 0.606094 ms / Copy throughput: 76.829712 GB/s.
        TEST OK
RGBA32_2...
Avg. time: 0.724719 ms / Copy throughput: 64.253905 GB/s.
        TEST OK

[alignedTypes] -> Test Results: 0 Failures
Shutting down...
Test passed

file: asyncAPI

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/asyncAPI] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

CUDA device [GRID P4-4Q]
time spent executing by the GPU: 72.02
time spent by CPU in CUDA calls: 0.06
CPU executed 187171 iterations while waiting for GPU to finish

file: bandwidthTest

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GRID P4-4Q
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6355.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     4529.5

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     157799.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

file: batchCUBLAS

batchCUBLAS Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1


 ==== Running single kernels ==== 

Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbf800000, -1) beta= (0x40000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00006199 sec  GFLOPS=67.6623
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00007510 sec  GFLOPS=55.8482
@@@@ dgemm test OK

 ==== Running N=10 without streams ==== 

Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbf800000, -1) beta= (0x00000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00008106 sec  GFLOPS=517.417
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00045013 sec  GFLOPS=93.179
@@@@ dgemm test OK

 ==== Running N=10 with streams ==== 

Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0x40000000, 2) beta= (0x40000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00010586 sec  GFLOPS=396.22
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00034285 sec  GFLOPS=122.338
@@@@ dgemm test OK

 ==== Running N=10 batched ==== 

Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0x3f800000, 1) beta= (0xbf800000, -1)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00006104 sec  GFLOPS=687.195
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00029707 sec  GFLOPS=141.189
@@@@ dgemm test OK

Test Summary
0 error(s)

file: BiCGStab

WARNING: it is assumed that the matrices are stores in Matrix Market format with double as elementtype
 Usage: ./BiCGStab -F[matrix.mtx] [-E] [-D]
Starting [~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BiCGStab]
argv[0] = ~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BiCGStabUsing default input file [../../../../7_CUDALibraries/BiCGStab/gr_900_900_crg.mtx]
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Testing dbicgstab
^^^^ M=900, N=900, nnz=4322
Copy matrix from CPU to GPU, time(s) = 0.00008607
analysis lower 0.004964 (s), upper 0.001323 (s) 
CUSPARSE csrilu0 time(s) = 0.00109410 
gpu total solve time 1.362471 (s), matvec time 0.117151 (s)

file: binomialOptions

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/binomialOptions] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Generating input data...
Running GPU binomial tree...
Options count            : 1024     
Time steps               : 2048     
binomialOptionsGPU() time: 4.246000 msec
Options per second       : 241168.168883     
Running CPU binomial tree...
Comparing the results...
GPU binomial vs. Black-Scholes
L1 norm: 2.220162E-04
CPU binomial vs. Black-Scholes
L1 norm: 2.217447E-04
CPU binomial vs. GPU binomial
L1 norm: 1.161472E-06
Shutting down...

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Test passed

file: binomialOptions_nvrtc

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/binomialOptions_nvrtc] - Starting...
Generating input data...
Running GPU binomial tree...
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
Options count            : 1024     
Time steps               : 2048     
binomialOptionsGPU() time: 409.170013 msec
Options per second       : 2502.627188     
Running CPU binomial tree...
Comparing the results...
GPU binomial vs. Black-Scholes
L1 norm: 2.216577E-04
CPU binomial vs. Black-Scholes
L1 norm: 9.435265E-05
CPU binomial vs. GPU binomial
L1 norm: 1.513570E-04
Shutting down...

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Test passed

file: BlackScholes

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BlackScholes] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...
Options count             : 8000000     
BlackScholesGPU() time    : 0.481787 msec
Effective memory bandwidth: 166.048442 GB/s
Gigaoptions per second    : 16.604844     

BlackScholes, Throughput = 16.6048 GOptions/s, Time = 0.00048 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...
Checking the results...
...running CPU calculations.

Comparing the results...
L1 norm: 1.741792E-07
Max absolute error: 1.192093E-05

Shutting down...
...releasing GPU memory.
...releasing CPU memory.
Shutdown done.

[BlackScholes] - Test Summary

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Test passed

file: BlackScholes_nvrtc

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BlackScholes_nvrtc] - Starting...
Initializing data...
...allocating CPU memory for options.
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
...allocating GPU memory for options.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...
Options count             : 8000000     
BlackScholesGPU() time    : 0.481795 msec
Effective memory bandwidth: 166.045751 GB/s
Gigaoptions per second    : 16.604575     

BlackScholes, Throughput = 16.6046 GOptions/s, Time = 0.00048 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...
Checking the results...
...running CPU calculations.

Comparing the results...
L1 norm: 1.741792E-07
Max absolute error: 1.192093E-05

Shutting down...
...releasing GPU memory.
...releasing CPU memory.
Shutdown done.

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/BlackScholes_nvrtc] - Test Summary
Test passed

file: boxFilterNPP

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/boxFilterNPP Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

cudaSetDevice GPU0 = GRID P4-4Q
NPP Library Version 9.0.176
  CUDA Driver  Version: 9.0
  CUDA Runtime Version: 9.0
  Device 0: <      GRID P4-4Q >, Compute SM 6.1 detected
boxFilterNPP opened: <../../../../common/data/Lena.pgm> successfully!
Saved image: ../../../../common/data/Lena_boxFilter.pgm

file: c++11_cuda

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Read 3223503 byte corpus from ../../../../6_Advanced/c++11_cuda/warandpeace.txt
counted 107310 instances of 'x', 'y', 'z', or 'w' in "../../../../6_Advanced/c++11_cuda/warandpeace.txt"

file: cannyEdgeDetectorNPP

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/cannyEdgeDetectorNPP Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

cudaSetDevice GPU0 = GRID P4-4Q
NPP Library Version 9.0.176
  CUDA Driver  Version: 9.0
  CUDA Runtime Version: 9.0
  Device 0: <      GRID P4-4Q >, Compute SM 6.1 detected
cannyEdgeDetectionNPP opened: <../../../../common/data/Lena.pgm> successfully!
Saved image: ../../../../common/data/Lena_cannyEdgeDetection.pgm

file: cdpAdvancedQuicksort

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

GPU device GRID P4-4Q has compute capabilities (SM 6.1)
Running qsort on 5000 elements with seed 100, on GRID P4-4Q
    cdpAdvancedQuicksort PASSED
Sorted 5000 elems in 0.422 ms (11.851 Melems/sec)

file: cdpBezierTessellation

Running on GPU 0 (GRID P4-4Q)
Computing Bezier Lines (CUDA Dynamic Parallelism Version) ... Done!

file: cdpLUDecomposition

Starting LU Decomposition (CUDA Dynamic Parallelism)
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

GPU device GRID P4-4Q has compute capabilities (SM 6.1)
Compute LU decomposition of a random 1024x1024 matrix using CUDA Dynamic Parallelism
Launching single task from device...
Successfully launched CDP kernel
GPU perf(dgetrf)= 1.585 Gflops
Checking results... done
Tests suceeded

file: cdpQuadtree

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

GPU device GRID P4-4Q has compute capabilities (SM 6.1)
Launching CDP kernel to build the quadtree
Results: OK

file: cdpSimplePrint

starting Simple Print (CUDA Dynamic Parallelism)
Running on GPU 0 (GRID P4-4Q)
***************************************************************************
The CPU launches 2 blocks of 2 threads each. On the device each thread will
launch 2 blocks of 2 threads each. The GPU we will do that recursively
until it reaches max_depth=2

In total 2+8=10 blocks are launched!!! (8 from the GPU)
***************************************************************************

Launching cdp_kernel() with CUDA Dynamic Parallelism:

BLOCK 1 launched by the host
BLOCK 0 launched by the host
|  BLOCK 4 launched by thread 0 of block 0
|  BLOCK 2 launched by thread 0 of block 0
|  BLOCK 3 launched by thread 0 of block 1
|  BLOCK 5 launched by thread 0 of block 1
|  BLOCK 6 launched by thread 1 of block 0
|  BLOCK 7 launched by thread 1 of block 0
|  BLOCK 8 launched by thread 1 of block 1
|  BLOCK 9 launched by thread 1 of block 1

file: cdpSimpleQuicksort

Running on GPU 0 (GRID P4-4Q)
Initializing data:
Running quicksort on 128 elements
Launching kernel on the GPU
Validating results: OK

file: clock

CUDA Clock sample
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Average clocks/block = 3092.281250

file: clock_nvrtc

CUDA Clock sample
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
Average clocks/block = 3056.218750

file: concurrentKernels

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/concurrentKernels] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> Detected Compute SM 6.1 hardware with 20 multi-processors
Expected time for serial execution of 8 kernels = 0.080s
Expected time for concurrent execution of 8 kernels = 0.010s
Measured time for sample = 0.010s
Test passed

file: conjugateGradient

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> GPU device has 20 Multi-Processors, SM 6.1 compute capabilities

iteration =   1, residual = 4.449882e+01
iteration =   2, residual = 3.245218e+00
iteration =   3, residual = 2.690220e-01
iteration =   4, residual = 2.307639e-02
iteration =   5, residual = 1.993140e-03
iteration =   6, residual = 1.846192e-04
iteration =   7, residual = 1.693378e-05
iteration =   8, residual = 1.600115e-06
Test Summary:  Error amount = 0.000000

file: conjugateGradientMultiBlockCG

Unified Memory not supported on this device
Starting [conjugateGradientMultiBlockCG]...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

file: conjugateGradientPrecond

conjugateGradientPrecond starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

GPU selected Device ID = 0 
> GPU device has 20 Multi-Processors, SM 6.1 compute capabilities

laplace dimension = 128
Convergence of conjugate gradient without preconditioning: 
  iteration = 564, residual = 9.462559e-13 
  Convergence Test: OK 

Convergence of conjugate gradient using incomplete LU preconditioning: 
  iteration = 188, residual = 9.029262e-13 
  Convergence Test: OK 
  Test Summary:
     Counted total of 0 errors
     qaerr1 = 0.000005 qaerr2 = 0.000003

file: conjugateGradientUM

Unified Memory not supported on this device
Starting [conjugateGradientUM]...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

file: convolutionFFT2D

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/convolutionFFT2D] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Testing built-in R2C / C2R FFT-based convolution
...allocating memory
...generating random input data
...creating R2C & C2R FFT plans for 2048 x 2048
...uploading to GPU and padding convolution kernel and input data
...transforming convolution kernel
...running GPU FFT convolution: 1967.535669 MPix/s (2.033000 ms)
...reading back GPU convolution results
...running reference CPU convolution
...comparing the results: rel L2 = 1.058234E-07 (max delta = 9.978815E-07)
L2norm Error OK
...shutting down
Testing custom R2C / C2R FFT-based convolution
...allocating memory
...generating random input data
...creating C2C FFT plan for 2048 x 1024
...uploading to GPU and padding convolution kernel and input data
...transforming convolution kernel
...running GPU FFT convolution: 1800.180112 MPix/s (2.222000 ms)
...reading back GPU FFT results
...running reference CPU convolution
...comparing the results: rel L2 = 1.087169E-07 (max delta = 1.084381E-06)
L2norm Error OK
...shutting down
Testing updated custom R2C / C2R FFT-based convolution
...allocating memory
...generating random input data
...creating C2C FFT plan for 2048 x 1024
...uploading to GPU and padding convolution kernel and input data
...transforming convolution kernel
...running GPU FFT convolution: 2481.389596 MPix/s (1.612000 ms)
...reading back GPU FFT results
...running reference CPU convolution
...comparing the results: rel L2 = 1.084221E-07 (max delta = 9.397971E-07)
L2norm Error OK
...shutting down
Test Summary: 0 errors
Test passed

file: convolutionSeparable

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/convolutionSeparable] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Image Width x Height = 3072 x 3072

Allocating and initializing host arrays...
Allocating and initializing CUDA arrays...
Running GPU convolution (16 identical iterations)...

convolutionSeparable, Throughput = 9717.1599 MPixels/sec, Time = 0.00097 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0

Reading back GPU results...

Checking the results...
 ...running convolutionRowCPU()
 ...running convolutionColumnCPU()
 ...comparing the results
 ...Relative L2 norm: 0.000000E+00

Shutting down...
Test passed

file: convolutionTexture

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/convolutionTexture] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Initializing data...
Running GPU rows convolution (10 identical iterations)...
Average convolutionRowsGPU() time: 0.622700 msecs; //7577.632456 Mpix/s
Copying convolutionRowGPU() output back to the texture...
cudaMemcpyToArray() time: 0.267000 msecs; //17672.629892 Mpix/s
Running GPU columns convolution (10 iterations)
Average convolutionColumnsGPU() time: 0.614100 msecs; //7683.752090 Mpix/s
Reading back GPU results...
Checking the results...
...running convolutionRowsCPU()
...running convolutionColumnsCPU()
Relative L2 norm: 0.000000E+00
Shutting down...
Test passed

file: cppIntegration

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Hello World.
Hello World.

file: cppOverload

C++ Function Overloading starting...
DevicecheckCudaErrors Count: 1
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Shared Size:   1024
Constant Size: 0
Local Size:    0
Max Threads Per Block: 1024
Number of Registers: 10
PTX Version: 60
Binary Version: 60
simple_kernel(const int *pIn, int *pOut, int a) PASSED

Shared Size:   2048
Constant Size: 0
Local Size:    0
Max Threads Per Block: 1024
Number of Registers: 11
PTX Version: 60
Binary Version: 60
simple_kernel(const int2 *pIn, int *pOut, int a) PASSED

Shared Size:   2048
Constant Size: 0
Local Size:    0
Max Threads Per Block: 1024
Number of Registers: 11
PTX Version: 60
Binary Version: 60
simple_kernel(const int *pIn1, const int *pIn2, int *pOut, int a) PASSED

file: cudaOpenMP

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/cudaOpenMP Starting...

number of host CPUs:    4
number of CUDA devices: 1
   0: GRID P4-4Q
---------------------------
CPU thread 0 (of 1) uses CUDA device 0
---------------------------

file: cudaTensorCoreGemm

Initializing...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

cudaTensorCoreGemm requires requires SM 7.0 or higher to use Tensor Cores.  Exiting...

file: cuHook

Error: Condition (allocation_cb == 1) failed at cuHook.cpp:115
cuHook sample failed (Didn't receive the allocation callback)

file: cuSolverDn_LinearSolver

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

step 1: read matrix market format
Using default input file [../../../../7_CUDALibraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx]
sparse matrix A is 900 x 900 with 7744 nonzeros, base=1
step 2: convert CSR(A) to dense matrix
step 3: set right hand side vector (b) to 1
step 4: prepare data on device
step 5: solve A*x = b 
timing: cholesky =   0.011339 sec
step 6: evaluate residual
|b - A*x| = 1.136868E-13 
|A| = 1.600000E+01 
|x| = 2.357708E+01 
|b - A*x|/(|A|*|x|) = 3.013701E-16

file: cuSolverRf

step 1.1: preparation
step 1.1: read matrix market format
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Using default input file [../../../../7_CUDALibraries/cuSolverRf/lap2D_5pt_n100.mtx]
WARNING: cusolverRf only works for base-0 
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=0
step 1.2: set right hand side vector (b) to 1
step 2: reorder the matrix to reduce zero fill-in
        Q = symrcm(A) or Q = symamd(A) 
step 3: B = Q*A*Q^T
step 4: solve A*x = b by LU(B) in cusolverSp
step 4.1: create opaque info structure
step 4.2: analyze LU(B) to know structure of Q and R, and upper bound for nnz(L+U)
step 4.3: workspace for LU(B)
step 4.4: compute Ppivot*B = L*U 
step 4.5: check if the matrix is singular 
step 4.6: solve A*x = b 
    i.e.  solve B*(Qx) = Q*b 
step 4.7: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 4.547474E-12 
(CPU) |A| = 8.000000E+00 
(CPU) |x| = 7.513384E+02 
(CPU) |b - A*x|/(|A|*|x|) = 7.565621E-16 
step 5: extract P, Q, L and U from P*B*Q^T = L*U 
        L has implicit unit diagonal
nnzL = 671550, nnzU = 681550
step 6: form P*A*Q^T = L*U
step 6.1: P = Plu*Qreroder
step 6.2: Q = Qlu*Qreorder 
step 7: create cusolverRf handle
step 8: set parameters for cusolverRf 
step 9: assemble P*A*Q = L*U 
step 10: analyze to extract parallelism 
step 11: import A to cusolverRf 
step 12: refactorization 
step 13: solve A*x = b 
step 14: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 4.547474E-12 
(GPU) |A| = 8.000000E+00 
(GPU) |x| = 7.513384E+02 
(GPU) |b - A*x|/(|A|*|x|) = 7.565621E-16 
===== statistics 
 nnz(A) = 49600, nnz(L+U) = 1353100, zero fill-in ratio = 27.280242

===== timing profile 
 reorder A   : 0.006283 sec
 B = Q*A*Q^T : 0.001430 sec

 cusolverSp LU analysis: 0.005644 sec
 cusolverSp LU factor  : 0.137884 sec
 cusolverSp LU solve   : 0.003406 sec
 cusolverSp LU extract : 0.011348 sec

 cusolverRf assemble : 0.022505 sec
 cusolverRf reset    : 0.000154 sec
 cusolverRf refactor : 0.169308 sec
 cusolverRf solve    : 0.186020 sec

file: cuSolverSp_LinearSolver

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Using default input file [../../../../7_CUDALibraries/cuSolverSp_LinearSolver/lap2D_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that 
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b) to 1
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 4.547474E-12 
(CPU) |A| = 8.000000E+00 
(CPU) |x| = 7.513384E+02 
(CPU) |b - A*x|/(|A|*|x|) = 7.565621E-16 
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 1.818989E-12 
(GPU) |A| = 8.000000E+00 
(GPU) |x| = 7.513384E+02 
(GPU) |b - A*x|/(|A|*|x|) = 3.026248E-16 
timing chol: CPU =   0.054121 sec , GPU =   0.131622 sec

file: cuSolverSp_LowlevelCholesky

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Using default input file [../../../../7_CUDALibraries/cuSolverSp_LowlevelCholesky/lap2D_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: create opaque info structure
step 3: analyze chol(A) to know structure of L
step 4: workspace for chol(A)
step 5: compute A = L*L^T 
step 6: check if the matrix is singular 
step 7: solve A*x = b 
step 8: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 3.637979E-12 
(CPU) |A| = 8.000000E+00 
(CPU) |x| = 7.513384E+02 
(CPU) |b - A*x|/(|A|*|x|) = 6.052497E-16 
step 9: create opaque info structure
step 10: analyze chol(A) to know structure of L
step 11: workspace for chol(A)
step 12: compute A = L*L^T 
step 13: check if the matrix is singular 
step 14: solve A*x = b 
(GPU) |b - A*x| = 1.364242E-12 
(GPU) |b - A*x|/(|A|*|x|) = 2.269686E-16

file: cuSolverSp_LowlevelQR

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Using default input file [../../../../7_CUDALibraries/cuSolverSp_LowlevelQR/lap2D_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: create opaque info structure
step 3: analyze qr(A) to know structure of L
step 4: workspace for qr(A)
step 5: compute A = L*L^T 
step 6: check if the matrix is singular 
step 7: solve A*x = b 
step 8: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 6.821210E-12 
(CPU) |A| = 8.000000E+00 
(CPU) |x| = 7.513384E+02 
(CPU) |b - A*x|/(|A|*|x|) = 1.134843E-15 
step 9: create opaque info structure
step 10: analyze qr(A) to know structure of L
step 11: workspace for qr(A)
GPU buffer size = 51372672 bytes
step 12: compute A = L*L^T 
step 13: check if the matrix is singular 
step 14: solve A*x = b 
(GPU) |b - A*x| = 5.911716E-12 
(GPU) |b - A*x|/(|A|*|x|) = 9.835307E-16

file: dct8x8

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/dct8x8 Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

CUDA sample DCT/IDCT implementation
===================================
Loading test image: barbara.bmp... [512 x 512]... Success
Running Gold 1 (CPU) version... Success
Running Gold 2 (CPU) version... Success
Running CUDA 1 (GPU) version... Success
Running CUDA 2 (GPU) version... 34357.012287 MPix/s //0.007630 ms
Success
Running CUDA short (GPU) version... Success
Dumping result to barbara_gold1.bmp... Success
Dumping result to barbara_gold2.bmp... Success
Dumping result to barbara_cuda1.bmp... Success
Dumping result to barbara_cuda2.bmp... Success
Dumping result to barbara_cuda_short.bmp... Success
Processing time (CUDA 1)    : 0.071000 ms 
Processing time (CUDA 2)    : 0.007630 ms 
Processing time (CUDA short): 0.085000 ms 
PSNR Original    <---> CPU(Gold 1)    : 32.777073
PSNR Original    <---> CPU(Gold 2)    : 32.777050
PSNR Original    <---> GPU(CUDA 1)    : 32.777092
PSNR Original    <---> GPU(CUDA 2)    : 32.777077
PSNR Original    <---> GPU(CUDA short): 32.749447
PSNR CPU(Gold 1) <---> GPU(CUDA 1)    : 64.019310
PSNR CPU(Gold 2) <---> GPU(CUDA 2)    : 71.777740
PSNR CPU(Gold 2) <---> GPU(CUDA short): 42.258053

Test Summary...
Test passed

file: deviceQuery

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID P4-4Q"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 4096 MBytes (4294705152 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1114 MHz (1.11 GHz)
  Memory Clock rate:                             3003 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 2
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

file: deviceQueryDrv

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version 
Detected 1 CUDA Capable device(s)

Device 0: "GRID P4-4Q"
  CUDA Driver Version:                           9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 4096 MBytes (4294705152 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1114 MHz (1.11 GHz)
  Memory Clock rate:                             3003 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 2
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

file: dwtHaar1D

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/dwtHaar1D Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

source file    = "../../../../3_Imaging/dwtHaar1D/data/signal.dat"
reference file = "result.dat"
gold file      = "../../../../3_Imaging/dwtHaar1D/data/regression.gold.dat"
Reading signal from "../../../../3_Imaging/dwtHaar1D/data/signal.dat"
Writing result to "result.dat"
Reading reference result from "../../../../3_Imaging/dwtHaar1D/data/regression.gold.dat"
Test success!

file: dxtc

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/dxtc Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Image Loaded '../../../../3_Imaging/dxtc/data/lena_std.ppm', 512 x 512 pixels

Running DXT Compression on 512 x 512 image...

16384 Blocks, 64 Threads per Block, 1048576 Threads in Grid...

dxtc, Throughput = 99.9405 MPixels/s, Time = 0.00262 s, Size = 262144 Pixels, NumDevsUsed = 1, Workgroup = 64

Checking accuracy...
Deviation at (   9,   1):       0.791667 rms
Deviation at (  99,   1):       1.041667 rms
Deviation at (  33,   2):       2.645833 rms
Deviation at (  38,   4):       1.916667 rms
Deviation at (  57,   4):       0.854167 rms
Deviation at (  20,   7):       1.041667 rms
Deviation at (  57,   7):       0.458333 rms
Deviation at (   8,   9):       0.937500 rms
Deviation at (  31,   9):       0.770833 rms
Deviation at (  13,  11):       1.041667 rms
Deviation at (  88,  11):       0.729167 rms
Deviation at (   4,  13):       8.562500 rms
Deviation at (  28,  13):       0.562500 rms
Deviation at (  90,  13):       0.708333 rms
Deviation at (  25,  14):       0.520833 rms
Deviation at (  87,  16):       0.708333 rms
Deviation at (  24,  19):       0.916667 rms
Deviation at (  25,  19):       0.625000 rms
Deviation at (  26,  19):       1.041667 rms
Deviation at (  55,  20):       4.791667 rms
Deviation at (  20,  23):       1.541667 rms
Deviation at (  99,  23):       3.312500 rms
Deviation at (  45,  24):       18.104166 rms
Deviation at (   8,  28):       0.895833 rms
Deviation at (  21,  30):       1.562500 rms
Deviation at ( 115,  32):       24.104166 rms
Deviation at (   2,  33):       0.854167 rms
Deviation at ( 102,  33):       2.250000 rms
Deviation at (  50,  35):       26.958334 rms
Deviation at (  12,  38):       2.166667 rms
Deviation at (  96,  39):       1.041667 rms
Deviation at (  40,  40):       0.270833 rms
Deviation at (  43,  44):       2.250000 rms
Deviation at (  54,  44):       4.791667 rms
Deviation at (  46,  46):       2.875000 rms
Deviation at ( 116,  46):       0.604167 rms
Deviation at ( 117,  46):       6.833333 rms
Deviation at ( 117,  48):       0.937500 rms
Deviation at (  23,  51):       3.520833 rms
Deviation at (  67,  54):       5.687500 rms
Deviation at (  26,  55):       0.854167 rms
Deviation at (  21,  56):       5.000000 rms
Deviation at (  24,  56):       0.562500 rms
Deviation at (  30,  57):       0.937500 rms
Deviation at ( 126,  57):       1.208333 rms
Deviation at (  21,  59):       2.541667 rms
Deviation at ( 120,  59):       0.104167 rms
Deviation at ( 112,  60):       1.125000 rms
Deviation at (  76,  61):       1.666667 rms
Deviation at (  77,  61):       1.083333 rms
Deviation at (  75,  62):       0.937500 rms
Deviation at ( 121,  62):       0.937500 rms
Deviation at ( 124,  64):       2.854167 rms
Deviation at (  78,  66):       0.541667 rms
Deviation at ( 106,  68):       0.375000 rms
Deviation at (  16,  70):       3.104167 rms
Deviation at (  10,  71):       0.937500 rms
Deviation at ( 108,  71):       0.354167 rms
Deviation at (   0,  72):       0.854167 rms
Deviation at ( 118,  72):       5.562500 rms
Deviation at (  11,  73):       0.541667 rms
Deviation at (  68,  74):       1.937500 rms
Deviation at (  70,  76):       1.791667 rms
Deviation at ( 124,  76):       3.354167 rms
Deviation at ( 103,  78):       0.375000 rms
Deviation at (  74,  79):       0.270833 rms
Deviation at ( 108,  79):       0.083333 rms
Deviation at (  43,  82):       24.979166 rms
Deviation at (  58,  82):       2.833333 rms
Deviation at (  67,  82):       3.125000 rms
Deviation at (  78,  82):       2.437500 rms
Deviation at ( 123,  84):       0.541667 rms
Deviation at ( 127,  88):       0.229167 rms
Deviation at (  99,  89):       0.770833 rms
Deviation at (  93,  91):       0.666667 rms
Deviation at ( 118,  91):       1.125000 rms
Deviation at ( 115,  92):       0.083333 rms
Deviation at ( 115,  93):       0.083333 rms
Deviation at (  45,  94):       0.166667 rms
Deviation at (  14,  95):       1.937500 rms
Deviation at (  69,  95):       1.875000 rms
Deviation at ( 106,  95):       1.125000 rms
Deviation at ( 107,  95):       3.708333 rms
Deviation at (  13,  96):       1.354167 rms
Deviation at ( 115,  98):       0.187500 rms
Deviation at ( 118,  98):       0.187500 rms
Deviation at ( 116, 101):       0.187500 rms
Deviation at (  87, 106):       0.270833 rms
Deviation at (  67, 107):       0.708333 rms
Deviation at (  74, 107):       0.375000 rms
Deviation at (  65, 109):       0.770833 rms
Deviation at (  89, 109):       0.708333 rms
Deviation at ( 118, 109):       3.854167 rms
Deviation at (  88, 111):       0.208333 rms
Deviation at (  64, 113):       0.708333 rms
Deviation at (  84, 113):       0.333333 rms
Deviation at (  75, 114):       2.083333 rms
Deviation at (  66, 115):       0.770833 rms
Deviation at (  89, 116):       0.770833 rms
Deviation at (  19, 118):       5.270833 rms
Deviation at (  76, 121):       0.104167 rms
Deviation at (  70, 122):       0.708333 rms
Deviation at (  91, 122):       0.208333 rms
Deviation at (  75, 123):       0.854167 rms
Deviation at (  61, 124):       0.937500 rms
Deviation at (  91, 124):       0.270833 rms
Deviation at (  91, 125):       1.020833 rms
RMS(reference, result) = 0.015238

Test passed

file: eigenvalues

Starting eigenvalues
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Matrix size: 2048 x 2048
Precision: 0.000010
Iterations to be timed: 100
Result filename: 'eigenvalues.dat'
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 2.891819 ms
Average time step 2, one intervals: 3.538522 ms
Average time step 2, mult intervals: 7.413971 ms
Average time TOTAL: 13.875119 ms
Test Succeeded!

file: fastWalshTransform

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/fastWalshTransform Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Initializing data...
...allocating CPU memory
...allocating GPU memory
...generating data
Data length: 8388608; kernel length: 128
Running GPU dyadic convolution using Fast Walsh Transform...
GPU time: 9.447000 ms; GOP/s: 30.634804
Reading back GPU results...
Running straightforward CPU dyadic convolution...
Comparing the results...
Shutting down...
L2 norm: 1.021579E-07
Test passed

file: FDTD3d

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/FDTD3d Starting...

Set-up, based upon target device GMEM size...
 getTargetDeviceGlobalMemSize
 cudaGetDeviceCount
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

 cudaGetDeviceProperties
 generateRandomData

FDTD on 376 x 376 x 376 volume with symmetric filter radius 4 for 5 timesteps...

fdtdReference...
 calloc intermediate
 Host FDTD loop
        t = 0
        t = 1
        t = 2
        t = 3
        t = 4

fdtdReference complete
fdtdGPU...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

 set block size to 32x16
 set grid size to 12x24
 GPU FDTD loop
        t = 0 launch kernel
        t = 1 launch kernel
        t = 2 launch kernel
        t = 3 launch kernel
        t = 4 launch kernel

fdtdGPU complete

CompareData (tolerance 0.000100)...

file: FilterBorderControlNPP

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/FilterBorderControlNPP Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

cudaSetDevice GPU0 = GRID P4-4Q
NPP Library Version 9.0.176
  CUDA Driver  Version: 9.0
  CUDA Runtime Version: 9.0
  Device 0: <      GRID P4-4Q >, Compute SM 6.1 detected
gradientFilterBorderNPP opened <../../../../common/data/Lena.pgm> successfully!
Saved image: ./../../../../common/data/Lena.pgm_gradientVectorPrewittBorderX_Vertical.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderX_Vertical_WithNoSourceBorders.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal_WithNoSourceBorders.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderX_Vertical_BorderDiffs.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal_BorderDiffs.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderX_Vertical_WithMixedBorders.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal_WithMixedBorders.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderX_Vertical_MixedBorderDiffs.pgm
Saved image: ../../../../common/data/Lena_gradientVectorPrewittBorderY_Horizontal_MixedBorderDiffs.pgm

file: fp16ScalarProduct

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Result: 587296.812500

file: freeImageInteropNPP

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/freeImageInteropNPP Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

cudaSetDevice GPU0 = GRID P4-4Q
NPP Library Version 9.0.176
  CUDA Driver  Version: 9.0
  CUDA Runtime Version: 9.0
  Device 0: <      GRID P4-4Q >, Compute SM 6.1 detected
freeImageInteropNPP opened: <../../../../common/data/Lena.pgm> successfully!
Size  (512, 512)
Pitch 512
Type  FIC_MINISBLACK
BPP   8

file: histEqualizationNPP

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/histEqualizationNPP Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

cudaSetDevice GPU0 = GRID P4-4Q
NPP Library Version 9.0.176
  CUDA Driver  Version: 9.0
  CUDA Runtime Version: 9.0
  Device 0: <      GRID P4-4Q >, Compute SM 6.1 detected
histEqualizationNPP opened: <../../../../common/data/Lena.pgm> successfully!
Saved image file ../../../../common/data/Lena_histEqualization.pgm

file: histogram

[[histogram]] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

CUDA device [GRID P4-4Q] has 20 Multi-Processors, Compute 6.1
Initializing data...
...allocating CPU memory.
...generating input data
...allocating GPU memory and copying input data

Starting up 64-bin histogram...

Running 64-bin GPU histogram for 67108864 bytes (16 runs)...

histogram64() time (average) : 0.00048 sec, 138816.0059 MB/sec

histogram64, Throughput = 138816.0059 MB/s, Time = 0.00048 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 64

Validating GPU results...
 ...reading back GPU results
 ...histogram64CPU()
 ...comparing the results...
 ...64-bin histograms match

Shutting down 64-bin histogram...


Initializing 256-bin histogram...
Running 256-bin GPU histogram for 67108864 bytes (16 runs)...

histogram256() time (average) : 0.00053 sec, 126740.0624 MB/sec

histogram256, Throughput = 126740.0624 MB/s, Time = 0.00053 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 192

Validating GPU results...
 ...reading back GPU results
 ...histogram256CPU()
 ...comparing the results
 ...256-bin histograms match

Shutting down 256-bin histogram...


Shutting down...

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

[histogram] - Test Summary
Test passed

file: HSOpticalFlow

HSOpticalFlow Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Loading "frame10.ppm" ...
Loading "frame11.ppm" ...
Computing optical flow on CPU...
Computing optical flow on GPU...
L1 error : 0.044308

file: inlinePTX

CUDA inline PTX assembler sample
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Test Successful.

file: inlinePTX_nvrtc

CUDA inline PTX assembler sample
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
Test Successful.

file: interval

[Interval Computing]  starting ...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> GPU Device has Compute Capabilities SM 6.1

GPU naive implementation
Searching for roots in [0.01, 4]...
Found 2 intervals that may contain the root(s)
 i[0] = [0.999655515093009, 1.00011722206639]
 i[1] = [1.00011907576551, 1.00044661086269]
Number of equations solved: 65536
Time per equation: 13.6764059066772 us

Check against Host computation...

file: jpegNPP

NPP Library Version 9.0.176
  CUDA Driver  Version: 9.0
  CUDA Runtime Version: 9.0
  Device 0: <      GRID P4-4Q >, Compute SM 6.1 detected
jpegNPP usage
   -input=srcfile.jpg     (input  file JPEG image)
   -output=destfile.jpg   (output file JPEG image)
   -scale=1.0             (scale multiplier for width and height)

Source File: ../../../../common/data/Growth_of_cubic_bacteria_25x16.jpg
Output File: scaled.jpg
Scale Factor: 0.5
Image Size: 2560x1600x3
Output Size: 1280x800x3

file: libcuhook.so.1

file: lineOfSight

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/lineOfSight] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Line of sight
Average time: 1.848300 ms

Test passed

file: matrixMul

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 682.11 GFlop/s, Time= 0.192 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

file: matrixMulCUBLAS

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 2053.36 GFlop/s, Time= 0.096 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

file: matrixMulDrv

[ matrixMulDrv (Driver API) ]
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
  Total amount of global memory:     4294705152 bytes
  64-bit Memory Address:             NO
> findModulePath <./matrixMul_kernel64.ptx>
> initCUDA loading module: <./matrixMul_kernel64.ptx>
> PTX JIT log:

Processing time: 0.140000 (ms)
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

file: matrixMulDynlinkJIT

[ matrixMulDynlinkJIT (CUDA dynamic linking) ]
> Device 0: "GRID P4-4Q" with Compute 6.1 capability
> Compiling CUDA module
> PTX JIT log:

Test run success!

file: matrixMul_nvrtc

[Matrix Multiply Using CUDA] - Starting...
MatrixA(320,320), MatrixB(640,320)
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
Computing result using CUDA Kernel...
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

file: MC_EstimatePiInlineP

Monte Carlo Estimate Pi (with inline PRNG)
==========================================

Estimating Pi on GPU (GRID P4-4Q)

Precision:      single
Number of sims: 100000
Tolerance:      1.000000e-02
GPU result:     3.140440e+00
Expected:       3.141593e+00
Absolute error: 1.152754e-03
Relative error: 3.669329e-04

MonteCarloEstimatePiInlineP, Performance = 579847.94 sims/s, Time = 172.46(ms), NumDevsUsed = 1, Blocksize = 128

file: MC_EstimatePiInlineQ

Monte Carlo Estimate Pi (with inline QRNG)
==========================================

Estimating Pi on GPU (GRID P4-4Q)

Precision:      single
Number of sims: 100000
Tolerance:      1.000000e-02
GPU result:     3.142520e+00
Expected:       3.141593e+00
Absolute error: 9.272099e-04
Relative error: 2.951401e-04

MonteCarloEstimatePiInlineQ, Performance = 602518.54 sims/s, Time = 165.97(ms), NumDevsUsed = 1, Blocksize = 128

file: MC_EstimatePiP

Monte Carlo Estimate Pi (with batch PRNG)
=========================================

Estimating Pi on GPU (GRID P4-4Q)

Precision:      single
Number of sims: 100000
Tolerance:      1.000000e-02
GPU result:     3.136320e+00
Expected:       3.141593e+00
Absolute error: 5.272627e-03
Relative error: 1.678329e-03

MonteCarloEstimatePiP, Performance = 565585.27 sims/s, Time = 176.81(ms), NumDevsUsed = 1, Blocksize = 128

file: MC_EstimatePiQ

Monte Carlo Estimate Pi (with batch QRNG)
=========================================

Estimating Pi on GPU (GRID P4-4Q)

Precision:      single
Number of sims: 100000
Tolerance:      1.000000e-02
GPU result:     3.141840e+00
Expected:       3.141593e+00
Absolute error: 2.472401e-04
Relative error: 7.869895e-05

MonteCarloEstimatePiQ, Performance = 579888.28 sims/s, Time = 172.45(ms), NumDevsUsed = 1, Blocksize = 128

file: MC_SingleAsianOptionP

Monte Carlo Single Asian Option (with PRNG)
===========================================

Pricing option on GPU (GRID P4-4Q)

Precision:      single
Number of sims: 100000

   Spot    |   Strike   |     r      |   sigma    |   tenor    |  Call/Put  |   Value    |  Expected  |
-----------|------------|------------|------------|------------|------------|------------|------------|
        40 |         35 |       0.03 |        0.2 |   0.333333 |       Call |    5.17634 |    5.16253 |

MonteCarloSingleAsianOptionP, Performance = 598601.69 sims/s, Time = 167.06(ms), NumDevsUsed = 1, Blocksize = 128

file: mergeSort

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/mergeSort Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Allocating and initializing host arrays...

Allocating and initializing CUDA arrays...

Initializing GPU merge sort...
Running GPU merge sort...
Time: 9.544000 ms
Reading back GPU merge sort results...
Inspecting the results...
...inspecting keys array: OK
...inspecting keys and values array: OK
...stability property: stable!
Shutting down...

file: MersenneTwisterGP11213

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/MersenneTwisterGP11213 Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Allocating data for 2400000 samples...
Seeding with 777 ...
Generating random numbers on GPU...


Reading back the results...
Generating random numbers on CPU...

Comparing CPU/GPU random numbers...

Max absolute error: 0.000000E+00
L1 norm: 0.000000E+00

MersenneTwisterGP11213, Throughput = 6.5520 GNumbers/s, Time = 0.00037 s, Size = 2400000 Numbers
Shutting down...

file: MonteCarloMultiGPU

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/MonteCarloMultiGPU Starting...

Using single CPU thread for multiple GPUs
MonteCarloMultiGPU
==================
Parallelization method  = streamed
Problem scaling         = weak
Number of GPUs          = 1
Total number of options = 8192
Number of paths         = 262144
main(): generating input data...
main(): starting 1 host threads...
main(): GPU statistics, streamed
GPU Device #0: GRID P4-4Q
Options         : 8192
Simulation paths: 262144

Total time (ms.): 45.661999
        Note: This is elapsed time for all to compute.
Options per sec.: 179405.199607
main(): comparing Monte Carlo and Black-Scholes results...
Shutting down...
Test Summary...
L1 norm        : 4.825160E-04
Average reserve: 11.741779

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Test passed

file: newdelete

newdelete Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

 > Container = Vector test OK

 > Container = Vector, using placement new on SMEM buffer test OK

 > Container = Vector, with user defined datatype test OK

Test Summary: 3/3 succesfully run

file: nvgraph_Pagerank

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> Detected Compute SM 6.1 hardware with 20 multi-processors
pr_1, alpha = 0.85
0.051705
0.073679
0.057413
0.348703
0.199904
0.268596

pr_2, alpha = 0.90
0.037212
0.053958
0.041506
0.375080
0.205998
0.286246


Done!

file: nvgraph_SemiRingSpMV

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> Detected Compute SM 6.1 hardware with 20 multi-processors
y_h
9.900001
14.299999
80.300003
62.699997
4.950000


Done!

file: nvgraph_SpectralClustering

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Modularity_score: 0.371466
Hit rate : 100.000000% (34 hits)
Done!

file: nvgraph_SSSP

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> Detected Compute SM 6.1 hardware with 20 multi-processors
sssp_1_h
0.000000
0.500000
0.500000
1.333333
0.833333
1.333333


Done!
sssp_2_h
340282346638528859811704183484516925440.000000
340282346638528859811704183484516925440.000000
340282346638528859811704183484516925440.000000
1.000000
1.500000
0.000000


Done!

file: p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GRID P4-4Q, pciBusID: 2, pciDeviceID: 2, pciDomainID:0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0
     0       1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0 
     0 165.24 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0 
     0 165.31 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0 
     0 165.59 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0 
     0 165.42 
P2P=Disabled Latency Matrix (us)
   D\D     0 
     0   6.35 
P2P=Enabled Latency Matrix (us)
   D\D     0 
     0   6.40 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

file: ptxjit

[PTX Just In Time (JIT) Compilation (no-qatest)] - Starting...
> Using CUDA device [0]: GRID P4-4Q
Loading myPtx[] program
CUDA Link Completed in 0.000000ms. Linker Output:
info    : 0 bytes gmem
info    : Function properties for '_Z8myKernelPi':
info    : used 4 registers, 0 stack, 0 bytes smem, 328 bytes cmem[0], 0 bytes lmem
CUDA kernel launched

file: quasirandomGenerator

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/quasirandomGenerator Starting...

Allocating GPU memory...
Allocating CPU memory...
Initializing QRNG tables...

Testing QRNG...

quasirandomGenerator, Throughput = 27.4616 GNumbers/s, Time = 0.00011 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384

Reading GPU results...
Comparing to the CPU results...

L1 norm: 7.275964E-12

Testing inverseCNDgpu()...

quasirandomGenerator-inverse, Throughput = 33.7343 GNumbers/s, Time = 0.00009 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128
Reading GPU results...

Comparing to the CPU results...
L1 norm: 9.439909E-08

Shutting down...

file: quasirandomGenerator_nvrtc

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/quasirandomGenerator_nvrtc Starting...

> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
Allocating GPU memory...
Allocating CPU memory...
Initializing QRNG tables...

Testing QRNG...

quasirandomGenerator, Throughput = 21.0417 GNumbers/s, Time = 0.00015 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384

Reading GPU results...
Comparing to the CPU results...

L1 norm: 7.275964E-12

Testing inverseCNDgpu()...

quasirandomGenerator-inverse, Throughput = 31.1458 GNumbers/s, Time = 0.00010 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128
Reading GPU results...

Comparing to the CPU results...
L1 norm: 9.439909E-08

Shutting down...

file: radixSortThrust

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/radixSortThrust Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1


Sorting 1048576 32-bit unsigned int keys and values

radixSortThrust, Throughput = 340.2453 MElements/s, Time = 0.00308 s, Size = 1048576 elements
Test passed

file: reduction

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/reduction Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Using Device 0: GRID P4-4Q

Reducing array of type int

16777216 elements
256 threads (max)
64 blocks

Reduction, Throughput = 45.9587 GB/s, Time = 0.00146 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256

GPU result = 2139353471
CPU result = 2139353471

Test passed

file: reductionMultiBlockCG

reductionMultiBlockCG Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

33554432 elements
numThreads: 1024
numBlocks: 40

Launching SinglePass Multi Block Cooperative Groups kernel
Average time: 0.837790 ms
Bandwidth:    160.204520 GB/s

GPU result = 1.992401361465
CPU result = 1.992401361465

file: scalarProd

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/scalarProd Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Initializing data...
...allocating CPU memory.
...allocating GPU memory.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.
Executing GPU kernel...
GPU time: 0.156000 msecs.
Reading back GPU result...
Checking GPU results...
..running CPU scalar product calculation
...comparing the results
Shutting down...
L1 error: 2.745062E-08
Test passed

file: scan

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/scan Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Allocating and initializing host arrays...
Allocating and initializing CUDA arrays...
Initializing CUDA-C scan...

*** Running GPU scan for short arrays (100 identical iterations)...

Running scan for 4 elements (1703936 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 8 elements (851968 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 16 elements (425984 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 32 elements (212992 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 64 elements (106496 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 128 elements (53248 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 256 elements (26624 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 512 elements (13312 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 1024 elements (6656 arrays)...
Validating the results...
...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match


scan, Throughput = 3.1163 MElements/s, Time = 0.00033 s, Size = 1024 Elements, NumDevsUsed = 1, Workgroup = 256

***Running GPU scan for large arrays (100 identical iterations)...

Running scan for 2048 elements (3328 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 4096 elements (1664 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 8192 elements (832 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 16384 elements (416 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 32768 elements (208 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 65536 elements (104 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 131072 elements (52 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
 ...comparing the results
 ...Results Match

Running scan for 262144 elements (26 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
 ...comparing the results
 ...Results Match


scan, Throughput = 391.5051 MElements/s, Time = 0.00067 s, Size = 262144 Elements, NumDevsUsed = 1, Workgroup = 256

Shutting down...

file: segmentationTreeThrust

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/segmentationTreeThrust Starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

* Building segmentation tree... done in 255.807 (ms)
* Dumping levels for each tree...

file: shfl_scan

Starting shfl_scan
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> Detected Compute SM 6.1 hardware with 20 multi-processors
Starting shfl_scan
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> Detected Compute SM 6.1 hardware with 20 multi-processors
Computing Simple Sum test
---------------------------------------------------
Initialize test data [1, 1, 1...]
Scan summation for 65536 elements, 256 partial sums
Partial summing 256 elements with 1 blocks of size 256
Test Sum: 65536
Time (ms): 0.052064
65536 elements scanned in 0.052064 ms -> 1258.758423 MegaElements/s
CPU verify result diff (GPUvsCPU) = 0
CPU sum (naive) took 0.047840 ms

Computing Integral Image Test on size 1920 x 1080 synthetic data
---------------------------------------------------
Method: Fast  Time (GPU Timer): 0.068608 ms Diff = 0
Method: Vertical Scan  Time (GPU Timer): 0.154016 ms 
CheckSum: 2073600, (expect 1920x1080=2073600)

file: simpleAssert

simpleAssert.cu:47: void testKernel(int): block: [1,0,0], thread: [28,0,0] Assertion `gtid < N` failed.
simpleAssert.cu:47: void testKernel(int): block: [1,0,0], thread: [29,0,0] Assertion `gtid < N` failed.
simpleAssert.cu:47: void testKernel(int): block: [1,0,0], thread: [30,0,0] Assertion `gtid < N` failed.
simpleAssert.cu:47: void testKernel(int): block: [1,0,0], thread: [31,0,0] Assertion `gtid < N` failed.
simpleAssert starting...
OS_System_Type.release = 3.10.0-693.2.2.el7.x86_64
OS Info: <#1 SMP Tue Sep 12 22:26:13 UTC 2017>

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Launch kernel to generate assertion failures

-- Begin assert output


-- End assert output

Device assert failed as expected, CUDA error message is: device-side assert triggered

simpleAssert completed, returned OK

file: simpleAssert_nvrtc

../../../../0_Simple/simpleAssert_nvrtc/simpleAssert_kernel.cu:13: void testKernel(int): block: [1,0,0], thread: [28,0,0] Assertion `gtid < N` failed.
../../../../0_Simple/simpleAssert_nvrtc/simpleAssert_kernel.cu:13: void testKernel(int): block: [1,0,0], thread: [29,0,0] Assertion `gtid < N` failed.
../../../../0_Simple/simpleAssert_nvrtc/simpleAssert_kernel.cu:13: void testKernel(int): block: [1,0,0], thread: [30,0,0] Assertion `gtid < N` failed.
../../../../0_Simple/simpleAssert_nvrtc/simpleAssert_kernel.cu:13: void testKernel(int): block: [1,0,0], thread: [31,0,0] Assertion `gtid < N` failed.
simpleAssert_nvrtc starting...
Launch kernel to generate assertion failures
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability

-- Begin assert output


-- End assert output

Device assert failed as expected

file: simpleAtomicIntrinsics

simpleAtomicIntrinsics starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> GPU device has 20 Multi-Processors, SM 6.1 compute capabilities

Processing time: 153.134995 (ms)
simpleAtomicIntrinsics completed, returned OK

file: simpleAtomicIntrinsics_nvrtc

simpleAtomicIntrinsics_nvrtc starting...
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
Processing time: 1.088000 (ms)
simpleAtomicIntrinsics_nvrtc completed, returned OK

file: simpleCallback

Starting simpleCallback
Found 1 CUDA capable GPUs
GPU[0] GRID P4-4Q supports SM 6.1, capable GPU Callback Functions
1 GPUs available to run Callback Functions
Starting 8 heterogeneous computing workloads
Total of 8 workloads finished:
Success

file: simpleCooperativeGroups

Launching a single block with 64 threads...

 Sum of all ranks 0..63 in threadBlockGroup is 2016 (expected 2016)

 Now creating 4 groups, each of size 16 threads:

   Sum of all ranks 0..15 in this tiledPartition16 group is 120 (expected 120)
   Sum of all ranks 0..15 in this tiledPartition16 group is 120 (expected 120)
   Sum of all ranks 0..15 in this tiledPartition16 group is 120 (expected 120)
   Sum of all ranks 0..15 in this tiledPartition16 group is 120 (expected 120)

...Done.

file: simpleCubemapTexture

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

CUDA device [GRID P4-4Q] has 20 Multi-Processors SM 6.1
Covering Cubemap data array of 64~3 x 1: Grid size is 8 x 8, each block has 8 x 8 threads
Processing time: 0.019 msec
1293.47 Mtexlookups/sec
Comparing kernel output to expected data

file: simpleCUBLAS

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

simpleCUBLAS test running..
simpleCUBLAS test passed.

file: simpleCUBLASXT

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

simpleCUBLASXT test running..
simpleCUBLASXT test passed.

file: simpleCUFFT

[simpleCUFFT] is starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Temporary buffer size 448 bytes
Transforming signal cufftExecC2C
Launching ComplexPointwiseMulAndScale<<< >>>
Transforming signal back cufftExecC2C

file: simpleCUFFT_2d_MGPU

Poisson equation using CUFFT library on Multiple GPUs is starting...

No. of GPU on node 1
Two GPUs are required to run simpleCUFFT_2d_MGPU sample code

file: simpleCUFFT_callback

[simpleCUFFT_callback] is starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Transforming signal cufftExecC2C
Transforming signal back cufftExecC2C

file: simpleCUFFT_MGPU

[simpleCUFFT_MGPU] is starting...

No. of GPU on node 1
Two GPUs are required to run simpleCUFFT_MGPU sample code

file: simpleDevLibCUBLAS

simpleDevLibCUBLAS test running...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Host and device APIs will be tested.
simpleDevLibCUBLAS completed, returned OK

file: simpleHyperQ

starting hyperQ...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> Detected Compute SM 6.1 hardware with 20 multi-processors
Expected time for serial execution of 32 sets of kernels is between approx. 0.330s and 0.640s
Expected time for fully concurrent execution of 32 sets of kernels is approx. 0.020s
Measured time for sample = 0.050s

file: simpleIPC

Checking for multiple GPUs...
CUDA-capable device count: 1

Searching for UVA capable devices...
> GPU0 = "     GRID P4-4Q" IS capable of UVA

Spawning processes and assigning GPUs...
> Process   1 -> GPU0
> Process   1: Run kernel on GPU0, taking source data from and writing results to process 0, GPU0...

Spawning processes and assigning GPUs...
> Process   0 -> GPU0

Launching kernels...
Checking test results...

Shutting down...

file: simpleLayeredTexture

[simpleLayeredTexture] - Starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

CUDA device [GRID P4-4Q] has 20 Multi-Processors SM 6.1
Covering 2D data array of 512 x 512: Grid size is 64 x 64, each block has 8 x 8 threads
Processing time: 0.100 msec
13107.20 Mtexlookups/sec
Comparing kernel output to expected data

file: simpleMultiCopy

[simpleMultiCopy] - Starting...
> Using CUDA device [0]: GRID P4-4Q
[GRID P4-4Q] has 20 MP(s) x 128 (Cores/MP) = 2560 (Cores)
> Device name: GRID P4-4Q
> CUDA Capability 6.1 hardware with 20 multi-processors
> scale_factor = 1.00
> array_size   = 4194304


Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
    (Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000)

Measured timings (throughput):
 Memcpy host to device  : 2.706176 ms (6.199603 GB/s)
 Memcpy device to host  : 13.667808 ms (1.227499 GB/s)
 Kernel                 : 0.259104 ms (647.508920 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 16.633087 ms 
Compute can overlap with one transfer: 16.373983 ms
Compute can overlap with both data transfers: 13.667808 ms

Average measured timings over 10 repetitions:
 Avg. time when execution fully serialized      : 6.404608 ms
 Avg. time when overlapped using 4 streams      : 3.068211 ms
 Avg. speedup gained (serialized - overlapped)  : 3.336397 ms

Measured throughput:
 Fully serialized execution             : 5.239108 GB/s
 Overlapped using 4 streams             : 10.936155 GB/s

file: simpleMultiGPU

Starting simpleMultiGPU
CUDA-capable device count: 1
Generating input data...

Computing with 1 GPUs...
  GPU Processing time: 23.084000 (ms)

Computing with Host CPU...

Comparing GPU and Host CPU results...
  GPU sum: 16777296.000000
  CPU sum: 16777294.395033
  Relative difference: 9.566307E-08

file: simpleOccupancy

starting Simple Occupancy

[ Manual configuration with 32 threads per block ]
Potential occupancy: 50%
Elapsed time: 0.15312ms

[ Automatic, occupancy-based configuration ]
Suggested block size: 1024
Minimum grid size for maximum occupancy: 40
Potential occupancy: 100%
Elapsed time: 0.06736ms

Test PASSED

file: simpleP2P

[~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 1
Two or more GPUs with SM 2.0 or higher capability are required for ~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/simpleP2P.
Waiving test.

file: simplePitchLinearTexture

simplePitchLinearTexture starting...

GPU Device 0: "GRID P4-4Q" with compute capability 6.1


Bandwidth (GB/s) for pitch linear: 1.63e+02; for array: 1.64e+02

Texture fetch rate (Mpix/s) for pitch linear: 2.04e+04; for array: 2.05e+04

simplePitchLinearTexture completed, returned OK

file: simplePrintf

GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Device 0: "GRID P4-4Q" with Compute 6.1 capability
printf() is called. Output:

[2, 0]:         Value is:10
[2, 1]:         Value is:10
[2, 2]:         Value is:10
[2, 3]:         Value is:10
[2, 4]:         Value is:10
[2, 5]:         Value is:10
[2, 6]:         Value is:10
[2, 7]:         Value is:10
[3, 0]:         Value is:10
[3, 1]:         Value is:10
[3, 2]:         Value is:10
[3, 3]:         Value is:10
[3, 4]:         Value is:10
[3, 5]:         Value is:10
[3, 6]:         Value is:10
[3, 7]:         Value is:10
[0, 0]:         Value is:10
[0, 1]:         Value is:10
[0, 2]:         Value is:10
[0, 3]:         Value is:10
[0, 4]:         Value is:10
[0, 5]:         Value is:10
[0, 6]:         Value is:10
[0, 7]:         Value is:10
[1, 0]:         Value is:10
[1, 1]:         Value is:10
[1, 2]:         Value is:10
[1, 3]:         Value is:10
[1, 4]:         Value is:10
[1, 5]:         Value is:10
[1, 6]:         Value is:10
[1, 7]:         Value is:10

file: simpleSeparateCompilation

simpleSeparateCompilation starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

simpleSeparateCompilation completed, returned OK

file: simpleStreams

[ simpleStreams ]

Device synchronization method set to = 0 (Automatic Blocking)
Setting reps to 100 to demonstrate steady state

> GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Device:  canMapHostMemory: Yes
> CUDA Capable: SM 6.1 hardware
> 20 Multiprocessor(s) x 128 (Cores/Multiprocessor) = 2560 (Cores)
> scale_factor = 1.0000
> array_size   = 16777216

> Using CPU/GPU Device Synchronization method (cudaDeviceScheduleAuto)
> mmap() allocating 64.00 Mbytes (generic page-aligned system memory)
> cudaHostRegister() registering 64.00 Mbytes of generic allocated system memory

Starting Test
memcopy:        9.06
kernel:         3.46
non-streamed:   24.61
4 streams:      9.30
-------------------------------

file: simpleSurfaceWrite

simpleSurfaceWrite starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

CUDA device [GRID P4-4Q] has 20 Multi-Processors, SM 6.1
Loaded 'lena_bw.pgm', 512 x 512 pixels
Processing time: 0.036000 (ms)
7281.78 Mpixels/sec
Wrote 'output.pgm'
Comparing files
        output:    
        reference: <../../../../0_Simple/simpleSurfaceWrite/data/ref_rotated.pgm>
simpleSurfaceWrite completed, returned OK

file: simpleTemplates

> runTest
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

CUDA device [GRID P4-4Q] has 20 Multi-Processors
Processing time: 159.790000 (ms)
Compare OK

> runTest
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

CUDA device [GRID P4-4Q] has 20 Multi-Processors
Processing time: 0.961000 (ms)
Compare OK


[simpleTemplates] -> Test Results: 0 Failures

file: simpleTemplates_nvrtc

> runTest
> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
Processing time: 1.043000 (ms)
Compare OK

> runTest
Processing time: 0.953000 (ms)
Compare OK


[simpleTemplates_nvrtc] -> Test Results: 0 Failures

file: simpleTexture

simpleTexture starting...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Loaded 'lena_bw.pgm', 512 x 512 pixels
Processing time: 0.037000 (ms)
7084.97 Mpixels/sec
Wrote '../../../../0_Simple/simpleTexture/data/lena_bw_out.pgm'
Comparing files
        output:    <../../../../0_Simple/simpleTexture/data/lena_bw_out.pgm>
        reference: <../../../../0_Simple/simpleTexture/data/ref_rotated.pgm>
simpleTexture completed, returned OK

file: simpleTextureDrv

> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
> findModulePath <./simpleTexture_kernel64.ptx>
> initCUDA loading module: <./simpleTexture_kernel64.ptx>
> PTX JIT log:

Loaded 'lena_bw.pgm', 512 x 512 pixels
Processing time: 0.034000 (ms)
7710.12 Mpixels/sec
Wrote '../../../../0_Simple/simpleTextureDrv/data/lena_bw_out.pgm'
Comparing files
        output:    <../../../../0_Simple/simpleTextureDrv/data/lena_bw_out.pgm>
        reference: <../../../../0_Simple/simpleTextureDrv/data/ref_rotated.pgm>

file: simpleVoteIntrinsics

[simpleVoteIntrinsics]
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

> GPU device has 20 Multi-Processors, SM 6.1 compute capabilities

[VOTE Kernel Test 1/3]
        Running <> kernel1 ...
        OK

[VOTE Kernel Test 2/3]
        Running <> kernel2 ...
        OK

[VOTE Kernel Test 3/3]
        Running <> kernel3 ...
        OK
        Shutting down...

file: simpleVoteIntrinsics_nvrtc

> Using CUDA Device [0]: GRID P4-4Q
> GPU Device has SM 6.1 compute capability
[simpleVoteIntrinsics_nvrtc]
[VOTE Kernel Test 1/3]
        Running <> kernel1 ...
        OK

[VOTE Kernel Test 2/3]
        Running <> kernel2 ...
        OK

[VOTE Kernel Test 3/3]
        Running <> kernel3 ...
        OK
        Shutting down...

file: simpleZeroCopy

  Device 0: <      GRID P4-4Q >, Compute SM 6.1 detected
> Using CUDA Host Allocated (cudaHostAlloc)
> vectorAddGPU kernel will add vectors using mapped CPU memory...
> Checking the results from vectorAddGPU() ...
> Releasing CPU memory...

file: SobolQRNG

Sobol Quasi-Random Number Generator Starting...

> number of vectors = 100000
> number of dimensions = 100
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Allocating CPU memory...
Allocating GPU memory...
Initializing direction numbers...
Copying direction numbers to device...
Executing QRNG on GPU...
Gsamples/s: 5.95593
Reading results from GPU...

Executing QRNG on CPU...
Gsamples/s: 0.139458
Checking results...
L1-Error: 0
Shutting down...

file: sortingNetworks

~/NVIDIA_CUDA-9.0_Samples/bin/x86_64/linux/release/sortingNetworks Starting...

Starting up CUDA context...
GPU Device 0: "GRID P4-4Q" with compute capability 6.1

Allocating and initializing host arrays...

Allocating and initializing CUDA arrays...

Running GPU bitonic sort (1 identical iterations)...

Testing array length 64 (16384 arrays per batch)...
Average time: 0.394000 ms