NVIDIA CUDA PROFILING Tool
1. tools
nsightinWIN(vs)orLinux (eclipse)nvprofinlinux cmd line``` //in gtx1060 nvprof –metrics ipc,issued_ipc,achieved_occupancy,global_hit_rate,local_hit_rate,l2_tex_read_hit_rate,gld_transactions,gst_transactions,local_load_transactions,local_store_transactions,l2_tex_read_transactions,l2_tex_write_transactions,l2_read_transactions,l2_write_transactions,dram_read_transactions,dram_write_transactions,sysmem_read_transactions,sysmem_write_transactions ./wave
```
2. 度量标准 metrics
2.1 Performance
ipc- Instructions executed per cycle
issued_ipc- Instructions issued per cycle
achieved_occupancy- Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
说明:本文研究点在 Data Cache,那么一下的提到的L1 Cache 都为 Data Cache
2.2 Cache Hit Rate
L1 Cache
Fermi/Kepler (Capability 2.x/3.x)
l1_cache_global_hit_rate- Hit rate in
L1 cachefor global loads
- Hit rate in
l1_cache_local_hit_rate- Hit rate in
L1 cachefor local loads and stores
- Hit rate in
nc_cache_global_hit_rate- only for Kepler
- Hit rate in
non coherent cachefor global loads
Maxwell/Pascal(Capability 5.x/6.x)
global_hit_rate- Hit rate for global loads
local_hit_rate- Hit rate for local loads and stores
L2 Cache
Fermi/Kepler (Capability 2.x/3.x)
l2_l1_read_hit_rate- Hit rate at
L2cache for all read requests fromL1cache
- Hit rate at
l2_tex_read_hit_rate- Hit rate at
L2cache for all read requests fromtexturecache
- Hit rate at
Maxwell/Pascal(Capability 5.x/6.x)
l2_tex_read_hit_rate- Hit rate at
L2cache for all read requests fromtexturecache
- Hit rate at
2.3 Transactions
L1 Cache
Global data
gld_transactions- Number of global memory load transactions
gld_transactions_per_request- Average number of global memory load transactions performed for each global memory load
gst_transactions- Number of global memory store transactions
gst_transactions_per_request- Average number of global memory store transactions performed for each global memory store
Local data
local_load_transactions- Number of local memory load transactions
local_load_transactions_per_request- Average number of local memory load transactions performed for each local memory load
local_store_transactions- Number of local memory store transactions
local_store_transactions_per_request- Average number of local memory store transactions performed for each local memory store
L2 Cache
Fermi/Kepler (Capability 2.x/3.x)
l2_l1_read_transactions- Memory read transactions seen at
L2cache for all read requests fromL1cache
- Memory read transactions seen at
l2_l1_write_transactions- Memory write transactions seen at
L2cache for all write requests fromL1cache
- Memory write transactions seen at
Maxwell/Pascal(Capability 5.x/6.x)
l2_tex_read_transactions- Memory read transactions seen at
L2cache for read requests from thetexturecache
- Memory read transactions seen at
l2_tex_write_transactionsBothl2_read_transactions- Memory read transactions seen at L2 cache for all read requests
l2_write_transactions- Memory write transactions seen at L2 cache for all write requests
Only in Kepler
nc_l2_read_transactions- Memory read transactions seen at L2 cache for non coherent global read requests
备注
- 自
Kepler架构以来,L1 Cache对global data的默认策略是bypassing,只有Fermi架构L1 Cache对 global data 是既可读又可写的,但是不能保持cache coherence。 -
那么为了保证
cache coherence,nvidia采取了较为极端的做法,那就是bypassingL1 Cache,并且在Maxwell与Pascal架构中,与Tex Cache合并,设置为Read Only, 但我认为其效果并不佳。最新架构volta又将其架构改为Fermi中L1 Cache与Shared memory可配置的模式。 - 可知,在
Maxwell与Pascal架构中,我们就将tex cache看成L1 Data Cache
GDRAM
dram_read_transactions- Device memory read transactions
dram_write_transactions- Device memory write transactions
DRAM
sysmem_read_transactions- System memory read transactions
sysmem_write_transactions- System memory write transactions
Influence by L2 Hit Rate
Reference
Read more at: https://docs.nvidia.com/cuda/profiler-users-guide/index.html#ixzz4t4vGKod8 Follow us: @GPUComputing on Twitter NVIDIA on Facebook