L3 bus monitoring SW tool

From OMAPpedia

Jump to: navigation, search

How-to is readable, will only add 1 graph Technical details need some cleaning/refactoring


[edit] Statistic collectors overview

L3 NOC Statistic collectors is an HW IP that computes traffic statistics. It relies on HW probes located on EMIF (external memory) or initiators (DSS, IVA, ISS, ...). It can be programmed in 2 different ways:

SW tool leverages only EMIF HW probes therefore it can't monitor traffic that is not related to EMIF, i.e. L3 traffic such as read/write registers, direct traffic from 1 HW IP to another IP (MPU-ABE)... Hopefully most of the traffic goes through external memory

"How-to" describes the SW tool and its usage. Further technical details are available in last sections

Statistic collectors

[edit] SW tool how-to


[edit] Definition and filtering of counter

HW IP has 8 configurable counters. Each counter can track EMIF traffic:

[edit] Default mode = accumulation mode 2

Counter: 0  Master: alldmm  Transaction: w Probe: emif1
Counter: 1  Master: alldmm  Transaction: w Probe: emif2
Counter: 2  Master: alldmm  Transaction: r Probe: emif1
Counter: 3  Master: alldmm  Transaction: r Probe: emif2
time: 823237498 823204726 32772 -> 0.00 0.00 141.92 141.92
time: 823270313 823237498 32815 -> 0.00 0.00 141.90 141.90

[edit] Filter initiator

[edit] Basic method

[edit] Flexible method

[edit] Filter transaction

[edit] Filter EMIF (probe)

[edit] Tune delay

[edit] Examples

omapconf trace bw --m4 ma_mpu --m5 ma_mpu --m6 ma_mpu --m7 ma_mpu -> 4 counters to capture alldmm R and W EMIF1/EMIF2, 4 counters to capture ma_mpu R EMIF1/EMIF2, R+W EMIF1/EMIF2 (so you can compute W)

omapconf trace bw --tr r+w --m0 dss --m1 dss --m2 gpu_p1 --m3 gpu_p1 --m4 ma_mpu --m5 ma_mpu --m6 alldmm --m7 alldmm

omapconf trace bw --tr r+w --m0 dss --m1 gpu_p1 --m2 gpu_p2 --m3 iva --m4 bb2d_p1 --m5 bb2d_p1 --m6 ma_mpu --m7 alldmm

omapconf trace bw --m0 dss --tr0 r --p0 emif2 --m1 gpu_p1 --tr1 w --p1 emif1 --m2 gpu_p2 --tr2 r+w --p2 emif2 --m3 iva --tr3 w --p3 emif1

[edit] Methodology

For a quite regular use case, it is suggested to do:

Example: video playback


[edit] Accumulation mode 1

At very high capture rate, dumping immediately is too intrusive. Tool can store register values + timestamp in an array and dump results at the end of test. This is "accumulation mode 1", which requires to always set number of iterations. You may also tune HW IP auto-reset delay/threshold

[edit] Tune number of iterations

After 100000 iterations, i.e. 100000*0.3 = 30s, trace displays:

0,0,0,S,,SDRAM,,0,EMIF 0:Wr:All Initiators,T,V,77,,,,0, -> Write_EMIF1, timestamp 77 ticks 32kHz
0,0,0,S,,SDRAM,,0,EMIF 1:Wr:All Initiators,T,V,77,,,,0, -> Write_EMIF2
0,0,0,S,,SDRAM,,52992,EMIF 0:Rd:All Initiators,T,V,77,,,,0, -> Read_EMIF1
0,0,0,S,,SDRAM,,53120,EMIF 1:Rd:All Initiators,T,V,77,,,,0, -> Read_EMIF2
0,0,0,S,,SDRAM,,0,EMIF 0:Wr:All Initiators,T,V,89,,,,0, -> timestamp 89, i.e 12 ticks=366us later
0,0,0,S,,SDRAM,,0,EMIF 1:Wr:All Initiators,T,V,89,,,,0,
0,0,0,S,,SDRAM,,59008,EMIF 0:Rd:All Initiators,T,V,89,,,,0,
0,0,0,S,,SDRAM,,59008,EMIF 1:Rd:All Initiators,T,V,89,,,,0,

Max number of iterations is 1000000. You can interrupt tool by Ctrl-C, tool will dump current content of array. So "iterations" option could be removed in the future. You can simply put always 1000000 and use Ctrl-C

Format is compatible with CCS output format (for which custom post-processing script was written)

[edit] Tune HW IP reset threshold

HW counters keep accumulating and saturate at 2^32 without resetting to 0. Only SW solution is to stop/start the IP. To avoid intrusiveness, SW tool is simple and stupid, user must tune when to do it.

By default, HW is reset every second. Reset takes less than 31us, i.e. error is less than 0.003%

[edit] --overflow_delay method

option simply changes delay between auto-resets of HW IP

[edit] counter threshold -o -t method

Either you know well your use case throughput, either do some 'omapconf trace bw first to find biggest contributor and its threshold

[edit] Post-processing/Visualization

Below example is DSS bandwidth during Wifi Display. "Write" corresponds to Writeback pipeline activity. The low peaks at 80MB/s correspond to end of frame, i.e. VSYNC. So you can get display refresh rate


[edit] HW IP technical details

Statistics collectors are used for target load and master latency monitoring (SDRAM and LAT0/LAT1 collectors). SW tool currently exposes only SDRAM collector.

[edit] Probes, memory paths

[edit] target load / EMIF probes

These probes are attached to EMIF. MPU transactions are not routed through the L3 NOC due to latency constraints therefore different probes monitor these memory paths:

"DMM" probes can filter per initiator and per types of events. SW tool only handles "payload" event to get BW throughput.

[edit] master latency / L3 master probe

The probes are attached to most critical L3 masters so are the only probes that can compute traffic latency. But they have in fact same features than above probes like monitoring also payload. Not leveraged by SW tool

[edit] SW tool principles

[edit] "EMIF" probes counters

Traffic events monitored by "EMIF" probes (target load) are accumulated into counters. They have following granularity:

There are 8 but only 6 can filter. SW tool currently monitors separately Read and Write traffic on EMIF1 and EMIF2 with 4 counters

[edit] Counters capture principle/MPU intrusiveness

Tool configures 4 counters to monitor Write EMIF1 / Write EMIF2 / Read EMIF1 / Read EMIF2. Counters will all monitor "all DMM" traffic, traffic from only 1 initiator or MA_MPU traffic.

Main loop will do:

Apart from being readable by most of HW IPs, 32kHz timestamping allows having only the cost of reading 1 register rather than a system call + reading register

SW tool reads various registers in userspace. This is then intrusive in MPU processing. Pre-emption can occur therefore there is no guarantee that timestamp and counters are read at the same time

[edit] Choose between CCS and SW tool

Personal tools