2 Intel's XScale Microarchitecture processors provide a Performance
3 Monitoring Unit (PMU) that can be utilized to provide information
4 that can be useful for fine tuning of code. This text file describes
5 the API that's been developed for use by Linux kernel programmers.
6 When I have some extra time on my hand, I will extend the code to
7 provide support for user mode performance monitoring (which is
8 probably much more useful). Note that to get the most usage out
9 of the PMU, I highly reccomend getting the XScale reference manual
10 from Intel and looking at chapter 12.
12 To use the PMU, you must #include <asm/xscale-pmu.h> in your source file.
14 Since there's only one PMU, only one user can currently use the PMU
15 at a given time. To claim the PMU for usage, call pmu_claim() which
16 returns an identifier. When you are done using the PMU, call
17 pmu_release() with the identifier that you were given by pmu_claim.
19 In addition, the PMU can only be used on XScale based systems that
20 provide an external timer. Systems that the PMU is currently supported
25 Before delving into how to use the PMU code, let's do a quick overview
26 of the PMU itself. The PMU consists of three registers that can be
27 used for performance measurements. The first is the CCNT register with
28 provides the number of clock cycles elapsed since the PMU was started.
29 The next two register, PMN0 and PMN1, are eace user programmable to
30 provide 1 of 20 different performance statistics. By combining different
31 statistics, you can derive complex performance metrics.
33 To start the PMU, just call pmu_start(pm0, pmn1). pmn0 and pmn1 tell
34 the PMU what statistics to capture and can each be one of:
37 Instruction fetches requiring access to external memory
40 Instruction cache could not deliver an instruction. Either an
41 ICACHE miss or an instruction TLB miss.
44 Stall in execution due to a data dependency. This counter is
45 incremented each cycle in which the condition is present.
54 A branch instruction was executed and it may or may not have
58 A branch (B or BL instructions only) was mispredicted
61 An instruction was executed
64 Stall because data cache buffers are full. Incremented on every
65 cycle in which condition is present.
67 EVT_DCACHE_FULL_STALL_CONTIG
68 Stall because data cache buffers are full. Incremented on every
69 cycle in which condition is contigous.
72 Data cache access (data fetch)
78 Data cache write back. This counter is incremented for every
79 1/2 line (four words) that are written back.
82 Software changed the PC. This is incremented only when the
83 software changes the PC and there is no mode change. For example,
84 a MOV instruction that targets the PC would increment the counter.
85 An SWI would not as it triggers a mode change.
88 The Bus Control Unit(BCU) received a request from the core
91 The BCU request queue if full. A high value for this event means
92 that the BCU is often waiting for to complete on the external bus.
95 The BCU queues were drained due to either a Drain Write Buffer
96 command or an I/O transaction for a page that was marked as
97 uncacheable and unbufferable.
100 The BCU detected an ECC error on the memory bus but noe ELOG
101 register was available to to log the errors.
104 The BCU detected a 1-bit error while reading from the bus.
107 An RMW cycle occurred due to narrow write on ECC protected memory.
109 To get the results back, call pmu_stop(&results) where results is defined
110 as a struct pmu_results:
114 u32 ccnt; /* Clock Counter Register */
116 u32 pmn0; /* Performance Counter Register 0 */
118 u32 pmn1; /* Performance Counter Register 1 */
122 Pretty simple huh? Following are some examples of how to get some commonly
123 wanted numbers out of the PMU data. Note that since you will be dividing
124 things, this isn't super useful from the kernel and you need to printk the
125 data out to syslog. See [1] for more examples.
127 Instruction Cache Efficiency
129 pmu_start(EVT_INSTRUCTION, EVT_ICACHE_MISS);
133 icache_miss_rage = results.pmn1 / results.pmn0;
134 cycles_per_instruction = results.ccnt / results.pmn0;
136 Data Cache Efficiency
138 pmu_start(EVT_DCACHE_ACCESS, EVT_DCACHE_MISS);
142 dcache_miss_rage = results.pmn1 / results.pmn0;
144 Instruction Fetch Latency
146 pmu_start(EVT_ICACHE_NO_DELIVER, EVT_ICACHE_MISS);
150 average_stall_waiting_for_instruction_fetch =
151 results.pmn0 / results.pmn1;
153 percent_stall_cycles_due_to_instruction_fetch =
154 results.pmn0 / results.ccnt;
159 - Add support for usermode PMU usage. This might require hooking into
160 the scheduler so that we pause the PMU when the task that requested
161 statistics is scheduled out.
164 This code is still under development, so please feel free to send patches,
165 questions, comments, etc to me.
167 Deepak Saxena <dsaxena@mvista.com>