Fetch and Decode Unit

 

7  8

In the figure shown on the left we see the front end of a Bulldozer module. On the right is shown in more detail the branch prediction diagram.

Compared to classical architecture, we see the branch prediction unit decoupled from the real fetch unit: this allows you to make branch predictions in parallel and independently of the instruction fetch. Prefetching is driven by the branch prediction, to find in cache the necessary instructions as soon as possible. The instruction cache is 64KB two-way. The instruction fetch is 32 bytes at a time. The instruction TLB are two levels, with an first level of 72 elements fully associative, shared between all page measures and a the second level of 512-item 4-way, with only the 4KB pages. The decoding unit is also able to make the branch fusion. The decoders are 4 and can generate 4 macros instructions per clock, alternately to the two threads.

The branch predictor is double and works independently for the two threads. The prefetch requests are done in parallel to the L2 cache and memory, whenever missing in the L1 cache. The predictor, in addition to being divided by thread, is also divided into two levels. The first level is faster but less accurate, based also on an L1 cache BTB (Branch Target Buffer, i.e. a cache that stores the IP addresses of jumps performed in the past, that the predictor returns as a prediction, if the jump is predicted as taken) the smallest (of 512 elements). His prediction is filed timely in the queue and is started in parallel with the second level prediction, which uses, among other things, a much larger L2 BTB cache (5120 elements). When the second level prediction is ready, if the result of the first level prediction has not been used yet, it is overwritten by the second level, presumably more accurate. This allows you to combine the advantages of a fast predictor, and an accurate one. The prediction of the procedure returns is carried out by a separate unit.


Corsair