scarab | Ashwani Rathee

scarab is a dynamic branch predictor that uses a combination of techniques to improve the accuracy of branch prediction. The project is based on the idea of using a combination of dynamic branh prediction with perceptrons to improve the accuracy of branch prediction. The project is implemented in C on scarab simulator which is maintained by UC Santa Cruz.

My group implemented perceptron branch predictor in the following paper: Daniel Jimenez, Calvin Lin: Dynamic branch prediction with perceptrons. Perceptrons are structures that intake multiple inputs and use them to output a single value. This paper used the bits of the history register as its inputs and applied a dot product operation with the perceptron weights to produce a negative or positive value to indicate if the branch should be TAKEN or NOT TAKEN.

Implementation

Git Repository: https://github.com/Aaron-Vuong/scarab/tree/ash/predictor

Step Implemented

Hash the branch address to get a perceptron index

#define BASE_PERCEPTRON_HASH(addr) (addr %
BASE_PERCEPTRON_ENTRIES)
uns32 index = BASE_PERCEPTRON_HASH(addr);

The indexed perceptron is fetched

const auto& perceptron_state =
perceptron_hist.at(proc_id).perceptron_state_all_cores[index];

Compute 𝑦

for(int i = 0; i < BASE_HIST_LENGTH; ++i) {
 int history_bit = ((hist >> (32 - i)) & 0x1) ? 1 : -1;
 // Bias weight is just the weight.
 if (i == 0) {
     prediction_score += perceptron_state.weights[i];
     continue;
 }
 prediction_score += ((history_bit *
 perceptron_state.weights[i]));
}

Send the prediction of the branch.

uns8 pred = (prediction_score >= 0.0f) ? 1 : 0;
return pred;

Train the perceptron after the branch outcome is known.

if(!(base_perceptron_output <= 0 && t <= 0) || (abs(base_perceptron_output) <= PERCEPTRON_THRESHOLD)) {
 for(int i = 0; i < BASE_HIST_LENGTH; ++i) {
     int history_bit = ((hist >> (32 - i)) & 0x1) ? 1 : -1;
     // Bias weight is just the weight.
     if (i == 0) {
         perceptron_state.weights[i] += t;
     } else {
         perceptron_state.weights[i] += history_bit * t;
     }
 }
}

The perceptron is written back into the table. The perceptron is a pointer so it is already modified in the table.

Evaluation & Results

We used the following configurations to gather data about our final implementation, where we abbreviated each of the parameters to the following.

history_length to hl
num_perceptrons_in_table to bpr

Experiments Ran:

Optimal History Length

–base_hist_length [1-32]
–base_perceptron_rows 2048

Optimal # of Perceptrons

–base_hist_length 8
–base_perceptron_rows [1-32768]

Branch Predictor Comparison

–bp_mech perceptron
–bp_mech gshare
–bp_mech tage64k

There were two primary experiments that we ran.

● Adjust the number of history length bits to the most optimal value. ● Adjust the number of perceptrons in the perceptron table to the most optimal value, for the optimal history length.

1. Finding the optimal history length

From the graph, we can see that with 2048 perceptrons, the optimal history length to look at is ~8 bits of the most recent history. This is in line with the paper itself, which recommended 7 or 9 bits (1 bit for the sign), with 8 + 1 sign bit being the most optimal history length.

We actually see a decrease in overall IPC and Branch Mispredictions when moving from 8 bits to 16/32 bits. ● AVG IPC: 1.58478 (8bits) -> 1.58217 (32bits) ● AVG Mispredict %: 5.46334% (8bits) -> 5.6417% (32bits) We think that if we look at too many values, since we are just referring to the global history, it can mislead our perceptron into relying too much on old information to make predictions about the current state.

2. Finding the optimal number of perceptrons

Now that we have our optimal value of 8 bits, we can also look at how many perceptrons to have in our table of perceptrons. We collected another sweep of values, going from just 1 perceptron to 32768 perceptrons. The maximum value of 32768 perceptrons has the most performance, maxing out at 1.59391 IPC as the average of all benchmarks. Also, we see an asymptotic curve as we approach the higher values of perceptrons, where jumping from 1 to 8 perceptrons had the largest effect on IPC. Increasing the number of perceptrons addresses the main problem of aliasing and interference between different branches. As we increase the number of perceptrons, each branch can get its own predictor and weights that accurately match its pattern. However, as we approach the number of unique branch addresses in the program with the number of perceptrons, we will have removed all interference and there can be no more performance gain to be made which we can observe in our graph.

3. Comparison to other branch predictors

In our final experiment/graph, we ran other branch predictors with the default PARAMS.sunny_cove file. While we didn’t beat the other predictors, we were at least able to approach near the efficiency of gshare and tage64k, having 5.3866% average mispredict ratio to gshare’s 3.14337% average mispredict ratio. One thing that could explain the differences in perceptron behavior versus the paper is the benchmarks themselves. Perceptrons can’t learn non-linearly separable functions with 100% accuracy while gshare can (given enough time). If the benchmarks have changed between SPEC2000 and SPEC2017 to contain more of these types of functions, it could explain the large spikes in mispredictions on certain benchmarks for the perceptron predictor