# Simulators

5SIA0

## **Processors Processing Processors**

The meta-lecture



#### **Your Friend Harm**





Unfortunately for Harm you need to go outside to drive tractors





And the outside world is filled with dangers





And the outside world is filled with dangers





And the outside world is filled with dangers



















### How to help Harm?

Of course you have many ideas on how to speed-up Harms computer.

But which ones should you apply?





Buy (or build) all hardware options



Buy (or build) all hardware options



Buy (or build) all hardware options

Use analytical models





Buy (or build) all hardware options

Use analytical models





Buy (or build) all hardware options







Buy (or build) all hardware options

Use analytical models

Simulate the design points!

Hey, I like simulators, That sounds promising:)



- Performance
- Energy
- Power (!=Energy)
- Thermal

- Performance
- Energy
- Power (!=Energy)
- Thermal

What details to simulate?

- Performance
- Energy
- Power (!=Energy)
- Thermal

### What details to simulate?

- Cycle accurate vs Functionality
- Caches
- Full operating system
- Disk accesses
- Background tasks
- ...

- Performance
- Energy
- Power (!=Energy)
- Thermal

### What details to simulate?

- Cycle accurate vs Functionality
- Caches
- Full operating system
- Disk accesses
- Background tasks
- ...



Simulate at **gate level**:

#### Simulate at gate level:

- modelsim/questasim (Mentor)
- ncsim (Cadence)
- VCS (Synopsys)
- Icarus Verilog (Open Source!)
- ...

#### Simulate at gate level:

- modelsim/questasim (Mentor)
- ncsim (Cadence)
- VCS (Synopsys)
- Icarus Verilog (Open Source!)
- ...

#### Advantages:

- No need to build a custom simulator if you need RTL to build hardware anyway
- Highest level of precision and detail

#### Simulate at gate level:

- modelsim/questasim (Mentor)
- ncsim (Cadence)
- VCS (Synopsys)
- Icarus Verilog (Open Source!)
- ...

#### Advantages:

- No need to build a custom simulator if you need RTL to build hardware anyway
- Highest level of precision and detail

#### Disadvantage:

Horribly slow for realistic designs

#### Simulate at gate level:

- modelsim/questasim (Mentor)
- ncsim (Cadence)
- VCS (Synopsys)
- Icarus Verilog (Open Source!)
- ...

Nvidia GPU with > 1 Billion transistors Small tests take over 8 hours! [1]

#### Advantages:

- No need to build a custom simulator if you need RTL to build hardware anyway
- Highest level of precision and detail

#### Disadvantage:

Horribly slow for realistic designs



## Slightly less horribly slow: Hardware Emulation

RTL description of Target Architecture

### Slightly less horribly slow: Hardware Emulation

RTL description of Target Architecture

Synthesize for FPGA (slow)



## Slightly less horribly slow: Hardware Emulation

RTL description of Target Architecture

Synthesize for FPGA (slow)



Emulate on FPGA (fast!) Note: instrumentation required to get detailed information out!

### Levels of detail in Simulation

- Full-System versus User-level
- Cycle Accurate versus Functional
- Execution- versus Trace-driven



To OS or not to OS?

To OS or not to OS?



To OS or not to OS?



To OS or not to OS?





**User-Level** 

### **User-Level**

Famous example: Simple Scalar [1]

#### Advantages

- Fast to develop and update to new architectures
- Usually 'accurate enough'

#### Disadvantages

 Any time spent in the OS is not modelled accurately. Can have severe impact, database applications spent 20-30% of their time in OS mode.







#### Functional - no/limited model of the micro architecture

- An (add) instruction of the target can be translated to an (add) instruction on the host, and be simulated that way.
- Example 1: Simple Scalar <u>sim-fast</u>
- Example 2: QEMU, Full-system emulator using dynamic translation

#### Cycle Accurate - includes model of the micro architecture

- Block resources in the pipeline when instruction executes
- Use target branch predictor scheme
- Out-of-order execution
- Example: Simple Scalar <u>sim-outorder</u>









Question

Implement the execute function in regular C

```
void execute(int32_t* instructions){
      //declare a pointer to a function that returns void
      // and has no arguments
                                                                                   int32_t instructions[]={
      void (fp*)(void);
                                                                                         0x3FE9.
                                                                                         0xA701.
      //set the function pointer to the first instruction
                                                                                         0xEF02.
      fp=instructions;
                                                                                         0x8FF0
                                                                                   };
      //call the function
      //Note: make sure the last instruction in the list returns
                                                                                   execute(instructions);
      fp();
```



Execution Driven: Application executes on simulator



Execution Driven: Application executes on simulator



Trace Driven: simulator uses trace as input



Trace Driven: simulator uses trace as input



Trace Driven: simulator uses trace as input



### **Trace-driven Simulation**

#### Advantages

- Trace collection only required once
- Trace collection can be done with ISA compatible processor
- Trace simulator does not need to simulate <u>all</u> instructions, can skip ahead in trace if not implemented

### **Trace-driven Simulation**

#### Advantages

- Trace collection only required once
- Trace collection can be done with ISA compatible processor
- Trace simulator does not need to simulate <u>all</u> instructions, can skip ahead in trace if not implemented

#### Disadvantages

- Cannot speculatively execute code (trace is fixed)
- Trace file can become huge for large applications (hundreds of GBs)

## Mixing Simulation Strategies

#### **Direct-execution**

- Parts execute directly on the host (e.g. using dynamic translation such as QEMU)
- Other parts are executed on cycle accurate simulation

#### Use case:

Interested in memory accesses and memory behavior. Execute only loads and stores on the simulator, emulate the rest directly on the host machine

## Simulation in the Multiprocessor Era





# Simulation in the Multiprocessor Era









A multi-threaded application running on a single core target processor.

Question: Does this make sense?





A multi-core processor running on a single threaded simulator.

Question: Does this make sense?





A multi-threaded simulator running on a single-core host.

Question: Does this make sense?





But how to build a fast, multi-threaded simulator?



### Parallelisation in all levels of the simulation stack

Multi-threaded simulator

0 1 ... N

But how to build a fast, multi-threaded simulator?



## Parallel Simulation Techniques

Discrete event simulation

Quantum simulation

Slack simulation



## Parallel Simulation Techniques

Discrete event simulation

Quantum simulation

Not schrödinger's cat 'quantum' though



Slack simulation



## **Space Granularity**

- The textbook implicitly assumes the smallest hardware block that can be mapped to a simulator thread is a full target core.
- Holds for almost all real-world simulators, which severely limits the parallelism

## **Space Granularity**

- The textbook implicitly assumes the smallest hardware block that can be mapped to a simulator thread is a full target core.
- Holds for almost all real-world simulators, which severely limits the parallelism

Exception is RTL simulation, there the blocks can be smaller. The Rocketick simulator even appears to use GPUs! [1]



A logical choice for a simulator "time step" is one cycle for the fastest core.



Disadvantage



#### Disadvantage

<u>Under utilisation</u> of the <u>host</u> platform if threads are idle for synchronisation





<u>Under utilisation</u> of the <u>host</u> platform if threads are idle for synchronisation

## Target vs Host Cores

There is between the number of target cores and the number of host cores!!!

## Target vs Host Cores

There is **no relation** between the number of target cores and the number of host cores!!!







Utilisation of host depends on variation in processing time of a cycle, but also on the amount of host cores!

Host core 1 P4 P3 P2 P1

## **Quantum Simulation**

### Synchronize threads at larger time-steps, e.g. 3 cycles



## Quantum Simulation

#### Synchronize threads at larger time-steps, e.g. 3 cycles



#### Advantage

Utilisation improves, because the variation of processing is amortized over longer sections of simulation

#### Disadvantage

No longer cycle accurate





Instead of waiting in the red areas, use slack to process ahead



#### Side-effect: Drift

The cores might be simulating different points in time, and could drift apart

#### Mitigation

Allow a maximum drift (or slack), and synchronize when this value is exceeded





Simulation time

End

#### Side-effect: Drift

The cores might be simulating different points in time, and could <u>drift apart</u>

#### Mitigation

Allow a maximum drift (or slack), and synchronize when this value is exceeded



## Slack versus Quantum simulation

In quantum simulation, the core simulation times always stay within a cycle window, which is fixed in global time.

Also in slack simulation the simulation times stay within a window, but with the key difference that this is a sliding window.



## Slack versus Quantum simulation

In quantum simulation, the core simulation times always stay within a cycle window, which is fixed in global time.

Also in slack simulation the simulation times stay within a window, but with the key difference that this is a sliding window.

Typically much less synchronisation!



From the paper Graphite: a Distributed Parallel Simulator for Multicores

"Simulation slowdown is as low as 41x versus native execution"

From the paper Graphite: a Distributed Parallel Simulator for Multicores

"Simulation slowdown is as low as 41x versus native execution"



[1] Graphite: A Distributed Parallel Simulator for Multicores - Jason E. Miller et al.

From the paper Graphite: a Distributed Parallel Simulator for Multicores

"Simulation slowdown is as low as 41x versus native execution"





[1] Graphite: A Distributed Parallel Simulator for Multicores - Jason E. Miller et al.

From the paper Graphite: a Distributed Parallel Simulator for Multicores

"Simulation slowdown is as low as 41x versus native execution"



[1] Graphite: A Distributed Parallel Simulator for Multicores - Jason E. Miller et al.

### Question

What can we do if it still takes weeks or months to simulate a full benchmark?











## **Program Modes**

#### **MPEG-2 DECODER**



Real world programs spend time in different modes, which can have very different characteristics

Sample uniformly over the program, hopefully capturing the dominant modes



Sample uniformly over the program, hopefully capturing the dominant modes







[1] SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling - Roland E. Wunderlich et al.









## Summary

- Why Simulators
  - More accurate than models
  - Cheaper than building hardware
- Simulation detail
  - Full-System vs User-level
  - Functional vs Cycle Accurate (micro-arch.) vs Gate-Level
  - Execution- vs Trace-driven
- (Fast) Multiprocessor Simulation
  - Discrete event
  - Quantum
  - slack
- Workload Sampling
- Summary (the meta lecture)

## Summary

- Why Simulators
  - More accurate than models
  - Cheaper than building hardware
- Simulation detail
  - Full-System vs User-level
  - Functional vs Cycle Accurate (micro-arch.) vs Gate-Level
  - Execution- vs Trace-driven
- (Fast) Multiprocessor Simulation
  - Discrete event
  - Quantum
  - slack
- Workload Sampling
- Summary (the meta lecture)

You can read about all of this in your textbook, chapter 9

