A CSC714 Project Report on

 

Static Timing Analysis of IBM PowerPC 750CX

By

Amol Kamble: - akamble@ncsu.edu

Gopi Rao:   - gprao@ncsu.edu

 

 

Project URL

http://www4.ncsu.edu/unity/users/g/gprao/www/Project.htm

 

 

 

 

 

 

 

 

 

 

 

 

INTRODUCTION:

 

        For any Real Time embedded system to meet its deadline, the Worst-Case execution time (WCET) of each task in the system should be known. The process of statically determining the WCET of a task is called Timing Analysis. The scheduling decision of a task depends on its WCET and the total time in the schedule. For real-life programs, finding the program run that leads to WCET is impossible to achieve. What is achievable is computing an upper bound of the WCET, i.e., a time greater than real but incomputable WCET. This project is aimed at computing the WCET for IBM PowerPC 750 CX, which is used in the Airbus A380. Hence computation of WCET of IBM PowerPC 750CX helps in making effective scheduling decisions in such safety critical real time systems, dynamically.

 

 

ARCHITECTURE OVERVIEW – IBM POWERPC 750CX

 

        IBM PowerPC 750CX is an implementation of the PowerPC microprocessor family of Reduced Instruction Set Computer (RISC) processors. Being a superscalar processor, it can fetch, dispatch, execute and retire multiple instructions every cycle.

        The 750CX processes the instructions in 4 stages namely,

a)      Fetch – 4 instructions per cycle

b)      Decode/Dispatch – 3 instructions per cycle (one branch)

c)      Execute – Multiple instructions per cycle

d)      Write Back – 2 instructions per cycle.

 

The execute stage has different execution units which execute instructions in parallel. They are:

a)      Integer Unit – 2

b)      Floating Point Unit (3-stage pipelined)

c)      Branch Processing Unit – Both static and dynamic prediction

d)      Load Store Unit

e)      System register Unit

 

    The cache parameters of IBM PowerPC 750CX are as below:

    Cache Size: 32 KB    Associativity:  8    Cache Block – 32 bytes

 

 

 

 

 

 

STATIC TIMING ANALYSIS:

 

       As discussed above, violation of timing constraints for a Real Time system can cause irreparable damage especially if the system is a life-critical system (Avionic Embedded Real Time systems). Therefore, schedulability of tasks which determine if all the tasks can meet their deadline is performed offline. This requires the task parameters, especially the WCET, to be known a priori. The process of determining the WCET of a task is called Timing Analysis.

        There are two types of Timing Analysis. They are:

a)      Dynamic Timing Analysis

b)      Static Timing Analysis

 

If Dynamic timing analysis methods are used based on experimental approaches, it is not guaranteed that WCET bounds would be safe. Also,

modern contemporary processors exhibit complex architectural features like superscalar pipelining and caching, which can be unpredictable and therefore can hinder the system from portraying the worst case behavior.

 

The other alternative is to use Static timing analysis which has proven to be effective in computing the safe WCET upper bounds. The static timing analysis models traverse all the possible execution paths in the program and determine the timing independent of program variables. The various architectural aspects like data hazards and control hazards are taken care of. The framework used to compute the WCET bounds is as below:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

      The various blocks of the framework are explained in the following sections.

 

 

PCOMPILER:

 

       The Pcompiler (Pseudo-Compiler) extracts the control-flow information from the input (by forming Basic Blocks). The high-level-language code (‘C’ code in our case) is assembled using a gcc-ppc cross compiler. The assembled code is fed as input to the Pcompiler to generate the required control-flow and cache references information in the path files. It also outputs an inf file containing information about instructions, required by the Timing Analyzer. The aforementioned basic block formation is done as follows:

a)      The assembly code is parsed until a label or branch is reached.

b)       The basic block ends at the label/branch.

c)       The label name/branch target marks the beginning of a new basic block.

d)      The step a) is repeated until the end of the file or until all the instructions have been put into basic blocks.

 

 

STATICSIM (CACHE SIMULATOR):

 

      The usage of caches gives a huge performance boost to any system. But the behavior of a cache is not predictable especially in the case of data caches. Since instruction caches have spatial locality, they are more predictable. Therefore there is a trade-off between the performance and predictability when it comes to Real Time systems. The aforementioned framework models only an instruction cache.

 

      The Static Cache Simulator models an instruction cache. The parameters of the cache for a particular architecture, like Cache size, associativity, miss penalty etc are given as input to the cache simulator along with the path file. The simulator constructs a control flow graph and analyses it for a given cache configuration and determines the instructions that can be in the cache at entry and exit of each basic block.  This is used to categorize the caching behavior of each instruction.

 

 

 

TIMING ANALYZER:

 

     The Timing Analyzer (TA) uses the instruction caching information along with the control-flow information provided by the Pcompiler to estimate the WCET bound of a program. The TA takes into account the different hazards like the data hazards and the control hazards for each execution path. It then performs a path analysis wherein it traverses all the possible execution paths and chooses the longest execution path. A timing tree is constructed with each loop/function representing a node in the tree.

      The timing tree is processed in a bottom-up manner until the root of the tree is reached. In other words, WCET of the inner functions/loops are computed first before the outer loop/caller function WCET computation.

 

The main functions implemented to find the WCET are:

 

a)      Create_Timing_Path and Build_Timing_Tree: These functions create the TA tree and fill in the information given in the inf file. 

 

b)      Batch_Time: This function calls the Time_Best_Case and the Time_Worst_Case timing functions recursively to determine the best and the worst case timing bounds. This is done until all nodes are timed.

 

c)      Time_Worst_Case: This function scans over the timing tree and times all the paths for the worst case execution cycles.

 

d)      Time_Best_Case: This function scans over the timing tree and times all the paths for the best case execution cycles.

 

e)      Time_Path: The Time_Path function is the most important function in the timing analyzer. The function goes through all the instructions in the path specified and returns the number of cycles that will be required for the path, to the calling functions.
 
 
 
TESTS AND RESULTS:
 
      Simple programs were tested against the TA so that the correctness of the WCEC (Worst Case Execution Cycles) generated could be verified by hand-computing the WCEC. Complex programs involving loops would have made guaranteeing the correctness a lot more difficult and time consuming. Given the tenure of the current project, it is beyond the scope of this project and hence is a part of the Future Work.
 
The test programs used are as below:
a)      Add program
b)      Conditional Add program (only if-then, no else)
 
The following table summarizes the results:
 
 
 
 
 
 
 
 
 
 
Input files
Gcc-PPC 
Assembly 
Files
Pcompiler
Staticsim
(Uses Path file)
WCET/WCEC
     by TA
(Uses inf file and ist file) 
  Manually computed WCET/WCEC
Add.c
Add.s
Add.path
Add.inf
Add.ist
28 cycles
28 cycles
Condition.c
Condition.s
Condition.path
Condition.inf
Condition.ist
34 cycles
34 cycles
 
 
The detailed hand verification is as below:
 
Condition.c    // simple addition of 2 variables in a ‘if’
 
 #include <stdio.h>
 void main ()
 {
     int x, y, z;
     
     x = 0;
     y = 33;
    
 
 
     if(x >0)
     {
         x = x + y;
     }
}
 
The corresponding Condition.s generated is as follows:
 
stwu, 1 -32(1)
stw 31, 28(1)
mr 31, 1
--------------------------------------------------------program starts
li 0,0                      // value of x
stw 0, 8(31)           // stored in memory
li 0, 33                   // value of y
stw 0, 12(31)         // stored in memory
lwz 0, 8(31)           // load x to compare
cmpwi 7,0,0           
ble 7, .L1               // Branch to L1 if x < 0                         
lwz 0, 8(31)           // Load x
lwz 9, 12 (31)        // Load y   
add 0,0,9               
stw 0, 8(31)           // Store the result back in memory. 
------------------------------------------------------program ends
.L1
   lwz 11, 0(11)
   lwz 31, -4(11)
   mr 1, 11
   blr
   
The worst case execution of the above program would be when the ‘if’ condition is true. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The various stages of the pipeline is shown below.
 
                                     0   1    2    3    4
stwu 1, -32(1)              IF ID EX EX WB
 
                                                                     5  6

stw 31,28(1)      IF  ID  st  EX EX WB

 

                                                                                    7

mr 31,1                IF st ID EX WB

 

                                       8 9

li 0,0                       IF ID EX EX WB

 

                                            10 11

stw 0,8(31)                     IF st ID EX EX WB 

 

                                                  12 13 

li 0,33                               IF st ID EX EX WB

 

                                                        14 15

stw 0,12(31)                                IF st ID EX EX WB                 

 

                                                              16 17   

lwz 0,8(31)                                       IF ID st EX EX WB

 

                                                                    18

cmpwi 7,0,0                                          IF st st ID EX WB  

                                                                       19

ble 7,.L1                                                     IF ID EX WB

                                                                       19 20 21

lwz 0,8(31)                                                      IF ID EX EX WB

 

                   19 20 21

             IF ID EX EX WB 22 23

lwz 9,12(31)    IF st ID EX EX WB                                       

 

                               23 24

add 0,0,9             IF st ID EX WB

 

                                  24 25 26

stw 0,8(31)                 IF ID EX EX WB            

 

                                           27 28

lwz 11,0(1)                    IF st ID EX EX WB

 

                                                    30 31

lwz 31,-4(11)                        IF st st ID EX EX WB

 

                                                          32  

mr 1,11                                       IF st ID EX WB    

                                                                                              33

blr                                                 IF ID EX WB    

 

The execution started at the 0th cycle and ended at the 31st cycle. Therefore it took 32 cycles to run.


CONSTRAINTS / OPEN PROBLEMS:

 

a)      The Pcompiler does not support branches involving function pointers. The   

      function pointers (used in Call/Return) are stored in the Link Registers and Count   

      Register.

 

b)      The IBM PowerPC 750CX is a superscalar processor. Currently the Timing Analyzer models single scalar processing. The superscalar processing is a part of the enhancement to the current TA.

 

 

 

FUTURE WORK:

  

a)      The staticsim currently models only an instruction cache. It would be interesting to include a data cache and consider its affects on the WCET.

        

b)      The current scope of the project allowed the correctness of the TA for simple programs. Testing against standard benchmark programs would definitely be desirable in the future.

 

 

INDIVIDUAL CONTRIBUTIONS:

 

   Amol Kamble: 

a)      Study of PowerPC Architecture

b)      Staticsim implementation

c)      Study of time.cpp

d)      isa.cpp, instdef.h - parts

    

   Gopi Rao:

   a) Study of PowerPC Architecture

   b) Pcompiler implementation

   c) Study of time.cpp

   d) Worst cases and Best cases construction.

   e) isa.cpp - parts