Stream Abstraction for GPU

Work Log

  • Week 1 (10/11/2009 ~ 10/18/2009) Brainstorming
  • Week 2 (10/19/2009 ~ 10/25/2009) Goal: Make RandomSourceFilter<float> | DelayFilter<float, float> work
    • Refactoring the filter class hierarchy (10/21): 
      • Abstract base class: FilterBase
      • Filter<inType, outType> -> FilterBase
      • other filters derived from Filter<>. Examples are RandomSourceFilter<outType>, FileSourceFilter<outType>, SocketSourceFilter<outType>, FileSinkFilter<inType>, SplitFilter<inType, outType, fanOutSize>, JoinFilter<inType, outType, fanInSize> ...
      • TODO: Are we gonna support multiple in/out stream/types? If so, the class may need refactoring again. High risk...
    • Need a global system object that holds all topological information? Currently I have a StreamSystem object that knows all filters and their relationships.
    • Designing data channel interface (10/22):
      • Base Interface includes:
        • int ChannelBase::reserve(void **, int bSize) : instead of push() to remove the extra memory copy
        • void ChannelBase::reserve_done()
        • int ChannelBase::pop(void **buffer, int peekBSize, int popBSize, int parallel, bool consumeFlag = true)
        • Two kind of concrete channels<>: inter-process channel(in multiple node case), intra-process channel.
    • Implementing the execution framework (10/23)
      • Each filter is a cpu thread, currently
      • Filter runs a kernel once its input is ready and meets the requirement for parallelism
      • Channel buffer management? Currently using produce/consume style.
    • Channel operations (reserve, pop) debugging (10/24,25): working
  • Week3 (10/26/2009 ~ 11/1/2009) Goal: simple example working; integrate CUDA(defer)
    • Adding TermOutFilter causes crash. debugging (10/26)
    • Lessons learned: (a) add "volatile" keyword for class members if the class is run in multi-threaded. (b) g++ in mac seems to have bugs handling volatile variables, not sure if it is because of the compiler or pthread (10/27)
    • Working on a way to gracefully terminate the program by flushing signals from source to sink filters. (10/28)
    • RandomSourceFilter<float> | DelayFilter<float, float> | outputFilter<float> works. (10/29)
    • Typelist is a potential solution to represent filters with multiple input and output ports. Filter now is defined as Filter<inTypeList, outTypeList> where the *TypeList can have arbitrary number of types in it. (See Loki::TypeList and "Modern C++ Design: Generic Programming and Design Patterns Applied")  Re-factoring code...   10/30) 
    • Made FIR filter working. Need to change the algorithm in the channel buffer management: copy tail buffer to the front of the head in case the tail is large enough.
  • Week4 (11/2/2009 ~ 11/8/2009) Goal: More test cases. Multi-node streaming
    • Debugging (11/2/2009)
    • Adding VecAddFilter, more bug fixes (multi input) (11/3/2009)
    • Start extending to multi-node case. (11/4/2009)
    • Design inter-connection channel class
  • Week5(11/9/2009 ~ 11/15/2009) Goal: Design multi-node case
    • Debugging multi-node case (11/14~11/17): 
  • Week6(11/17/2009~/11/23/2009) Goal: Implement multi-node case, add cuda support
    • simple multi-node case working: identity filter (11/17)
    • Refactor Makefile: add nvcc (11/17)
    • Add gpu object per process. It encapsulates all cuda calls and provide dma functions. (11/18)
    • Revise Channel classes to handle different end node configurations(GPU/CPU/Network->GPU/CPU/Network, 6 combs) (11/19)
    • Add GPU kernel calls. (11/20): make several filter examples working
  • Week7(11/24/09~11/30/09): Thanksgiving
  • Week8(11/31/09~/12/06/09) Goal: IS benchmark on GPU
    • Add random double generator filter (11/31/09)
    • Rewrite Reduce filter (12/1/09)
    • Add bucketsort filter(12/2/09)
    • Bucket sort CPU version working(12/3/09)
    • Add GPU part in bucket sort(12/4/09)