Problem 3 Bag of Tasks
A bag of tasks is a problem where you are given a set of tasks to
perform of varying sizes and degrees of complexity. Your task is to
implement a bag of tasks manager using pthreads. We will ignore
the varying degrees of complexity for this assignment and will only
focus on one operation, count the number of occurrences of each word
in series of input files. Your program will need to accept, as input,
a file with a list of files to be processed. Read input from stdin only. You can assume the list will contain many entries, but will always fit in to the host memory. You can also assume there will be only one input file per line. You will need to split the text in these input files into tokens using a white space( ) as a delimiter and you can assume words will not spread over multiple lines. Words here will be at most 80 characters long and the total number of words across all file swill not exceed 10MB. Once read and parsed, use a hash table, along with a hashing function(you're free to use a library hashing function or design your own), to store the total count of each word for each file. You are free to
handle hash collisions using any heuristic/method you choose, but be sure to handle collisions atomically (fetch and set/compare and swap). Also, use atomics to make sure values are updated correctly, mutexes and semaphores will not be allowed for the value update portion of this assignment (but you are free to use them elsewhere). Finally,
make sure to keep another data structure with a mapping of the words
in each file to their hash value and output a list of each word along
with the number of times that word appears across all files. In the output file, please display the items which occur most frequently first, but words with same number of occurrences need not be alphabetized. Also, please separate the word from the count using a colon(:), put the count in brackets([]), and please output one word-count pair per line.
Hints:
- A hash table of the data structures (which contain the word and it's count) is a good place to start
- A good heuristic for handling collisions is to make your data structure a linked list; collisions are added to the end of the list
- When dealing with multiple insertions at the same time, it's OK to allow both to be added, then clean up the list when printing it
Here are example input and output files:
Now implement the task manager on the GPU using CUDA. Use atomic operations here as well. You will also need to design your own simple hash function for this problem.
Hints: