Parallel & Distributed Computing 

  • Introduction to Heterogeneous Parallel Computing (Pages: 57)

  • High Performance Computing Benchmarking (60 pages)

  • Beowulf Cluster Using Raspberry Pi (100 pages)

  • CUDA C - Part 1 (157 pages)

  • CUDA C - Part 2 (85 pages)

  • Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU & GPU

    Introduction, the workload: throughput computing kernels, today's high performance computing platforms, performance evaluations on core i7 and gtx280, performance analysis, platform optimization guide, hardware recommendations, related work, conclusion.

  • Introduction to Map Reduce (Pages: 166)

    A simple example (city and corresponding temperatures), map() function, map and reduce operations, shuffle operation, overview of map-reduce framework, logical view, map(k1,v1)->list(k2,v2), reduce(k2,list(v2))->(k3,v3), example1:word count, example2:average number of social contacts, java program for word count, Mapper, Writable and Comparable interfaces, WritableComparable, Context class, IntWritable, Reducer class, Job class, configuring a job object, combiner class, addInputPath, setOutputPath, Hadoop -- standalone setup in Linux, compiling wordCount program, Hadoop map reduce, Resource manager, Node manager, ApplicationMaster, running in Linux Hadoop pseudo-distributed environment setup, HDFS setup, copying files into hdfs, what are combiners in Hadoop?, How many maps?, Reducer, How many reduce?, Reducer NONE, Partitioner, Counter, What Hadoop can and can't do.

  • Multithreading 1 (135 pages)

    Introduction, light-weight process, difference between threads and processes, thread states - spawn, block, unblock, finish, user-level and kernel-level threads, thread library, advantages and disadvantages of user-level threads, POSIX Pthreads, Mach C-threads, advantages and disadvantages of kernel level threads, combined approach, multithreading models - many-to-one, one-to-one, many-to-many, POSIX threads, sample program, pthread_create, handling thread IDs, pthread_self(), gettid(), pthread with mutex, pthread_mutex_t, can thread really speed up execution?, sample code - primality testing, clone() system call, sys_clone(), fork(), exit(), Linux threads, Linux Process/thread model -- interruptible & uninterruptable, zombie, ready, executing, stopped, sample code, sending signal to a specific thread, pthread_kill(), sample code, thread pool - faster to service a request, limits the number of threads, java threads, one-to-one thread model in Windows, pthread_yield(), sample code, difference between pthread_yield() and sleep().

  • Multithreading 2 (146 pages)

    Threads and processes, Kernel threads, context switch, user space threads, fibers, GNU Pth, advantages and disadvantages of fibers, NPTL (Native Posix Thread Library), LinuxThreads, POSIX Thread Trace Tool (PTT), clone(), pthread_crate(), how to implement user level threads in Linux, getcontext() and setcontext(), makecontext() and swapcontext(), setjmp and longjmp, hybrid model, implementing kernel level threads in linux, how efficients are threads in Linux, why not a hybrid kernel_space/user_space implementation?, clone and fork, mprotect(), wait(), CLONE_THREAD flag, pthread_join() blocking call, C++ multi-threading performance on indepedent data, cache performance and threading, L1 and L2 shared per core (usually) but L3 is not, guidelines for converting single threaded code to multi-threaded, CPU cache intro, ITLB & DTLB caches, replacement policies, cache entry structure, multi-core caches, False sharing in multi-threaded programming, grouping of cache-lines (64 or 128 bytes), Cache coherence protocol - MESI (Modified, Exclusive, Shared & Invalid), write-through and write-back policies, MESI state diagram, false-sharing scenarios (no padding & no spacing, no padding but spacing, padding but no spacing, padding & spacing), solution to false sharing. detection of false-sharing in Visual Studio.

  • Flynn's Taxonomy (Pages: 50)

    Introduction, single and multiple data streams, SISD (Single Instruction Single Data), exploits no parallelism, corresponds to von Neumann architecture, SIMD (Single Instruction Multiple Data), array processor and vector processors, MISD (Multiple Instruction Single Data), for fault tolerance, MIMD (Multiple Instruction Multiple Data), SPMD (Single Program Multiple Data), MPMD (Multiple Program Multiple Dat), Sony PS3 gaming console, SIMD instructions, disadvantages of SIMD, MISD -- different operations on the same data, systolic arrays -- hardwired for specific operation, wavefron processors, individual nodes triggered by the arrival of new data, data and partial results stored withtin the array, Cisco PXF network processor, MIMD - Xeon Phi, flavours of parallelism.