程序案例-EEE6207 OS-Assignment 20

EEE6207 OS Resit Assignment 2021
This assignment concerns using multiple threads to speed-up the multiplication of two n × n
matrices.
1 Background: Algorithmic performance
It seems intuitively reasonable that the runtime of any algorithm should scale as a function of the
‘size’ n of the problem. Consider the case where the runtime T (n) of an algorithm is given by:
T (n) = αn3 + βn2 + γn+ δ
which can often be derived simply by ‘counting’ the numbers of operations in a block of code.
Usually, the details of the constants depend on the processor, etc. and such a detailed analysis
is rarely justified. For this reason, computer scientists usually only consider the worst-case analysis
of when n tends asymptotically to infinity. To take the example above, as n → ∞, the term in n3
eventually dominates all other terms. Similarly, the value of α tends to be processor-specific and is
ignored. This leads to the so-called ‘big oh’ analysis—the example algorithm discussed above would
thus be said to beO(n3), or an “n-cubed algorithm”. (Notice that this is a rough-and-ready asymptotic
analysis that usually proves to be adequate in practice; for small values of n, it is entirely feasible that
an O(n3) algorithm runs faster than an O(n2) algorithm if the values of the respective constants are
favourable.)
2 Matrix multiplication
If we consider the elementary case of the product of two 2× 2 matrices,A×B = C:[
a11 a12
a21 a22
]
×
[
b11 b12
b21 b22
]
=
[
c11 c12
c21 c22
]
we can observe that element c11 is given by the (inner or dot) product of the row vector
[
a11 a12
] and
the column vector [b11 b21] . Similarly, the other elements in the product matrix C are formed by
vector products between rows and columns inA andB. (If you are not convinced of this, try writing
out the individual terms!)
Considering the algorithmic complexity of multiplying two n × n matrices, forming a single ele-
ment in the product matrix (e.g. c11) requires n multiply-accumulate operations. To generate every
element ofC requiresn2 such vector product operations—because there aren2 of them—meaning that
the algorithmic complexity of the whole matrix multiplication is O(n3). If, however, it were possible
to perform each of the vector products in parallel on, say, n2 processors then the matrix multiplica-
tion will be O(n)—a significantly smaller rate of growth than an O(n3) algorithm and an attractive
speed-up.
3 The assignment
Nowadays, it is near-universal for the processors in even the cheapest computers to contain multiple
cores – each core is effectively a stand-alone processor that shares main memory with all the other
cores. Mapping a computational thread onto each core can thus produce significant speed ups for a
whole range of algorithms. This assignment will consider amulti-threaded implementation of square
matrixmultiplication inwhich each vector product is carried out by a single thread. (In practice, most
P.I.Rockett, 2021 1
operating systems internally map one thread to one core—the details of how this is done and any
complications that result are outside the scope of this assignment. You can, of course, create more
threads than there are cores, and leave it to the OS to schedule them.)
You shouldwrite a program in C tomultiply two (arbitrary) n×nmatrices usingmultiple threads
and measure its runtime performance1. Starting with a 2 × 2 matrix (n = 2) and 4 (= n2) threads,
measure how the runtimes scale when you increase the value of n and hence the number of threads.
(In order to obtain reliable timings, you may need to repeat the same matrix multiplication multiple
times and then average.)
In principle, it should be possible to achieve O(n) performance by using n2 threads although
it is highly unlikely will achieve this—what do you observe and why Can you explain the
deviation from O(n) performance you observe
How does the performance vary with n and the numbers of statically allocated threads used for
calculating the vector products
Is it possible to achieve approximately the sameperformance by reusing a smaller pool of threads
(< n2) where each thread carries out a sequence of two or more of the vector multiplications Can you explain your results How does this all relate to the number of physical cores in your machine And the cache sizes In this assignment, you should consider only the so-called static thread allocation model where you create a sufficiently large pool of threads before carrying out the matrix multiplications. This is an entirely appropriate approach if, for example, you are going tomultiply a large number of identically- sized matrices, one after the other. Time only the durations of the matrix multiplications. Do NOT include the time it takes to create/destroy threads. Your solution should consider only the times it takes to perform the matrix multiplications, not the operating-system dependent overheads of thread creation and destruction. Depending on how you organise your program, youmay experience race hazards, that is, situations where one thread occasionally completes itswork before another thread, resulting in the answer some- times being calculated from incomplete results. You need to think carefully about whether you need to employ appropriate synchronisation mechanisms. (Unfortunately, race hazards are a major prob- lem in concurrent programming and have been the root cause of a number of software disasters!) The data structures you use to store the matrices are for you to decide, but I suggest anything more complicated than 2D arrays, such as specialist matrix classes, may cause you problems. FWIW, multiplying twomatrices stored in 2D arrays is a first-year C programming exercise. Similarly, passing function parameters by reference rather than by value would be sensible. Sufficient background information for this assignment is given in the printed notes available on the module’s MOLE pages. 4 What is not included Although sophisticated concurrency platforms are available—probably the most well-known be- ingOpenMP—you shouldnot use such software in this assignment. Hand code all the allocation of vector products to threads. Youmay come across divide-and-conquer approaches tomatrixmultiplication, such as Strassen’s method; this ingenious algorithm reduces the complexity of matrix multiplication from O(n3) 1You should not make the element values zero in case your compiler ‘optimises away’ what it may think are pointless 0× 0 operations; random values will be fine. You should, of course, check that your program produces the right answer! P.I.Rockett, 2021 2 to O(n2.81) on a single processor. For the purposes of this assignment, however, you should ignore such clever approaches, and consider only mapping vector products to threads. You should not do anything (explicit) to alter the way the operating system schedules the threads. Just rely on the default scheduling mechanisms. 5 Assessment The program should be written in C to exploit the standard Linux API calls. (Experience has shown that multithreading in languages like Java behve strangely.) The report, which should be up to 3 pages in length, should discuss the performance of your algorithm; it should include any relevant graphs. You should outline how your program works (although detailed documentation of every line of source code is not required). In addition, you should provide sufficient evidence that your program actually works as spe- cified. You may be required to physically demonstrate your program working. All source code should be presented as an appendix to your report; the appendices will not count towards the 3-page limit. Since this is a resit assignment, the mark will be capped to 50% Marks will be awarded for identifying the key factors that determine the speed-up from using multiple threads, and the appropriateness of the methods/approach employed. The report should be submitted by Friday, 21 May 2021 to the module’s MOLE page. P.I.Rockett, 2021 3 Marking Scheme 1. Introduction&demonstration of the base case of performingmultiplicationusing a single thread [10%]. 2. Implementation&demonstration of distributing thematrixmultiplication overN2 threads, con- sidering a large enough range of N to draw sensible conclusions. [15%] 3. Analysis & discussion of the results obtained in (2) above. [15%] 4. Implementation & demonstration of distributing the matrix multiplication over fewer than N2 threads, considering a large enough range of N to draw sensible conclusions. [15%] 5. Analysis & discussion of the results obtained in (4) above. [15%] 6. Overall analysis/discussion of results taking into consideration the characteristics of the pro- cessor. [15%] 7. Drawing appropriate conclusions. [15%] P.I.Rockett, 2021 4