This input, x has 8,000,000 elements, versus only 8000 or 8 for the other Once again, this isn’t surprising, since for Program (see Exercise 5.12) shows that the main source of the differences isĭue to the reads of x. These occur in Line 6, and a careful study of this Is run with the 8 x8,000,000 input, it has far more cache read-misses thanĮither of the other inputs. A cache profiler shows that when the program 8000 or 8), and eachĮlement must be initialized, it’s not surprising that this line slows down theĮxecution of the program with the 8,000,000 x8 input.Īlso recall that a read-miss occurs when a core tries to read a variable that’s not in cache,Īnd it has to access main memory. Vector y is far greater in this case (8,000,000 vs. Shows that when the program is run with the 8, 000, 000x 8 input, it has farĬache write-misses than either of the other Recall that a write-miss occurs when a core tries to update a variable that’s not in cache,Īnd it has to access main memory. Partially attributable to cache performance. Than the 8000x 8000 system, and the 8x 8, 000, 000 system requires about 26% The 8, 000, 000 x8 system requires about 22% more time However, it’sĬlear that this is not the case. An analysis that onlyĬonsiders arithmetic operations would predict that a single thread running theĬode would take the same amount of time for all three inputs. Point additions and multiplications is 64, 000, 000. In each case, the total number of floating Matrix-vector multiplication with different sets of data and differing numbers Table 5.4 shows the run-times and efficiencies of our Run-time of the parallel program, recall that the efficiency E of the parallel program is the speedup S divided by the number of threads, t : Run-time of the serial program and T parallel is the # pragma omp parallel for num_threads(thread count) \ Iterations in the outer loop among the threads: Thus, we can parallelize this by dividing the Outer loop, since A and x are never updated and iteration i only updates y. There are no loop-carried dependences in the So if we store A as a two-dimensional array and x and y as one-dimensional arrays, we can implement serial matrix-vector ![]() Recall that if A = ( a ij) is an m n matrix and x is a vector with n components, then their product y = A x is a vector with m components, and its i th component y i is found by forming the dot product of the i th row of A with x : Take a look at matrix-vector multipli-cation. The use of cache coherence can have a dramaticĮffect on the performance of shared-memory systems. The core running thread 1 would get the line with the updated value of x from main memory. Update the copy of x in main memory (either now or earlier), and Thus, the core running thread 0 would have to Would have been marked invalid when thread 0 executed x++, and before assigning my _ z = x, the core running thread 1 would see that it’s value of x was out of date. Saw there that most systems insist that the caches be made aware that changes This is the cache coherence prob-lem, which we discussed in Chapter 2. When thread 0 executed x++, what happened to the values in main memory and thread 1’s cache? Three copies of x : the one in main memory, the one in thread 0’sĬache, and the one in thread 1’s cache. What’s the value in my _ z ? Is it 5? Or is it 6? The problem is that there are (at least) Where my _ z is another private variable. Now suppose thread 0 executes the statementįinally, suppose that thread 1 now executes Here, my y is a private variable defined by both threads. Suppose x is a shared variable with the value 5, andīoth thread 0 and thread 1 read x from memory into their (separate) caches,īecause both want to execute the statement Let’s recall why.įirst, consider the following situation. Use of cache memory can have a huge impact on shared memory. We’ve already seen in Section 2.3.4 that the ![]() Such a block of memory is called a cache line or cache ![]() Memory location x, rather than transferring only the contents of x to/from main memory, a block of memoryĬontaining x is tranferred from/to the processor’s cache. Thus, if a processor needs to access main ThisĬonsideration the principles of temporal and spatial locality : if a processor accesses main memory location x at time t, then it is likely that at times close to t, it will access main memory locations close to x. Also recall that in order to address this problem, chipĭesigners have added blocks of relatively fast memory to processors. Operation, it will spend most of its time simply waiting for the data from Main memory, so if a processor must read data from main memory for each Have been able to execute opera-tions much faster than they can access data in Recall that for a number of years now, processors CACHES, CACHE COHERENCE, AND FALSE SHARING
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |