Lectures on the discipline “parallel computing” - Lectures. Classification of parallel data processing systems

4th year, 1st and 2nd streams, 7th semester

lectures (34 hours), test

Department responsible for the course: ASVK

Program Compiler: member-corr. RAS, Doctor of Physics and Mathematics. Sciences Voevodin Vl.V.,

Lecturers: member-corr. RAS, Doctor of Physics and Mathematics. Sciences Voevodin Vl.V.

annotation

Discussed in the course general issues organization of parallel computing. The features of the architectures of modern parallel computing systems are considered, the basic methods and paradigms of programming in parallel environments are studied.

For the 1st and 2nd streams, approaches to harmonizing the architecture of parallel systems and the structure of algorithms, issues of the theory of analysis of the structure of programs and algorithms, and models in parallel computing are discussed.

Program

1. Large tasks and supercomputers. Parallel and pipeline data processing. Parallelism and pipelineability in the architecture of modern high-performance computers. Scalar and vector commands. Scalar, pipeline and vector devices. The memory hierarchy in computers as a means of increasing the speed of program execution, locality of calculations and locality of data use. Amdahl's law and its consequences, superlinear acceleration.

2. Main classes of modern parallel computing systems. Computers with shared memory, examples, reasons for decreased performance in real programs. Architectures SMP, NUMA, ccNUMA. Switching of processors and memory modules, bus, matrix switch, omega network. Vector-pipeline computing systems, examples, reasons for decreased performance. Computers with distributed memory, examples, reasons for decreased performance. Topology of communication between processors: star, lattice, three-dimensional torus, binary hypercube, their properties. Computing clusters, examples, latency and throughput of various communication technologies. Architectures with parallelism at the level of machine instructions, VLIW, superscalarity.

3. Parallel programming technologies. Traditional sequential languages and parallelizing compilers, problems. Special comments and compiler directives, extensions existing languages. Special languages parallel programming. Programming using libraries and message passing interfaces. Parallel subject libraries, specialized packages and software systems high level. Parallel programming technologies MPI, OpenMP, Linda.

4. Performance of parallel computing systems. The versatility and specialization of computers, the performance of special processors. Moore's Law. Performance assessment methods. Introduction of a single numeric parameter, Mflops, MIPS. Peak and real performance computers. Linpack test and its variants. Sets of complementary test programs, STREAM and NPB.

5. Graph models of programs. Control graph and program information graph. Information and operating history implementation of programs. Algorithm graph as a compact parametric form of representation information history. Information independence of operations and the possibility of their parallel execution. The length of the critical path of an algorithm graph as a measure of the degree of parallelism. Finite and massive parallelism, coordinate and skew parallelism. Equivalent program transformations, elementary cycle transformations.

6. Heterogeneous distributed computing systems. Metacomputers and metacomputing, existing metacomputer projects. Distinctive properties of metacomputers. The concept of GRID, basic components and services, existing projects of GRID segments, the concept of a virtual organization.

Literature

1. Voevodin V.V., Voevodin Vl.V. Parallel computing. – St. Petersburg: BHV St. Petersburg, 2002. - 608 p.

2. Korolev L.N. Architecture of electronic computer processors. – M.: Publishing house. Faculty of Computational Mathematics and Mathematics of Moscow State University, 2003.

3. V.V. Korneev. Parallel computing systems. – M.: Knowledge Publishing House, 1999. – 320 p.

4. Materials of the information and analytical center for parallel computing Parallel.ru.

additional literature

1. Antonov A.S. Parallel programming using technology

MPI: Tutorial. – M.: Moscow State University Publishing House, 2004. - 71 p.

Throughout the history of development computer technology attempts were made to find some general classification, under which all possible directions of development of computer architectures would fall. None of these classifications could cover the entire variety of developed architectural solutions and did not stand the test of time. Nevertheless, a number of terms have come into scientific circulation and are widely used that are useful to know not only for developers, but also for computer users.

Any computing system (be it a supercomputer or a personal computer) achieves its highest performance through the use of high-speed elements and the parallel execution of a large number of operations. It is the opportunity parallel work various devices of the system (working with overlap) is the basis for accelerating basic operations.

Parallel computers are often divided according to Flynn's classification into SIMD (Single Instruction Multiple Data) and MIMD (Multiple Instruction Multiple Data) machines. Like any other, the above classification is imperfect: there are cars that do not fall directly into it, there are also important features that are not taken into account in this classification. In particular, vector processors are often classified as SIMD machines, although their high performance depends on another form of parallelism - the pipeline organization of the machine. Multiprocessor vector systems, such as Cray Y-MP, consist of several vector processors and therefore can be called MSIMD (Multiple SIMD).

Flynn's classification does not distinguish between other characteristics important to computing models, such as the level of grain in parallel computations and synchronization methods.

There are four main types of system architecture parallel processing:

1) Pipeline and vector processing.

The basis of pipeline processing is the separate execution of some operation in several stages (in several stages) with the transfer of data from one stage to the next. Productivity increases due to the fact that several operations are performed simultaneously at different stages of the conveyor. Pipelining is only effective when the pipeline is close to full capacity and the rate at which new operands are supplied matches the pipeline's maximum capacity. If latency occurs, fewer operations will be executed in parallel and overall performance will decrease. Vector operations provide an ideal opportunity full load computing pipeline.

When a vector command is executed, the same operation is applied to all elements of the vector (or most often to the corresponding elements of a pair of vectors). It may take some setup time to set up a pipeline to perform a particular operation, but operands can then enter the pipeline with maximum speed allowed by memory capabilities. In this case, there are no pauses either in connection with the selection of a new command or in connection with the definition of a branch of calculations during a conditional transition. Thus, main principle Computing on a vector machine consists of performing some elementary operation or combination of several elementary operations that must be repeatedly applied to some block of data. Such operations in the original program correspond to small compact loops.

2) SIMD type machines. SIMD machines consist of a large number of identical processing elements that have their own memory. All processing elements in such a machine execute the same program. Obviously, such a machine, composed of a large number of processors, can provide very high performance only on those tasks in which all processors can do the same job. The computation model for a SIMD machine is very similar to the computation model for a vector processor: a single operation is performed on a large block of data.

Unlike the limited pipeline operation of a vector processor, a matrix processor (synonymous with most SIMD machines) can be significantly more flexible. The processing elements of such processors are universal programmable computers, so a problem solved in parallel can be quite complex and contain branches. The typical manifestation of this computational model in a source program is roughly the same as in the case of vector operations: loops on array elements in which the values produced in one iteration of the loop are not used in another iteration of the loop.

The computing models on vector and matrix computers are so similar that these computers are often discussed as equivalent.

3) MIMD type machines. The term "multiprocessor" covers most MIMD machines and (just as the term "matrix processor" applies to SIMD machines) is often used as a synonym for MIMD machines. In a multiprocessor system, each processing element (PE) executes its program quite independently of other processing elements. Processing elements, of course, must somehow communicate with each other, which necessitates a more detailed classification of MIMD-type machines. Multiprocessors with shared memory (tightly coupled multiprocessors) have data and instruction memory available to all PEs. PEs communicate with the shared memory using a common bus or exchange network. In contrast, in loosely coupled multiprocessor systems (machines with local memory), all memory is shared among the processing elements and each memory block is accessible only to the processor associated with it. The exchange network connects the processing elements with each other.

The basic model of computing on a MIMD multiprocessor is a set of independent processes that periodically access shared data. There are a large number of variants of this model. At one end of the spectrum is the distributed computing model, in which a program is divided into a fairly large number of parallel tasks consisting of many subroutines. At the other end of the spectrum is the stream computing model, in which each operation in a program can be treated as a separate process. Such an operation waits for its input data (operands), which must be passed to it by other processes. Once they are received, the operation is performed and the resulting value is passed on to those processes that need it. In high- and medium-granularity streaming computing models, processes contain a large number of operations and are executed in a streaming manner.

4) Multiprocessor machines with SIMD processors.

Many modern supercomputers are multiprocessor systems that use vector processors or SIMD processors as processors. Such machines belong to the MSIMD class machines.

Programming languages and corresponding compilers for MSIMD machines typically provide language constructs that allow the programmer to describe "coarse-grained" parallelism. Within each task, the compiler automatically vectorizes suitable loops. MSIMD machines, as one can imagine, make it possible to use the better of these two decomposition principles: vector operations ("fine-grained" parallelism) for those parts of the program that are suitable for it, and the flexible capabilities of the MIMD architecture for other parts of the program.

Over the years of development of computing technology, multiprocessor systems have undergone a number of stages of development. Historically, SIMD technology was the first to be mastered. However, there is currently a steady interest in MIMD architectures. This interest is mainly determined by two factors:

The MIMD architecture provides great flexibility: with adequate hardware and software support, MIMD can operate as a single-user system providing high-performance data processing for a single application task, as a multi-program machine running multiple tasks in parallel, or as some combination of these capabilities.
MIMD architecture can take full advantage of modern microprocessor technology based on strict cost/performance considerations. In fact, almost all modern multiprocessor systems are built on the same microprocessors that can be found in personal computers, workstations and small single-processor servers.

One of distinctive features A multiprocessor computing system is an exchange network through which processors are connected to each other or to memory. The communication model is so important for a multiprocessor system that many performance characteristics and other estimates are expressed as the ratio of processing time to communication time corresponding to the tasks being solved. There are two main models of interprocessor communication: one is based on message passing, the other is based on the use of shared memory. In a shared memory multiprocessor system, one processor writes to a specific memory location and another processor reads from that memory location. To ensure data consistency and process synchronization, exchange is often implemented using the principle of mutually exclusive access to shared memory using the “mailbox” method.

In local memory architectures, direct memory sharing is not possible. Instead, processors access shared data by passing messages over an exchange network. The effectiveness of the communication scheme depends on the communication protocols, the underlying communication networks and the memory bandwidth and communication channels.

Often, and unreasonably, in shared memory and vector machines the costs of communication are not taken into account, since the problems of communication are largely hidden from the programmer. However, communication overhead in these machines exists and is determined by conflicts of buses, memory and processors. As more processors are added to a system, more processes compete to share the same data and bus, leading to a saturation condition. The shared memory system model is very convenient for programming and is sometimes seen as a high-level means of assessing the impact of sharing on system operation, even if the underlying system is actually implemented using local memory and the principle of message transmission.

In circuit-switched and packet-switched networks, as traffic demands increase, the possibility of network congestion must be considered. Here interprocessor communication communicates network resources: channels, processors, message buffers. The volume of transmitted information can be reduced through careful functional decomposition of the task and careful dispatching of the functions performed.

Thus, existing MIMD machines fall into two main classes depending on the number of processors being combined, which determines both the method of organizing memory and the method of their interconnections.

The first group includes machines with common (shared) main memory, combining up to several dozen (usually less than 32) processors. The relatively small number of processors in such machines makes it possible to have one centralized shared memory and combine processors and memory using a single bus. If processors have sufficient cache memory, the high-performance bus and shared memory can accommodate memory accesses from multiple processors. Because there is a single memory with the same access time, these machines are sometimes called UMA (Uniform Memory Access). This method of organization with relatively small shared memory is currently the most popular. The structure of such a system is shown in Fig. 10.1.

Rice. 10.1. Typical architecture multiprocessor system with shared memory.

The second group of machines consists of large-scale distributed memory systems. In order to support a large number of processors, main memory must be distributed among them, otherwise the memory bandwidth simply may not be enough to satisfy requests coming from a very large number of processors. Naturally, this approach also requires the implementation of communication between processors. In Fig. Figure 10.2 shows the structure of such a system.

As the number of processors grows, there is simply no way around the need to implement a distributed memory model with a high-speed network to communicate between processors. With the rapid growth in processor performance and the associated increased demands for increased memory bandwidth, the scale of systems (i.e., the number of processors in a system) that require distributed memory is decreasing, as well as the number of processors that can be supported on one shared bus and shared memory.

Distributing memory between individual system nodes has two main advantages. First, it is a cost-effective way to increase memory bandwidth because most accesses can be performed in parallel to local memory in each node. Secondly, it reduces the access latency (access time) to local memory. These two advantages further reduce the number of processors for which a distributed memory architecture makes sense.

Typically, I/O devices, as well as memory, are distributed across nodes, and in reality nodes may consist of a small number (2-8) of processors interconnected in some other way. Although such clustering of several processors with memory and network interface may be quite useful in terms of efficiency in terms of cost, it is not very significant for understanding how such a machine works, so we will focus on systems with one processor per node for now. The main architectural difference that should be highlighted in distributed memory machines is how communication is carried out and what logic model memory.

Rice. 10.2. Typical distributed memory machine architecture.

Simple calculations show that the configurations similar systems can cost more than one million US dollars - just for fun, figure out how much, say, only 4 TB costs random access memory? Arises whole line Natural questions: What tasks are so important that they require multimillion-dollar computers? Or, what tasks are so complex that a good Pentium is not enough? I would like to find reasonable answers to these and similar questions.

In order to assess the complexity of problems solved in practice, let’s take a specific subject area, for example, optimization of the oil production process. We have an underground oil reservoir with a certain number of drilled wells: one pumps oil to the surface, and the other pumps water back. It is necessary to simulate the situation in a given reservoir in order to estimate oil reserves or understand the need for additional wells.

We will accept a simplified scheme in which the modeled area is mapped into a cube; however, it will be sufficient to estimate the number of necessary arithmetic operations. Reasonable cube sizes at which plausible results can be obtained are 100*100*100 points. At each point of the cube, it is necessary to calculate from 5 to 20 functions: three components of speed, pressure, temperature, concentration of components (water, gas and oil - this is the minimum set of components; in more realistic models, for example, different fractions of oil are considered). Further, the values of the functions are found by solving nonlinear equations, which requires from 200 to 1000 arithmetic operations. And finally, if a non-stationary process is being studied, i.e. you need to understand how this system behaves in time, then 100-1000 time steps are taken. What happened:

10 6 (grid points)*10(functions)*500(operations)*500(time steps) = 2.5*10 12

2500 billion arithmetic operations to perform just one calculation! What about changing model parameters? What about tracking the current situation when input data changes? Such calculations must be done many times, which imposes very stringent requirements on the performance of the computing systems used.

Examples of the use of supercomputers can be found not only in the oil industry. Here is just a small list of areas of human activity where the use of supercomputers is really necessary:

automotive industry
oil and gas production
pharmacology
weather forecasting and climate change modeling
seismic survey
electronic device design
synthesis of new materials
and many, many others

In 1995, the body of the Nissan Maxima was made 10% stronger thanks to the use of a Cray supercomputer (The Atlanta Journal, May 28, 1995). With its help, not only weak points of the body were found, but also the most effective way to remove them.

According to Mark Miller (Ford Motor Company), to perform crash tests, in which real cars crash against a concrete wall while simultaneously measuring the necessary parameters, filming and then processing the results, Ford would need from 10 to 150 prototypes of new models with total costs ranging from $4 million to $60 million. The use of supercomputers has reduced the number of prototypes by one third.

A very recent example is the development of one of the world's largest reservation systems, Amadeus, used by thousands of agencies with 180,000 terminals in more than a hundred countries. Installation of two Hewlett-Packard T600 servers with 12 processors each made it possible to increase the level of operational availability central system up to 99.85% with a current load of about 60 million requests per day.

And similar examples can be found everywhere. At one time, DuPont researchers were looking for a replacement for chlorofluorocarbon. It was necessary to find a material that had the same positive qualities: non-flammability, corrosion resistance and low toxicity, but without harmful effects on the Earth's ozone layer. In one week, the necessary calculations were carried out on a supercomputer with a total cost of about 5 thousand dollars. According to DuPont experts, the use of traditional experimental methods research would require about three months and 50 thousand dollars, and this does not take into account the time required to synthesize and purify the required amount of the substance.

Increasing computer performance due to what?

Why do supercomputers calculate so quickly? There may be several answer options, among which two have a clear advantage: the development of the element base and the use of new solutions in computer architecture.

Let's try to figure out which of these factors is decisive for achieving record performance. Let us turn to known historical facts. On one of the first computers in the world - EDSAC, which appeared in 1949 in Cambridge and had a clock time of 2 microseconds (2 * 10-6 seconds), it was possible to perform 2 * n arithmetic operations in 18 * n milliseconds, that is, on average 100 arithmetic operations per second. Let's compare with one computing node of a modern Hewlett-Packard V2600 supercomputer: the clock time is approximately 1.8 nanoseconds (1.8 * 10-9 seconds), and the peak performance is about 77 billion arithmetic operations per second.

What happens? Over half a century, computer performance has increased by more than seven hundred million once. At the same time, the performance gain associated with reducing the clock cycle time from 2 microseconds to 1.8 nanoseconds is only about 1000 times. Where did the rest come from? The answer is obvious - the use of new solutions in computer architecture. The main place among them is occupied by the principle of parallel data processing, which embodies the idea of simultaneous (parallel) execution of several actions.

Parallel data processing on a computer

Parallel data processing, embodying the idea of simultaneous execution of several actions, has two varieties: pipeline and actual parallelism. Both types of parallel processing are intuitive, so we will only make small explanations.

Parallel Processing. If a certain device performs one operation per unit of time, then it will perform a thousand operations in a thousand units. If we assume that there are five identical independent devices capable of operating simultaneously, then a system of five devices can perform the same thousand operations not in a thousand, but in two hundred units of time. Similarly, a system of N devices will perform the same work in 1000/N units of time. Similar analogies can be found in life: if one soldier digs up a garden in 10 hours, then a company of fifty soldiers with the same abilities, working simultaneously, will cope with the same work in 12 minutes - the principle of parallelism in action!

By the way, the pioneer in parallel processing of data streams was Academician A.A. Samarsky, who performed the calculations necessary to simulate nuclear explosions in the early 50s. Samarsky solved this problem by seating several dozen young ladies with adding machines at tables. The young ladies transferred data to each other simply in words and put down the necessary numbers on adding machines. Thus, in particular, the evolution of the blast wave was calculated. There was a lot of work, the young ladies were tired, and Alexander Andreevich walked among them and encouraged them. This, one might say, was the first parallel system. Although the calculations for the hydrogen bomb were carried out masterfully, their accuracy was very low, because there were few nodes in the grid used, and the calculation time was too long.

Conveyor processing. What is needed to add two real numbers represented in floating point form? A whole lot of small operations such as comparing orders, aligning orders, adding mantissas, normalizing, etc. The processors of the first computers performed all these “micro-operations” for each pair of arguments, one after another, until they reached the final result, and only then proceeded to processing next pair terms.

The idea of pipeline processing is to isolate individual stages of performing a general operation, and each stage, having completed its work, would pass the result to the next one, while simultaneously receiving a new portion of input data. We get an obvious gain in processing speed by combining previously spaced operations. Let's assume that there are five micro-operations in an operation, each of which is performed in one unit of time. If there is one indivisible serial device, then it will process 100 pairs of arguments in 500 units. If each micro-operation is separated into a separate stage (or otherwise called a stage) of a conveyor device, then on the fifth unit of time, at different stages of processing of such a device, the first five pairs of arguments will be located, and the entire set of one hundred pairs will be processed in 5 + 99 = 104 units time - acceleration compared to serial device almost five times (by the number of conveyor stages).

It would seem that pipeline processing can be successfully replaced by ordinary parallelism, for which we duplicate the main device as many times as the number of stages of the pipeline are supposed to be allocated. In fact, the five devices in the previous example will process 100 pairs of arguments in 100 units of time, which is faster than the running time of the conveyor device! What's the matter? The answer is simple: by increasing the number of devices fivefold, we significantly increase both the volume of equipment and its cost. Imagine that a car plant decided to remove the assembly line while maintaining the rate of car production. If previously there were a thousand cars on the assembly line at the same time, then, acting by analogy with the previous example, it is necessary to recruit a thousand teams, each of which (1) is able to completely assemble the car from start to finish, performing hundreds of different types of operations, and (2) do this in the same time that the car was previously on the assembly line. Can you imagine the cost of such a car? No? I agree, it’s difficult, unless Lamborghini comes to mind, but that’s why assembly line processing arose...

A Brief History of the Emergence of Parallelism in Computer Architecture

Today, parallelism in computer architecture will surprise few people. All modern microprocessors, be it Pentium III or PA-8700, MIPS R14000, E2K or Power3, use one or another type of parallel processing. In the Pentium 4 core on different stages Up to 126 micro-operations can be executed simultaneously. At presentations of new chips and in press releases from corporations, this is presented as the latest word in technology and the cutting edge of science, and this is indeed the case if we consider the implementation of these principles in the miniature confines of a single chip.

At the same time, these ideas themselves appeared a very long time ago. Initially, they were implemented in the most advanced, and therefore single, computers of their time. Then, after proper development of the technology and cheaper production, they went down to middle-class computers, and finally today all this is fully embodied in workstations and personal computers.

To ensure that all major innovations in architecture modern processors in fact, they have been used since the times when neither microprocessors nor the concept of supercomputers existed; let’s take a small excursion into history, starting almost from the moment the first computers were born.

IBM 701 (1953), IBM 704 (1955): bit-parallel memory, bit-parallel arithmetic.
All the very first computers (EDSAC, EDVAC, UNIVAC) had bit-sequential memory, from which words were read sequentially bit by bit. First commercially accessible computer, using bit-parallel memory (on CRT) and bit-parallel arithmetic, was the IBM 701, and the most popular model was the IBM 704 (150 copies sold), which, in addition to the above, was the first to use memory on ferrite cores and hardware AC floating point.

IBM 709 (1958): independent I/O processors.
The processors of the first computers managed the input/output themselves. However, the speed of the fastest external device, which at that time was magnetic tape, was 1000 times less than the speed of the processor, so the processor was essentially idle during I/O operations. In 1958 6 independent input/output processors were attached to the IBM 704 computer, which, after receiving commands, could work in parallel with the main processor, and the computer itself was renamed IBM 709. This model turned out to be surprisingly successful, since together with modifications, about 400 copies were sold, and the last one was turned off in 1975 - 20 years of existence!

IBM STRETCH (1961): lookahead, memory striping.
In 1956, IBM signs a contract with the Los Alamos Scientific Laboratory to develop the STRETCH computer, which has two fundamentally important features: lookahead for instruction fetching and memory striping into two banks to accommodate low memory retrieval speed and execution speed.

ATLAS (1963): command pipeline.
The conveyor principle of command execution was first used in the ATLAS machine, developed at the University of Manchester. Command execution is divided into 4 stages: instruction fetch, operand address calculation, operand fetch, and operation execution. Pipelining made it possible to reduce the command execution time from 6 μs to 1.6 μs. This computer had a huge impact on both computer architecture and software: it was the first to use a multi-program OS based on the use of virtual memory and an interrupt system.

CDC 6600 (1964): independent functional devices.
Control Data Corporation (CDC), with the direct participation of one of its founders, Seymour R. Cray, produces the CDC-6600 computer - the first computer that used several independent functional units. For comparison with today, here are some computer parameters:

clock time 100ns,
performance 2-3 million operations per second,
RAM is divided into 32 banks of 4096 60-bit words,
memory cycle 1µs,
10 independent functional units.

The machine was a huge success in the scientific market, actively displacing machines from IBM.

CDC 7600 (1969): conveyor independent functional devices.
CDC releases the CDC-7600 computer with eight independent pipelined functional units - a combination of parallel and pipelining processing. Main parameters:

clock 27.5 ns,
10-15 million operations/sec.,
8 conveyor units,
2-level memory.

ILLIAC IV (1974): matrix processors.

Project: 256 processor elements (PE) = 4 quadrants of 64PE, reconfigurability: 2 quadrants of 128PE or 1 quadrant of 256PE, clock cycle 40ns, performance 1Gflop;

work began in 1967, by the end of 1971 a system of 1 quadrant was manufactured, in 1974. it was put into operation, development was carried out until 1975;

central part: control device (CU) + matrix of 64 PE;

The control unit is a simple computer with low performance that controls the PE matrix; all PE matrices worked in synchronous mode, executing at each moment the same command received from the control unit, but on their own data;
PE had its own ALU with a full set of instructions, OP - 2Kwords of 64 bits, memory cycle 350ns, each PE had direct access only to its OP;
data forwarding network: two-dimensional torus with a horizontal shift of 1 along the boundary;

Despite the result in comparison with the project: the cost is 4 times higher, only 1 quadrant is made, the clock cycle is 80ns, the real productivity is up to 50Mflop - this project had a huge impact on the architecture of subsequent machines built on a similar principle, in particular: PEPE, BSP ,ICL DAP.

Hierarchy of memory.
The memory hierarchy is not directly related to parallelism, however, it certainly refers to those features of computer architecture that are of great importance for increasing their performance (smoothing out the difference between the processor speed and memory access time). Main levels: registers, cache memory, RAM, disk memory. The sampling time for memory levels from disk memory to registers decreases, the cost per 1 word (byte) increases. Currently, such a hierarchy is supported even on personal computers.

What is being used in the world now?

In what directions is the development of high-performance computing technology currently taking place? There are four main directions.

Suppose that in your program the fraction of operations that need to be executed sequentially is f, where 0

If 9/10 of the program is executed in parallel, and 1/10 is still sequential, then it is in principle impossible to achieve speedups of more than 10 times, regardless of the quality of the implementation of the parallel part of the code and the number of processors used (it is clear that 10 is obtained only if case when the execution time of the parallel part is 0).

Let's look at the problem from the other side: what part of the code needs to be accelerated (and therefore preliminarily examined) in order to obtain the specified acceleration? The answer can be found in a corollary of Amdahl's law: in order to speed up the execution of a program in q times it is necessary to speed up no less than q times no less than (1-1/ q)th part of the program. Therefore, if you want to speed up a program by 100 times compared to its sequential version, then you need to get no less speedup on at least 99.99% of the code, which almost always makes up a significant part of the program!

Hence the first conclusion - before thoroughly reworking the code to switch to parallel computer(and any supercomputer, in particular, is such) we need to think carefully. If, having assessed the algorithm embedded in the program, you realized that the proportion of sequential operations is large, then you clearly cannot count on significant acceleration and you need to think about replacing individual components algorithm.

In some cases, the sequential nature of the algorithm is not so difficult to change. Let's say that the program has the following fragment to calculate the sum of n numbers:

S = 0 Do i = 1, n s = s + a(i) EndDo (the same can be done in any other language)

By its nature, it is strictly sequential, since at the i-th iteration of the loop the result from the (i-1)th is required and all iterations are performed one after another. We have 100% sequential operations, which means there is no effect from using parallel computers. At the same time, the way out is obvious. Since in most real programs (question: why in most and not all?) there is no significant difference, in what order to add the numbers, we will choose a different addition scheme. First, we find the sum of pairs of neighboring elements: a(1)+a(2), a(3)+a(4), a(5)+a(6), etc. Note that with this scheme, all pairs can be added simultaneously! In the next steps we will act in exactly the same way, obtaining a version of the parallel algorithm.

It would seem in in this case all problems were resolved. But imagine that the processors available to you are heterogeneous in their performance. This means there will be a moment when some of them are still working, while others have already done everything and stand uselessly waiting. If the dispersion in computer performance is large, then the efficiency of the entire system with uniform processor load will be extremely low.

But let's go further and assume that all processors are the same. Are the problems over? No again! The processors have completed their work, but the result must be transferred to another to continue the summation process... and the transfer takes time... and at this time the processors are idle again...

In short, to force a parallel computing system or supercomputer to work with maximum efficiency For a specific program, this, frankly speaking, is not an easy task, since it is necessary to carefully coordinate the structure of programs and algorithms with the features of the architecture of parallel computing systems.

Final question. Do you think the statement is true: than more powerful computer, the faster this problem can be solved on it?

Final answer. No, that's not true. This can be explained with a simple everyday example. If one digger digs a hole 1m*1m*1m in 1 hour, then two similar diggers will do it in 30 minutes - you can believe it. How long will it take 60 diggers to do this job? In 1 minute? Of course not! Starting from a certain point, they will simply interfere with each other, not speeding up, but slowing down the process. It’s the same in computers: if the task is too small, then we will spend more time distributing work, synchronizing processes, assembling results, etc., than directly useful work.

It is absolutely clear that not everything is so simple...

Laboratory of Parallel Information technologies, Research Computing Center MSU

Ministry of Education and Science of the Russian Federation

FSBEI HPE "Bryansk State Engineering and Technological

academy"

Department of Information Technologies

Sequential and parallel processing of information

Calculation and graphic work No. 1

by discipline

"Information Processing Technologies"

Option No. 16

RGR-02068025.230400.084

Bryansk 2015

Introduction 3

Parallel information processing 4

Shared Memory Systems 6

Parallel SQL Processing 7

Sequential information processing 9

10 Simple Batch Systems

References 13

Introduction

This computational and graphical study examines sequential and parallel information processing. Examples are given for each of them.

Sequential information processing is the sequential passage of information from input to output through a series of transformations (stages), so that in each period of time (specific to a given block) the transformation is carried out in only one functional block, and information comes to it only from the previous block.

Parallel information processing is a model of information processing, according to which information undergoes a series of transformations in certain functional blocks, so that at any given time it is processed simultaneously (in parallel) in several blocks.

Parallel information processing

Parallel data processing, embodying the idea of simultaneous execution of several actions, has two varieties: pipeline and parallelism.

The idea of pipeline processing is to isolate individual stages of performing a general operation, and each stage, having completed its work, would pass the result to the next one, while simultaneously receiving a new portion of input data. We get an obvious gain in processing speed by combining previously spaced operations. Let's assume that there are five micro-operations in an operation, each of which is performed in one unit of time. If there is one indivisible serial device, then it will process 100 pairs of arguments in 500 units. If each micro-operation is separated into a separate stage (or otherwise called a stage) of a conveyor device, then on the fifth unit of time, at different stages of processing of such a device, the first five pairs of arguments will be located, and the entire set of one hundred pairs will be processed in 5 + 99 = 104 units time - acceleration compared to a serial device is almost five times (according to the number of conveyor stages).

It would seem that pipeline processing can be successfully replaced by ordinary parallelism, for which we duplicate the main device as many times as the number of stages of the pipeline are supposed to be allocated. In fact, the five devices in the previous example will process 100 pairs of arguments in 100 units of time, which is faster than the running time of the conveyor device! Thus, by increasing the number of devices fivefold, we significantly increase both the volume of equipment and its cost. Imagine that a car plant decided to remove the assembly line while maintaining the rate of car production. If previously there were a thousand cars on the assembly line at the same time, then, acting by analogy with the previous example, it is necessary to recruit a thousand teams, each of which is able to completely assemble the car from start to finish, performing hundreds of different types of operations, and do this in the same time as the car was previously on the assembly line.

Today, parallelism in computer architecture will surprise few people. All modern microprocessors use some form of parallel processing. In the Pentium 4 core, up to 126 micro-operations can be simultaneously at different stages of execution. At the same time, these ideas themselves appeared a very long time ago. Initially, they were implemented in the most advanced, and therefore single, computers of their time. Then, after proper development of the technology and cheaper production, they went down to middle-class computers, and finally today all this is fully embodied in workstations and personal computers.

The performance of many applications running on single-processor computer systems can be significantly improved by using parallel processing tools. The following introduces the basic concepts of parallel processing and multiprocessor computer architecture.

When multiple applications request their jobs to be processed on a single-processor computer, its single processor has to do all the work. The purpose of parallel processing is usually to improve application performance. When an application issues a job request to a multiprocessor computer, the computer breaks the job into logical subtasks and then processes them using multiple processors in parallel, reducing the time it takes to complete the job. The number of subtasks resulting from splitting one large task is called the degree of parallelism . The reduction in information processing time required to complete a task is directly proportional to the degree of parallelism. They try to increase the performance of systems with parallel processing in such a way as to ensure maximum performance of each processor in the system.

Parallel data processing

Computer science, cybernetics and programming

Automatic concurrency detection. Degree and levels of parallelism. Types of parallelism. The performance of parallel computers depends on many factors and, to a large extent, on the architecture and structure of the system; draw the structure of a parallel system and explain: on the degree and level of parallelism in the system; from organizing data transfer between parallel working processors; from the switching system; from the interaction of processors and memory; on the relationship between hardware and software implementation of a macro operation.

Lecture 1

Parallel data processing

Plan

1. Tiered-parallel form of the algorithm.

2. Automatic concurrency detection.

3. Degree and levels of parallelism.

4. Types of parallelism.

Parallelism this is the ability to simultaneously perform several arithmetic, logical or service operations. Moreover, operations can be both large-block and small-block.

The performance of parallel computers depends on many factors and, to a large extent, on the architecture and structure of the system(draw the structure of a parallel system and explain):

On the degree and level of parallelism in the system;

From organizing data transfer between parallel working processors;

From the switching system;

From the interaction of processors and memory;

From the relationship between hardware and software implementation of a macro operation.

Parallel processing can be based on various principles:

Spatial parallelism;

Temporal parallelism:

Pipelining.
Vectorization.
Matrix.
Systolic.
Organization of the data flow processing structure.
Organization of the system based on the hypercube structure.
Dynamic restructuring of the aircraft structure.

The description of any algorithm is hierarchical, based on the nesting property. When programming, allocatenesting levels:tasks, tasks, subtasks (processes), macrooperations, operations.Nesting determines the depth of parallelization and is one of the important properties of algorithms when analyzing parallel computing models.

1. Tiered-parallel form of the algorithm

Most general form representation of algorithms is the information-control graph of the algorithm,which reflects the data dependence between the operators of the algorithm and unconditional and conditional transitions in the program. Such a graph implicitly contains all types of parallelism for the chosen method of solving the problem.A more specific form of representing task parallelism is the tier-parallel form (LPF) apparatus.

Algorithm in tiered-parallel formis presented in the form of tiers, and the zero tier includes operators (branches) that are independent of each other.

On the graph you can indicate transitions , meaning the transfer of the calculation results of a primitive operation from one tier to an operation from the next tier. The tiers are divided by transitions. Can be"empty" transitions And "empty" primitive operations. An empty operation corresponds to saving the result obtained at the previous tier. In a sequential chain of operations, an empty operation can be placed in any tier.

When constructing the NAP, they rely on basic set primitive operations (PNO). The tiered-parallel form is characterized by the following parameters:

1. Length of the graph (number of tiers) L.

2. The width of the i-th tier is b i.

3. Width of a tiered-parallel graph B = max (b i ).

4. Average width of the NAP graph On Wed .

5. Fill factor i-th tier k i .

6. The scatter coefficient of operations in the graph is Q j i , j BNO , where is the quantity j -th type of operations in i-th tier.

7. The minimum required number of computers (from the BNO) to implement the algorithm represented by this graph in the YAPF.

8. Minimum solution time of the algorithm (sum of response times of computers with the maximum amount of calculations for each tier) T min.

9. Algorithm connectivity (the number of intermediate results that must be stored during the implementation of the algorithm) WITH .

2. Automatic concurrency detection

There are two possible ways to construct a parallel algorithm: directly from the problem statement or by transforming a sequential algorithm.

Methods for constructing a parallel algorithm from a sequential one are based on identifying typical, frequently occurring constructions in the sequential algorithm, which, according to certain rules, are replaced by parallel ones. (This allows, to a certain extent, to increase the degree of parallelism lost by the algorithm when programming in a sequential language.)

The nature of changes in the degree of parallelism during the preparation of a machine program is shown in Fig. 2.2.

potential parallelism

Method

solutions

Original text

Machine program

Rice. 2.2. Changing potential parallelism during program development:

1 parallel programming system;

2 serial programming and

vectorizing compiler

Despite the lower level of parallelism achieved when constructing a parallel algorithm by converting from a sequential one, this method is widely used, since it provides the ability to use expensive application programs developed and debugged for sequential ODS.

In a sequential program there are obvious and hidden parallel processing.

When analyzing a program, a data flow graph is constructed. To detect obvious parallelism of processes, sets of input (read) variables are analyzed R and output (recorded) variables W each process.

Explicit parallel processing can be detected among processes i and j (i ≠ j ), satisfying the following conditions:

input data of one process must not be modified (written) by another process

no two processes should modify shared variables

a) R i W j =;

b) W i R j =;

c) W i W j =;

Hidden Parallel processing requires some transformation procedure of a sequential program to make it possible to execute it in parallel. The transformation could be as follows:

a) decreasing the height of trees of arithmetic expressions (Fig. 2.3). For arithmetic expressions with n variables or constants, reducing the height of the tree allows you to achieve faster order processing O(n/log2n ) using O(n) processors;

b) transformation of linear recurrence relations;

((a + b) + c) + d

(a + b)+ (c + d)

Rice. 2.3. Reducing tree height

c) replacement of operators;

d) converting blocks of conditional jumps and loops to canonical form;

e) distribution of cycles.

Parallel architectures achieve high performance if the parallelism transformation takes into account the features of the computer architecture on which the algorithm is supposed to be executed.

When converting parallelism, programs take into account: 1) the layout of data in memory; 2) memory addressing (indexing); 3) choice of data route (method of connecting processors and storage).

Fig.2.4. Storage

shift matrices

As an example of taking into account the memory layout, let's take diagonally addressed memory. To ensure parallel processing of matrices, the elements of their rows and columns must be distributed among the processor storage devices in such a way that they can be read and processed simultaneously. In this case, the matrix is stored with a shift (Fig. 2.4).

Any algorithm contains sequential (scalar) sections. It has been proven that the length of these scalar sections is the determining factor when implementing the algorithm on a parallel computer.

3. Degree and levels of parallelism

Degree of parallelism(D) this is the order of the number of parallel working devices in the system when implementing a task algorithm, provided that the number of processors (processing devices) is not limited.(There is another definitiondegrees of parallelismthis is the number of processors of a multiprocessor system participating in parallel in executing the program at each time t.)

1) Low degree: from 2 to 10 processors.

2) Medium degree: from 10 to 100 processors.

3) High degree: from 100 to 10 4 processors.

4) Ultra-high degree: from 10 4 to 10 6 processors.

Rice. 2.5. Concurrency Profile

Graphical representation of the parameter D(t ) as functions of time are calledprogram parallelism profile. Changes in the processor load level during the observation period depend on many factors (algorithm, available resources, degree of optimization provided by the compiler, etc.). In Fig. Figure 2.5 shows a typical parallelism profile.

IN application programs there is a wide range of potential parallelism. In computationally intensive programs, 500 to 3500 arithmetic operations can be performed in parallel in each cycle, if there is an existing computing environment for this. However, even a properly designed superscalar processor can support between 2 and 5.8 instructions per cycle. This decline is primarily due to communication and system costs.

The degree of parallelism significantly depends on: the architecture of the computer, especially the switching system, the organization of interaction between parallel processors and methods of data exchange between processors and memory. Greater impact on performance computing facilities, than the degree of parallelism, has the level of parallelism.

Are considering algorithmic and circuit levels of parallelism.

The following are distinguished:algorithmic levels of parallelism:

1. Task level:

a) between tasks;

b) between task phases.

2. Software level:

a) between parts of the program(parts of one task are performed on many computers);

b) within cycles.

(If the individual iterations in a loop are not dependent on each other. For example : For I:=1 to N do A(I):=B(I) + C(I))

3. Command level:

a) between phases of command execution.

4. Arithmetic and bit level:

a) between elements of a vector operation;

b) inside logic circuits ALU.

Each level is characterized by certain properties, based on which special structures of computing tools have been developed. The command level is implemented in any modern computers, including personal computers.

Circuit parallelism levelthis is the hardware level at which parallelization of data processing or organization of parallel computing is carried out.

Parallel processing can be implemented at the following circuit levels:

1. At the level of logic gates and memory elements.This is the lowest level level of transistors. Here parallel logic circuits are built from logic gates ( PM ) (for example: parallel adder).

2. Level of logical circuits and simple automata with memory.A parallel elementary automaton ( EA).

3. Register level and integrated circuits memory.On elementary machines one gets parallel circuits microprocessors ( MP).

4. Level of elementary microprocessors.Parallel macroprocessors are built from microprocessors to perform medium-block operations ( MAP).

5 . The level of macroprocessors that implement large operations.Here parallelism of macrooperations is realized. Parallel multiprocessor systems are built on macroprocessors ( MPS).

6. Level of computers, processors and programs.The highest level of parallelism from multiprocessor systems, parallel computing systems are obtained ( Sun).

4. Types of parallelism

4.1. Natural parallelism and

parallelism of multiple objects

In the information graph, “vertical” independent subgraphs can be identified that do not mutually use any intermediate results obtained when implementing primitive operations of another subgraph. This type of parallelism is called natural parallelism of independent tasks.

The task has natural parallelism, if in its original formulation it is reduced to an operation on multidimensional vectors, multidimensional matrices or lattice functions (Fig. 2.6). Intermediate task results are not used here. Each task is programmed independently of the others. This type of parallelism does not require combining computers into complexes. However, increasing the number of independent tasks in the ODS increases throughput systems. For example: processing database transactions on multiprocessor servers.

1 task

task 2

Rice. 2.6. Information graph of a task characterized by natural parallelism

Ori

Ori+1

at 1

at 2

at 3

at 4

Rice. 2.7. Information graph

task characterized

parallelism of many objects

Parallelism of multiple objectsrepresents a special case of natural parallelism. Its meaning is that the task is to process information about different but similar objects processed using the same or almost the same program (Fig. 2.7).

Here the so-calledintegral operations. The initial operands of integral operations are vectors or functions, or sets of objects, and the result is a number. For example, calculating the dot product for n-dimensional vectors

includes two types of operations: a pairwise product of vector components and then an “integral operation” (an operation on an n-dimensional vector) summing up all the components of this vector.

When parallelizing multiple objects, situations occur more often than in the general case when separate areas calculations must be performed differently for different objects.

For example, when finding the values of some functions limited to a certain area. The values inside the area for all points are calculated using one formula, and at the boundary points - using another.

Parallelism of multiple objects is characterized by the following parameters:

1. Total program length L the lengths of all operators across all branches are summed up.

2. Average program length L avg is calculated based on the task rank.

The main quantitative characteristic of the parallelized problem is task rank r (®) this is the number of parameters for which parallel processing should be carried out (for example, the number of vector components, the number of points at which the function is specified).

3. Magnitude of problem discrepancy D

If the information processing program for all r objects is exactly the same, then D =1 and the more the programs of different objects differ from each other, the more D.

4.2. Parallelism of independent branches

The essence parallelism of independent branchesconsists in the fact that in the program for solving a problem independent parts, called branches, can be distinguished. If the computer has the appropriate hardware, the branches can be executed in parallel (Fig. 2.8).

Program branch Y independent of branch X if:

Rice. 2.8. Information graph of a problem characterized by

parallelism of independent branches

between them No functional connections , i.e. none of the input variables of branch Y is the output variable of branch X or any branch that depends on X;

between they have no connection across working memory fields;
they must be fulfilledaccording to different programs;
independent in management, i.e. the condition for executing branch Y should not depend on the characteristics generated during the execution of branch X or the branch that depends on it.

4.3. Parallelism of adjacent operations or

local parallelism

Parallelism of adjacent operationsoccurs when the input data for current operations is received more than early stages calculations and the construction of computational tools make it possible to combine the execution of several operations that are not related to each other by output data and results.

Local parallelism is characterized by the following parameters:

1. Indicator of connectivity of adjacent operationsis the probability that the result of some operation will be used in the next operation. The less connected an operation is, the greater the depth of parallelism of adjacent operations. Typically the value has values of 0.10.5.

2. The probability that, starting from a given operation, there is a chain of length at least l l

3. The probability that, starting from any operation in the program, there is a chain of exactly l operations that can be performed simultaneously l

4. Depth of parallelism of adjacent operations L PSO this expected value the length of the chain of operations that can be performed simultaneously

Local optimization of programs consists of looking through several instructions that must be executed in a row and changing the order of some of them, possibly changing the numbers of registers and memory cells to ensure the maximum possible parallelism of adjacent operations.

In most cases, the indicator of connectivity of adjacent operations depends not so much on the problem as on the quality of local optimization.

________________________________________________________________________________________________

Course "Computer Organization"

10 -

(course project)

As well as other works that may interest you
54055.		Urochiste vidkritya tizhnya Logika	149.5 KB
	Learn. I allow you to open the period of logic. Captains, please submit teams and submit reports. Teams submit reports. 1 student. Uvaga Uvaga 2 studies. Good afternoon dear children and guests 1st student.
54056.		Integration between initial subjects and logic	120.5 KB
	Children need to know the rules and the laws of logic; they can form logical minds and develop logical minds. The productivity of the integrated approach can be especially impressive in logic lessons. The teacher's knowledge of the basic rules and laws of logic allows him to use logical techniques in the hour of unraveling problematic situations from any illegitimate problems; Develop in the academic world to formulate rules and laws of logic to analyze the evidence, evaluate your own and other people's thoughts, formulate and accept informed decisions...
54057.		Interdisciplinary integration as a means of activating the educational process	135.5 KB
	In specialized schools with in-depth study of a foreign language, interdisciplinary integration should not occupy the last place. In this regard, joint lessons in mathematics and in English can be very interesting.
54058.		ALGEBRA OF STATEMENTS. BASIC OPERATIONS OF PROPOSITIONAL ALGEBRA	1.77 MB
	A truth table is a table that establishes correspondence between all possible sets of logical variables included in logical function and function values.
54059.		Logics	81.18 KB
	Do you know this man tangled in a cloak? No. By the way, this is your father. The object of logic is what the scientist’s interest in logic is directed towards - thinking about human thinking. Logic is the science not of all thinking, but of correct thinking about correct rational thinking, which can be expressed in symbolic form in words.
54061.		Forest Tree. Tabernacles. Yagodi. The development of a conjunctive prayer	40 KB
	Meta: Learn a children's dictionary based on knowledge, statements about Dovkill. Learn to overestimate the power and power of objects, learn to characterize them, formulate words as accurately as possible to fit a specific situation or describe.
54062.		Bring the merry cochinits	44.5 KB
	Under music supervision, children go to the music hall together with the speech therapist. Speech therapist: Good morning good day Let your little ones splash, Let your little feet grow dull, Let your mouth sleep and smile. Where did I go? Speech therapist.
54063.		Logopsychocorrection in robots and children with speech disorders	67.5 KB
	Games and the right to develop the emotional sphere Kazka-gra: About the fisherman and the fish The speech therapist reads a lesson from the fairy tale O. With cones, tension and relaxation of the muscles of the hands. Great tension and relaxation of the muscles of the legs. Vedmedica calls for gold bdzhilku to plunder with the vedmezhatami.