A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://computing.llnl.gov/tutorials/parallel_comp/ below:

Introduction to Parallel Computing Tutorial

Table of Contents
  1. Abstract
    1. Parallel Computing Overview
      1. What Is Parallel Computing?
      2. Why Use Parallel Computing?
      3. Who Is Using Parallel Computing?
    2. Concepts and Terminology
      1. von Neumann Computer Architecture
      2. Flynn’s Taxonomy
      3. Parallel Computing Terminology
      4. Potential Benefits, Limits and Costs of Parallel Programming
    3. Parallel Computer Memory Architectures
      1. Shared Memory
      2. Distributed Memory
      3. Hybrid Distributed-Shared Memory
    4. Parallel Programming Models
      1. Shared Memory Model
      2. Threads Model
      3. Distributed Memory / Message Passing Model
      4. Data Parallel Model
      5. Hybrid Model
      6. SPMD and MPMP
    5. Designing Parallel Programs
      1. Automatic vs. Manual Parallelization
      2. Understand the Problem and the Program
      3. Partitioning
      4. Communications
      5. Synchronization
      6. Data Dependencies
      7. Load Balancing
      8. Granularity
      9. I/O
      10. Debugging
      11. Performance Analysis and Tuning
    6. Parallel Examples
      1. Array Processing
      2. PI Calculation
      3. Simple Heat Equation
      4. 1-D Wave Equation
    7. References and More Information
Abstract

This is the first tutorial in the "Livermore Computing Getting Started" workshop. It is intended to provide only a brief overview of the extensive and broad topic of Parallel Computing, as a lead-in for the tutorials that follow it. As such, it covers just the very basics of parallel computing, and is intended for someone who is just becoming acquainted with the subject and who is planning to attend one or more of the other tutorials in this workshop. It is not intended to cover Parallel Programming in depth, as this would require significantly more time. The tutorial begins with a discussion on parallel computing - what it is and how it's used, followed by a discussion on concepts and terminology associated with parallel computing. The topics of parallel memory architectures and programming models are then explored. These topics are followed by a series of practical discussions on a number of the complex issues related to designing and running parallel programs. The tutorial concludes with several examples of how to parallelize several simple problems. References are included for further self-study.

Overview What Is Parallel Computing? Serial Computing

Traditionally, software has been written for serial computation:

Serial computing generic example

For example:

Serial computing example of processing payroll Parallel Computing

In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:

Parallel computing generic example

For example:

Parallel computing example of processing payroll Parallel Computers IBM BG/Q Compute Chip with 18 cores (PU) and 16 L2 Cache units (L2) Network connections Example of typical parallel computer cluster Source: Top500.org Why Use Parallel Computing? The Real World Is Massively Complex Real world phenomena can be simulated with parallel computing Real world phenomena can be simulated with parallel computing Main Reasons for Using Parallel Programming SAVE TIME AND/OR MONEY Working in parallel shortens completion time SOLVE LARGER / MORE COMPLEX PROBLEMS Parallel computing can solve increasingly complex problems PROVIDE CONCURRENCY Collaborative networks TAKE ADVANTAGE OF NON-LOCAL RESOURCES SETI has a large worldwide user base MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE Parallel computing cores The Future Source: Top500.org Who Is Using Parallel Computing? Science and Engineering

Historically, parallel computing has been considered to be "the high end of computing," and has been used to model difficult problems in many areas of science and engineering:

Parallel computing is key to simulating a range of complex physical phenomena Industrial and Commercial

Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. For example:

Parallel computing is used in many commercial applications Global Applications Source: Top500.org Source: Top500.org Source: Top500.org Concepts and Terminology von Neumann Computer Architecture John von Neumann circa 1940s
(Source: LANL archives)
Basic computing architecture
  1. Memory
  2. Control Unit
  3. Arithmetic Logic Unit
  4. Input/Output

Parallel computers still follow this basic design, just multiplied in units. The basic, fundamental architecture remains the same. More info on his other remarkable accomplishments: http://en.wikipedia.org/wiki/John_von_Neumann

Flynn's Classical Taxonomy Flynn's taxonomy Single Instruction, Single Data (SISD)   Single Instruction, Multiple Data (SIMD) Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD)   General Parallel Computing Terminology CPU

Contemporary CPUs consist of one or more cores - a distinct execution unit with its own instruction stream. Cores with a CPU may be organized into one or more sockets - each socket with its own distinct memory . When a CPU consists of two or more sockets, usually hardware infrastructure supports memory sharing across sockets.

Node

A standalone "computer in a box." Usually comprised of multiple CPUs/processors/cores, memory, network interfaces, etc. Nodes are networked together to comprise a supercomputer.

Nodes in a supercomputer Task

A logically discrete section of computational work. A task is typically a program or program-like set of instructions that is executed by a processor. A parallel program consists of multiple tasks running on multiple processors.

Pipelining

Breaking a task into steps performed by different processor units, with inputs streaming through, much like an assembly line; a type of parallel computing.

Shared Memory

Describes a computer architecture where all processors have direct access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical memory locations regardless of where the physical memory actually exists.

Symmetric Multi-Processor (SMP)

Shared memory hardware architecture where multiple processors share a single address space and have equal access to all resources - memory, disk, etc.

Distributed Memory

In hardware, refers to network based memory access for physical memory that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing.

Communications

Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory bus or over a network.

Synchronization

The coordination of parallel tasks in real time, very often associated with communications.

Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase.

Computational Granularity

In parallel computing, granularity is a quantitative or qualitative measure of the ratio of computation to communication.

Observed Speedup

Observed speedup of a code which has been parallelized, defined as:

        wall-clock time of serial execution
        -----------------------------------
        wall-clock time of parallel execution

One of the simplest and most widely used indicators for a parallel program's performance.

Parallel Overhead

Required execution time that is unique to parallel tasks, as opposed to that for doing useful work. Parallel overhead can include factors such as:

Massively Parallel

Refers to the hardware that comprises a given parallel system - having many processing elements. The meaning of "many" keeps increasing, but currently, the largest parallel computers are comprised of processing elements numbering in the hundreds of thousands to millions.

Embarrassingly (IDEALY) Parallel

Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks.

Scalability

Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more resources. Factors that contribute to scalability include:

Potential Benefits, Limits and Costs of Parallel Programming Amdahl's Law Amdahl's law Speedup when introducing more processors
                         1
        speedup   =   --------
                       1  - P
                           1 
        speedup   =   ------------
                        P   +  S
                       ---
                        N
                           speedup
              -------------------------------------
        N     P = .50   P = .90   P = .95   P = .99
      -----   -------   -------   -------   -------
         10      1.82      5.26      6.89      9.17
        100      1.98      9.17     16.80     50.25    
      1,000      1.99      9.91     19.62     90.99
     10,000      1.99      9.91     19.96     99.02
    100,000      1.99      9.99     19.99     99.90
        2D Grid Calculations    
        Parallel fraction        85 seconds 85%   
        Serial fraction          15 seconds   15%   
        2D Grid Calculations 
        Parallel fraction         680 seconds 97.84%   
        Serial fraction           15 seconds    2.16%   
Complexity Portability Resource Requirements Scalability Strong and weak scaling Parallel Computer Memory Architectures Shared Memory General Characteristics Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA) Non-uniform memory access Advantages Disadvantages Distributed Memory General Characteristics Advantages Disadvantages Hybrid Distributed-Shared Memory General Characteristics Advantages and Disadvantages Parallel Programming Models SHARED memory model on a DISTRIBUTED memory machine

Kendall Square Research (KSR) ALLCACHE approach. Machine memory was physically distributed across networked machines, but appeared to the user as a single shared memory global address space. Generically, this approach is referred to as "virtual shared memory".

DISTRIBUTED memory model on a SHARED memory machine

Message Passing Interface (MPI) on SGI Origin 2000. The SGI Origin 2000 employed the CC-NUMA type of shared memory architecture, where every task has direct access to global address space spread across all machines. However, the ability to send and receive messages using MPI, as is commonly done over a network of distributed memory machines, was implemented and commonly used.

Shared Memory Model (without threads) Shared memory model

Implementations

Threads Model Threads model

Implementations

In both cases, the programmer is responsible for determining the parallelism (although compilers can sometimes help).

POSIX Threads OpenMP More Information Distributed Memory / Message Passing Model Distributed memory model

Implementations:

More Information Data Parallel Model Data parallel model

Implementations:

Hybrid Model Hybrid model with MPI and OpenMP Hybrid model with MPI and CUDA SPMD and MPMD Single Program Multiple Data (SPMD) SPMD model Multiple Program Multiple Data (MPMD) MPMD model Designing Parallel Programs Automatic vs. Manual Parallelization Fully Automatic Programmer Directed Understand the Problem and the Program

Programs = algorithms + data + (hardware)

Determine whether the problem can actually be parallelized

Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation.

This problem is able to be solved in parallel. Each of the molecular conformations is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem.

Calculation of the first 10,000 members of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the formula:

F(n) = F(n-1) + F(n-2)

The calculation of the F(n) value uses those of both F(n-1) and F(n-2), which must be computed first.

An example of a parallel algorithm for solving this problem (using Binet's formula):

where

Partitioning Domain Decomposition

In this type of partitioning, the data associated with a problem is decomposed. Each parallel task then works on a portion of the data.

Domain decomposition

There are different ways to partition data:

Partitioning examples Functional Decomposition

In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.

Functional decomposition

Functional decomposition lends itself well to problems that can be split into different tasks. For example:

Ecosystem Modeling

Each program calculates the population of a given group, where each group's growth depends on that of its neighbors. As time progresses, each process calculates its current state, then exchanges information with the neighbor populations. All tasks then progress to calculate the state at the next time step.

Ecosystem modeling Signal Processing

An audio signal data set is passed through four distinct computational filters. Each filter is a separate process. The first segment of data must pass through the first filter before progressing to the second. When it does, the second segment of data passes through the first filter. By the time the fourth segment of data is in the first filter, all four tasks are busy.

Signal processing Earth Systems Modeling

Each model component can be thought of as a separate task. Arrows represent exchanges of data between components during computation: the atmosphere model generates wind velocity data that are used by the ocean model, the ocean model generates sea surface temperature data that are used by the atmosphere model, and so on.

Diagram of earth systems modeling Complex relationships between earth systems and atmospheric modeling components Communications Who Needs Communications?

The need for communications between tasks depends upon your problem:

You DON'T need communications Parallel processing without need for much communication You DO need communications Parallel processing that needs a lot of communication Factors to Consider

There are a number of important factors to consider when designing your program's inter-task communications:

Communication overhead Communication is complex Latency vs. bandwidth Visibility of communications Synchronous vs. asynchronous communications Scope of communications Different types of communication scope Efficiency of communications Overhead and complexity Example of parallel communications overhead and complexity

Finally, realize that this is only a partial list of things to consider!

Synchronization Synchronization of tasks in parallel processing is an important design consideration Types of Synchronization Barrier Lock / semaphore Synchronous communication operations Data Dependencies Definition Visualization of dependencies Examples Loop carried data dependence
    DO J = MYSTART,MYEND
       A(J) = A(J-1) * 2.0
    END DO
Loop independent data dependence
    task 1        task 2
    ------        ------

    X = 2         X = 4
      .             .
      .             .
    Y = X**2      Y = X**3
How to Handle Data Dependencies Load Balancing How to Achieve Load Balance Equally partition the work each task receives Use dynamic work assignment Sparse arrays - some tasks will have actual data to work on while others have mostly "zeros." Adaptive grid methods - some tasks may need to refine their mesh while others don't. N-body simulations - particles may migrate across task domains requiring more work for some tasks. Scheduler-task pool Granularity Computation / Communication Ratio Fine-grain Parallelism Fine-grain parallelism Coarse-grain Parallelism Coarse-grain parallelism Which is Best? I/O The Bad News I/O operations The Good News Debugging Using debugging tools Performance Analysis and Tuning Using performance analysis and tuning tools Parallel Examples Array Processing 2-D array
    do j = 1,n
      do i = 1,n
        a(i,j) = fcn(i,j)
      end do
    end do
Parallel Solution 1 Parallel solution 1

Column-major:

do j = mystart, myend 
  do i = 1, n 
    a(i,j) = fcn(i,j) 
  end do 
end do

Row-major:

for i (i = mystart; i < myend; i++) { 
  for j (j = 0; j < n; j++) { 
    a(i,j) = fcn(i,j); 
  }
}
One possible solution:
    find out if I am MASTER or WORKER
      
    if I am MASTER
      
      initialize the array
      send each WORKER info on part of array it owns
      send each WORKER its portion of initial array
      
      receive from each WORKER results
      
    else if I am WORKER
      receive from MASTER info on part of array I own
      receive from MASTER my portion of initial array

      # calculate my portion of array
      do j = my first column,my last column
        do i = 1,n
          a(i,j) = fcn(i,j)
        end do
      end do

      send MASTER results

    endif
Example programs Parallel Solution 2: Pool of Tasks Pool of tasks scheme

Master Process:

Worker Process: repeatedly does the following

    find out if I am MASTER or WORKER

    if I am MASTER

      do until no more jobs
        if request send to WORKER next job
        else receive results from WORKER
      end do

    else if I am WORKER

      do until no more jobs
        request job from MASTER
        receive from MASTER next job

        calculate array element: a(i,j) = fcn(i,j)

        send results to MASTER
      end do

    endif
Discussion PI Calculation Calculating pi in serial
    npoints = 10000
    circle_count = 0

    do j = 1,npoints
      generate 2 random numbers between 0 and 1
      xcoordinate = random1
      ycoordinate = random2
      if (xcoordinate, ycoordinate) inside circle
      then circle_count = circle_count + 1
    end do

    PI = 4.0*circle_count/npoints
Parallel Solution Calculating pi in parallel
    npoints = 10000
    circle_count = 0

    p = number of tasks
    num = npoints/p

    find out if I am MASTER or WORKER

    do j = 1,num
      generate 2 random numbers between 0 and 1
      xcoordinate = random1
      ycoordinate = random2
      if (xcoordinate, ycoordinate) inside circle
      then circle_count = circle_count + 1
    end do

    if I am MASTER

      receive from WORKERS their circle_counts
      compute PI (use MASTER and WORKER calculations)

    else if I am WORKER

      send to MASTER circle_count

    endif
Example Programs Simple Heat Equation 2-D heat problem Finite differencing scheme Calculation of the above
   do iy = 2, ny - 1
      do ix = 2, nx - 1
        u2(ix, iy) =  u1(ix, iy)  +
            cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy)) +
            cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy))
      end do
    end do
Parallel Solution Parallel solution to 2-D heat problem
    find out if I am MASTER or WORKER

    if I am MASTER
      initialize array
      send each WORKER starting info and subarray
      receive results from each WORKER

    else if I am WORKER
      receive from MASTER starting info and subarray

      # Perform time steps
      do t = 1, nsteps
        update time
        send neighbors my border info
        receive from neighbors their border info
        update my portion of solution array
       
      end do
     
      send MASTER results
         
    endif
Example Programs 1-D Wave Equation 1-D wave problem
    A(i,t+1) = (2.0 * A(i,t)) - A(i,t-1) + (c * (A(i-1,t) - (2.0 * A(i,t)) + A(i+1,t)))

where c is a constant

1-D Wave Equation Parallel Solution Parallel solution to 1-D wave problem
    find out number of tasks and task identities

    #Identify left and right neighbors
    left_neighbor = mytaskid - 1
    right_neighbor = mytaskid +1
    if mytaskid = first then left_neigbor = last
    if mytaskid = last then right_neighbor = first

    find out if I am MASTER or WORKER
    if I am MASTER
      initialize array
      send each WORKER starting info and subarray
    else if I am WORKER`
      receive starting info and subarray from MASTER
    endif

    #Perform time steps
    #In this example the master participates in calculations
    do t = 1, nsteps
      send left endpoint to left neighbor
      receive left endpoint from right neighbor
      send right endpoint to right neighbor
      receive right endpoint from left neighbor

      #Update points along line
      do i = 1, npoints
        newval(i) = (2.0 * values(i)) - oldval(i)
        + (sqtau * (values(i-1) - (2.0 * values(i)) + values(i+1)))
      end do

    end do

    #Collect results and write to file
    if I am MASTER
      receive results from each WORKER
      write results to file
    else if I am WORKER
      send results to MASTER
    endif
Example Programs This completes the tutorial.

Please complete the online evaluation form.

References and More Information

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.3