The bulk-synchronous parallel (BSP) paradigm provides an abstract model that separates the concerns of hardware architects and software designers, thereby bridging the two. Leslie Valiant, winner of the 2010 ACM Turing Award, presented BSP as such a bridging model [1].
First and foremost, the BSP model simplifies the analysis and development of scalable and portable parallel algorithms. Unlike the preceding PRAM model [2], which ignores communication, the BSP cost model captures a wide range of parallel computers within a single realistic framework, thus resulting in the design of truly optimal algorithms.
The BSP model outlined in Section 1 has three main elements: a framework for building well-structured algorithms, a target architecture that is both simple and realistic, and a cost model that accurately predicts running time. As a theory, BSP has been successful and has led to many tools that had great impact in the areas of big data computing and high-performance computing (HPC). These practical tools are reviewed in Section 2. BSP will need to cope with several challenges in the coming years: ever-increasing core counts, corresponding increases in latency, hardware variability, and non-uniform memory access. Section 3 discusses these. Section 4 introduces the multi-BSP model [3] as an answer to those challenges.
1. Algorithm, architecture, and cost
A BSP algorithm consists of a sequence of supersteps that are executed concurrently by a number of processes. All processes execute the same program with the same superstep sequence, but each program may differentiate its instructions based on its unique process identity.
A BSP superstep contains both local computation and remote communication. Upon reaching a superstep, the algorithm designer may assume all previous supersteps have completed. The idea of bundling as many operations as possible in a single superstep leads to the term “bulk,” while the strict separation of one superstep from the next leads to the term “synchronous.”
A BSP computer consists of a set of p processors, each of which, for purposes of algorithm analysis, executes a single computation operation in time t. Each processor is equipped with local memory and an interconnect facility that enables inter-processor communication. Data in local memory are immediately accessible to a processor, whereas data residing in remote memories can only be accessed by communication. The network is optimized for full-duplex all-to-all communication, during which processors can simultaneously receive and send one data word at constant cost g. A global synchronization facility furthermore checks whether a superstep has completed, and performs this check periodically: every l time units. A BSP computer is thus completely modeled by only four parameters, (p,t,g,l).
The BSP cost model provides a way of analyzing the supersteps of BSP algorithms to predict their performance when deployed on a BSP computer. The computation cost of a superstep is the maximum number of computation operations carried out by a processor, times t. The communication cost is the maximum number of data words sent or received by a processor, times g. The synchronization cost of a superstep is l. The cost of a complete BSP algorithm then becomes an expression of the form at+bg+cl, where (a,b,c) depend on the algorithm, (t,g,l) depend on the computer, while typically all of those parameters depend on p.
2. BSP in practice
The BSP model has evolved over the years, with pioneering work done in the 1990s by the Oxford Parallel group, led by Bill McColl. This resulted in the definition of the BSPlib standard interface in 1998 [4], accompanied by an implementation in the Oxford BSP toolset. This toolset demonstrated the opportunities that BSP offers for optimizing communication at runtime, without involvement of the programmer, for instance by automatically reordering communication requests or by automatically combining them [5].
Structured programming
Humans, inadequate at multitasking, often have similar difficulty in designing correct and efficient parallel algorithms. Imposing a structure on parallel programming, like BSP does via supersteps, leads to natural points of rest that ease thinking in parallel. The superstep counter furthermore provides us with a conceptual global clock. This imposed structure can be exploited for various purposes: proof of correctness, optimization of communication, checkpoints for fault tolerance, and perhaps even for energy minimization.
One great success of BSP is precisely this highly structured approach of writing parallel algorithms. Examples of successful programming paradigms exploiting BSP-structured parallel programming are: collectives in MPI [6, ch. 4], MapReduce [7], Hadoop [8], Hama [9], Pregel [10], Giraph [11], Spark [12], and Petuum [13]. These are all platforms that force a programmer to write in a superstep structure; other platforms, like MPI, can express more unstructured programs, but can of course express superstep structures as well.
Large-scale graph computations form a prime application area of the BSP algorithmic approach. Google's Pregel system [10] and Facebook's open-source Giraph system for processing graphs [11] have both been inspired by the BSP model. In a computation by these systems, a single superstep has every vertex in the graph perform some local operations, after which every vertex communicates along its edges before starting a next superstep. In this fashion, Facebook managed to perform PageRank, label propagation, and k-means clustering computations on graphs with up to 1012 vertices and 1015 edges [14]. Similarly, the BSP paradigm is especially well suited for machine learning [12,13].
Algorithm design
Using BSP, algorithm designers can formulate algorithms that are oblivious to the parameters of the BSP computer model. Algorithms that are provably optimal regardless of the precise values of (p,t,g,l) are so-called immortal algorithms: algorithms that need to be written just once, and remain forever optimal.
In the absence of a proof of optimality, the cost model allows derivation of the asymptotic behavior of the algorithm to determine its scalability. Reducing the BSP cost of algorithms leads to practices proven healthy over decades in HPC: reducing and balancing both computation and communication, while avoiding synchronization.
The BSP model also is ideally suited for teaching parallel computing. In our own experience at Utrecht University and KU Leuven, we used BSP as our parallel computing framework because it is transparent, easy to explain, and because of the limited size of the BSPlib interface, which eases practical implementation.
The BSPlib interface
The BSPlib interface [4] is the closest interpretation of the BSP paradigm available since it allows any possible communication pattern to be expressed while still imposing a clear superstep structure. This ensures that processes that receive data do not need to confirm receipt, which entails that all communication can be done in a one-sided and unbuffered manner. Zero-copy direct remote memory access (DRMA) effects this by using put or get primitives, where local data is directly put into remote memory and remote data is directly retrieved into local memory, respectively.
This contrasts with two-sided message passing, where a local send operation must be met by a matching remote receive operation; a practice which has been likened to the use of the goto-statement prior to the days of structured sequential programming [15,16]. One-sided communication furthermore ensures deadlocks cannot occur.
Apart from DRMA, BSPlib also incorporates bulk-synchronous message passing as inspired by the Green BSP library [17]. Here, data is sent to a remote receiver queue in one superstep and all of the queues are emptied in the next superstep. An interface similar to BSPlib is the Paderborn University BSP library [18]. BSPonMPI, released by Wijnand Suijlen in 2006, allows BSPlib to run on all MPI-supported architectures. For functional programming, there exists bulk synchronous ML (BSML) [19], which is accompanied by a formal calculus [20].
The object-oriented MulticoreBSP for Java library [21] first explored the effectiveness, predictability, and scalability of BSP on the then-emerging shared-memory multicore architectures. MulticoreBSP for C [22] proves that state-of-the-art performance can be matched or exceeded while using the classical BSPlib, at least for the important kernels of sparse matrix-vector multiplication and the fast Fourier transform. MulticoreBSP for C is an implementation of an updated BSPlib interface and contains only 23 primitives. This is much less than the 452 primitives of the most-used parallel programming interface in HPC: MPI-3. This small size of the BSPlib interface makes full command of it attainable, and has repeatedly inspired students to write their own implementation for new architectures: MulticoreBSP for Java and the Parallella Epiphany BSP are two such examples.
3. Challenges to BSP
Recent trends in hardware architecture—such as a dramatic increase in the number of processor cores, larger variations in processor performance, more likely hardware failures, and deeper architectural hierarchies—pose new challenges to the BSP model. The Tianhe-2, the current leader of the TOP500 supercomputing sites, possesses 3.12 million processor cores, many of which reside on co-processors. This section outlines the problems such architectures cause for the BSP paradigm.
Latency
In contemporary computer hardware, synchronization usually is a collective barrier operation initiated over a network interconnect. It is not implemented as an automatic periodic mechanism in specialized hardware, as originally suggested for BSP by Valiant [1]. While the cost of synchronization can still be modeled via the latency parameter l, the number of processor cores in a massively parallel supercomputer has increased from hundreds to millions; synchronizing all of these is time-consuming and should not be done lightheartedly [23, Way 5].
The use of a global barrier in a BSP system, however, is neither prescribed by the model, nor by BSPlib. If the communication pattern of a superstep is known beforehand, a BSP process can locally and without barriers determine when it is safe to proceed with the next superstep, without breaking the global consistency guaranteed by the BSP algorithm model. A well-known example of a global operation that avoids global synchronization is the tree-based allreduce operation [24]. Avoiding global barriers is furthermore a key feature of BSP systems like Pregel and Giraph, and has been incorporated in more generic BSPlib style frameworks as well [18,25].
Hardware variability
The BSP cost model does not consider the possibility of processors slowing down due to error corrections, power management, heat constraints, and so on. The time-to-completion of a parallel program may no longer be proportional to the work distributed to each individual process, and instead has become dependent on parameters outside the influence of the parallel program itself. If there are no hard guarantees on hardware variabilities, computation times become unpredictable, making the notion of an optimal parallel algorithm irrelevant.
While hardware has had stable bounds on variability, recent trends indicate that this is changing; on large scales, hardware failure has even become unavoidable. To deal with hardware variability, MapReduce [7] and Hadoop [8], for instance, use disk-based data replication and hashing to achieve both tail-tolerance and fault-tolerance. Such techniques are already commonplace in BSP-structured big data computing [7,10,26], and are becoming equally important to HPC.
Non-uniform memory access
Modern shared-memory architectures exhibit increasingly non-uniform memory access (NUMA) times. This stems from a necessary increase in the number of compute cores per processor and from the high cost of providing a high-performance all-to-all interconnect between those cores.
NUMA effects have also appeared in network interconnects. A traditional fat-tree interconnect such as that used by the Tianhe-2 [27] naturally introduces a hierarchy of networks. The 2015 overview of HPC systems by Vetter [28] has six hierarchical interconnects versus two non-hierarchical ones, whereas the 2013 overview still had a non-hierarchical majority. Such architectural developments are contrary to the BSP computer model, which assumes a network optimized for uniform all-to-all communication.
4. Multi-BSP
To function as a realistic bridging model, BSP should account for NUMA effects. Valiant proposes the theoretical multi-BSP model [3] to achieve this. It redefines the BSP computer model recursively: a multi-BSP computer consists of either p other multi-BSP computers, or of p regular processors. This leads to a tree hierarchy of interconnects that represents the parallel computer: each internal node represents a communication network while every node is equipped with local memory (for example, caches or main memory). All the compute power is attached to the leaves of the tree.
Since the network interconnect and the local memory sizes differ depending on the level in the architectural hierarchy, the number of parameters of the multi-BSP computer becomes linear in the number of levels. A multi-BSP algorithm must execute different types of supersteps for each level of the communication network, as illustrated in Fig. 1; the cost expression for a multi-BSP algorithm thus also grows linearly in the number of levels.
Figure 1: A multi-BSP algorithm for four processes, organized in a two-level hierarchy. The horizontal axis represents time, from left to right, while the vertical axis represents the processes. The arrows denote inter-process communication. The vertical bars separate the computations and the communications of the different levels.
A current practical topic of investigation is the portable implementation of multi-BSP algorithms. An effective multi-BSP programming framework must automatically adapt the implemented algorithm to arbitrarily deep architectural hierarchies. Work in this direction is in progress at several research groups, including for MulticoreBSP and for BSML.
5. Outlook
BSP can play a role in unifying hybrid programming on arbitrary levels of hierarchy, providing an attractive alternative to today's mixed programming practice as often seen in HPC: for example, MPI for nodes, OpenMP within nodes, and CUDA for accelerators. Results obtained with modern BSPlib implementations show that bulk synchrony is neither at odds with efficiency, nor with scalability. Hardware variability and failure, furthermore, can be mitigated with the help of BSP as already proven by modern big data platforms. BSP thus can provide a unified practical framework that is suitable for both big data and HPC, providing resiliency without compromising efficiency: BSP has still untapped potential that can move both big data and HPC one “superstep” forward.
|
Created: Apr 10 2016
Last updated: Apr 10 2016
|
|
|
|
Web pages
BSP Worldwide: central web resource and mailing list
MulticoreBSP: updated BSPlib version with Java and C API for multicore computers
Epiphany BSP: BSPlib version for the Parallella board
BSML: Bulk Synchronous Parallel ML, for functional programming; also has a multi-BSP variant, Multi-ML
BSPedupack: educational programs in BSPlib for dense LU decomposition, fast Fourier transform, sparse matrix-vector multiplication
Architecture Independent Parallel Code: sorting, matrix multiplication
Articles
A bridging model for parallel computation Valiant, L. G. CACM
A bridging model for multi-core computing Valiant, L. G. J. Computer and System Sciences
BSPlib: the BSP Programming Library Hill, J. M. D.; McColl, B.; Stefanescu, D. C.; Goudreau, M. W.; Lang, K.; Rao, S. B.; Suel, T.; Tsantilas, T.; Bisseling, R. H. Parallel Computing
MulticoreBSP for C: a high-performance library for shared-memory parallel programming Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K. Int. J. Parallel Programming
Pregel: a system for large-scale graph processing Malewicz, G.; Austern, M. H.; Bik, A. J. C.; Dehnert, J. C.; Horn, I.; Leiser, N.; Czajkowski, G. Proc. ACM SIGMOD 2010
Questions and answers about BSP Skillicorn, D. B.; Hill, J. M. D.; McColl, W. F. Scientific Programming
BSP (bulk synchronous parallelism) Tiskin, A. Encyc. Parallel Computing
Books
Parallel scientific computation: a structured approach using BSP and MPI Bisseling, R. H., 2004
Conferences and Workshops
International Symposium on High-Level Parallel Programming and Applications (HLPP): annual conference on higher-level, structured approaches that “offer many possible advantages and have a key role to play in the scalable exploitation of ubiquitous parallelism.”
SIAM Conference on Parallel Processing for Scientific Computing (SIAMPP): biannual conference organized by the SIAM Special Interest Group on Supercomputing. SIAM PP16 features a minisymposium on BSP.
Reviews
Parallel scientific computation Bisseling R., Oxford University Press,2004.
HPC in big data age Cheptsov A. EuroMPI/ASIA 2014 (Proceedings of the 21st European MPI Users’ Group Meeting), Kyoto, Japan, Sep 9-12, 2014,p.175-180.
Design patterns percolating to parallel programming framework implementation Aldinucci M., Campa S., Danelutto M., Kilpatrick P., Torquati M. International Journal of Parallel Programming42(6),2014,pp.1012-1031.
|
|
|
|
|
1) | Valiant, L. G. A bridging model for parallel computation. CACM 33, 8 (1990), 103-111.
|
2) | Fortune, S.; Wyllie, J. Parallelism in random access machines. In Proc. of STOC ’78 (1978), 114-118.
|
3) | Valiant, L. G. A bridging model for multi-core computing. Journal of Computer and System Sciences 77, 1 (2011), 154-166.
|
4) | Hill, J. M. D.; McColl, B.; Stefanescu, D. C.; Goudreau, M. W.; Lang, K.; Rao, S. B.; Suel, T.; Tsantilas, T.; Bisseling, R. H. BSPlib: the BSP programming library. Parallel Computing 24, 14 (1998), 1947-1980.
|
5) | Hill, J. M. D.; Skillicorn D. B. Lessons learned from implementing BSP. Future Generation Computer Systems 13, 4-5 (1998), 327-335.
|
6) | Snir, M.; Otto, S. W.; Huss-Lederman, S.; Walker, D. W.; Dongarra, J. MPI—the complete reference: volume 1: the MPI core (2nd ed.). MIT Press, 1998.
|
7) | Dean, J.; Ghemawat, S. MapReduce: simplified data processing on large clusters. CACM 51, 1 (2008), 107-113.
|
8) | White, T. Hadoop: the definitive guide (4th ed.). O’Reilly, 2015
|
9) | Seo, S.; Yoon, E. J.; Kim, J.; Jin, S.; Kim, J.-S.; Maeng, S. Hama: an efficient matrix computation with the MapReduce framework. In Proc. of CloudCom ’10 (2010), 721-726.
|
10) | Malewicz, G.; Austern, M. H.; Bik, A. J. C.; Dehnert, J. C.; Horn, I.; Leiser, N.; Czajkowski, G. Pregel: a system for large-scale graph processing. In Proc. of SIGMOD ’10 (2010), 135-146.
|
11) | Apache Giraph
|
12) | Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauley, M.; Franklin, M. J.; Shenker, S.; Stoica, I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proc. of NSDI ’12 (2012), 15-28.
|
13) | Xing, E. P.; Ho, Q.; Dai, W.; Kim, J. K.; Wei, J.; Lee, S.; Zheng, X.; Xie, P.; Kumar, A.; Yu, Y. Petuum: a new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1, 2 (2015), 49-67.
|
14) | Ching, A.; Edunov, S.; Kabiljo, M.; Logothetis, D.; Muthukrishnan, S. One trillion edges: graph processing at Facebook-scale. Proceedings of the VLDB Endowment 8, 12 (2015), 1804-1815.
|
15) | Bisseling, R. H. Parallel scientific computation: a structured approach using BSP and MPI. Oxford University Press, 2004.
|
16) | Gorlatch, S. Send-receive considered harmful: myths and realities of message passing. ACM Transactions on Programming Languages and Systems 26, 1 (2004), 47-56.
|
17) | Goudreau, M. W.; Lang, K.; Rao, S. B.; Suel, T.; Tsantilas, T. Portable and efficient parallel computing using the BSP model. IEEE Transactions on Computers 48, 7 (1999), 670-689.
|
18) | Bonorden, O.; Juurlink, B.; von Otte, I.; Rieping, I. The Paderborn University BSP (PUB) library. Parallel Computing 29, 2 (2003), 187-207.
|
19) | Bousdira, W.; Gava, F.; Gesbert, L.; Loulergue, F.; Petiot, G. Functional parallel programming with revised bulk synchronous parallel ML. In Proc. of ICNC ’10 (2010), 191-196.
|
20) | Loulergue, F.; Hains, G.; Foisy, C. A calculus of functional BSP programs. Science of Computer Programming 37, 1-3 (2000), 253-277.
|
21) | Yzelman, A. N.; Bisseling, R. H. An object-oriented bulk synchronous parallel library for multicore programming. Concurrency and Computation: Practice and Experience 24, 5 (2012), 533-553.
|
22) | Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K. MulticoreBSP for C: a high-performance library for shared-memory parallel programming. International Journal of Parallel Programming 42, 4 (2014), 619-642.
|
23) | Yelick. K. A. Ten ways to waste a parallel computer. In Proc. of ISCA ’09 (2009). Slides of keynote talk.
|
24) | Chan, E.; Heimlich, M.; Purkayastha, A.; Van De Geijn, R. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749-1783.
|
25) | Alpert, R. D.; Philbin J. F. cBSP: zero-cost synchronization in a modified BSP model. Technical Report 97-054. NEC Research Institute, 1997.
|
26) | Dean, J.; Barroso, L. A. The tail at scale. CACM 56, 2 (2013), 74-80.
|
27) | Liao, X.; Xiao, L.; Yang, C.; Lu Y. MilkyWay-2 supercomputer: system and application. Frontiers of Computer Science 8, 3 (2014), 345-356.
|
28) | Vetter, J. S. (Ed.). Contemporary high performance computing: from petascale toward exascale, volume 2. CRC Press, 2015.
|
|
|