|
Lightweight Parallel Foundations 1.0.1-alpha 2023-06-26T11:02:34Z
A high performance and model-compliant communication layer
|

Modules | |
| Error Codes | |
Classes | |
| struct | lpf_args_t |
| struct | lpf_machine_t |
Macros | |
| #define | _LPF_VERSION 202000L |
| #define | _LPF_INCLUSIVE_MEMORY 202000L |
| #define | _LPF_EXCLUSIVE_MEMORY 202000L |
Typedefs | |
| typedef void * | lpf_t |
| typedef void(* | lpf_func_t) () |
| typedef void(* | lpf_spmd_t) (const lpf_t ctx, const lpf_pid_t pid, const lpf_pid_t nprocs, const lpf_args_t args) |
Variables | |
| typedef | lpf_pid_t |
| const lpf_args_t | LPF_NO_ARGS |
| typedef | lpf_err_t |
| typedef | lpf_init_t |
| typedef | lpf_sync_attr_t |
| typedef | lpf_memslot_t |
| typedef | lpf_msg_attr_t |
| const lpf_t | LPF_NONE |
| const lpf_t | LPF_ROOT |
| const lpf_init_t | LPF_INIT_NONE |
| const lpf_sync_attr_t | LPF_SYNC_DEFAULT |
| const lpf_msg_attr_t | LPF_MSG_DEFAULT |
| const lpf_pid_t | LPF_MAX_P |
| const lpf_machine_t | LPF_INVALID_MACHINE |
| const lpf_memslot_t | LPF_INVALID_MEMSLOT |
The C language is well known for its speed when the programmer does everything right, as well as for its Undefined Behaviour (UB) when the programmer did something wrong. The same holds true for this API: each function has well-defined behaviour, provided that a set of preconditions are satisfied; when any precondition is violated, a call will instead result in UB. We capture these semantics in natural language as well as formal semantics. Behaviour of a function comprises, given the input state and the function parameters, the following aspects:
For each LPF primitive defined in the core API, a strict set of rules defines defines the behaviour of an implementation. If a call to a primitive violates its preconditions, then UB will be the result. If, during a call to an LPF primitive, an implementation encounters an error, it will behave as defined for the encountered error condition (usually meaning the primitive shall have no effect) and return the associated error code.
To the designer of an immortal algorithm, predictability of the LPF system is just as important as its functional behaviour. Precisely specifying costs for each function individually, however, would restrict the implementation more than is necessary, as well as more than is realisable within acceptable cost.
Regarding necessity, and for example, LPF defines nonblocking semantics for RDMA communication. Therefore, an implementation could choose to overlap communication with computation by initiating data transfers at each call to lpf_get() or lpf_put(), without waiting for an lpf_sync(). On the other hand, another implementation could choose to temporally separate communication from computation by delaying the initiation of data transfers until such time all all computation globally has completed. LPF allows both (and more), yet strives to assign an unambiguous cost as to the completion of the communication pattern as a whole, and in isolation from any computational cost: the LPF, as a communication layer, only assigns a precise cost to communication, not computation.
In line with the above, instead of specifying the cost for each function individually, LPF specifies the total runtime costs of communication during groups of independent supersteps. If no extensions to the core API are in use, the BSP cost model specifies the wall-clock time of all calls to lpf_put(), lpf_get(), and lpf_sync() in a single superstep; see The BSP cost model for more details. An LPF implementation may enable the alternative choice of any relaxed BSP model. LPF also allows for probabilistic superstep costs that may futhermore be subject to SLAs over many executions of identical communication patterns. Finally, a call to any LPF primitive has asymptotic work complexity guarantees. These constitute the only guarantees that LPF defines on computational cost.
Through all these guarantees, probabilistic or otherwise, LPF allows for the reliable realisation of immortal algorithms.
LPF follows the BSP algorithmic model in that computation and communication are considered separately, and that a computation phase and a subsequent communication phase constitute a superstep. The BSP cost model, as well as any BSP-like cost model, exploits this separation to assign a cost to the execution of a given communication pattern. Note that a run-time system may overlap computation and communication during actual execution as long as doing so does not conflict with the chosen BSP or BSP-like cost model.
We sketch here an abstraction that applies to all BSP-like models of parallel computation, the so-called bag of messages, and how this interacts with the core API.
All code written using LPF execute in the compute phase. This means all calls to lpf_get() and lpf_put() register communication requests to the runtime, which may or may not immediately initiate communication. A call to lpf_sync() ends the current computation phase and returns when the LPF process which called lpf_sync() can safely continue with the next superstep.
A communication request thus registered during a computational phase is a triplet \( ( s, d, b ) \). It consists of:
Any call to lpf_put() or lpf_get() is translated into such a memory request, after which the memory request is put into a conceptual bag of message requests \( M \). A communication request, once executed, is removed from \( M \). A subsequent call to lpf_sync() guarantees that all messages \( ( s, d, b ) \in M \) are executed and removed from \( M \). Any given process is safe to continue whenever, for all remaining messages \( (s,d,b) \in M \):
An implementation may, at any time before and during the call to a subsequent lpf_sync(), process any part of any message request already in \( M \). Messages in \( M \) may overlap in their destination memory area: these result in write conflicts. The conflicted destination memory areas, after a subsequent call to lpf_sync(), must match the result of a serialisation of the messages in \( M \). Because LPF does not define which serialisation, this conflict resolution is called arbitrary-order and is well-known from literature on the CRCW PRAM model. While the end result must match a serialisation, an implementation is not required to actually serialise communication– implementations are, in fact, not even allowed to serialise for more than one processes since doing so would violate the BSP cost model, while relaxed BSP models are only allowed to improve this default cost.
While it is legal to define global communication patterns with write conflicts, some illegal patterns remain. Specifically, a global communication pattern \( M \) is illegal when
Violating this restriction results in undefined behaviour.
Recall that communication patterns in LPF are registered during the computational phase of a superstep. User code, during a computational phase, shall not address (except via LPF primitives):
Violating any of these restrictions results in undefined behaviour.
BSP-like models may be employed via dedicated implementations of the LPF core API and/or via extensions to the LPF core API. LPF specifies in which ways the core API can be extended to ensure any algorithm making use of extensions still functions using the standard semantics. LPF additionally requires that implementations ensure that the overall time spent communicating does not exceed the cost defined by the BSP cost model– that is, implementations may only define extensions and modify the cost model if the default cost remains a valid upper bound to the actual cost.
To make LPF available for practical use, it must be easy to retrofit existing codes with algorithms implemented using LPF.
Ideally, an application developer who makes use of a third-party binary module should never need to make any changes to his code when that module updates to make use of LPF.
To meet that requirement this API specifies the lpf_exec() function. It takes as argument an SPMD callback function and an LPF context, which, in turn, must always be passed as the first parameter to every other LPF SPMD function. It also provides the process ID and the total number of processes involved in its current SPMD run, and provides access to input and output arguments to each process individually.
The lpf_exec() function can be called from anywhere in the program, even from within an SPMD section. It will guarantee that the callback function is executed in parallel using, at most, the specified number of parallel processes.
Example:
If existing code already has spawned multiple processes using a framework other than LPF, then calling lpf_exec() will not re-use those pre-existing processes to run the requested LPF program; rather, it may instead spawn additional new LPF processes per pre-existing process, and then have each pre-existing process run its own instance of the requested LPF program.
If this is not intended, and the given LPF program should instead re-use the pre-existing processes and not spawn new ones, the lpf_hook() should be used instead.
Besides SPMD section confinement there are no other features that help structuring SPMD program text. Although the C language supports structured programming, this is only limited to sequential, Von Neumann, machines. In particular, it is not aware of the parallel programming rules that the use of LPF requires. Take as example the lpf_sync() primitive, which breaks any encapsulation:
(This code ignores error codes for simplicity.)
On first sight, the communication that is performed in the final lpf_sync() looks balanced. However, notice the call to black_box() which is passed the LPF context. It can be
which means the final lpf_sync() in foo will be a \( (p+1) \)-relation, where \(p\) is the number of processes. On the other hand, the black box could read
which would mean that the final lpf_sync() in foo would be a \( 2 \)-relation. Hence to understand the effect and performance of a lpf_sync(), complete knowledge of all SPMD code is required.
To enable the design of ‘black boxes’ (libraries) on top of LPF that do not require their users to track the use of LPF primitives with all libraries, LPF defines the lpf_rehook(). This primitive starts a new SPMD section within an existing one, mapping existing LPF processes to new ones on a one-to-one basis, providing a new context that is disjoint from the older one. This allows the definition of LPF libraries that take input and output through the standard lpf_args_t structure.
Note that this is the exact same mechanism LPF defines to simplify the integration of LPF algorithms from arbitrary external software via lpf_exec() and lpf_hook().
Nevertheless, some higher-level libraries that require tight integration by LPF algorithms, may opt to expose explicit LPF contracts for users to track. This enables best performance by not requiring many calls into the encapsulation-friedly lpf_rehook(), and make sense, e.g., for a collectives library.
Higher-level libraries that provide relatively heavy computations, in contrast, would do well to benefit of the lpf_rehook(). For example, a library designed to compute large-scale fast Fourier transforms in parallel would hide any overhead from encapsulation by lpf_rehook().
Finally, one may envision higher-level libraries that completely hide the LPF core API. LPF comes bundled with one example of this, namely, LPF BSPlib API Reference Documentation.
The LPF foresees four fundamentally different ways of executing an SPMD LPF program:
This specification only defines the lpf_hook(), lpf_exec(), and lpf_rehook(). It allows for the lpf_offload() as an implementation-defined extension.
A runtime error encountered by any third party library should never simply abort the entire application program. Likewise, calling lpf_exec() from within a larger user application should never halt the entire program without providing the opportunity to the top-level program to mitigate any errors encountered.
Runtime errors arise when a necessary and expected assumption about the system has not held true. E.g., an implementation could reasonably assume that memory necessary for buffer resizing, such as necessary for the lpf_resize_message_queue(), would normally succeed. If the user provides an unrealistically high value, however, or if the system is under unusually high memory utilisation pressure, this assumption is violated. In such cases, LPF returns an error code that allows for user mitigation of the problem: for example, the caller may opt to lower the amount of communications required (e.g., at the cost of extra supersteps) or may free non-critical memory and retry the same resize call again.
Runtime errors are communicated through return values. The special type for return codes is lpf_err_t, which is therefore the return type for all LPF functions. All error conditions and their coinciding error codes are specified in this document. One error code that all functions have in common is LPF_SUCCESS which means that the function executed the normal course of action. An implementation may define additional specific error codes for specific functions, though any high quality implementation minimises the number of error codes a user would have to deal with.
Sometimes errors happen that are assumed to be so rare that it is acceptable to terminate an LPF program immediately. Still, the application must have the opportunity to log the failure and/or seek end-user advice on how to proceed. For example, an LPF algorithm employed within a traditional HPC environment need not tolerate a suddenly-unplugged network cable, yet should inform its user when such a catastrophic network failure occurs. For this reason, the LPF defines the non-mitigable LPF_ERR_FATAL error code. When an LPF process encounters this error code, it enters a so-called failure state which guarantees the following and the following only:
All calls to any other LPF functions not mentioned in the above may return LPF_ERR_FATAL. If they do not, they shall function as though the process were not in a failure state.
A failure state is a property of an LPF process– LPF does not define a global (error) state. This allows for the lazy handling of failed processes, thus reducing communication and synchronisation requirements when no process is in a failure state.
There are no checks for programming errors. While it never makes sense to execute a communication request that writes outside of any registered memory area due to overflows, for example, checking for such conditions would degrade performance significantly– not just within local computation phases, but also by increasing h-relations during communication phases. Therefore, the LPF core API ignores such programming errors entirely, instead clearly specifying what actions consititute programming errors and furthermore defining when such errors will cause undefined behaviour. An implementation, however, may elect to implement a debugging mode that does include run-time checks for programming errors to simplify debugging.
| #define _LPF_VERSION 202000L |
The version of this LPF specification. All implementations shall define this macro. The format is YYYYNN, where YYYY is the year the specification was released, and NN the number of the specifications released before this one in the same year.
| #define _LPF_INCLUSIVE_MEMORY 202000L |
An implementation that has defined this macro may never define the _LPF_EXCLUSIVE_MEMORY macro.
If this macro is defined, then 1) if an SPMD section is created using lpf_exec, then the process with PID 0 shall have the same memory address space as the process that made the corresponding call to lpf_exec; and 2) if an LPF SPMD section is started using lpf_hook, then the memory address space of the process does not change from that of prior to the call to lpf_hook.
| #define _LPF_EXCLUSIVE_MEMORY 202000L |
An implementation that has defined this macro may never define the _LPF_INCLUSIVE_MEMORY macro.
If this macro is defined, then SPMD processes started using lpf_exec or lpf_hook will never share the same address space. Some architectures require this, as do some LPF implementations.
| typedef void* lpf_t |
An LPF context.
Primitives that take an argument of type lpf_t shall require a valid LPF context. Such a context can be provided by LPF_ROOT, or can be passed as a parameter of lpf_spmd_t as called by an implementation, via a preceding call to lpf_exec or lpf_hook. Using any other instance of lpf_t shall lead to undefined behaviour.
Thus all LPF primitives are safe to concurrently call from different LPF processes, as is reasonable to expect. If a single process spawns multiple threads, however, this specification does not require the primitives be thread safe.
| typedef void(* lpf_func_t) () |
The type of a function symbol that may be broadcasted using lpf_args_t, via lpf_exec and/or lpf_hook.
| typedef void(* lpf_spmd_t) (const lpf_t ctx, const lpf_pid_t pid, const lpf_pid_t nprocs, const lpf_args_t args) |
This is a pointer to the user-defined LPF SPMD program.
A user will supply his or her program to the LPF implementation as a plain C function with the prototype as defined by this lpf_spmd_t type.
| [in] | ctx | The LPF context data. Each process that runs the SPMD program gets its own unique object which it can use to access the LPF runtime independently of the other processes. |
| [in] | pid | The process ID. This is a value \( s \in \{ 0, 1, \ldots, p-1 \} \) that is unique amongst the \( p \) LPF processes involved with the SPMD section. |
| [in] | nprocs | The number of LPF processes \( p \) involved with the SPMD section. |
| [in,out] | args | The program input and output arguments corresponding to this LPF process; lpf_exec() will only communicate input and output from and to process 0. Other processes will receive args with lpf_args_t.input = lpf_args_t.output = NULL and lpf_args_t.input_size = lpf_args_t.output_size = 0. lpf_hook(), on the other hand, allows each LPF process to be assigned a different set of input/output arguments based on the process ID pid. |
| lpf_err_t lpf_exec | ( | lpf_t | ctx, |
| lpf_pid_t | P, | ||
| lpf_spmd_t | spmd, | ||
| lpf_args_t | args | ||
| ) |
Runs the given spmd function from a sequential environment as SPMD program in parallel on a BSP computer and waits for its completion.
The BSP computer is given by ctx and must be be the current active runtime state as are provided by lpf_exec() or lpf_hook(). When neither call is in effect, an implementation usually provides a LPF_ROOT runtime state, representing the master process, for the purpose of running SPMD programs. Alternatively, lpf_hook() can be used to run SPMD programs.
The SPMD program spmd will be executed using P LPF processes. If lpf_machine_t.free_p, as reported by lpf_probe(), is smaller than P, the number of LPF processes will be equal to lpf_machine_t.free_p. If P is less than lpf_machine_t.free_p, then at least one of the LPF processes will be assigned more than one processor. Such LPF processes may use subsequent calls to this lpf_exec() primitive to make use of the extra processors assigned to them.
To pass input to the SPMD section and retrieve its output results, the args parameter can be given a lpf_args_t object with nonzero lpf_args_t.input_size and lpf_args_t.output_size while pointing lpf_args_t.input and lpf_args_t.output to appropriately sized memory areas. lpf_exec() will make sure that LPF process 0 will get the contents of the input memory area via a lpf_args_t object as well. Also, lpf_exec() will provide an output memory area to the SPMD program as large as is provided to lpf_exec() itself to retrieve the output. Finally, to support dynamic polymorphism, the lpf_args_t object can be given an array with function pointers which the runtime will broadcast and translate to local memory addresses.
If _LPF_INCLUSIVE_MEMORY is defined, the call to the given spmd function will be made from the same memory address space as the current call to lpf_exec. If _LPF_EXCLUSIVE_MEMORY is defined instead, only the data passed through args will be guaranteed to be available when the call to the given spmd function is made.
| [in] | ctx | The current active runtime state as provided by a previous call to lpf_exec() or lpf_hook(), or, if none of the those have happened before, LPF_ROOT. LPF_ROOT may only be used if there is a master LPF process, otherwise use lpf_hook(). |
| [in] | P | The number of requested processes. |
| [in] | spmd | The entry point of the SPMD program. |
| [in,out] | args | A memory area with input for process 0 and a memory area to store its output. |
argc and argv.| lpf_err_t lpf_hook | ( | lpf_init_t | init, |
| lpf_spmd_t | spmd, | ||
| lpf_args_t | args | ||
| ) |
Runs the given spmd function from within an existing parallel environment as an SPMD program on a BSP computer. Effectively, a collective call takes the existing parallel environment and transforms it into a BSP computer. The mapping of processes (distinction between threads and processes is irrelevant in this context) is one-to-one and the args parameters will be passed on likewise. The type of supported `‘existing parallel environments’' is implementation dependent. The supported types are encapsulated as the init parameter. The reader is referred to the documentation of the implementation on how to obtain a valid lpf_init_t object.
If _LPF_INCLUSIVE_MEMORY is defined, the call to the given spmd function will be made from the same memory address space as the current call to lpf_hook. If _LPF_EXCLUSIVE_MEMORY is defined instead, only the data passed through args will be guaranteed to be available when the call to the given spmd function is made.
| [in] | init | A structure that allows multiple threads or processes in the top-layer parallelisation mechanism to communicate. The type of this parameter, as well as how to construct an instance of it, is implementation-defined. |
| [in] | spmd | The entry point of the SPMD program. |
| [in,out] | args | The input and output arguments this function has access to. |
In addition to the total BSP cost and compute time required to execute the user-defined SPMD program, the overhead of calling lpf_hook at entry is \( \mathcal{O}( (a+b)g + l ) \), where \( g, l \) are the BSP machine parameters, and \( a, b \) equal the lpf_args_t.input_size and lpf_args_t.f_size fields of args, respectively.
Upon exit of the spmd function the overhead of this function is \( \mathcal{O}( cg + l ) \), where \( c \) equals the lpf_args_t.output_size field of args.
argc and argv.| lpf_err_t lpf_rehook | ( | lpf_t | ctx, |
| lpf_spmd_t | spmd, | ||
| lpf_args_t | args | ||
| ) |
Runs the spmd function on the BSP computer but in a new fresh context. Message queues, memory registrations, etc. start all from scratch. This is useful when an interfacing with code that is not aware of the LPF context of the calling code: to achieve encapsulation. The process mapping is exactly the same as the parent context and memory is shared between parent and child. The call is collective for all participating processes.
| [in] | ctx | The LPF context |
| [in] | spmd | The entry point of the SPMD program. |
| [in,out] | args | The input and output arguments this function has access to. |
| lpf_err_t lpf_register_global | ( | lpf_t | ctx, |
| void * | pointer, | ||
| size_t | size, | ||
| lpf_memslot_t * | memslot | ||
| ) |
Registers a memory area globally. It assigns a global memory slot to the given memory area, thus preparing its use for inter-process communication.
This is a collective function, meaning that all processes call this primitive on a local memory area they own, in the same superstep, and in the same order. For registration of memory that is referenced only locally see lpf_register_local().
The registration process is necessary to enable Remote Direct Memory Access (RDMA) primitives, such as lpf_get() and lpf_put(); both source and destination memory areas are always indicated using registered memory slots that are returned by either this function or by lpf_register_local().
A successful call to this primitive takes effect after a subsequent call to lpf_sync(). Earlier use of the resulting memslot leads to undefined behaviour. A successful call immediately consumes one memslot capacity; see also lpf_resize_memory_register on how to ensure sufficient capacity.
NULL as pointer and 0 for size is valid. | [in,out] | ctx | The runtime state as provided by lpf_exec(). |
| [in] | pointer | The pointer to the memory area to register. |
| [in] | size | The size of the memory area to register in bytes. |
| [out] | memslot | Where to store the memory slot identifier. |
sizeof(size_t) bytes and \( p \) equals the number of SPMD processes in this LPF run. See also The BSP cost model.On systems like Infiniband, handles to the remote memory areas must be communicated before they can be referred to. This can happen as part of a point-to-point synchronisation pattern during a lpf_sync().
| lpf_err_t lpf_register_local | ( | lpf_t | ctx, |
| void * | pointer, | ||
| size_t | size, | ||
| lpf_memslot_t * | memslot | ||
| ) |
Registers a memory area locally. It assigns to the memory area a local memory slot so that it can be used in inter-process communication that is initiated from the local process.
Contrary to lpf_register_global(), this primitive does not enable remote processes to access this memory.
Also contrary to lpf_register_local(), a successful local memory registration is immediate and does not require a subsequent call to lpf_sync(). A successful call also immediately consumer one memory slot capacity; see lpf_resize_memory_register() on how to ensure sufficient capacity.
The registration process is necessary to enable Remote Direct Memory Access (RDMA) primitives, such as lpf_get() and lpf_put(); both source and destination memory areas are always indicated using registered memory slots that are returned by either this function or by lpf_register_global().
NULL as pointer and 0 for size is also valid. | [in,out] | ctx | The runtime state as provided by lpf_exec(). |
| [in] | pointer | The pointer to the memory area to register. |
| [in] | size | The size of the memory area to register. |
| [out] | memslot | Where to store the memory slot identifier. |
| lpf_err_t lpf_deregister | ( | lpf_t | ctx, |
| lpf_memslot_t | memslot | ||
| ) |
Deregisters a locally or globally registered memory area and puts the slot back into the pool of free memory slots.
For global memory slots this is a collective function, which means that all processes must call this function on the same global memory slots in the exact same order. For locally registered memory areas, the deregistration can be performed in any order.
Deregistration takes effect immediately. Any local or remote communication request using the given memslot in the current superstep is illegal.
| [in,out] | ctx | The runtime state as provided by lpf_exec(). |
| [in] | memslot | The memory slot identifier to de-register. |
| lpf_err_t lpf_put | ( | lpf_t | ctx, |
| lpf_memslot_t | src_slot, | ||
| size_t | src_offset, | ||
| lpf_pid_t | dst_pid, | ||
| lpf_memslot_t | dst_slot, | ||
| size_t | dst_offset, | ||
| size_t | size, | ||
| lpf_msg_attr_t | attr | ||
| ) |
Copies contents of local memory into the memory of remote processes. This operation is guaranteed to be completed after a call to the next lpf_sync() exits. Until that time it occupies one entry in the operations queue. Concurrent reads or writes from or to the same memory area are allowed in a way resembling a CRCW PRAM with arbitrary conflict resolution. However, it is illegal to read from a memory location that is also being written.
More precisely, usage is subject to the following rules:
Breaking any of the above rules results in undefined behaviour.
As illustration, below is given a small state diagram of a memory area annotated with rule numbers from the list above.
| [in,out] | ctx | The runtime state as provided by lpf_exec() |
| [in] | src_slot | The memory slot of the local source memory area registered using lpf_register_local() or lpf_register_global(). |
| [in] | src_offset | The offset of reading out the source memory area, w.r.t. the base location of the registered area expressed in bytes. |
| [in] | dst_pid | The process ID of the destination process |
| [in] | dst_slot | The memory slot of the destination memory area at pid, registered using lpf_register_global(). |
| [in] | dst_offset | The offset of writing to the destination memory area w.r.t. the base location of the registered area expressed in bytes. |
| [in] | size | The number of bytes to copy from the source memory area to the destination memory area. |
| [in] | attr | In case an attr not equal to LPF_MSG_DEFAULT is provided, the the message created by this function may have modified semantics that may be used to extend this API. Examples include:
These attributes are stored after a call to this function has completed and may be modified immediately after without affecting any messages already scheduled. |
Possible operations that API extensions could provide via operations on message attributes of type lpf_msg_attr_t are, for instance,
Such extensions are implementation-depenendent. Prescribing a specific behaviour in this specification would otherwise lead to a very specific type of BSP, while we aim for a most generic specification here.
| lpf_err_t lpf_get | ( | lpf_t | ctx, |
| lpf_pid_t | src_pid, | ||
| lpf_memslot_t | src_slot, | ||
| size_t | src_offset, | ||
| lpf_memslot_t | dst_slot, | ||
| size_t | dst_offset, | ||
| size_t | size, | ||
| lpf_msg_attr_t | attr | ||
| ) |
Copies contents from remote memory to local memory. This operation completes after one call to lpf_sync(). Until that time it occupies one entry in the operations queue. Concurrent reads or writes from or to the same memory area are allowed in a way resembling a CRCW PRAM with arbitrary conflict resolution. However, it is illegal to read from a memory location that is is also being written.
More precisely, Usage is subject to the following rules:
Breaking any of the above rules results in undefined behaviour.
As illustration, below is given a small state diagram of a memory area annotated with rule numbers from the list above.
| [in,out] | ctx | The runtime state as provided by lpf_exec(). |
| [in] | src_pid | The process ID of the source process. |
| [in] | src_slot | The memory slot of the source memory area at pid, as globally registered with lpf_register_global(). |
| [in] | src_offset | The offset of reading out the source memory area, w.r.t. the base location of the registered area expressed in bytes. |
| [in] | dst_slot | The memory slot of the local destination memory area registered using lpf_register_local() or lpf_register_global(). |
| [in] | dst_offset | The offset of writing to the destination memory area w.r.t. the base location of the registered area expressed in bytes. |
| [in] | size | The number of bytes to copy from the source remote memory location. |
| [in] | attr | In case an attr not equal to LPF_MSG_DEFAULT is provided, the the message created by this function may have modified semantics that may be used to extend this API. Examples include:
These attributes are stored after a call to this function has completed and may be modified immediately after without affecting any messages already scheduled. |
| lpf_err_t lpf_sync | ( | lpf_t | ctx, |
| lpf_sync_attr_t | attr | ||
| ) |
Terminate the current computation phase, then execute all globally pending communication requests. The local part of the global communication phase is guaranteed to have finished at exit of this function after which the next computation phase starts.
| [in,out] | ctx | The runtime state as provided by lpf_exec() |
| [in] | attr | Any additional hints the user can provide the runtime with in hopes of speeding up the synchronisation process beyond what is predicted by the BSP cost model. An implementation must always adhere to the BSP cost contract no matter what hints the lpf_sync() is provided with. The default, and only possible lpf_sync_attr_t defined by this specification is the LPF_SYNC_DEFAULT. Any other attributes are extensions to this API. |
| lpf_err_t lpf_probe | ( | lpf_t | ctx, |
| lpf_machine_t * | params | ||
| ) |
This primitive allows a user to inspect the machine that this LPF program has been assigned. All resources reported in the lpf_machine_t struct are under full control of the current instance of the LPF implementation for the lifetime of the supplied ctx parameter.
A call to this function with the ctx parameter equal to LPF_ROOT will yield full information on the entire machine assigned to the LPF implementation. Within an LPF SPMD section, each process is assigned a subset of the available resources. Information on that subset may be enquired using the lpf_t pointer returned by the lpf_exec() of interest.
| [in] | ctx | The currently active runtime state as provided by a top-level lpf_exec(). Outside an SPMD section, LPF_ROOT may be used if there is a master LPF process. |
| [out] | params | The BSP machine parameters |
Resizes the memory register for subsequent supersteps.
The new capacity becomes valid after a next call to lpf_sync().
Each local (lpf_register_local()) or global (lpf_register_global()) registration counts as one.
The local process allocates enough resources to register up to max_regs memory regions. This is a purely local property and does not guarantee that other processes have the same capacity; indeed, different processes may require different memory slot capacities, which LPF allows. Since the programmer can always determine a suitable upper bound for the number of registrations when designing her algorithm, there are no runtime out-of-bounds checks prescribed for lpf_register_local() and lpf_register_global()– this would also be too costly as error checking would require communication.
If memory allocation was successful, the return value is LPF_SUCCESS and the local process may assume the new buffer size max_regs. In the case of insufficient local memory the return value will be LPF_ERR_OUT_OF_MEMORY. In that case, it is as if the call never happened and the user may retry the call locally after freeing up unused resources. Should retrying not lead to a successful call, the programmer may opt to broadcast the error (using existing slots) or to give up by returning from the spmd section.
| [in,out] | ctx | The runtime state as provided by lpf_exec(). |
| [in] | max_regs | The requested maximum number of memory regions that can be registered. This value must be the same on all processes. |
See also The BSP cost model.
Resizes the message queue to support the specified maximum communication pattern during subsequent supersteps.
The local process allocates enough resources to support patterns of up to max_msgs messages, but that does not guarantee that others have. All processes may therefore only assume that the global minimum of max_msgs messages will be reserved. Each call to lpf_put() or lpf_get() counts as one message on the originating process and as one message on the target process, which may be itself. Since the programmer can always determine a suitable upper bound for the maximum communication pattern when designing her algorithm, there are no runtime out-of-bounds checks prescribed for lpf_get() and lpf_put().
If memory allocation was locally successful, the return value is LPF_SUCCESS. However, in case of insufficient memory the return value will be LPF_ERR_OUT_OF_MEMORY on one or more processes. In that case, it is as if the call never happened. However, the failing processes may retry the call locally after freeing up unused resources – it is implementation defined what helps – and on success all processes may assume the new global minimum as buffer size. If the problem persists it is up to the programmer to mitigate this problem globally by broadcasting the error or to give up by returning from the spmd section.
| [in,out] | ctx | The runtime state as provided by lpf_exec(). |
| [in] | max_msgs | The requested maximum accepted number of messages per process. This value must be the same on all processes. |
See also The BSP cost model.
| typedef lpf_pid_t |
An unsigned integral type large enough to hold the number of processes. All operations that work with C unsigned integral types also work for this type.
|
extern |
An empty argument that can be passed to lpf_exec() when no input or output is required. It has the lpf_args_t data fields lpf_args_t.input, lpf_args_t.output, and lpf_args_t.f_symbols set to NULL. The lpf_args_t size fields lpf_args_t.input_size, lpf_args_t.output_size, and lpf_args_t.f_size are set to 0.
An implementation thus must implement LPF_NO_ARGS as
| typedef lpf_err_t |
Type to hold an error return code that functions can return.
An implementation must ensure the error codes of this type may be compared for equality or for inequality.
| typedef lpf_init_t |
Represents a minimal form of initialisation data that can be used to to lpf_hook() an LPF SPMD section from an existing, possibly non-bulk synchronous, parallel environment. In particular, this object will be able to bootstrap communications between processes. How such an object can be obtained, depends on the implementation.
| typedef lpf_sync_attr_t |
A mechanism to provide hints to the global lpf_sync() operator. Through this type, information may be made available to the LPF runtime that will allow it to speed up the synchronisation process. The attribute that shall induce the default behaviour shall be LPF_SYNC_DEFAULT.
The maximum BSP costs, as defined in the section The BSP cost model, shall remain to define an upper limit in execution.
Any call to lpf_sync using a modified lpf_sync_attr_t, shall not modify the semantics of RDMA communication using lpf_get or lpf_put using the default message attribute LPF_MSG_DEFAULT.
| typedef lpf_memslot_t |
Type to identify a registered memory region. Both source and destination memory areas must be registered for direct remote memory access (DRMA).
| typedef lpf_msg_attr_t |
A mechanism to provide properties to messages scheduled via lpf_put and lpf_get. This information may allow for LPF implementations to change the functional and/or performance semantics of the message to be sent; for instance, to allow for a delay in delivery of a message, or to instruct the runtime that the message should be combined with data already residing in the destination memory, e.g., by addition.
|
extern |
|
extern |
The context of the master process of the assigned machine. This notion of a master process is not guaranteed for all implementations; the user is advised to consult the implementation's manual, because, in case of absence, the value of LPF_ROOT is undefined. Implementations that do not define LPF_ROOT must provide a mechanism to retrieve a valid instance of lpf_init_t– using such an instance, an LPF SPMD section can still be started using lpf_hook().
There shall only be one LPF process (one thread or one process) in any implementation that provides a valid LPF_ROOT.
|
extern |
An invalid lpf_init_t object.
|
extern |
Apply the default syncing mechanism in lpf_sync().
|
extern |
|
extern |
The maximum value that a variable of type lpf_pid_t can store. This can come in useful when you want to run an spmd function on the maximum number of processors with lpf_exec(). In that case, use NULL as parameter for the args parameter.
|
extern |
A dummy value to initialize an abstract machine of type lpf_machine_t at their declaration.
|
extern |
A dummy value to initialize memory slots at their declaration. A debug implementation may check for this value so that errors can be detected.