|
Lightweight Parallel Foundations 1.0.1-alpha 2023-06-26T11:02:34Z
A high performance and model-compliant communication layer
|

Classes | |
| struct | lpf_coll_t |
Macros | |
| #define | _LPF_COLLECTIVES_VERSION 201500L |
Typedefs | |
| typedef void(* | lpf_reducer_t) (size_t n, const void *array, void *value) |
| typedef void(* | lpf_combiner_t) (size_t n, const void *combine, void *into) |
Variables | |
| const lpf_coll_t | LPF_INVALID_COLL |
| #define _LPF_COLLECTIVES_VERSION 201500L |
The version of this collectives specification.
| typedef void(* lpf_reducer_t) (size_t n, const void *array, void *value) |
A type of a reducer used with lpf_reduce() and lpf_allreduce(). This function defines how an array of n elements should be reduced into a single value. The value is assumed to be initialised at function entry.
| [in] | n | The number of elements in the array. |
| [in] | array | The array to reduce. |
| [in,out] | value | An initial reduction value into which all values in the array are reduced. |
This is a user-defined function that the LPF collectives library makes callbacks to. There are no hard guarantees on the runtime incurred during the induced computation phases, nor are there any guarantees on correctness and failure mitigation.
| typedef void(* lpf_combiner_t) (size_t n, const void *combine, void *into) |
A type of combiner used with lpf_combine() and lpf_allcombine(). This function defines how an array of n elements should by combined with an existing array of the same size. All arrays should be assumed to be fully initialised at function entry.
A lpf_combiner_t function may not be applied on the full input array; both arrays of size N may be cut into smaller pieces by the LPF collectives library, which then calls this lpf_combiner_t function repeatedly on each of the smaller pieces of both arrays.
For consistent results, the lpf_combiner_t should be associative and commutative. Other reduces should only be used with greater thought and full understanding of the underlying collective algorithms, which, of course, are implementation dependent.
| [in] | n | The number of elements in both the combine and into arrays. |
| [in] | combine | The data to be combined the destination array. |
| [in,out] | into | An array with existing data into which to combine the array combine. |
| lpf_err_t lpf_collectives_init | ( | lpf_t | ctx, |
| lpf_pid_t | s, | ||
| lpf_pid_t | p, | ||
| size_t | max_calls, | ||
| size_t | max_elem_size, | ||
| size_t | max_byte_size, | ||
| lpf_coll_t * | coll | ||
| ) |
Initialises a collectives struct, which allows the scheduling of collective calls. The initialised struct is only valid after a next call to lpf_sync().
All operations using the output argument coll need to be made collectively; i.e., all processes must make the same function call with possibly different arguments, as prescribed by the collective in question.
| [in,out] | ctx | The LPF runtime state as provided by lpf_exec() or lpf_hook() via lpf_spmd_t. |
| [in] | s | The unique ID number of this process in this LPF SPMD run. Must be larger or equal to zero; may not be larger or equal to p. |
| [in] | p | The number of processes involved with this LPF SPMD run. Must be larger or equal to zero. May not be larger than the value provided by lpf_exec() or lpf_hook() via lpf_spmd_t. |
| [in] | max_calls | The number of collective calls a single LPF process may make within a single superstep. Must be larger than zero. |
| [in] | max_elem_size | The maximum number of bytes of a single element to be reduced through lpf_reduce or lpf_allreduce. Must be larger or equal to zero. |
| [in] | max_byte_size | The maximum number of bytes given as a size parameter to lpf_broadcast, lpf_scatter, lpf_gather, lpf_allgather, or lpf_alltoall. Must be larger or equal to zero. For the lpf_combine and lpf_allcombine collectives, the effective byte size is \( \mathit{num}\mathit{size} \). |
| [out] | coll | The collective struct needed to perform collective operations using this library. After a successful call to this function, this return lpf_coll_t value may be used immediately. |
This implementation will register one memory slot on every call to this function.
| lpf_t lpf_collectives_get_context | ( | lpf_coll_t | coll | ) |
Returns the context embedded within a collectives instance.
| [in] | coll | A valid collectives instance. |
| lpf_err_t lpf_collectives_init_strided | ( | lpf_t | ctx, |
| lpf_pid_t | s, | ||
| lpf_pid_t | p, | ||
| lpf_pid_t | lo, | ||
| lpf_pid_t | hi, | ||
| lpf_pid_t | str, | ||
| size_t | max_calls, | ||
| size_t | max_elem_size, | ||
| size_t | max_byte_size, | ||
| lpf_coll_t * | coll | ||
| ) |
Initialises a collectives struct, which allows the scheduling of collective calls. The initialised struct is only valid after a next call to lpf_sync(). This variant selects only a subset of processes based on a lower PID bound (inclusive), and upper bound (exclusive), and a stride.
All operations using the output argument coll need to be made collectively; i.e., all processes that are involved with coll must make the same function call with possibly different arguments, as prescribed by the collective in question.
| [in,out] | ctx | The LPF runtime state as provided by lpf_exec() or lpf_hook(). |
| [in] | s | The unique ID number of this process in this LPF SPMD run. Must be larger or equal to zero; may not be larger or equal to p. May not be smaller than lo. |
| [in] | p | The number of processes involved with this LPF SPMD run. Must be larger or equal to zero. May not be larger than the value provided by lpf_exec() or lpf_hook() via lpf_spmd_t. |
| [in] | lo | The lower process ID participating in these collectives. Must be larger or equal to zero. May not be larger than p. May not be larger than hi. |
| [in] | hi | The upper bound on the process IDs participating in these collectives. Must be larger or equal to zero. May not be larger than p. May not be less than lo. |
| [in] | str | The stride of process IDs. Must be larger or than one. |
| [in] | max_calls | number of collective calls a single LPF process may make within a single superstep. Must be larger than zero. |
| [in] | max_elem_size | The maximum number of bytes of a single element to be reduced through lpf_reduce or lpf_allreduce. May be equal to zero. |
| [in] | max_byte_size | The maximum number of bytes given as a size parameter to lpf_broadcast, lpf_scatter, lpf_gather, lpf_allgather, or lpf_alltoall. May be equal to zero. |
| [out] | coll | The collective struct needed to perform collective operations using this library. After a successful call to this function, this return lpf_coll_t value may be used immediately. |
This implementation will register one memory slot on every call to this function.
| lpf_err_t lpf_collectives_destroy | ( | lpf_coll_t | coll | ) |
Destroys a lpf_coll_t object created via a call to lpf_collectives_init() or lpf_collectives_init_strided().
This function may only be called on a successfully initialised parameter coll. The lpf_coll_t instance shall then become invalid immediately.
This is a collective call conform the descriptions of lpf_coll_t, lpf_collectives_init, and lpf_collectives_init_strided.
| [in] | coll | The collectives system to invalidate. |
| lpf_err_t lpf_broadcast | ( | lpf_coll_t | coll, |
| lpf_memslot_t | src, | ||
| lpf_memslot_t | dst, | ||
| size_t | size, | ||
| lpf_pid_t | root | ||
| ) |
Schedules a broadcast of a vector of size bytes. The broadcast shall be complete by the end of a next call to lpf_sync(). This is a collective call, meaning that if one LPF process calls this function, all LPF processes in the same SPMD section must make a matching call to this function.
The root process has size bytes to transmit, as uniquely identified by data and size.
For all \( i \in \{ 0, 1, \ldots, \mathit{size} - 1 \} \), after a next call to lpf_sync() all processes with non- root ID will locally have that \( \mathit{dst}_{ i } \) equals \( \mathit{src}_{ i } \) at PID root.
| [in,out] | coll | The LPF collectives state. This argument must match across processes in the same collective call to this function and must provide the same sized memory area of size bytes. |
| [in] | src | At PID root, the memory slot of the source memory area to read from. This argument is only used read from by the process with PID equal to root; there, the memory slot must correspond to a valid memory area of size at least size bytes. At processes with PID not equal to root, the memory area corresponding to this slot will not be touched and may thus have any size, including zero. The memory slot must be global; this argument must match across processes in the same collective call to this function. |
| [in] | dst | At PID not equal to root, the destination memory area. The memory slot must correspond to a valid memory area of size at least size bytes. At the process with PID root the memory area corresponding to this slot will not be touched and may be of any size, including zero. The memory slot may be local or global. Arguments do not have to match across processes in the same collective call to this function. |
| [in] | size | The number of bytes to broadcast. This argument must match across processes in the same collective call to this function. |
| [in] | root | The PID of the root process of this operation. This argument must match across processes in the same collective call to this function. |
| lpf_err_t lpf_gather | ( | lpf_coll_t | coll, |
| lpf_memslot_t | src, | ||
| lpf_memslot_t | dst, | ||
| size_t | size, | ||
| lpf_pid_t | root | ||
| ) |
Schedules a gather of a vector of size bytes. The gather shall be complete by the end of a next call to lpf_sync(). This is a collective operation, meaning that if one LPF process calls this function, all LPF processes in the same SPMD section must make a matching call to this function.
The root process will retrieve size bytes from each other process. The source memory areas are identified by src and size.
For all \( i \in \{ 0, 1, \ldots, \mathit{size} - 1 \} \) and for all \( k \in \{ 0, 1, \ldots, p-1 \},\ k \neq \mathit{root} \), after a next call to lpf_sync() the process with ID root will have that \( \mathit{dst}_{ k \cdot p + i } \) equals \( \mathit{src}_{ i } \) at PID \( k \), with \( p \) the number of processes registered in coll. The memory area starting from dst plus root * size with size size bytes will not be touched at PID root.
| [in,out] | coll | The LPF collectives state. This argument must match across processes in the same collective call to this function. |
| [in] | src | The memory slot of the source memory area to read from. When PID is not root, the corresponding memory area to src should be valid to read for at least size bytes. When PID is root, the corresponding memory area will not be touched and can be of any size, including zero. The memory slot can be local or global. This argument may differ across processes in the same collective call to this function. |
| [in] | dst | The memory slot of the destination memory area. At process ID root, the corresponding memory area must must be valid and of size at least p * size bytes. At other processes, the corresponding memory area will not be touched and may be of any size. The memory slot must be global; this argument must match across processes in the same collective call to this function. |
| [in] | size | The number of bytes at each source array. This argument must match across processes in the same collective call to this function. |
| [in] | root | Which process is the root of this operation. This argument must match across processes in the same collective call to this function. |
| lpf_err_t lpf_scatter | ( | lpf_coll_t | coll, |
| lpf_memslot_t | src, | ||
| lpf_memslot_t | dst, | ||
| size_t | size, | ||
| lpf_pid_t | root | ||
| ) |
Schedules a scatter of a vector of size bytes. The operation shall be complete by the end of a next call to lpf_sync(). This is a collective operation, meaning that if one LPF process calls this function, all LPF processes in the same SPMD section must make a matching call to this function.
The root process will split a source memory area in segments of size bytes each, while expecting \( p \) segments. Each \( k \)th segment is sent to process \( k \).
The \( k \)-th process, \( k \neq \mathit{root} \), shall, after the next call to lpf_sync, have that for all \( 0 \leq i < \mathit{size} \), \( \mathit{dest}_i \) equals \( \mathit{src}_{ k \mathit{size} + i } \) at PID root.
| [in,out] | coll | The LPF collectives state. This argument must match across processes in the same collective call to this function. |
| [in] | src | The memory slot of the source memory area to read from. When the PID equals root, this must point to a valid memory area of size at least \( p \mathit{size} \) bytes. For all other processes, the corresponding memory area will not be touched and may be of any size, including zero. The slot must be global; this argument must match across processes in the same collective call to this function. |
| [in] | dst | The memory slot of the destination memory area. At processes with PID not equal to root, this must point to a valid memory area of at least size bytes. At PID root, the memory area is not touched and may be of any size, including zero. This argument may differ across processes in the same collective call to this function. |
| [in] | size | The number of bytes that need to be scattered to a single process. The total number of bytes scattered, i.e., the size of the src memory area, equals \( p \mathit{size} \). This argument must match across processes in the same collective call to this function. |
| [in] | root | Which process is the root of this operation. This argument must match across processes in the same collective call to this function. |
| lpf_err_t lpf_allgather | ( | lpf_coll_t | coll, |
| lpf_memslot_t | src, | ||
| lpf_memslot_t | dst, | ||
| size_t | size, | ||
| bool | exclude_myself | ||
| ) |
Schedules an allgather of a vector of size bytes. The operation shall be complete by the end of a next call to lpf_sync(). This is a collective, operation, meaning that if one LPF process calls this function, all LPF processes in the same SPMD section must make a matching call to this function.
All processes locally have two memory areas; one of size bytes and another of p * size bytes. After the next call to lpf_sync, each process will have at the latter memory area a concatenation of all of the former memory areas from all processes. More formally:
At the end of the next call to lpf_sync, each process with its PID \( s \in \{ 0, 1, \ldots, p-1 \} \) has \( \forall i \in \{ 0, 1, \ldots, \mathit{size}-1 \} \) and \( \forall k \in \{ 0, 1, \ldots, p-1 \},\ k \neq s \) that \( \mathit{dst}_{ k \cdot \mathit{size} + i } \) local to PID \( s \) equals \( \mathit{src}_{ i } \) local to PID \( k \).
The induced communication pattern as defined above must never cause read and writes to occur at the same memory location, or undefined behaviour will occur.
| [in,out] | coll | The LPF collectives state. This argument must match across processes in the same collective call to this function. |
| [in] | src | The memory slot of the source memory area. This can be a local or global slot. The memory area must be at least size bytes large. This argument may differ across processes in the same collective call to this function. This argument must not equal dst. |
| [in] | dst | The memory slot of the destination memory area. This must be a global slot. On all processes, this must correspond to a valid memory area of size at least p * size. This must be a global slot; this argument must match across processes in the same collective call to this function. |
| [in] | size | The number of bytes in a single source memory area. This argument must match across processes in the same collective call to this function. |
| [in] | exclude_myself | Skip myself in the collective communication. |
s * size , with s that process' PID, and src was registered with length exactly size bytes.| lpf_err_t lpf_alltoall | ( | lpf_coll_t | coll, |
| lpf_memslot_t | src, | ||
| lpf_memslot_t | dst, | ||
| size_t | size | ||
| ) |
Schedules an all-to-all of a vector of size bytes. The operation shall be complete by the end of a next call to lpf_sync(). This is a collective operation, meaning that if one LPF process calls this function, all LPF processes in the same SPMD section must make a matching call to this function.
All process locally have \( p \) elements of \( \mathit{size} \) bytes. The elements will be transposed amongst all participating processes.
At the end of this operation, each process with its unique ID \( s \in \{ 0, 1, \ldots, p-1 \} \) has \( \forall i \in \{ 0, 1, \ldots, \mathit{size} - 1 \} \) and \( \forall k \in \{ 0, 1, \ldots, p - 1 \},\ k \neq s \) that \( \mathit{dst}_{ k\mathit{size} + i } \) local to PID \( s \) equals \( \mathit{src}_{ s\mathit{size} + i } \) local to PID \( k \).
It is illegal to have src equal to dst.
All arguments to this function must match across all processes in the collective call to this function.
| [in,out] | coll | The LPF collectives state. |
| [in] | src | The memory slot of the source memory area. This must correspond to a valid memory area of size at least \( p \mathit{size} \). This memory area must not overlap with that of dst. This must be a global memory slot. |
| [in] | dst | The memory slot of the destination memory area to write to. On all processes, this must point to a valid memory area of size at least \( p \mathit{size} \). This parameter must not overlap with that of src. This must be a global memory slot. |
| [in] | size | The number of bytes that each process sends to another. In total, each process sends \( (p-1)\mathit{size} \) bytes, and receives the same amount. |
| lpf_err_t lpf_reduce | ( | lpf_coll_t | coll, |
| void *restrict | element, | ||
| lpf_memslot_t | element_slot, | ||
| size_t | size, | ||
| lpf_reducer_t | reducer, | ||
| lpf_pid_t | root | ||
| ) |
Schedules a reduce operation of one array per process. The reduce shall be completed by the end of a next call to lpf_sync(). This is a collective operation, meaning that if one LPF process calls this function, all LPF processes in the same SPMD section must make a matching call to this function.
At the end of the next lpf_sync, the memory area element points to shall equal the reduced value of all the element memory area passed at function entry. This output value shall only be set at the root process. The reduction operator is user-defined through a lpf_reducer_t. Even if the same reducer function is used, this may result in different pointers being passed across the various processes involved in the collective call; hence the reducer argument cannot be enforced to be the same everywhere.
| [in,out] | coll | The LPF collectives state. This argument must match across processes in the same collective call to this function. |
| [in,out] | element | At PID root, a pointer to a memory area of at least size bytes. At function entry, the memory area will contain the element to be reduced. After a next call to lpf_sync, the value of the memory area at PID root will equal the globally reduced value. At PID not equal to root, this memory area at entry will not point to the to be reduced value. At exit the memory area will not have changed. |
| [in] | element_slot | The lpf_memslot_t corresponding to element. The memory slot must be global, and must have registered size bytes starting from element. |
| [in] | size | The size of a single element of the type to be reduced, in bytes. This argument must match across processes in the same collective call to this function. |
| [in] | reducer | A function that defines the reduction operator. This argument may differ across processes in the same collective call to this function. |
| [in] | root | The process ID of the root process in this collective. This argument must match across processes in the same collective call to this function. |
| lpf_err_t lpf_allreduce | ( | lpf_coll_t | coll, |
| void *restrict | element, | ||
| lpf_memslot_t | element_slot, | ||
| size_t | size, | ||
| lpf_reducer_t | reducer | ||
| ) |
Schedules an allreduce operation of a single object per process. The allreduce shall be complete by the end of a next call to lpf_sync. This is a collective operation, meaning that if one LPF process calls this function, all LPF processes in the same SPMD section must make a matching call to this function.
At the end of the next lpf_sync, the memory area element points to shall equal the reduced value of all elements at all processes. The reduction operator is user-defined through a lpf_reducer_t. Even if the same reducer function is used, this may result in different pointers being passed across the various processes involved in the collective call; hence the reducer argument cannot be enforced to be the same everywhere.
| [in,out] | coll | The LPF collectives state. This argument must match across processes in the same collective call to this function. |
| [in,out] | element | A pointer to a memory area of at least size bytes. At function entry, this equals the local to be reduced value. After a next call to lpf_sync, the value of the memory area will equal the globally reduced value. |
| [in] | element_slot | The lpf_memslot_t corresponding to element. The memory slot must be global, and must have registered size bytes starting from element. |
| [in] | size | The size of a single element of the type to be reduced, in bytes. This argument must match across processes in the same collective call to this function. |
| [in] | reducer | A function that defines the reduction operator. This argument may differ across processes in the same collective call to this function. |
| lpf_err_t lpf_combine | ( | lpf_coll_t | coll, |
| void *restrict | array, | ||
| lpf_memslot_t | slot, | ||
| size_t | num, | ||
| size_t | size, | ||
| lpf_combiner_t | combiner, | ||
| lpf_pid_t | root | ||
| ) |
Combines an array at all non-root processes into that of the root process.
The operation is guaranteed to be complete after a next call to lpf_sync.
On input, all processes must supply a valid data array. On output, the root process will have its output array equal to \( array^{\mathit{root}} = \oplus_{k=0}^{p-1} array^{(k)}, \) where \( \oplus \) is prescribed by combiner. The order in which the operator \( \oplus \) is applied is undefined. The operator \( \oplus \) may furthermore be applied in stages, when applied to a single set of input and output arrays. See lpf_combiner_t for more details.
All parameters must match across all processes involved with the same collective call to this function.
This implementation synchronises once before exiting.
| [in,out] | coll | The LPF collectives state. |
| [in,out] | array | The array to be combined. The array must point to a valid memory area of size \( \mathit{num}\mathit{size} \). The num elements in this array must be initialised and will be taken as input for the combiner. After a call to this function, this array's contents will be undefined. On the root process, this array's contents will be the combination of all arrays, as prescribed by the combiner. |
| [in] | slot | The memory slot corresponding to array. This must be a globally registered slot. |
| [in] | num | The number of elements in the array. |
| [in] | size | The size, in bytes, of a single element of the array. |
| [in] | combiner | A function which may combine one or more elements of the appropriate array types. The combining happens element- by-element. |
| [in] | root | Which process is the root of this collective operation. |
| lpf_err_t lpf_allcombine | ( | lpf_coll_t | coll, |
| void *restrict | array, | ||
| lpf_memslot_t | slot, | ||
| size_t | num, | ||
| size_t | size, | ||
| lpf_combiner_t | combiner | ||
| ) |
Combines an array at all processes into one array that is broadcasted over all processes.
The operation is guaranteed to be complete after a next call to lpf_sync.
On input, all processes must supply a valid data array. On output, the all processes will have their output array equal to \( \oplus_{k=0}^{p-1} array^{(k)}, \) where \( \oplus \) is prescribed by combiner. The order in which the operator \( \oplus \) is applied is undefined. The operator \( \oplus \) may furthermore be applied in stages, when applied to a single set of input and output arrays. See lpf_combiner_t for more details.
All parameters must match across all processes involved with the same collective call to this function.
| [in,out] | coll | The LPF collectives state. |
| [in] | size | The size, in bytes, of a single element of the array. |
| [in] | num | The number of elements in the array. |
| [in] | combiner | A function which may combine one or more elements of the appropriate array types. The combining happens element- by-element. |
| [in,out] | array | The array to be combined. The array must point to a valid memory area of size \( \mathit{num}\mathit{size} \). The num elements in this array must be initialised and will be taken as input for the combiner. After a call to this function, this array's contents will be undefined. On the root process, this array's contents will be the combination of all arrays, as prescribed by the combiner. |
| [in] | slot | The memory slot corresponding to array. This must be a globally registered slot. |
|
extern |
An invalid lpf_coll_t. Can be used as a static initialiser, but can never be used as input to any LPF collective as its contents are invalid.