Backend

To specify backend, simply typing -b cuda or --backend openmp, together with other parameters described below.

Feel free to play fpie with other arguments!

GridSolver

GridSolver keeps most of the 2D structure of the image, instead of relabeling pixels as EquSolver. To use GridSolver in some of the following backends, you need to specify --grid-x and --grid-y to determine the access pattern of the large 2D array.

Here is a Python pseudocode to show how it works:

arr = np.random.random(size=[N, M])
# here is a sequential scan:
for i in range(N):
    for j in range(M):
        func(arr[i, j])
# however, we can use block-level access pattern to improve the cache hit rate:
for i in range(N // grid_x):
    for j in range(M // grid_y):
        # the grid size is (grid_x, grid_y)
        for x in range(grid_x):
            for y in range(grid_y):
                func(arr[i * grid_x + x, j * grid_y + y])

NumPy

This backend uses NumPy vectorized operation for parallel computation.

There’s no extra parameter for NumPy EquSolver:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b numpy --method equ
Successfully initialize PIE equ solver with numpy backend
# of vars: 12559
Iter 5000, abs error [450.09415 445.24747 636.1397 ]
Time elapsed: 3.26s
Successfully write image to result.jpg

There’s no extra parameter for NumPy GridSolver:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b numpy --method grid
Successfully initialize PIE grid solver with numpy backend
# of vars: 17227
Iter 5000, abs error [450.07922 445.27014 636.1374 ]
Time elapsed: 3.09s
Successfully write image to result.jpg

Numba

This backend use NumPy vectorized operation together with numba jit function for parallel computation.

There’s no extra parameter for Numba EquSolver:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b numba --method equ
Successfully initialize PIE equ solver with numba backend
# of vars: 12559
Iter 5000, abs error [449.83978128 445.02560616 635.9542823 ]
Time elapsed: 1.5883s
Successfully write image to result.jpg

There’s no extra parameter for Numba GridSolver:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b numba --method grid
Successfully initialize PIE grid solver with numba backend
# of vars: 17227
Iter 5000, abs error [449.89603 445.08475 635.89545]
Time elapsed: 5.6462s
Successfully write image to result.jpg

GCC

This backend uses a single thread C++ program to perform computation.

There’s no extra parameter for GCC EquSolver:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b gcc --method equ
Successfully initialize PIE equ solver with gcc backend
# of vars: 12559
Iter 5000, abs error [ 5.179281   6.6939087 11.006622 ]
Time elapsed: 0.29s
Successfully write image to result.jpg

For GCC GridSolver, you need to specify --grid-x and --grid-y described in the first section:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b gcc --method grid --grid-x 8 --grid-y 8
Successfully initialize PIE grid solver with gcc backend
# of vars: 17227
Iter 5000, abs error [ 5.1776047  6.69458   11.001862 ]
Time elapsed: 0.36s
Successfully write image to result.jpg

Taichi

Taichi is an open-source, imperative, parallel programming language for high-performance numerical computation. We provide 2 choices: taichi-cpu for CPU-level parallelization, taichi-gpu for GPU-level parallelization. You can install taichi via pip install taichi.

For taichi-cpu: use -c or --cpu to determine how many CPUs it will use;
For taichi-gpu: use -z or --block-size to determine the number of threads used in a block.

The parallelization strategy for Taichi backend is written by Taichi itself.

There’s no other parameters for Taichi EquSolver:

# taichi-cpu
$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b taichi-cpu --method equ -c 6
[Taichi] version 0.9.2, llvm 10.0.0, commit 7a4d73cd, linux, python 3.8.10
[Taichi] Starting on arch=x64
Successfully initialize PIE equ solver with taichi-cpu backend
# of vars: 12559
Iter 5000, abs error [ 5.1899223  6.708023  11.034821 ]
Time elapsed: 0.57s
Successfully write image to result.jpg

# taichi-gpu
$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b taichi-gpu --method equ -z 1024
[Taichi] version 0.9.2, llvm 10.0.0, commit 7a4d73cd, linux, python 3.8.10
[Taichi] Starting on arch=cuda
Successfully initialize PIE equ solver with taichi-gpu backend
# of vars: 12559
Iter 5000, abs error [37.35366  46.433205 76.09506 ]
Time elapsed: 0.60s
Successfully write image to result.jpg

For Taichi GridSolver, you also need to specify --grid-x and --grid-y described in the first section:

# taichi-cpu
$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b taichi-cpu --method grid --grid-x 16 --grid-y 16 -c 12
[Taichi] version 0.9.2, llvm 10.0.0, commit 7a4d73cd, linux, python 3.8.10
[Taichi] Starting on arch=x64
Successfully initialize PIE grid solver with taichi-cpu backend
# of vars: 17227
Iter 5000, abs error [ 5.310623   6.8661118 11.2751465]
Time elapsed: 0.73s
Successfully write image to result.jpg

# taichi-gpu
$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b taichi-gpu --method grid --grid-x 8 --grid-y 8 -z 64
[Taichi] version 0.9.2, llvm 10.0.0, commit 7a4d73cd, linux, python 3.8.10
[Taichi] Starting on arch=cuda
Successfully initialize PIE grid solver with taichi-gpu backend
# of vars: 17227
Iter 5000, abs error [37.74704  46.853233 74.741455]
Time elapsed: 0.63s
Successfully write image to result.jpg

OpenMP

OpenMP backend needs to specify the number of CPU cores it can use, with -c or --cpu option (default choice is to use all CPU cores).

There’s no other parameters for OpenMP EquSolver:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b openmp --method equ -c 6
Successfully initialize PIE equ solver with openmp backend
# of vars: 12559
Iter 5000, abs error [ 5.2758713  6.768402  11.11969  ]
Time elapsed: 0.06s
Successfully write image to result.jpg

For OpenMP GridSolver, you also need to specify --grid-x and --grid-y described in the first section:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b openmp --method grid --grid-x 8 --grid-y 8 -c 6
Successfully initialize PIE grid solver with openmp backend
# of vars: 17227
Iter 5000, abs error [ 5.187172  6.701462 11.020264]
Time elapsed: 0.10s
Successfully write image to result.jpg

Parallelization Strategy

For EquSolver, it first groups the pixels into two folds by (x + y) % 2, then parallelizes per-pixel iteration inside a group in each step. This strategy can utilize the thread-local assessment.

For GridSolver, it parallelizes per-grid iteration in each step, where the grid size is (grid_x, grid_y). It simply iterates all pixels in each grid.

MPI

To run with MPI backend, you need to install both mpicc and mpi4py (pip install mpi4py).

Different from other methods, you need to use mpiexec or mpirun to launch MPI service instead of directly calling fpie program. -np option is to indicate the number of process it will launch.

Apart from that, you need to specify the synchronization interval for MPI backend with --mpi-sync-interval. If this number is too small, it will cause a large amount of overhead of synchronization; however, if it is too large, the quality of solution drops down dramatically.

MPI EquSolver and GridSolver don’t have any other arguments because of the parallelization strategy we used, see the next section.

$ mpiexec -np 6 fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b mpi --method equ --mpi-sync-interval 100
Successfully initialize PIE equ solver with mpi backend
# of vars: 12559
Iter 5000, abs error [264.6767  269.55304 368.4869 ]
Time elapsed: 0.10s
Successfully write image to result.jpg

$ mpiexec -np 6 fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b mpi --method grid --mpi-sync-interval 100
Successfully initialize PIE grid solver with mpi backend
# of vars: 17227
Iter 5000, abs error [204.41124 215.00548 296.4441 ]
Time elapsed: 0.13s
Successfully write image to result.jpg

Parallelization Strategy

MPI cannot use share-memory program model. We need to reduce the amount of data communicated while maintaining the quality of the solution.

Each MPI process is only responsible for a part of computation, and synchronized with other process per mpi_sync_interval steps, denoted as \(S\) here. When \(S\) is too small, the synchronization overhead dominates the computation; when \(S\) is too large, each process computes solution independently without global information, therefore the quality of the solution gradually deteriorates.

For EquSolver, it’s hard to say which part of the data should be exchanged to other process, since it relabels all pixels at the very beginning of this process. We use MPI_Bcast to force sync all data per \(S\) iterations.

For GridSolver, we use line partition: process i exchanges its first and last line data with process i-1 and i+1 separately per \(S\) iterations. This strategy has a continuous memory layout to exchange, thus has less overhead comparing with block-level partition.

CUDA

CUDA EquSolver needs to specify the number of threads in one block it will use, with -z or --block-size option (default value is 1024):

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b cuda --method equ -z 256
---------------------------------------------------------
Found 1 CUDA devices
Device 0: NVIDIA GeForce GTX 1060
   SMs:        10
   Global mem: 6078 MB
   CUDA Cap:   6.1
---------------------------------------------------------
Successfully initialize PIE equ solver with cuda backend
# of vars: 12559
Iter 5000, abs error [37.63664 48.39614 79.6199 ]
Time elapsed: 0.06s
Successfully write image to result.jpg

For CUDA GridSolver, you also need to specify --grid-x and --grid-y described in the first section, instead of -z:

$ fpie -s test2_src.png -m test2_mask.png -t test2_tgt.png -o result.jpg -h1 130 -w1 130 -n 5000 -g src -b cuda --method grid --grid-x 4 --grid-y 128
---------------------------------------------------------
Found 1 CUDA devices
Device 0: NVIDIA GeForce GTX 1060
   SMs:        10
   Global mem: 6078 MB
   CUDA Cap:   6.1
---------------------------------------------------------
Successfully initialize PIE grid solver with cuda backend
# of vars: 17227
Iter 5000, abs error [37.50096  48.061874 79.06909 ]
Time elapsed: 0.07s
Successfully write image to result.jpg

AMD GPUs (ROCm/HIP)

The CUDA backend also runs on AMD GPUs through ROCm/HIP. The kernels are unchanged; a small compatibility header maps the CUDA runtime calls to HIP, and the build is selected with -DUSE_HIP=ON. With a ROCm toolchain installed, configure the GPU backend directly with CMake:

$ cmake <src> -DUSE_HIP=ON -DCMAKE_HIP_COMPILER=$(which hipcc) -DCMAKE_BUILD_TYPE=Release
$ make -j core_cuda

By default CMake builds for the GPU installed in the build machine (detected automatically). To target a specific architecture, pass it explicitly, e.g. -DCMAKE_HIP_ARCHITECTURES=gfx1100. The backend is still selected at runtime with -b cuda (the name is kept for compatibility); on AMD it dispatches to the HIP build.

Parallelization Strategy

The strategy used on the CUDA backend is quite similar to OpenMP.

For EquSolver, it performs equation-level parallelization.

For GridSolver, each grid with size (grid_x, grid_y) will be in the same block. A thread in a block performs iteration only for a single pixel.