Final Report¶
Summary¶
We have implemented a parallelized Poisson image editor with Jacobi method. It can compute results using seven extensions: NumPy, Numba, Taichi, singlethreaded c++, OpenMP, MPI, and CUDA. In terms of performance, we have a detailed benchmarking result where the CUDA backend can achieve 31 to 42 times speedup on GHC machines compared to the singlethreaded c++ implementation. In terms of userexperience, we have a simple GUI to demonstrate the results interactively, released a standard PyPI package, and provide a website for project documentation.
Source image 
Mask image 
Target image 
Result image 

Background¶
Poisson Image Editing¶
Poisson Image Editing is a technique that can fuse two images together without producing artifacts. Given a source image and its corresponding mask, as well as a coordination on the target image, the algorithm always yields amazing result. The general idea is to keep most of gradient in source image unchanged, while matching the boundary of source image and target image pixels.
The gradient per pixel is computed by
After calculating the gradient in source image, the algorithm tries to solve the following problem: given the gradient and the boundary value, calculate the approximate solution that meets the requirement, i.e., to keep target image’s gradient as similar as the source image.
This process can be formulated as \((4A)\vec{x}=\vec{b}\), where \(A\in\mathbb{R}^{N\times N}\), \(\vec{x}\in\mathbb{R}^N\), \(\vec{b}\in\mathbb{R}^N\), \(N\) is the number of pixels in the mask, \(A\) is a giant sparse matrix because each line of A only contains at most 4 nonzero value (neighborhood), \(\vec{b}\) is the gradient from source image, and \(\vec{x}\) is the result value.
\(N\) is always a large number, i.e., greater than 50k, so the GaussJordan Elimination cannot be directly applied here because of the high time complexity \(O(N^3)\). People use Jacobi Method to solve the problem. Thanks to the sparsity of matrix \(A\), the overall time complexity is \(O(MN)\) where \(M\) is the number of iteration performed by Poisson image editing. The iterative equation is \(\vec{x}' \leftarrow (A\vec{x}+\vec{b})/4\).
This project parallelizes Jacobi method to speed up the computation. To our best knowledge, there’s no public project on GitHub that implements Poisson image editing with either OpenMP, or MPI, or CUDA. All of them can only handle a small size image workload (link).
PIE Solver¶
We implemented two different solvers: EquSolver and GridSolver.
EquSolver directly constructs the equations \((4A)\vec{x}=\vec{b}\) by relabeling the pixel, and use Jacobi method to get the solution via \(\vec{x}' \leftarrow (A\vec{x}+\vec{b})/4\).
""" EquSolver pseudocode."""
# preprocess
src, mask, tgt = read_images(src_name, mask_name, tgt_name)
A = build_A(mask) # shape: (N, 4), dtype: int
X = build_X(tgt, mask) # shape: (N, 3), dtype: float
B = build_B(src, mask, tgt) # shape: (N, 3), dtype: float
# major computation, can be parallelized
for _ in range(n_iter):
X = (X[A[:, 0]] + X[A[:, 1]] + X[A[:, 2]] + X[A[:, 3]] + B) / 4.0
# postprocess
out = merge(tgt, X, mask)
write_image(out_name, out)
GridSolver uses the same Jacobi iteration, however, it keeps the 2D structure of the original image instead of relabeling all pixels in the mask. It may have some advantages when the mask region covers the whole image, because in this case GridSolver can save 4 read instructions by calculating the coordinates of the neighborhood directly. Also, if we set the access pattern appropriately (which will be discussed in Section Access Pattern), it has better locality in getting the required data in each iteration.
""" GridSolver pseudocode."""
# preprocess
src, mask, tgt = read_images(src_name, mask_name, tgt_name)
# mask: shape: (N, M), dtype: uint8
grad = calc_grad(src, mask, tgt) # shape: (N, M, 3), dtype: float
x, y = np.nonzero(mask) # find elementwise pixel index of mask array
# major computation, can be parallelized
for _ in range(n_iter):
tgt[x, y] = (tgt[x  1, y] + tgt[x, y  1] + tgt[x, y + 1] + tgt[x + 1, y] + grad[x, y]) / 4.0
# postprocess
write_image(out_name, tgt)
The bottleneck for both solvers is the forloop and can be easily parallelized. The implementation detail and parallelization strategies will be discussed in Section Parallelization Strategy.
Method¶
Language and Hardware Setup¶
We start building PIE with the help of pybind11 as we aim to benchmark multiple parallelization methods, including handwritten CUDA code and other 3rdparty libraries such as NumPy.
One of our project goal is to let the algorithm run on any *nix machine and can have a realtime interactive result demonstration. For this reason, we didn’t choose a supercomputing cluster as our hardware setup. Instead, we choose GHC machine to develop and measure the performance, which has 8x i79700 CPU cores and an Nvidia RTX 2080Ti.
Access Pattern¶
For EquSolver, we can reorganize the order of pixels to obtain better locality when performing
parallel operations. Specifically, we can divide all pixels into two folds by (x + y) % 2
. Here
is a small example:
# before
x1 x2 x3 x4 x5
x6 x7 x8 x9 x10
x11 x12 x13 x14 x15
...
# reorder
x1 x10 x2 x11 x3
x12 x4 x13 x5 x14
x6 x15 x7 x16 x8
...
This results in a tighter relationship between the 4 neighbors of each pixel. The ideal access pattern is to iterate over these two groups separately, i.e.
for _ in range(n_iter):
parallel for i in range(1, p):
# i < p, neighbor >= p
x_[i] = calc(b[i], neighbor(x, i))
parallel for i in range(p, N):
# i >= p, neighbor < p
x[i] = calc(b[i], neighbor(x_, i))
Unfortunately, we only observe a clear advantage with OpenMP EquSolver. For other backend, the sequential ID assignment is much better than reordering. See the section Parallelization Strategy  OpenMP for a related discussion.
For GridSolver, since it retains most of the 2D structure of the image, we can use blocklevel access pattern instead of sequential access pattern to improve cache hit rate. Here is a Python pseudocode to show how it works:
N, M = tgt.shape[:2]
# here is a sequential scan:
parallel for i in range(N):
parallel for j in range(M):
if mask[i, j]:
tgt_[i, j] = calc(grad[i, j], neighbor(tgt, i, j))
# however, we can use blocklevel access pattern to improve the cache hit rate:
parallel for i in range(N // grid_x):
parallel for j in range(M // grid_y):
# the grid size is (grid_x, grid_y)
for x in range(i * grid_x, (i + 1) * grid_x):
for y in range(j * grid_y, (j + 1) * grid_y):
if mask[x, y]:
tgt_[x, y] = calc(grad[x, y], neighbor(tgt, x, y))
Synchronization vs Converge Speed¶
Since Jacobi Method is an iterative method for solving matrix equations, there is a tradeoff between the quality of solution and the frequency of synchronization.
Parallelization Strategy¶
This section will cover the implementation detail with three different backend (OpenMP/MPI/CUDA) and two different solvers (EquSolver/GridSolver).
OpenMP¶
As mentioned before, OpenMP
EquSolver
first divides the pixels into two folds by (x + y) % 2
, then parallelizes perpixel iteration
inside a group in each step.
This strategy can utilize the threadlocal assessment as the position of four neighborhood become closer. However, it requires the entire array to be processed twice because of the division. In some cases, such as CUDA, this approach introduces an overhead that exceeds the original computational cost. However, in OpenMP, it has a significant runtime improvement.
OpenMP
GridSolver
assigns equal amount of blocks to each thread, with size (grid_x, grid_y)
per block. Each thread
simply iterates all pixels in each block independently.
We use static assignment for both solvers to minimize the overhead of runtime task allocation, since the workload is uniform per pixel/grid.
MPI¶
MPI cannot use the shared memory program model. We need to reduce the amount of data communicated, while maintaining the quality of the solution.
Each MPI process is responsible for only a portion of the computation and synchronizes with other
process per mpi_sync_interval
steps, denoted as \(S\) in this section. When \(S\) is too
small, the synchronization overhead dominates the computation; when \(S\) is too large, each
process computes the solution independently without global information, therefore the quality of the
solution gradually deteriorates.
For MPI
EquSolver,
it’s hard to say which part of the data should be exchanged to other process, since it relabels all
pixels at preprocess stage. We assign an equal number of equations to each process and use
MPI_Bcast
to force synchronization of all data per \(S\) iteration.
MPI
GridSolver
uses line partition: process i
exchanges its first and last line data with process i1
and
i+1
separately per \(S\) iterations. This strategy has a continuous memory layout, thus has
less overhead compared to blocklevel partition.
The workload per pixel is small and fixed. In fact, this type of workload is not suitable for MPI.
CUDA¶
The strategy used on the CUDA backend is quite similar to OpenMP.
CUDA EquSolver performs equationlevel parallelization. It has sequential labels per pixel instead of dividing into two folds as OpenMP does. Each block is assigned with an equal number of equations to execute Jacobi Method independently. The threads in a block iterate over only a single equation. We also tested the shared memory kernel, but it’s much slower than nonshared memory version.
For
GridSolver,
each grid with size (grid_x, grid_y)
will be in the same block. The threads in a block iterates
over a single pixel only.
There are no barriers during the iteration of both solvers. The reason has been discussed in Section Share Memory Programming Model.
Experiments¶
Experiment Setting¶
Hardware and Software¶
We use GHC83 to run all of the following experiments. Here is the hardware and software configuration:
OS: Red Hat Enterprise Linux Workstation 7.9 (Maipo)
CPU: 8x Intel(R) Core(TM) i79700 CPU @ 3.00GHz
GPU: GeForce RTX 2080 8G
Python: 3.6.8
Python package version:
numpy==1.19.5
opencvpython==4.5.5.64
mpi4py==3.1.3
numba==0.53.1
taichi==1.0.0
Data¶
We generate 10 images for benchmarking performance, 5 square and 5 circle. The script is tests/data.py. You can find the detail information in this table:
ID 
Size 
# pixels 
# unmasked pixels 
Image 

square6 
66x66 
4356 
4356 

square7 
130x130 
16900 
16900 

square8 
258x258 
66564 
66564 

square9 
514x514 
264196 
264196 

square10 
1026x1026 
1052676 
1052676 

circle6 
74x74 
5476 
4291 

circle7 
146x146 
21316 
16727 

circle8 
290x290 
84100 
66043 

circle9 
579x579 
335241 
262341 

circle10 
1157x1157 
1338649 
1049489 
We try to keep the number of unmasked pixels of circleX and squareX to be the same level. For EquSolver there’s no difference, but for GridSolver it cannot be ignored, since it needs to process all pixels no matter it is masked.
Metric¶
We measure the performance by “Time per Operation” (TpO for short) and “Cache Miss per Operation”
(CMpO for short). TpO is derived by total time / total number of iteration / number of pixel
.
The smaller the TpO, the more efficient the parallel algorithm will be. CMpO is derived by
total cache miss / total number of iteration / number of pixel
.
Result and Analysis¶
We use all seven backend to run benchmark experiments. GCC
(singlethreaded C++ implementation)
is the baseline. Details of the following experiment (commands and tables) can be found on
Benchmark page. For the sake of simplicity, we only demonstrate the plot in
the following sections. Most plots are in logarithmic scale.
EquSolver vs GridSolver¶
If GridSolver’s parameters grid_x
and grid_y
are carefully tuned, in most cases it can
perform better than EquSolver in a handwritten backend configuration (OpenMP/MPI/CUDA). The analysis
will be performed in the following sections. However, it is difficult to say which one is better
using other thirdparty backends. This may be due to the internal design of these libraries.
Analysis for 3rdparty Backend¶
NumPy¶
NumPy is 10~11x slower than GCC with EquSolver, and 8~9x slower than GCC with GridSolver. This result shows that the overhead of the NumPy solver is nonnegligible. Each iteration requires repeated data transfers between C and Python and the creation of some temporary arrays to compute the results. Even if we have used vector operations in all the computations, it cannot take advantage of the memory layout.
Numba¶
Numba is a justintime compiler for numerical functions in Python. For EquSolver, Numba is about twice as fast as NumPy; however, for GridSolver, Numba is about twice as slow as NumPy. This result suggests that Numba does not provide a general speedup for any NumPy operations, not to mention that it is still slower than GCC.
Taichi¶
Taichi is an opensource, imperative, parallel programming language for highperformance numerical computation. If we use Taichi with small size input images, it does not get much benefit. However, when increasing the problem size to a very large scale, the advantage becomes much clear. We think it is because the preprocessing step in Taichi is a nonnegligible overhead.
On the CPU backend, EquSolver is faster than GCC, while GridSolver performs almost as well as GCC. This shows the access pattern largely affects the actual performance.
On the GPU backend, although the TpO is twice as slow as CUDA with extremely largescale input, it is still faster than other backends. We are quite interested in other 3rdparty GPU solution’s performance, and leave it as future work.
Analysis for Non 3rdparty Backend¶
OpenMP and MPI can achieve almost the same speed, but MPI’s converge speed is slower because of the synchronization tradeoff. CUDA is the fastest in all conditions.
OpenMP¶
EquSolver is 8~9x faster than GCC; GridSolver is 6~7x faster than GCC. However, there is a huge performance drop when the problem size is greater than 1M for both two solvers. The threshold is 300k ~ 400k for EquSolver and 500k ~ 600k for GridSolver. We suspect that is because of cachemiss, confirmed by the following numerical result:
OpenMP 
# of pixels 
100000 
200000 
300000 
400000 
500000 
600000 
700000 
800000 
900000 
1000000 

EquSolver 
Time (s) 
0.1912 
0.3728 
0.6033 
1.073 
2.0081 
3.4242 
4.1646 
5.6254 
6.2875 
7.6159 
EquSolver 
TpO (ns) 
0.3824 
0.3728 
0.4022 
0.5365 
0.8032 
1.1414 
1.1899 
1.4063 
1.3972 
1.5232 
EquSolver 
CMpO 
0.0341 
0.0201 
0.1104 
0.3713 
0.5799 
0.6757 
0.7356 
0.8083 
0.8639 
0.9232 
GridSolver 
Time (s) 
0.2870 
0.5722 
0.8356 
1.1321 
1.4391 
2.2886 
3.0738 
4.1967 
5.5097 
6.0635 
GridSolver 
TpO (ns) 
0.5740 
0.5722 
0.5571 
0.5661 
0.5756 
0.7629 
0.8782 
1.0492 
1.2244 
1.2127 
GridSolver 
CMpO 
0.0330 
0.0174 
0.0148 
0.0522 
0.1739 
0.3346 
0.3952 
0.4495 
0.5132 
0.5394 
We also investigated the impact of the number of threads on the performance of the OpenMP backend. There is a linear speedup when the aforementioned cachemiss problem does not occur; when the cachemiss problem is encountered, its performance rapidly saturates, especially for EquSolver. We believe the reason behind is that GridSolver can take better advantage of locality compared to EquSolver, since it has no relabeling pixel process and keep all of the 2D information.
MPI¶
EquSolver and GridSolver is 6~7x faster than GCC. Like OpenMP, there is a huge performance drop. The threshold is 300k ~ 400k for EquSolver and 400k ~ 500k for GridSolver. Fortunately, the following table and plot confirms our assumption of cachemiss:
MPI 
# of pixels 
100000 
200000 
300000 
400000 
500000 
600000 
700000 
800000 
900000 
1000000 

EquSolver 
Time (s) 
0.2696 
0.6584 
0.9549 
1.6435 
2.6920 
3.6933 
4.7265 
5.7762 
6.8305 
7.7894 
EquSolver 
TpO (ns) 
0.5392 
0.6584 
0.6366 
0.8218 
1.0768 
1.2311 
1.3504 
1.4441 
1.5179 
1.5579 
EquSolver 
CMpO 
0.5090 
0.2743 
0.2998 
0.4646 
0.5995 
0.7006 
0.7525 
0.7951 
0.8204 
0.8391 
GridSolver 
Time (s) 
0.2994 
0.5948 
0.9088 
1.3075 
1.6024 
2.1239 
2.8969 
3.7388 
4.4776 
5.3026 
GridSolver 
TpO (ns) 
0.5988 
0.5948 
0.6059 
0.6538 
0.6410 
0.7080 
0.8277 
0.9347 
0.9950 
1.0605 
GridSolver 
CMpO 
0.5054 
0.2570 
0.1876 
0.2008 
0.2991 
0.3783 
0.4415 
0.4866 
0.5131 
0.5459 
A similar phenomenon occurs on the MPI backend when the number of processes changes:
CUDA¶
EquSolver is 27~44x faster than GCC; GridSolver is 38~42x faster than GCC. The performance is consistent across different input sizes.
We investigated the impact of different block size on CUDA EquSolver. For a better demonstration, we
didn’t use GridSolver because it requires tuning two parameters grid_x
and grid_y
. By
increasing the block size, the performance improves first, reaches a peak, and finally drops. The
best configuration is block size = 256.
When the block size is too small, it will use more grids for computation and therefore the overhead of communication across grids will increase. When the block size is too large, the cache invalidation problem dominates, even though fewer grids are used – since we are not using shared memory in this CUDA kernel and there are no barriers to calling this kernel, we suspect that the program will often read values that cannot be cached and will also often write values to invalidate the cache.
Contribution¶
Each group member’s contributions are on GitHub.
REFERENCE¶
[1] Pérez, Patrick, Michel Gangnet, and Andrew Blake. “Poisson image editing.” ACM SIGGRAPH 2003 Papers. 2003. 313318.
[2] Jacobi Method, https://en.wikipedia.org/wiki/Jacobi_method
[3] Harris, Charles R., et al. “Array programming with NumPy.” Nature 585.7825 (2020): 357362.
[4] Lam, Siu Kwan, Antoine Pitrou, and Stanley Seibert. “Numba: A llvmbased python jit compiler.” Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. 2015.
[5] Hu, Yuanming, et al. “Taichi: a language for highperformance computation on spatially sparse data structures.” ACM Transactions on Graphics (TOG) 38.6 (2019): 116.