Benchmark ========= Environment configuration ------------------------- OS: Red Hat Enterprise Linux Workstation 7.9 (Maipo) CPU: 8x Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz GPU: GeForce RTX 2080 8G Python: 3.6.8 Python package version: - numpy==1.19.5 - opencv-python==4.5.5.64 - mpi4py==3.1.3 - numba==0.53.1 - taichi==1.0.0 Problem size vs backend ----------------------- To run and get the time spend: .. code:: bash $ fpie -s $NAME.png -t $NAME.png -m $NAME.png -o result.png -n 5000 -b $BACKEND --method $METHOD ... The following table shows the best performance of corresponding backend choice, i.e., tuning other parameters on square10/circle10 and apply them to other tests, instead of using the default value. |image0| The above plots are generated by per-pixel operation time cost. EquSolver ~~~~~~~~~ The benchmark commands for squareX and circleX: .. code:: bash # numpy fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b numpy --method equ fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b numpy --method equ # numba fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b numba --method equ fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b numba --method equ # gcc fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b gcc --method equ fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b gcc --method equ # openmp fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b openmp --method equ -c 8 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b openmp --method equ -c 8 # mpi mpiexec -np 8 fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b mpi --method equ --mpi-sync-interval 100 mpiexec -np 8 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b mpi --method equ --mpi-sync-interval 100 # cuda fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b cuda --method equ -z 256 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b cuda --method equ -z 256 # taichi-cpu fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b taichi-cpu --method equ -c 8 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b taichi-cpu --method equ -c 8 # taichi-gpu fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b taichi-gpu --method equ -z 1024 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b taichi-gpu --method equ -z 1024 .. raw:: html ========== ======= ======= ======== ======== ========= EquSolver square6 square7 square8 square9 square10 ========== ======= ======= ======== ======== ========= # of vars 4097 16385 65537 262145 1048577 NumPy 0.8367s 3.2142s 12.1836s 52.4939s 230.5375s Numba 0.7257s 1.9472s 7.0761s 32.4084s 149.3390s GCC 0.0740s 0.3013s 1.2061s 5.0351s 22.0276s OpenMP 0.0176s 0.0423s 0.1447s 0.5835s 8.6203s MPI 0.0127s 0.0488s 0.1757s 0.8253s 8.3310s CUDA 0.0112s 0.0141s 0.0272s 0.1835s 0.6967s Taichi-CPU 0.4437s 0.5178s 0.7667s 1.9061s 13.2009s Taichi-GPU 0.5730s 0.5727s 0.6022s 0.8101s 1.4430s ========== ======= ======= ======== ======== ========= .. raw:: html ========== ======= ======= ======== ======== ========= EquSolver circle6 circle7 circle8 circle9 circle10 ========== ======= ======= ======== ======== ========= # of vars 4256 16676 65972 262338 1049486 NumPy 0.8618s 3.2280s 12.5615s 52.7161s 226.5578s Numba 0.7430s 1.9789s 7.1499s 32.1932s 132.7537s GCC 0.0764s 0.3062s 1.2115s 4.9785s 22.1516s OpenMP 0.0179s 0.0391s 0.1301s 0.5177s 8.2778s MPI 0.0131s 0.0494s 0.1767s 0.8155s 8.3823s CUDA 0.0113s 0.0139s 0.0274s 0.1831s 0.6966s Taichi-CPU 0.4461s 0.5148s 0.7687s 1.8646s 12.9343s Taichi-GPU 0.5735s 0.5679s 0.5971s 0.7987s 1.4379s ========== ======= ======= ======== ======== ========= .. raw:: html GridSolver ~~~~~~~~~~ The benchmark commands for squareX and circleX: .. code:: bash # numpy fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b numpy --method grid fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b numpy --method grid # numba fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b numba --method grid fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b numba --method grid # gcc fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b gcc --method grid --grid-x 8 --grid-y 8 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b gcc --method grid --grid-x 8 --grid-y 8 # openmp fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b openmp --method grid -c 8 --grid-x 2 --grid-y 16 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b openmp --method grid -c 8 --grid-x 2 --grid-y 16 # mpi mpiexec -np 8 fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b mpi --method grid --mpi-sync-interval 100 mpiexec -np 8 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b mpi --method grid --mpi-sync-interval 100 # cuda fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b cuda --method grid --grid-x 2 --grid-y 128 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b cuda --method grid --grid-x 2 --grid-y 128 # taichi-cpu fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b taichi-cpu --method grid -c 8 --grid-x 8 --grid-y 128 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b taichi-cpu --method grid -c 8 --grid-x 8 --grid-y 128 # taichi-gpu fpie -s square10.png -t square10.png -m square10.png -o result.png -n 5000 -b taichi-gpu --method grid -z 1024 --grid-x 16 --grid-y 64 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b taichi-gpu --method grid -z 1024 --grid-x 16 --grid-y 64 .. raw:: html ========== ======= ======= ======== ======== ========= GridSolver square6 square7 square8 square9 square10 ========== ======= ======= ======== ======== ========= # of vars 4356 16900 66564 264196 1052676 NumPy 0.7809s 2.8823s 12.3242s 51.7496s 209.5504s Numba 1.5838s 6.0720s 24.0901s 99.5048s 410.6119s GCC 0.0884s 0.3504s 1.3832s 5.5402s 24.6482s OpenMP 0.0177s 0.0547s 0.2011s 0.7805s 5.4012s MPI 0.0136s 0.0516s 0.1999s 0.7956s 5.4109s CUDA 0.0116s 0.0152s 0.0330s 0.1458s 0.5738s Taichi-CPU 0.5308s 0.8638s 1.6196s 4.8147s 20.2245s Taichi-GPU 0.6538s 0.6505s 0.6638s 0.8298s 1.3439s ========== ======= ======= ======== ======== ========= .. raw:: html ========== ======= ======= ======== ========= ========= GridSolver circle6 circle7 circle8 circle9 circle10 ========== ======= ======= ======== ========= ========= # of vars 5476 21316 84100 335241 1338649 NumPy 0.8554s 3.0602s 13.1915s 55.3018s 224.0399s Numba 1.8680s 7.1174s 28.1826s 117.5155s 481.5718s GCC 0.0997s 0.3768s 1.4753s 5.8558s 25.1236s OpenMP 0.0219s 0.0670s 0.2498s 0.9838s 6.0868s MPI 0.0155s 0.0614s 0.2446s 0.9810s 5.8527s CUDA 0.0113s 0.0150s 0.0334s 0.1507s 0.5954s Taichi-CPU 0.5558s 0.8727s 1.6317s 4.8740s 20.2178s Taichi-GPU 0.6447s 0.6418s 0.6521s 0.8309s 1.3578s ========== ======= ======= ======== ========= ========= .. raw:: html Per backend performance ----------------------- In this section, we will perform ablation studies with OpenMP/MPI/CUDA backend. We use circle9/10 with 5000 iterations as the experiment setting. OpenMP ~~~~~~ |image1| Command to run: .. code:: bash fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b openmp --method equ -c 8 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b openmp --method grid -c 8 --grid-x 2 --grid-y 16 .. raw:: html ========= ======= ======= ======= ======= ======= circle9 1 2 4 6 8 ========= ======= ======= ======= ======= ======= # of vars 262338 262338 262338 262338 262338 EquSolver 3.5689s 1.7679s 0.8987s 0.6344s 0.4982s ========= ======= ======= ======= ======= ======= .. raw:: html ========== ======= ======= ======= ======= ======= circle9 1 2 4 6 8 ========== ======= ======= ======= ======= ======= # of vars 335241 335241 335241 335241 335241 GridSolver 6.2717s 3.1530s 1.8758s 1.2955s 0.9897s ========== ======= ======= ======= ======= ======= .. raw:: html ========= ======== ======= ======= ======= ======= circle10 1 2 4 6 8 ========= ======== ======= ======= ======= ======= # of vars 1049486 1049486 1049486 1049486 1049486 EquSolver 16.9218s 9.2764s 7.8828s 8.2016s 8.0285s ========= ======== ======= ======= ======= ======= .. raw:: html ========== ======== ======== ======= ======= ======= circle10 1 2 4 6 8 ========== ======== ======== ======= ======= ======= # of vars 1338649 1338649 1338649 1338649 1338649 GridSolver 26.7571s 13.5669s 8.2486s 6.4654s 6.2539s ========== ======== ======== ======= ======= ======= .. raw:: html MPI ~~~ |image2| Command to run: .. code:: bash mpiexec -np 8 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b mpi --method equ --mpi-sync-interval 100 mpiexec -np 8 fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b mpi --method grid --mpi-sync-interval 100 .. raw:: html ========= ======= ======= ======= ======= ======= circle9 1 2 4 6 8 ========= ======= ======= ======= ======= ======= # of vars 262338 262338 262338 262338 262338 EquSolver 4.9217s 2.4655s 1.3378s 0.9310s 0.7996s ========= ======= ======= ======= ======= ======= .. raw:: html ========== ======= ======= ======= ======= ======= circle9 1 2 4 6 8 ========== ======= ======= ======= ======= ======= # of vars 335241 335241 335241 335241 335241 GridSolver 6.2136s 3.1381s 1.8817s 1.3124s 0.9822s ========== ======= ======= ======= ======= ======= .. raw:: html ========= ======== ======== ======= ======= ======= circle10 1 2 4 6 8 ========= ======== ======== ======= ======= ======= # of vars 1049486 1049486 1049486 1049486 1049486 EquSolver 22.1275s 11.5566s 8.2541s 8.2208s 8.3238s ========= ======== ======== ======= ======= ======= .. raw:: html ========== ======== ======== ======= ======= ======= circle10 1 2 4 6 8 ========== ======== ======== ======= ======= ======= # of vars 1338649 1338649 1338649 1338649 1338649 GridSolver 26.8360s 13.6866s 8.3945s 6.6107s 5.8929s ========== ======== ======== ======= ======= ======= .. raw:: html CUDA ~~~~ |image3| Command to run: .. code:: bash fpie -s circle10.png -t circle10.png -m circle10.png -o result.png -n 5000 -b cuda --method equ -z 1024 .. raw:: html ========= ======= ======= ======= ======= ======= ======= ======= circle9 16 32 64 128 256 512 1024 ========= ======= ======= ======= ======= ======= ======= ======= # of vars 262338 262338 262338 262338 262338 262338 262338 EquSolver 0.1885s 0.1844s 0.1841s 0.1831s 0.1823s 0.1861s 0.1893s ========= ======= ======= ======= ======= ======= ======= ======= .. raw:: html ========= ======= ======= ======= ======= ======= ======= ======= circle10 16 32 64 128 256 512 1024 ========= ======= ======= ======= ======= ======= ======= ======= # of vars 1049486 1049486 1049486 1049486 1049486 1049486 1049486 EquSolver 0.7220s 0.7038s 0.7012s 0.6976s 0.6973s 0.6983s 0.7037s ========= ======= ======= ======= ======= ======= ======= ======= .. raw:: html .. |image0| image:: /_static/images/benchmark.png .. |image1| image:: /_static/images/openmp.png .. |image2| image:: /_static/images/mpi.png .. |image3| image:: /_static/images/cuda.png