SPOOLES vs. PARDISO performance

Hello all,

I have compiled ccx 2.17 once with spooles and once with pardiso and compared the runtimes.
For pardiso I have installed lapack and for each test OMP_NUM_THREAD=8 was set.

(If you are interested, you can find the inputfiles here. I created a smaller input file (reduced elsets) for testing purpose. The full element sets would take more than an hour)

I got the following runtimes for the reduced elset input file (the same obersvation can be done at the full element set input file, but the calculation takes longer):

  • ccx 2.17 with spooles (spooles was not compiled as multithread Version): 405 seconds
  • ccx 2.17 with spooles (spooles was compiled as multithread Version): 207 seconds
  • ccx 2.17 with pardiso : 316 seconds

I was a little bit surprised, because I thought pardiso should be faster. Has someone an idea why pardiso is slower or what I can do to improve the pardiso runtimes?

If I use the full element input set, MT spooles takes about 42 minutes, and pardiso more than 90 minutes.

Best regards,
Patrick

PS: The difference between the spooles MT and spooles not MT version is the parallelization of the factoring

Hello , @patrick

I cannot check you performance analyse, but I suppose:

  • performace SPOOLES less then MKL Intel Pardiso
  • performance MKL Intel Pardiso less then https://www.pardiso-project.org/

Are you use multi-threading for https://www.pardiso-project.org/ ?

Thank you

Thanks for your reply. I would expect the same, but I have not tried MKL INTEL Pardiso, because I’m atm not sure how to do this. I was just surprised, because everyone said, that spooles is slower, but at my case this is just the case, if I use not the full multithread spooles.

I use the https://www.pardiso-project.org/ library libpardiso600-GNU800-X86-64.so

and the makefile for ccx:

CFLAGS = -Wall -g -O2 -fopenmp -I ../../../SPOOLES.2.2 -DARCH="Linux" -DSPOOLES -DPARDISO -DARPACK -DMATRIXSTORAGE -DUSE_MT=8
FFLAGS = -Wall -g -O2 -fopenmp

CC=cc
FC=gfortran

.c.o :
    $(CC) $(CFLAGS) -c $<
.f.o :
    $(FC) $(FFLAGS) -c $<

include Makefile.inc

SCCXMAIN = ccx_2.17.c

OCCXF = $(SCCXF:.f=.o)
OCCXC = $(SCCXC:.c=.o)
OCCXMAIN = $(SCCXMAIN:.c=.o)

DIR=../../../SPOOLES.2.2

LIBS = \
       $(DIR)/MT/src/spoolesMT.a \
       $(DIR)/spooles.a \
       ../../../ARPACK/libarpack_INTEL.a \
        -L../../../pardiso -lpardiso600-GNU800-X86-64 -lpthread -lm -llapack -lc
ccx_2.17_MT: $(OCCXMAIN) ccx_2.17_MT.a
    ./date.pl; $(CC) $(CFLAGS) -c ccx_2.17.c; $(FC) -fopenmp -Wall -O2 -g -o $@ $(OCCXMAIN) ccx_2.17_MT.a $(LIBS)

ccx_2.17_MT.a: $(OCCXF) $(OCCXC)

While Pardiso is solving the equation system, htop shows, that all threads are working with 100%, thus I’m quite sure, I use the multi-threading version.

Maybe the bottleneck is the standard lapack package, which is needed by Pardiso Project?

Atm I have no explanation for this behaviour and for quite small input Files, the Pardiso Project Solver is faster (e.g. 7 seconds compared to 11 with spooles)

I had a more detailed look at the runtimes: (all times in sec)

PARDISO:

total_time;304.65253
readinput;0.66694
fort_allocation;11.36228
fort_calinput;39.87290
init_var;0.00001
descascade;0.00002
det_struct_mat;1.69151
linstatic_total;251.01715
linstatic_stress1;0.08033
linstatic_stiffness;1.01056
linstatic_spooles;0.00000
linstatic_pardiso;247.17648
linstatic_stress2;1.21279
spooles_factoring;0.00000
spooles_solve;0.00000
spooles_cleanup;0.00000
pardiso_factoring;243.78792
pardiso_solve;3.28830
pardiso_cleanup;0.10025

SPOOLES full multithreading:

total_time;194.28114
readinput;0.63942
fort_allocation;11.48573
fort_calinput;40.45663
init_var;0.00001
descascade;0.00000
det_struct_mat;1.80309
linstatic_total;139.84573
linstatic_stress1;0.08460
linstatic_stiffness;1.03885
linstatic_spooles;135.93945 
linstatic_pardiso;0.00000
linstatic_stress2;1.23636
spooles_factoring;134.12712 (8 Cores)
spooles_solve;1.20269 (8 Cores)
spooles_cleanup;0.60964
pardiso_factoring;0.00000
pardiso_solve;0.00000
pardiso_cleanup;0.00000

SPOOLES part multithreading:

total_time;399.39059
readinput;0.62447
fort_allocation;14.16187
fort_calinput;44.83856
init_var;0.00001
descascade;0.00000
det_struct_mat;1.75905
linstatic_total;337.99054
linstatic_stress1;0.08328
linstatic_stiffness;1.03335
linstatic_spooles;334.12296
linstatic_pardiso;0.00000
linstatic_stress2;1.20208
spooles_factoring;331.44077 (1 Core)
spooles_solve;2.15284 (8 Cores)
spooles_cleanup;0.52933
pardiso_factoring;0.00000
pardiso_solve;0.00000
pardiso_cleanup;0.00000

Interesting are just the times with spooles or pardiso in the variable name. The other times are the same, because it’s the ccx part

Maybe someone is interested in this information…

In the meantime I made a comparison between mkl pardiso, pardiso project and spooles. In my case I was not able to make the pardiso project solver faster than spooles (the full MT version). The mkl pardiso solver is in my case about 20% faster than spooles or the pardiso project solver.

Hello,
As I understood, your research show dependency between solver and calculation time. And I see in research Implementation of the CUDA Cusp and CHOLMOD Solvers in CalculiX
It is also depends of model. See Figure 9 of research.