CalculiX and PaStiX solver Windows version

Hi all,

I was able to compile the windows version of ccx with the Pastix solver.

But currently this version does not work, an error immediately pops up (crash of ccx).

1 Like

I have the first working version of ccx on Windows!

My benchmark:
ccx 2.17 PaStiX, metis, no parsec, no starpu: 590s
ccx 2.17 PaStiX, scotch, no parsec, no starpu: 493s
ccx 2.17 Intel PARDISO: 649s
ccx 2.17 SPOOLES: 825s

Version for tests (version i4, metis):
http://s000.tinyupload.com/index.php?file_id=87740980074093979332

password: nCG4W48PgG

1 Like

New version base on Scotch and have link to MKL PARDISO (pardiso libraries required!):
https://gofile.io/d/klndR2

image

password: yCxFxNAkFW

With this version you can directly compare pastix vs pardsio (Windows version).

1 Like

@rafal.brzegowy, thanks for take the time to compile, and share this versions, I’m using your Pardiso versions since months and they work perfectly!. Will test this one as soon as possible.
Best regards

Hey @rafal.brzegowy, could you share the makefile that you used to compile PaStiX without Parsec? I am encountering an error as spm keeps looking for parsec.h

Hi,

Did you use PaStiX cudaless version from?:

My Makefile (alpha version :grin:)

OPT = -O2 -m64

#Specify where to store the generated .o files
OBJDIR = Multi_v2

CFLAGS = -Wall $(OPT) -fopenmp -posix -fpic -I$(PASTIX_INCLUDE) -I$(HWLOC_INCLUDE) -I$(STARPU_INCLUDE) -DARCH="Linux" -DPARDISO -DMATRIXSTORAGE -DUSE_MT=1 -DNETWORKOUT -DCALCULIX_EXTERNAL_BEHAVIOURS_SUPPORT -DPASTIX -DPASTIX_FP32 -fcommon
FFLAGS = -Wall $(OPT) -fopenmp -posix -fpic -fallow-argument-mismatch 

#-DPASTIX_GPU
#ARPACK
CFLAGS+= -I /usr/local/ARPACK_OpenBLAS -DARPACK

#SPOOLES
CFLAGS+= -I /usr/local/SPOOLES.2.2 -DSPOOLES

CC=gcc
FC=gfortran

#Source files in this folder and in the adapter directory
$(OBJDIR)/%.o : %.c
	$(CC) $(CFLAGS) -c $< -o $@
$(OBJDIR)/%.o : %.f
	$(FC) $(FFLAGS) -c $< -o $@

include Makefile.inc

SCCXMAIN = ccx_2.17.c

OCCXF = $(SCCXF:%.f=$(OBJDIR)/%.o)
OCCXC = $(SCCXC:%.c=$(OBJDIR)/%.o)
OCCXMAIN = $(SCCXMAIN:%.c=$(OBJDIR)/%.o)

DIR1=/usr/local/SPOOLES.2.2
DIR2=/usr/local/ARPACK_OpenBLAS
MKL=/usr/local/MKL2020U2

PASTIX_INCLUDE = /usr/local/PaStiX/pastix_i4/include
HWLOC_INCLUDE = /usr/local/PaStiX/hwloc_i4/include
STARPU_INCLUDE = /usr/local/PaStiX/starpu_i8/include
PASTIX_LIBS = \
       /usr/local/PaStiX/hwloc_i4/lib64/libhwloc-15.dll \
       /usr/local/PaStiX/pastix_i4/lib/libpastix.a \
       /usr/local/PaStiX/pastix_i4/lib/libspm.a \
       /usr/local/PaStiX/pastix_i4/lib/libpastix_kernels.a

LIBS = \
       $(DIR1)/MT/src/spoolesMT.a \
       $(DIR1)/spooles.a \
       $(DIR2)/libarpack_x64.a \
       /mingw64/lib/libopenblas.a \
       $(MKL)/mkl_core.dll \
       $(MKL)/mkl_intel_thread.dll \
       $(MKL)/mkl_intel_lp64_dll.lib \
       $(PASTIX_LIBS)
       
$(OBJDIR)/ccx_PASTIX.exe: $(OBJDIR) $(OCCXMAIN) $(OBJDIR)/ccx_2.17_MT.a $(LIBS)
	 ./date.pl; $(CC) $(CFLAGS) -static-libgcc -static-libgfortran -static-libgcc -static-libstdc++ \
	 -Wl,-Bstatic -lm -lcrypt -lpthread -lwinpthread -lgomp -lquadmath -lstdc++ -ldl -c ccx_2.17.c;
	 $(FC) $(FFLAGS) -static-libgcc -static-libgfortran -static-libgcc -static-libstdc++ \
	 -Wl,-Bstatic -lm -lpthread -lstdc++ \
	 -Wl,-Bstatic,--whole-archive -lwinpthread -lgomp -lquadmath \
	 -Wl,--no-whole-archive -o $@ $(OCCXMAIN) $(OBJDIR)/ccx_2.17_MT.a $(LIBS) \
	 -L/mingw64/x86_64-w64-mingw32/lib -lopenblas -lmetis -lscotch -lscotcherrexit -lstdc++

$(OBJDIR)/ccx_2.17_MT.a: $(OCCXF) $(OCCXC)
	ar vr $@ $?

$(OBJDIR):
	mkdir -p $(OBJDIR)

clean:
	rm -f $(OBJDIR)/*.o $(OBJDIR)/ccx_2.17.a $(OBJDIR)/ccx_PASTIX.exe

Additional information / errors:

1 Like

Thanks for the resources! I am actually trying to build parsec with CUDA and use that to build PaStix. Have you been successful in doing that?

I have tried but without success.

Hi,

is this executable files are dynamically linking with MFront libraries?

Thank you,

Hi,

Yes, you need (-DCALCULIX_EXTERNAL_BEHAVIOURS_SUPPORT):
libCALCULIXBEHAVIOUR.dll
libCalculiXInterface.dll
libNHU2.dll
libstdc+±6.dll
libTFELException.dll
libTFELMaterial.dll
libTFELMath.dll
libTFELNUMODIS.dll
libTFELUtilities.dll
,

PS. There is progress with (original) ParSec and mingw64/cygwin:

2 Likes

Hi,
Important, for the best possible performance (for all cases: without parsec, with parsec), set:

  1. set OPENBLAS_NUM_THREADS=1
  2. use only for physical processors
  3. set PASTIX_MIXED_PRECISION=1

Please try with this settings.

1 Like

Hi,

I took simple test using Mazars material models, present CCX executable stopped to run with display error message as one fatal while reading input deck. Why this problem occurs even the number of constant(8) and depvar (3) are the same as given in example files.

Try using previous version of CCX (2.13) still won’t run, however it give a hint about error in usermaterials: anisotropic definition is not complete.

P.s what MFront version has been integrated and compiled since it has no DruckerPragerCap material models?

Thank you,

Hi,
If you have 8 constat try add temperature in new line (9 constant)

*User Material, constants=8
<YoungModulus>, <PoissonRatio>, <Ac>, <At>, <Bc>, <Bt>, <k>, <ed0>
<temp>
1 Like

Thank you for such a guidance, it running well now but seem too long to finish. significant different compared to Modified MC material models, several seconds to minutes and it’s about 95% completed (I stopped the calculation). Not to be excited since Mazars/MFront is a brittle damage material models.

There’s an updated from MFront official webs, latest version has DruckerPragerCap material models. Look a great deal for both acuraccy & computational times comparing to Modified MC and Mazars.

Can you kindly share an updated version of MFront/CalculiX integration? so many thanks for times & effort.

Previous links have expired so I am posting new ones, there are two versions in the archive:

  1. Requires a PARDISO (mkl) library
  2. Does not require PARDISO libraries (PARDISO cannot be used)

Both of these versions, as part of the tests, have the option to choose (can be added to: cmdStartup.bat from bConverged):

set PASTIX_ORDERING=0
0 - Scotch, 1 - Metis

and:

set PASTIX_SCHEDULER=1
0 - Static, 1 - StarPU, 3 - Sequential, 2 - parsec (not working yet)

my patch of pastix.c:

	// Set best PaStiX parameters for CalculiX usage	
    const char* pastix_ordering = getenv("PASTIX_ORDERING");
    if(atoi(pastix_ordering) == 1) {
	
	iparm[IPARM_ORDERING]  				= PastixOrderMetis;
    }		
    else {	
	
	iparm[IPARM_ORDERING]  				= PastixOrderScotch;
    }
    if( mode == AS ){
	    iparm[IPARM_SCHEDULER] 			= PastixSchedStatic;
    }
    else{
	const char* pastix_scheduler = getenv("PASTIX_SCHEDULER");
	if(atoi(pastix_scheduler) == 1) {
		
	    iparm[IPARM_SCHEDULER] 			= PastixSchedStarPU;
	}
	else if(atoi(pastix_scheduler) == 2) {
		
	    iparm[IPARM_SCHEDULER] 			= PastixSchedParsec;
	}
	else if(atoi(pastix_scheduler) == 3) {
		
	    iparm[IPARM_SCHEDULER] 			= PastixSchedSequential;
	}
	else {
		
	    iparm[IPARM_SCHEDULER] 			= PastixSchedStatic;
	}
    }

password: 4VERsW9m8h

My very simple benchmark:
image

2 Likes

Hey Rafa,

Have you seen any issues with the time it took to factorize the matrix in the problems that you have ran? I was trying to run my own PaStix on linux and it seems like it is taking a long time to factorize the matrix.

+-------------------------------------------------+
+     PaStiX : Parallel Sparse matriX package     +
+-------------------------------------------------+
  Version:                                   6.0.1
  Schedulers:
    sequential:                            Enabled
    thread static:                         Started
    thread dynamic:                       Disabled
    PaRSEC:                               Disabled
    StarPU:                               Disabled
  Number of MPI processes:                       1
  Number of threads per process:                24
  Number of GPUs:                                0
  MPI communication support:              Disabled
  Distribution level:                     2D( 128)
  Blocking size (min/max):             1024 / 2048

  Matrix type:  General
  Arithmetic:   Float
  Format:       CSC
  N:            1021086
  nnz:          42850334

+-------------------------------------------------+
  Ordering step :
    Ordering method is: Scotch
    Time to compute ordering:              6.7668
+-------------------------------------------------+
  Symbolic factorization step:
    Symbol factorization using: Fax Direct
    Number of nonzeroes in L structure:   940155805
    Fill-in of L:                         21.940455
    Time to compute symbol matrix:        0.4936
+-------------------------------------------------+
  Reordering step:
    Split level:                                 0
    Stoping criteria:                           -1
    Time for reordering:                  1.1074
+-------------------------------------------------+
  Analyse step:
    Number of non-zeroes in blocked L:    1880311610
    Fill-in:                              43.880909
    Number of operations in full-rank LU   :     5.52 TFlops
    Prediction:
      Model:                             AMD 6180  MKL
      Time to factorize:                  113.1075
    Time for analyze:                     0.1228
+-------------------------------------------------+
  Factorization step:
    Factorization used: LU
    Time to initialize internal csc:      0.8904
    Time to initialize coeftab:           0.7960
    Time to factorize:                    100.6078  (56.19 GFlop/s)
    Number of operations:                       5.52 TFlops
    Number of static pivots:                     0
    Time to solve:                        10.5152
    - iteration 1 :
         total iteration time                   6.35
         error                                  0.00026925
    - iteration 2 :
         total iteration time                   6.71
         error                                  1.8872e-06
    - iteration 3 :
         total iteration time                   7.77
         error                                  4.9107e-08
    - iteration 4 :
         total iteration time                   7.04
         error                                  3.1283e-10
    - iteration 5 :
         total iteration time                   7.2
         error                                  1.4821e-12
    - iteration 6 :
         total iteration time                   6.77
         error                                  2.6222e-15
    Time for refinement:                  42.4008
________________________________________

CSC Conversion Time: 0.409282
Init Time: 8.907896
Factorize Time: 102.307430
Solve Time: 53.063267
Clean up Time: 0.000001
---------------------------------
Sum: 164.687876

Total PaStiX Time: 164.687876
CCX without PaStiX Time: 19.732016
Share of PaStiX Time: 0.893005
Total Time: 184.419891
Reusability: 0 : 1
________________________________________

Please take a look:

My performance of factorization: 130.29 GFlop/s if I have " set OPENBLAS_NUM_THREADS=1"
My performance of factorization: 19.61 GFlop/s if I have " set OPENBLAS_NUM_THREADS=8"

If you don’t have the “OPENBLAS_NUM_THREADS” at all, it takes the value from “OMP_NUM_THREADS”.

Hello,

Thanks for the help! I was able to improve the factorization performance. Do you mind sharing your INP file for the model that you ran? I want to test it on my linux version


Not reusing csc.
+-------------------------------------------------+
+     PaStiX : Parallel Sparse matriX package     +
+-------------------------------------------------+
  Version:                                   6.0.1
  Schedulers:
    sequential:                            Enabled
    thread static:                         Started
    thread dynamic:                       Disabled
    PaRSEC:                               Disabled
    StarPU:                               Disabled
  Number of MPI processes:                       1
  Number of threads per process:                24
  Number of GPUs:                                0
  MPI communication support:              Disabled
  Distribution level:                     2D( 128)
  Blocking size (min/max):             1024 / 2048

  Matrix type:  General
  Arithmetic:   Float
  Format:       CSC
  N:            1021086
  nnz:          42850334

+-------------------------------------------------+
  Ordering step :
    Ordering method is: Scotch
    Time to compute ordering:              6.8583
+-------------------------------------------------+
  Symbolic factorization step:
    Symbol factorization using: Fax Direct
    Number of nonzeroes in L structure:   940155805
    Fill-in of L:                         21.940455
    Time to compute symbol matrix:        0.5008
+-------------------------------------------------+
  Reordering step:
    Split level:                                 0
    Stoping criteria:                           -1
    Time for reordering:                  1.0693
+-------------------------------------------------+
  Analyse step:
    Number of non-zeroes in blocked L:    1880311610
    Fill-in:                              43.880909
    Number of operations in full-rank LU   :     5.52 TFlops
    Prediction:
      Model:                             AMD 6180  MKL
      Time to factorize:                  113.1075
    Time for analyze:                     0.1085
+-------------------------------------------------+
  Factorization step:
    Factorization used: LU
    Time to initialize internal csc:      1.0117
    Time to initialize coeftab:           0.7879
    Time to factorize:                    10.5443  (536.11 GFlop/s)
    Number of operations:                       5.52 TFlops
    Number of static pivots:                     0
    Time to solve:                        0.8367
    - iteration 1 :
         total iteration time                   0.137
         error                                  0.00040009
    - iteration 2 :
         total iteration time                   0.129
         error                                  3.3839e-06
    - iteration 3 :
         total iteration time                   0.13
         error                                  6.304e-08
    - iteration 4 :
         total iteration time                   0.134
         error                                  3.3557e-10
    - iteration 5 :
         total iteration time                   0.162
         error                                  1.4103e-12
    - iteration 6 :
         total iteration time                   0.141
         error                                  3.5867e-15
    Time for refinement:                  1.1423
________________________________________

CSC Conversion Time: 0.398161
Init Time: 8.956861
Factorize Time: 12.361112
Solve Time: 2.058843
Clean up Time: 0.000000
---------------------------------
Sum: 23.774978

Total PaStiX Time: 23.774978
CCX without PaStiX Time: 19.864284
Share of PaStiX Time: 0.544807
Total Time: 43.639262
Reusability: 0 : 1