Building CalculiX with PaStiX on FreeBSD without CUDA

Over the last couple of weeks I’ve been trying to build a working CalculiX 2.21 solver with PaStiX on FreeBSD 14 (x86-64) without CUDA. I’m using GCC 13 (and gfortran13) for the build.

Libraries built (in order):

  1. spooles 2.2 + FreeBSD patches + my patches to fix warnings
  2. OpenBlas 0.3.26
  3. arpack ng 3.9.1
  4. hwloc 2.10.0
  5. mfaverge-parsec-b580d208094e
  6. scotch 6.0.8
  7. PaStiX4CalculiX (cudaless branch from https://github.com/Kabbone/PaStiX4CalculiX)

Spooles is built like this:

make global -f makefile
cd MT/src
make -f makeGlobalLib

OpenBlas is built with:

env CC=gcc13 FC=gfortran13 AR=gcc-ar13 \
    NO_SHARED=1 INTERFACE64=1 BINARY=64 USE_THREAD=0 \
    gmake

(This was the problem; see the end of this post.)

Arpack:

autoreconf -vif
env INTERFACE64=1 CC=gcc13 F77=gfortran13 FC=gfortran13 \
    LDFLAGS=-L${PREFIX} \
./configure --with-blas=-lopenblas --with-lapack=-lopenblas --enable-icb --enable-static --disable-shared --prefix=${PREFIX}
gmake

hwloc:

env CC=gcc13 CXX=g++13 LIBS='-lexecinfo -lpciaccess'\
    ./configure \
    --prefix= ${PREFIX}\
    --disable-shared --enable-static \
    --disable-readme --disable-picky --disable-cairo \
    --disable-libxml2 --disable-levelzero

parsec:

cmake \
    -Wno-dev \
    -DEXTRA_LIBS='-lexecinfo -lpciaccess' \
    -DNO_CMAKE_SYSTEM_PATH=YES \
    -DCMAKE_INSTALL_LOCAL_ONLY=YES \
    -DBUILD_SHARED_LIBS=OFF \
    -DCMAKE_CXX_COMPILER=g++13 \
    -DCMAKE_C_COMPILER=gcc13 \
    -DCMAKE_Fortran_COMPILER=gfortran13 \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=${PREFIX} \
    -DPARSEC_GPU_WITH_CUDA=OFF \
    -DHWLOC_DIR=${PREFIX} \
    ..
gmake -j4

scotch was built using the following Makefile.inc:

EXE             =
LIB             = .a
OBJ             = .o

MAKE            = gmake
AR              = ar
ARFLAGS         = -ruv
CAT             = cat
CCS             = gcc13
CCP             = mpicc
CCD             = gcc13
CFLAGS          += -std=c99 -fPIC -DCOMMON_FILE_COMPRESS_GZ -DCOMMON_PTHREAD
CFLAGS          += -DCOMMON_RANDOM_FIXED_SEED -DSCOTCH_RENAME -DSCOTCH_RENAME_PARSER
CFLAGS          += -DSCOTCH_PTHREAD -Drestrict=__restrict -DIDXSIZE64
CFLAGS          += -DINTSIZE64 -DSCOTCH_PTHREAD_NUMBER=4 -DCOMMON_PTHREAD_FILE
CLIBFLAGS       =
LDFLAGS         += -lz -lm -lpthread
CP              = cp
LEX             = flex -Pscotchyy -olex.yy.c
LN              = ln
MKDIR           = mkdir -p
MV              = mv
RANLIB          = ranlib
YACC            = bison -pscotchyy -y -b y

If you have a CPU with more or less than four cores, you might want to adapt -DSCOTCH_PTHREAD_NUMBER=4 accordingly.

The original PaStiX4CalculiX does not build without CUDA. Even if configured without CUDA it references CUDA functions and types and fails.
Luckily I found Kabbone’s version before I was halfway fixing all the errors.

With Kabbone’s PaStiX4CalculiX I can build PaStiX if and only if Python 2 is used to generate the different versions of the code. The Python scripts in PaStiX4CalculiX fail miserably with Python 3.

Building PaStiX is done as follows.
(The patch for CMakeLists.txt is used to tell cmake to use gcc-ar13).

patch < ../../patches/pastix/kabbone-CMakeLists.txt.patch
mkdir build
cd build
env PKG_CONFIG_PATH=/zstorage/home/rsmith/tmp/src/calculix-build/lib/pkgconfig \
cmake   -Wno-dev \
        -DPYTHON_EXECUTABLE=/usr/local/bin/python2.7 \
        -DPASTIX_WITH_CUDA=OFF \
        -DCMAKE_PREFIX_PATH=${PREFIX} \
        -DCMAKE_INSTALL_PREFIX=${PREFIX} \
        -DCMAKE_BUILD_TYPE=Release \
        -DPASTIX_WITH_PARSEC=ON \
        -DSCOTCH_DIR=${PREFIX} \
        -DPASTIX_ORDERING_SCOTCH=ON \
        -DCMAKE_C_COMPILER=gcc13 \
        -DCMAKE_CXX_COMPILER=g++13 \
        -DCMAKE_Fortran_COMPILER=gfortran13 \
        -DCMAKE_C_FLAGS='-fopenmp -lpciaccess -lm -Wno-unused-parameter' \
        ..
gmake -j4
gmake install

(On FreeBSD, GNU make is called gmake)

The following Makefile was used to build ccx:

PASTIX_INCLUDE = ../../include/
DIR=../spooles

WFLAGS = -Wno-unused-variable -Wno-unused-argument -Wno-maybe-uninitialized
WFLAGS += -Wno-unused-label -Wno-conversion
CFLAGS = -Wall -O2 -fopenmp -fpic -I$(DIR) -I$(PASTIX_INCLUDE) -DARCH="Linux" -DSPOOLES -DARPACK -DMATRIXSTORAGE -DINTSIZE64 -DPASTIX -DPASTIX_FP32 $(WFLAGS)
FFLAGS = -Wall -O2 -fopenmp -fpic -fdefault-integer-8 $(WFLAGS) -Wno-unused-dummy-argument

CC=gcc13
FC=gfortran13

.c.o :
        $(CC) $(CFLAGS) -c $<
.f.o :
        $(FC) $(FFLAGS) -c $<

include Makefile.inc

SCCXMAIN = ccx_2.21.c

OCCXF = $(SCCXF:.f=.o)
OCCXC = $(SCCXC:.c=.o)
OCCXMAIN = $(SCCXMAIN:.c=.o)

PASTIX_LIBS = -lpastix -lpastix_kernels -lpastix_parsec -lparsec -lhwloc -lspm
PASTIX_LIBS += -lscotch -lscotcherrexit -lopenblas
PASTIX_LIBS += -lgomp -lstdc++ -lpciaccess -latomic -lexecinfo

LIBS = -L../../lib \
     $(DIR)/spooles.a \
     -larpack \
     $(PASTIX_LIBS) \
     -lpthread -lm -lz -lc

ccx_2.21_i8: $(OCCXMAIN) ccx_2.21.a
        ./date.pl; $(CC) $(CFLAGS) -c ccx_2.21.c; $(FC) -Wall -O2 -o $@ \
        $(OCCXMAIN) ccx_2.21.a $(LIBS)

ccx_2.21.a: $(OCCXF) $(OCCXC)
        ar vr $@ $?

With that I could build and link ccx.
An example problem seems to run OK; it finishes without errors.
However the results generated by that executable are invalid.
Note that using the same libraries but building without PaStiX (just using spooles, arpack and openblas) works fine:

When I use the executable built with PaStiX:

So I strongly suspect PaStiX is the problem. (Actually it was OpenBLAS not using locking.)
What I’m not sure of is what exactly the problem is. On the one hand, it looks like some kind of memory corruption issue. On the other hand, I’m not sure if the scripts that generate different versions of the code work as they should.

I also found a set of patches that supposedly enabled a stock PaStiX 6.2.0 to work with CalculiX. And although that compiled and ran, it failed with NaN results on the example problem.

If anyone could give me a pointer as to what is going wrong and how to fix it, I’d appreciate it.

PROBLEM SOLVED

So, after a lot of experimentation, I found the solution.
Basically, even though OpenBLAS is built as a single threaded library, it needs to be built with locking enabled because CalculiX uses it with multiple threads. As usual with these things, pretty obvious in hindsight.
:person_facepalming:

So, the correct invocation to compile OpenBLAS is:

# Do *not* use USE_OPENMP=1. Pastix needs a singe-threaded build, hence USE_THREAD=0.
# Enable locking in case BLAS routines are called in a multithreaded program.
env CC=gcc13 FC=gfortran13 AR=gcc-ar13 \
    PREFIX=${PREFIX} \
    NO_SHARED=1 INTERFACE64=1 BINARY=64 USE_THREAD=0 \
    BUFFERSIZE=25 USE_LOCKING=1 DYNAMIC_ARCH=1 \
    gmake

I used DYNAMIC_ARCH=1 so I can hopefully also use the same binary on a machine with another CPU type. If you don’t need that you can leave it out and speed up the build significantly.

Thanks to all for the assistance.

The scripts and patches I’ve used are available on github in case anybody wants to use them.

3 Likes

Yeah, it sounds like a PaStiX problem. Kudos for getting this far along, comparing and sharing the scripts and workarounds.

You may have tried this, but did you change the settings for mixed precision? That’s the only thing I can think of besides what you already pointed out.
IMO, we should try to implement solvers that are less painful to compile (petsc, mumps, superlu, suitesparse, or even trilinos)

That was something I had not tried yet.
So I ran env PASTIX_MIXED_PRECISION=1 ccx_i8 -i job.
It did not make much of a difference, although it generated a slightly different image.

But it turns out that every time I run the analysis with PaStiX with the same input, I get a different image. :frowning:

This is what PaStiX reports;

Not reusing csc.
+-------------------------------------------------+
+     PaStiX : Parallel Sparse matriX package     +
+-------------------------------------------------+
  Version:                                   6.0.1
  Schedulers:
    sequential:                            Enabled
    thread static:                         Started
    thread dynamic:                       Disabled
    PaRSEC:                                Started
    StarPU:                               Disabled
  Number of MPI processes:                       1
  Number of threads per process:                 4
  Number of GPUs:                                0
  MPI communication support:              Disabled
  Distribution level:                     2D( 128)
  Blocking size (min/max):             1024 / 2048

  Matrix type:  General
  Arithmetic:   Float
  Format:       CSC
  N:            62898
  nnz:          4545542

+-------------------------------------------------+
  Ordering step :
    Ordering method is: Scotch
    Time to compute ordering:              0.9941 
+-------------------------------------------------+
  Symbolic factorization step:
    Symbol factorization using: Fax Direct
    Number of nonzeroes in L structure:   18351942
    Fill-in of L:                         4.037350
    Time to compute symbol matrix:        0.0206 
+-------------------------------------------------+
  Reordering step:
    Split level:                                 0
    Stoping criteria:                           -1
    Time for reordering:                  0.0342 
+-------------------------------------------------+
  Analyse step:
    Number of non-zeroes in blocked L:    36703884
    Fill-in:                              8.074699
    Number of operations in full-rank LU   :    16.44 GFlops
    Prediction:
      Model:                             AMD 6180  MKL
      Time to factorize:                  0.9542 
    Time for analyze:                     0.0029 
+-------------------------------------------------+
  Factorization step:
    Factorization used: LU
    Time to initialize internal csc:      0.0994 
    Time to initialize coeftab:           0.0192 
    Time to factorize:                    0.1036  (158.70 GFlop/s)
    Number of operations:                      16.44 GFlops
    Number of static pivots:                     0
    Time to solve:                        0.0098 
    - iteration 1 :
         total iteration time                   0.0122 
         error                                  17.29
   [skipping iterations...]
    - iteration 70 :
         total iteration time                   0.0148 
         error                                  16.902
    Time for refinement:                  0.9685 
    - iteration 1 :
         total iteration time                   0.0121 
         error                                  16.863
[skipping iterations...]
    - iteration 50 :
         total iteration time                   0.0141 
         error                                  16.712
    Time for refinement:                  0.6800 
________________________________________

CSC Conversion Time: 0.033823
Init Time: 1.089587
Factorize Time: 0.222386
Solve Time: 1.660087
Clean up Time: 0.000000
---------------------------------
Sum: 3.005883

Total PaStiX Time: 3.005883
CCX without PaStiX Time: 0.573124
Share of PaStiX Time: 0.839865
Total Time: 3.579008
Reusability: 0 : 1 

Not sure if this contains anything out of the ordinary.

I’ve been using spooles exclusively up to now, and in general I’m not unhappy with it. It’s just that PaStiX is supposed to be faster.

That’s fair. I generally use Pardiso through the OneAPI from Intel, and it works like a charm 99% of the time and is faster than spooles. I did have an academic license for the latest pardiso before they started their new business, and it was even faster than the one from intel mkl as they advertise on their website Panua Technologies

Have you tried to compile (and test) the regular version, i.e. single precision (for example NO -DINTSIZE64)?

If I build ccx with just spooles, arpack and openblas it works fine. It produces normal output and yields the same result as the version from the FreeBSD ports tree.

I’ll try rebuilding without INT64 and report back.

So, I rebuilt the whole stack, ripping out INT64 in all shapes and forms.

Unfortunately, it didn’t help. :frowning:

Do the calculations run on one or several processor threads?

If you don’t use CUDA, parsec is probably unnecessary (maybe it causes the error).

On several real processor threads; hyperthreading is disabled.

Rebuilt PaStiX and ccx without parsec; no improvement.

You may try compiled Pastix without Scotch but with Metis (to check Scotch):

PASTIX_MIXED_PRECISION=1
  Arithmetic:   Float

That means mixed precision is on which is the default and unreliable mode. The safer one is PASTIX_MIXED_PRECISION=0 and the output says Double instead.

Also, these errors seem too large. I normally see 1e-10 or so when it’s working right. High error is a symptom of mixed precision being wrong for the problem.

     error                                  16.712

I think something’s not checking for convergence properly and it should be failing when error remains steady and high like that.

When I ran
env PASTIX_MIXED_PRECISION=0 ccx_i8 -i job| tee job.log
The output now indeed says double:

Not reusing csc.
+-------------------------------------------------+
+     PaStiX : Parallel Sparse matriX package     +
+-------------------------------------------------+
  Version:                                   6.0.1
  Schedulers:
    sequential:                            Enabled
    thread static:                         Started
    thread dynamic:                       Disabled
    PaRSEC:                                Started
    StarPU:                               Disabled
  Number of MPI processes:                       1
  Number of threads per process:                 4
  Number of GPUs:                                0
  MPI communication support:              Disabled
  Distribution level:                     2D( 256)
  Blocking size (min/max):             1024 / 2048

  Matrix type:  General
  Arithmetic:   Double
  Format:       CSC
  N:            62898
  nnz:          4545542

+-------------------------------------------------+
  Ordering step :
    Ordering method is: Scotch
    Time to compute ordering:              0.7787 
+-------------------------------------------------+
  Symbolic factorization step:
    Symbol factorization using: Fax Direct
    Number of nonzeroes in L structure:   18321342
    Fill-in of L:                         4.030618
    Time to compute symbol matrix:        0.0212 
+-------------------------------------------------+
  Reordering step:
    Split level:                                 0
    Stoping criteria:                           -1
    Time for reordering:                  0.0369 
+-------------------------------------------------+
  Analyse step:
    Number of non-zeroes in blocked L:    36642684
    Fill-in:                              8.061235
    Number of operations in full-rank LU   :    16.21 GFlops
    Prediction:
      Model:                             AMD 6180  MKL
      Time to factorize:                  0.9569 
    Time for analyze:                     0.0028 
+-------------------------------------------------+
  Factorization step:
    Factorization used: LU
    ||A||_2  =                            3.187785e+02
    Time to initialize internal csc:      0.1068 
    Time to initialize coeftab:           0.0425 
    Time to factorize:                    0.3194  (50.77 GFlop/s)
    Number of operations:                      16.21 GFlops
    Number of static pivots:                     0
    Time to solve:                        0.0147 
    Time to solve:                        0.0158 
    - iteration 1 :
         total iteration time                   0.0188 
         error                                  2.5489
    Time to solve:                        0.0166 
...[skipped]
    - iteration 70 :
         total iteration time                   0.0234 
         error                                  2.5222
    Time for refinement:                  1.5468 
    Time to solve:                        0.0163 
...[skipped]
    - iteration 50 :
         total iteration time                   0.0222 
         error                                  2.2486
    Time for refinement:                  1.0767 
________________________________________

CSC Conversion Time: 0.033528
Init Time: 0.877496
Factorize Time: 0.494876
Solve Time: 2.639968
Clean up Time: 0.000000
---------------------------------
Sum: 4.045869

Total PaStiX Time: 4.045869
CCX without PaStiX Time: 0.631549
Share of PaStiX Time: 0.864979
Total Time: 4.677418
Reusability: 0 : 1 
________________________________________

 Using up to 4 cpu(s) for the stress calculation.

 Estimating the stress errors


 Job finished

________________________________________

Total CalculiX Time: 6.773766
________________________________________

The stress output still looks the same as shown before. :-/

When I run it repeatedly, the resulting stress values are always different, and the errors in the iterations as well, varying between 2.x and 6.x between runs

This makes me wonder if there is not a memory corruption bug when converting the stiffness matrix to/from csc format. Or an alignment error or something.

Update

After compiling the whole stack another machine with an AMD Ryzen CPU (but with the same OS and toolchain), the problem remains; the output looks random and changes after every run.
So it is not an error caused by a specific machine.

1 Like

Hmm, I’ve built metis 5.1.0.
But when I then try to build PaStiX4Calculix it complains;

-- A cache variable, namely METIS_DIR, has been set to specify the install directory of METIS
-- Looking for METIS_NodeND
-- Looking for METIS_NodeND - not found
-- Looking for METIS : test of METIS_NodeND with METIS library fails
-- CMAKE_REQUIRED_LIBRARIES: /home/rsmith/tmp/src/calculix-build/lib/libmetis.a;/usr/lib/libm.so
-- CMAKE_REQUIRED_INCLUDES: /home/rsmith/tmp/src/calculix-build/include
-- CMAKE_REQUIRED_FLAGS: 
-- Check in CMakeFiles/CMakeError.log to figure out why it fails
-- Performing Test METIS_Idx_4
-- Performing Test METIS_Idx_4 - Success
-- Performing Test METIS_Idx_8
-- Performing Test METIS_Idx_8 - Failed
-- Could NOT find METIS (missing: METIS_WORKS) 
CMake Error at CMakeLists.txt:461 (message):
  Metis is required but was not found.  Please provide a Metis library in
  your environment or configure with -DPASTIX_ORDERING_METIS=OFF

If I patch metis.h to set #define IDXTYPEWIDTH 64 instead of the default #define IDXTYPEWIDTH 32, the tests fail the other way around. :scream:

At the moment I don’t see a way around this.

Update

Also tried metis-5.2.1 from github; same problem.

Metis uses a make to invoke cmake which in turn generates other Makefiles, because why not.
And it contains it’s own blas-like routines, written (of course) in C preprocessor macro’s.
For now, METIS is going on my “stuff to avoid” list. :frowning:

Also, I used to think that autotools was the most user-unfriendly build system out there. Slowly I’m starting to think that cmake deserves that title.

If anybody has managed to build PaStiX with METIS, please enlighten me. :sob:

I don’t envy you and just guessing here even though my previous guess didn’t help anything :stuck_out_tongue: Is IDXTYPEWIDTH is the integer size for array indices? Maybe only one test is supposed to pass because you can only choose one option? Do you particularly want either 32 or 64?

Yes.

I’m building CalculiX with PaStiX and i8, so I’ve built the whole stack with 64 bit integers (integer-8 in Fortran parlance).

Maybe that’s OK that the “METIS_Idx_4” test fails since you don’t expect 4-byte indexes anyway? Perhaps the “Metis is required but was not found.” error is unrelated to that.

Hmm, it could be.
There is also this the error Looking for METIS_NodeND - not found.
Which is weird, because that symbol is present in the library and I’m explicitly specifying the directory where libmetis.a can be found.
(I’m really beginning to dislike cmake at this point.)

I’ve been reading Peter Wauligmann’s thesis, and Scotch is a lot faster than Metis.
So I’m thinking of patching the failed tests out of CMakelists.txt and see what happens. If that doesn’t work, I think I’ll drop the combination of PaStiX and metis: too much hassle.

@rafal.brzegowy When compiling pastix 6.2 (with your patches) or pastix4calculix, I get the same warnings;

Warning on cgetrf_sp1dplus.jdf:70: Function GETRF runs on a node depending on data descA(0, k, 0), but re
fers directly (as IN) to data descA(1, k, 0), if (browk0 == browk1) is true.
  This is a potential direct remote memory reference.
  To remove this warning, descA(1, k, 0) should be syntaxically equal to descA(0, k, 0), or marked as ali
gned to descA(0, k, 0)
  If this is not possible, and data are located on different nodes at runtime, this will result in a faul
t.
Warning on cgetrf_sp1dplus.jdf:72: Function GETRF runs on a node depending on data descA(0, k, 0), but re
fers directly (as OUT) to data descA(1, k, 0).
  This is a potential direct remote memory reference.
  To remove this warning, descA(1, k, 0) should be syntaxically equal to descA(0, k, 0), or marked as ali
gned to descA(0, k, 0)
  If this is not possible, and data are located on different nodes at runtime, this will result in a faul
t.
[ 56%] Generating cgetrf_sp2d.h, cgetrf_sp2d.c
Warning on cgetrf_sp2d.jdf:362: Function GETRF2D runs on a node depending on data descA(0, k, 1), but ref
ers directly (as OUT) to data descA(1, k, 1), if isTwoD is true.
  This is a potential direct remote memory reference.
  To remove this warning, descA(1, k, 1) should be syntaxically equal to descA(0, k, 1), or marked as ali
gned to descA(0, k, 1)
  If this is not possible, and data are located on different nodes at runtime, this will result in a faul
t.
Warning on cgetrf_sp2d.jdf:78: Function GETRF runs on a node depending on data descA(0, k, 0), but refers
 directly (as IN) to data descA(1, k, 0), if (browk0 == browk1) is true.
  This is a potential direct remote memory reference.
  To remove this warning, descA(1, k, 0) should be syntaxically equal to descA(0, k, 0), or marked as ali
gned to descA(0, k, 0)
  If this is not possible, and data are located on different nodes at runtime, this will result in a faul
t.
Warning on cgetrf_sp2d.jdf:80: Function GETRF runs on a node depending on data descA(0, k, 0), but refers
 directly (as OUT) to data descA(1, k, 0).
  This is a potential direct remote memory reference.
  To remove this warning, descA(1, k, 0) should be syntaxically equal to descA(0, k, 0), or marked as ali
gned to descA(0, k, 0)
  If this is not possible, and data are located on different nodes at runtime, this will result in a faul
t.

These warning appears for all the variants (c,d,s,z) of course.

From Peter Wauligmann’s thesis, I understand that these routines are important.
This seems to be warning about potential overlapping in- and output data.
If so, that could cause issues.

Are you seeing the same warning?

I don’t remember (it was some time ago), this version was for testing capabilities, not for use (I was very lucky that something worked there at all).

It is possible that version 6.4.0 will enable native support for CalculiX:

1 Like

I saw issue #67. But the code that generates the warning is the same for pastix4calculix and pastix 6.2. And it generates the same warning in both cases.

I’ve rebuilt OpenBLAS with locking enabled (even though it is single threaded), just in case. And that seems to have fixed the issue!
I’ll edit the original question with the fix after I’ve done a full-stack rebuild.

1 Like