Trouble compiling and running CalculiX with Pastix on Ubuntu 24.04

You should never set OMP_NUM_THREADS to a number higher than the actual number of physical cores.

In my experience hypertreading decreases performance, at least on the CPU’s I’ve tested it (Intel Core i7-7700 and AMD Ryzen 5).
This is why I always disable hyperthreading in the BIOS, and set the global environment variable OMP_NUM_THREADS to the amount of physical cores.

Using the same beamp input, I get a much higher factorization performance on my Intel Core i7-7700;

  Factorization step:
    Factorization used: LU
    Time to initialize internal csc:      0.0012 
    Time to initialize coeftab:           0.0001 
    Time to factorize:                    0.0006  (21.48 GFlop/s)
    Number of operations:                      12.72 MFlops
    Number of static pivots:                     0
    Time to solve:                        0.0001 
    - iteration 1 :
         total iteration time                   0.0002 
         error                                  2.1222e-06
    - iteration 2 :
         total iteration time                   0.000216 
         error                                  6.8298e-10
    - iteration 3 :
         total iteration time                   0.00022 
         error                                  8.3425e-13
    Time for refinement:                  0.0008 
    - iteration 1 :
         total iteration time                   0.000184 
         error                                  1.1033e-14
    Time for refinement:                  0.0003 

On real-world problems factorization performance is between 20 and 270 GFlop/s on my i7-7700.

1 Like

Hint: if you put blocks of output or scripts between lines containing only ``` (three “grave accent”, ASCII 96) it will be rendered as preformatted text.

You can also use the button “</>” on top of the edit box to make this.

1 Like

thank you, very useful :slight_smile:

I don’t have a powerfull laptop:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz

but in my tests the factorization reach 450 GFlop/s
Anyone could explain me why I can not use GPU?

I do have 8 physical cpu cores. Last time I checked I had turned off the hyperthreading for the same reasons you had mentioned. Hence I am still confused about the oversubscription error. It seems to happen for a small model like beamp.

1 Like

@teofil75, I set

export PASTIX_GPU=1

and I can see this in my run, which tells me GPU is helping with GEMMS:

Also, here is my nvidia-smi:

Sun Jan  5 09:40:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050        Off |   00000000:3B:00.0 Off |                  N/A |
| N/A   42C    P8             N/A / ERR!  |       5MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1656      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

Maybe all you are missing is setting:

DPARSEC_GPU_WITH_CUDA=ON \

during parsec build and

-DPASTIX_WITH_CUDA=ON \

for the Pastix build.

The only way I have been able to trigger the “oversubscription” warning is to set OMP_NUM_THREADS higher than the number of actual cores:

elysium:~> uname -v
FreeBSD 14.2-RELEASE releng/14.2-n269506-c8918d6c7412 GENERIC
elysium:~...src/beamp> sysctl hw.ncpu
hw.ncpu: 4
elysium:~...src/beamp> env OMP_NUM_THREADS=4 ccx -i beamp | grep Over
elysium:~...src/beamp> env OMP_NUM_THREADS=6 ccx -i beamp | grep Over
W@-0001 Oversubscription on core 0 detected
W@-0001 Oversubscription on core 1 detected
elysium:~...src/beamp> env OMP_NUM_THREADS=8 ccx -i beamp | grep Over
W@-0001 Oversubscription on core 0 detected
W@-0001 Oversubscription on core 1 detected
W@-0001 Oversubscription on core 2 detected
W@-0001 Oversubscription on core 3 detected

Since you get warnings about oversubscription on two cores with OMP_NUM_THREADS=6, it looks to me that your CPU only has 4 cores.

You are correct! I forgot that I had turned back hyperthreading on at some time. I do have only 4 real CPU cores. That explains the oversubscription error.

In the instructions I sent you there are both versions: the first one without cuda using Kabbone trunk for only PaStiX whit CPU and the second one of parsec with cuda and Guido Dondth PaStiX4CalculiX. The second one are correctly detected by nvidia-smi but the usage is 0% GPU and 100% CPU.
Please, could you share your complete runtime CalculiX output?

Sure thing. This is for running ccx on beamp:

************************************************************

CalculiX Version 2.22 i8, Copyright(C) 1998-2024 Guido Dhondt
CalculiX comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
certain conditions, see gpl.htm

************************************************************

You are using an executable made on Sat Jan  4 02:50:15 PM MST 2025

  The numbers below are estimated upper bounds

  number of:

   nodes:                   261
   elements:                    32
   one-dimensional elements:                     0
   two-dimensional elements:                     0
   integration points per element:                     8
   degrees of freedom per node:                     3
   layers per element:                     1

   distributed facial loads:                     0
   distributed volumetric loads:                     0
   concentrated loads:                     9
   single point constraints:                    63
   multiple point constraints:                     1
   terms in all multiple point constraints:                     1
   tie constraints:                     0
   dependent nodes tied by cyclic constraints:                     0
   dependent nodes in pre-tension constraints:                     0

   sets:                     6
   terms in all sets:                   105

   materials:                     1
   constants per material and temperature:                     2
   temperature points per material:                     1
   plastic data points per material:                     0

   orientations:                     0
   amplitudes:                     4
   data points in all amplitudes:                     4
   print requests:                     4
   transformations:                     0
   property cards:                     0


 STEP                     1

 Static analysis was selected

 Decascading the MPC's

 Determining the structure of the matrix:
 Using up to 4 cpu(s) for setting up the structure of the matrix.
 number of equations
 720
 number of nonzero lower triangular matrix elements
 37458

 Using up to 4 cpu(s) for the stress calculation.

 Using up to 4 cpu(s) for the symmetric stiffness/mass contributions.

Not reusing csc.
+-------------------------------------------------+
+     PaStiX : Parallel Sparse matriX package     +
+-------------------------------------------------+
  Version:                                   6.0.1
  Schedulers:
    sequential:                            Enabled
    thread static:                         Started
    thread dynamic:                       Disabled
    PaRSEC:                                Started
    StarPU:                               Disabled
  Number of MPI processes:                       1
  Number of threads per process:                 4
  Number of GPUs:                                1
  MPI communication support:              Disabled
  Distribution level:                     2D( 128)
  Blocking size (min/max):             1024 / 2048

  Matrix type:  General
  Arithmetic:   Float
  Format:       CSC
  N:            720
  nnz:          75636

+-------------------------------------------------+
  Ordering step :
    Ordering method is: Scotch
    Time to compute ordering:              0.0026 
+-------------------------------------------------+
  Symbolic factorization step:
    Symbol factorization using: Fax Direct
    Number of nonzeroes in L structure:      64548
    Fill-in of L:                         0.853403
    Time to compute symbol matrix:        0.0017 
+-------------------------------------------------+
  Reordering step:
    Split level:                                 0
    Stoping criteria:                           -1
    Time for reordering:                  0.0027 
+-------------------------------------------------+
  Analyse step:
    Number of non-zeroes in blocked L:      129096
    Fill-in:                              1.706806
    Number of operations in full-rank LU   :    12.72 MFlops
    Prediction:
      Model:                             AMD 6180  MKL
      Time to factorize:                  0.0036 
    Time for analyze:                     0.0001 
+-------------------------------------------------+
  Factorization step:
    Factorization used: LU
    Time to initialize internal csc:      0.0078 
    Time to initialize coeftab:           0.0007 
    Time to factorize:                    0.0082  ( 1.52 GFlop/s)
    Number of operations:                      12.72 MFlops
    Number of static pivots:                     0
CPU vs GPU CBLK GEMMS -> 37 vs 106
CPU vs GPU BLK GEMMS -> 0 vs 0
CPU vs GPU TRSM -> 0 vs 0
    Time to solve:                        0.0261 
    - iteration 1 :
         total iteration time                   0.044 
         error                                  1.4032e-06
    - iteration 2 :
         total iteration time                   0.0534 
         error                                  1.5343e-09
    - iteration 3 :
         total iteration time                   0.00317 
         error                                  8.7133e-13
    Time for refinement:                  0.1021 
    - iteration 1 :
         total iteration time                   0.00422 
         error                                  5.746e-15
    Time for refinement:                  0.0047 
________________________________________

CSC Conversion Time: 0.083536
Init Time: 0.023291
Factorize Time: 0.016837
Solve Time: 0.144072
Clean up Time: 0.000000
---------------------------------
Sum: 0.267736

Total PaStiX Time: 0.267736
CCX without PaStiX Time: 0.035720
Share of PaStiX Time: 0.882288
Total Time: 0.303457
Reusability: 0 : 1 
________________________________________

 Using up to 4 cpu(s) for the stress calculation.


 Job finished

________________________________________

Total CalculiX Time: 0.305798
________________________________________

1 Like

:frowning: nfortunately in all my tests the GPU didn’t work at all! Can I ask you if you can share with me the ccx executable using CUDA? I would like to test it on my KUBUNTU 24.04 OS. Alternatively, I could share my executable with you to verify if CUDA is being used for factorization on your system.
My issue is that while the compilation went smoothly, all variables were set correctly, and the NVIDIA-SMI correctly detected the CCX CUDA task, the GPU is not being used at all :frowning:

in order to solve the Threads::Threads try this:
in CMakeLists.txt search for:

    if(PASTIX_WITH_CUDA)
        list(APPEND PARSEC_COMPONENT_LIST "CUDA")
    endif()

and simply put a space before the “CUDA” word:

    if(PASTIX_WITH_CUDA)
        list(APPEND PARSEC_COMPONENT_LIST " CUDA")
    endif()

But I don’t understand why in your UBUNTU OS this problem didn’t arise?

1 Like

Hi all

sorry for ressurecting this topic, but well i am also trying to compile calculix on Ubuntu 24.04.4 LTS
with et last pastix, spooles ect. but without CUDA, GPU. as a reference i was using foamBuilder guide, this forum and ai. Well readme is outdated :sob: .

So for pastix i have used Tags · Kabbone/PaStiX4CalculiX · GitHub - no dependence on cuda, parsec - newest from Bitbucket , scotch - newest, openblas from ubuntu repo. Spooles modified as in foamBuilder. After long guessing and trials, i have managed to compile calculix. - Great and now testing correctness. To ensure correct multithread execution i created script:

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export PASTIX_THREAD_NUMBER=1

if [ -z “$1” ]; then
echo “Usage: $0 jobname [extra ccx args]”
exit 1
fi

JOB=$1
shift

/home (…) CalculiX/ccx_2.22/src/ccx_2.23 “$JOB” “$@”

I have downloaded test cases from official website. and they are not compatible with my default solver pastix - so i scripted insert Solver=Spooles into test input files but results are at max medicore. For example:

part of output

aircolumn.dat and aircolumn.dat.ref do not have the same size !!! axrad2.dat and axrad2.dat.ref do not have the same size !!! beam10psmooth.rfn.dat.ref does not exist beamhtfc2.dat and beamhtfc2.dat.ref do not have the same size !!! deviation in file beampisof.dat line: 64 reference value: 9.169356e-01 value: 9.226554e-01 absolute error: 5.719800e-03 largest value within same block: 9.973145e-01 relative error w.r.t. largest value within same block: 0.573520 % deviation in file beamprand.frd line: 1249 reference value: 9.321370e-02 value: -9.321370e-02 absolute error: 1.864274e-01 largest value within same block: 9.321370e-02 relative error w.r.t. largest value within same block: 200.000000 % deviation in file beamptied4.dat line: 129 reference value: 8.075472e-01 value: 8.116879e-01 absolute error: 4.140700e-03 largest value within same block: 8.827490e-01 relative error w.r.t. largest value within same block: 0.469069 % deviation in file beamptied5.dat line: 16 reference value: 2.004731e+06 value: 1.890823e+06 absolute error: 1.139080e+05 largest value within same block: 2.968844e+06 relative error w.r.t. largest value within same block: 3.836780 % deviation in file beamptied6.dat line: 17 reference value: 2.090934e+06 value: 2.005467e+06 absolute error: 8.546700e+04 largest value within same block: 3.071561e+06 relative error w.r.t. largest value within same block: 2.782527 % deviation in file beamptied7.dat line: 64 reference value: 8.286043e-01 value: 8.397084e-01 absolute error: 1.110410e-02 largest value within same block: 8.944509e-01 relative error w.r.t. largest value within same block: 1.241443 % beamread.dat and beamread.dat.ref do not have the same size !!! beamread2.dat and beamread2.dat.ref do not have the same size !!! beamread3.frd does not exist beamread4.dat and beamread4.dat.ref do not have the same size !!! circ10dload.rfn.dat.ref does not exist circ10pnl.rfn.dat.ref does not exist circ11p.frd does not exist coucyl.dat and coucyl.dat.ref do not have the same size !!! crackIIinta.frd and crackIIinta.frd.ref do not have the same size !!! green2.dat and green2.dat.ref do not have the same size !!! greencyc1.dat and greencyc1.dat.ref do not have the same size !!! greencyc2.dat and greencyc2.dat.ref do not have the same size !!! deviation in file induction2.frd line: 20803 reference value: -9.582270e+00 value: -1.372320e+01 absolute error: 4.140930e+00 largest value within same block: 2.257380e+01 relative error w.r.t. largest value within same block: 18.343965 % deviation in file potied.dat line: 64 reference value: 9.323342e-01 value: 9.068721e-01 absolute error: 2.546210e-02 largest value within same block: 9.975599e-01 relative error w.r.t. largest value within same block: 2.552438 %

so basically for some files even after insertion of Solver=Spooles, simulation is done by pastix (do not have the same size !!! error) - keywords errorlessly not working - thats new
some test cases has for example 200% error and some are even correct
honestly i guess i need to re try compilation process but at the end i would probably get same problems.
So anyone have any idea of what to do? I mean really those symptoms are really strange.

Thank you in advance

You probably want to use the cudaless branch of Kabbone’s repo.
Also, some other libraries also needed patches (in my experience).

I have published my build scripts and patches for building a CalculiX binary with all required special libraries statically linked. See e.g. my earlier comment. Maybe you will find them useful.

Thank you. Well i have used kabbone repo…
So best bet would be to reproduce compilation process? Well i thought it would end up like this.
Have you ever met with problem that calculix does not assign simulation to defined solver? Also have you came across fact that test simulation gives opposite sign? are those symptoms of something or just random outcome?
In your repo… there are patches - so would those work for newest pastix? - it would be cool to not have do dig into old libraries and link against python 2.7 ect.

No. If I’m running an eigenfrequency calculation I manually select the spooles solver; it is better than PaStiX in that case.

No, I have not seen that either.

The main problem I ran into is that PaStiX needs a single-threaded build of OpenBLAS (you need to define USE_THREAD=0 when building OpenBLAS). However, since PaStiX does use multiple cores, you do need to build OpenBLAS with locking enabled (USE_LOCKING=1). If you don’t do this correctly, calculations will produce weird results.

No, unfortunately not. As I understand it, some changes to PaStiX are needed to make it work with CalculiX out-of-the-box.
Looking at the publicly available issues, it is clear that Kabbone has been working on that with the authors of PaStiX, buy I’m not sure it is finished.

AFAICT, the modified version of PaStiX called PaStiX4CalculiX is the only one that works with the published sources of CalculiX.

Update: See : brand new patch for CalculiX with PaStiX 6.4.0, thanks to rafal.brzegowy for providing the link.

The latest windows binaries of CalculiX 2.23 seems to be built with a newer version of PaStiX IIRC. But I could not find out how that was done.

For the record; Python 2.7 is only used in scripts to derive code for other precisions from a template file. Python 2 is not linked in and not required for using the library itself, just for building it. Those Python scripts are not compatible with Python 3.

1 Like