Trouble compiling and running CalculiX with Pastix on Ubuntu 24.04

You should never set OMP_NUM_THREADS to a number higher than the actual number of physical cores.

In my experience hypertreading decreases performance, at least on the CPU’s I’ve tested it (Intel Core i7-7700 and AMD Ryzen 5).
This is why I always disable hyperthreading in the BIOS, and set the global environment variable OMP_NUM_THREADS to the amount of physical cores.

Using the same beamp input, I get a much higher factorization performance on my Intel Core i7-7700;

  Factorization step:
    Factorization used: LU
    Time to initialize internal csc:      0.0012 
    Time to initialize coeftab:           0.0001 
    Time to factorize:                    0.0006  (21.48 GFlop/s)
    Number of operations:                      12.72 MFlops
    Number of static pivots:                     0
    Time to solve:                        0.0001 
    - iteration 1 :
         total iteration time                   0.0002 
         error                                  2.1222e-06
    - iteration 2 :
         total iteration time                   0.000216 
         error                                  6.8298e-10
    - iteration 3 :
         total iteration time                   0.00022 
         error                                  8.3425e-13
    Time for refinement:                  0.0008 
    - iteration 1 :
         total iteration time                   0.000184 
         error                                  1.1033e-14
    Time for refinement:                  0.0003 

On real-world problems factorization performance is between 20 and 270 GFlop/s on my i7-7700.

1 Like

Hint: if you put blocks of output or scripts between lines containing only ``` (three “grave accent”, ASCII 96) it will be rendered as preformatted text.

You can also use the button “</>” on top of the edit box to make this.

1 Like

thank you, very useful :slight_smile:

I don’t have a powerfull laptop:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz

but in my tests the factorization reach 450 GFlop/s
Anyone could explain me why I can not use GPU?

I do have 8 physical cpu cores. Last time I checked I had turned off the hyperthreading for the same reasons you had mentioned. Hence I am still confused about the oversubscription error. It seems to happen for a small model like beamp.

1 Like

@teofil75, I set

export PASTIX_GPU=1

and I can see this in my run, which tells me GPU is helping with GEMMS:

Also, here is my nvidia-smi:

Sun Jan  5 09:40:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050        Off |   00000000:3B:00.0 Off |                  N/A |
| N/A   42C    P8             N/A / ERR!  |       5MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1656      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

Maybe all you are missing is setting:

DPARSEC_GPU_WITH_CUDA=ON \

during parsec build and

-DPASTIX_WITH_CUDA=ON \

for the Pastix build.

The only way I have been able to trigger the “oversubscription” warning is to set OMP_NUM_THREADS higher than the number of actual cores:

elysium:~> uname -v
FreeBSD 14.2-RELEASE releng/14.2-n269506-c8918d6c7412 GENERIC
elysium:~...src/beamp> sysctl hw.ncpu
hw.ncpu: 4
elysium:~...src/beamp> env OMP_NUM_THREADS=4 ccx -i beamp | grep Over
elysium:~...src/beamp> env OMP_NUM_THREADS=6 ccx -i beamp | grep Over
W@-0001 Oversubscription on core 0 detected
W@-0001 Oversubscription on core 1 detected
elysium:~...src/beamp> env OMP_NUM_THREADS=8 ccx -i beamp | grep Over
W@-0001 Oversubscription on core 0 detected
W@-0001 Oversubscription on core 1 detected
W@-0001 Oversubscription on core 2 detected
W@-0001 Oversubscription on core 3 detected

Since you get warnings about oversubscription on two cores with OMP_NUM_THREADS=6, it looks to me that your CPU only has 4 cores.

You are correct! I forgot that I had turned back hyperthreading on at some time. I do have only 4 real CPU cores. That explains the oversubscription error.

In the instructions I sent you there are both versions: the first one without cuda using Kabbone trunk for only PaStiX whit CPU and the second one of parsec with cuda and Guido Dondth PaStiX4CalculiX. The second one are correctly detected by nvidia-smi but the usage is 0% GPU and 100% CPU.
Please, could you share your complete runtime CalculiX output?

Sure thing. This is for running ccx on beamp:

************************************************************

CalculiX Version 2.22 i8, Copyright(C) 1998-2024 Guido Dhondt
CalculiX comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
certain conditions, see gpl.htm

************************************************************

You are using an executable made on Sat Jan  4 02:50:15 PM MST 2025

  The numbers below are estimated upper bounds

  number of:

   nodes:                   261
   elements:                    32
   one-dimensional elements:                     0
   two-dimensional elements:                     0
   integration points per element:                     8
   degrees of freedom per node:                     3
   layers per element:                     1

   distributed facial loads:                     0
   distributed volumetric loads:                     0
   concentrated loads:                     9
   single point constraints:                    63
   multiple point constraints:                     1
   terms in all multiple point constraints:                     1
   tie constraints:                     0
   dependent nodes tied by cyclic constraints:                     0
   dependent nodes in pre-tension constraints:                     0

   sets:                     6
   terms in all sets:                   105

   materials:                     1
   constants per material and temperature:                     2
   temperature points per material:                     1
   plastic data points per material:                     0

   orientations:                     0
   amplitudes:                     4
   data points in all amplitudes:                     4
   print requests:                     4
   transformations:                     0
   property cards:                     0


 STEP                     1

 Static analysis was selected

 Decascading the MPC's

 Determining the structure of the matrix:
 Using up to 4 cpu(s) for setting up the structure of the matrix.
 number of equations
 720
 number of nonzero lower triangular matrix elements
 37458

 Using up to 4 cpu(s) for the stress calculation.

 Using up to 4 cpu(s) for the symmetric stiffness/mass contributions.

Not reusing csc.
+-------------------------------------------------+
+     PaStiX : Parallel Sparse matriX package     +
+-------------------------------------------------+
  Version:                                   6.0.1
  Schedulers:
    sequential:                            Enabled
    thread static:                         Started
    thread dynamic:                       Disabled
    PaRSEC:                                Started
    StarPU:                               Disabled
  Number of MPI processes:                       1
  Number of threads per process:                 4
  Number of GPUs:                                1
  MPI communication support:              Disabled
  Distribution level:                     2D( 128)
  Blocking size (min/max):             1024 / 2048

  Matrix type:  General
  Arithmetic:   Float
  Format:       CSC
  N:            720
  nnz:          75636

+-------------------------------------------------+
  Ordering step :
    Ordering method is: Scotch
    Time to compute ordering:              0.0026 
+-------------------------------------------------+
  Symbolic factorization step:
    Symbol factorization using: Fax Direct
    Number of nonzeroes in L structure:      64548
    Fill-in of L:                         0.853403
    Time to compute symbol matrix:        0.0017 
+-------------------------------------------------+
  Reordering step:
    Split level:                                 0
    Stoping criteria:                           -1
    Time for reordering:                  0.0027 
+-------------------------------------------------+
  Analyse step:
    Number of non-zeroes in blocked L:      129096
    Fill-in:                              1.706806
    Number of operations in full-rank LU   :    12.72 MFlops
    Prediction:
      Model:                             AMD 6180  MKL
      Time to factorize:                  0.0036 
    Time for analyze:                     0.0001 
+-------------------------------------------------+
  Factorization step:
    Factorization used: LU
    Time to initialize internal csc:      0.0078 
    Time to initialize coeftab:           0.0007 
    Time to factorize:                    0.0082  ( 1.52 GFlop/s)
    Number of operations:                      12.72 MFlops
    Number of static pivots:                     0
CPU vs GPU CBLK GEMMS -> 37 vs 106
CPU vs GPU BLK GEMMS -> 0 vs 0
CPU vs GPU TRSM -> 0 vs 0
    Time to solve:                        0.0261 
    - iteration 1 :
         total iteration time                   0.044 
         error                                  1.4032e-06
    - iteration 2 :
         total iteration time                   0.0534 
         error                                  1.5343e-09
    - iteration 3 :
         total iteration time                   0.00317 
         error                                  8.7133e-13
    Time for refinement:                  0.1021 
    - iteration 1 :
         total iteration time                   0.00422 
         error                                  5.746e-15
    Time for refinement:                  0.0047 
________________________________________

CSC Conversion Time: 0.083536
Init Time: 0.023291
Factorize Time: 0.016837
Solve Time: 0.144072
Clean up Time: 0.000000
---------------------------------
Sum: 0.267736

Total PaStiX Time: 0.267736
CCX without PaStiX Time: 0.035720
Share of PaStiX Time: 0.882288
Total Time: 0.303457
Reusability: 0 : 1 
________________________________________

 Using up to 4 cpu(s) for the stress calculation.


 Job finished

________________________________________

Total CalculiX Time: 0.305798
________________________________________

1 Like

:frowning: nfortunately in all my tests the GPU didn’t work at all! Can I ask you if you can share with me the ccx executable using CUDA? I would like to test it on my KUBUNTU 24.04 OS. Alternatively, I could share my executable with you to verify if CUDA is being used for factorization on your system.
My issue is that while the compilation went smoothly, all variables were set correctly, and the NVIDIA-SMI correctly detected the CCX CUDA task, the GPU is not being used at all :frowning:

in order to solve the Threads::Threads try this:
in CMakeLists.txt search for:

    if(PASTIX_WITH_CUDA)
        list(APPEND PARSEC_COMPONENT_LIST "CUDA")
    endif()

and simply put a space before the “CUDA” word:

    if(PASTIX_WITH_CUDA)
        list(APPEND PARSEC_COMPONENT_LIST " CUDA")
    endif()

But I don’t understand why in your UBUNTU OS this problem didn’t arise?

1 Like