You should never set OMP_NUM_THREADS to a number higher than the actual number of physical cores.
In my experience hypertreading decreases performance, at least on the CPU’s I’ve tested it (Intel Core i7-7700 and AMD Ryzen 5).
This is why I always disable hyperthreading in the BIOS, and set the global environment variable OMP_NUM_THREADS to the amount of physical cores.
Using the same beamp input, I get a much higher factorization performance on my Intel Core i7-7700;
Factorization step:
Factorization used: LU
Time to initialize internal csc: 0.0012
Time to initialize coeftab: 0.0001
Time to factorize: 0.0006 (21.48 GFlop/s)
Number of operations: 12.72 MFlops
Number of static pivots: 0
Time to solve: 0.0001
- iteration 1 :
total iteration time 0.0002
error 2.1222e-06
- iteration 2 :
total iteration time 0.000216
error 6.8298e-10
- iteration 3 :
total iteration time 0.00022
error 8.3425e-13
Time for refinement: 0.0008
- iteration 1 :
total iteration time 0.000184
error 1.1033e-14
Time for refinement: 0.0003
On real-world problems factorization performance is between 20 and 270 GFlop/s on my i7-7700.
Hint: if you put blocks of output or scripts between lines containing only ``` (three “grave accent”, ASCII 96) it will be rendered as preformatted text.
You can also use the button “</>” on top of the edit box to make this.
I do have 8 physical cpu cores. Last time I checked I had turned off the hyperthreading for the same reasons you had mentioned. Hence I am still confused about the oversubscription error. It seems to happen for a small model like beamp.
You are correct! I forgot that I had turned back hyperthreading on at some time. I do have only 4 real CPU cores. That explains the oversubscription error.
In the instructions I sent you there are both versions: the first one without cuda using Kabbone trunk for only PaStiX whit CPU and the second one of parsec with cuda and Guido Dondth PaStiX4CalculiX. The second one are correctly detected by nvidia-smi but the usage is 0% GPU and 100% CPU.
Please, could you share your complete runtime CalculiX output?
************************************************************
CalculiX Version 2.22 i8, Copyright(C) 1998-2024 Guido Dhondt
CalculiX comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
certain conditions, see gpl.htm
************************************************************
You are using an executable made on Sat Jan 4 02:50:15 PM MST 2025
The numbers below are estimated upper bounds
number of:
nodes: 261
elements: 32
one-dimensional elements: 0
two-dimensional elements: 0
integration points per element: 8
degrees of freedom per node: 3
layers per element: 1
distributed facial loads: 0
distributed volumetric loads: 0
concentrated loads: 9
single point constraints: 63
multiple point constraints: 1
terms in all multiple point constraints: 1
tie constraints: 0
dependent nodes tied by cyclic constraints: 0
dependent nodes in pre-tension constraints: 0
sets: 6
terms in all sets: 105
materials: 1
constants per material and temperature: 2
temperature points per material: 1
plastic data points per material: 0
orientations: 0
amplitudes: 4
data points in all amplitudes: 4
print requests: 4
transformations: 0
property cards: 0
STEP 1
Static analysis was selected
Decascading the MPC's
Determining the structure of the matrix:
Using up to 4 cpu(s) for setting up the structure of the matrix.
number of equations
720
number of nonzero lower triangular matrix elements
37458
Using up to 4 cpu(s) for the stress calculation.
Using up to 4 cpu(s) for the symmetric stiffness/mass contributions.
Not reusing csc.
+-------------------------------------------------+
+ PaStiX : Parallel Sparse matriX package +
+-------------------------------------------------+
Version: 6.0.1
Schedulers:
sequential: Enabled
thread static: Started
thread dynamic: Disabled
PaRSEC: Started
StarPU: Disabled
Number of MPI processes: 1
Number of threads per process: 4
Number of GPUs: 1
MPI communication support: Disabled
Distribution level: 2D( 128)
Blocking size (min/max): 1024 / 2048
Matrix type: General
Arithmetic: Float
Format: CSC
N: 720
nnz: 75636
+-------------------------------------------------+
Ordering step :
Ordering method is: Scotch
Time to compute ordering: 0.0026
+-------------------------------------------------+
Symbolic factorization step:
Symbol factorization using: Fax Direct
Number of nonzeroes in L structure: 64548
Fill-in of L: 0.853403
Time to compute symbol matrix: 0.0017
+-------------------------------------------------+
Reordering step:
Split level: 0
Stoping criteria: -1
Time for reordering: 0.0027
+-------------------------------------------------+
Analyse step:
Number of non-zeroes in blocked L: 129096
Fill-in: 1.706806
Number of operations in full-rank LU : 12.72 MFlops
Prediction:
Model: AMD 6180 MKL
Time to factorize: 0.0036
Time for analyze: 0.0001
+-------------------------------------------------+
Factorization step:
Factorization used: LU
Time to initialize internal csc: 0.0078
Time to initialize coeftab: 0.0007
Time to factorize: 0.0082 ( 1.52 GFlop/s)
Number of operations: 12.72 MFlops
Number of static pivots: 0
CPU vs GPU CBLK GEMMS -> 37 vs 106
CPU vs GPU BLK GEMMS -> 0 vs 0
CPU vs GPU TRSM -> 0 vs 0
Time to solve: 0.0261
- iteration 1 :
total iteration time 0.044
error 1.4032e-06
- iteration 2 :
total iteration time 0.0534
error 1.5343e-09
- iteration 3 :
total iteration time 0.00317
error 8.7133e-13
Time for refinement: 0.1021
- iteration 1 :
total iteration time 0.00422
error 5.746e-15
Time for refinement: 0.0047
________________________________________
CSC Conversion Time: 0.083536
Init Time: 0.023291
Factorize Time: 0.016837
Solve Time: 0.144072
Clean up Time: 0.000000
---------------------------------
Sum: 0.267736
Total PaStiX Time: 0.267736
CCX without PaStiX Time: 0.035720
Share of PaStiX Time: 0.882288
Total Time: 0.303457
Reusability: 0 : 1
________________________________________
Using up to 4 cpu(s) for the stress calculation.
Job finished
________________________________________
Total CalculiX Time: 0.305798
________________________________________
nfortunately in all my tests the GPU didn’t work at all! Can I ask you if you can share with me the ccx executable using CUDA? I would like to test it on my KUBUNTU 24.04 OS. Alternatively, I could share my executable with you to verify if CUDA is being used for factorization on your system.
My issue is that while the compilation went smoothly, all variables were set correctly, and the NVIDIA-SMI correctly detected the CCX CUDA task, the GPU is not being used at all