You should never set OMP_NUM_THREADS to a number higher than the actual number of physical cores.
In my experience hypertreading decreases performance, at least on the CPU’s I’ve tested it (Intel Core i7-7700 and AMD Ryzen 5).
This is why I always disable hyperthreading in the BIOS, and set the global environment variable OMP_NUM_THREADS to the amount of physical cores.
Using the same beamp input, I get a much higher factorization performance on my Intel Core i7-7700;
Factorization step:
Factorization used: LU
Time to initialize internal csc: 0.0012
Time to initialize coeftab: 0.0001
Time to factorize: 0.0006 (21.48 GFlop/s)
Number of operations: 12.72 MFlops
Number of static pivots: 0
Time to solve: 0.0001
- iteration 1 :
total iteration time 0.0002
error 2.1222e-06
- iteration 2 :
total iteration time 0.000216
error 6.8298e-10
- iteration 3 :
total iteration time 0.00022
error 8.3425e-13
Time for refinement: 0.0008
- iteration 1 :
total iteration time 0.000184
error 1.1033e-14
Time for refinement: 0.0003
On real-world problems factorization performance is between 20 and 270 GFlop/s on my i7-7700.
Hint: if you put blocks of output or scripts between lines containing only ``` (three “grave accent”, ASCII 96) it will be rendered as preformatted text.
You can also use the button “</>” on top of the edit box to make this.
I do have 8 physical cpu cores. Last time I checked I had turned off the hyperthreading for the same reasons you had mentioned. Hence I am still confused about the oversubscription error. It seems to happen for a small model like beamp.
You are correct! I forgot that I had turned back hyperthreading on at some time. I do have only 4 real CPU cores. That explains the oversubscription error.
In the instructions I sent you there are both versions: the first one without cuda using Kabbone trunk for only PaStiX whit CPU and the second one of parsec with cuda and Guido Dondth PaStiX4CalculiX. The second one are correctly detected by nvidia-smi but the usage is 0% GPU and 100% CPU.
Please, could you share your complete runtime CalculiX output?
************************************************************
CalculiX Version 2.22 i8, Copyright(C) 1998-2024 Guido Dhondt
CalculiX comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
certain conditions, see gpl.htm
************************************************************
You are using an executable made on Sat Jan 4 02:50:15 PM MST 2025
The numbers below are estimated upper bounds
number of:
nodes: 261
elements: 32
one-dimensional elements: 0
two-dimensional elements: 0
integration points per element: 8
degrees of freedom per node: 3
layers per element: 1
distributed facial loads: 0
distributed volumetric loads: 0
concentrated loads: 9
single point constraints: 63
multiple point constraints: 1
terms in all multiple point constraints: 1
tie constraints: 0
dependent nodes tied by cyclic constraints: 0
dependent nodes in pre-tension constraints: 0
sets: 6
terms in all sets: 105
materials: 1
constants per material and temperature: 2
temperature points per material: 1
plastic data points per material: 0
orientations: 0
amplitudes: 4
data points in all amplitudes: 4
print requests: 4
transformations: 0
property cards: 0
STEP 1
Static analysis was selected
Decascading the MPC's
Determining the structure of the matrix:
Using up to 4 cpu(s) for setting up the structure of the matrix.
number of equations
720
number of nonzero lower triangular matrix elements
37458
Using up to 4 cpu(s) for the stress calculation.
Using up to 4 cpu(s) for the symmetric stiffness/mass contributions.
Not reusing csc.
+-------------------------------------------------+
+ PaStiX : Parallel Sparse matriX package +
+-------------------------------------------------+
Version: 6.0.1
Schedulers:
sequential: Enabled
thread static: Started
thread dynamic: Disabled
PaRSEC: Started
StarPU: Disabled
Number of MPI processes: 1
Number of threads per process: 4
Number of GPUs: 1
MPI communication support: Disabled
Distribution level: 2D( 128)
Blocking size (min/max): 1024 / 2048
Matrix type: General
Arithmetic: Float
Format: CSC
N: 720
nnz: 75636
+-------------------------------------------------+
Ordering step :
Ordering method is: Scotch
Time to compute ordering: 0.0026
+-------------------------------------------------+
Symbolic factorization step:
Symbol factorization using: Fax Direct
Number of nonzeroes in L structure: 64548
Fill-in of L: 0.853403
Time to compute symbol matrix: 0.0017
+-------------------------------------------------+
Reordering step:
Split level: 0
Stoping criteria: -1
Time for reordering: 0.0027
+-------------------------------------------------+
Analyse step:
Number of non-zeroes in blocked L: 129096
Fill-in: 1.706806
Number of operations in full-rank LU : 12.72 MFlops
Prediction:
Model: AMD 6180 MKL
Time to factorize: 0.0036
Time for analyze: 0.0001
+-------------------------------------------------+
Factorization step:
Factorization used: LU
Time to initialize internal csc: 0.0078
Time to initialize coeftab: 0.0007
Time to factorize: 0.0082 ( 1.52 GFlop/s)
Number of operations: 12.72 MFlops
Number of static pivots: 0
CPU vs GPU CBLK GEMMS -> 37 vs 106
CPU vs GPU BLK GEMMS -> 0 vs 0
CPU vs GPU TRSM -> 0 vs 0
Time to solve: 0.0261
- iteration 1 :
total iteration time 0.044
error 1.4032e-06
- iteration 2 :
total iteration time 0.0534
error 1.5343e-09
- iteration 3 :
total iteration time 0.00317
error 8.7133e-13
Time for refinement: 0.1021
- iteration 1 :
total iteration time 0.00422
error 5.746e-15
Time for refinement: 0.0047
________________________________________
CSC Conversion Time: 0.083536
Init Time: 0.023291
Factorize Time: 0.016837
Solve Time: 0.144072
Clean up Time: 0.000000
---------------------------------
Sum: 0.267736
Total PaStiX Time: 0.267736
CCX without PaStiX Time: 0.035720
Share of PaStiX Time: 0.882288
Total Time: 0.303457
Reusability: 0 : 1
________________________________________
Using up to 4 cpu(s) for the stress calculation.
Job finished
________________________________________
Total CalculiX Time: 0.305798
________________________________________
nfortunately in all my tests the GPU didn’t work at all! Can I ask you if you can share with me the ccx executable using CUDA? I would like to test it on my KUBUNTU 24.04 OS. Alternatively, I could share my executable with you to verify if CUDA is being used for factorization on your system.
My issue is that while the compilation went smoothly, all variables were set correctly, and the NVIDIA-SMI correctly detected the CCX CUDA task, the GPU is not being used at all
sorry for ressurecting this topic, but well i am also trying to compile calculix on Ubuntu 24.04.4 LTS
with et last pastix, spooles ect. but without CUDA, GPU. as a reference i was using foamBuilder guide, this forum and ai. Well readme is outdated .
So for pastix i have used Tags · Kabbone/PaStiX4CalculiX · GitHub - no dependence on cuda, parsec - newest from Bitbucket , scotch - newest, openblas from ubuntu repo. Spooles modified as in foamBuilder. After long guessing and trials, i have managed to compile calculix. - Great and now testing correctness. To ensure correct multithread execution i created script:
I have downloaded test cases from official website. and they are not compatible with my default solver pastix - so i scripted insert Solver=Spooles into test input files but results are at max medicore. For example:
part of output
aircolumn.dat and aircolumn.dat.ref do not have the same size !!! axrad2.dat and axrad2.dat.ref do not have the same size !!! beam10psmooth.rfn.dat.ref does not exist beamhtfc2.dat and beamhtfc2.dat.ref do not have the same size !!! deviation in file beampisof.dat line: 64 reference value: 9.169356e-01 value: 9.226554e-01 absolute error: 5.719800e-03 largest value within same block: 9.973145e-01 relative error w.r.t. largest value within same block: 0.573520 % deviation in file beamprand.frd line: 1249 reference value: 9.321370e-02 value: -9.321370e-02 absolute error: 1.864274e-01 largest value within same block: 9.321370e-02 relative error w.r.t. largest value within same block: 200.000000 % deviation in file beamptied4.dat line: 129 reference value: 8.075472e-01 value: 8.116879e-01 absolute error: 4.140700e-03 largest value within same block: 8.827490e-01 relative error w.r.t. largest value within same block: 0.469069 % deviation in file beamptied5.dat line: 16 reference value: 2.004731e+06 value: 1.890823e+06 absolute error: 1.139080e+05 largest value within same block: 2.968844e+06 relative error w.r.t. largest value within same block: 3.836780 % deviation in file beamptied6.dat line: 17 reference value: 2.090934e+06 value: 2.005467e+06 absolute error: 8.546700e+04 largest value within same block: 3.071561e+06 relative error w.r.t. largest value within same block: 2.782527 % deviation in file beamptied7.dat line: 64 reference value: 8.286043e-01 value: 8.397084e-01 absolute error: 1.110410e-02 largest value within same block: 8.944509e-01 relative error w.r.t. largest value within same block: 1.241443 % beamread.dat and beamread.dat.ref do not have the same size !!! beamread2.dat and beamread2.dat.ref do not have the same size !!! beamread3.frd does not exist beamread4.dat and beamread4.dat.ref do not have the same size !!! circ10dload.rfn.dat.ref does not exist circ10pnl.rfn.dat.ref does not exist circ11p.frd does not exist coucyl.dat and coucyl.dat.ref do not have the same size !!! crackIIinta.frd and crackIIinta.frd.ref do not have the same size !!! green2.dat and green2.dat.ref do not have the same size !!! greencyc1.dat and greencyc1.dat.ref do not have the same size !!! greencyc2.dat and greencyc2.dat.ref do not have the same size !!! deviation in file induction2.frd line: 20803 reference value: -9.582270e+00 value: -1.372320e+01 absolute error: 4.140930e+00 largest value within same block: 2.257380e+01 relative error w.r.t. largest value within same block: 18.343965 % deviation in file potied.dat line: 64 reference value: 9.323342e-01 value: 9.068721e-01 absolute error: 2.546210e-02 largest value within same block: 9.975599e-01 relative error w.r.t. largest value within same block: 2.552438 %
so basically for some files even after insertion of Solver=Spooles, simulation is done by pastix (do not have the same size !!! error) - keywords errorlessly not working - thats new
some test cases has for example 200% error and some are even correct
honestly i guess i need to re try compilation process but at the end i would probably get same problems.
So anyone have any idea of what to do? I mean really those symptoms are really strange.
You probably want to use the cudaless branch of Kabbone’s repo.
Also, some other libraries also needed patches (in my experience).
I have published my build scripts and patches for building a CalculiX binary with all required special libraries statically linked. See e.g. my earlier comment. Maybe you will find them useful.
Thank you. Well i have used kabbone repo…
So best bet would be to reproduce compilation process? Well i thought it would end up like this.
Have you ever met with problem that calculix does not assign simulation to defined solver? Also have you came across fact that test simulation gives opposite sign? are those symptoms of something or just random outcome?
In your repo… there are patches - so would those work for newest pastix? - it would be cool to not have do dig into old libraries and link against python 2.7 ect.
No. If I’m running an eigenfrequency calculation I manually select the spooles solver; it is better than PaStiX in that case.
No, I have not seen that either.
The main problem I ran into is that PaStiX needs a single-threaded build of OpenBLAS (you need to define USE_THREAD=0 when building OpenBLAS). However, since PaStiX does use multiple cores, you do need to build OpenBLAS with locking enabled (USE_LOCKING=1). If you don’t do this correctly, calculations will produce weird results.
No, unfortunately not. As I understand it, some changes to PaStiX are needed to make it work with CalculiX out-of-the-box.
Looking at the publicly available issues, it is clear that Kabbone has been working on that with the authors of PaStiX, buy I’m not sure it is finished.
AFAICT, the modified version of PaStiX called PaStiX4CalculiX is the only one that works with the published sources of CalculiX.
The latest windows binaries of CalculiX 2.23 seems to be built with a newer version of PaStiX IIRC. But I could not find out how that was done.
For the record; Python 2.7 is only used in scripts to derive code for other precisions from a template file. Python 2 is not linked in and not required for using the library itself, just for building it. Those Python scripts are not compatible with Python 3.