Calculix BEGINING

No, it shows non-linear scaling (only compared with linear curve). But most importantly, scaling is highly dependent on a given case.

I agree, In my case pretty much manual meshing in 100% hex then extrusion. From there I can change the hex type, or refine the entire mesh, change back and forth between hex 8 and hex 20 when I just need a quick run, or I need precision. I did not see a lot of difference in speed or problem size between hex 20 and hex 8 refined enough to get sufficient accuracy, and interpolation of stresses was better with hex 20. For a model without the regularity of mine I would have needed tets, and wedges mixed in ways making refinement very labor intensive. There is research on automeshing hexes, but i don’t think it has worked its way into the freeware and cheapware I use. I intend to at some point overlay rebar cages using shared nodes, so node positioning is fairly critical, difficult with automeshed tets. I may eventually get into combination steel/concreete rails which will require different techniques. Currently using Mecway.

I believe there is a scaling law (Amdahl’s law?) of diminishing returns in the best case as number of working components increases. I.e. 100 monkeys do not finish a single task in 1/100th the time as one monkey, or even 2 cannot finish in 1/2 the time as one. “Linear Equation Solvers on GPU Architectures for Finite Element Methods in Structural Mechanics”, by Peter Wauligmann, 2000 describes speedup issues for FEM, in particular I believe he did much of the work integrating PasTix into Calculix. A bit dense, but reading through it gives one a sense of what takes time in the fem process. Some of his test problems may be similar to yours, but top out at about 4,000,000 nodes. User FeaCluster on Recent Discussions — Forum runs a cluster running Calculix and has been helpful to me. I have not used his service, but might in the future as I believe he is set up to do larger problems than I can handle. Note just looking at the results of a big problem requires a very fast GPU with lots of memory.

@Cinek_Poland New version of CalculiX (2.21) is available now. Like I said, it features Johnson-Cook plasticity: CalculiX: New features in Version 2.21 of CalculiX

1 Like

I get different results.
On FreeBSD, your example problem with the “iterative scaling” solver ran in 20.4 seconds on 4 cores (although it seems to only use one). With the “spooles” solver, it took 28 seconds, not 40.
It could be that this is an effect of the Windows I/O subsystem, which is slow compared to Linux and FreeBSD.

Your graph paints a picture of spooles being the slowest solver, and that has not been my experience.
In general, most of the analyses I run are of slender parts, i.e. one dimension being much larger than the others. In that case, the iterative solver is really slow.

On a test problem with 24883 nodes and 5000 he20r elements, the spooles solver ran in 4.3 seconds while the iterative scaling solver took 20.5 seconds. Quite different from the results you are getting.

A real-world problem (78732 nodes, 15184 he20r elements) ran in 73 seconds with spooles, and I killed the iterative solver after 11 minutes. :frowning:

iterative scaling use less memory consumption, make it not as fast but can solve larger models than spooles. i can confirm this by previos test at the times of my old laptop with Windows 32bit OS.

bellow the graph taken from Jörg Hiller (2008) reports.

It also matches the tests that I did.

Personally though, I don’t mind tweaking my models so that they fit in the RAM of my machine. Generally, I model the geometry in cgx so I can get a good hex mesh and use he20r elements as the manual recommends.
My general impression is that you don’t need a fine mesh to get decent results with those elements.

In my experience, UNIX-like platforms are generally faster. They also have more tools available out-of-the-box.

Mesh refinement is done by the solver, with the *REFINE MESH keyword. Note that this only works for tetrahedral meshes (te10 in the preprocessor, C3D10 in the solver.) I expect this needs either the tetgen or netgen programs to work.
For mechanical analysis I would advise to stay away from first order tet elements (te4/C3D4); they do not tend to produce good results.

If you use 20-node hex elements with reduced integration (he20r), you will find that you can get decent results with a relatively small amount of elements so they fit in the RAM of my machine when using the spooles solver. Unfortunately, filling arbitrary geometry with hex elements automatically is difficult. The default CalculiX preprocessor cgx does a decent job of it if you create the geometry in it. So I tend to rebuild the geometry in cgx rather than just auto-meshing a STEP file with gmsh.

For example, here is the geometry and hex mesh for an extruded aluminium profile I have made to use in an analysis of an assembly.


Note how this cross-section has been divided into several surfaces each built up out of 4–5 lines to make it mesh with the built-in mesher of cgx. The resulting mesh looks like this:

Note that this cross-section has been defeatured. That it, small radii and holes in this part have been removed to make modelling easier and reduce element count. This might very well result in stress singularities in the corners, but since I’m not interested in the stresses in this part for this analysis, I don’t care.

If it’s marketing, it is safe to assume that there is at least some bullshit involved. But this graph concerns Abaqus, not CalculiX. It is safe to assume that Dassault has the resources to make sure their solver runs well on high-end hardware.

In a small test, with CalculiX 2.20 and spooles on an i7-7700 CPU @ 3.60GHz I get:

  • 1 core: 7.1 seconds
  • 2 cores: 5.3 seconds
  • 4 cores: 4.3 seconds

So the first doubling of the amount of cores gives a 34% speedup.
The next doubling only a 23% speedup. You can see where that is going

And it makes sense. The more cores you have, the more coordination/communication overhead there will be.

Note that it generally does not make sense to use more threads then your CPU has physical cores. If you try to use more, you will just end up with threads fighting over CPU time.

This results of processor speed up are typical for 2 channel memory processor. I think using abaqus sotfware you could obtain similar results. Using 8 channel memory processor speed up should be better.

Note the extra threads beyond physical cores may help actual processing times by doing program, system, and operating system tasks in parallel with the base cores when they are busy fetching and saving data. The hyper threads usually don’t have the double precision throughput of the base cores. However the software may not be written to take full advantage of this. Also some out of core variations store data in text form and does not buffer in a very efficient way. The text form involves converting binary integers and floating point to text, then converting back when the data is used again.

You are missing the point. If you have a system with N physical cores, then only N threads can run at the same time. All other threads will be sleeping waiting to be rescheduled.

If you were to launch 10xN threads on an N core machine, by definition 9xN will waiting for processor time.

Hi rsmith,

I’m puzzle now. ¿Which could then be in your opinion the more efficient set up for a ?

Number of Cores / Threads 8/16

I mean regarding the system variables:

CCX_NPROC_STIFFNESS

CCX_NPROC_RESULTS

OMP_NUM_THREADS

NUMBER_OF_PROCESSORS

Not the way processor scheduling works these days. The threads have wait time for the registers to be loaded and unloaded where the hyperthread can be used. This is why my 8 physical core maching has a multi thread benchmark of about 10.65. It is true that if set to 8 cores Calculix will use those 8, but the hyperthreads are available for other work in the gaps in callculix core use, like loading another portion of the program or data or saving data to memory, also system related, but necessary for the functioning of callculix. Now if you set your BIOS to disable hyperthreading, it will behave differently, and will multiplex these essential tasks with your core. This scheduling occurs in the microcode and processor itself and varies from processor to processor, and generation to generation. I think earlier verssions of processors when hyperthreading was new would take a performance hit but now not so much. I think things may be different with simultaneousl multiple users which must be seperated for security reasons. Also If all 8 cores are running a process which uses all registers keeping the data in the registers all the time you would be correct as there would be no gaps in the calculations to load data and instructions, and save data. Maybe I have this wrong, but how I remember it. Check with a hardware processor expert, or try turning hyperthreading off and scheduling all cores on a large example problem, then with hyperthreading on.

For me it’s not just a question of efficiency.
CPU’s from both Intel and AMD have security issues with symmetric
multithreading (“hyperthreading” according to Intel).
In my testing it slowed down a CPU intensive multithreaded programs.
So that is why I switch it off on all my machines.

That is why I set the variables you mention to the number of physical cores that the machine has.

Furthermore I have the system scheduler set up so that the foreground tasks stay nice and responsive even if CalculiX is running full tilt.

Thanks Roland,

I will do some testing to see how my machine responds.

Some deeper investigation reveals that the number of cores that can effectively be used is related to the computationall rate of the processor cores and the memory bandwidth. FEM type work normally uses double precision which is slower except not so much when the processing uses AVX512. The memory bandwidth relates to the number of channels. The computational bandwidth that can be effectively used for most FEM is limited in most intel processors to between 8 and 12 physical cores per CPU, therefore twin CPU machines have better throughput. This data is from a document on choosing a workstation. Workstations for simulation (FEA) - DEVELOP3D Also currently DDR5 memory generally is not meaningfully faster than DDR4 due to the higher latency of of DDR5. The test machine for this AVSYS benchmarking was a Lenovo P910. Used refurbished P920’s seem to be going for reansonable prices. Saw one on Amazon last night for between 1700 $US and 3700 $US depending on memory 32GB to 1TB.

1 Like

They tested all the benchmark problem with between 1 and 28 cores.
Moving between 1 and 8 cores one sees significant improvement.
More than 12 cores doesn’t seem to help much, though.

It does makes me wonder how a modern AMD workstation CPU would perform. When I tested it at the end of last year, a Ryzen 5500U mobile CPU at 2 GHz was slightly faster then an i7-7700 at 3.6 GHz.

If single processor that channels of memory thing is the issue when problems get bigger, also if less than a 7000 series the lack of AVX512 which none of the lower AMD ryzens slows them. Apparently AVX512 with its 512bit register that can be broken down into 4 double precision sections can process double precision between 25% and 90% faster than processors with just AVX2 and its 256bit register. A two processor computer can handle twice as much memory, because the memort is split between the processors and there channels. I have not seen boards for ryzen multiprocessors, except EPYC processors which are very pricy, and are rackmounted.
I run a 3700X, 64GB and best performance is with 6 of its 8 cores, with 4 being close behind. My largest problems use just over 100GB which I get away with by using a fast SSD for memory paging, it non the less slows a lot as it used more and more virtual memory pageing. I built this system to learn FEM a few years ago and it has been enough for me, though 128GB of ram would have been better. The ryzens will not handle more. This might then handle static problems up to 5,500,000 nodes, but would take a lot longer to run.
Yes the Ryzen 3 is fast/dollar, but for FEM the fewer channels and lack of AVX512, and lower memory capacity cause it to be slower and capable of a smaller max problem size.
Note when I built it 3.5 years ago Intel had yet to drop prices to compete with AMD. Also the stock cooler and a reasonable power supply work well with the ryzens. Easy to upgrade as except for the 7000 series and up the AM4 socket is used throughout. Intel has a lot of different sockets and boards for their processors. An Tower with AMD 7700X with 128 GB of DDR5 memory in an AM5 board and a 2GB pcie 4 NVME migtht get you started at low cost. and allow for problems up to 5,500,000 or so nodes (16,000,000 DOF). To get much more you would need a two processor intel board with 512 GB or more memory and 10 or 12 physical cores each. Supermicro is a good place to look at what can be done, and they can be helpful. Probably very expensive compared to the levono P912 used systems. Possibly paperwork issues with foreign sales of new product from Supermicro, but Lenovo is a Chinese firm.

1 Like

Note again I run an AMD windows system. While trying to run the latest Calculix windows binary, 2.21 for windows, it seems that I can not run one that uses Pardiso as the solver. Of the three solvers currently used by Calculix, Spooles, Pastix, and Pardisio, Pardisio is proprietary and provided by Intel as part of their MKL libraries. These libraries as currently provided by intel do not work with AMD cpu’s. There used to be work arounds, but Intel has worked around them. It looks like I will not be able to run Pardiso on version 2.21 unless someone compiles it using older MKL libraries.
I mention this because it is relevant to the choice of processor for a workstation for running Calculix for larger problems. This is in addition the issue of the number of cores usable due to the number of memory channels as the consumer grade AM4 or AM5 socket AMD processors only have two memory channels, and the lower limit on memory for these processors (128GB). Pastix is faster than Pardiso for medium sized problems if a version is used compiled with i8 setting (8 byte integers) but requires more memory which slows it down for very large problems.
In summary a workstation for large Calculix problems in core should use an Intel processor with at least 8 cores, as much memory as possible, a fast core clock speed, and potentially multiple processors on a multi socket board. If I outlive my current workstation or need routinely higher capabilities than the up to 2,000,000 node nonlinear geometry and material problems I currently run, this is the direction I will go.

1 Like

HI,

Regarding v.21
Pardiso works well but I have noticed a huge increase in time and resources required for solving the same model with the new ccx v.21 + Pastix.

ccx v21 + Pardiso , ccx v20 + Pardiso or Pastix (<1sec).
ccx v.21 + Pastix takes 21 seconds.

Windows manager shows 100% CPU usage which should reduce computation time but it’s not. Seems contradictory.

I’m using AMD and WIN10 Pro.

Test problem is Linear Static with just 5.000 nodes.
Has someone experienced something similar.?

1 Like