|
BernardIE5317 wrote: Yes I am thinking in terms of "if 23% is good 99.999% is better."
That's describing a 4-core system where one process running at 100%, and whoever wrote that program didn't bother to try to write it as multi-threaded. It may or may not be possible to do that. Or maybe it was decided it was just not worth it, given the overall time expected for the task to complete, vs the complexity involved in writing a well-behaved multi-threaded application.
|
|
|
|
|
Software is hard. Good software is harder. Good parallel software is harder than that.
If you want high CPU loads that mostly represent useful work, you have to do a lot of work to figure out how to avoid system calls, memory allocation, and even random accesses into gigantic memory maps, as well as efficient communication among the threads. Any inefficiency in any of these factors can slow down processing to the extent that the multiple threads don't proceed much faster than a single thread would.
|
|
|
|
|
Indeed. That's why I've been saying all along that making some software multi-threaded isn't something you get for free, and many developers will forego the benefits, unless there's significant, measurable gains to be had.
In other words...stop worrying about processes not pinning your CPU at 100%. In fact, that is when you should start worrying about what's going on...
|
|
|
|
|
Many operations on the motherboard do not require direct CPU utilization. Reading/Writing to disk files (HD drives have rotational delays, SSDs have bandwidth limitations depending on read or write, Pulling data from the Internet, moving data to and from your video card.) Many use hardware DMA (Direct Memory Access) to move the data and the CPU has to go idle while the transfers are taking place. Another area that can cause CPU to idle is if you exceed your physical memory and start using virtual memory. The the operating system gets involved in swapping data to and from disk and memory to give you the illusion of more main memory. Some opcode level instructions don't like it when the data they are referencing is beyond a certain physical distance. This can cause stalls in the pre-execution decoding that the CPU does and cause flushes of opcodes that have been decoded and stacked in the execution pipeline. Same goes with generated code that has an abundance of branches. Branch prediction can falter and cause CPU stalls as it has to reload the pipeline with opcodes from the target location.
Depends on what your computer is processing.
Are you experiencing slow response (stuttering pointer movement, keyboard lag)?
That's all I can think of "off the top of my head" I am sure there are more reasons for low CPU utilization.
|
|
|
|
|
honey the codewitch wrote: Add 2 + 2 using two threads to divide the task between two cores. You can't.
What about?
1+1=A
1+1=B
A+B=Answer
(Seems like a joke but I worked with a VP that seemed to think throwing threads at a problem would always solve everything.)
|
|
|
|
|
Hehe
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
There's another weird bit where some chips can't run all cores at full speed. I think the point is to have some low-power efficiency cores to dedicate to certain kinds of tasks.
Not real sure how common that is and it's a newer deal.
|
|
|
|
|
Actually it's getting more common. Both Intel and ARM chips use multiple different class of core in their CPU.
Intel uses two, and calls them "p-cores" (performance cores) and "e-cores" (efficient? core)
The reason is heat, size and power consumption vs usage habits.
The idea is that people don't use each core the same way. This way you have more powerful cores that kick in while needed, but you can run things off the e-core(s) most of the time
ARM pioneered it** because phone advancements made it almost necessary. Intel caught on to what ARM was doing and was like "Excellent! I'll take four!"
** they may not technically be the first - i don't know, but they're the first major modern CPU vendor I've seen that does it.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
That all seems to mesh very well with my reality and understandings.
I've just not had the expendable moola to grab new silicon in a bit and keeping Intel offerings straight in one's head is an exercise in futility.
Nothing against AMD (though ARM, because of its sordid history with Windows can kick rocks). I just like Intel because it's literally all I've ever had and there's just a degree of comfort/security (likely a false sense).
I have built machines for others with AMD Ryzen though and they seem to have worked out just fine.
All this does seem to make the bit of calculating total processor usage a bit more of a complex algorithm though, if not a near-useless metric in context?
|
|
|
|
|
Most of those metrics are useless by themselves. As hardware gets more complicated, so do the numbers, and the circumstances we find those numbers presenting themselves in.
I've found if I want to bench a system, I find what other people are using to bench, and then I bench my own using a baseline. The ones I use right now are:
Running Cyberpunk Bench (DLSS and Raytracing benchmark)
Running TimeSpy (General DirectX 12 and CPU gaming perf bench)
And Cinebench R23 - for CPU performance
That won't tell you everything, and the first two of those benches are very gaming oriented, and focus on GPU performance. What running them tells me is that my desktop and laptop are pretty comparable at the resolutions I play at on each, but my lappy slightly beats my desktop in multicore performance
What I'd like is for other people to compile the same large C++ codebase i am on other machines, which would give me a nice real world metric for the purpose i built this machine for (C++ compile times)
As it is, I would buy an AMD laptop (power efficiency), but intel is my go to for desktops at this point, primarily due to general purpose single core performance. My laptop is also an intel, but if i bought again, I'd have waited for the AMD version of it and got better battery life.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
modified 8-Feb-24 11:21am.
|
|
|
|
|
I can tell you that if you aren't already, making sure all the related bits ride on SSD would be one of the most huge things I think might speed up linkage.
|
|
|
|
|
That's why I run two Samsung 990 Pro NVMe drives - fastest on the market.
I also run my ram at 6000MHz/CL32 - stock spec for intel on DDR5 is like 4800 or something.
My CPU on this machine is an i5-13600K. I would have gone with the i9-13900K but I built this to be an air cooled system, and 250W was too rich for my blood - at least with this cooler - and this i5 is a sleeper with single core performance comparable to the i9. I have the i9-13900HX in my laptop - which is basically the same as the desktop version but retargeted to 180W instead of 250W.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
What you may want to try regarding the RAM...
It's a real PITA, because you'll lock/bluescreen your machine, but rather than aiming for just the highest clock-rate possible, try to tighten the latency timings or look for some sticks with the most excellent latency timings.
These things tend to be somewhat inversely related (clock speed : CASL/others).
I won't be so upset I just got a corptop with an i5 then (coming from an i7).
|
|
|
|
|
I don't play the silicon lottery, because I've lost too much time to intermittent RAM failures.
I use an XMP profile because the stick was tested at those timings. and CL32 is pretty tight.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
Even just different sticks might have better latency timings... and lowering, even underclocking RAM to get tight latencies can see noticeable framerate improvement for at least some games (kinda depends what they have goin on).
For some of the same reasons that's the case, I suspect it'd be the case for linking too.
|
|
|
|
|
Like I said, I don't play the silicon lottery, as losing it is a giant time sink.
I run my ram at what it was tested for at factory. XMP profile has it at 6000/CL32, it's rock solid. And also I know that since it wasn't the fastest ram on the market from that vendor when I bought it that it failed faster tests.
So I'm not messing with the timings. Frankly, my time is too valuable to waste running down system errors due to memory corruption.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
I never played that either. But I have played at trying to extract what I can from whatever I get. Maybe we have different definitions of silicon lottery - I call it buying/returning chips till you get a good bin#.
> Frankly, my time is too valuable to waste
Gotta be your judgement call on that one... I looked for ya, and it doesn't seem like you'll find much better with stock XMP profile.
|
|
|
|
|
Yeah. I've been burned before with bad memory so I guess I'm extra cautious these days. That was a week of hairpulling (it was a bad stick, it wasn't due to clocking, but same issue)
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
There have been some bad bugs with XMP/bios relationships where using XMP would force a lower voltage than what the RAM wanted. I wonder if you ran into that.
|
|
|
|
|
There are three different major resources in a computer system:
- CPU
- Memory
- Disk
If you have files open on a network, then the network will be a fourth major resource.
Each thread in the system can only be using one of these resources at a time, with multiple threads receiving time slices of that resource. So if a thread is waiting for the disk, it's not using the CPU. In fact, it's this very concept that allows virtualization to work at all. Personally, I used this concept to partition a task that was taking close to six hours to complete as a serialized task and complete it in about two hours.
|
|
|
|
|
obermd wrote: So if a thread is waiting for the disk, it's not using the CPU
Just to be difficult *cough* I/O Completion Ports *cough*
Seriously though, I'm being a bit pedantic, but on modern OSs and hardware so many things are asynchronous without even using threads that a single thread can potentially be using multiple resources virtually at the same time, and when it does enter a wait state, it will be woken up by the first ready resource it's waiting on.
That changes the calculus of what you say in terms of how it actually plays out, even if what you say is ... "basically" true. In essence, you're not wrong, but you're simplifying, maybe to a fault.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
I'm simplifying based on what I had available (VAXBasic and Digital Command Language) to make the adjustments (6+ hours => 2 hours).
This simplification is also a good high-level view of what's available in a system and the realization that a thread can only do one item at a time. Threads that spawn off asynchronous tasks are still only doing one item at a time as the spawned task always executes on another thread. It may be that that thread is a hardware resource for IO, but it is still another thread and another task that attempts to use the same resource will have to wait.
IO Completion Ports use hardware signals that the OS monitors to allow applications to offload the actual IO details to an OS level thread. For commodity systems, you can have a small number of concurrent IOs occuring, and depending on the hardware and where the actual resource conflicts lie, that number can be one. From a high level, I've wrote an SQL Server based applications that had to back off on SQL errors of any sort (concurrency, timeout, etc.). The first fault (SQL error) split the task into 10 tasks to attempt concurrently. The second SQL error (double fault) used the dotNet framework's ReaderWriterLock class to force a complete back off so the erroring SQL statement would be the only insert/update query executing on the server at that time. Yes, it was a huge penalty hit but it was necessary to ensure data integrity. These threads normally set the lock to Read, but for the double fault situation the thread that was going to attempt the final insert/update/delete would set the lock to Write and wait for all the other readers to complete before attempting. Of course any new readers also waited for the Write lock to complete.
|
|
|
|
|
Do you know how thread waits are actually implemented in the kernel? Does the thread actually stop processing instructions, IOW, is the wait hardware based? I've heard the term used "spinning" related to threads, so I wonder if a waiting thread simply polls a value and if it's not the value it wants, it loops.
The difficult we do right away...
...the impossible takes slightly longer.
|
|
|
|
|
I mean, as far as I know, a kernel can wait and awake on a number of different conditions, some of them directly interrupt related, as in your drive finishes fetching, and it triggers a CPU interrupt, which eventually wakes up a thread to dispatch the waiting data.
Another option is the scheduler puts the waiting thread to sleep, and awakens it on a software condition (such as a mutex being released) as opposed to an interrupt.
When a thread waits, the scheduler does not schedule it for execution. It is effectively "asleep", not spinning a loop or anything. It does not poll. If it did, 3000 threads would quickly overwhelm the kernel.
Check out my IoT graphics library here:
https://honeythecodewitch.com/gfx
And my IoT UI/User Experience library here:
https://honeythecodewitch.com/uix
|
|
|
|
|
Greeting Kind Regards
I have no experience with threads. May I please posit an inquiry as to their use for one of my current projects. In particular I am currently running console level test code which runs autonomously needing no user interaction which is estimated to run for several days. It is of the form below.
test_0();
test_1();
test_2();
...
The calls do not depend on any before and share no resource other than CPU. They merely perform integer arithmetic. They are called in sequence just as shown. No call to test_x is performed until the one prior is complete. So my inquiry is would the overall test complete sooner if each of the calls were in its own thread?
Kind Regards
|
|
|
|