9

I'm running Windows 10 (1607) on an Intel Xeon E3-1231v3 CPU (Haswell, 4 physical cores, 8 logical cores).

When I first had Windows 7 installed on this machine, I could observe that four out of eight logical cores were parked until an application needed more than 4 threads. One can check with Windows resource monitor whether cores are parked or not (example). As far as I understand, this is an important technique to keep the threads balanced across physical cores, as explained on the Microsoft website: "the Core Parking algorithm and infrastructure is also used to balance processor performance between logical processors on Windows 7 client systems with processors that include Intel Hyper-Threading Technology."

However after upgrading to Windows 10, I noticed that there is no core parking. All logical cores are active all the time and when you run an application using less than four threads you can see how the scheduler equally distributes them across all logical cpu cores. Microsoft employees have confirmed that Core Parking is disabled in Windows 10.

But I wonder why? What was the reason for this? Is there a replacement and if yes, how does it look like? Has Microsoft implemented a new scheduler strategy that made core parking obsolete?


Appendix:

Here is an example on how core parking introduced in Windows 7 can benefit performance (in comparison to Vista which didn't have core parking feature yet). What you can see is that on Vista, HT (Hyper Threading) harms performance while on Windows 7 it doesn't:

enter image description here

enter image description here

(source)

I tried to enable Core Parking as mentioned here, but what I observed was that the Core Parking algorithm isn't Hyper Threading aware anymore. It parked cores 4,5,6,7, while it should have parked core 1,3,5,7 to avoid that threads are assigned to the same physical core. Windows enumerates cores in such a way that two successive indices belong to the same physical core. Very strange. It seems Microsoft has messed this up fundamentally. And no one noticed...

Furthermore, I did some CPU benchmarks using exactly 4 threads.

CPU affinity set to all cores (Windows defualt):

Average running time: 17.094498, standard deviation: 2.472625

CPU affinity set to every other core (so that it runs on different physical cores, best possible scheduling):

Average running time: 15.014045, standard deviation: 1.302473

CPU affinity set to the worst possible scheduling (four logical cores on two physical cores):

Average running time: 20.811493, standard deviation: 1.405621

So there is a performance difference. And you can see that the Windows defualt scheduling ranks between the best and worst possible scheduling, as we would expect it to happen with a non-hyperthreading aware scheduler. However, as pointed out in the comments, there may be other causes responsible for this, like fewer context switches, inference by monitoring applications, etc. So we still don't have a definitive answer here.

Source code for my benchmark:

#include <stdlib.h>
#include <Windows.h>
#include <math.h>

double runBenchmark(int num_cores) {
  int size = 1000;
  double** source = new double*[size];
  for (int x = 0; x < size; x++) {
    source[x] = new double[size];
  }
  double** target = new double*[size * 2];
  for (int x = 0; x < size * 2; x++) {
    target[x] = new double[size * 2];
  }
  #pragma omp parallel for num_threads(num_cores)
  for (int x = 0; x < size; x++) {
    for (int y = 0; y < size; y++) {
      source[y][x] = rand();
    }
  }
  #pragma omp parallel for num_threads(num_cores)
  for (int x = 0; x < size-1; x++) {
    for (int y = 0; y < size-1; y++) {
      target[x * 2][y * 2] = 0.25 * (source[x][y] + source[x + 1][y] + source[x][y + 1] + source[x + 1][y + 1]);
    }
  }
  double result = target[rand() % size][rand() % size];
  for (int x = 0; x < size * 2; x++) delete[] target[x];
  for (int x = 0; x < size; x++) delete[] source[x];
  delete[] target;
  delete[] source;
  return result;
}

int main(int argc, char** argv)
{
  int num_cores = 4;
  system("pause");  // So we can set cpu affinity before the benchmark starts 
  const int iters = 1000;
  double avgElapsedTime = 0.0;
  double elapsedTimes[iters];
  for (int i = 0; i < iters; i++) {
    LARGE_INTEGER frequency;
    LARGE_INTEGER t1, t2;
    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&t1);
    runBenchmark(num_cores);
    QueryPerformanceCounter(&t2);
    elapsedTimes[i] = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
    avgElapsedTime += elapsedTimes[i];
  }
  avgElapsedTime = avgElapsedTime / iters;
  double variance = 0;
  for (int i = 0; i < iters; i++) {
    variance += (elapsedTimes[i] - avgElapsedTime) * (elapsedTimes[i] - avgElapsedTime);
  }
  variance = sqrt(variance / iters);
  printf("Average running time: %f, standard deviation: %f", avgElapsedTime, variance);
  return 0;
}
manuel
  • 2,008
  • 2
  • 14
  • 16
  • 1
    Hold on, "four out of eight cores"? You just mentioned yourself that your CPU has only 4 cores and 8 _threads_. Does "core parking" actually apply here? – u1686_grawity Mar 15 '17 at 15:02
  • 1
    @grawity In theory core parking was mean to allow the CPU to park one half of the hyperthreaded core. My understanding was that the per core cache was split between two logical (hyperthreaded) cores and with one parked it could be released and essentially allow a single core access to the full cache that was connected to it. It could provide a small boost for low thread count tasks while still having the ability to reinstate hyperthreading and make full use of the core in multithreaded tasks. – Mokubai Mar 15 '17 at 15:33
  • @grawity: Sorry for the confusion. I meant eight "logical cores", i.e. it's only a virtual core. – manuel Mar 15 '17 at 15:40
  • @mokubai Intel CPUs can run two threads on a single cpu core, resulting in a performance boost of around 30%. However, now assume you have an application with two active threads. Now you want them to run on two different physical cores, which ideally doubles performance compared to single thread. But if they run on the same physical core with hyper threading, they only run 30% faster than single thread. That's where core parking comes in and avoids the latter situation. – manuel Mar 15 '17 at 15:41
  • AMD Zen CPUs have a similar problem, but AMD recently released a statement, that basically indicated that the problem was the utility doing the performance checks not Windows itself based on all investigations they performed. So before you blame Microsoft for anything, verify, the tools are not to blame.. – Ramhound Mar 15 '17 at 16:18
  • @manuel I know. But parking the core also frees up some cache and can be an additional bonus over a hyperthreaded core with a cache that has been split. – Mokubai Mar 15 '17 at 16:27
  • @Ramhound: I just tested my hypothesis with a benchmark. – manuel Mar 15 '17 at 17:25
  • Core parking refers to an entire physical _core_, not to individual logical processors. (With HT enabled you have two LPs per core; with HT disabled you have just one.) Terms such as "virtual core" and "logical core" should be avoided as they are misleading. Core parking is only ever done if _both_ LPs in a core have been idle for a period of time. – Jamie Hanrahan Mar 15 '17 at 17:57
  • @Jamie Hanrahan: Wrong. Read [this](https://msdn.microsoft.com/en-us/library/windows/hardware/dn613899(v=vs.85).aspx) where it says: _the Core Parking algorithm and infrastructure is also used to balance processor performance between logical processors on Windows 7 client systems with processors that include Intel Hyper-Threading Technology._ – manuel Mar 15 '17 at 18:03
  • That's not how it's implemented in the scheduler. See the description in _Windows Internals_ by Solomon, Russinovich, and Ionescu. "Core parking" refers to literally powering down a core. You can't do that at the LP level. – Jamie Hanrahan Mar 15 '17 at 18:04
  • it is still true that the scheduler prefers to use LPs such that only one LP is used per core. However thread scheduling is extremely dynamic and your test application is very unlikely to be the only thing that wants to run at any given moment. So the fact that your test app has only _nCores_ threads, but you sometimes see two LPs in a core both running your app's threads, does not mean that this isn't working. Also remember that running Task Manager or nearly any other monitor program requires running at least one thread! Just looking at the system changes what it's doing. – Jamie Hanrahan Mar 15 '17 at 18:12
  • @JamieHanrahan: Sorry, but your statement about core parking is still not correct. I looked it up (Windows Internals sixth edition, 2nd part, p. 108) where it says: _"Core Parking Policies [...] and, most importantly, always leaves one thread in an SMT package unparked—in other words, it is responsible for essentially disabling the Hyper-Threading feature found on Intel CPUs until load warrants it."_ – manuel Mar 15 '17 at 18:35
  • You're reading too much into it. Yes, the code that does that is intertwined with the "core parking" code, but trying to use only one LP in a core is not "core parking". You flatly cannot "park" one LP out of a core. An LP is not a physically separate set of logic that _can_ be parked; it is 99% an abstraction presented by the microarchitecture. Note that, per the MSDN article _you_ linked, core parking doesn't happen in Windows 7, only Server 2008... but one-LP-at-a-time does happen in Windows 7. Therefore using only one-LP per core at a time is not "core parking". It's just not-using an LP. – Jamie Hanrahan Mar 15 '17 at 19:07
  • Well, that's just a technicality. And it doesn't explain why I could _see_ the core-parking-like feature working in Windows 7 through task manager, resource monitor and so on in contrast to Windows 10 where the process scheduling has changed and this hyper-threading-friendly core-parking like scheduling is _not_ in use any more. Maybe we'll have to wait until the seventh edition of Windows Inside is released, but I am really curious to see what exactly Microsoft has changed _or_ if it's really the case that the old Vista scheduling is back hurting performance for <= 4 thread applications. – manuel Mar 15 '17 at 21:20
  • @manuel There have been vast improvements between Windows 7 and Windows 10 – Ramhound Mar 15 '17 at 22:54
  • I don't know what you're seeing, @manuel but I see one-LP-per-core happening all the time under Windows 10. I think your system is just busier than you think it is, so the scheduler has little choice but to use more than one LP in some of the cores. If you really hate that behavior, just turn off hyperthreading in your firmware settings. – Jamie Hanrahan Mar 21 '17 at 11:39
  • @JamieHanrahan But that's not what I see. And it's confirmed by various reports on the internet (see for example [this](https://social.technet.microsoft.com/Forums/en-US/8d6085d8-2c26-426c-87ac-2ba189b77aa5/core-parking-not-working-after-upgrade?forum=win10itprohardware)) Maybe you have an AMD CPU because in this case, Windows 10 still uses Core Parking. – manuel Apr 12 '17 at 23:03
  • No, it's Intel. I repeat: You can't go by what you see in Task Manager. I believe that what has happened in Windows 10 is that the scheduler doesn't hold off for as long a time before running a thread on a core that already has a thread running on it. – Jamie Hanrahan Apr 13 '17 at 04:48
  • I believe that what you are seeing is an optical illusion. I have checked the situation on an 8-core computer with Hyper Threading (16 logical cores), and I find it very hard to say how many cores are parked and which ones. It seems like the Windows 10 scheduler parks and unparks cores at high rate, so what I can see in Resource Monitor is the "Parked" state steady on some cores but flickering frequently on others. Frankly, I cannot manage to say how many parked cores I have at any given time, and I also doubt that the display of Resource Monitor can keep up with it. – harrymc Apr 13 '17 at 10:57
  • @harrymc I don't see the "parked" label in resource monitor. Not al all. Even if the cpu is in complete idle with <5% utilization. So I don't think you theory holds, otherwise the scheduler must be that stupid that it frequently unparks cores even if my computer is in complete idle, which would be a flawed behavior, too. – manuel Apr 13 '17 at 14:01
  • @manuel: "flawed behavior" to you, but there might be some logic behind it. For example, dividing the load between the cores so as not to always overwork one core. Windows kernel developers certainly know Intel hardware better than you or me. Verification: In the registry key `HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power\PowerSettings\54533251-82be-4824-96c1-47b60b740d00\0cc5b647-c1df-4637-891a-dec35c318583`, check if `ValueMax=100` and `ValueMin=0`. Reboot if you change anything. – harrymc Apr 13 '17 at 15:10
  • This doesn't answer your question but my bet is that Microsoft doesn't care for something as "legacy" as Haswell technology. Intel's Skylake introduced something called Speed Shift which, among other things, makes OS-controlled core parking redundant. The latest build of Windows 10 supports Speed Shift very well. Neither Intel or Microsoft give much technical details but I think it's safe to assume Microsoft disabled core parking in Win 10 because the kernel's architecture and optimisations, which favour new CPUs, have made it incompatible with the now legacy core parking paradigm. – misha256 Apr 19 '17 at 20:28
  • Maybe you can enable it?...http://www.thewindowsclub.com/enable-disable-core-parking-windows – Moab Jun 21 '17 at 01:42

1 Answers1

-3

Huh, I could tell you the story but you are going to hate it and I'm going to hate writing it :-)

Short version - Win10 screwed up everything it could and is in perpetual state of starving cores due to systemic problem known as cpu oversubscription (way too many threads, no one can ever service them, something is choking at any point, forever). That's why it desperately needs these fake CPU-s, shortens base scheduler timer to 1 ms and can't let you have parking anything. It would just scorch the system. Open Process Explorer and add up the number of threads, now do the math :-)

CPU Sets API was introduced to give at least some fighting chance to those who know and have the time to write the code to wrestle the beast. You can de-facto park fake CPU-s by putting them in a CPU-Set that you won't give to anyone and create default set to throw it to piranhas. But you can't do it on client sku-s (you could technically, it's just not going to be honored) since kernel would do into panic state and either totally ignore CPU Sets or some other things are going to start crashing. It has to defend system's integrity at any cost.

The whole state of affairs is by and large a tabu since it would require major rewrites and everyone culling the no of frivolous threads and admitting that they messed up. Hyperthreads actually have to be permanently disabled (they heat up cores under real load, degrade performance and destabilize HTM - the principal reason why it never became mainstream). Big SQL Server shops are doing it as a first setup step and so is Azure. Bing is not, they run servers with de-facto client setup since they'd need much more cores to dare to switch. The problem percolated into Server 2016.

SQL Server is the sole real user of CPU Sets (as usual :-), 99% of perf-advanced things in Win has always been done just for SQL Server, starting with super efficient memory mapped file handling that kills people coming from Linux since they assume different semantics).

To play with this safely you'd need 16 cores min for a client box, 32 for a server (that actually does something real :-) You have to put at least 4 cores in default set so that kernel and system services can barely breathe but that's still just a dual core laptop equivalent (you still have perpetual choking), meaning 6-8 to let the system breathe properly.

Win10 needs 4 cores and 16 GB to just barely breathe. Laptops get away with 2 cores and 2 fake "CPU-s" if there's nothing demanding to do since their usual work distribution is such that there's always enough things that have to wait anyway (long queue on memaloc "helps" a lot :-).

This is still not going to help you with OpenMP (or any automatic parallelization) unless you have a way of telling it explicitly to use your CPU Set (individual threads have to be assigned to CPU Set) and nothing else. You still need process affinity set as well, it's precondition for CPU Sets.

Server 2k8 was the last good one (yes that means Win7 as well :-). People were bulk loading a TB in 10 min with it and SQL Server. Now people brag if they can load it in one hour - under Linux :-) So chances are that the state of affairs is not much better "over there" either. Linux had CPU Sets way before Win.

ZXX
  • 154
  • 5
  • 1
    This "answer" has very little to do with reality. Just for starters, the number of total threads in the system has nothing to do with CPU "oversubscription"; most threads spend the vast majority of their time NOT wanting to run ("waiting" for IO or similar). – Jamie Hanrahan Dec 20 '18 at 01:16
  • Also, the scheduler timer - that is, the interval by which thread "quantum" (timeslcie) is reckoned - is still 15.625 msec (64 Hz). The default interval timer is still this. It may be changed programmatically to as little as half a millisecond on current platforms, and some multimedia apps and other apps that need high-precision timing do this - but thread timeslices are still multiples of 15.625 msec. – Jamie Hanrahan Dec 20 '18 at 01:56