What you say may be completely correct but I believe the problem goes deeper than one of cache coherency.
The issue I think is rather to do with the language specification and the current state of the art in optimising compilers.
The C++ language specification does not, contrary to common assumption, specify that statements must be executed in the order they are written. If fact the compiler is free to rearrange statements at will so long as the result is 'correct' according to the constraints that the specification does impose.
Authoring compiler software is one of the most difficult challenges in computing and authoring an optimising compiler even more so.
Given this, alomost all the optimisers I'm aware of work linearly and most work with a limited scope. In other words they operate within the scope of one function or the functions in one file and they optimise under the assumption of a single thread of execution.
In the case you mention this can lead to the 2 piece of code being compiled to machine code equivalent to for example:
x=0
read y
x=1
and
read x
y=0
y=1
Each code section
on its own is still gauranteed to be correct and produce the same result if it is the only piece of code executing. However if multiple threads of execution are running amok through this code then the result will be highly dependent on the final linearized sequence of instructions that is executed. There are many possible such sequences such as
x=0
read y
.
read x
y=0
.
x=1
.
y=1
(dots representing context switches)
While I can't immediately see one that would result in both x and y being 0 it would be hard to rule it out given that there are sequences where at least either x or y is undeterminable. This gets even worse when you have truly parallel hardware that can actually do
read y
and
y=0
at the same time. This give many mre possible outcomes and many more of them are as good as random.
It may sound like the situation is therefore hopeless and little or nothing can ever be gaurenteed but it is not so. The above code can very simply be made reliable and synchronisation trouble free by not sharing x or y between threads. Making it a design issue rather than a language or compiler problem.
When such sharing is necessary then it can only ever be made safe by using hardware facilities for
atomic operations which include memory fences, pipeline flushes and bus locks to ensure consistency at the hardware level. This is where caching may enter the picture but does not present an insoluable problem.
In the past such low level atomic operations have had to be provided by operating system specific libraries pre-built in assembly language because the C++ language specification was independent of the availability of atomic operations at the hardware level and neither required them or relied on them. C++11 however mandates a new atomic operations section of the standard library, effectively requiring underlying hardware locking support in order to use the full language.