”Relaxed” memory consistency C++

Question

1.40/5 (2 votes)

See more:

, +

Hi!
I have a question about multithreading and ”Relaxed” memory consistency in c++, but I guess it's even apply to c#, java and a lot of other imperative languages. A professor at my university said at a lecture in a course called "parallel programming" that the following two programs,

program 1:

x=0           
x=1           
read y

program2:

y=0
y=1
read x

executed in parallel could result in both seeing 0 after the read instruction. This should be because the new value 1 of x and/or y hadn't been written from the cache / memory buffer to the actually shared memory, and therefore both processes would be able to read 0.

For me this sounds like a huge problem because how can one then guaranteeing even if locks are used that the shared memory is consistent and up to date? I guess you somehow would need to do as you do with locks, i.e. make sure it's value is written and read directly to and from the shared memory. How can you achieve that or what approach do you recommend.

Would be very thankful for an explanation and/or an approach to tackle this problem.
/ WaZoX

Posted 7-Apr-13 5:26am

WaZoX

Add a Solution

3 solutions

Solution 1

What you say may be completely correct but I believe the problem goes deeper than one of cache coherency.
The issue I think is rather to do with the language specification and the current state of the art in optimising compilers.
The C++ language specification does not, contrary to common assumption, specify that statements must be executed in the order they are written. If fact the compiler is free to rearrange statements at will so long as the result is 'correct' according to the constraints that the specification does impose.
Authoring compiler software is one of the most difficult challenges in computing and authoring an optimising compiler even more so.
Given this, alomost all the optimisers I'm aware of work linearly and most work with a limited scope. In other words they operate within the scope of one function or the functions in one file and they optimise under the assumption of a single thread of execution.

In the case you mention this can lead to the 2 piece of code being compiled to machine code equivalent to for example:

x=0
read y
x=1

and

read x
y=0
y=1

Each code section on its own is still gauranteed to be correct and produce the same result if it is the only piece of code executing. However if multiple threads of execution are running amok through this code then the result will be highly dependent on the final linearized sequence of instructions that is executed. There are many possible such sequences such as

x=0
read y
.
read x
y=0
.
x=1
.
y=1

(dots representing context switches)

While I can't immediately see one that would result in both x and y being 0 it would be hard to rule it out given that there are sequences where at least either x or y is undeterminable. This gets even worse when you have truly parallel hardware that can actually do read y and y=0 at the same time. This give many mre possible outcomes and many more of them are as good as random.

It may sound like the situation is therefore hopeless and little or nothing can ever be gaurenteed but it is not so. The above code can very simply be made reliable and synchronisation trouble free by not sharing x or y between threads. Making it a design issue rather than a language or compiler problem.
When such sharing is necessary then it can only ever be made safe by using hardware facilities for atomic operations which include memory fences, pipeline flushes and bus locks to ensure consistency at the hardware level. This is where caching may enter the picture but does not present an insoluable problem.
In the past such low level atomic operations have had to be provided by operating system specific libraries pre-built in assembly language because the C++ language specification was independent of the availability of atomic operations at the hardware level and neither required them or relied on them. C++11 however mandates a new atomic operations section of the standard library, effectively requiring underlying hardware locking support in order to use the full language.

Posted 7-Apr-13 6:31am

Matthew Faithfull

Solution 2

The professor was correct.

The short answer to your question is that the code sequence illustrated needs to turn the operations into atomic operations which, amongst other things, will flush the cache and disable/workaround the instruction reordering as described.

Posted 7-Apr-13 15:51pm

H.Brydon

Comments

Matthew Faithfull 8-Apr-13 5:01am

I wish I could be so concise. My 5.

H.Brydon 8-Apr-13 10:21am

Thanks, and I like your answer - a +5 for you too.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

WaZoX · Accepted Answer · 2013-04-17T02:49:00

Solution 3

After some research it turns out this isn't a problem on modern computers. The problem is solved in the hardware implementation of the cache using some of many memory model. It may however affect the performance of the program since the core will need to fetch the data from the shared memory instead of the cache if it has been modified. It may also decrease performance if unrelated data is placed in the same block, e.g. if some variable in the same block is modified which isn't even used by the core, it will still need to load the data from the shared memory again. This is an issue for the compiler though and nothing we generally can do much about. This apply normally to single multicore processor, you may need to be careful if you have more processors, it highly depends on the system.

Posted 17-Apr-13 2:49am

WaZoX

Updated 17-Apr-13 2:52am

v2

Comments

H.Brydon 17-Apr-13 12:31pm

You need to do a little more research. It is not a problem for single threaded programs or for programs where sequence of operations is not an issue (eg. search, load balancing counter). The artifacts you described in the question are features of the C++ memory consistency model, and are a "problem" with multi-process (or multi-thread or multi-fiber) programming and multi-processor hardware. There will be different artifacts in other languages and machine architectures.

WaZoX 17-Apr-13 14:49pm

I can't see that's the case.
My sources are,
http://www.roguewave.com/portals/0/products/threadspotter/docs/2010.4/manual/ch_intro_coherence.html

http://en.wikipedia.org/wiki/MESI_protocol

http://stackoverflow.com/questions/558848/can-i-force-cache-coherency-on-a-multicore-x86-cpu

maybe not very reliable sources but they all speak for that this is solved and not an issue for multicore programs.

Plus one book called "Computer organization and design" don't sure about what revision right now but can check if it matter.

I'm not saying you must be wrong but it seems anywhere else where I can see, that what I said is the case and that's for multicore programs as well.

Could you give me some references to verify that what you're saying is correct?

EDIT: I'm taking about Intel's x86 and x64. In general you can't assume this behavior.