Doc. No.: | WG21/N3710 |
---|---|
Date: | 2013-8-29 |
Reply to: | Hans-J. Boehm |
Other contributors: | Mark Batty, Brian Demsky, Olivier Giroux, Paul McKenney, Peter Sewell, Francesco Zappa Nardelli, ... |
Phone: | +1-650-857-3406 |
Email: | [email protected] |
The C++11 memory model specified the semantics of shared ordinary and atomic objects. There is one small part of this specification that we always recognized was dubious. The specification in 1.10 of weakly ordered atomics allows an atomic load to essentially see, or not see, the result of any racing store. In the case of explicit very weak memory ordering specifications, this can lead to "causal cycles" and highly counterintuitive results. Consider the following example, where x and y are atomic variables initialized to zero, and ri are local variables:
Thread 1: r1 = x.load(memory_order_relaxed); y.store(r1, memory_order_relaxed); Thread 2: r2 = y.load(memory_order_relaxed); x.store(r2, memory_order_relaxed);
Effectively Thread 1 copies x
to y
,
and thread 2 copies y
to x
. The section
1.10 specification allows each load to see either the initializing store
of zero, or the store in the other thread.
This famously allows
both r1
and r2
to have final values of 42, or
any other "out of thin air" value. This occurs if each load sees
the store in the other thread. It effectively models an execution
in which the compiler speculates that both atomic variables will have a value
of 42, speculatively stores the resulting values, and then performs the
loads to confirm that the speculation was correct and nothing needs to be
undone.
No known implementations actually produce such results. However,
it is extraordinarily hard to write specifications that present them
without also preventing legitimate compiler and hardware optimizations.
As a first indication of the complexity, note that the following variation
of the preceding example should ideally allow
x
= y
= 42, and some existing implementations
can produce such a result:
Thread 1: r1 = x.load(memory_order_relaxed); y.store(r1, memory_order_relaxed); Thread 2: r2 = y.load(memory_order_relaxed); x.store(42, memory_order_relaxed);
In this case, the load in each thread actually can see the store in the other thread, without problems. The two operations in thread 2 are independent and unordered, so either the compiler or hardware can reorder them.
Essentially this issue has been an open issue in the Java specification
for about 10 years. The major advantage that we have in C++ is that the
problem is confined to non-synchronizing atomics, i.e.
memory_order_relaxed
, and some
memory_order_consume
uses (or read-modify-write operations that
effectively weaken the ordering on either the read or write to one of
those). Many of us expect those to be rarely used.
The current standard addresses this in 29.3p9-11, using words that none of us ever considered rock-solid. (A 2007 or so discussion leading up to the current wording can be found here.) Unfortunately, we underestimated both the impact of this issue, and the problems with the current wording. In the meantime, a number of people, including Peter Sewell, Mark Batty, and more recently Brian Demsky have pointed out issues with the current wording.
This proposal is an attempt to repair the specification, at
non-negligible cost relative to our original intent, though
not so when compared to the literal text of the current document.
This change would effectively imply
stronger semantics for memory_order_relaxed
, which
would imply additional overhead for memory_order_relaxed
on a number of architectures, including ARM, current Nvidia architectures,
and possibly POWER. Our proposal effectively conflates the
above two examples, forbidding the x
= y
= 42
outcome in both, in order to give us cleanly definable semantics.
Such restrictions on memory_order_relaxed
are however not
unprecedented; the current cache coherence requirements already imply
that ordinary loads and stores do not suffice for
memory_order_relaxed
on Itanium. (See
Jaroslav Sevcik and Peter Sewell's table of C++11 atomic
mappings.) Indeed, there is no
a priori reason to believe that load and store instructions designed
for regular non-racing accesses should be directly suitable for
atomic access.
We try to mitigate the required performance impact on these architectures
by weakening the intended semantic requirements to normative encouragement
from the current requirement. The intent is that this should provide
hardware vendors the time required to better support the revised semantics.
We also propose to provide an alternative to memory_order_relaxed
accesses for non-racing contexts, in which both the existing and proposed
requirements are too strong.
We reiterate that this added overhead is relative to the
intent of the C++11 standard,
not relative to the actual specification. The actual specification, as
currently worded, seems to unintentionally require
memory_order_relaxed
to be implemented as something
close to memory_order_seq_cst
, and thus technically
imposes a much larger penalty, as pointed out in
LWG2265.
This proposal benefitted from a lengthy email discussion involving participants listed at the beginning of this paper, and others. This provided very useful input resulting in improvements to this proposal. It unfortunately resulted in neither a clear consensus in that group, nor an alternate proposal with clearly better consensus behind it. Not all of the listed contributors necessarily support this proposal. Since I don't believe there is support for maintaining the status quo, I nonetheless bring forward this proposal. Mark Batty and Peter Sewell are currently considering submitting additional late papers on this topic, including an alternate proposal.
In a Java context, it is necessary to prohibit out-of-thin-air results to be able to show that untrusted code cannot obtain access to certain information. If untrusted code could somehow provoke the implementation into generating the user's secret password "out of thin air", that would be bad. Since C++ does not attempt to support running untrusted code as part of a trusted executable, that's much less of an issue here.
However, as Brian Demsky and others have pointed out, allowing out-of-thin-air results also makes it very hard to reason formally, or perhaps even informally, about ordinary C++ code.
To see the problem, consider a program P that only ever assigns 0 or 1
to the atomic variables x
and y
,
though it does sometimes copy from one to the other using weakly
ordered operations.
Assume we have a library function f()
,
whose specification requires
that its first argument be 0 or 1, and its specification
says that if f()
is called correctly it doesn't modify either global variables
or its second by-reference argument.
Let's say this program calls f(x.load(memory_order_relaxed), y)
.
We would expect this to be sufficient to prove that the first argument
to f
is always 0 or 1. But without an out-of-thin-air
guarantee it is not.
If thread 1 copys y
to x
using
memory_order_relaxed
operations, while thread 2 calls
f(x.load(memory_order_relaxed), y)
, I still need to preclude
the cases like the one in which thread 1 sees y
= 42,
copies 42 to x
, causing thread 2 to pass 42 to f()
,
which, now operating outside its specification, stores 42 in y
,
thus justifying thread 1's load of 42.
This kind of argument is clearly impossible in any realistic setting in
which f()
is a library function for which we have have
no idea how it behaves if its preconditions are not met.
This necessity for reasoning about such "impossible" paths is
what makes it untenable to allow any kind of out-of-thin-air results.
As the above illustrates, out-of-thin-air values occur when there are cyclic "dependency chains" between values. Thus if we could define "dependency chains", we could define out-of-thin-air values, prohibit them, and we should be done.
This is essentially the approach taken by hardware memory models. They prohibit stores from becoming visible to other threads before the loads on which they depend, and this is sufficient to prevent out-of-thin-air values at the hardware level. At the hardware level, it is relatively straightforward to define a notion of one value "depending" on another; the approach is similar in spirit to that taken by our definition of "carries a dependency" in 1.10 [intro.multithread]. Essentially anytime a is an argument to an operation that computes b, b is taken to depend on a, and similarly if b is computed by a chain of such operations. Unlike our "carries a dependency" formulation, hardware models usually also consider b to depend on a if a branch preceding the computation of b depended on a. (Itanium is an exception, but not in a way that matters here.)
Hardware memory models then implicitly outlaw out-of-thin-air values by requiring that dependent operations become visible to other threads in order. Since the hardware notion of dependency covers all cases in which one value can affect another, this suffices to solve the problem.
Our existing "carries a dependency" definition does not similarly suffice,
since it covers only a subset of the ways in which one value can affect
another. For example, an evaluation inside the condition of an if-statement
does not carry a dependency to evaluations in then- or else-statements
even though it affects its values. Even if we ensured ordering whenever
a carries a dependency to b, we could not preclude the
problem case from the last section, since the argument evaluation of
x.load(...)
may not carry a dependency to the store to
y
inside f()
, for example because the assignment
to y
assigns a constant to it, but the code to do so is inside
a conditional whose condition is false if the first argument to
f()
is less than 2.
To address this problem as the hardware does, we would have to generalize "carries a dependency" to include all methods by which one value can affect another, including control dependencies, and then insist that all such dependencies enforce ordering. This appears impractical, since compilers routinely remove such dependencies as part of optimization. Consider the code segment
if (x) { a[i++] = 1; } else { a[i++] = 2; }
Any reasonable compiler will move the computation of
&(a[i++])
out of the conditional, thereby removing
the dependency of i
on x
.
Since relevant dependencies may pass through separately compiled functions
that don't mention atomics, we would have to prohibit such optimization
in all parts of the program, at a cost that I would expect to be much higher
than our proposal below. We would also have to require recompilation of
existing code not mentioning atomics to get the right semantics for atomics.
The fact that compilation can remove such syntactic dependencies, thus hiding them from the hardware, makes this problem fundamentally different at the programming language level from the hardware level.
Note that such elimination of dependencies does not actually introduce
out-of-thin-air results. It is simply at odds with this specification
technique for prohibiting them. The elimination of the syntactic dependency
if i
on x
simply reflects the fact that
the value of i
doesn't really change as a result of a change of
x
. Thus we can try to define a more "dynamic" notion
of dependency that reflects this fact.
But that seems to be a fundamentally undefinable notion, and also not one that has been attempted by any hardware specifications. To illustrate the difficulty, consider the function definition:
int f(int x, int y) { return x - y; }
Does the result "dynamically depend on" the input x
?
If f()
is only called in contexts in which the
arguments are equal, and the compiled determines that, e.g. by
inlining f()
, it may not. Even if that's not always the case,
the compiler may effectively clone a copy of f()
,
typically again by inlining, to handle the one call in which the arguments
are equal. Mark Batty and Peter Sewell point out that such a notion is
fundamentally not definable by looking at just a single execution.
Thus, at a minimum, we are again in a position in which we need to understand
other hypothetical executions in order to understand program
behavior on a given input. Even then, it is not clear that any such
definition could correctly deal with clones or specializations of
functions.
Informally it seems clear that defining the correct notion of "dependency" is equivalent to defining "out-of-thin-air", and it appears that neither is possible. Thus we are willing to propose a fairly drastic, of necessity long term, direction in order to sidestep the problem.
Brian Demsky and others point out
in
LWG2265 that the current wording intended to avoid "out-of-thin-air"
values is practically unimplementable
on a number of important weakly-ordered architectures. The current
wording essentially requires that memory_order_relaxed
be implemented as something much too close to
memory_order_seq_cst
, effectively defeating the point of
introducing weakly ordered memory operations.
This is an accident arising from our earlier efforts to at least
partially address the "out-of-thin-air" problem.
The particular example in LWG2265 seems to be difficult to analyze. But there seems to be agreement that the following (somewhat abbreviated) variant of Brian Demsky's example, when compiled normally on POWER or ARM, can cause the T1 and T2 stores to be seen in different orders in T3 or T4 (as intended) in spite of that being accidentally disallowed by 29.3p9 [atomics.order], which requires a "finite sequence of program evaluations" to produce the value stored at line 32:
1 atomic_int x[2]; 2 atomic_int z; 3 int c1, c2; 4 int d1, d2; 5 6 void T1 () { 7 x[0].store(1, relaxed); 8 } 9 10 void T2 () { 11 x[1].store(1, relaxed); 12 } 13 14 void T3 () { 15 c1= x[0].load(relaxed, consume or acquire); 16 c2= x[c1].load(relaxed); 17 } 18 19 void T4 () { 20 d1= x[1].load(relaxed, consume or acquire); 21 d2= x[d1^1].load(relaxed); 22 } 23 24 int user_main (int argc , char ** argv ) { 25 x[0] = 0; 26 x[1] = 0; 27 28 { T1 () || T2 () || T3 () || T4 () } // run all in parallel 29 30 /* Can this store write 200 (i.e., c1=d1 =1, c2=d2 =0) ? */ 31 /* If so , what is the evaluation sequence that justifies it? */ 32 z.store((c1+d1)*100 + c2+d2); 33 }
To produce this result, the required sequence of evaluations must have T1s and T2s assignments before T3s and T4s initial loads respectively, and the order the loads in each of T3 and T4 must be preserved. If T4's initial load follows T3's, then T4's final load must see a 1, and similarly if T3's load is last. Thus the sequence, as required by reasonable interpretations of the current text, can't exist.
On the other hand, preventing this outcome on POWER or ARM requires a
heavy weight fence between the two loads in each of T3 and T4,
forcing the loads to be treated more or less as sequentially consistent,
completely defeating the purpose of memory_order_relaxed
.
And we do not believe that a typical compiler can usually see enough
context to compile memory_order_relaxed
as intended,
even for more realistic examples.
As we pointed out in the last sections, there is no known way to fix this, and it seems very unlikely, and increasingly so, that we are overlooking a painless solution.
We propose a fairly drastic change of wording, which effectively
prevents a memory_order_relaxed
(or memory_order_consume
) load from being reordered
with a subsequent memory_order_relaxed
store,
and imposes corresponding restrictions on some very weakly ordered
atomic read-modify-write operations. This is easily expressed
as the restriction that the "write seen by" (inverse "reads-from")
relation on atomics,
together with the "happens-before" relation, cannot form a cycle or,
more formally the transitive closure of the union of the two relations is
irreflexive. This clearly prohibits any sort of causal cycle;
we can no longer follow a cyclic sequence of dependencies, since all
such dependencies are clearly reflected in either a write-seen-by relationship
or a happens-before relationship (or a data race).
But it unfortunately also prohibits
the second example we
gave at the beginning, in which one thread stores a constant.
Instead of enforcing simply dependency ordering between a load
and a later store, as the hardware does, we are treating every relaxed
store as though it were dependent on every prior load, thus
dodging the issue of defining dependencies.
Although details depend on specific machine memory models, the above restriction can generally be enforced by preventing load to store reordering of atomics. To see that, assume we have a path consisting of a sequence of write-seen-by and happens-before relationships. Each happens-before edge itself can be decomposed into synchronizes-with and sequenced-before edges. (Ignore memory_order_consume for now.) The cycle thus consists of intra-thread segments, connected by write-seen-by or synchronizes with edges. Each intra-thread segment must start with a load or acquire operation, and end with a release or store operation. In either case, ordering is preserved. Thus every intra-thread segment must be completed after it started, and the same applies to the synchronization edges. Thus we can't get back to an earlier action along the path, and a cycle is impossible.
(If such a path involves memory_order_consume
, the
same path will still exist if we ignore the additional properties
of memory_order_consume
and treat it as
memory_order_relaxed
. The happens-before relationships
induced by memory_order_consume
can be replaced by the
write-seen-by relationship beyween the memory_order_release
store and memory_order_consume
load, with sequenced-before
taking the place of carries-a-dependency-to. Thus the same
argument continues to apply.)
Since our proposal does not require any enforcement of dependencies,
it affects only code that uses weakly ordered atomic operations.
The added guarantee already holds for code that consistently uses
at least memory_order_acquire
for loads,
memory_order_release
for stores and
memory_order_acq_rel
for read-modify-write operations.
On many architectures, a memory_order_relaxed
load
followed by a memory_order_relaxed
store is already ordered,
provided the compiler refrains from rearranging them. The latter is usually
a tiny cost, since there is rarely strong motivation for a compiler
to reorder in that direction. These architectures include TSO architectures
like x86. They include Itanium because we already imposed restrictions
on memory_order_relaxed
that prevent Itanium from using
ordinary loads and stores. They appear to include current implementations
of POWER, though not the POWER architecture as officially specified.
The major architectures on which we add appreciable overhead for
memory_order_relaxed
appear to be
(GPU architectures are relevant here, since OpenCL has adopted essentially the C++ memory model, and we would like to preserve that consistency.)
On ARM (and according to the POWER spec), we can enforce the required
ordering by adding a never-taken conditional branch that tests the result
of the memory_order_relaxed
load after every such load,
and before the next memory_order_relaxed
atomic store
to a different location. The branch should either never actually be taken
or target the next instruction. (Again memory_order_consume
might require similar treatment if it is not already compiled
as memory_order_acquire
. Read-modify-write
operations with weaker than memory_order_acquire
ordering also need to be treated correspondingly.)
This is believed to be considerably less expensive than adding
a fence instruction, especially since the compiler may be able to
postpone the branch significantly, thus hopefully avoiding a stall
waiting for the load to complete.
On current ARM processors, this technique may have the negative effect of effectively wasting branch prediction slot. We guess that should be avoidable in future revisions of the architecture.
The overhead seems to be potentially much larger on current Nvidia GPUs, since we have to wait for the load to complete, and that typically involves waiting for an actual memory access. However, we hope and expect that typically this wait can be delayed until the result of the load is needed anyway, thus hopefully avoiding much of the overhead.
Since this does impact performance on some architectures, our goal is to
Note that even the future direction has less implementation impact than the current unintentional restrictions.
The phrasing explicitly allows vendors to either find an alternate way to address the problem or, at least in the short term, ignore the recommendation completely. But the hope is that, in the absence of a very insightful alternate vendor specification, the recommended direction will eventually become a requirement.
Change 29.3p9-11 [atomics.order] as follows
An atomic store shall only store a value that has been computed from constants and program input values by a finite sequence of program evaluations, such that each evaluation observes the values of variables as computed by the last prior assignment in the sequence. The ordering of evaluations in this sequence shall be such that:
if an evaluation B observes a value computed by A in a different thread, then B does not happen before A, andif an evaluation A is included in the sequence, then every evaluation that assigns to the same variable and happens before A is included.It is implementation defined when, for an atomic access a sequenced before an atomic update b, when there is a dependency from a to b. [ Note: This definition should include all cases in which the value read by a affects that stored at b. Implementations are encouraged to conservatively define there to be a dependency from a to b whenever a is sequenced before b, since alternatives are difficult to define meaningfully and precisely. --end note ] [Note: This differs from the "carries a dependency" terminology in 1.10 in that "carries a dependency" describes a proper subset of such possible information flows. --end note. ]
We say that an evaluation A causally precedes an evaluation B if
- A synchronizes with B, or
- The evaluation B observes the value stored by modification A, or
- There is a dependency from atomic access A to atomic update B, or
- There is an evaluation C such that A causally precedes C, and C causally precedes B.
Implementations should ensure that no evaluation causally precedes itself.
[ Note:
The second requirement disallowsThis recommendation can normally be implemented by ensuring that an atomic load operation is never reordered with a later atomic store operation performed by the same thread. It avoids "out-of-thin-air" or "speculative" stores of atomics when relaxed atomics are used.Since unordered operations are involved, evaluations may appear in this sequence out of thread order.For example, with x and y initially zero,// Thread 1: r1 = y.load(memory_order_relaxed); x.store(r1, memory_order_relaxed); // Thread 2: r2 = x.load(memory_order_relaxed); y.store(42, memory_order_relaxed);
is allowed to producer1
=r2
= 42. The sequence of evaluations justifying this consists of:y.store(42, memory_order_relaxed); r1 = y.load(memory_order_relaxed); x.store(r1, memory_order_relaxed); r2 = x.load(memory_order_relaxed);
On the other hand,// Thread 1: 1: r1 = y.load(memory_order_relaxed); 2: x.store(r1, memory_order_relaxed); // Thread 2: 3: r2 = x.load(memory_order_relaxed); 4: y.store(r2, memory_order_relaxed);may not produce
r1
=r2
= 42, since that would require (4) to provide the value for (1), which is sequenced before (2), which provides the value for (3), which is sequenced before (4), resulting in cyclic causally-precedes relationships.there is no sequence of evaluations that results in the computation of 42. In the absence of "relaxed" operations and read-modify-write operations with weaker than-end note ]memory_order_acq_rel
ordering, the second requirement has no impact.[ Note:
The requirements do allowThe recommendation similarly disallowsr1
==r2
== 42 in the following example, withx
andy
initially zero:// Thread 1: r1 = x.load(memory_order_relaxed); if (r1 == 42) y.store(r1, memory_order_relaxed); // Thread 2: r2 = y.load(memory_order_relaxed); if (r2 == 42) x.store(42, memory_order_relaxed);
However, implementations should not allow such behavior.-end note ]
As we pointed out, the above removes unintended requirements on
memory_order_relaxed
, while adding a new requirement,
load-to-store ordering, that was not part of the original design.
This slows down memory_order_relaxed
on some platforms
relative to what we initially thought was possible, though not relative to
what the standard unintentionally requires.
A large fraction of memory_order_relaxed
use cases are those
in which the access to the atomic variable is in fact either known not
to race with any other accesses, or will be discarded unless we can determine
that there was no race. These arguably cover the safest and simplest use
cases, e.g. the second "check" in double-checked locking, and reads inside
a seqlock reader critical section.
We can regain the performance in these cases, and sometimes gain a bit
of additional performance, by supporting these cases with separate kinds
of accesses. Since these cases do not really apply to read-modify-write
operations, it makes sense to add functions
non_atomic_load()
and non_atomic_store()
analogous to
load()
and store()
instead
of new memory_order_
values. Since these
operations cannot meaningfully be used in racy contexts, they may be
performed in multiple steps, e.g. a byte at a time. In a lock-based
implementation, they do not need to acquire the lock. They do not
need to enforce cache coherence, and hence are subject to weaker
optimization rules. On Itanium, they may, unlike
memory_order_relaxed
, use ordinary load and store
operations.
We introduce a helper class race_or<T>
to express the result of a speculative
non_atomic_load()
operation that might
have raced, but whose result will only be examined in the
absence of a race. Logically race_or<T>
either contains a T
value, or a special "race"
value. There is no way to test which case applies, and
conversion to T
yields undefined behavior if it
contained a race value.
This allows us to call non_atomic_load()
and
then check that no race occurred, e.g. by examining a version number
to confirm that there was no intervening store, as is done
for seqlocks (cf.
my
recent MSPC paper), and in many transactional memory implementations.
There have been requests for supporting non-atomic operations on atomic
values before. And that may be a requirement for using C++11/C11
atomics in the Linux kernel, for example. Previously our counterargument
was "memory_order_relaxed
may be close enough; we'll reconsider
if it isn't". This may be a good time to reconsider.
I don't consider this addition essential at this point, provided we adopt the weak wording suggested above. I do believe a change along these lines is required before we make load-to-store ordering a hard requirement, if and when we decide to do that.
Disclaimer: I am not qualified to write wording for new library classes. This didn't stop me. LWG help is requested.
29.2 [atomics.syn], add to synopsis:
template<class T> struct race_or<T>; void atomic_store_non_atomic(volatile atomic-type*, T) noexcept; void atomic_store_non_atomic(atomic-type*, T) noexcept; race_or<T> atomic_load_non_atomic(const volatile atomic-type*) noexcept; race_or<T> atomic_load_non_atomic(const atomic-type*) noexcept;
Add a section after after 29.4 [atomics.lockfree]:
29.5 race_or template [atomics.race_or]namespace std { template <class T> struct race_or { race_or() = default; race_or(const race_or&) = default; race_or& operator=(const race_or&) = default; operator T() const; }}This is a helper class for the
load_non_atomic
functions. It represents a value that was potentially speculatively loaded and should not be examined before establishing that the load did not participate in a race. The type of the template argument T shall be trivially copyable.A value of type
race_or<T>
represents either a value of typeT
or a race indication, representing the fact that the value not uniquely determined. [Note: There is no way to test arace_or<T>
value to determine whether it contains aT
value or a race indication. Implementations will typically representrace_or<T>
as aT
. Arace_or
value can only be generated by copying another one or as the result produced by one of theload_non_atomic
functions. In the latter case, it provides the opportunity for the calling code to check whether a data race occurred, for example by checking a version number incremented by writers, after performing the load but before performing the conversion. -- end note.]
race_or() = default;Effects: Initializes the object to a race indication.
operator T() const;Requires: The object does not contain a race indication.
Returns: The contained type T object.
Add the following member functions to the synopsis at the beginning of 29.5 [atomics.types.generic]:
void store_non_atomic(T) volatile noexcept; void store_non_atomic(T) noexcept; race_or<T> load_non_atomic() const volatile noexcept; race_or<T> load_non_atomic() const noexcept;
Insert the following after 29.6.5p10 [atomics.types.operations.req]:
void atomic_store_non_atomic(volatile A* object, C desired) noexcept; void atomic_store_non_atomic(A* object, C desired) noexcept; void A::store_non_atomic(C desired) volatile noexcept; void A::store_non_atomic(C desired) noexcept;Effects: Replaces the value pointed to by object or by this with the value of desired. This is not an atomic operation (1.10 [intro.multithread]).
Insert after 29.6.5p15 [atomics.types.operations.req]: (Note that some of the preceding effects clauses for load should probably be synchronization clauses, making things a bit inconsistent.)
race_or<C> atomic_load_non_atomic(const volatile A* object) noexcept; race_or<C> atomic_load_non_atomic(const A* object) noexcept; race_or<C> A::load_non_atomic() const volatile noexcept; race_or<C> A::load_non_atomic() const noexcept;Synchronization: None.
Effects: For purposes of (1.10 [intro.multithread]) this is an atomic operation, in spite of its name. It cannot, by itself, introduce a data race that results in undefined behavior. [Note: The indication that such a data race would otherwise have occurred is instead captured in the return value, and deferred until that value is converted to C.--end note] [Note: Unlike an ordinary value computation, the implementation must respect the fact that the value being loaded may change while it's being loaded, and this must not result in any visible effects unless the result is subsequently converted to a T. A data-race-checking implementation should explicitly distinguish the two cases in the race_or<T> representation, so that the race can be reported when the conversion occurs. -- end note]
Returns:If the visible sequence of side effects with respect to this value computation of
*object
or*this
(1.10 [intro.multithread]) contains exactly one such side effect, then the resulting race_or object will contain the value stored by this side effect. Otherwise it will contain an indication that the load involved a race.