Yonghong, In our implementation (not open sourced), we don’t do outlining the Front-End. See my previous reply to Medhi’s email.
Xinmin
Yonghong, In our implementation (not open sourced), we don’t do outlining the Front-End. See my previous reply to Medhi’s email.
Xinmin
Hi,
I'm going to club together some responses.
I agree that outlining function sub-bodies and passing in the function
pointers to said outlined bodies to OpenMP helpers lets us correctly
implement the semantics we need. However, unless I severely
misunderstood the thread, I thought the key idea was to move *away*
from that representation and towards a representation that _allows_
optimization?
My problem with representing parallel regions with
intrinsic-denoted-regions is that we're lying to the optimizer about
what the code actually does. Calls, even to intrinsics, can "at
worst" do some combination of the following:
- Write to and read from arbitrary memory
- Have UB (but we're allowed to pretend that they don't)
- Throw an exception
- Never return, either by infinite looping or by calling exit(0)
- Have memory synchronization operations, like fences, atomic loads,
stores etc.
- Have side effects like IO, volatile writes
If an intrinsic's behavior can be explained by some subset of the
above, then you should not need to edit any pass to preserve
_correctness_ -- all optimization passes (today) conservatively assume
that calls that they don't understand have all of the behaviors
outlined above.
However, if to preserve *correctness* you have to edit optimization
passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not
have "call semantics". `call @llvm.experimental.region_begin()` is
really a fundamentally new instruction masquerading as an intrinsic,
and it is probably better to call a spade a spade and represent it as
a new instruction.
The setting for the examples I gave was not that "here is a case we
need to get right". The setting was that "here is a *symptom* that
shows that we've lied to the optimizer". We can go ahead and fix all
the symptoms by adding bailouts to the respective passes, but that
does not make us immune to passes that we don't know about
e.g. downstream passes, passes that will be added later. It also puts
us in a weird spot around semantics of call instructions.
-- Sanjoy
Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic.
The fact that you needed this pre-step in the first place seems to indicate to me that it confirms what multiple people expressed in this thread, for example what Daniel wrote here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
I missed this before answering my last email about how "as is” it does not sound like something that can scale non-intrusively in LLVM as a model-agnostic and general-purpose region semantic.
Sanjoy explains it just better :), so right now my understanding is the same as what Sanjoy expressed below.
[XT] Back from Biz trips, trying to catch up with the discussion.
I agree that outlining function sub-bodies and passing in the function pointers to said outlined bodies to OpenMP helpers lets us correctly implement the semantics we need. However, unless I severely misunderstood the thread, I thought the key idea was to move *away* from that representation and towards a representation that _allows_ optimization?
[XT]: Your understanding is correct. But, the IR-level region annotation RFC is not just for OpenMP. OpenMP is one of usage cases..
My problem with representing parallel regions with intrinsic-denoted-regions is that we're lying to the optimizer about what the code actually does. Calls, even to intrinsics, can "at worst" do some combination of the following:
- Write to and read from arbitrary memory
- Have UB (but we're allowed to pretend that they don't)
- Throw an exception
- Never return, either by infinite looping or by calling exit(0)
- Have memory synchronization operations, like fences, atomic loads,
stores etc.
- Have side effects like IO, volatile writes
[XT] Based on Google and Xilinx's suggestion, the IR-level region annotation can use token and tags with intrinsic functions to model region and memory dependency (use/def). Above list are handled based on language rules. E.g. OpenMP rule says, in a parallel region, throw an exception is allowed, but it has been caught within the region, i.e. no control-flow edge is allowed to across the region boundary. "exit" is one exception which is allowed, as it terminate the program.. Our solution is to have FE and/or one central place in ME to deal with language specifics.
However, if to preserve *correctness* you have to edit optimization passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not have "call semantics". `call @llvm.experimental.region_begin()` is really a fundamentally new instruction masquerading as an intrinsic, and it is probably better to call a spade a spade and represent it as a new instruction.
[XT] Yes and No. Yes: w.r.t region scope annotation, No: it is more than one new instruction, it is more like a sequence of instructions. Assume we have a "fork" instruction, omp fork and cilk fork/spawn semantics are differently in terms of stack frame allocation and ABI. When we introduce a new instruction, the exact semantics needs to be defined, it can't be alter. Thus, we proposed to start with experimental_intrinsics, and it is proven working. We can always convert the intrinsics with token/tag to instructions when we have enough solid cases / justification for the part of model-agnostic for the conversion.
The setting for the examples I gave was not that "here is a case we need to get right". The setting was that "here is a *symptom* that shows that we've lied to the optimizer". We can go ahead and fix all the symptoms by adding bailouts to the respective passes, but that does not make us immune to passes that we don't know about e.g. downstream passes, passes that will be added later. It also puts us in a weird spot around semantics of call instructions.
[XT] I would say, it is a design trade-off between having a central place to deal with specifics or make drastic changes to begin with from day one. Our process is to have a central place to get all working, then, turning off the support for some "symptoms" in this central place one-by-one to trigger downstream fails and fixed. I think our ultimate goal is more or less same, just taking a different approach to get there. The central place / prepare-phase for getting IR to a "canonical" form with help to address the issue . " downstream passes, passes that will be added later. It also puts us in a weird spot around semantics of call instructions." you mentioned.
Thanks for all questions, discussions and feedback.
Xinmin
Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic.
I would say it is a design trade-off. Regardless it is a new instruction or an intrinsics with token/tag, it will consist of model-agnostic part and model-non-agnostic part. The package comes with a framework for parsing and using these intrinsics. See the reply I had for Sanjoy's email.
Xinmin
Ok, but this looks like a “workaround" for your specific use-case, I don’t see how it can scale as a model-agnostic and general-purpose region semantic.
I would say it is a design trade-off.
I’m not sure if we’re talking about the same thing here: my understanding at this point is that the design trading-off you’re making “simplicity” by scarifying “correctness”.
Requiring the IR to stay in what you’re calling a “canonical” form in your answer to Sanjoy in order to not miscompile a program is not an approach that seems compatible with how we deal with the IR usually.
Regardless it is a new instruction or an intrinsics with token/tag, it will consist of model-agnostic part and model-non-agnostic part. The package comes with a framework for parsing and using these intrinsics. See the reply I had for Sanjoy's email.
The answer to Sanjoy is not really helpful to clarify anything to me. At this point I still don’t understand how this is supposed to be correct in general.
It would be helpful to have a LangRef patch that describes the semantic associated to your region intrinsics. Then we may be able to process some examples through the formalized description.
Let me try this.
You can simply consider the prepare-phase (e.g. pre-privatization) were done in FE (actually a library can be used by multiple FEs at LLVM IR level), the region is run with 1 thread, region annotation (scope, single-entry-single-exit) as memory barrier conservatively for now (instead of checking individual memory dependency, aliasing via tags which is the actual implementation is done) marked with region intrinsic functions. What optimization will mess up with this region-annotation?
The first thing that comes to my mind is inlining that can put the IR in a form that breaks the invariant you tried to enforce with your "prepare-phase” (for example by hoisting an allocas).
Remember that, the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don't call it "pass". Chandler made a good point for this case a long time back.
Hoisting alloca is totally ok. A new alloca is generated during outlining later on for anything marked as "private" (so long the "private" information is saves in the tag). I thought we talked this in an early email.
By the way, all concerns you have are all valid, we had worked on resolving these issues 10+ years back when we did similar things in our compilers. I wouldn't claim we have perfect solutions, but we do reasonable good solutions for handling general directives and openmp directives.
Xinmin
Remember that, the prepare-phase is invoked in the FE or right after FE, inlining is not happening, that is why we don’t call it “pass”. Chandler made a good point for this case a long time back.
What I was describing is the inlining in the optimizer pipeline.
Hoisting alloca is totally ok. A new alloca is generated during outlining later on for anything marked as “private” (so long the “private” information is saves in the tag). I thought we talked this in an early email.
Can you describe how (and at which point) you get the private for “var” added to the tag?
In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.
In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds ”private” attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it.
Xinmin
In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.
So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.
Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.
In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds ”private” attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it.
Again, you’re trying to address a specific case, while I’m just trying to identify a generic class of problem that your proposal fails to address explicitly.
Best,
In this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.
So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?
[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.
Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.
[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCs
In our icc implementation, it is even simple, as we have routine level symbol table, the inliner adds ”private” attribute to those local variables w/o checking enclosing scope, the parallelizer does check and use it.
Again, you’re trying to address a specific case, while I’m just trying to identify a generic class of problem that your proposal fails to address explicitly.
[XT] It looks there is a mis-understanding of proposal. The proposal is to build up experimental framework with proposed interface to evaluate set of changes (or cost) needed for usage cases we know of. Hal and I carefully positioned this RFC. Even all intrinsics are named as experimental. So far, cost for handling these cases identified is within a range of our expectation.
Best,
<>
From: mehdi.amini@apple.com <mailto:mehdi.amini@apple.com> [mailto:mehdi.amini@apple.com]
Sent: Tuesday, January 31, 2017 9:03 PM
To: Tian, Xinmin <xinmin.tian@intel.com <mailto:xinmin.tian@intel.com>>
Cc: Sanjoy Das <sanjoy@playingwithpointers.com <mailto:sanjoy@playingwithpointers.com>>; Adve, Vikram Sadanand <vadve@illinois.edu <mailto:vadve@illinois.edu>>; llvm-dev@lists.llvm.org <mailto:llvm-dev@lists.llvm.org>; llvm-dev-request@lists.llvm.org <mailto:llvm-dev-request@lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] IR-level Region AnnotationsIn this case, inliner is educated to add all local variables to the tag of enclosing parallel region, if there is enclosing parallel region.
So isn’t it a good example that shows that your intrinsic *cannot* be opaque and that IR passes need to be modified to handle not only the IR-region intrinsic but also the specific semantic of the tag?
[XT] I thought we said a number of times, there are small changes to be made. I quoted a ball park # 2000 LOC vs. 6000 LOC w.r.t changes, in one of early emails.
This didn’t mean that the changes were meant specifically for OpenMP. My understanding was that this proposal is for a generic "IR-level Region Annotations” mechanism, and that’s what the changes were for. Now it ends up being “let’s support OpenMP semantic without adding openmp in the intrinsic names”.
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.
Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.
[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCs
At this point I don’t see any advantage in having a “generic intrinsic" that has an opaque tag since all the semantic is in the tag anyway. I’d have to see what is really “generic” in the handling of it...
Reid identified this very early in the thread (he is a lot more perspicacious than I am) here: [llvm-dev] [RFC] IR-level Region Annotations
The point here is to abstract the properties about which other passes might need to know by using a set of generic intrinsics. The fact that you can’t hoist allocas past one of these intrinsics, is nowhere close to saying that the individual optimization passes need to know anything about OpenMP, parallelism, etc. Regardless of how many LOC are in Intel’s prototype, we’re obviously aiming for minimal impact on the current upstream infrastructure. This is completely opposite to the point. The semantics relevant to the rest of the optimization pipeline should be in the intrinsics themselves. I’ve yet to see anything to suggest that we can’t do that. There are multiple levels here: a) Semantics relevant to the rest of the pipeline b) Semantics relevant to parallelism-specific optimizations (e.g. redundant barrier removal) c) Semantics relevant to specific programming model / extension (OpenMP, OpenACC, C++ parallel algorithms, whatever) We’d like to separate these three levels, and I believe the proposed scheme allows us to do that. Obviously, this assumes that we can indeed have a small set of intrinsics that satisfy the needs of (a). Furthermore, if we’re going to use intrinsics, we need to decide whether all of the relevant semantics are reasonable to encode in intrinsics (e.g. it is reasonable to have an intrinsic past which you can’t hoist an alloca, or would that need to be an instruction, etc.) -Hal
+1 to Hal’s point.
The point here is to abstract the properties about which other passes might need to know by using a set of generic intrinsics.
The fact is that I’m still not sure what these properties are, which is exactly why I’m asking for a LangRef patch.
If it has been explained in thread, then sorry I missed it, and I’d appreciate a pointer to the right post.
The fact that you can’t hoist allocas past one of these intrinsics, is nowhere close to saying that the individual optimization passes need to know anything about OpenMP, parallelism, etc. Regardless of how many LOC are in Intel’s prototype, we’re obviously aiming for minimal impact on the current upstream infrastructure.
It seems to me that this contradicts the claim that the “tag” specific semantic does not need to be handled by the optimizer and that the intrinsic can integrate seamlessly in LLVM, which invalidates the approach (of a generic intrinsic) entirely IMO.
Maybe you never intended to claim this, but this is a hidden cost in the original RFC, and I suspect this cost has to be carefully evaluated. At this point I’m not sure it is worth discussing anything further without seeing a proper LangRef update.
[XT] All we said is to minimize cost when it is possible. The intrinsic functions is a generic for representing a directive and region, such as prefecth, unroll, omp, …. Each instance of them will have their semantics which will be in following up RFCs
At this point I don’t see any advantage in having a “generic intrinsic" that has an opaque tag since all the semantic is in the tag anyway. I’d have to see what is really “generic” in the handling of it…
This is completely opposite to the point. The semantics relevant to the rest of the optimization pipeline should be in the intrinsics themselves. I’ve yet to see anything to suggest that we can’t do that.
I’ve yet to see anything that suggest we can ![]()
How will it be expressed that we’re not allowed to be able to hoist an alloca? How are other constraints gonna be expressed?
Reid identified this very early in the thread (he is a lot more perspicacious than I am) here: http://lists.llvm.org/pipermail/llvm-dev/2017-January/108914.html
There are multiple levels here:
a) Semantics relevant to the rest of the pipeline
b) Semantics relevant to parallelism-specific optimizations (e.g. redundant barrier removal)
c) Semantics relevant to specific programming model / extension (OpenMP, OpenACC, C++ parallel algorithms, whatever)We’d like to separate these three levels, and I believe the proposed scheme allows us to do that. Obviously, this assumes that we can indeed have a small set of intrinsics that satisfy the needs of (a). Furthermore, if we’re going to use intrinsics, we need to decide whether all of the relevant semantics are reasonable to encode in intrinsics (e.g. it is reasonable to have an intrinsic past which you can’t hoist an alloca, or would that need to be an instruction, etc.)
I agree with all that, and for now I’m only worried about a), which is why I asked for a LangRef update because I don’t feel I read a proper explanation there.
—Mehdi
FWIW, we only do this for allocas in the entry block, and it would be trivial to having the code stop scanning the entry block upon hitting some specifically-tagged thing(s). I don’t think this is necessarily the scheme that we’d use for an upstream implementation. Thanks again, Hal
Hi Xinmin,
[XT] Back from Biz trips, trying to catch up with the discussion.
I agree that outlining function sub-bodies and passing in the function pointers to said outlined bodies to OpenMP helpers lets us correctly implement the semantics we need. However, unless I severely misunderstood the thread, I thought the key idea was to move *away* from that representation and towards a representation that _allows_ optimization?
[XT]: Your understanding is correct. But, the IR-level region annotation RFC is not just for OpenMP. OpenMP is one of usage cases..
My problem with representing parallel regions with intrinsic-denoted-regions is that we're lying to the optimizer about what the code actually does. Calls, even to intrinsics, can "at worst" do some combination of the following:
- Write to and read from arbitrary memory
- Have UB (but we're allowed to pretend that they don't)
- Throw an exception
- Never return, either by infinite looping or by calling exit(0)
- Have memory synchronization operations, like fences, atomic loads,
stores etc.
- Have side effects like IO, volatile writes[XT] Based on Google and Xilinx's suggestion, the IR-level region
annotation can use token and tags with intrinsic functions to model
region and memory dependency (use/def). Above list are handled based
on language rules. E.g. OpenMP rule says, in a parallel region, throw
an exception is allowed, but it has been caught within the region,
i.e. no control-flow edge is allowed to across the region boundary.
"exit" is one exception which is allowed, as it terminate the
program.. Our solution is to have FE and/or one central place in ME
to deal with language specifics.
I should have been clearer: I am not talking about user code here.
I'm trying to say that intrinsic calls are, first and foremost, calls;
and their behavior should be describable within the framework above.
This precludes using intrinsics to represent magic control flow.
For instance, if we go back to one of the problematic transformations
from earlier in the mail:
void main() {
int a[4];
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val = compute_something_into_val(i);
a[i] = val;
}
return a[0] + a[1];
}
to
void main() {
int a[4];
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int val = compute_something_into_val(i);
a[i] = val;
}
return undef;
}
or some variant of this, Say the parallel region is demarcated by a
pair of intrinsics. The said intrinsics could have any subset of the
behavior demarcated above, but the transform would still be valid; and
so to rule out this transform you would need these demarcating (or
region-creating) intrinsics to have some exotic property not in that
list. That's the bit I'm worried about. IOW I'm worried about
changes like this:
if (!isa<TerminatorInst>(I))
V.push_back(I);
else
return false;
to
if (!isa<TerminatorInst>(I) && !llvm::isTerminatorLikeIntrinsic(I))
V.push_back(I);
else
return false;
If you can define the semantics of the intrinsics you want to add as a
subset of the behaviors specified above then I have no problems.
The other option is to define all or a subset of intrinsic calls as
*not* `CallInst` s but as fundamentally different "things" in the LLVM
Instruction hierarchy; at which point we can give them whatever
semantics we want. What I don't want is a `CallInst` doing something
a `CallInst` ought not do.
However, if to preserve *correctness* you have to edit optimization passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not have "call semantics". `call @llvm.experimental.region_begin()` is really a fundamentally new instruction masquerading as an intrinsic, and it is probably better to call a spade a spade and represent it as a new instruction.
[XT] Yes and No. Yes: w.r.t region scope annotation, No: it is more
than one new instruction, it is more like a sequence of
instructions. Assume we have a "fork" instruction, omp fork and cilk
fork/spawn semantics are differently in terms of stack frame
allocation and ABI. When we introduce a new instruction, the exact
semantics needs to be defined, it can't be alter. Thus, we proposed to
Can that problem be solved by (this is both a technical and cultural
shift) allowing "experimental" instructions, whose semantics can be
evolved over time with no guarantee for backwards compatibility?
start with experimental_intrinsics, and it is proven working. We can
always convert the intrinsics with token/tag to instructions when we
have enough solid cases / justification for the part of model-agnostic
for the conversion.
I don't have any problem with that in principle, and I'd
wholeheartedly agree if the intrinsics had call semantics.
The setting for the examples I gave was not that "here is a case we need to get right". The setting was that "here is a *symptom* that shows that we've lied to the optimizer". We can go ahead and fix all the symptoms by adding bailouts to the respective passes, but that does not make us immune to passes that we don't know about e.g. downstream passes, passes that will be added later. It also puts us in a weird spot around semantics of call instructions.
[XT] I would say, it is a design trade-off between having a central
It *is* a design trade-off but I think you're discounting how much
we're trading-off on one side of the equation.
place to deal with specifics or make drastic changes to begin with
from day one. Our process is to have a central place to get all
working, then, turning off the support for some "symptoms" in this
central place one-by-one to trigger downstream fails and fixed. I
Again, I have no issue with this process; but I'm speculating that the
nature of the fixes will add more technical debt to LLVM than we're
comfortable with absorbing.
(Assuming by "central place" you mean the frontend and by downstream
fixes you meant fixes to LLVM passes?)
think our ultimate goal is more or less same, just taking a different
approach to get there. The central place / prepare-phase for getting
IR to a "canonical" form with help to address the issue . " downstream
passes, passes that will be added later. It also puts us in a weird
spot around semantics of call instructions." you mentioned.Thanks for all questions, discussions and feedback.
Thank you too! ![]()
-- Sanjoy