Hi Johannes,
I am especially curious where you get your data from. Tapir [0] (and to
some degree PIR [1]) have shown that, counterintuitively, only a few changes
to LLVM passes are needed. Tapir was recently used in an MIT class with a
lot of students and it seemed to work well with only minimal changes
to analysis and especially transformation passes.
TAPIR is an elegant, small extension and, in particular, I think the idea of asymmetric parallel tasks and control flow is a clever way to express parallelism with serial semantics, as in Cilk. Encoding the control flow extensions as explicit instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.
However, Cilk is a tiny language compared with the full complexity of other languages, like OpenMP. To take just one example, TAPIR cannot express the ORDERED construct of OpenMP. A more serious concern, IMO, is that TAPIR (like Cilk) requires serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction. Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses. For example, automatic array privatization in parallel loops is a very hard problem (automatic scalar privatization is easier, but even that is interprocedural). Reduction recognition is doable for common cases, but there are hard cases here as well. These are all standard features of parallel programs, not specific to OpenMP (e.g., C++17 parallel template operators are likely to produce these as well).
If you support all these capabilities in the IR, a *lot* more than 6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for PIR) would probably have to be modified in LLVM.
[...]
(b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
join, barrier, etc.
[...]For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.
A reasonable question is whether to use (#b) first-class instructions for some features, *in combination with* (#d) — i.e., region markers + metadata — or to use #d exclusively. There are almost certainly far too many essential features in parallel programs to capture them all as new instructions. I don’t see a need to answer this question on Day 1. Instead, we can begin with regions and metadata annotations, and then “promote” a few features to first-class instructions if the benefit is justified.
Does that make sense?
Also I think we already have a lot of
simple constructs in the IR to express high-level information properly,
e.g. 'atomicrmw' instructions for high-level reduction. While we currently
lack analysis passes to extract information from such low-level
representations, these are certainly possible [2,3]. I would argue that such
analysis are a better way to do things than placing "high-level intrinsics" in the IR to mark things like reductions.
IMO, this is too fragile for ensuring performance for realistic applications. While there is a lot of work on reduction recognition in the literature going back to the 1980s, there will always be cases where these algorithms miss a reduction pattern and you get terrible parallel performance. Second, I’m not convinced that you can use specific operations like “atomicrmw” to find candidate loop nests in the first place that *might* be reductions, because the atomic op may in fact be in a library operation linked in as binary code. Third, this greatly increases the complexity of a “first working prototype” for a language like OpenMP because you can’t exclude REDUCTION clauses. Fourth, I believe array privatization is even harder; more generally, you cannot rely on heroic compiler optimizations to take the place of explicit language directives.
I may be showing my age, but I co-wrote a full HPF-to-MPI source-to-source compiler in the late 1990s at Rice University with some quite sophisticated optimizations [1,2], and we and others in the HPF compiler community — both product and research teams — encountered many cases where compiler analyses were inadequate. There is a credible view that HPF failed despite very high hopes and significant industry and research investment because it relied too much on heroic compiler technology.
[1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps (PLDI 1998)
[2] http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf (SC 1998)
-—Vikram
// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve@illinois.edu
// http://llvm.org