-
Notifications
You must be signed in to change notification settings - Fork 66
/
Copy pathDOC.txt
340 lines (249 loc) · 11.6 KB
/
DOC.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
OVERVIEW
PLUTO is an automatic parallelization tool that is based on the
polyhedral model. The Polyhedral model is a geometrical representation
for programs that utilizes machinery from Linear Algebra and Linear
Programming for analysis and high-level transformations. Pluto
transforms C programs from source to source for coarse-grained
parallelism and locality simultaneously. The core transformation
framework mainly works by finding affine transformations for efficient
tiling and fusion, but not limited to it. OpenMP parallel code for
multicores can be automatically generated from sequential C program
sections. Pluto currently lacks intra-tile interchange (supposed to be
done post-transformation) for spatial locality, prefetching, etc.. This
will be implemented in future.
This is the chain of the entire source-to-source system that polycc will
run.
C code --> Polyhedral extraction --> Dependence analysis
(clan or PET) (ISL or candl)
--> Pluto transformer
(core Pluto algorithm + post transformation)
--> CLooG --> C (with OpenMP, ivdep pragmas)
(cloog + clast processing
to mark loops parallel, ivdep)
USING PLUTO
- Use '#pragma scop' and '#pragma endscop' around the section of code
you want to parallelize/optimize.
- Then, run
./polycc <C source file> --parallel --tile
The output file will be named <original prefix>.pluto.c unless '-o
<filename>" is supplied. When --debug is used, the .cloog used to
generate code is not deleted and is named similarly.
Please refer to the documentation of Clan or PET for information on the
kind of code around which one can put '#pragma scop' and '#pragma
endscop'. Most of the time, although your program may not satisfy the
constraints, it may be possible to work around them.
COMMAND-LINE OPTIONS
-o output
Output to file 'output'. Without -o, name of the output file is
determined as described earlier (under 'Using PLUTO')
--help
List all available options with one-line summary
--pet Use 'pet' to extract polyhedral representation from the source
program instead of clan.
--tile [--second-level-tile]
Tile code; in addition, --second-level-tile will tile once more. By
default, --second-level-tile is disabled. Tile sizes can be forced
if needed from a file 'tile.sizes' (see below), otherwise, tile
sizes are set automatically using a rough heuristic. Tiling also
allows extraction of coarse-grained pipelined parallelism with the
Pluto model.
--intratileopt [enabled by default]
Optimize a tile's execution order for locality (spatial and temporal
reuse); the right loop permutation for a tile will be chosen, in particular,
the right innermost loop. Spatial locality is not otherwise captured
by Pluto's cost function.
--parallel
Parallelize code with OpenMP (usually useful when used with --tile)
--parallelize
Same as --parallel
--innnerpar
Prever inner parallelism over pipelined/wavefront parallelism obtained
via skewing whenever both exist
--multipar
Will enable extraction of multiple degrees of parallelism (from all
parallel loops). Disabled by default. By default, only one degree
of outer parallelism or coarse-grained pipelined parallelism is
extracted if it exists.
--smartfuse [default]
This is the default fusion heuristic. Will try to fuse between SCCs
of the same dimensionality.
--nofuse
Separate all strongly-connected components (SCCs) in the dependence
graphs to start with, i.e., no fusion across SCCs, and at any level
inside.
--maxfuse
This is geared towards maximal fusion, but maximal fusion is not
guaranteed. Fusion is done across SCCs.
--[no]unrolljam
Automatically identify and unroll-jam loops. Enabled by default.
--ufactor=<n>
Unroll or unroll-jam factor (default is 8). Note that if two loops
are unroll-jammed by factor n, you will get an nxn body.
--[no]prevector
Perform post-transformations to make the code amenable to
vectorization. Enabled by default.
--rar
Consider RAR dependences for optimization (increases running time by
a little). Disabled by default
--debug
Verbose information to give some insights into the algorithm.
Intermediate files are not deleted (like the program-readable
statement domains, dependences, pretty-printed dependences, the
.cloog file, etc.). For the format of these files, refer doc/DOC.txt
for a pointer.
--verbose
Higher level of output. ILP formulation constraints are
pretty-printed out dependence-wise, along with solution hyperplanes
at each level.
--indent
Indent generated code
--islsolve
Use ISL for solving ILPs
--isldep
See usage message (run polycc with no arguments)
--lastwriter
See usage message (run polycc with no arguments)
--readscoplib
Read input from a scoplib file
--context=<value>
An assertion from the user that all program parameters are greater than
or equal to <value>
--version
Print version number and exit
-q | --silent
UNIX-style silence: no output as long as everything goes fine
For more please see the output with '--help'.
Besides these, 'tile.sizes' and '.fst' files allow the user to force
certain things.
Other options will only make sense to power users. See comments in
src/pluto.h for details.
SPECIFYING CUSTOM TILE SIZES THROUGH 'tile.sizes'
A 'tile.sizes' file in the current working directory can be used to manually
specify tile sizes. Specify one tile size on each line and as many tile
sizes are there are hyperplanes in the outermost non-trivial permutable
band. When specifying tile sizes for multiple levels (with
--second-level-tile), first specify first level tile sizes, then
second:first tile size ratios. See examples/matmul/tile.sizes as an
example. If
8x128x8 is the first level tile size, and 128x256x128 for the second
level, the
tile.sizes file will be
# first level tile size 8x128x8
8
128
8
# second level is 16*8 x 2*128 x 16*8
16
2
16
The default tile size in the absence of a tile.sizes file is 32 (along
all dimensions), and the default second/first ratio is 8 (when using
--second-level-tile). The sizes specified correspond to transformed
loops in that order. For eg., for heat-3d, you'll see this output when
you run Pluto
$ ../../polycc 3d7pt.c
[...]
[pluto] Affine transformations [<iter coeff's> <param> <const>]
T(S1): (t, t+i, t+j, t+k)
loop types (loop, loop, loop, loop)
[...]
Hence, the tile sizes specified correspond to t, t+i, t+j, and t+k.
SETTING GOOD TILE SIZES
One would like to maximize locality while making sure there are enough
tiles to keep all cores busy. Rough thumb rules on selecting tile sizes
empirically are below.
(1) data set should fit in L2 roughly (or definitely within L3), (2) the
innermost dimension should have enough iterations, especially if it's
being vectorized so that (a) cleanup code if any doesn't hurt and (b)
prefetching provides benefits,
(3) for non-innermost dimensions from which temporal reuse is typically being
exploited, increasing tile size beyond 32 often provides diminishing
returns (additional reuse); so one may not want to experiment much
beyond 32. Each time tile.sizes is changed, code has to be regenerated
(rm -f *.tiled.c; make tiled; ./tiled).
If one wishes to carefully tune tile sizes, it may be good to first run
Pluto without tiling (without --tile) and check the number of iterations
in each of the loops in the tilable band of the generated code, and then
determine tile sizes.
TILE SIZES FOR DIAMOND TILING OF STENCILS
Wherever diamond tiling is performed, the same tile size should be
chosen along the first two dimensions of the tile band; otherwise, the
tile wavefront will no longer be parallel to the concurrent start face
and tile-wise concurrent start will be lost.
SPECIFYING A CUSTOM FUSION STRUCTURE THROUGH '.fst' or '.precut' file
One can force a particular fusion structure in two ways: either using
the .fst file or the .precut file. In either case, the file has to be in
your `present working directory'. If both files exist in the directory
polycc is run from, '.fst' takes precedence. If --nofuse is specified,
distribution will be done in spite of a .fst/.precut.
Alternative 1 (.fst file)
A component is a subset of statements. Components are disjoint, and the
union of all components is the set of all statements. So the set of all
statements is partitioned into components. Here's the format of the .fst
file:
<number of components>
# description of 1st component
<num of statements in component>
<ids of statements in first component - an id is b/w 0 and num_stmts-1>
<whether to tile this component -- 0 for no, 1 or more for yes>
# description of 2nd component
...
See examples/gemver/.fst as an example; here is the meaning of a sample
.fst for examples/gemver/gemver.c:
3 # number of components
1 # number of statements in first component
0 # statement id's of statements in 1st component (id is from 0 to num_stmts-1): so we have S0 here
0 # don't tile this component {S0}
2 # number of statements in second component
1 2 # stmts in this component are S1 and S2
256 # tile this component {S1,S2}
1 # number of statements in the third component
3 # stmt id 3, i.e., S3
0 # don't tile this {S3}
So, the above asks Pluto to distribute the statements at the outer
level as: {S0}, {S1,S2}, {S3}. It'll try to fuse as much as possible
within each component if you use --maxfuse along with the above .fst,
but will always respect the distribution specified.
Alternative 2
Using the '.precut' file
The '.precut' can be used to specify any partial transformation that Pluto
can then complete. This is a useful way to test or manually specify a
particular transformation, or to check if a particular transformation is a
valid one. The format of .precut is as below.
<number of statements>
<number of levels to tile>
# first statement
<number of rows, and number of columns for the partial transformation>
<rows of the partial transformation in Cloog-like scattering function
format. For eg., if the partial transformation is c1 = i+j+1 for a 2-d
statement with one program parameter, the row would be 0 1 1 0 1, the first
number is always a zero (for an equality), then 1 for i, 1 for j,
0 for the parameter, and 1 for the constant part>
<a number that's not currently used, specify anything>
<whether or not to tile each of the levels, 0 for `don't tile', any
other number for tiling; as many lines as the number of tiling levels
specified earlier>
# second statement
...
Parsing of any other text (like comments) in .fst or .precut is not
handled.
MORE ON POST-PROCESSING
--prevector will cause bounds of the loop to be vectorized replaced by
scalars and insert the ivdep and 'vector always' pragmas to allow
vectorization with ICC. GCC does not support these, and its
vectorization capalibility is currently very limited.
LOOKING AT THE TRANSFORMATION
Transformations are pretty-printed. The transformation is a
multi-dimensional affine function on a per-statement basis.
LOOKING AT THE GENERATED CODE
The best options for taking a good look at the generated code are:
<pluto_dir>/polycc source.c --noprevector --tile --parallel
or
<pluto_dir>/polycc source.c --noprevector
to just look at the transformation without the actual tiling and
parallelization performed; the transformation is just applied, and loop
fusion can be seen.
USING PLUTO AS A LIBRARY
See src/test_plutolib.c
To install libpluto on a system, run 'make install' from Pluto's top-level
directory. libpluto.{so,a} can be found in src/.libs/