doc/DOC.txt

OVERVIEW

PLUTO is an automatic parallelization tool that is based on the 
polyhedral model. The Polyhedral model is a geometrical representation 
for programs that utilizes machinery from Linear Algebra and Linear 
Programming for analysis and high-level transformations. Pluto 
transforms C programs from source to source for coarse-grained 
parallelism and locality simultaneously. The core transformation 
framework mainly works by finding affine transformations for efficient 
tiling and fusion, but not limited to it. OpenMP parallel code for 
multicores can be automatically generated from sequential C program 
sections. Pluto currently lacks intra-tile interchange (supposed to be 
done post-transformation) for spatial locality, prefetching, etc.. This 
will be implemented in future.

This is the chain of the entire source-to-source system that polycc will  
run. 

C code --> Polyhedral extraction -->  Dependence analysis
           (clan or PET)                (ISL or candl)

-->    Pluto transformer
      (core Pluto algorithm + post transformation)

 --> CLooG                     -->       C (with OpenMP, ivdep pragmas)
 (cloog + clast processing
 to mark loops parallel, ivdep)


USING PLUTO

- Use '#pragma scop' and '#pragma endscop' around the section of code 
  you want to parallelize/optimize.

- Then, run

    ./polycc <C source file> --parallel --tile

    The output file will be named <original prefix>.pluto.c unless '-o
    <filename>" is supplied. When --debug is used, the .cloog used to
    generate code is not deleted and is named similarly.

Please refer to the documentation of Clan or PET for information on the
kind of code around which one can put '#pragma scop' and '#pragma
endscop'.  Most of the time, although your program may not satisfy the
constraints, it may be possible to work around them.


COMMAND-LINE OPTIONS
    -o output
    Output to file 'output'. Without -o, name of the output file is
    determined as described earlier (under 'Using PLUTO')

    --help
    List all available options with one-line summary

    --pet Use 'pet' to extract polyhedral representation from the source
    program instead of clan.

    --tile [--second-level-tile]
    Tile code; in addition, --second-level-tile will tile once more. By
    default, --second-level-tile is disabled. Tile sizes can be forced
    if needed from a file 'tile.sizes' (see below), otherwise, tile
      sizes are set automatically using a rough heuristic.  Tiling also
      allows extraction of coarse-grained pipelined parallelism with the
      Pluto model.

    --intratileopt  [enabled by default]
    Optimize a tile's execution order for locality (spatial and temporal
    reuse); the right loop permutation for a tile will be chosen, in particular,
    the right innermost loop. Spatial locality is not otherwise captured
    by Pluto's cost function.

    --parallel
    Parallelize code with OpenMP (usually useful when used with --tile)

    --parallelize
    Same as --parallel

    --innnerpar
    Prever inner parallelism over pipelined/wavefront parallelism obtained
    via skewing whenever both exist

    --multipar
    Will enable extraction of multiple degrees of parallelism (from all
    parallel loops).  Disabled by default. By default, only one degree
    of outer parallelism or coarse-grained pipelined parallelism is
    extracted if it exists.

    --smartfuse [default]
    This is the default fusion heuristic. Will try to fuse between SCCs
    of the same dimensionality.

    --nofuse
    Separate all strongly-connected components (SCCs) in the dependence
    graphs to start with, i.e., no fusion across SCCs, and at any level
    inside.

    --maxfuse
    This is geared towards maximal fusion, but maximal fusion is not
    guaranteed. Fusion is done across SCCs.

    --[no]unrolljam
    Automatically identify and unroll-jam loops. Enabled by default.

    --ufactor=<n>
    Unroll or unroll-jam factor (default is 8). Note that if two loops
    are unroll-jammed by factor n, you will get an nxn body.

    --[no]prevector
    Perform post-transformations to make the code amenable to 
    vectorization. Enabled by default.

    --rar
    Consider RAR dependences for optimization (increases running time by 
    a little). Disabled by default

    --debug
    Verbose information to give some insights into the algorithm.  
    Intermediate files are not deleted (like the program-readable 
    statement domains, dependences, pretty-printed dependences, the 
    .cloog file, etc.). For the format of these files, refer doc/DOC.txt 
    for a pointer.

    --verbose
    Higher level of output. ILP formulation constraints are 
    pretty-printed out dependence-wise, along with solution hyperplanes 
    at each level. 

    --indent
    Indent generated code

    --islsolve
    Use ISL for solving ILPs

    --isldep
    See usage message (run polycc with no arguments)

    --lastwriter
    See usage message (run polycc with no arguments)

    --readscoplib
    Read input from a scoplib file

    --context=<value>
    An assertion from the user that all program parameters are greater than 
    or equal to <value>

    --version
    Print version number and exit

	-q | --silent
	UNIX-style silence: no output as long as everything goes fine

    For more please see the output with '--help'.

    Besides these, 'tile.sizes' and '.fst' files allow the user to force 
    certain things.

    Other options will only make sense to power users. See comments in 
    src/pluto.h for details.


SPECIFYING CUSTOM TILE SIZES THROUGH 'tile.sizes'

A 'tile.sizes' file in the current working directory can be used to manually 
specify tile sizes. Specify one tile size on each line and as many tile 
sizes are there are hyperplanes in the outermost non-trivial permutable 
band. When specifying tile sizes for multiple levels (with 
--second-level-tile), first specify first level tile sizes, then 
second:first tile size ratios.  See examples/matmul/tile.sizes as an 
example.  If
8x128x8 is the first level tile size, and 128x256x128 for the second 
level, the
tile.sizes file will be

# first level tile size 8x128x8
8
128
8
# second level is 16*8 x 2*128 x 16*8
16
2
16

The default tile size in the absence of a tile.sizes file is 32 (along 
all dimensions), and the default second/first ratio is 8 (when using 
--second-level-tile). The sizes specified correspond to transformed 
loops in that order.  For eg., for heat-3d, you'll see this output when 
you run Pluto

$ ../../polycc 3d7pt.c

[...]

[pluto] Affine transformations [<iter coeff's> <param> <const>]

T(S1): (t, t+i, t+j, t+k)
loop types (loop, loop, loop, loop)

[...]

Hence, the tile sizes specified correspond to t, t+i, t+j, and t+k.


SETTING GOOD TILE SIZES

One would like to maximize locality while making sure there are enough
tiles to keep all cores busy. Rough thumb rules on selecting tile sizes
empirically are below.

(1) data set should fit in L2 roughly (or definitely within L3), (2) the
innermost dimension should have enough iterations, especially if it's
being vectorized so that (a) cleanup code if any doesn't hurt and (b)
prefetching provides benefits,
(3) for non-innermost dimensions from which temporal reuse is typically being
exploited, increasing tile size beyond 32 often provides diminishing 
returns (additional reuse); so one may not want to experiment much 
beyond 32. Each time tile.sizes is changed, code has to be regenerated 
(rm -f *.tiled.c; make tiled; ./tiled).

If one wishes to carefully tune tile sizes, it may be good to first run 
Pluto without tiling (without --tile) and check the number of iterations 
in each of the loops in the tilable band of the generated code, and then 
determine tile sizes.

TILE SIZES FOR DIAMOND TILING OF STENCILS 

Wherever diamond tiling is performed, the same tile size should be 
chosen along the first two dimensions of the tile band; otherwise, the 
tile wavefront will no longer be parallel to the concurrent start face 
and tile-wise concurrent start will be lost.


SPECIFYING A CUSTOM FUSION STRUCTURE THROUGH '.fst' or '.precut' file

One can force a particular fusion structure in two ways: either using 
the .fst file or the .precut file. In either case, the file has to be in 
your `present working directory'.  If both files exist in the directory 
polycc is run from, '.fst' takes precedence. If --nofuse is specified, 
distribution will be done in spite of a .fst/.precut.

Alternative 1 (.fst file)

A component is a subset of statements. Components are disjoint, and the
union of all components is the set of all statements. So the set of all
statements is partitioned into components. Here's the format of the .fst 
file:

<number of components>
# description of 1st component
<num of statements in component>
<ids of statements in first component - an id is b/w 0 and num_stmts-1>
<whether to tile this component -- 0 for no, 1 or more for yes>
# description of 2nd component
...

See examples/gemver/.fst as an example; here is the meaning of a sample 
.fst for examples/gemver/gemver.c:

3 # number of components
1 # number of statements in first component
0 # statement id's of statements in 1st component (id is from 0 to   num_stmts-1): so we have S0 here
0 # don't tile this component {S0}
2 # number of statements in second component
1 2  # stmts in this component are S1 and S2
256 # tile this component {S1,S2}
1 # number of statements in the third component
3 # stmt id 3, i.e., S3 
0 # don't tile this {S3}

So, the above asks Pluto to distribute the statements at the outer
level as: {S0}, {S1,S2}, {S3}. It'll try to fuse as much as possible 
within each component if you use --maxfuse along with the above .fst, 
but will always respect the distribution specified.

Alternative 2

Using the '.precut' file

The '.precut' can be used to specify any partial transformation that Pluto 
can then complete. This is a useful way to test or manually specify a 
particular transformation, or to check if a particular transformation is a 
valid one. The format of .precut is as below.

<number of statements>
<number of levels to tile>
# first statement
<number of rows, and number of columns for the partial transformation>
<rows of the partial transformation in Cloog-like scattering function 
format. For eg.,  if the partial transformation is c1 = i+j+1 for a 2-d 
statement with one program parameter, the row would be 0 1 1 0 1, the first 
number is always a zero (for an equality), then 1 for i, 1 for j, 
0 for the parameter, and 1 for the constant part>
<a number that's not currently used, specify anything>
<whether or not to tile each of the levels, 0 for `don't tile', any 
other number for tiling; as many lines as the number of tiling levels
specified earlier>
# second statement
...

Parsing of any other text (like comments) in .fst or .precut is not 
handled.


MORE ON POST-PROCESSING

--prevector will cause bounds of the loop to be vectorized replaced by 
scalars and insert the ivdep and 'vector always' pragmas to allow 
vectorization with ICC. GCC does not support these, and its 
vectorization capalibility is currently very limited.


LOOKING AT THE TRANSFORMATION

Transformations are pretty-printed. The transformation is a 
multi-dimensional affine function on a per-statement basis.


LOOKING AT THE GENERATED CODE

The best options for taking a good look at the generated code are:

<pluto_dir>/polycc source.c --noprevector --tile --parallel

or

<pluto_dir>/polycc source.c --noprevector

to just look at the transformation without the actual tiling and 
parallelization performed; the transformation is just applied, and loop 
fusion can be seen.

USING PLUTO AS A LIBRARY

See src/test_plutolib.c

To install libpluto on a system, run 'make install' from Pluto's top-level 
directory.  libpluto.{so,a} can be found in src/.libs/