-
-
Notifications
You must be signed in to change notification settings - Fork 31
/
Copy pathalgorithms.jl
178 lines (142 loc) · 6.1 KB
/
algorithms.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
"""
```julia
EnsembleCPUArray(cpu_offload = 0.2)
```
An `EnsembleArrayAlgorithm` which utilizes the CPU kernels to parallelize each ODE solve
with their separate ODE integrator on each kernel. This method is meant to be a debugging
counterpart to `EnsembleGPUArray`, having the same behavior and using the same
KernelAbstractions.jl process to build the combined ODE, but without the restrictions of
`f` being a GPU-compatible kernel function.
It is unlikely that this method is useful beyond library development and debugging, as
almost any case should be faster with `EnsembleThreads` or `EnsembleDistributed`.
"""
struct EnsembleCPUArray <: EnsembleArrayAlgorithm end
"""
```julia
EnsembleGPUArray(backend, cpu_offload = 0.2)
```
An `EnsembleArrayAlgorithm` which utilizes the GPU kernels to parallelize each ODE solve
with their separate ODE integrator on each kernel.
## Positional Arguments
- `backend`: the KernelAbstractions backend for performing the computation.
- `cpu_offload`: the percentage of trajectories to offload to the CPU. Default is 0.2 or
20% of trajectories.
## Limitations
`EnsembleGPUArray` requires being able to generate a kernel for `f` using
KernelAbstractons.jl and solving the resulting ODE defined over `CuArray` input types.
This introduces the following limitations on its usage:
- Not all standard Julia `f` functions are allowed. Only Julia `f` functions which are
capable of being compiled into a GPU kernel are allowed. This notably means that
certain features of Julia can cause issues inside of kernel, like:
+ Allocating memory (building arrays)
+ Linear algebra (anything that calls BLAS)
+ Broadcast
- Not all ODE solvers are allowed, only those from OrdinaryDiffEq.jl. The tested feature
set from OrdinaryDiffEq.jl includes:
+ Explicit Runge-Kutta methods
+ Implicit Runge-Kutta methods
+ Rosenbrock methods
+ DiscreteCallbacks and ContinuousCallbacks
- Stiff ODEs require the analytical solution of every derivative function it requires.
For example, Rosenbrock methods require the Jacobian and the gradient with respect to
time, and so these two functions are required to be given. Note that they can be
generated by the
[modelingtoolkitize](https://docs.juliadiffeq.org/latest/tutorials/advanced_ode_example/#Automatic-Derivation-of-Jacobian-Functions-1)
approach.
- To use multiple GPUs over clusters, one must manually set up one process per GPU. See the
multi-GPU tutorial for more details.
!!! warn
Callbacks with `terminate!` does not work well with `EnsembleGPUArray` as the entire
integration will hault when any of the trajectories hault. Use with caution.
## Example
```julia
using DiffEqGPU, CUDA, OrdinaryDiffEq
function lorenz(du, u, p, t)
du[1] = p[1] * (u[2] - u[1])
du[2] = u[1] * (p[2] - u[3]) - u[2]
du[3] = u[1] * u[2] - p[3] * u[3]
end
u0 = Float32[1.0; 0.0; 0.0]
tspan = (0.0f0, 100.0f0)
p = [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = rand(Float32, 3) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
@time sol = solve(monteprob, Tsit5(), EnsembleGPUArray(CUDADevice()),
trajectories = 10_000, saveat = 1.0f0)
```
"""
struct EnsembleGPUArray{Backend} <: EnsembleArrayAlgorithm
backend::Backend
cpu_offload::Float64
end
"""
```julia
EnsembleGPUKernel(backend, cpu_offload = 0.2)
```
A massively-parallel ensemble algorithm which generates a unique GPU kernel for the entire
ODE which is then executed. This leads to a very low overhead GPU code generation, but
imparts some extra limitations on the use.
## Positional Arguments
- `backend`: the KernelAbstractions backend for performing the computation.
- `cpu_offload`: the percentage of trajectories to offload to the CPU. Default is 0.0 or
0% of trajectories.
## Limitations
- Not all standard Julia `f` functions are allowed. Only Julia `f` functions which are
capable of being compiled into a GPU kernel are allowed. This notably means that
certain features of Julia can cause issues inside a kernel, like:
+ Allocating memory (building arrays)
+ Linear algebra (anything that calls BLAS)
+ Broadcast
- Only out-of-place `f` definitions are allowed. Coupled with the requirement of not
allowing for memory allocations, this means that the ODE must be defined with
`StaticArray` initial conditions.
- Only specific ODE solvers are allowed. This includes:
+ GPUTsit5
+ GPUVern7
+ GPUVern9
- To use multiple GPUs over clusters, one must manually set up one process per GPU. See the
multi-GPU tutorial for more details.
## Example
```julia
using DiffEqGPU, CUDA, OrdinaryDiffEq, StaticArrays
function lorenz(u, p, t)
σ = p[1]
ρ = p[2]
β = p[3]
du1 = σ * (u[2] - u[1])
du2 = u[1] * (ρ - u[3]) - u[2]
du3 = u[1] * u[2] - β * u[3]
return SVector{3}(du1, du2, du3)
end
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,
adaptive = false, dt = 0.1f0)
```
"""
struct EnsembleGPUKernel{Dev} <: EnsembleKernelAlgorithm
dev::Dev
cpu_offload::Float64
end
cpu_alg = Dict(GPUTsit5 => (GPUSimpleTsit5(), GPUSimpleATsit5()),
GPUVern7 => (GPUSimpleVern7(), GPUSimpleAVern7()),
GPUVern9 => (GPUSimpleVern9(), GPUSimpleAVern9()))
# Work around the fact that Zygote cannot handle the task system
# Work around the fact that Zygote isderiving fails with constants?
function EnsembleGPUArray(dev)
EnsembleGPUArray(dev, 0.2)
end
function EnsembleGPUKernel(dev)
EnsembleGPUKernel(dev, 0.0)
end
function ChainRulesCore.rrule(::Type{<:EnsembleGPUArray})
EnsembleGPUArray(0.0), _ -> NoTangent()
end
ZygoteRules.@adjoint function EnsembleGPUArray(dev)
EnsembleGPUArray(dev, 0.0), _ -> nothing
end