phptxtfgets技巧_教程｜在Julia编程中实现GPU加速

文章目录 [+]

注册地址：https://nextjournal.com/signup

首先，什么是 GPU？

phptxtfgets技巧_教程｜在Julia编程中实现GPU加速

GPU 是一种大型并行处理器，有几千个并行处理单元。
例如，本文利用的 Tesla k80，能供应 4992 个并行 CUDA 核。
GPU 在频率、延迟和硬件性能方面与 CPU 有很大的不同，但实际上 Tesla k80 有点类似于具有 4992 核的慢速 CPU。

（图片来自网络侵删）

能够启动的并行线程可以大幅提升速率，但也令利用 GPU 变得更困难。
当利用这种未加处理的能量时，会涌现以下缺陷：

GPU 是一种有专属内存空间和不同架构的独立硬件。
因此，从 RAM 到 GPU 内存（VRAM，显存）的传输韶光很长。
乃至在 GPU 上启动内核（调用调度函数）也会带来很大的延迟，对付 GPU 而言是 10us 旁边，而对付 CPU 只有几纳秒。
在没有高等封装的情形下，建立内核会变得繁芜。
低精度是默认值，高精度的打算可以很随意马虎地肃清所有性能增益。
GPU 函数（内核）实质上是并行的，以是编写 GPU 内核不比编写并行 CPU 代码随意马虎，而且硬件上的差异增加了一定的繁芜性。
与上述情形干系的很多算法都不能很好地迁移到 GPU 上。
想要理解更多的细节，请看这篇博文：https://streamhpc.com/blog/2013-06-03/the application-areas-opencl- cuda-can- used/。
内核常日是用 C/ C++措辞编写的，但这并不是写算法的最好措辞。
CUDA 和 OpenCL 之间有差异，OpenCL 是编写底层 GPU 代码的紧张框架。
虽然 CUDA 只支持英伟达硬件，OpenCL 支持所有硬件，但并不风雅。
要看个人需求进行选择。

而 Julia 作为一种高等脚本措辞，许可在个中编写内核和环境代码，同时可在大多数 GPU 硬件上运行！

GPUArrays

大多数高度并行的算法都须要同时处理大量数据，以战胜所有的多线程和延迟损耗。
因此，大多数算法都须要数组来管理所有数据，这就须要一个好的 GPU 数组库作为关键的根本。

GPUArrays.jl 是 Julia 为此供应的根本。
它实现了一个专门用于高度并行硬件的抽象数组。
它包含了设置 GPU、启动 Julia GPU 函数、供应一些基本数组算法等所有必要功能。

抽象意味着它须要以 CuArrays 和 CLArrays 的形式实现。
由于继续了 GPUArrays 的所有功能，它们供应的接口完备相同。
唯一的差异涌如今分配数组时，这会逼迫用户决定这一数组是存在于 CUDA 还是 OpenCL 设备上。
关于这一点的更多信息，请参阅「内存」部分。

GPUArrays 有助于减少代码重复，由于它许可编写独立于硬件的 GPU 内核，这些内核可以通过 CuArrays 或 CLArrays 编译到本地的 GPU 代码。
因此，大多通用内核可以在从 GPUArrays 继续的所有包之间共享。

选择小贴士：CuArrays 只支持 Nvidia GPU，而 CLArrays 支持大多数可用的 GPU。
CuArrays 比 CLArrays 更稳定，可以在 Julia 0.7 上利用。
速率上两者大同小异。
我建议都试一试，看看哪种最有效。

本文中，我将选择 CuArrays，由于本文是在 Julia 0.7 / 1.0 上编写的，CLArrays 暂不支持。

性能

用一个大略的交互式代码示例来快速解释：为了打算 julia 凑集（曼德勃罗凑集），我们必须要将打算转移到 GPU 上。

using CuArrays, FileIO, Colors, GPUArrays, BenchmarkToolsusing CuArrays: CuArray\公众\"大众\"大众The function calculating the Julia set\"大众\"大众\"大众function juliaset(z0, maxiter) c = ComplexF32(-0.5, 0.75) z = z0 for i in 1:maxiter abs2(z) > 4f0 && return (i - 1) % UInt8 z = z z + c end return maxiter % UInt8 # % is used to convert without overflow checkendrange = 100:50:2^12cutimes, jltimes = Float64[], Float64[]function run_bench(in, out) # use dot syntax to apply `juliaset` to each elemt of q_converted # and write the output to result out .= juliaset.(in, 16) # all calls to the GPU are scheduled asynchronous, # so we need to synchronize GPUArrays.synchronize(out)end# store a reference to the last results for plottinglast_jl, last_cu = nothing, nothingfor N in range w, h = N, N q = [ComplexF32(r, i) for i=1:-(2.0/w):-1, r=-1.5:(3.0/h):1.5] for (times, Typ) in ((cutimes, CuArray), (jltimes, Array)) # convert to Array or CuArray - moving the calculation to CPU/GPU q_converted = Typ(q) result = Typ(zeros(UInt8, size(q))) for i in 1:10 # 5 samples per size # benchmarking macro, all variables need to be prefixed with $ t = Base.@elapsed begin run_bench(q_converted, result) end global last_jl, last_cu # we're in local scope if result isa CuArray last_cu = result else last_jl = result end push!(times, t) end endendcu_jl = hcat(Array(last_cu), last_jl)cmap = colormap(\"大众Blues\公众, 16 + 1)color_lookup(val, cmap) = cmap[val + 1]save(\"大众results/juliaset.png\"大众, color_lookup.(cu_jl, (cmap,)))

using Plots; plotly()x = repeat(range, inner = 10)speedup = jltimes ./ cutimesPlots.scatter( log2.(x), [speedup, fill(1.0, length(speedup))], label = [\"大众cuda\"大众 \"大众cpu\"大众], markersize = 2, markerstrokewidth = 0, legend = :right, xlabel = \"大众2^N\公众, ylabel = \"大众speedup\"大众)

对付大型数组，通过将打算转移到 GPU，可以稳定地将速率提高 60-80 倍。
得到此加速和将 Julia 数组转换为 GPUArray 一样大略。

有人可能认为 GPU 性能会受到像 Julia 这样的动态措辞影响，但 Julia 的 GPU 性能该当与 CUDA 或 OpenCL 的原始性能相称。
Tim Besard 在集成 LLVM Nvidia 编译流程方面做得很好，能够实现与纯 CUDA C 措辞代码相同（有时乃至更好）的性能。
他在博客（https://devblogs.nvidia.com/gpu-computing-julia-programming-language/）中作了进一步阐明。
CLArrays 方法有点不同，它直接从 Julia 天生 OpenCL C 代码，代码性能与 OpenCL C 相同！

为了更好地理解性能并与多线程 CPU 代码进行比对，我整理了一些基准：https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/results/results.md

内存

GPU 具有自己的存储空间，包括显存（VRAM）、不同的高速缓存和寄存器。
无论做什么，运行前都要先将 Julia 工具转移到 GPU。
并非 Julia 中的所有类型都可以在 GPU 上运行。

首先让我们看一下 Julia 的类型：

struct Test # an immutable struct# that only contains other immutable, which makes # isbitstype(Test) == true x::Float32 end# the isbits property is important, since those types can be used# without constraints on the GPU!@assert isbitstype(Test) == truex = (2, 2)isa(x, Tuple{Int, Int}) # tuples are also immutablemutable struct Test2 #-> mutable, isbits(Test2) == false x::Float32endstruct Test3 # contains a heap allocation/ reference, not isbits x::Vector{Float32} y::Test2 # Test2 is mutable and also heap allocated / a referenceendVector{Test} # <- An Array with isbits elements is contigious in memoryVector{Test2} # <- An Array with mutable elements is basically an array of heap pointers. Since it just contains cpu heap pointers, it won't work on the GPU.

\"大众Array{Test2,1}\"大众

所有这些 Julia 类型在传输到 GPU 或在 GPU 上创建时表现不同。
下表概述了预期结果：

创建位置描述工具是在 CPU 上创建的，然后转移到 GPU 内核上，或者本身就由内核内部的 GPU 创建。
该表显示创建类型的实例是否可行，对付从 CPU 到 GPU 的转移，该表还解释了工具是否能通过参照进行复制或通报。

垃圾网络

当利用 GPU 时，要把稳 GPU 上没有垃圾网络器（GC）。
这不会造成太大影响，由于写入 GPU 的高性能内核不应该创建任何 GC-跟踪的内存作为起始。

在 GPU 上实现 GC 不无可能，但请记住，每个实行内核都是大规模并行的。
在大约 1000 个 gpu 线程中的每一个创建和跟踪大量堆内存就会立时毁坏性能增益，因此实现 GC 是得不偿失落的。

利用 GPUArrays 可以作为在内核等分配数组的替代方法。
GPUArray 布局函数将创建 GPU 缓冲区并将数据转移到 VRAM。
如果调用 Array(gpu_array)，数组将被转移回 RAM，变为普通的 Julia 数组。
这些 gpu 数组的 Julia 操作由 Julia 的 GC 跟踪，如果不再利用，GPU 内存将被开释。

因此，只能在设备上利用堆栈分配，并且只能被其他的预先分配的 GPU 缓冲区利用。
由于转移代价很高，因此在编写 GPU 时，每每要尽可能重用和预分配。

GPUArray 布局函数

using CuArrays, LinearAlgebra# GPU Arrays can be constructed from all Julia arrays containing isbits types!A1D = cu([1, 2, 3]) # cl for CLArraysA1D = fill(CuArray{Int}, 0, (100,)) # CLArray for CLArrays# Float32 array - Float32 is usually preferred and can be up to 30x faster on most GPUs than Float64diagonal_matrix = CuArray{Float32}(I, 100, 100)filled = fill(CuArray, 77f0, (4, 4, 4)) # 3D array filled with Float32 77randy = rand(CuArray, Float32, 42, 42) # random numbers generated on the GPU# The array constructor also accepts isbits iterators with a known size# Note, that since you can also pass isbits types to a gpu kernel directly, in most cases you won't need to materialize them as an gpu arrayfrom_iter = CuArray(1:10)# let's create a point type to further illustrate what can be done:struct Point x::Float32 y::Float32endBase.convert(::Type{Point}, x::NTuple{2, Any}) = Point(x[1], x[2])# because we defined the above convert from a tuple to a point# [Point(2, 2)] can be written as Point[(2,2)] since all array # elements will get converted to Pointcustom_types = cu(Point[(1, 2), (4, 3), (2, 2)])typeof(custom_types)

\"大众CuArray{point,1}\"大众

数组操作

我们已经定义了许多操作。
最主要的是，GPUArrays 支持 Julia 的领悟点广播表示法（fusing dot broadcasting notation）。
此表示法许可你将函数运用于数组的每个元素，并利用 f 的返回值创建新数组。
此功能常日称为映射（map）。
broadcast 指的是形状互异的数组被 broadcast 成相同形状。

事情事理如下：

x = zeros(4, 4) # 4x4 array of zerosy = zeros(4) # 4 element arrayz = 2 # a scalar# y's 1st dimension gets repeated for the 2nd dimension in x# and the scalar z get's repeated for all dimensions# the below is equal to `broadcast(+, broadcast(+, xx, y), z)`x .+ y .+ z

发生「领悟」是由于 Julia 编译器会重写该表达式为一个通报调用树的 lazy broadcast 调用，然后可以在循环遍历数组之前将全体调用树领悟到一个函数中。

如果你想要更详细的理解 broadcast，可以看该指南：julia.guide/broadcasting。

这意味着在不分配堆内存（仅创建 isbits 类型）的情形下运行的任何 Julia 函数，都可以运用于 GPUArray 的每个元素，并且多点调用会领悟到一个内核调用中。
由于内核调用会有很大延迟，以是这种领悟是一个非常主要的优化。

using CuArraysA = cu([1, 2, 3])B = cu([1, 2, 3])C = rand(CuArray, Float32, 3)result = A .+ B .- Ctest(a::T) where T = a convert(T, 2) # convert to same type as `a`# inplace broadcast, writes directly into `result`result .= test.(A) # custom function work# The cool thing is that this composes well with custom types and custom functions.# Let's go back to our Point type and define addition for itBase.:(+)(p1::Point, p2::Point) = Point(p1.x + p2.x, p1.y + p2.y)# now this works:custom_types = cu(Point[(1, 2), (4, 3), (2, 2)])# This particular example also shows the power of broadcasting: # Non array types are broadcasted and repeated for the whole lengthresult = custom_types .+ Ref(Point(2, 2))# So the above is equal to (minus all the allocations):# this allocates a new array on the gpu, which we can avoid with the above broadcastbroadcasted = fill(CuArray, Point(2, 2), (3,))result == custom_types .+ broadcasted

true

GPUArrays 支持更多操作：

实现 GPU 数组转换为 CPU 数组和复制多维索引和切片 (xs[1:2, 5, :])permutedims串联 (vcat(x, y), cat(3, xs, ys, zs))映射，领悟 broadcast(zs .= xs.^2 .+ ys . 2)添补 (CuArray, 0f0, dims)，添补 (gpu_array, 0)减小尺寸 (reduce(+, xs, dims = 3), sum(x -> x^2, xs, dims = 1)缩减为标量 (reduce(, xs), sum(xs), prod(xs))各种 BLAS 操作 (matrixmatrix, matrixvector)FFT，利用与 julia 的 FFT 相同的 API

GPUArrays 实际运用

让我们直接看一些很酷的实例。

GPU 加速烟雾仿照器是由 GPUArrays + CLArrays 创建的，可在 GPU 或 CPU 上运行，GPU 版本速率提升 15 倍：

还有更多的例子，包括求微分方程、FEM 仿照和求解偏微分方程。

演示地址：https://juliagpu.github.io/GPUShowcases.jl/latest/index.html

让我们通过一个大略的机器学习示例，看看如何利用 GPUArrays：

using Flux, Flux.Data.MNIST, Statisticsusing Flux: onehotbatch, onecold, crossentropy, throttleusing Base.Iterators: repeated, partitionusing CuArrays# Classify MNIST digits with a convolutional networkimgs = MNIST.images()labels = onehotbatch(MNIST.labels(), 0:9)# Partition into batches of size 1,000train = [(cat(float.(imgs[i])..., dims = 4), labels[:,i]) for i in partition(1:60_000, 1000)]use_gpu = true # helper to easily switch between gpu/cputodevice(x) = use_gpu ? gpu(x) : xtrain = todevice.(train)# Prepare test set (first 1,000 images)tX = cat(float.(MNIST.images(:test)[1:1000])..., dims = 4) |> todevicetY = onehotbatch(MNIST.labels(:test)[1:1000], 0:9) |> todevicem = Chain( Conv((2,2), 1=>16, relu), x -> maxpool(x, (2,2)), Conv((2,2), 16=>8, relu), x -> maxpool(x, (2,2)), x -> reshape(x, :, size(x, 4)), Dense(288, 10), softmax) |> todevicem(train[1][1])loss(x, y) = crossentropy(m(x), y)accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))evalcb = throttle(() -> @show(accuracy(tX, tY)), 10)opt = ADAM(Flux.params(m));

# trainfor i = 1:10 Flux.train!(loss, train, opt, cb = evalcb)end

using Colors, FileIO, ImageShowN = 22img = tX[:, :, 1:1, N:N]println(\"大众Predicted: \公众, Flux.onecold(m(img)) .- 1)Gray.(collect(tX[:, :, 1, N]))

只需将数组转换为 GPUArrays（利用 gpu(array），就可以将全体打算移动到 GPU 并得到可不雅观的速率提升。
这要归功于 Julia 繁芜的 AbstractArray 根本架构，使 GPUArray 可以无缝集成。
随后，如果省略转换为 GPUArray 这一步，代码会按普通的 Julia 数组处理，但仍在 CPU 上运行。
可以考试测验将 use_gpu = true 改为 use_gpu = false，重新运行初始化和演习单元格。
比拟 GPU 和 CPU，CPU 运行韶光为 975 秒，GPU 运行韶光为 29 秒，速率提升约 33 倍。

另一个上风是为了有效地支持神经网络的反向传播，GPUArrays 无需明确地实现自动微分。
这是由于 Julia 的自动微分库适用于任意函数，并存有可在 GPU 上高效运行的代码。
这样即可利用最少的开拓职员就能在 GPU 上实现 Flux，并使 Flux GPU 能够高效实现用户定义的功能。
这种开箱即用的 GPUArrays + Flux 不须要折衷，这是 Julia 的一大特点，详细阐明如下：为什么 Numba 和 Cython 不能代替 Julia（http://www.stochasticlifestyle.com/why）。

编写 GPU 内核

一样平常情形，只利用 GPUArrays 的通用抽象数组接口即可，而不须要编写任何 GPU 内核。
但是有些时候，可能须要在 GPU 上实现一个无法通过一样平常数组算法组合表示的算法。

好是，GPUArrays 通过分层法肃清了大量事情，可以实现从高等代码开始，编写类似于大多数 OpenCL / CUDA 示例的低级内核。
同时可以在 OpenCL 或 CUDA 设备上实行内核，从而提取出这些框架中的所有差异。

实现上述功能的函数名为 gpu_call。
调用语句为 gpu_call(kernel, A::GPUArray, args)，在 GPU 上利用参数 (state, args...) 调用 kernel。
State 是一个用于实现获取线程索引等功能的后端特定工具。
GPUArray 须要作为第二个参数通报，以分配到精确的后端并供应启动参数的默认值。

让我们利用 gpu_call 来实现一个大略的映射内核：

using GPUArrays, CuArrays# Overloading the Julia Base map! function for GPUArraysfunction Base.map!(f::Function, A::GPUArray, B::GPUArray) # our function that will run on the gpu function kernel(state, f, A, B) # If launch parameters aren't specified, linear_index gets the index # into the Array passed as second argument to gpu_call (`A`) i = linear_index(state) if i <= length(A) @inbounds A[i] = f(B[i]) end return end # call kernel on the gpu gpu_call(kernel, A, (f, A, B))end

大略来说，这将在 GPU 上并行调用 julia 函数 kernel length(A) 次。
kernel 的每个并行调用都有一个线程索引，可以利用它索引到数组 A 和 B。
如果打算索引时没有利用 linear_index，就须要确保没有多个线程读取和写入相同的数组位置。
因此，如果在纯 Julia 中利用线程编写，可等效如下：

using BenchmarkToolsfunction threadded_map!(f::Function, A::Array, B::Array) Threads.@threads for i in 1:length(A) A[i] = f(B[i]) end Aendx, y = rand(10^7), rand(10^7)kernel(y) = (y / 33f0) (732.f0/y)# on the cpu without threads:single_t = @belapsed map!($kernel, $x, $y)# \"大众on the CPU with 4 threads (2 real cores):thread_t = @belapsed threadded_map!($kernel, $x, $y)# on the GPU:xgpu, ygpu = cu(x), cu(y)gpu_t = @belapsed begin map!($kernel, $xgpu, $ygpu) GPUArrays.synchronize($xgpu)endtimes = [single_t, thread_t, gpu_t]speedup = maximum(times) ./ timesprintln(\"大众speedup: $speedup\公众)bar([\"大众1 core\"大众, \"大众2 cores\"大众, \公众gpu\"大众], speedup, legend = false, fillcolor = :grey, ylabel = \"大众speedup\"大众)

由于该函数未实现过多内容，也得不到更多的扩展，但线程化和 GPU 版本仍旧有一个很好的加速。

GPU 与线程示例比较，能显示更繁芜的内容，由于硬件线程因此线程块的形式分布的，gpu_call 是从大略版本中提取出来的，但它也可以用于更繁芜的启动配置：

using CuArraysthreads = (2, 2)blocks = (2, 2)T = fill(CuArray, (0, 0), (4, 4))B = fill(CuArray, (0, 0), (4, 4))gpu_call(T, (B, T), (blocks, threads)) do state, A, B # those names pretty much refer to the cuda names b = (blockidx_x(state), blockidx_y(state)) bdim = (blockdim_x(state), blockdim_y(state)) t = (threadidx_x(state), threadidx_y(state)) idx = (bdim . (b .- 1)) .+ t A[idx...] = b B[idx...] = t returnendprintln(\公众Threads index: \n\公众, T)println(\公众Block index: \n\"大众, B)

上面的示例中启动配置的迭代顺序更繁芜。
确定得当的迭代+启动配置对付实现最优 GPU 性能至关主要。
很多关于 CUDA 和 OpenCL 的 GPU 教程都非常详细地阐明了这一点，在 Julia 中编程 GPU 时这些事理是相通的。

结论

Julia 为高性能的天下带来了可组合的高等编程。
现在是时候为 GPU 做同样的事了。

希望 Julia 能降落人们在 GPU 编程的门槛，我们可以为开源 GPU 打算开拓可扩展的平台。
第一个成功案例是通过 Julia 软件包实现自动微分解决方案，这些软件包乃至都不是为 GPU 编写的，因此可以相信 Julia 在 GPU 打算领域的扩展性和通用设计中一定会大放异彩。