游戏编程我最行吧 关注:42贴子:951

Chapter 12 THE COMPUTE SHADER 第12章 计算着色器

只看楼主收藏回复

GPUs have been optimized to process a large amount of memory from a single location or sequential locations (so-called “streaming operation”); this is in contrast to a CPU designed for random memory accesses [Boyd10]. Moreover, because vertices and pixels are independently processed, GPUs have been architected to be massively parallel; for example, the NVIDIA “Fermi” architecture supports up to 16 streaming multiprocessors of 32 CUDA cores for a total of 512 CUDA cores [NVIDIA09].
//-------------------------------------------------------------------------------------------------------------
GPU 从单一存储单元或者连续存储单元, 处理一大堆内存,已经被最优化了(所谓的“流操作“);相比之下,就把CPU设计成随机内存入口。再者,顶点和像素已经被独立处理,GPU已经被设计成大规模的并行结构。比如,对全部512个CUDA核心,NVIDIA(一家公司) “Fermi”体系结构支持16个包含有32个CUDA核心的流处理器


IP属地:云南1楼2014-03-31 08:34回复
    Obviously graphics benefit from this GPU architecture, as the architecture was designed for graphics. However, some non-graphical applications benefit from the massive amount of computational power a GPU can provide with its parallel architecture. Using the GPU for non-graphical applications is called general purpose GPU (GPGPU) programming. Not all algorithms are ideal for a GPU implementation; GPUs need data-parallel algorithms to take advantage of the parallel architecture of the GPU. That is, we need a large amount of data elements that will have similar operations performed on them so that the elements can be processed in parallel. Graphical operations like shading pixels is a good example, as each pixel fragment being drawn is operated on by the pixel shader. As another example, if you look at the code for our wave simulation from the previous chapters, you will see that in the update step,we perform a calculation on each grid element. So this, too, is a good candidate for a GPU implementation, as each grid element can be updated in parallel by the GPU. Particle systems provide yet another example, where the physics of each particle can be computed independently provided we take the simplification that the particles do not interact with each other.
    //-----------------------------------------------------------------------------------------------------------------
    明显地,图形学从GPU架构获益,因为这个架构就是为图形学设计的.然而,一些没有图形的应用,也可从GPU 超强的计算能力受益。一个GPU可以提供它的并行架构。对没有图形的应用来说,应用GPU被称为通用GPU(GPGPU)编程。对一个GPU,不是所有的算法是完美的。GPU需要并行数据算法,才能从本身的并行架构获益。更确切地说,我们需要大量的数据元素,且这些数据支持并行处理,那将对这些数据也有类似地操作。如着色像素一样,图形操作是一个好的例子,因为每个被画的像素碎片,在像素着色器里被操作。如其它例子,如果你从前面章节看了我们波浪模拟代码,在更新步骤,你将会看到,我们在每个
    grid element执行一个计算。所以呢,对GPU来说,这是好的结果了,因为每个grid element在GPU里,可以被并行地更新。粒子系统还提供了其它的例子,每个粒子的物理学现象可以被独立计算处理,这个给我们带来简便,粒子之间不会彼此的干扰。


    IP属地:云南2楼2014-03-31 09:23
    回复
      The compute shader is a programmable shader Direct3D exposes that is not directly part of the rendering pipeline. Instead, it sits off to the side and can read from GPU resources and write to GPU resources (Figure 12.2). Essentially, the compute shader allows us to access the GPU to implement data-parallel algorithms without drawing anything. As mentioned, this is useful for GPGPU programming, but there are still many graphical effects that can be implemented on the compute shader as well—so it is still very relevant for a graphics programmer. And as already mentioned, because the compute shader is part of Direct3D, it reads from and writes to Direct3D resources, which enables us to bind the output of a compute shader directly to the rendering pipeline.
      计算着色器是一个可编程的着色器。Direct3D揭示了,它不属于渲染管线的一部分。替代的,它是在渲染管线旁边,可以读取GPU资源和写进GPU。本质上,计算着色允许我们进入GPU,执行平行数据运算,哪怕没有画任何东西。由于提及,对GPGPU编程是经常的。但这里一直有很多图像特效,可以被实现,在计算着色器。所以,它是有重大意义的,对一个图像程序。
      正如已经提到的,由于计算着色器是Direct3D的一部分,它读写Direct3D资源,它可以使我们能够对渲染管道,直接绑定一个计算着色器输出。


      IP属地:云南5楼2014-03-31 16:15
      回复
        Objectives: 目标
        1. To learn how to program compute shaders.
        2. To obtain a basic high-level understanding of how the hardware processes thread
        groups, and the threads within them.
        3. To discover which Direct3D resources can be set as an input to a compute shader and
        which Direct3D resources can be set as an output to a compute shader.
        4. To understand the various thread IDs and their uses.
        5. To learn about shared memory and how it can be used for performance optimizations.
        6. To find out where to obtain more detailed information about GPGPU programming.
        1.学习怎样编计算着色器程序 。
        2.达到一个基本高水平的认识:硬件是怎样处理线程组和线程组里的线程。
        3.揭示:对一个计算着色器,Directx3D资源怎么被设置成一个输入和怎么被设置成一个输出。
        4.去理解各种各样的线程IDS和它们的作用。
        5.去学习关于共用存储器和怎么使它使用时,性能最优化。
        6.找出关于GPGPU编程,更多的详细信息。


        IP属地:云南6楼2014-03-31 16:30
        回复


          IP属地:云南7楼2014-03-31 16:41
          回复
            12.1 THREADS AND THREAD GROUPS 线程和线程组
            In GPU programming, the number of threads desired for execution is divided up into a grid of thread groups. A thread group is executed on a single multiprocessor. Therefore, if you had a GPU with 16 multiprocessors, you would want to break up your problem into at least 16 thread groups so that each multiprocessor has work to do. For better performance, you would want at least two thread groups per multiprocessor since a multiprocessor can switch to processing the threads in a different group to hide stalls [Fung10] (a stall can occur, for example, if a shader needs to wait for a texture operation result before it can continue to the next instruction).
            在GPU程序里,要求完成线程的数量,被分割成一个线程组网格。一个线程组在一个单一的多处理器里,被执行。因此,如果你有带有16个多处理器的GPU ,你将想把你的程序分成至少16个线程组,这样每个多处理器都在工作。对更好的性能来说,在不同的组里有小间隔时间,如果一个多处理器可以切换去处理线程,你将会想让每个多处理器分配两个线程组(这样等待的处理器将不会浪费时间)。(一个小间隔闪现,比如:在一个着色器继续等待下个指令之前,如果需要等待一个纹理的操作结果。就会有小间隔时间出现)


            IP属地:云南8楼2014-03-31 17:12
            回复
              Each thread group gets shared memory that all threads in that group can access; a thread cannot access shared memory in a different thread group. Thread synchronization operations can take place amongst the threads in a thread group, but different thread groups cannot be synchronized. In fact, we have no control over the order in which different thread groups are processed. This makes sense as the thread groups can be executed on different multiprocessors.
              每个线程组将获取,一个组里的所有线程组可以进入的共用存储器。在不同的线程组里,一个线程不能访问共用存储器。在一个线程组里,在线程之间,线程同步操作可以发生。但不同的线程组,不能被同步。事实上,我们在不同线程组被执行的命令上,没有控制权。这就讲得通,线程组在不同的多处理器里可以被执行。


              IP属地:云南9楼2014-03-31 17:24
              回复
                A thread group consists of n threads. The hardware actually divides these threads up into warps (32 threads per warp), and a warp is processed by the multiprocessor in SIMD32 (i.e., the same instructions are executed for the 32 threads simultaneously). Each CUDA core processes a thread and recall that a “Fermi” multiprocessor has 32 CUDA cores (so a CUDA core is like an SIMD “lane”). In Direct3D, you can specify a thread group size with dimensions that are not multiples of 32, but for performance reasons, the thread group dimensions should always be multiples of the warp size [Fung10].
                一个线程组由n个线程组成。硬件事实上伐分这些线程为经线(每个经线32个线程),一个经线在SIMD32里,被多处理器执行。每个CUDA 核心处理一个线程。(意思是:一个SIM32包含32个处理器,一次分32个线程进一个SIM32 ,SIM32的每个处理器处理一个线程)。在Direct3D ,你可以指定一个线程组(带有量度的,该度量不是32倍的)尺寸,但由于性能的原因,线程组度量不总是warp尺寸的倍数。


                IP属地:云南10楼2014-03-31 17:37
                回复
                  Thread group sizes of 256 seem to be a good starting point that should work well for various hardware. Then experiment with other sizes. Changing the number of threads per group will change the number of groups dispatched.
                  “256个规模 的线程组,对多种多样的硬件,可以工作地很好”,这看起来是一个好的出发点。以其他的规模 作试验,每个组改变线程数,将改变组处理数。


                  IP属地:云南11楼2014-03-31 17:43
                  回复
                    This enables you to launch a 3D grid of thread groups; however, in this book we will only be concerned with 2D grids of thread groups. The following example call launches three groups in the x direction and two groups in the y direction for a total of 3 × 2 = 6 thread groups (see Figure 12.3).
                    这能使你去开始一个3D格子的线程组。不管怎样,在这本书里,我们将唯一关心的是2d的线程组网格。下面例子,对一个3 × 2 = 6的线程组,在X方向,着手进行3个组,在y方向两个组。


                    IP属地:云南13楼2014-03-31 17:54
                    回复
                      12.2 A SIMPLE COMPUTE SHADER 一个简单的计算着色器
                      Following is a simple compute shader that sums two textures, assuming all the textures are the same size. This shader is not very interesting, but it illustrates the basic syntax of writing a compute shader.
                      下面是一个简单的计算着色器,如果所有的纹理是一样尺寸,它将把两个纹理相加。这个着色器不是非常有趣的,但它 说明了最基本的,填写计算着色器得语法 。
                      cbuffer cbSettings
                      {
                      // Compute shader can access
                      values in constant buffers.
                      };
                      // Data sources and outputs.
                      Texture2D gInputA;
                      Texture2D gInputB;
                      RWTexture2D<float4> gOutput;
                      // The number of threads in the thread group. The threads in a group can
                      // be arranged in a 1D, 2D, or 3D grid layout.
                      [numthreads(16, 16, 1)]
                      void CS(int3 dispatchThreadID : SV_DispatchThreadID)
                      // Thread ID
                      {
                      // Sum the xyth texels and store the result in the xyth texel of
                      // gOutput.
                      gOutput[dispatchThreadID.xy] = gInputA[dispatchThreadID.xy] + gInputB[dispatchThreadID.xy];
                      }
                      technique11 AddTextures
                      {
                      pass P0
                      {
                      SetVertexShader(NULL);
                      SetPixelShader(NULL);
                      SetComputeShader(CompileShader(cs_5_0, CS()));
                      }
                      }


                      IP属地:云南14楼2014-03-31 18:05
                      回复
                        A compute shader consists of the following components:
                        1. Global variable access via constant buffers.
                        2. Input and output resources, which are discussed in the next section.
                        3. The [numthreads(X, Y, Z)] attribute, which specifies the number of threads in the
                        thread group as a 3D grid of threads.
                        4. The shader body that has the instructions to execute for each thread.
                        5. Thread identification system value parameters (discussed in §12.4).
                        一个计算着色器由以下部分组成:
                        1. 全局变量通过常量缓冲访问;
                        2.输入输出资源(在下个章节详述);
                        3. [numthreads(X, Y, Z)]属性,作为3D线程,在线程组里指定线程数.
                        4.着色本身(有指令执行每个线程);
                        5.线程鉴别系统值参数(详述 在 §12.4).


                        IP属地:云南15楼2014-04-01 08:44
                        回复
                          Observe that we can define different topologies of the thread group; for example, a thread group could be a single line of X threads [numthreads(X, 1, 1)] or a single column of Y threads [numthreads(1, Y, 1)]. 2D thread groups of X × Y threads can be made by setting the z-dimension to 1 like this, [numthreads(X, Y, 1)]. The topology you choose will be dictated by the problem you are working on. As mentioned in the previous section, the total thread count per group should be a multiple of the warp size (32 for NVIDIA cards) or a multiple of the wavefront size (64 for ATI cards). A multiple of the wavefront size is also a multiple of the warp size, so choosing a multiple of the wavefront size works for both types of cards.
                          注意:我们可以为线程定义不同的拓扑结构;例如:一个线程组可以是包含, [numthreads(X, 1, 1)]简单的X线程行,或者包含[numthreads(1, Y, 1)]简单的Y线程列。 包含X × Y线程的2D线程组,可以像 [numthreads(X, Y, 1)]设置Z向为单位1。你选择的拓扑结构,在你工作的程序上,是被程序控制的。在前面的章节提到,每组所有的线程数,应该是 warp(经纬)尺寸的倍数(32 for NVIDIA cards),或者是wavefront尺寸的倍数(64 for ATI cards).一个wavefront尺寸的倍数总是一个 warp(经纬)尺寸的倍数,所以选择wavefront尺寸倍数是both (cards类型)


                          IP属地:云南16楼2014-04-01 09:01
                          回复
                            12.3 DATA INPUT AND OUTPUT RESOURCES 数据输入和输出资源
                            Two types of resources can be bound to a compute shader: buffers and textures. We have worked with buffers already such as vertex and index buffers, and constant buffers. Although we use the effects framework to set constant buffers, they are just ID3D11Buffer instances with the D3D11_BIND_CONSTANT_BUFFER flag. We are also familiar with texture resources from Chapter 8.
                            两种类型资源,对计算着色器是必须的:buffers和textures 。我们早就用 buffers工作了,如vertex和index buffers,还有constant buffers。尽管我们用effects框架去设置comstant buffers,
                            它们只是带有 D3D11_BIND_CONSTANT_BUFFER标志的ID3D11 buffer的例子。我们从第8章开始,总是对texture resources比较熟悉。


                            IP属地:云南17楼2014-04-01 09:12
                            回复
                              12.3.1 Texture Inputs 纹理输入
                              The compute shader defined in the previous section defined two input texture resources:
                              Texture2D gInputA;
                              Texture2D gInputB;
                              The input textures gInputA and gInputB are bound as inputs to the shader by creating ID3D11ShaderResourceViews (SRVs) to the textures and setting them to the compute shader via ID3DX11EffectShaderResourceVariable variables. This is exactly the same way we bind shader resource views to pixel shaders. Note that (SRVs)are read-only.
                              //------------------------------------------------------------------------------------------------------------------
                              在前面章节定义的计算着色器,定义了两个输入纹理资源:
                              Texture2D gInputA;
                              Texture2D gInputB;
                              输入纹理gInputA和gInputB,通过对纹理生成 ID3D11ShaderResourceViews (SRVs)和经过
                              ID3DX11EffectShaderResourceVariable 对计算着色器设置它们,把它们和输入绑定在一起。这和我们把shader resource view绑定给pixel shander是一样的。注意:(SRVs)是只读的。


                              IP属地:云南18楼2014-04-01 09:25
                              回复