使用磁贴

可以使用平铺来最大化应用的加速。平铺将线程划分为相等的矩形子集或块。如果使用适当的块大小及平铺算法，您可以从 C++ AMP 代码中获得更大的性能提升。平铺的基本组件包括：

tile_static 变量。平铺的主要优点是由于 tile_static 访问带来的性能提升。访问内存中的数据 tile_static 的速度比访问全局空间（array 或 array_view 对象）中的数据要快得多。为每个平铺创建 tile_static 变量的实例，平铺中的所有线程可以访问该变量。在典型的分块算法中，数据从全局内存复制到 tile_static 内存中一次，然后可以从 tile_static 内存中访问多次。
tile_barrier：：wait 方法。调用tile_barrier::wait暂停当前线程的执行，直到同一块中的所有线程都到达调用tile_barrier::wait。无法保证线程运行的顺序，只能保证在tile_barrier::wait调用之前，磁贴中的所有线程都可以到达该调用之后再执行。这意味着，通过使用 tile_barrier::wait 方法，可以按磁贴逐个执行任务，而不是按线程逐个执行任务。典型的平铺算法包含用于初始化整个块内存的代码，然后调用 tile_barrier::wait。下面的 tile_barrier::wait 代码包含需要访问所有 tile_static 值的计算。
本地和全局索引。有权访问线程相对于整个 array_view 或 array 对象的索引以及相对于磁贴的索引。使用本地索引可使代码更易于读取和调试。通常，使用本地索引访问 tile_static 变量和全局索引来访问 array 和 array_view 变量。
tiled_extent类和 tiled_index类。在parallel_for_each调用中使用tiled_extent对象而不是extent对象。您在parallel_for_each调用中使用tiled_index对象而不是index对象。

若要利用分片，您的算法必须将计算域划分为块，然后将块数据复制到 tile_static 变量中，以便更快地访问。

全局、磁贴和本地索引的示例

注释

从 Visual Studio 2022 版本 17.0 开始，已弃用 C++ AMP 标头。包含任何 AMP 标头都会引发构建错误。应在包含任何 AMP 标头之前定义 _SILENCE_AMP_DEPRECATION_WARNINGS，以使警告静音。

下图表示一个 8x9 矩阵的数据，这些数据按照 2x3 磁贴进行排列。

一个 8 到 9 矩阵的图示，分为 2 到 3 个磁贴。

以下示例显示此分块矩阵的全局、分块和本地索引。 array_view对象是使用类型Description元素创建的。包含 Description 矩阵中元素的全局、块和本地索引。 parallel_for_each 调用中的代码用于设置每个元素的全局、线程块和本地索引的值。输出显示 Description 结构中的值。

#include <iostream>
#include <iomanip>
#include <Windows.h>
#include <amp.h>
using namespace concurrency;

const int ROWS = 8;
const int COLS = 9;

// tileRow and tileColumn specify the tile that each thread is in.
// globalRow and globalColumn specify the location of the thread in the array_view.
// localRow and localColumn specify the location of the thread relative to the tile.
struct Description {
    int value;
    int tileRow;
    int tileColumn;
    int globalRow;
    int globalColumn;
    int localRow;
    int localColumn;
};

// A helper function for formatting the output.
void SetConsoleColor(int color) {
    int colorValue = (color == 0)  4 : 2;
    SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), colorValue);
}

// A helper function for formatting the output.
void SetConsoleSize(int height, int width) {
    COORD coord;

    coord.X = width;
    coord.Y = height;
    SetConsoleScreenBufferSize(GetStdHandle(STD_OUTPUT_HANDLE), coord);

    SMALL_RECT* rect = new SMALL_RECT();
    rect->Left = 0;
    rect->Top = 0;
    rect->Right = width;
    rect->Bottom = height;
    SetConsoleWindowInfo(GetStdHandle(STD_OUTPUT_HANDLE), true, rect);
}

// This method creates an 8x9 matrix of Description structures.
// In the call to parallel_for_each, the structure is updated
// with tile, global, and local indices.
void TilingDescription() {
    // Create 72 (8x9) Description structures.
    std::vector<Description> descs;
    for (int i = 0; i < ROWS * COLS; i++) {
        Description d = {i, 0, 0, 0, 0, 0, 0};
        descs.push_back(d);
    }

    // Create an array_view from the Description structures.
    extent<2> matrix(ROWS, COLS);
    array_view<Description, 2> descriptions(matrix, descs);

    // Update each Description with the tile, global, and local indices.
    parallel_for_each(descriptions.extent.tile< 2, 3>(),
        [=] (tiled_index< 2, 3> t_idx) restrict(amp)
    {
        descriptions[t_idx].globalRow = t_idx.global[0];
        descriptions[t_idx].globalColumn = t_idx.global[1];
        descriptions[t_idx].tileRow = t_idx.tile[0];
        descriptions[t_idx].tileColumn = t_idx.tile[1];
        descriptions[t_idx].localRow = t_idx.local[0];
        descriptions[t_idx].localColumn= t_idx.local[1];
    });

    // Print out the Description structure for each element in the matrix.
    // Tiles are displayed in red and green to distinguish them from each other.
    SetConsoleSize(100, 150);
    for (int row = 0; row < ROWS; row++) {
        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Value: " << std::setw(2) << descriptions(row, column).value << "      ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Tile:   " << "(" << descriptions(row, column).tileRow << "," << descriptions(row, column).tileColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Global: " << "(" << descriptions(row, column).globalRow << "," << descriptions(row, column).globalColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Local:  " << "(" << descriptions(row, column).localRow << "," << descriptions(row, column).localColumn << ")  ";
        }
        std::cout << "\n";
        std::cout << "\n";
    }
}

int main() {
    TilingDescription();
    char wait;
    std::cin >> wait;
}

该示例的主要工作是在对象定义 array_view 和调用 parallel_for_each中。

结构向量 Description 复制到 8x9 array_view 对象中。
使用 tiled_extent 对象作为计算域调用 parallel_for_each 方法。通过调用 descriptions 变量的 extent::tile() 方法创建 tiled_extent 对象。调用extent::tile()<2,3>的类型参数，指定创建 2x3 磁贴。因此，8x9 矩阵被划分为 12 个子矩阵、4 行和 3 列。
该方法 parallel_for_each 是使用 tiled_index<2,3> 对象（t_idx）作为索引调用的。索引的类型参数（t_idx）必须与计算域的类型参数（descriptions.extent.tile< 2, 3>()）匹配。
执行每个线程时，索引 t_idx 将返回有关线程位于哪个磁贴（tiled_index::tile 属性）和该线程在图块（tiled_index::local 属性）中的位置的信息。

磁贴同步 - tile_static 和 tile_barrier::wait

前面的示例演示了磁贴布局和索引，但本身并不十分有用。当磁贴在算法中是必不可少的，并且利用tile_static变量时，平铺变得很有用。由于磁贴中的所有线程都有权访问 tile_static 变量，因此 tile_barrier::wait 调用用于同步对 tile_static 变量的访问。尽管磁贴中的所有线程都有权访问 tile_static 变量，但无法保证磁贴中线程的执行顺序。下面的示例演示如何使用 tile_static 变量和 tile_barrier::wait 方法来计算每个磁贴的平均值。了解示例的关键点如下：

rawData 存储在 8x8 矩阵中。
磁贴大小为 2x2。这会创建一个 4x4 磁贴网格，平均值可以使用对象存储在 4x4 矩阵 array 中。只有有限数量的类型可以通过在 AMP 受限函数中引用来捕获。该 array 类是其中之一。
矩阵大小和样本大小是通过使用#define语句定义的，因为类型参数到 array，array_viewextent并且tiled_index必须是常量值。还可以使用 const int static 声明。作为额外的好处，更改样本大小以计算 4x4 格子的平均值是非常简单的。
tile_static为每个磁贴声明 2x2 浮点值数组。尽管声明位于每个线程的代码路径中，但矩阵中的每个单元只会创建一个数组。
有一行代码将每个磁贴 tile_static 中的值复制到数组。对于每个线程，在值被复制到数组后，由于调用tile_barrier::wait，线程的执行将停止。
当磁贴中的所有线程都达到屏障时，可以计算平均值。由于代码针对每个线程执行，因此有一个 if 语句只计算一个线程的平均值。平均值存储在平均值变量中。屏障本质上是按块控制计算的结构，就像使用 for 循环一样。
变量中的数据 averages （因为它是对象 array ）必须复制回主机。此示例使用矢量转换运算符。
在完整的示例中，可以将 SAMPLESIZE 更改为 4，代码执行正确，而无需进行任何其他更改。

#include <iostream>
#include <amp.h>
using namespace concurrency;

#define SAMPLESIZE 2
#define MATRIXSIZE 8
void SamplingExample() {

    // Create data and array_view for the matrix.
    std::vector<float> rawData;
    for (int i = 0; i < MATRIXSIZE * MATRIXSIZE; i++) {
        rawData.push_back((float)i);
    }
    extent<2> dataExtent(MATRIXSIZE, MATRIXSIZE);
    array_view<float, 2> matrix(dataExtent, rawData);

    // Create the array for the averages.
    // There is one element in the output for each tile in the data.
    std::vector<float> outputData;
    int outputSize = MATRIXSIZE / SAMPLESIZE;
    for (int j = 0; j < outputSize * outputSize; j++) {
        outputData.push_back((float)0);
    }
    extent<2> outputExtent(MATRIXSIZE / SAMPLESIZE, MATRIXSIZE / SAMPLESIZE);
    array<float, 2> averages(outputExtent, outputData.begin(), outputData.end());

    // Use tiles that are SAMPLESIZE x SAMPLESIZE.
    // Find the average of the values in each tile.
    // The only reference-type variable you can pass into the parallel_for_each call
    // is a concurrency::array.
    parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
        [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp)
    {
        // Copy the values of the tile into a tile-sized array.
        tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
        tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

        // Wait for the tile-sized array to load before you calculate the average.
        t_idx.barrier.wait();

        // If you remove the if statement, then the calculation executes for every
        // thread in the tile, and makes the same assignment to averages each time.
        if (t_idx.local[0] == 0 && t_idx.local[1] == 0) {
            for (int trow = 0; trow < SAMPLESIZE; trow++) {
                for (int tcol = 0; tcol < SAMPLESIZE; tcol++) {
                    averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
                }
            }
            averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE * SAMPLESIZE);
        }
    });

    // Print out the results.
    // You cannot access the values in averages directly. You must copy them
    // back to a CPU variable.
    outputData = averages;
    for (int row = 0; row < outputSize; row++) {
        for (int col = 0; col < outputSize; col++) {
            std::cout << outputData[row*outputSize + col] << " ";
        }
        std::cout << "\n";
    }
    // Output for SAMPLESIZE = 2 is:
    //  4.5  6.5  8.5 10.5
    // 20.5 22.5 24.5 26.5
    // 36.5 38.5 40.5 42.5
    // 52.5 54.5 56.5 58.5

    // Output for SAMPLESIZE = 4 is:
    // 13.5 17.5
    // 45.5 49.5
}

int main() {
    SamplingExample();
}

竞争条件

为每个线程创建一个名为total的tile_static变量并对其递增可能很诱人，如下所示：

// Do not do this.
tile_static float total;
total += matrix[t_idx];
t_idx.barrier.wait();

averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE* SAMPLESIZE);

此方法的第一个问题是 tile_static 变量不能有初始值设定项。第二个问题是，赋值 total 存在竞争条件，因为磁贴中的所有线程都可以无特定顺序地访问变量。可以将算法编程为仅允许一个线程访问每个屏障的总和，如下所示。但是，此解决方案不可扩展。

// Do not do this.
tile_static float total;
if (t_idx.local[0] == 0&& t_idx.local[1] == 0) {
    total = matrix[t_idx];
}
t_idx.barrier.wait();

if (t_idx.local[0] == 0&& t_idx.local[1] == 1) {
    total += matrix[t_idx];
}
t_idx.barrier.wait();

// etc.

内存围栏

必须同步两种类型的内存访问- 全局内存访问和 tile_static 内存访问。对象 concurrency::array 仅分配全局内存。可以引用concurrency::array_view全局内存、tile_static内存或两者，这取决于其构造方式。必须同步两种类型的内存：

全局内存
tile_static

内存围栏可确保内存访问可供线程磁贴中的其他线程使用，并按照程序顺序执行内存访问。为了确保这一点，编译器和处理器不会对围栏上的读取和写入重新排序。在 C++ AMP 中，通过调用以下方法之一创建内存围栏：

tile_barrier::wait 方法：在全局和 tile_static 内存周围创建屏障。
tile_barrier::wait_with_all_memory_fence 方法：在全局和 tile_static 内存周围创建内存屏障。
tile_barrier::wait_with_global_memory_fence 方法：创建仅作用于全局内存的围栏。
tile_barrier::wait_with_tile_static_memory_fence方法：仅为内存创建栅栏。

调用您所需的特定栅栏可以改善应用程序的性能。屏障类型会影响编译器和硬件重新排序语句的方式。例如，如果使用全局内存围栏，则它仅适用于全局内存访问，因此，编译器和硬件可能会对围栏两侧的变量重新排序读取和写入 tile_static 。

在下一个示例中，屏障用于同步对tile_static变量的tileValues写操作。在此示例中，调用的是 tile_barrier::wait_with_tile_static_memory_fence 而不是 tile_barrier::wait。

// Using a tile_static memory fence.
parallel_for_each(matrix.extent.tile<SAMPLESIZE, SAMPLESIZE>(),
    [=, &averages] (tiled_index<SAMPLESIZE, SAMPLESIZE> t_idx) restrict(amp)
{
    // Copy the values of the tile into a tile-sized array.
    tile_static float tileValues[SAMPLESIZE][SAMPLESIZE];
    tileValues[t_idx.local[0]][t_idx.local[1]] = matrix[t_idx];

    // Wait for the tile-sized array to load before calculating the average.
    t_idx.barrier.wait_with_tile_static_memory_fence();

    // If you remove the if statement, then the calculation executes
    // for every thread in the tile, and makes the same assignment to
    // averages each time.
    if (t_idx.local[0] == 0&& t_idx.local[1] == 0) {
        for (int trow = 0; trow <SAMPLESIZE; trow++) {
            for (int tcol = 0; tcol <SAMPLESIZE; tcol++) {
                averages(t_idx.tile[0],t_idx.tile[1]) += tileValues[trow][tcol];
            }
        }
    averages(t_idx.tile[0],t_idx.tile[1]) /= (float) (SAMPLESIZE* SAMPLESIZE);
    }
});

另请参阅

C++ AMP（C++加速大规模并行度）
tile_static 关键字

Last updated on 2018-11-19