Skip to content

Ashar's Proposal

Ashar edited this page Jun 4, 2019 · 1 revision

Absract

This proposal aims at improving the efficiency of the expression templates that are used by boost::numeric::ublas::tensor. Traditional Expression templates are efficient but Klaus Iglberger showed in his work that traditional ETs (Expression Templates) does not automatically result in faster execution of the expression. Smart expression templates are a tool for capturing the expression, possibly transforming and evaluating the expression. For better expression evaluation. I will use Boost.YAP, a C++14 and later expression template library. Boost.YAP already offers a lot of functions and algorithms for dealing with expressions. It will ease my task moreover most of the issues pointed out in the paper cited above are efficiently handled by YAP. Hence, this proposal aims at the integration of Boost.YAP for convenient expression template transformation and evaluation in boost::numeric::ublas::tensor

Proposal

A Smart Expression template Evaluator that will be implemented as a result of this proposal will include the following features/submodules. By Smart Expression template Evaluator, I mean an evaluator that analyzes the expression and optimizes the expression using mathematical laws to minimize the number of micro-operations required to evaluate the expression. In addition to optimization on expression, the Smart Expression template evaluator will also be aware of the operands and their properties to reduce computing same thing time and again. It will also be aware of the device to choose for the execution of the expression.

  • Smart Dimension Checker
  • Smart Expression Optimizer
  • Smart Expression Evaluator

The proposal aims at providing a brief description of how those submodules will be added to the current tensor extensions by taking references and ideas from my already existing YAP implemented Matrix.

Smart Dimension Checker

Every Expression in linear algebra has a dimension associated with it. By dimensions, I mean extents or rank and shape of a tensor. An Expression template only sees the operands and returns a small proxy object that represents the operation. This proxy returned is lightweight and can be further used with other operands. Since this can be further used with other operators, we must check if this returned expression has a dimension compatible with the new operation. The dimension of the returned proxy initially depends upon the tensor operands.

All the tensors (including matrices and vectors) can be subdivided into 2 class.

  • Tensors whose dimension can change dynamically after the creation. This typically involves calling tensor.reshape() or tensor.resize(). Throughout this Proposal, I will refer to such tensors as Fully mutable (Dimension, as well as values, can change)
  • Tensor whose extents or shape will not change once created. I will refer them as Partially Mutable (Dimension cannot change but values can change)

Taking into account that operands can be in one of two class only. We can optimize the Operation in AST that are used to compute the dimension of expression as Evaluating the dimension of expression is priori to evaluating the value of the expression.

Consider this Example :

auto a = Tensor<policy::partially_mutable>({2,5,4}, 1);
auto b = Tensor<policy::partially_mutable>({2,4,5}, 1);
auto expression = a + b;
Tensor result = expression;

There are two ways we can detect the that evaluation of an expression is not defined.

  • At the time when we actually built the expression (at Line 3). This is the correct choice to go with at this moment. The tensors a and b are policy::partially_mutable So, It is guaranteed that their dimension will not change throughout their lifetime. It also means that we can assign the dimension of expression in some member function of expression at the time it was built. In this case Line, 3 will throw an exception that a and b are partially mutable and + is not defined on them because neither dimension matched nor it can change.
  • At the time we actually evaluate the expression (at Line 4). The only reason we should delay the dimension check of the expression can be that maybe before evaluating the expression, the dimension of operands in expression can change. This is only possible if at least one operand is policy::fully_mutable, In such cases, we must recursively compute the dimension of the expression and throw if we find a mismatch.

Smart Expression Optimizer

A tensor variable is a giant collection of usually primitive data types and so it is very difficult to know by intuition if two given tensors are same. When end-user is working with tensors and building expressions it is highly likely they will write some trivial expressions. Suppose user gets two tensor variables after some evaluation namely A and B. In the typical example user may not check if A == B and simply call some operations such as C = 2*A - B The result for this expression is trivial since A can be equal to B and if it is the result expression on left collapses into C = A. If somehow we can check that A == B and optimize the user expression. We can save time in computing. Current Expression templates are not able to handle such cases.

Smart Expression Evaluator

Currently, we have no control over the way YAP will evaluate an expression. It can use multiple CPU cores, It can evaluate on GPU if any such hardware accelerator is available (Provided tensor_core was allocated on GPU). The Current Expression evaluator has no idea over it. In some cases, some operations are well suited and fast when done on multiple cores. Such as convolution, Some operations may only be done with single thread Such as finding the mean of all elements in the tensor.

We must add a policy of evaluation of the expression. This policy will determine the device or the hardware resources to use during the evaluation of an expression. This Idea comes from a library named Eigen. Which already provides such capabilities. The Below Lines shows how eigen handles such device policy after this we will show how we will handle it in our code.

Eigen::ThreadPoolDevice my_device(4 /* number of threads to use */);
// Now just use the device when evaluating expressions.
Eigen::Tensor<float, 2> c(30, 50);
c.device(my_device) = a.contract(b, dot_product_dims);

Our uBLAS tensor will also have such capability but instead of using a separate class Device for specifying execution policy we can use already existing std::execution available with C++17. However, the standard does not have an execution policy for GPU and does not provide finer control over how many threads to use in parallel policy. We can derive a new policy by extending to execution policy to create a GPU policy and advanced policies. If the memory has been allocated on CUDA devices our execution policy will run the computation on CUDA devices. An Example snippet how this could be achieved has been discussed below with a concept code.

As of now std::executor is not present in GCC and CLANG (Only on MSVC and Intel’s Compiler). We will use device executor for each one of them. std::executor::par will be represented via DeviceParallel and similarly DeviceSequencial will represent std::executor::seq. Once, std::executor is ready for GCC and CLANG. They will be deprecated in their favor

Clone this wiki locally