August 4, 2019
When using opencv functions in both python and c++, I noticed that the tasks were automatically distributed over all cores. I thought that this might be because opencv was using some task distribution algorithms and threads background. That was true, and it is called Intel Thread Building Blocks, or TBB.
TBB takes ‘for’ loop or ‘while’ loop, and breaks them down into small chunks of task, distributing them automatically on available cores. It decides which cores to use, and if a core is more busy than other ones, it shifts the tasks allocated for the busy core to another.
This concept is called “Task Based Parallelism”. With the help of intel TBB, we do not have to take care of individual threads and their allocations any more, because all such boring tasks are done automatically.
Here is a sample snippet of code:
#include "tbb/tbb.h"
using namespace tbb;float Foo(float x){ return 3.14 * x; } // the function we want to apply on arr in parallelint main()
{
int grain_size = 100; // the size of subrange
int size = 20000;
float arr[size];
parallel_for(blocked_range<size_t>(0, size, grain_size), ApplyFoo(arr)); return 0;
}
Let’s say that we want to apply a function Foo() on all elements of the array arr. Because this type of tusk is easily parallelized, we can expect some speed gain by using threads. This is just what parallel_for() does.
The first argument of prallel_for() is the range of index, and the grain size. Grain size specifies how small it divides the task into. Smaller grain size means more chunks of task.
ApplyFoo is a class that defines what operation is done on arr;
class ApplyFoo{ float* arr_copy;
public:
void operator()(const blocked_range<size_t>& r) const {
float *arr = a_copy;
for(size_t i=r.begin(); i!=r.end(); ++i)
{
arr[i] = Foo(arr[i]); // here’s the task to do in parallel
}
} // constructor
ApplyFoo(float* a):
arr_copy(arr)
{}
};
Here’s the official docs:
https://software.intel.com/sites/default/files/m/d/4/1/d/8/tutorial.pdf