Skip to content
Tao Luo edited this page Dec 9, 2019 · 1 revision

TenorFlow's Tensor

tensorflow::Tensor represents a n-dimensional array of values, like caffe2::Tensor.

Different from caffe2::Tensor<Context>, which is a template, tesnorflow::Tensor is a class.

caffe2::Tensor<Context>'s constructor doesn't allocate memory; instead, memory allocate is delayed till the mutable_data is called. Whereas tensorflow::Tensor allocates the memory.

caffe2::Tensor<Context>'s template methods data<T> and mutalbe_data<T> can return an array of any typed elements -- caffe2::Tensor::meta_ records the most recently returned (and allocated) element type. Whereas tensorflow::Tensor's constructor accepts a DataType typed parameter that specifies the element type.

caffe2::Tensor<Context> supports only numerical typed elements. Whereas tensorflow::Tensor supports string-typed elements.

caffe2::Tensor<Context> doesn't support accessing data in protobuf messages. Whereas tensorflow::Tensor does.

caffe2::Tensor<Context>'s destructor doesn't free memory; instead, its data member shared_ptr<T> data_ does. Whereas tensorflow::Tensor's destructor takes the responsibility to free memory. In addition, tensorflow::Tensor counts the reference of the memory by itself, whereas caffe2::Tensor<Context> utilizes shared_ptr for that.

TensorShape

The shape of a tensor is represented by tensorflow::TensorShape, which can be constructed from a list of int64 values, or from a protobuf message TensorShapeProto.

TensorShape supports various representations of a shape because most tensors are low dimensional. This brings more complexity than Caffe2's vector<int64_t>. Indeed, tensor_shape.h and tensor_shape.cc take 759 lines of C++ code in total -- more than the very candy majel::Dim that takes 498 lines.

Memory Management

The constructor of tensorflow::Tensor accepts a parameter Allocator* a and passes it to a newly created tensorflow::Buffer object tensorflow::Tensor::buf_:

Tensor::Tensor(Allocator* a, DataType type, const TensorShape& shape) : shape_(shape), buf_(nullptr) { set_dtype(type); CHECK_NOTNULL(a); if (shape_.num_elements() > 0 || a->ShouldAllocateEmptyTensors()) { CASES(type, buf_ = new Buffer<T>(a, shape.num_elements())); }

tensorflow::Buffer then saves a into its parent class tensorflow::BufferBase's alloc_ field, and it calls Allocator::Allocate<T>:

template <typename T> Buffer<T>::Buffer(Allocator* a, int64 n) : BufferBase(a), data_(a->Allocate<T>(n)), elem_(n) {}

Allocator::Allocate<T> calls Allocator::AllocateRaw and then call type T's constructors via Allocator::RunCtor<T>:

 template <typename T> T* Allocate(size_t num_elements, const AllocationAttributes& allocation_attr) { ... void* p = AllocateRaw(kAllocatorAlignment, sizeof(T) * num_elements, allocation_attr); T* typed_p = reinterpret_cast<T*>(p); if (typed_p) RunCtor<T>(typed_p, num_elements); return typed_p; }

By default, Allocator::RunCtor<T> is an no-op, so it doesn't construct basic types. A specialization runs string type's constructor:

template <> inline void Allocator::RunCtor(string* p, size_t n) { RunStringCtor(p, n); }

Similarly, there are corresponding Allocator::RunDtor<T> defines.

Allocator::AllocateRaw calls port::AlignedMalloc:

 void* AllocateRaw(size_t alignment, size_t num_bytes) override { void* p = port::AlignedMalloc(num_bytes, alignment); ... return p; }

and Allocator::DeallocateRaw calls port::AlignedFree:

 void DeallocateRaw(void* ptr) override { ... port::AlignedFree(ptr); }

port:AlignedMalloc, port::AlignedFree, and other platform-independent memory allocation are in tensorflow/core/platform/mem.h:

namespace tensorflow { namespace port { void* AlignedMalloc(size_t size, int minimum_alignment); void AlignedFree(void* aligned_memory); void* Malloc(size_t size); void* Realloc(void* ptr, size_t size); void Free(void* ptr); } }

There are two implemntations:

  1. POSIX implemenation in tensorflow/core/platform/posix/port.cc just calls POSIX C-runtime functions like malloc. For example:
void* Malloc(size_t size) { #ifdef TENSORFLOW_USE_JEMALLOC return jemalloc_malloc(size); #else return malloc(size); #endif }
  1. Windows implementation in tensorflow/core/platform/windows/port.cc is almost identical with the POSIX one, because the C-runtime functions are almost the same.

Question: GPU Memory

Above two implementation both allocates CPU memory, but not GPU memory.

TensorFlow codebase doesn't call cudaMalloc. Instead, there is one function, perftools::gputools::cuda::CUDADriver::DeviceAllocate, that calls cuMemAlloc:

/* static */ void *CUDADriver::DeviceAllocate(CudaContext *context, uint64 bytes) { ... CUresult res = cuMemAlloc(&result, bytes);

Class CUDADriver includes a set of static methods, each corresponds to a CUDA API. For example, CUDADriver::DeviceDeallocate calls cuMemFree:

/* static */ void CUDADriver::DeviceDeallocate(CudaContext* context, void *location) { ... CUresult res = cuMemFree(pointer);

Only CUDAExecutor::Allocate(uint64 size) calls CUDADriver::DeviceAllocate(context_, size):

void *CUDAExecutor::Allocate(uint64 size) { return CUDADriver::DeviceAllocate(context_, size); }

And I haven't figured it out how/if Tensor calls CUDAExecutor::Allocate for GPU memory.

Clone this wiki locally