tensorflow::Tensor

TenorFlow's Tensor

tensorflow::Tensor represents a n-dimensional array of values, like caffe2::Tensor.

Different from caffe2::Tensor<Context>, which is a template, tesnorflow::Tensor is a class.

caffe2::Tensor<Context>'s constructor doesn't allocate memory; instead, memory allocate is delayed till the mutable_data is called. Whereas tensorflow::Tensor allocates the memory.

caffe2::Tensor<Context>'s template methods data<T> and mutalbe_data<T> can return an array of any typed elements -- caffe2::Tensor::meta_ records the most recently returned (and allocated) element type. Whereas tensorflow::Tensor's constructor accepts a DataType typed parameter that specifies the element type.

caffe2::Tensor<Context> supports only numerical typed elements. Whereas tensorflow::Tensor supports string-typed elements.

caffe2::Tensor<Context> doesn't support accessing data in protobuf messages. Whereas tensorflow::Tensor does.

caffe2::Tensor<Context>'s destructor doesn't free memory; instead, its data member shared_ptr<T> data_ does. Whereas tensorflow::Tensor's destructor takes the responsibility to free memory. In addition, tensorflow::Tensor counts the reference of the memory by itself, whereas caffe2::Tensor<Context> utilizes shared_ptr for that.

`TensorShape`

The shape of a tensor is represented by tensorflow::TensorShape, which can be constructed from a list of int64 values, or from a protobuf message TensorShapeProto.

TensorShape supports various representations of a shape because most tensors are low dimensional. This brings more complexity than Caffe2's vector<int64_t>. Indeed, tensor_shape.h and tensor_shape.cc take 759 lines of C++ code in total -- more than the very candy majel::Dim that takes 498 lines.

Memory Management

The constructor of tensorflow::Tensor accepts a parameter Allocator* a and passes it to a newly created tensorflow::Buffer object tensorflow::Tensor::buf_:

Tensor::Tensor(Allocator* a, DataType type, const TensorShape& shape) : shape_(shape), buf_(nullptr) { set_dtype(type); CHECK_NOTNULL(a); if (shape_.num_elements() > 0 || a->ShouldAllocateEmptyTensors()) { CASES(type, buf_ = new Buffer<T>(a, shape.num_elements())); }

tensorflow::Buffer then saves a into its parent class tensorflow::BufferBase's alloc_ field, and it calls Allocator::Allocate<T>:

template <typename T> Buffer<T>::Buffer(Allocator* a, int64 n) : BufferBase(a), data_(a->Allocate<T>(n)), elem_(n) {}

Allocator::Allocate<T> calls Allocator::AllocateRaw and then call type T's constructors via Allocator::RunCtor<T>:

 template <typename T> T* Allocate(size_t num_elements, const AllocationAttributes& allocation_attr) { ... void* p = AllocateRaw(kAllocatorAlignment, sizeof(T) * num_elements, allocation_attr); T* typed_p = reinterpret_cast<T*>(p); if (typed_p) RunCtor<T>(typed_p, num_elements); return typed_p; }

By default, Allocator::RunCtor<T> is an no-op, so it doesn't construct basic types. A specialization runs string type's constructor:

template <> inline void Allocator::RunCtor(string* p, size_t n) { RunStringCtor(p, n); }

Similarly, there are corresponding Allocator::RunDtor<T> defines.

Allocator::AllocateRaw calls port::AlignedMalloc:

 void* AllocateRaw(size_t alignment, size_t num_bytes) override { void* p = port::AlignedMalloc(num_bytes, alignment); ... return p; }

and Allocator::DeallocateRaw calls port::AlignedFree:

 void DeallocateRaw(void* ptr) override { ... port::AlignedFree(ptr); }

port:AlignedMalloc, port::AlignedFree, and other platform-independent memory allocation are in tensorflow/core/platform/mem.h:

namespace tensorflow { namespace port { void* AlignedMalloc(size_t size, int minimum_alignment); void AlignedFree(void* aligned_memory); void* Malloc(size_t size); void* Realloc(void* ptr, size_t size); void Free(void* ptr); } }

There are two implemntations:

POSIX implemenation in tensorflow/core/platform/posix/port.cc just calls POSIX C-runtime functions like malloc. For example:

void* Malloc(size_t size) { #ifdef TENSORFLOW_USE_JEMALLOC return jemalloc_malloc(size); #else return malloc(size); #endif }

Windows implementation in tensorflow/core/platform/windows/port.cc is almost identical with the POSIX one, because the C-runtime functions are almost the same.

Question: GPU Memory

Above two implementation both allocates CPU memory, but not GPU memory.

TensorFlow codebase doesn't call cudaMalloc. Instead, there is one function, perftools::gputools::cuda::CUDADriver::DeviceAllocate, that calls cuMemAlloc:

/* static */ void *CUDADriver::DeviceAllocate(CudaContext *context, uint64 bytes) { ... CUresult res = cuMemAlloc(&result, bytes);

Class CUDADriver includes a set of static methods, each corresponds to a CUDA API. For example, CUDADriver::DeviceDeallocate calls cuMemFree:

/* static */ void CUDADriver::DeviceDeallocate(CudaContext* context, void *location) { ... CUresult res = cuMemFree(pointer);

Only CUDAExecutor::Allocate(uint64 size) calls CUDADriver::DeviceAllocate(context_, size):

void *CUDAExecutor::Allocate(uint64 size) { return CUDADriver::DeviceAllocate(context_, size); }

And I haven't figured it out how/if Tensor calls CUDAExecutor::Allocate for GPU memory.

Release Notes

tensorflow::Tensor

TenorFlow's Tensor

TensorShape

Memory Management

Question: GPU Memory

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`TensorShape`