TensorFlow内存管理bfc算法实例

作者:齐豪 时间:2023-09-08 21:42:24 

1. 基本介绍

tensorflow设备内存管理模块实现了一个best-fit with coalescing算法(后文简称bfc算法)。

bfc算法是Doung Lea's malloc(dlmalloc)的一个非常简单的版本。

它具有内存分配、释放、碎片管理等基本功能。

2. bfc基本算法思想

1. 数据结构

整个内存空间由一个按基址升序排列的Chunk双向链表来表示,它们的直接前趋和后继必须在地址连续的内存空间。Chunk结构体里含有实际大小、请求大小、是否被占用、基址、直接前趋、直接后继、Bin索引等信息。

TensorFlow内存管理bfc算法实例

2. 申请

用户申请一个内存块(malloc)。根据chunk双链表找到一个合适的内存块,如果该内存块的大小是用户申请的大小的二倍以上,那么就将该内存块切分成两块,这就是split操作。

返回其中一块给用户,并将该内存块标识为占用

Spilt操作会新增一个chunk,所以需要修改chunk双链表以维持前驱和后继关系

如果用户申请512的空间,正好有一块1024的chunk2是空闲的,由于1024/512 =2,所以chunk2 被split为2块:chunk2_1和chunk2_2。返回chunk2_1给用户并将其标志位占用状态。

3. 释放

用户释放一个内存块(free)。先将该块标记为空闲。然后根据chunk数据结构中的信息找到其前驱和后继内存块。如果前驱和后继块中有空闲的块,那么将刚释放的块和空闲的块合并成一个更大的chunk(这就是merge操作,合并当前块和其前后的空闲块)。再修改双链表结构以维持前驱后继关系。这就做到了内存碎片的回收。

如果用户要free chunk3,由于chunk3的前驱chunk2也是空闲的,所以将chunk2和chunk3合并得到一个新的chunk2',大小为chunk2和chunk3之和。

3. bins

1. bins数据结构

bfc算法采取的是被动分块的策略。最开始整个内存是一个chunk,随着用户申请空间的次数增加,最开始的大chunk会被不断的split开来,从而产生越来越多的小chunk。当chunk数量很大时,为了寻找一个合适的内存块而遍历双链表无疑是一笔巨大的开销。为了实现对空闲块的高效管理,bfc算法设计了bin这个抽象数据结构。

每个bin都有一个size属性,一个bin是一个拥有chunk size >= binsize的空闲chunk的集合。集合中的chunk按照chunk size的升序组织成单链表。bfc算法维护了一个bin的集合:bins。它由多个bin以及从属于每个bin的chunks组成。内存中所有的空闲chunk都由bins管理。

TensorFlow内存管理bfc算法实例

图中每一列表示一个bin,列首方格中的数字表示bin的size。bin size的大小都是256的2^n的倍。每个bin下面挂载了一系列的空闲chunk,每个chunk的chunk size都大于等于所属的bin的bin size,按照chunk size的升序挂载成单链表。

2. bins操作

bfc算法针对bins这个集合设计了三个操作:search、insert、delete。

search

给定一个chunk size,从bins中找到大于等于该chunksize的最小的那个空闲chunk。Search操作具体流程如下。如果bin以数组的形式组织,那么可以从index = chunk size /256 >>2的那个bin开始查找。最好的情况是开始查找的那个bin的chunk链表非空,那么直接返回链表头即可。这种情况时间复杂度是常数级的。最坏的情况是遍历bins数组中所有的bin。对于一般大小的内存来说,bins数组元素非常少,比如4G空间只需要23个bin就足够了(256 * 2 ^ 23 > 4G),因此也很快能返回结果。总体来说search操作是非常高效的。对于固定大小内存来说,查找时间是常数量级的。

insert

将一个空闲的chunk插入到一个bin所挂载的chunk链表中,同时需要维持chunk链表的升序关系。具体流程是直接将chunk插入到index = chunk size /256 >>2的那个bin中即可。

delete

将一个空闲的chunk从bins中移除。

4. 总结

将内存分块管理,按块进行空间分配和释放。

通过split操作将大内存块分解成用户需要的小内存块。

通过merge操作合并小的内存块,做到内存碎片回收

通过bin这个抽象数据结构实现对空闲块高效管理。

5. 代码分析

1. 代码地址

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/common_runtime

2. 数据结构

Chunk


static const int kInvalidChunkHandle = -1;
...
struct Chunk {
 size_t size = 0; // Full size of buffer.

// We sometimes give chunks that are larger than needed to reduce
 // fragmentation. requested_size keeps track of what the client
 // actually wanted so we can understand whether our splitting
 // strategy is efficient.
 size_t requested_size = 0;

// allocation_id is set to -1 when the chunk is not in use. It is assigned a
 // value greater than zero before the chunk is returned from
 // AllocateRaw, and this value is unique among values assigned by
 // the parent allocator.
 int64 allocation_id = -1;
 void* ptr = nullptr; // pointer to granted subbuffer.

// If not kInvalidChunkHandle, the memory referred to by 'prev' is directly
 // preceding the memory used by this chunk. E.g., It should start
 // at 'ptr - prev->size'
 ChunkHandle prev = kInvalidChunkHandle;

// If not kInvalidChunkHandle, the memory referred to by 'next' is directly
 // following the memory used by this chunk. E.g., It should be at
 // 'ptr + size'
 ChunkHandle next = kInvalidChunkHandle;

// What bin are we in?
 BinNum bin_num = kInvalidBinNum;

bool in_use() const { return allocation_id != -1; }
};

Bin


// A Bin is a collection of similar-sized free chunks.
struct Bin {
 // All chunks in this bin have >= bin_size memory.
 size_t bin_size = 0;

struct ChunkComparator {
   explicit ChunkComparator(BFCAllocator* allocator)
     : allocator_(allocator) {}
   // Sort first by size and then use pointer address as a tie breaker.
   bool operator()(const ChunkHandle ha,
           const ChunkHandle hb) const NO_THREAD_SAFETY_ANALYSIS {
     const Chunk* a = allocator_->ChunkFromHandle(ha);
     const Chunk* b = allocator_->ChunkFromHandle(hb);
     if (a->size != b->size) {
       return a->size < b->size;
     }
     return a->ptr < b->ptr;
   }

private:
     BFCAllocator* allocator_; // The parent allocator
 };

typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;
 // List of free chunks within the bin, sorted by chunk size.
 // Chunk * not owned.
 FreeChunkSet free_chunks;
 Bin(BFCAllocator* allocator, size_t bs)
   : bin_size(bs), free_chunks(ChunkComparator(allocator)) {}
};

AllocationRegion

AllocationRegion给一个连续的内存区域做指针到ChunkHandle的映射。

RegionManager

RegionManager聚集了一个或多个AllocationRegion,并提供一个从指针到基础ChunkHandle的间接层,这个间接层可在多个不连续的内存区域进行分配。

3. 分配大小

将每次分配的内存大小调整为kMinAllocationSize的N倍,这样所有内存地址都是很好地按字节对齐了。


// kMinAllocationSize = 256
static const size_t kMinAllocationBits = 8;
static const size_t kMinAllocationSize = 1 << kMinAllocationBits;
...
size_t BFCAllocator::RoundedBytes(size_t bytes) {
 size_t rounded_bytes =
   (kMinAllocationSize *
   ((bytes + kMinAllocationSize - 1) / kMinAllocationSize));
 DCHECK_EQ(size_t{0}, rounded_bytes % kMinAllocationSize);
 return rounded_bytes;
}

4. 初始化bin


typedef int BinNum;
static const int kInvalidBinNum = -1;
static const int kNumBins = 21;
...
// 二进制2^8往左移0,1,2位
// (static_cast<size_t>(256) << 0) = 256
// (static_cast<size_t>(256) << 1) = 512
// (static_cast<size_t>(256) << 2) = 1024
size_t BinNumToSize(BinNum index) {
 return static_cast<size_t>(256) << index;
}
...
char bins_space_[sizeof(Bin) * kNumBins];
// Map from bin size to Bin
Bin* BinFromIndex(BinNum index) {
 return reinterpret_cast<Bin*>(&(bins_space_[index * sizeof(Bin)]));
}
...
// We create bins to fit all possible ranges that cover the
// memory_limit_ starting from allocations up to 256 bytes to
// allocations up to (and including) the memory limit.
for (BinNum b = 0; b < kNumBins; b++) {
 size_t bin_size = BinNumToSize(b);
 VLOG(1) << "Creating bin of max chunk size "
     << strings::HumanReadableNumBytes(bin_size);
 new (BinFromIndex(b)) Bin(this, bin_size);
 CHECK_EQ(BinForSize(bin_size), BinFromIndex(b));
 CHECK_EQ(BinForSize(bin_size + 255), BinFromIndex(b));
 CHECK_EQ(BinForSize(bin_size * 2 - 1), BinFromIndex(b));
 if (b + 1 < kNumBins) {
   CHECK_NE(BinForSize(bin_size * 2), BinFromIndex(b));
 }
}

5. 查找bin


// 求属于第几个bin
BinNum BinNumForSize(size_t bytes) {
 uint64 v = std::max<size_t>(bytes, 256) >> kMinAllocationBits;
 int b = std::min(kNumBins - 1, Log2FloorNonZero(v));
 return b;
}
// 最高位非零的二进制位数,eg: 0001 0101B 为5
inline int Log2FloorNonZero(uint64 n) {
#if defined(__GNUC__)
 return 63 ^ __builtin_clzll(n);
#elif defined(PLATFORM_WINDOWS)
 unsigned long index;
 _BitScanReverse64(&index, n);
 return index;
#else
 int r = 0;
 while (n > 0) {
   r++;
   n >>= 1;
 }
 return r;
#endif
}

6. 查找Chunk


// 先加锁
mutex_lock l(lock_);
void* ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes);
if (ptr != nullptr) {
 return ptr;
}
// FindChunkPtr函数内部
void* BFCAllocator::FindChunkPtr(BinNum bin_num, size_t rounded_bytes,
                size_t num_bytes) {
 // First identify the first bin that could satisfy rounded_bytes.
 for (; bin_num < kNumBins; bin_num++) {
   // Start searching from the first bin for the smallest chunk that fits
   // rounded_bytes.
   Bin* b = BinFromIndex(bin_num);
   for (auto citer = b->free_chunks.begin(); citer != b->free_chunks.end();
       ++citer) {
     // 从之前得到的Bin索引开始,查找合适的空闲Chunk:
     const BFCAllocator::ChunkHandle h = (*citer);
     BFCAllocator::Chunk* chunk = ChunkFromHandle(h);
     DCHECK(!chunk->in_use());
     if (chunk->size >= rounded_bytes) {
       // We found an existing chunk that fits us that wasn't in use, so remove
       // it from the free bin structure prior to using.
       RemoveFreeChunkIterFromBin(&b->free_chunks, citer);

// If we can break the size of the chunk into two reasonably
       // large pieces, do so.
       //
       // TODO(vrv): What should be the criteria when deciding when
       // to split?
       // 具体实现后面会分析
       if (chunk->size >= rounded_bytes * 2) {
         SplitChunk(h, rounded_bytes);
         chunk = ChunkFromHandle(h); // Update chunk pointer in case it moved
       }

// The requested size of the returned chunk is what the user
       // has allocated.
       chunk->requested_size = num_bytes;
       // Assign a unique id and increment the id counter, marking the
       // chunk as being in use.
       chunk->allocation_id = next_allocation_id_++;

// Update stats.
       ++stats_.num_allocs;
       stats_.bytes_in_use += chunk->size;
       stats_.max_bytes_in_use =
         std::max(stats_.max_bytes_in_use, stats_.bytes_in_use);
       stats_.max_alloc_size =
         std::max<std::size_t>(stats_.max_alloc_size, chunk->size);

VLOG(4) << "Returning: " << chunk->ptr;
       if (VLOG_IS_ON(4)) {
         LOG(INFO) << "A: " << RenderOccupancy();
       }
       return chunk->ptr;
     }
   }
 }
 return nullptr;
}

7. 拆分Chunk

如果Chunk的大小大于等于申请内存大小的2倍,那么将该Chunk拆分成2个:第一个Chunk的大小等于申请内存大小,第二个Chunk作为它的直接后继。


if (chunk->size >= rounded_bytes * 2) {
 SplitChunk(h, rounded_bytes);
 chunk = ChunkFromHandle(h); // Update chunk pointer in case it moved
}

void BFCAllocator::SplitChunk(BFCAllocator::ChunkHandle h, size_t num_bytes) {
 // Allocate the new chunk before we do any ChunkFromHandle
 ChunkHandle h_new_chunk = AllocateChunk();

Chunk* c = ChunkFromHandle(h);
 CHECK(!c->in_use() && (c->bin_num == kInvalidBinNum));

// Create a new chunk starting num_bytes after c
 BFCAllocator::Chunk* new_chunk = ChunkFromHandle(h_new_chunk);
 new_chunk->ptr = static_cast<void*>(static_cast<char*>(c->ptr) + num_bytes);
 region_manager_.set_handle(new_chunk->ptr, h_new_chunk);

// Set the new sizes of the chunks.
 new_chunk->size = c->size - num_bytes;
 c->size = num_bytes;

// The new chunk is not in use.
 new_chunk->allocation_id = -1;

// Maintain the pointers.
 // c <-> c_neighbor becomes
 // c <-> new_chunk <-> c_neighbor
 BFCAllocator::ChunkHandle h_neighbor = c->next;
 new_chunk->prev = h;
 new_chunk->next = h_neighbor;
 c->next = h_new_chunk;
 if (h_neighbor != kInvalidChunkHandle) {
   Chunk* c_neighbor = ChunkFromHandle(h_neighbor);
   c_neighbor->prev = h_new_chunk;
 }

// Add the newly free chunk to the free bin.
 InsertFreeChunkIntoBin(h_new_chunk);
}

8. 回收chunk

加锁,获得ChunkHandle


mutex_lock l(lock_);
BFCAllocator::ChunkHandle h = region_manager_.get_handle(ptr);
FreeAndMaybeCoalesce(h);

FreeAndMaybeCoalesce


void BFCAllocator::FreeAndMaybeCoalesce(BFCAllocator::ChunkHandle h) {
 Chunk* c = ChunkFromHandle(h);
 CHECK(c->in_use() && (c->bin_num == kInvalidBinNum));

// Mark the chunk as no longer in use
 c->allocation_id = -1;

// Updates the stats.
 stats_.bytes_in_use -= c->size;

// This chunk is no longer in-use, consider coalescing the chunk
 // with adjacent chunks.
 ChunkHandle chunk_to_reassign = h;

// If the next chunk is free, coalesce the two
 if (c->next != kInvalidChunkHandle) {
   Chunk* cnext = ChunkFromHandle(c->next);
   if (!cnext->in_use()) {
   //   VLOG(8) << "Chunk at " << cnext->ptr << " merging with c " <<
   //   c->ptr;

chunk_to_reassign = h;

// Deletes c->next
   RemoveFreeChunkFromBin(c->next);
   Merge(h, ChunkFromHandle(h)->next);
   }
 }

// If the previous chunk is free, coalesce the two
 c = ChunkFromHandle(h);
 if (c->prev != kInvalidChunkHandle) {
   Chunk* cprev = ChunkFromHandle(c->prev);
   if (!cprev->in_use()) {
   //   VLOG(8) << "Chunk at " << c->ptr << " merging into c->prev "
   //    << cprev->ptr;

chunk_to_reassign = c->prev;

// Deletes c
   RemoveFreeChunkFromBin(c->prev);
   Merge(ChunkFromHandle(h)->prev, h);
   c = ChunkFromHandle(h);
   }
 }

InsertFreeChunkIntoBin(chunk_to_reassign);
}

Merge


// Merges h1 and h2 when Chunk(h1)->next is h2 and Chunk(h2)->prev is c1.
// We merge Chunk(h2) into Chunk(h1).
void BFCAllocator::Merge(BFCAllocator::ChunkHandle h1,
            BFCAllocator::ChunkHandle h2) {
 Chunk* c1 = ChunkFromHandle(h1);
 Chunk* c2 = ChunkFromHandle(h2);
 // We can only merge chunks that are not in use.
 CHECK(!c1->in_use() && !c2->in_use());

// c1's prev doesn't change, still points to the same ptr, and is
 // still not in use.

// Fix up neighbor pointers
 //
 // c1 <-> c2 <-> c3 should become
 // c1 <-> c3

BFCAllocator::ChunkHandle h3 = c2->next;
 c1->next = h3;
 CHECK(c2->prev == h1);
 if (h3 != kInvalidChunkHandle) {
   BFCAllocator::Chunk* c3 = ChunkFromHandle(h3);
   c3->prev = h1;
 }

// Set the new size
 c1->size += c2->size;

DeleteChunk(h2);
}

来源:https://blog.csdn.net/qq_33096883/article/details/77479647

标签:TensorFlow,内存,bfc
0
投稿

猜你喜欢

  • python 循环while和for in简单实例

    2021-12-11 03:16:48
  • python实现简单井字棋游戏

    2023-08-08 21:38:01
  • Go 语言进阶freecache源码学习教程

    2023-08-06 03:05:20
  • Request.ServerVariables("HTTP_REFERER")的用法

    2008-06-19 13:33:00
  • python读取txt数据的操作步骤

    2022-12-27 10:36:31
  • JS实现二维数组横纵列转置的方法

    2023-08-29 21:54:05
  • 5个有效改进网页UI设计的技巧

    2008-12-19 12:04:00
  • 如何在社区建立一个寻呼台?

    2009-11-08 18:59:00
  • Pygame代码 制作一个贪吃蛇小游戏

    2022-06-29 03:04:27
  • 常用的Git便捷操作合集

    2022-02-19 08:16:47
  • MySQL基于SSL协议进行主从复制的详细操作教程

    2024-01-24 23:10:35
  • 网页设计布局原则

    2010-04-20 17:18:00
  • 个人经验总结:完全卸载MySQL数据库5.0

    2009-01-04 13:07:00
  • 在ubuntu16.04中将python3设置为默认的命令写法

    2022-06-21 10:12:41
  • 20个优秀网站助你征服CSS[译]

    2008-09-21 13:21:00
  • JS的get和set使用示例

    2024-05-13 09:35:45
  • Python使用sorted对字典的key或value排序

    2023-12-12 06:36:53
  • 解决mysql安装时出现error Nr.1045问题的方法

    2024-01-18 11:34:30
  • Django接受前端数据的几种方法总结

    2021-11-26 23:32:53
  • python读取excel数据绘制简单曲线图的完整步骤记录

    2022-04-27 10:52:18
  • asp之家 网络编程 m.aspxhome.com