pytorch中DataLoader()过程中遇到的一些问题

作者:hfw6310 时间:2022-01-17 18:36:11 

如下所示:

RuntimeError: stack expects each tensor to be equal size, but got [3, 60, 32] at entry 0 and [3, 54, 32] at entry 2


train_dataset = datasets.ImageFolder(
   traindir,
   transforms.Compose([
       transforms.Resize((224)) ###

原因是

transforms.Resize() 的参数设置问题,改为如下设置就可以了


train_dataset = datasets.ImageFolder(
   traindir,
   transforms.Compose([
       transforms.Resize((224,224)),

同理,val_dataset中也调整为transforms.Resize((224,224))。

补充:pytorch之dataloader深入剖析

- dataloader本质是一个可迭代对象,使用iter()访问,不能使用next()访问;

- 使用iter(dataloader)返回的是一个迭代器,然后可以使用next访问;

- 也可以使用`for inputs, labels in dataloaders`进行可迭代对象的访问;

- 一般我们实现一个datasets对象,传入到dataloader中;然后内部使用yeild返回每一次batch的数据;

① DataLoader本质上就是一个iterable(跟python的内置类型list等一样),并利用多进程来加速batch data的处理,使用yield来使用有限的内存

② Queue的特点

当队列里面没有数据时: queue.get() 会阻塞, 阻塞的时候,其它进程/线程如果有queue.put() 操作,本线程/进程会被通知,然后就可以 get 成功。

当数据满了: queue.put() 会阻塞

③ DataLoader是一个高效,简洁,直观的网络输入数据结构,便于使用和扩展

输入数据PipeLine

pytorch 的数据加载到模型的操作顺序是这样的:

① 创建一个 Dataset 对象

② 创建一个 DataLoader 对象

③ 循环这个 DataLoader 对象,将img, label加载到模型中进行训练


dataset = MyDataset()
dataloader = DataLoader(dataset)
num_epoches = 100
for epoch in range(num_epoches):
for img, label in dataloader:
....

所以,作为直接对数据进入模型中的关键一步, DataLoader非常重要。

首先简单介绍一下DataLoader,它是PyTorch中数据读取的一个重要接口,该接口定义在dataloader.py中,只要是用PyTorch来训练模型基本都会用到该接口(除非用户重写…),该接口的目的:将自定义的Dataset根据batch size大小、是否shuffle等封装成一个Batch Size大小的Tensor,用于后面的训练。

官方对DataLoader的说明是:“数据加载由数据集和采样器组成,基于python的单、多进程的iterators来处理数据。”关于iterator和iterable的区别和概念请自行查阅,在实现中的差别就是iterators有__iter__和__next__方法,而iterable只有__iter__方法。

1.DataLoader

先介绍一下DataLoader(object)的参数:

dataset(Dataset): 传入的数据集

batch_size(int, optional): 每个batch有多少个样本

shuffle(bool, optional): 在每个epoch开始的时候,对数据进行重新排序

sampler(Sampler, optional): 自定义从数据集中取样本的策略,如果指定这个参数,那么shuffle必须为False

batch_sampler(Sampler, optional): 与sampler类似,但是一次只返回一个batch的indices(索引),需要注意的是,一旦指定了这个参数,那么batch_size,shuffle,sampler,drop_last就不能再制定了(互斥——Mutually exclusive)

num_workers (int, optional): 这个参数决定了有几个进程来处理data loading。0意味着所有的数据都会被load进主进程。(默认为0)

collate_fn (callable, optional): 将一个list的sample组成一个mini-batch的函数

pin_memory (bool, optional): 如果设置为True,那么data loader将会在返回它们之前,将tensors拷贝到CUDA中的固定内存(CUDA pinned memory)中.

drop_last (bool, optional): 如果设置为True:这个是对最后的未完成的batch来说的,比如你的batch_size设置为64,而一个epoch只有100个样本,那么训练的时候后面的36个就被扔掉了…

如果为False(默认),那么会继续正常执行,只是最后的batch_size会小一点。

timeout(numeric, optional): 如果是正数,表明等待从worker进程中收集一个batch等待的时间,若超出设定的时间还没有收集到,那就不收集这个内容了。这个numeric应总是大于等于0。默认为0

worker_init_fn (callable, optional): 每个worker初始化函数 If not None, this will be called on each


worker subprocess with the worker id (an int in [0, num_workers - 1]) as
input, after seeding and before data loading. (default: None)

- 首先dataloader初始化时得到datasets的采样list


class DataLoader(object):
   r"""
   Data loader. Combines a dataset and a sampler, and provides
   single- or multi-process iterators over the dataset.
   Arguments:
       dataset (Dataset): dataset from which to load the data.
       batch_size (int, optional): how many samples per batch to load
           (default: 1).
       shuffle (bool, optional): set to ``True`` to have the data reshuffled
           at every epoch (default: False).
       sampler (Sampler, optional): defines the strategy to draw samples from
           the dataset. If specified, ``shuffle`` must be False.
       batch_sampler (Sampler, optional): like sampler, but returns a batch of
           indices at a time. Mutually exclusive with batch_size, shuffle,
           sampler, and drop_last.
       num_workers (int, optional): how many subprocesses to use for data
           loading. 0 means that the data will be loaded in the main process.
           (default: 0)
       collate_fn (callable, optional): merges a list of samples to form a mini-batch.
       pin_memory (bool, optional): If ``True``, the data loader will copy tensors
           into CUDA pinned memory before returning them.
       drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
           if the dataset size is not divisible by the batch size. If ``False`` and
           the size of dataset is not divisible by the batch size, then the last batch
           will be smaller. (default: False)
       timeout (numeric, optional): if positive, the timeout value for collecting a batch
           from workers. Should always be non-negative. (default: 0)
       worker_init_fn (callable, optional): If not None, this will be called on each
           worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
           input, after seeding and before data loading. (default: None)
   .. note:: By default, each worker will have its PyTorch seed set to
             ``base_seed + worker_id``, where ``base_seed`` is a long generated
             by main process using its RNG. However, seeds for other libraies
             may be duplicated upon initializing workers (w.g., NumPy), causing
             each worker to return identical random numbers. (See
             :ref:`dataloader-workers-random-seed` section in FAQ.) You may
             use ``torch.initial_seed()`` to access the PyTorch seed for each
             worker in :attr:`worker_init_fn`, and use it to set other seeds
             before data loading.
   .. warning:: If ``spawn`` start method is used, :attr:`worker_init_fn` cannot be an
                unpicklable object, e.g., a lambda function.
   """
   __initialized = False
   def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None,
                num_workers=0, collate_fn=default_collate, pin_memory=False, drop_last=False,
                timeout=0, worker_init_fn=None):
       self.dataset = dataset
       self.batch_size = batch_size
       self.num_workers = num_workers
       self.collate_fn = collate_fn
       self.pin_memory = pin_memory
       self.drop_last = drop_last
       self.timeout = timeout
       self.worker_init_fn = worker_init_fn
       if timeout < 0:
           raise ValueError('timeout option should be non-negative')
       if batch_sampler is not None:
           if batch_size > 1 or shuffle or sampler is not None or drop_last:
               raise ValueError('batch_sampler option is mutually exclusive '
                                'with batch_size, shuffle, sampler, and '
                                'drop_last')
           self.batch_size = None
           self.drop_last = None
       if sampler is not None and shuffle:
           raise ValueError('sampler option is mutually exclusive with '
                            'shuffle')
       if self.num_workers < 0:
           raise ValueError('num_workers option cannot be negative; '
                            'use num_workers=0 to disable multiprocessing.')
       if batch_sampler is None:
           if sampler is None:
               if shuffle:
                   sampler = RandomSampler(dataset)  //将list打乱
               else:
                   sampler = SequentialSampler(dataset)
           batch_sampler = BatchSampler(sampler, batch_size, drop_last)
       self.sampler = sampler
       self.batch_sampler = batch_sampler
       self.__initialized = True
   def __setattr__(self, attr, val):
       if self.__initialized and attr in ('batch_size', 'sampler', 'drop_last'):
           raise ValueError('{} attribute should not be set after {} is '
                            'initialized'.format(attr, self.__class__.__name__))
       super(DataLoader, self).__setattr__(attr, val)
   def __iter__(self):
       return _DataLoaderIter(self)
   def __len__(self):
       return len(self.batch_sampler)

其中:RandomSampler,BatchSampler已经得到了采用batch数据的index索引;yield batch机制已经在!!!


class RandomSampler(Sampler):
   r"""Samples elements randomly, without replacement.
   Arguments:
       data_source (Dataset): dataset to sample from
   """
   def __init__(self, data_source):
       self.data_source = data_source
   def __iter__(self):
       return iter(torch.randperm(len(self.data_source)).tolist())
   def __len__(self):
       return len(self.data_source)

class BatchSampler(Sampler):
   r"""Wraps another sampler to yield a mini-batch of indices.
   Args:
       sampler (Sampler): Base sampler.
       batch_size (int): Size of mini-batch.
       drop_last (bool): If ``True``, the sampler will drop the last batch if
           its size would be less than ``batch_size``
   Example:
       >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
       [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
       >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
       [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
   """
   def __init__(self, sampler, batch_size, drop_last):
       if not isinstance(sampler, Sampler):
           raise ValueError("sampler should be an instance of "
                            "torch.utils.data.Sampler, but got sampler={}"
                            .format(sampler))
       if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
               batch_size <= 0:
           raise ValueError("batch_size should be a positive integeral value, "
                            "but got batch_size={}".format(batch_size))
       if not isinstance(drop_last, bool):
           raise ValueError("drop_last should be a boolean value, but got "
                            "drop_last={}".format(drop_last))
       self.sampler = sampler
       self.batch_size = batch_size
       self.drop_last = drop_last
   def __iter__(self):
       batch = []
       for idx in self.sampler:
           batch.append(idx)
           if len(batch) == self.batch_size:
               yield batch
               batch = []
       if len(batch) > 0 and not self.drop_last:
           yield batch
   def __len__(self):
       if self.drop_last:
           return len(self.sampler) // self.batch_size
       else:
           return (len(self.sampler) + self.batch_size - 1) // self.batch_size

- 其中 _DataLoaderIter(self)输入为一个dataloader对象;如果num_workers=0很好理解,num_workers!=0引入多线程机制,加速数据加载过程;

- 没有多线程时:batch = self.collate_fn([self.dataset[i] for i in indices])进行将index转化为data数据,返回(image,label);self.dataset[i]会调用datasets对象的

__getitem__()方法

- 多线程下,会为每个线程创建一个索引队列index_queues;共享一个worker_result_queue数据队列!在_worker_loop方法中加载数据;


class _DataLoaderIter(object):
   r"""Iterates once over the DataLoader's dataset, as specified by the sampler"""
   def __init__(self, loader):
       self.dataset = loader.dataset
       self.collate_fn = loader.collate_fn
       self.batch_sampler = loader.batch_sampler
       self.num_workers = loader.num_workers
       self.pin_memory = loader.pin_memory and torch.cuda.is_available()
       self.timeout = loader.timeout
       self.done_event = threading.Event()
       self.sample_iter = iter(self.batch_sampler)
       base_seed = torch.LongTensor(1).random_().item()
       if self.num_workers > 0:
           self.worker_init_fn = loader.worker_init_fn
           self.index_queues = [multiprocessing.Queue() for _ in range(self.num_workers)]
           self.worker_queue_idx = 0
           self.worker_result_queue = multiprocessing.SimpleQueue()
           self.batches_outstanding = 0
           self.worker_pids_set = False
           self.shutdown = False
           self.send_idx = 0
           self.rcvd_idx = 0
           self.reorder_dict = {}
           self.workers = [
               multiprocessing.Process(
                   target=_worker_loop,
                   args=(self.dataset, self.index_queues[i],
                         self.worker_result_queue, self.collate_fn, base_seed + i,
                         self.worker_init_fn, i))
               for i in range(self.num_workers)]
           if self.pin_memory or self.timeout > 0:
               self.data_queue = queue.Queue()
               if self.pin_memory:
                   maybe_device_id = torch.cuda.current_device()
               else:
                   # do not initialize cuda context if not necessary
                   maybe_device_id = None
               self.worker_manager_thread = threading.Thread(
                   target=_worker_manager_loop,
                   args=(self.worker_result_queue, self.data_queue, self.done_event, self.pin_memory,
                         maybe_device_id))
               self.worker_manager_thread.daemon = True
               self.worker_manager_thread.start()
           else:
               self.data_queue = self.worker_result_queue
           for w in self.workers:
               w.daemon = True  # ensure that the worker exits on process exit
               w.start()
           _update_worker_pids(id(self), tuple(w.pid for w in self.workers))
           _set_SIGCHLD_handler()
           self.worker_pids_set = True
           # prime the prefetch loop
           for _ in range(2 * self.num_workers):
               self._put_indices()
   def __len__(self):
       return len(self.batch_sampler)
   def _get_batch(self):
       if self.timeout > 0:
           try:
               return self.data_queue.get(timeout=self.timeout)
           except queue.Empty:
               raise RuntimeError('DataLoader timed out after {} seconds'.format(self.timeout))
       else:
           return self.data_queue.get()
   def __next__(self):
       if self.num_workers == 0:  # same-process loading
           indices = next(self.sample_iter)  # may raise StopIteration
           batch = self.collate_fn([self.dataset[i] for i in indices])
           if self.pin_memory:
               batch = pin_memory_batch(batch)
           return batch
       # check if the next sample has already been generated
       if self.rcvd_idx in self.reorder_dict:
           batch = self.reorder_dict.pop(self.rcvd_idx)
           return self._process_next_batch(batch)
       if self.batches_outstanding == 0:
           self._shutdown_workers()
           raise StopIteration
       while True:
           assert (not self.shutdown and self.batches_outstanding > 0)
           idx, batch = self._get_batch()
           self.batches_outstanding -= 1
           if idx != self.rcvd_idx:
               # store out-of-order samples
               self.reorder_dict[idx] = batch
               continue
           return self._process_next_batch(batch)
   next = __next__  # Python 2 compatibility
   def __iter__(self):
       return self
   def _put_indices(self):
       assert self.batches_outstanding < 2 * self.num_workers
       indices = next(self.sample_iter, None)
       if indices is None:
           return
       self.index_queues[self.worker_queue_idx].put((self.send_idx, indices))
       self.worker_queue_idx = (self.worker_queue_idx + 1) % self.num_workers
       self.batches_outstanding += 1
       self.send_idx += 1
   def _process_next_batch(self, batch):
       self.rcvd_idx += 1
       self._put_indices()
       if isinstance(batch, ExceptionWrapper):
           raise batch.exc_type(batch.exc_msg)
       return batch

def _worker_loop(dataset, index_queue, data_queue, collate_fn, seed, init_fn, worker_id):
   global _use_shared_memory
   _use_shared_memory = True
   # Intialize C side signal handlers for SIGBUS and SIGSEGV. Python signal
   # module's handlers are executed after Python returns from C low-level
   # handlers, likely when the same fatal signal happened again already.
   # https://docs.python.org/3/library/signal.html Sec. 18.8.1.1
   _set_worker_signal_handlers()
   torch.set_num_threads(1)
   random.seed(seed)
   torch.manual_seed(seed)
   if init_fn is not None:
       init_fn(worker_id)
   watchdog = ManagerWatchdog()
   while True:
       try:
           r = index_queue.get(timeout=MANAGER_STATUS_CHECK_INTERVAL)
       except queue.Empty:
           if watchdog.is_alive():
               continue
           else:
               break
       if r is None:
           break
       idx, batch_indices = r
       try:
           samples = collate_fn([dataset[i] for i in batch_indices])
       except Exception:
           data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
       else:
           data_queue.put((idx, samples))
           del samples

- 需要对队列操作,缓存数据,使得加载提速!

来源:https://blog.csdn.net/hfw6310/article/details/106992968

标签:pytorch,DataLoader
0
投稿

猜你喜欢

  • Django中URL的参数传递的实现

    2022-12-24 13:53:25
  • Python实现去除列表中重复元素的方法小结【4种方法】

    2022-10-17 12:24:09
  • Python实现CAN报文转换工具教程

    2022-06-13 02:34:06
  • PHP PDOStatement::closeCursor讲解

    2023-06-07 18:23:31
  • Python实现端口检测的方法

    2022-02-01 21:21:02
  • python+opencv实现论文插图局部放大并拼接效果

    2023-12-07 17:29:12
  • Python3多线程处理爬虫的实战

    2023-08-16 02:16:21
  • JavaScript判断前缀、后缀是否是空格的方法

    2024-04-22 22:38:04
  • 快速解决SQL server 2005孤立用户问题

    2009-01-04 14:02:00
  • MacBook m1芯片采用miniforge安装python3.9的方法示例

    2022-03-03 21:18:26
  • python数据可视化使用pyfinance分析证券收益示例详解

    2022-05-24 06:59:13
  • layui实现下拉复选功能的例子(包括数据的回显与上传)

    2024-02-24 17:37:10
  • pytest测试框架+allure超详细教程

    2023-03-18 21:38:00
  • 浅谈Python响应式类库RxPy

    2021-12-24 12:44:26
  • 分析Silverlight Button控件布局

    2009-02-17 13:13:00
  • web程序员的思考

    2009-08-04 13:10:00
  • caffe binaryproto 与 npy相互转换的实例讲解

    2021-10-22 15:38:48
  • git版本库介绍及本地创建的三种场景方式

    2023-07-11 11:22:18
  • Windows下Anaconda的安装和简单使用方法

    2022-04-21 16:46:50
  • Python3 执行系统命令并获取实时回显功能

    2023-12-06 13:23:10
  • asp之家 网络编程 m.aspxhome.com