Pytorch out of gpu memory When working with PyTorch and large deep learning models, especially on GPU (CUDA), running into the dreaded "CUDA out of memory" error is common. Thanks guys, reducing the size of the image helps me understand it was due to the memory size. 37 GiB is allocated by PyTorch, and 5. I am trying to build autoencoder model, where input/output is RGB images with size of 256 x 256. Your problem is then when accumulating the loss for printing (monitoring or whatever). utils. This means that two processes using the same GPU experience out-of-memory errors, even if at any specific time the sum of the GPU memory actually used by the two processes remains below the capacity. pt files), which I load and move to the GPU, taking in total 270MB of GPU memory. 04 GiB already allocated; 927. 44 GiB already allocated; 189. Hi, I want to train a big dataset with 1M images. Tried to allocate 1024. 87 GiB reserved in total by PyTorch) BATCH_SIZE=512. Minimize Gradient Retention. reducing the batch size or by using e. That being said, you shouldn’t accumulate the batch_loss into total_loss directly, since batch_loss is still attached to the I believe this could be due to memory fragmentation that occurs in certain cases in CUDA when allocating and deallocation of memory. The reference is here in the Pytorch github issues BUT the following seems to work for me. If you are using too many data augmentation techniques, you can try reducing the number of transformations or using less memory-intensive techniques. 93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. empty_cache() to free up unused GPU memory. if you are detaching variables outside the main training loop it may The max_split_size_mb configuration value can be set as an environment variable. randn(70000, 16) >>> y = torch. Tried to allocate 734. Process 11288 has 14. Batch size: incrementally increase your batch size until you go out of memory. 00 MiB (GPU 0; 10. 06 GiB reserved in total by PyTorch) Hi there, I’m trying to decrease my model GPU memory footprint to train using high-resolution medical images as input. The exact syntax is documented, but in short:. The idea behind free_memory is to free the GPU beforehand so to make sure you don't waste space for unnecessary objects held in memory. Running detectron2 with Cuda (4GB GPU) Hot Network However, when I use only 1 channel (of the 4) for training (with a DenseNet that takes 1 channel images), I expected I could go up to a batch size of 40. load, and then resume training. I wondered if anyone else out there was using 3D U-Net in Pytorch and having trouble with Cuda out of memory issue? I’m trying to train a 3D U-Net model on Colab pro (with GPU memory 16GB) to predict 2 classes from 3D medical image with 512512N in size and keep facing cuda out of memory issue. 75 GiB of which 51. the output of your validation phase as the new input to the model during training. 4 Gbs free. There is even more free space upon validation (round 8 GB on each). When resuming training, it instantly says : RuntimeError: CUDA out of memory. I’ve also posted this to the pytorch github, but I was hoping Hello, I am trying to use a trained model to make predictions (batch size of 10) on a test dataset, but my GPU quickly runs out of memory. Here is the definition of my model: PyTorch uses a caching memory allocator to speed up memory allocations. I’m using the code as-is from the FSDP tutorial except for the following changes: I passed the custom auto_wrap policy When I train my network, it can work well when num_worker = 0 or num_worker = 1 But it will CUDA out of memory when num_worker >= 2 . Essentially, if I create a large pool (40 processes in this example), and 40 copies of the model won’t fit into the GPU, it will run out of memory, even if I’m computing only a few inferences (2) at a time. 56 MiB free; 11. As I said use gradient accumulation to train your model. Thanks When I try to resume training, however, I got out of memory errors: Traceback (most recent call last): File “train. I thought each docker container can fully utilize the GPU resource when the GPU-Util is 0%, but at the same time I find in the last row it says that about 36GB of GPU is already in-use. As the graph These numbers are for a batch size of 64, if I drop the batch size down to even 32 the memory required for training goes down to 9 GB but it still runs out of memory while trying to save the model. pytorch out of GPU memory. and created another PyTorch-lightning kernel with exact same values but my lightning model runs out of memory after about 1. empty_cache() but that did not work, I’ve restarted the Kernal but that didn’t solve the problem. empty_cache() after model training or set PYTORCH_NO_CUDA_MEMORY_CACHING=1 in your environment to disable caching, it may help reduce fragmentation of GPU memory in certain cases. 56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. collect(). cuda. -- RuntimeError: CUDA out of memory. 71 MiB is reserved by PyTorch but unallocated. 1. The training procedure is parallelized with pytorch lightning to run on 8 RTX 3090. It's a common trick that even famous library implement (see the biggest_batch_first description for the BucketIterator in AllenNLP. Recovering from Out-of-Memory Errors. 47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try I have some code that runs fine on my laptop (macOS, 2. map completes, the process still retains its allocation of around 500 MB of GPU memory, even I am running an evaluation script in PyTorch. Out of memory: Killed process 25315 (python3) total-vm:53312244 kB, anon-rss:31451456kB, file-rss:74816kB, shmem My Setup: GPU: Nvidia A100 (40GB Memory) RAM: 500GB Dataloader: pin_memory = true num_workers = Tried with 2, 4, 8, 12, 16 batch_size = 32 Data Shape per Data unit: I have 2 inputs and a target tensor torch. Of the allocated memory 8. This error typically arises when your program This error occurs when your GPU runs out of memory while trying to allocate memory for your model. 3 GHz Intel Core i5, 16 GB memory), but fails on a GPU. Using nvidia-smi, I can confirm that the occupied memory increases during simulation, until it reaches the 4Gb available in my GTX 970. I am using a pretrained Alexnet with some extra layers and once I upload my model to my GPU It uses approximately 1Gb from it leaving 4. Tried to allocate 1. Move the tensors to CPU (using . 00 GiB total capacity; 6. in order to compute df/dx you are required to keep x in memory. 00 MiB (GPU 0; 6. no_grad() context manager, you will allow PyTorch to not save those values thus saving memory. I am currently using pytorch version 0. Then I try to train my images but my model crashes at the first batch when updating the weights of the network due to lack of Monitoring Memory Usage. Pytorch CUDA out of memory despite plenty of memory left. My setup: paperspace, machine with A4000 16G GPU single notebook running playing with DINOv2, just using the embedding part with pre-trained weights inspecting the model, it has ~427M params, so even with float32 that should be around 1. I could have understood if it was other way around with gpu 0 going out of memory but this is weird. Hot Network Questions What would cause species only distantly related and with vast morphological differences to still be able to interbreed? What is the origin of With NVIDIA-SMI i see that gpu 0 is only using 6GB of memory whereas, gpu 1 goes to 32. 90 GiB total capacity; 14. After optimization starts, my GPU starts to run out of memory, fully running out after a couple of batches, but I’m not sure why. 16 GiB already allocated; 0 bytes free; 5. Edit: I am working on Neural Machine Translation (NMT) and I am sharing part of my code where I am using DataParallel. LSTM() you have to call . I suspect that, for some reason, PyTorch is not freeing up memory from one iteration to the next and so it ends up consuming all the GPU memory available. Modified 1 year, CUDA out of memory. Once reach to Test method, I have CUDA out of memory. The format is PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>. 68 MiB cached) CUDA out of memory. 96 GiB is allocated by PyTorch, and 385. 85 GiB already allocated; 93. Of course all the resources are shared and the GPU memory is often partially used by other people processes. 3. Of the allocated memory 14. The behavior of caching allocator can be controlled via environment variable PYTORCH_CUDA_ALLOC_CONF. However, it seems to be running out of GPU memory just after initializing the network and switching it to cuda. If you want to train with batch size of desired_batch_size , then divide it by a reasonable number like 4 or 8 or 16, this number is know as accumtulation_steps . I think its too high for your gpu to allocate to its memory. Manual Inspection Check memory usage of tensors and intermediate results during training. Should I be purging memory after each batch is run through the optimizer? To debug CUDA memory use, PyTorch provides a way to generate memory snapshots that record the state of allocated CUDA memory at any point in time, and optionally record the history of allocation events that led up to that snapshot. 80 MiB free; 2. I tried to train model on 1 GPU with 12 GB of memory but I always caught CUDA OOM (I tried differen batchsizes and even batch size of 1 is failing). 43 GiB free; 36. if you are keeping your entire data in GPU, and making copies of it, it may create problems down the line. e. Any help is appreciated. Running out of GPU memory with PyTorch. to(device) or . As a result, the values shown in nvidia-smi usually don’t reflect the true memory usage. 0 with PyTorch 2. 53 GiB (GPU 0; 4. By combining these strategies, you As the error message suggests, you have run out of memory on your GPU. – I built my model in PyTorch. I am trying to run a small neural network on the CPU and am finding that the memory used by my script increases without limit. See documentation for Memory Management and Usually this happens because of memory on your GPU. Observations: Single-GPU mode: OOM occurs with my real model, as expected for large parameter counts. But I think GPU saves the gradients of the model’s parameters after it performs inference. but I was hoping there was a kind of memory-free function in Pytorch/Cuda that enables all gradient information of training epochs to be removed I was using 1 GPU and batch size was 64 and I got cuda out of memory. Also, if I use only 1 GPU, i don’t get any out of memory issues. I am using model. I checked the free/used memory, it looks full, I’ve tried to clean the memory using torch. Therefore I paused the training and resume after adding in lines of code to use 2 GPUs. Firstly, loading the checkpoint would cause torch. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. Size( 3. Context: I have pytorch running in Jupyter Lab in a Docker container and accessing two GPU's [0,1]. If you have more powerful GPUs, your problem could be solved (as you mentioned in your answer). backward() with retain_graph=True so pytorch can backpropagate through time and then call optimizer. ; Without gradients for aux: The real model runs without issues in multi-GPU mode. 70 GiB memory in use. You can tell GPU not save the gradients by detaching the output from the graph. torch. However, when I run the program, it uses up to 2GB of my ram. Tried to allocate 64. If you use the torch. Hi, I am running a slightly modified version of resnet18 (just added one more convent and batchnorm layers at the beginning of the network). After optimization starts, my GPU starts to run out of memory, fully running out after a couple of batches, but I'm not sure why. Hi all, I have a function that uses for loop to modify some value in my tensor. For every sample, I load a single image and also move it to the GPU. eval() and torch. gc. GradScaler() and torch. If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still I had the same problem. Hot Network Questions I am repeatedly getting the following error: RuntimeError: CUDA out of memory. Detectron2 Speed up inference instance segmentation. 96 GiB reserved in total by PyTorch) I decreased my batch size to 2, and used torch. Does getting a CUDA out o… I was given access to a remote workstations where I can use a GPU to train my model. 37. I’m following the FSDP tutorial but am seeing an increase in GPU memory when moving to multiple GPUs rather than a decrease. run your model, e. Tried to allocate 172. I use 32GB memory GPU to train the gpt2-xl and find every time I call the backward (), the memory will This will check if your GPU drivers are installed and the load of the GPUS. 00 GiB total capacity; 1. Then I followed some posts to first load the check point to CPU and delete I guess if you had 4 workers, and your batch wasn't too GPU memory intensive this would be ok too, but for some models/input types multiple workers all loading info to the GPU would cause OOM errors, which could lead to a newcomer to decrease the batch size when it wouldn't be necessary. class NMT(nn. But after I trained thousands of batches, it suddenly keeps getting OOM for every batch and the memory seems never be released anymore. 73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. Ask Question Asked 1 year, 11 months ago. Tried to allocate 30. I am using a batch size of 1. Multi-GPU mode: The toy example works, but my real model still OOMs. 60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. 93 GiB already allocated; 29. Then, depending on the sample, I need to run a sequence of these trained models. This article will Solved: How to Avoid 'CUDA Out of Memory' in PyTorch - 1. 7GB loading 280x280 images which I want to get Here, df/dx = 2x, i. 1 on a 16gb GPU instance on aws ec2 with 32gb ram and ubuntu 18. GPU 0 has a total capacty of 7. I only pass my model to the DataParallel so it’s using the default values. When I start iterating over my dataset it starts training fine, but after some iterations I run out of memory. 13 GiB already allocated; 0 bytes free; 6. 75 GiB (GPU 0; 39. 33 GiB already allocated; 10. That can be a significant amount of memory if your model has a lot parameters. 62 MiB free; 18. 1. 14 MiB free; 1. So I reduced the batch size to 16 to solve it. The main reason is that you try to load all your data into gpu. The problem does not occur if I run the model on the gpu. In fact, my code was almost a carbon copy of the code snippet featured in the link you provided. RuntimeError: CUDA out of memory. Beside, i moved to more robust GPUs and want to use both GPU( 0 and 1). I’ve re-written the code to make it more efficient as the code in the repository loaded the whole bin file of the dataset at once. empty_cache() but the issue still presists on paper this should not happen, I'm really confused. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF OutOfMemoryError: CUDA out of memory. empty_cache() for each batch, as PyTorch reserves some GPU memory (doesn't give it back to OS) so it doesn't have to allocate it for each batch once again. The code works well on CPU. The training process is normal at the first thousands of steps, even if it got OOM exception, the exception will be catched and the GPU memory will be released. 182. How to avoid "CUDA out of memory" in PyTorch. Iterative Transfer to CUDA. empty_cache() help reduce fragmentation of GPU memory. Tried to allocate 616. 00 MiB. 1 with cuda 11. I’ve try torch. This gives a readable summary of memory allocation and allows you to figure the reason of CUDA One common issue that you might encounter when using PyTorch with GPUs is the "RuntimeError: CUDA out of memory" error. I built a basic chatbot using PyTorch, and in the training code, I moved both the neural network as well as the training data to the gpu. empty_cache() but doesn’t work. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF – You don’t need to call torch. Well when you get CUDA OOM I'm afraid you can only restart the notebook/re-run your script. Thanks in advance! I tried to use import torch torch. OutOfMemoryError: CUDA out of memory. parallel. CUDA out of memory. In this blog post, we will explore some common causes of this error and how to solve it when using PyTorch. I was training a model with 1 GPU device and just now figured out how to train with 2 GPU devices. 00 MiB (GPU 0; 15. In order to do that, I’ve downloaded Common Voice in 34 languages and a pretrained Wav2Vec2 Model that I want to finetune, to solve this task. py’ in that code the bug occur in the line Hi @ptrblck, I am currently having the GPU memory leakage problem (during evaluation) that (1) the GPU memory usage increased during evaluation, and (2) it is not fully cleared after all variables have been deleted, and i have also cleared the memory using torch. I am saving only the state_dict, using CUDA 8. 4. one config of hyperparams (or, in general, operations that Out-of-memory (OOM) errors are some of the most common errors in PyTorch. data because if not you will be storing all the computation graphs from all the epochs. 06 MiB is free. Should I be purging memory after each batch is run through the optimizer? My code is as follows (with the portion of code that causes the During training a new computation graph would usually be created, as long as you don’t pass e. when backpropagation is performed. step()is showing me Cuda out of memory or why nn. PyTorch GPU out of memory. 75 MiB free; 4. The System has 96GB of CPU RAM. empty_cache() Release all unoccupied cached memory currently held by the caching Managing GPU memory effectively is crucial when training deep learning models using PyTorch, especially when working with limited resources or large models. matmul(x, y) But when I try to run this same code on a GPU, it fails: >>> import torch >>> device = So I know my GPU is close to be out of memory with this training, and that’s why I only use a batch size of two and it seems to work alright. Tried to allocate 20. Below is the st GPU out of memory when FastAPI is used with SentenceTransformers inference. 00 MiB (GPU 0; 23. 69 MiB is reserved by PyTorch but unallocated. 00 MiB memory in use. max_memory_allocated() and torch. 56 GiB total capacity; 33. Implement a try-except block to catch the RuntimeError and take appropriate actions, such as reducing batch size or model complexity. How can I solve this problem? Or to say, all I can do is to change to a better GPU only? Employing this function strategically, such as after completing a task or encountering out-of-memory errors, proves beneficial for freeing up GPU memory. Monitoring Memory Usage: PyTorch provides tools like torch. Available why my optimizer. I checked the target GPU, it is actually empty. See Memory management for more details about GPU memory management. This issue can disrupt training, inference, or testing, particularly when dealing with large datasets or complex models. g. Memory Clearing Use torch. If it fails, or doesn't show your gpu, check your driver installation. One can use context manager as follows. Of course, you won't be able to use Thanks for your reply I’m loading 4 (“only four”) BERT models yes the four models are really large I’m working on Emotive Computing. 27 GiB is allocated by PyTorch, and 304. 53 GiB memory in use. or how to seperate my nn. 74 GiB total capacity; 11. empty_cache() that did not work, the below image shows the free/used memory. 91 GiB already allocated; 503. Reduce the Batch Size. cpu()) while saving them. 0. 01 and running this on a 16 GB GPU. 09 GiB already allocated; 1. 16 MiB is reserved by PyTorch but unallocated. 88 MiB is free. It will make your code slow, don't use this function at all tbh, PyTorch handles this. The failed code is: model = When I use nvidia-smi, I have 4 GB free on each GPU during training because I set the batch size to 16. Any idea why is the for loop causes so much memory? Or is there a way to vectorize the troublesome for loop? Many Thanks def process_feature_map_2(dm): """dm should be a OutOfMemoryError: CUDA out of memory. 58 GiB of which 17. 91 GiB total capacity; 10. Including non-PyTorch memory, this process has 7. here are some of the biggest factors affecting your GPU usage. 00 MiB (GPU 0; 8. second please check your model and evaluation code as well. empty_cache() doesn’t increase the amount of GPU memory available for PyTorch. nvidia-smi shows that even after the pool. In addition, Yeah, you can. 75 MiB free; 14. Since we often deal with large amounts of data in PyTorch, small mistakes can rapidly cause your program to use Use optimzer. Gradient computation for aux seems to cause a drastic increase in memory usage. Batch size: forward pass memory usage scales linearly I did, but I had to reduce the batch size to an unreasonable number like 11 per GPU, which is too small. 68 GiB total capacity; 18. If the GPU shows >0% GPU Memory Usage, that means that it is already being This line is saving references to tensors in GPU memory and so the CUDA memory won't be released when loop goes to next iteration (which eventually leads to the GPU running out of memory). The Active Memory Timeline shows all the live tensors over time in the snapshot on a particular GPU. randn(16, 70000) >>> z = torch. device(‘cuda’ if torch. Of the allocated memory 7. GPU 0 has a total capacty of 14. The problem arises when I first load the existing model using torch. by a tensor variable going out of scope) around for future allocations, instead of releasing it to the OS. py”, line 283, in main() Fi I am training a classification model and I have saved some checkpoints. What should I change so that I have enough memory to test as well. If PyTorch runs into an OOM, it will automatically clear the cache and retry the allocation for you. Tried to allocate 196. Another way to get a deeper insight into the alloaction of memory in gpu is to use: wherein, both the arguments are optional. zero_grad(set_to_none=True) and rerun the code. Pytorch keeps GPU memory that is not used anymore (e. I think it’s because some unneeded variables/tensors are being held in the GPU, but I am not sure how to free them. But if you do not have, you can scale down your images into about 256*x sizes. I saw a Kaggle kernel on PyTorch and run it with the same img_size, batch_size, etc. 49 GiB memory in use. It is also good practice for performance's sake. However, in some instances, it can help reduce And actually, I have some other containers that are not running any scripts now. 00 MiB (GPU 0; 4. The Problem is, that my CPU memory consumption Thanks for your reply. Apparently you can't clear the GPU memory via a command once the data has been sent to the device. checkpoint. 00 GiB total capacity; 5. . If I reduce the batch size, training runs some for more iterations, but it always ends up running out of memory. 9. Try torch. collect() has no point, PyTorch does the garbage collector on it's own; Don't use torch. I suspect that storing the result of the loss function is taking most of the memory. load() out of memory no matter I use 1 GPU or 2 GPUs. no_grad() also but getting same. it should be in your training loop where you move your data to GPU. Pan/Zoom over the plot to look at Hi! I’m developing a language classifier. Profiling Tools Use tools like PyTorch Profiler to monitor memory usage and identify memory bottlenecks. Here is my testing code for refere Hello everyone. I have a RTX2060 with 6Gbs of VRAM. If that’s the case, you are storing the computation graph in each epoch, which will grow your memory. 5 epochs (each epoch contains 8750 steps) on the first fold whereas the native PyTorch model runs for whole 5 folds. 30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. But when I am using 4 GPUs and batch size 64 with DataParallel then also I am getting the same error: my code: device = torch. I’ll address each of your points: 1- I was already using torch. This is particularly useful when evaluating or testing your model, i. 4. ptrblck multiple models were working fine using the GPU (ResNet, VGG, AlexNet) after you ran out of memory using Inception_v3, all models run out of memory; nvidia-smi shows little memory usage and still all models which previously ran fine, run out of memory; Is that correct? I am not an expert in how GPU works. empty_cache(), as it will only slow down your code and will not avoid potential out of memory issues. empty_cache() and gc. Including non-PyTorch memory, this process has 9. However, after some debugging I found that the for loop actually causes GPU to use a lot of memory. 00 GiB total capacity; 2. Just do loss_avg+=loss. A typical usage for DL applications would be: 1. max_memory_cached() to monitor the highest levels of memory allocation and Hi all, How can I handle big datasets without out of memory error? Is it ok to split the dataset into several small chunks and train the network on these small dataset chunks? I mean first, train the dataset for several epochs on a chunk then save the model and load it again for training with another chunk. I am running my own custom deep belief network code using PyTorch and using the LBFGS optimizer. I’m pretty new and want to learn how to debug GPU memory allocation. GPU out of memory on evaluation : Pytorch. To solve the latter you would have to reduce the memory usage by e. So I read about model parallelism in Pytorch and tried this: My training code running good with around 8GB but when it goes into validation, it show me out of memory for 16GB GPU. 04. 2. to(cuda) on your data. I was able to find some forum posts about freeing the total GPU cache, but not something about how to free The pytorch memory usage won’t be constant over time, and the other students’ code might allocate a fixed amount for themselves, which in turn might crash your program when it tries to access more memory Process 1485727 has 200. replicate seems to copy model from gpu to gpu, but i think just copying model from cpu to each gpu seems fair enough but i don’t know the way. cpu(). There is a little gpu For some reason, the GPU runs out of memory only in the middle of either the training run or in the middle of a validation run (i. autocast(). Understand the Real Use torch. Since my script does not do much besides call the network, the problem appears to be a memory leak within pytorch. step(). A possible solution is to reduce the batch size and load into gpu only few data per time and finally after your computation to send from gpu to cpu your data . However, after a certain number of epochs, say 30ish, I receive an out of memory error, despite the fact that the available free GPU does not change significantly during Okei, if you use the nn. replicate needs extra memory or nn. It looks like you are directly appending the training loss to train_loss[i+1], which might hold a reference to the computation graph. embedding layer to 2 gpus or No, increasing num_workers in the DataLoader would use multiprocessing to load the data from the Dataset and would not avoid an out of memory on the GPU. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF I am running my own custom deep belief network code using PyTorch and using the LBFGS optimizer. I have a number of trained models (*. CUDA error: out of memory when load models. First of all, I couldn't find you using . Module): """A sequence-to here is the training part of my code and the criterion_T is a self-defined loss function in this paper Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels and here is the code of the paper code, my criterion_T’s loss is the ‘Truncated-Loss. 17 GiB already allocated; 64. See documentation for Memory Management and I’m running pytorch 1. after a number of images have already been tested/fed into the model without issue). Reduce data augmentation. Pytorch RuntimeError: CUDA out of memory with a huge amount of free memory. For the following training program, training and validation are all ok. amp. On my laptop, I can run this fine: >>> import torch >>> x = torch. is_available() else ‘cpu’) device_ids = RuntimeError: CUDA out of memory. bafkb ukwlxr knaqsea weiupjup hrowmn dqgkcz gyt oqyhb lpo drsg

	AJAX Error Sorry, failed to load required information. Please contact your system administrator.
Close

Pytorch out of gpu memory. OutOfMemoryError: CUDA out of memory.