Sunday, November 2, 2025

5 ideas for multi-GPU coaching with Keras


Deep Studying (the favorite buzzword of late 2010s together with blockchain/bitcoin and Knowledge Science/Machine Studying) has enabled us to do some actually cool stuff the previous few years. Apart from the advances in algorithms (which admittedly are based mostly on concepts already identified since Nineties aka “Knowledge Mining period”), the principle causes of its success will be attributed to the supply of enormous free datasets, the introduction of open-source libraries and using GPUs. On this weblog put up I’ll deal with the final two and I’ll share with you some ideas that I discovered the laborious method.

Why TensorFlow & Keras?

TensorFlow is a extremely popular Deep Studying library developed by Google which lets you prototype rapidly advanced networks. It comes with numerous fascinating options resembling auto-differentiation (which saves you from estimating/coding the gradients of the fee features) and GPU assist (which lets you get simply a 200x pace enchancment utilizing respectable {hardware}). Furthermore it presents a Python interface which implies that you could prototype rapidly with out requiring to put in writing C or CUDA code. Admittedly there are many different frameworks one can use as a substitute of TensorFlow, resembling Torch, MXNet, Theano, Caffe, Deeplearning4j, CNTK, and so on nevertheless it all boils right down to your use-case and your private choice.

However why Keras? For me utilizing immediately TF is like doing Machine Studying with Numpy. Sure it’s possible and once in a while you must do it (particularly for those who write customized layers/loss-functions) however do you actually need to write code that describes the advanced networks as a collection of vector operations (sure, I do know there are higher-level strategies in TF however they aren’t as cool as Keras)? Additionally what if you wish to transfer to a unique library? Nicely then you definately would most likely have to rewrite the code, which sucks. Ta ta taaa, Keras to the rescue! Keras permits you to describe your networks utilizing excessive stage ideas and write code that’s backend agnostic, that means that you could run the networks throughout totally different deep studying libraries. Few issues I really like about Keras is that it’s well-written, it has an object oriented structure, it’s simple to contribute and it has a pleasant group. In case you prefer it, say thanks to François Chollet for growing it and open-sourcing it.

Ideas and Gotchas for Multi-GPU coaching

With out additional ado, let’s soar to a couple tips about how you can benefit from GPU coaching on Keras and a few gotchas that you need to bear in mind:

1. Multi-GPU coaching just isn’t automated

Coaching fashions on GPU utilizing Keras & Tensorflow is seamless. If in case you have an NVIDIA card and you’ve got put in CUDA, the libraries will mechanically detect it and use it for coaching. So cool! However what if you’re a spoilt brat and you’ve got a number of GPUs? Nicely sadly you’ll have to work a bit to attain multi-GPU coaching.

There are a number of methods to parallelise a community relying on what you need to obtain however the principle two approaches is mannequin and knowledge parallelization. The primary will help you in case your mannequin is simply too advanced to slot in a single GPU whereas the latter helps once you need to pace up the execution. Sometimes when individuals speak about multi-GPU coaching they imply the latter. It was tougher to attain however fortunately Keras has not too long ago included a utility technique referred to as mutli_gpu_model which makes the parallel coaching/predictions simpler (presently solely accessible with TF backend). The primary thought is that you simply go your mannequin via the strategy and it’s copied throughout totally different GPUs. The unique enter is cut up into chunks that are fed to the assorted GPUs after which they’re aggregated as a single output. This technique can be utilized for attaining parallel coaching and predictions, nonetheless remember that for coaching it doesn’t scale linearly with the quantity of GPUs because of the required synchronization.

2. Take note of the Batch Dimension

If you do multi-GPU coaching take note of the batch dimension because it has a number of results on pace/reminiscence, convergence of your mannequin and if you’re not cautious you would possibly corrupt your mannequin weights!

Pace/reminiscence: Clearly the bigger the batch the quicker the coaching/prediction. It’s because there may be an overhead on placing in and taking out knowledge from the GPUs, so small batches have extra overhead. On the flip-side, the bigger the batch the extra reminiscence you want within the GPU. Particularly throughout coaching, the inputs of every layer are stored in reminiscence as they’re required on the back-propagation step, so rising your batch dimension an excessive amount of can result in out-of-memory errors.

Convergence: In case you use Stochastic Gradient First rate (SGD) or a few of its variants to coach your mannequin, you need to bear in mind that the batch dimension can have an effect on the flexibility of your community to converge and generalize. Typical batch sizes in lots of laptop imaginative and prescient issues are between 32-512 examples. As Keskar et al put it, “It has been noticed in follow that when utilizing a bigger batch (than 512) there’s a degradation within the high quality of the mannequin, as measured by its skill to generalize.”. Be aware that different totally different optimizers have totally different properties and specialised distributed optimization methods will help with the issue. If you’re within the mathematical particulars, I like to recommend studying Joeri Hermans’ Thesis “On Scalable Deep Studying and Parallelizing Gradient Descent”.

Corrupting the weights: This can be a nasty technical element which might have devastating outcomes. If you do multi-GPU coaching, you will need to feed all of the GPUs with knowledge. It may occur that the final batch of your epoch has much less knowledge than outlined (as a result of the dimensions of your dataset can’t be divided precisely by the dimensions of your batch). This would possibly trigger some GPUs to not obtain any knowledge over the last step. Sadly some Keras Layers, most notably the Batch Normalization Layer, can’t address that resulting in nan values showing within the weights (the working imply and variance within the BN layer). To make the issues even nastier, one won’t observe the issue throughout coaching (whereas studying part is 1) as a result of the particular layer makes use of the batch’s imply/variance within the estimations. Nonetheless throughout predictions (studying part set to 0), the working imply/variance is used which in our case can turn out to be nan resulting in poor outcomes. So do your self a favour and all the time make it possible for your batch dimension is fastened once you do multi-GPU coaching. Two easy methods to attain that is both by rejecting batches that don’t match the predefined dimension or repeat the information throughout the batch till you attain the predefined dimension. Final however not least remember that in a multi-GPU setup, the batch dimension ought to be a a number of of the variety of accessible GPUs in your system.

3. GPU knowledge Hunger aka the CPUs can’t sustain with the GPUs

Sometimes the costliest half whereas coaching/predicting Deep networks is the estimation that occurs on the GPUs. The info are preprocessed within the CPUs on the background and they’re fed to the GPUs periodically. Nonetheless one mustn’t underestimate how briskly the GPUs are; it might occur that in case your community is simply too shallow or the preprocessing step is simply too advanced that your CPUs can’t sustain together with your GPUs or in different phrases they don’t feed them with knowledge rapidly sufficient. This may result in low GPU utilization which interprets to wasted cash/assets.

Keras usually performs the estimations of the batches in parallel nonetheless because of Python’s GIL (International Interpreter Lock) you may’t actually obtain true multi-threading in Python. There are two options for that: both use a number of processes (observe that there are many gotchas on this one which I’m not going to cowl right here) or hold your preprocessing step easy. Previously I’ve despatched a Pull-Request on Keras to alleviate among the pointless pressure that we had been placing on the CPUs throughout Picture preprocessing, so most customers shouldn’t be affected in the event that they use the usual mills. If in case you have customized mills, attempt to push as a lot logic as attainable to C libraries resembling Numpy as a result of a few of these strategies truly launch the GIL which implies that you could enhance the diploma of parallelization. A great way to detect whether or not you’re going through GPU knowledge hunger is to observe the GPU utilization, nonetheless be warned that this isn’t the one motive for observing that (the synchronization that occurs throughout coaching throughout the a number of GPUs can also be responsible for low utilization). Sometimes GPU knowledge hunger will be detected by observing GPU bursts adopted by lengthy pauses with no utilization. Previously I’ve open-sourced an extension for Dstat that may aid you measure your GPU utilization, so take a look on the authentic weblog put up.

4. Saving your parallel fashions

Say you used the mutli_gpu_model technique to parallelize your mannequin, the coaching completed and now you need to persist its weights. The dangerous information is that you could’t simply name save() on it. At the moment Keras has a limitation that doesn’t let you save a parallel mannequin. There are 2 methods round this: both name the save() on the reference of the unique mannequin (the weights might be up to date mechanically) or it is advisable to serialize the mannequin by chopping-down the parallelized model and cleansing up all of the pointless connections. The primary possibility is method simpler however on the long run I plan to open-source a serialize() technique that performs the latter.

5. Counting the accessible GPUs has a nasty side-effect

Sadly in the meanwhile, there’s a nasty side-effect on the tensorflow.python.consumer.device_lib.list_local_devices() technique which causes a brand new TensorFlow Session to be created and the initialization of all of the accessible GPUs on the system. This may result in surprising outcomes resembling viewing extra GPUs than specified or prematurely initializing new classes (you may learn all the small print on this pull-request). To keep away from related surprises you’re suggested to make use of Keras’ Ok.get_session().list_devices() technique as a substitute, which can return you all of the presently registered GPUs on the session. Final however not least, remember that calling the list_devices() technique is by some means costly, so if you’re simply on the variety of accessible GPUs name the strategy as soon as and retailer their quantity on a neighborhood variable.

 

That’s it! Hope you discovered this record helpful. In case you discovered different gotchas/ideas for GPU coaching on Keras, share them beneath on the feedback. 🙂



Related Articles

Latest Articles