How to control thermal and performance settings for multiple Nvidia GPUS on Ubuntu Linux

View this thread on: d.buzz | hive.blog | peakd.com | ecency.com
·@thescinder·
0.000 HBD
How to control thermal and performance settings for multiple Nvidia GPUS on Ubuntu Linux
<p style="text-align: center;">Such compute. </p>
<p style="text-align: right;">very co$t</[>
<p style="text-align: left;">Wow</p>

The price of a pair of GTX 1060 GPUs has gone up about 50% since I <a href="https://steemit.com/deeplearning/@thescinder/i-too-built-a-rather-decent-deep-learning-rig-for-900-quid">built my deep-learning rig</a> a few weeks ago, and that's if you can even find them in stock. There's been a *wee bit* of a gold rush surrounding cryptocurrencies lately as many new miners have been setting up systems. Ultimately, I think this will benefit more than just crypto as the demand for fast and efficient cards pushes graphics card makers to innovate more efficient and powerful cards, just like high performance computing for scientific purposes has traditionally piggy-backed on demand for better gaming. It also means increased awareness and adoption of cryptography and cryptocurrencies, which I consider a good thing as it should help stabilize the ecosystem. 

In any case, whether your aim is mining ether or back-propagation, you may want to get as much performance out of the GPUs you do have during the current shortage. This means tuning the card to optimize for your needs of performance and/or efficiency. For Nvidia cards on Ubuntu this comes with a slight difficulty in that normally you can only tune a GPU running a display, but with a few tricks it's possible to overclock multiple GPUs without hooking up a monitor to each one. This took me a while to figure out, so I thought it may be helpful to others. 

Note: changing the <code>cool-bits</code> flag lets you bypass thermal safeguards, may affect warranty, etc., so be conservative in your changes and monitor for GPU temperatures and errors. I typically run the fans at a higher intensity than they would normally operate and keep the temperature well below 70C. 

In short, it was the order that mattered. Setting <code>cool-bits</code> with or without a a flag to allow empty configurations, before editing the config file always left me with control over just one GPU after rebooting :-/ Instead I had to first modify the config file, and only then allow empty configurations and change the cool-bits flag. I'll assume you've got your drivers set up and your cards are working, all you have left is to gain control of <code>nvidia-settings.</code> Here's what worked for me:

First edit your Xorg config file. Duplicate the monitor/device/screen declarations while incrementing the names, <em>e.g.</em> "Device0" becomes "Device1."  Do this as many times as you need to for each of your GPUs, I have two cards so I ended up with two screens/monitors/device entries. The text for my config file is at the end of this post. 
<code>
sudo nano /etc/X11/xorg.conf
</code>
<em>nano --> your text editor of choice</em>

Then set the -cool-bits flag and allow empty configurations. Setting <code>cool-bits</code>  to 28 actually <a href="https://hashcat.net/forum/thread-5785-post-31020.html">allows</a> you to change GPU voltages, which I don't currently use or recommend. 12 or 5 should also work for our needs. 

<code>
sudo nvidia-xconfig -a --cool-bits=28 --allow-empty-initial-configuration
</code>

and that's it. You should be able to reboot and start over/underclocking your GPUs. Check in on the temperature and power usage with <code>nvidia-smi</code> on the command line. You can modify the thermal and performance ("Powermizer") settings with the GUI by just typing <code>nvidia-settings</code>, or you can use commands like these:

<code>
nvidia-settings -a [gpu:0]/GPUMemoryTransferRateOffset[3]=400
nvidia-settings -a [gpu:0]/GPUGraphicsClockOffset[3]=40
nvidia-settings -a [gpu:0]/GPUFanControlState=1
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=65
</code>
<em>or back to normal</em>
<code>
nvidia-settings -a [gpu:0]/GPUMemoryTransferRateOffset[3]=0
nvidia-settings -a [gpu:0]/GPUGraphicsClockOffset[3]=0
nvidia-settings -a [gpu:0]/GPUFanControlState=0
</code>
Tune the settings in small increments and just change one setting at a time until you get closer to optimizing your chosen metric, then adjust the next setting and repeat as necessary ("walking the settings"). There are many more, much better guides out there for the actual overclocking for performance or underclocking for efficiency, and I suggest you check them out. Hopefully one or two other people had the same problem as I did with the order of setting cool-bits and modifying the xorg.conf file and this short post will be useful to some fellow human, somewhere, sometime. Thanks!

I am using Ubuntu 16.04 with the 375.66 version Nvidia drivers. 

xorg.conf example:
```
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 375.66  (buildmeister@swio-display-x86-rhel47-06)  Mon May  1 15:45:32 PDT 2017

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    Screen      1  "Screen1" RightOf "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1060 6GB"
    BusID          "PCI:1:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1060 6GB"
    BusID          "PCI:2:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

```
👍 , , ,