I will constantly update this page with tips and tricks for Cumulus Linux. If you have any suggestions, please let me know.
Mod Mellanox FRU EEPROMs under Cumulus Linux
Check “How to mod Mellanox FRU EEPROMs under Cumulus Linux”
Fixing Mellanox DMI / SMBIOS Information
Check “Fixing Mellanox DMI / SMBIOS Information”
Troubleshoot thermal daemon
sudo journalctl -u hw-management-tc.service
could contain hints such as:
Mar 31 07:28:25 cumulus hw-management-tc[5489]: NOTICE - Preinit thermal control ver 2.1.0
Mar 31 07:28:25 cumulus hw-management-tc[5489]: NOTICE - Platform Board:'"SA000874"', SKU:'"MSN2700-CS2FC"' is not supported.
Mar 31 07:29:20 cumulus systemd[1]: Stopping hw-management-tc.service - Thermal control service (ver 2.0) of Mellanox systems...
Mar 31 07:29:20 cumulus hw-management-tc[5489]: NOTICE - Thermal control stopped
(Hint here: Notice the double quotes '"SA000874"'
? The inner "
shouldn’t be there, something has been messed up in some script which I had to fix)
cat /var/log/tc_log
Contains some interesting stuff, like:
2025-07-12 23:31:22,462 - INFO - ================================
2025-07-12 23:31:22,462 - INFO - "asic1" temp: 46, tmin: 70.0, tmax: 105.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,462 - INFO - "cpu_pack" temp: 44, tmin: 70.0, tmax: 100.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,463 - INFO - "drwr1:[1, 2]" rpm:[11904, 10211], dir:P2C faults:[] pwm 20 RUNNING
2025-07-12 23:31:22,463 - INFO - "drwr2:[3, 4]" rpm:[12009, 10445], dir:P2C faults:[] pwm 20 RUNNING
2025-07-12 23:31:22,463 - INFO - "drwr3:[5, 6]" rpm:[12224, 10288], dir:P2C faults:[] pwm 20 RUNNING
2025-07-12 23:31:22,463 - INFO - "drwr4:[7, 8]" rpm:[12009, 10288], dir:P2C faults:[] pwm 20 RUNNING
2025-07-12 23:31:22,463 - INFO - "module1" temp: 0, tmin: 0.0, tmax: 0.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,466 - INFO - "module28" temp: 41, tmin: 33.0, tmax: 53.0, faults:[], pwm: 58, RUNNING
2025-07-12 23:31:22,468 - INFO - "psu1_fan" rpm:10192, dir:P2C faults:[] pwm: 20, RUNNING
2025-07-12 23:31:22,468 - INFO - "psu2_fan" rpm:10320, dir:P2C faults:[] pwm: 20, RUNNING
2025-07-12 23:31:22,468 - INFO - "sensor_amb" port_amb:37 fan_amb:35 (35), dir:P2C, faults:[] pwm:30, RUNNING
2025-07-12 23:31:22,468 - INFO - "sodimm1_temp" temp: 34, tmin: 70.0, tmax: 85.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,468 - INFO - "voltmon1_temp" temp: 30, tmin: 85.0, tmax: 125.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,468 - INFO - "voltmon2_temp" temp: 31, tmin: 85.0, tmax: 125.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,469 - INFO - "voltmon6_temp" temp: 30, tmin: 85.0, tmax: 125.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,469 - INFO - ================================
2025-07-12 23:32:22,462 - INFO - Thermal periodic report
2025-07-12 23:32:22,462 - INFO - ================================
2025-07-12 23:32:22,462 - INFO - Temperature(C): asic1 46, amb 35
2025-07-12 23:32:22,462 - INFO - Cooling(%) 58 (max pwm source:module28)
2025-07-12 23:32:22,462 - INFO - dir:P2C
2025-07-12 23:32:22,463 - INFO - ================================
sudo smonctl
can also help:
sudo smonctl | grep "Asic Temp Sensor"
Temp4 (Asic Temp Sensor ): BAD
Override Fan Direction
Don’t want to mod the EEPROM for some reason? You can also just override the detected fan direction using
echo "0" | sudo tee /var/run/hw-management/thermal/fan4_dir
0
is F2B (Front to Back) / C2P (Connector to PSU) / Red Handles
1
is B2F (Back to Front) / P2C (PSU to Connector) / Blue Handles
Archived download links for Cumulus VX
Because Nvidia has removed the download links for Cumulus VX from their website (the files are still available for download though), I have archived them here. The links lead to Nvidia’s S3 bucket.
Note that Nvidia has silently discontinued Cumulus VX, they do not release any newer versions than 5.12.1 to the public (although I’ve heard through the grapevine that some customers have negotiated access to newer versions…). This was “announced” (not really announced - more like hidden in there) in the Cumulus Linux 5.13.0 release notes
Not a single day goes by that Nvidia doesn’t manage to disappoint me even more. Linus Torvalds was right when he said that Nvidia is the worst company to work with in the Linux community.