I will constantly update this page with tips and tricks for Cumulus Linux. If you have any suggestions, please let me know.

Mod Mellanox FRU EEPROMs under Cumulus Linux

Check “How to mod Mellanox FRU EEPROMs under Cumulus Linux”

Fixing Mellanox DMI / SMBIOS Information

Check “Fixing Mellanox DMI / SMBIOS Information”

Troubleshoot thermal daemon

sudo journalctl -u hw-management-tc.service

could contain hints such as:

Mar 31 07:28:25 cumulus hw-management-tc[5489]: NOTICE - Preinit thermal control ver 2.1.0
Mar 31 07:28:25 cumulus hw-management-tc[5489]: NOTICE - Platform Board:'"SA000874"', SKU:'"MSN2700-CS2FC"' is not supported.
Mar 31 07:29:20 cumulus systemd[1]: Stopping hw-management-tc.service - Thermal control service (ver 2.0) of Mellanox systems...
Mar 31 07:29:20 cumulus hw-management-tc[5489]: NOTICE - Thermal control stopped

(Hint here: Notice the double quotes '"SA000874"'? The inner " shouldn’t be there, something has been messed up in some script which I had to fix)

cat /var/log/tc_log

Contains some interesting stuff, like:

2025-07-12 23:31:22,462 - INFO - ================================
2025-07-12 23:31:22,462 - INFO - "asic1" temp: 46, tmin: 70.0, tmax: 105.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,462 - INFO - "cpu_pack" temp: 44, tmin: 70.0, tmax: 100.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,463 - INFO - "drwr1:[1, 2]" rpm:[11904, 10211], dir:P2C faults:[] pwm 20 RUNNING
2025-07-12 23:31:22,463 - INFO - "drwr2:[3, 4]" rpm:[12009, 10445], dir:P2C faults:[] pwm 20 RUNNING
2025-07-12 23:31:22,463 - INFO - "drwr3:[5, 6]" rpm:[12224, 10288], dir:P2C faults:[] pwm 20 RUNNING
2025-07-12 23:31:22,463 - INFO - "drwr4:[7, 8]" rpm:[12009, 10288], dir:P2C faults:[] pwm 20 RUNNING
2025-07-12 23:31:22,463 - INFO - "module1" temp: 0, tmin: 0.0, tmax: 0.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,466 - INFO - "module28" temp: 41, tmin: 33.0, tmax: 53.0, faults:[], pwm: 58, RUNNING
2025-07-12 23:31:22,468 - INFO - "psu1_fan" rpm:10192, dir:P2C faults:[] pwm: 20, RUNNING
2025-07-12 23:31:22,468 - INFO - "psu2_fan" rpm:10320, dir:P2C faults:[] pwm: 20, RUNNING
2025-07-12 23:31:22,468 - INFO - "sensor_amb" port_amb:37 fan_amb:35 (35), dir:P2C, faults:[] pwm:30, RUNNING
2025-07-12 23:31:22,468 - INFO - "sodimm1_temp" temp: 34, tmin: 70.0, tmax: 85.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,468 - INFO - "voltmon1_temp" temp: 30, tmin: 85.0, tmax: 125.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,468 - INFO - "voltmon2_temp" temp: 31, tmin: 85.0, tmax: 125.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,469 - INFO - "voltmon6_temp" temp: 30, tmin: 85.0, tmax: 125.0, faults:[], pwm: 30, RUNNING
2025-07-12 23:31:22,469 - INFO - ================================
2025-07-12 23:32:22,462 - INFO - Thermal periodic report
2025-07-12 23:32:22,462 - INFO - ================================
2025-07-12 23:32:22,462 - INFO - Temperature(C): asic1 46, amb 35
2025-07-12 23:32:22,462 - INFO - Cooling(%) 58 (max pwm source:module28)
2025-07-12 23:32:22,462 - INFO - dir:P2C
2025-07-12 23:32:22,463 - INFO - ================================
sudo smonctl

can also help:

sudo smonctl | grep "Asic Temp Sensor"
Temp4     (Asic Temp Sensor                      ):  BAD

Override Fan Direction

Don’t want to mod the EEPROM for some reason? You can also just override the detected fan direction using

echo "0" | sudo tee /var/run/hw-management/thermal/fan4_dir

0 is F2B (Front to Back) / C2P (Connector to PSU) / Red Handles 1 is B2F (Back to Front) / P2C (PSU to Connector) / Blue Handles

Because Nvidia has removed the download links for Cumulus VX from their website (the files are still available for download though), I have archived them here. The links lead to Nvidia’s S3 bucket.

Note that Nvidia has silently discontinued Cumulus VX, they do not release any newer versions than 5.12.1 to the public (although I’ve heard through the grapevine that some customers have negotiated access to newer versions…). This was “announced” (not really announced - more like hidden in there) in the Cumulus Linux 5.13.0 release notes

Not a single day goes by that Nvidia doesn’t manage to disappoint me even more. Linus Torvalds was right when he said that Nvidia is the worst company to work with in the Linux community.