MSFS: RTX 4090 Issues Resolved
It took 8 months of research involving several companies to determine why my £1699 Gainward RTX 4090 was crashing at regular intervals.
Page Contents
Overview
The Installation
In January 2023 I installed a new Gainward RTX 4090 which cost £1699. I painstakingly verified each step of the installation and the final system worked without a hitch for the next few weeks.
The Search for the Cause of the Random Crashes
For the next 8 months I experienced random computer resets which appeared to be related to the GPU. The frequency of the crashes varied from month to month as new updates came along. Sometimes the computer was idling when the crash occurred, it was all quite random. Most event log reports pointed to the GPU as the cause. In the end I isolated two reason why my machine was crashing.
Crash Reason 1 – ECC State
At some point I had enabled the ECC state in the Nvidia control panel when it should be left disabled (unchecked). This option should not be displayed to the end user as the RTX 4090 doesn’t support it.
ECC is a hardware memory check state reserved for much higher end graphics cards that might be operating in an environment such as a server farm or enterprise environment. By enabling this option, the response of the driver to the hardware is undefined. There is some sparse information about it within the Nvidia web site but it amounts to a handful of words.
Crash Reason 2 – Power Cables
The supplied power adapter and power cable combination was inadequate.
I first detected this when, after having several resets in a row, I set about disconnecting and reconnecting all the power connectors to make sure that the connection was good and to remove oxidation build up. After that was complete, the PC worked perfectly.
I ordered a 600w power cable supplied by Corsair for just less that £20 and eventually replaced the original mess of power cables and the Gainward RTX 4090 adapter. Since than I have had no problems.
How Good Were the Companies?
No Support From Gainward
I tried contacting Gainward by email using a contact address they had listed on their website. I wrote twice, with a separation of 4 weeks and they did not acknowledge either email.
Little Support from Overclockers
I thought it prudent to check in with Overclockers who supplied the Gainward GPU so I would have an early reference point if I needed a refund. Since Overclockers appeared to be more technically savvy than other box shifters like Scan or Box I thought I might get a potentially useful response.
They asked me what driver I was using with an email that was marked with a ‘Pro Gamer’ tag. I supplied the requested information and a complete dossier on everything else I had tried to date. I then received an automated email notification that the support ticket was ‘resolved’ and closed.
I would have liked a discussion about what my options might be for a potential refund or replacement as necessary. Even a link to their returns policy would have been acceptable. However, as I now had a reference point for future use I didn’t pursue the issue.
Reluctant Support from MSI
I thought that maybe my motherboard was not supplying enough power or was otherwise incompatible and might need a firmware update, so I contacted MSI. At first I thought they were being difficult, but eventually it seemed to me there was
- A contractually defensive attitude
- A language barrier
- A cultural barrier
I deduced that my contact might be Chinese because of the way the language was used in the emails, and when I embraced that premise I started to make progress. MSI tried to stick to the script of saying the firmware was non-standard because Novatech had added their own branding. They redirected me to Novatech for further support several times. I eventually broke through that barrier with repetition and the pedantic logic that only MSI had the source code and that branding can only be supported by MSI for it to work. I knew this from having worked as a firmware engineer for many years.
After that they gave me politically correct official replies based on expectation rather than actual facts. For example, I could interact with my BIOS enough to know that its reflash tool was working with 100% functionality whereas they were saying it wouldn’t for various reasons.
I had to be creative in my interpretation of their replies. I then summarised what I thought they were saying and they confirmed it. They seemed to relax a bit more with this route because I could say things that they didn’t seem to feel they were allowed to, but I kept it uncritical. Using this method I extracted a lot more information than I otherwise would have.
The net result was that I could see no reason at this time to risk upgrading the MSI firmware and I could rule out the motherboard as a source of problems.
Good Support from Novatech
Novatech provided the original PC but had nothing to do with the Gainward RTX 4090 upgrade, so they had no reason to help me. I contacted them in order to decipher some of the information I was receiving from MSI. Novatech told me a few things about the MSI firmware and what I could reasonably expect, and in this way I could understand what MSI were telling me and what they were leaving out. This was very helpful in understanding and extracting information from MSI.
Good Support from Nvidia
When I contacted Nvidia, I didn’t anticipate getting much help, but his was not the case. From March to August I exchanged emails with two support personnel at Nvidia. I didn’t always agree with what they were saying, but they didn’t give up on me. They left the ticket open for several months while I kept them informed of everything I was doing whether they wanted it or not, long after their suggested fixes had failed to work.
Sometimes they did not reply, since there was nothing they could add, but I kept going. However, they did suggest that whatever it was, they felt it was something to do with the power supply – and this proved to be absolutely correct, although I had to find a test to proved that for myself.
Things I Tried to Fix the Resets
I tried all of the things in the list below to stop my computer crashing. A lot of these things made the system run better, but none of them stopped the crashes. The actual solution was to replace the power adapter with a 600w power cable. Apparently the power transients due to the RTX 4090 have a significantly longer duration than normal and this can cause problems for the power supply.
Things that were Helpful
- Use WinDbg and/or WhoCrashed to analyse crash files
- Examine the system event logs
Things that Didn’t Work
- With respect to the Nvidia Drivers:
- Roll back the Nvidia Game Ready drivers to earlier versions. This helped sometimes but in the end none of them solved the problem
- Install selected Nvidia driver using the custom option for a ‘clean install’
- Use Display Driver Uninstaller (DDU) to make sure all traces of old software were removed.
- Use the Nvidia clean-up tool. This is supposed to erase all existing traces of graphics services prior to reloading a selected driver.
- Update the Nvidia control panel to the latest version in the MS Store
- Turn off ECC checking within the Nvidia Control Panel that I had enabled months before. Note: Nvidia knows what GPU is installed and thus should not be showing this option for the RTX 4090. It is not applicable and should always be set to unchecked / disabled.
- Run the Softpedia Video Card Stability Test for about an hour. It all worked out fine, but didn’t stop my PC being reset by the RTX4090 unfortunately, so that was confusing.
- Repair windows:
- Download a fresh Windows 10 ISO file
- Move the ISO file to my secondary drive
- Double-click the ISO file to mount it as a new drive
- Run setup
- Ask for repair, keeping apps and data
- After an hour or so, Windows is repaired and the original copy is moved to a file called Windows.old
- Remove Windows.old using the ‘storage’ option in the control panel.
- Run a complete memory test using the built in Windows tool
- Set the power performance for the system to maximum
- I was using an extension socket lead to power the PC. Replace the extension lead containing a surge protector for one that did not have a surge protector.
- Disable Multi-Plane Overlay (MPO)
- Run system checks in a CMD box raised to Admin level:
- sfc /scannow
- Chkdsk
- All the DISM options
- Use CPU-Z Tools to check for driver updates. I found GPU, Motherboard, Network, USB and Bluetooth drivers that could be updated.
- Replaced the Nvidia Game Ready driver with the Studio driver. The difference is just a few FPS, worth doing for the stability and so you only need to update the driver a few times per year. I don’t want to be using a driver that has been hacked to support the latest irrelevant game.
- Changed from DirectX 12 (beta) to DirectX 11
- Give nvlddmkm.sys full control
- Turn off hardware scheduling
Gamers Nexus Explains All
This Gamers Nexus video explains the weird shutting down randomly because of transients that a significantly long duration compared to previous GPUs. I have a 1000 watt Platinum Corsair power supply, but the power adapter supplied by Gainward in combination with the cables that came with the PC were not good enough for the job. In addition the cables weight nearly 0.5Kg and were fairly inflexible. If you have a 4090, install a decent high power cable now to ensure you get the best performance from your system.
Conclusion
- The Nvidia control panel should not show an ECC setting to the domestic market at all. It is not supported by an RTX 4090 and will cause problems if it is enabled. Ensure ECC is unchecked.
- RTX 4090 GPUs should be supplied with a proper high power cable, not an adapter. The price difference between the high power cable and the adapter is negligible compared to the price of the GPU. It cost hundreds of hours of debugging over an 8 month period, using time of Nvidia support needlessly. If you don’t have a high power cable already, consider buying one now.
- When an idea failed to solve the problem I returned the change back to its original state because I have already invested a lot of time in creating settings to support MSFS. However, there were some things I did that were worth doing and keeping:
- Repair Windows – it just seemed happier afterwards.
- Use Studio Ready NVIDIA drivers – I’m willing to wait for stable code to get slightly better FPS
- Update the NVIDIA control panel – I didn’t know it was out of date in the first place
- Use CPU Z to update drivers – system level drivers are worth taking care of
- Run scanf, chkdsk and dism system checks – to flush out errors you are unaware of