Bug #19486

Updated by Richard Neswold over 2 years ago

In the past few months, quite a number of front-ends have required a reboot several times a day. In my case, @MSEPTA@, @LEBIT1@, and @MEBIT1@ have been very unreliable. Attempts to fix drivers or use older libraries failed to improve their uptime. *Dennis* felt there their was a correlation to when "Big Saves" occurred. This was confirmed[1] when *Mike Kuplic* discovered found that putting a MOOC version device on a parameter page would instantly "crash" the front-end[2]. I put my three front-ends' version devices into "Document" state so that Big Save wouldn't read them and all three front-ends ran stayed up all weekend without a reboot. weekend.

The built-in MOOC version device parses CVS generated strings to determine which modules are is loaded and at what version they are. it is. Many projects have been moved to Redmine, which uses *@git@*, Redmine and, hence, *@git@* so the version strings probably have changed format. Looking at the source code for the version device shows no error checking, so it's easy to see how changing the input to this algorithm would break it.

There are There's several ways we could go about fixing this:

# Write a more robust parser which identifies CVS and GIT-style strings and also reports an error if it's neither (instead of corrupting memory.)
# Develop a simple framework for modules to register their version so there's no need for any parsers; the version device simply returns information registered by the modules.
# Rewrite the version driver to use loaded module information to get the version info (whatever @moduleShow()@ uses to get its info.) I haven't looked to see if this option is possible.

Option 3 has the benefit that, if a front-end developer uses @-latest@ symbolic links, we could report the version as unknown since, since between reboots, reboots the version may change. This would encourage developers to specify versions in their startup scripts. If we decide to do option 2, then the framework should be able to recognize when the front-end is running BETA code (i.e. using the @-latest@ symbolic link) and report it as such. I don't think we should pursue option 1.


fn1. No one uses the version device except as a diagnostic. Big Saves, however, always reads them and that's when the front-ends would go down.

fn2. The "crash" is that the @RETRPY@ is suspended so no @RETDAT@ data will ever be returned. The task gets suspended because the heap is detected as corrupted.