Curiosity computer swap, troubleshooting, continues

By WILLIAM HARWOOD
CBS News

Work to carry out what amounts to an electronic brain transplant aboard the Curiosity Mars rover -- a complex sequence of steps to switch operations to a backup flight computer -- is continuing this week amid ongoing analysis to figure out how to resolve memory corruption discovered last week in the rover's active computer.

The memory glitch interrupted science operations, forcing flight controllers to put the craft in a low-activity "safe mode" while the computer switch was implemented.

Richard Cook, the Mars Science Laboratory project manager at the Jet Propulsion Laboratory in Pasadena, Calif., told CBS News Monday the computer swap was going well and that limited science operations should resume shortly.

A self-portrait mosaic of the Curiosity Mars rover, assembled from images collected late last year. (Credit: NASA)
"We spent the weekend kind of getting back, not totally to regular operations, but at least out of the immediate safe mode kind of a thing," he said. "We got it out of safe mode, got back to using the high-gain antenna, so we're well along the way to restoring things."

The problem cropped up last Wednesday when Curiosity failed to send back science data as expected and then failed to put itself to sleep during scheduled downtime. Reviewing telemetry, engineers discovered data corruption in the solid-state memory used by the rover's active flight computer.

Curiosity is equipped with two redundant computer systems, known as "side A" and "side B." Either one is capable of carrying out the rover's mission and only one operates at a time with the other on standby as a backup. The B-side computer was checked out during the cruise from Earth to Mars while the A-side computer has been running operations since before landing last August.

Cook said the switchover to side B is a complex procedure and that engineers are taking their time to make absolutely sure the process is carried out correctly.

"We have some more work to do to upload configuration files and parameters, things like that, so it's going to be another few days or so to kind of get things totally recovered," he said. "But basically, it's going well."

Once the B-side computer is fully up and running, limited science operations should resume. But Cook said the engineering team wants to have a better idea of what went wrong with the A-side memory before going "full throttle" on the B-side computer.

Engineers suspect the memory glitch might have been caused by space radiation, a "single-event upset" in which an energetic particle made it through radiation-hardened components and changed the state of one or more memory addresses. As luck would have it, the corruption was found in the memory's directory, which tracks where data is stored.

If that theory is correct, booting the A-side computer and its software would be expected to re-write the memory blocks, presumably flushing the corrupted data. In that case, assuming no other problems, the A-side computer would be deemed healthy and cleared to serve as backup to the B-side computer.

But before attempting a full re-boot, Cook said, engineers plan to power-up the A-side machine Wednesday, without loading software, to check the status of the non-volatile memory.

"The first thing you can do is just turn it on without software running and just treat it like it's an extended memory bank," he said. "That's actually what we're going to do first, we're just going to read the memory. If it comes back saying it's got a bit error, then that means it's still corrupted."

Because the memory retains data when it is powered down, engineers expect the corruption will still be present when they power the system back up. The real question is whether data can be successfully stored in the affected locations.

"If you then turned around and wrote to it, and it said, hey, I still can't write to this memory cell without getting an error, then it would tell you there's something more systemic going on, or more permanent," Cook said.

It's a bit of a "catch-22" for the computer experts at JPL, he added. Letting the computer's software boot up and write data to the suspect memory locations would destroy evidence that might help pin down what went wrong in the first place.

"So the first thing we're going to do is just bring it up, read the memory, dump memory from the areas where we think we had a problem and take a look at that and then decide what to do next, whether or not to write it," Cook said. "If it looks like it's all better, we may just bring software up and then software will essentially do the same thing, but for all the memory at once."

If the memory problem cannot be corrected, programmers could attempt to bypass the corrupted locations with a software patch.

"There are multiple banks of memory, it's not a single monolithic thing," Cook said. "So if you had an uncorrectable error in one place, then you could effectively map it out, you would tell software when it's booting up don't try to use this area of memory. That's an example of something you could do."

Curiosity landed in Gale Crater on Aug. 6. The $2.5 billion mission is devoted to searching for signs of past or present habitability and for evidence of organic compounds like those necessary for life as it is known on Earth.