Engineers assess Curiosity computer glitch (UPDATED)

Editor's note...
  • Posted at 09:45 PM EST, 02/28/13: Engineers assess Curiosity computer glitch
  • Updated at 11:45 PM EST, 02/28/13: Adding details; interview with project manager
CBS News

Space radiation may be to blame for corrupted memory used by the Curiosity Mars rover's flight computer, resulting in software glitches that interrupted the flow of science data Wednesday and prompted ground controllers to switch over to a redundant computer Thursday, NASA officials said.

Engineers are reviewing telemetry and diagnostic tests using ground systems to figure out what went wrong and how to restore the original computer system to normal operation.

"We were in a state where the software was partially working and partially not, and we wanted to switch from that state to a pristine version of the software running on a pristine set of hardware," Curiosity Project Manager Richard Cook told CBS News late Thursday. "The easiest way to do that is to essentially swap sides ... and start up with the redundant (computer)."

The Curiosity rover, seen in this computer graphic, is designed to operate for at least two years, searching for carbon compounds and signs of past or present habitability. (Credit: NASA)
Curiosity is equipped with twin flight computers, known as A and B, and either one is capable of carrying out the rover's mission. The B-side computer was used during the cruise from Earth to Mars while the A-side computer has been running the show since before landing last August.

The switch from the A-side computer to the B-side processor took place around 5:30 p.m. EDT (GMT-5) Thursday, putting the nuclear-powered rover into a low-activity state known as "safe mode." Over the next few days, engineers will tie the B-side computer into the rover's myriad systems and systematically restore normal operation.

The computer problem is the first glitch of any significance since Curiosity's landing last August in Gale Crater.

"I think we'll get back to routine operations," Cook said. "It is a good humbling experience, however, in the sense that this thing is a very complicated vehicle. ... It does not take very much for things to not go as well as you want. And we've got to be conscious of that all the time."

The problem came to light Wednesday morning on Mars when flight controllers at the Jet Propulsion Laboratory in Pasadena, Calif., noticed what appeared to be memory corruption in the computer's solid-state memory system. The flight software was not recording new data or playing back data already recorded. Instead, it was only sending back real-time telemetry.

Later in the day, during a communications session using NASA's Mars Reconnaissance Orbiter, telemetry from Curiosity indicated the corrupted memory was still present. In addition, Cook said, flight controllers saw the computer had not completed several pre-planned activities.

At that point, the computer was expected to put itself to sleep for an hour or so and then to wake up for a communications session with NASA's Odyssey orbiter.

"It was after that second overflight that we got some more information saying hey, the memory is still corrupted and oh by the way, I didn't go to sleep when I was supposed to, I stayed awake," Cook said.

The next communications session came late Wednesday night Earth time, between 10:30 p.m. and midnight at the Jet Propulsion Laboratory. The rover's computer was still awake and engineers decided to switch over to the B-side system.

Cook said the memory in question is "hardened" to resist upsets caused by cosmic rays or high-energy particles from the sun. But it is possible an energetic particle hit in a particularly sensitive area -- the directory that tells the computer where data is stored.

"In general, there are lots of layers of protection, the memory is self correcting and the software is supposed to be tolerant to it," Cook said. "But what we are theorizing happened is that we got what's called a double bit error, where you get an uncorrectable memory error in a particularly sensitive place, which is where the directory for the whole memory was sitting.

"So you essentially lost knowledge of where everything was. Again, software is supposed to be tolerant of that. ... But it looks like there was potentially a problem where software kind of got into a confused state where parts of the software were working fine but other parts of software were kind of waiting on the memory to do something ... and the hardware was confused as to where things were."

Cook said the odds of a cosmic ray or solar particle causing a problem like that were remote, but similar events have happened before.

"It's a little bit like a phone book full of addresses and instead of the cosmic ray hitting some random person's phone number it hits the table of contents," he said. "It's a very low probability event but it certainly can happen."

If that theory is correct, powering the A-side computer back up should clear the problem. But engineers plan to take their time, carrying out a detailed analysis before attempting a reset.

"We can operate on the B side just as well as we can operate on the A side," Cook said. "So for the next week or so, we're going to spend time getting things going again running on the B side.

"We're going to eventually want to turn on the A side. If it is a memory corruption thing, it should clear it up with a power cycle. At that point, the software rebuilds the table of contents and starts from scratch."

Curiosity landed in Gale Crater on Aug. 6. The $2.5 billion mission is devoted to searching for signs of past or present habitability and for evidence of organic compounds like those necessary for life as it is known on Earth.

The mission is intended to last for at least two years and possibly longer, depending on funding and the health of the spacecraft.