Apple's A4 Chip: Patent Papers Suggest Secrets in Silicon
In technical circles, one part of the iPad buzz has been its microprocessor, the A4, with a lack of detail from Apple (AAPL) fueling speculation about what it can or can't do. Unlike most of its other products, Apple went with its own custom semiconductor design. The prevailing opinion says that the chip is nothing special. However, it looks as though Apple has released some details -- through a number of patent applications -- and that there is something interesting going on in the silicon.
Jon Stokes at Ars Technica sums up the smart money view that there "just isn't anything to write home about" because the chip "is a 1GHz custom SoC with a single Cortex A8 core and a PowerVR SGX GPU" -- in other words, Apple's design is based on commercially-available semiconductor intellectual property:
While it's fun to speculate about what Apple didn't include in the A4, the ultimate point is this: with one 30-pin connector on the bottom and no integrated camera of any kind, the A4 needs a lot less in the way of I/O support than comparable chips that are intended for smartphones or smartbooks. This means that the A4 is just a GPU, a CPU, memory interface block (NAND and DDR), possibly security hardware, system hardware, and a few I/O controllers. It's lean and mean to a degree that isn't possible with an off-the-shelf SoC [system-on-a-chip].In other words, the A4 is a relatively bare bones microprocessor. Lean and mean is important when a company like Apple wants to have a device that runs fast and yet doesn't hog power. And yet, as Stokes notes, Apple bought semiconductor design firm P.A. Semi in April 2008. I agree that there wasn't enough time by any means to do a new semiconductor core design. However, it does seem to have been enough time to turn into nine patent applications from Apple:
- 2220100042900 -- Write Failure Handling of MLC NAND
- 2220100042817 -- SHIFT-IN-RIGHT INSTRUCTIONS FOR PROCESSING VECTORS
- 2220100049951 -- RUNNING-AND, RUNNING-OR, RUNNING-XOR, AND RUNNING-MULTIPLY INSTRUCTIONS FOR PROCESSING VECTORS
- 2220100049950 -- RUNNING-SUM INSTRUCTIONS FOR PROCESSING VECTORS
- 2220100042818 -- COPY-PROPAGATE, PROPAGATE-POST, AND PROPAGATE-PRIOR INSTRUCTIONS FOR PROCESSING VECTORS
- 2220100042807 -- INCREMENT-PROPAGATE AND DECREMENT-PROPAGATE INSTRUCTIONS FOR PROCESSING VECTORS
- 2220100042815 -- METHOD AND APPARATUS FOR EXECUTING PROGRAM CODE
- 2220100042816 -- BREAK, PRE-BREAK, AND REMAINING INSTRUCTIONS FOR PROCESSING VECTORS
- 2220100042789 -- CHECK-HAZARD INSTRUCTIONS FOR PROCESSING VECTORS
Recent advances in processor design have led to the development of a number of different processor architectures. For example, processor designers have created superscalar processors that exploit instruction-level parallelism (ILP), multi-core processors that exploit thread-level parallelism (TLP), and vector processors that exploit data-level parallelism (DLP). Each of these processor architectures has unique advantages and disadvantages which have either encouraged or hampered the widespread adoption of the architecture. For example, because ILP processors can often operate on existing program code that has undergone only minor modifications, these processors have achieved widespread adoption. However, TLP and DLP processors typically require applications to be manually re-coded to gain the benefit of the parallelism that they offer, a process that requires extensive effort. Consequently, TLP and DLP processors have not gained widespread adoption for general-purpose applications.One significant issue affecting the adoption of DLP processors is the vectorization of loops in program code. In a typical program, a large portion of execution time is spent in loops. Unfortunately, many of these loops have characteristics that render them unvectorizable in existing DLP processors. Thus, the performance benefits gained from attempting to vectorize program code can be limited.Here's the rough translation: Parallel processing is generally good in chips because it lets them work on multiple steps in a program at the same time. It's like the difference between a two-lane road and a six-lane highway. Typical approaches to parallelism have involved having more than one processing core. But more cores mean more power consumption and more expense. As clear by its marketing emphasis, Apple wanted to keep costs down and have a large a battery life as possible.One significant obstacle to vectorizing loops in program code in existing systems is dependencies between iterations of the loop. For example, loop-carried data dependencies and memory-address aliasing are two such dependencies. These dependencies can be identified by a compiler during the compiler's static analysis of program code, but they cannot be completely resolved until runtime data is available. Thus, because the compiler cannot conclusively determine that runtime dependencies will not be encountered, the compiler cannot vectorize the loop. Hence, because existing systems require that the compiler determine the extent of available parallelism during compilation, relatively little code can be vectorized.
So Apple had the P.A. Semi engineers work on DLP, which is having the chip perform an operation on multiple pieces of data at the same time, otherwise known as vectorizing. Some candidates for DLP are graphics, audio, and video processing -- the types of media that the iPad targets. The more efficiently the A4 can process its main data, the faster it will run and the less power it would use.
There are problems with DLP, especially when in a program's loops, or sections of code that repeat until the program reaches some particular condition or result. Values of variables can change from one cycle of the loop to the next, creating so-called data dependencies, because the results in one repetition of a loop depends on the results of a previous one. Unless you can take that into account, you can't use DLP in the loops because you could get an incorrect result. Programmers have typically hand optimized code to use DLP techniques.
What a number of these patents seem to address is a set of techniques that let the chip track data dependencies and automatically recognize when it can parallel process data without hand optimization. If I'm reading this correctly, that would let existing apps run faster without the need for programmers to rewrite them, which, in turn, would increase perception of speed by the iPad user as well as put less of a strain on the battery.