Taking liberties and cutting corners:
Zernike polynomials are a set of mathematical functions that when added up give the solution to the form of the wavefront from a lens.
Functions of this sort, the best known being Fourier and Laplace, allow transforming an arbitrarily complex formula into a set of simpler formulas. Very simplistically: the fourier formulas for a sound are what are shown in the bouncing-bar equalizer display of an MP3 player. Finding how something (say a loudspeaker) will respond to a complex signal (say a snare drum) is very hard to calculate. But how a loudspeaker will respond to a pure tone is easy to calculate. So if one takes the the sound of a snare-drum and finds the set of pure tones that make up its sound then the response of the speaker to a snare drum will be the same as the adding together the loudspeaker's response to each of the pure tones. It is a lot easier to calculate the response of a thousand types of loudspeaker than build and measure a thousand types.
Ditto with lenses. A lens, after all, just applies a mathematical formula to the path of a ray of light.
The MTF response of a lens is the fourier transform of a lens' performance: the MTF graph is like the lens' equalizer graph.
There are many 'transforms', and some transforms are better at representing
some physical processes than others. Zernike functions are good at representing the effect of lenses on light.
The functions are called 'orthogonal' because they don't interact. Like measuring a box in the three orthogonal axis of geometry class: changing the the height does nothing to the depth or width.
The orthogonality is what lets one take the response to the simpler signals and add them up at the end to find the response to a complex signal. What the lens does to one Zernike polynomial has nothing to do with what it does to any other Zernike polynomial.
In sum -- the whole thing is a mathematical 'trick' to make calculations easier, similar to the 'casting out nines' trick of accountants.