This seems to be a popular introduction to Capsule Networks. In Part I, on the “intuition” behind them, the author (not Geoffrey Hinton, although formatted the same as a quote from him immediately above) says:
Internal data representation of a convolutional neural network does not take into account important spatial hierarchies between simple and complex objects.
This is very simply not true. In fact, Sabour, Hinton and Frosst address this issue in their Dynamic Routing Between Capsules [pdf]:
Now that convolutional neural networks have become the dominant approach to object recognition, it makes sense to ask whether there are any exponential inefficiencies that may lead to their demise. A good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints. The ability to deal with translation is built in, but for the other dimensions of an affine transformation we have to chose between replicating feature detectors on a grid that grows exponentially with the number of dimensions, or increasing the size of the labelled training set in a similarly exponential way. Capsules (Hinton et al. [2011]) avoid these exponential inefficiencies…
This is fundamental, and I hope folks avoid the error in thinking that ConvNets can’t “take into account important spatial hierarchies between simple and complex objects”. That’s exactly what they do, but as models of how brains take into account these hierarchies under transformations, they are badly inefficient at doing so.