Geek/Engineer - Kernel Hacker - Low Level Software - Views my own
Data Aborts. Lovely little exceptions. You might have guessed correctly by using the term Data Abort that I was running on a bare metal embedded system. No, no linux with a nice splatter and lovely debugging tools already available. Not to say that there aren’t good debugging tools for bare metal embedded world, but they are different. Luckily I had an access to JTAG and a Lauterbach, so I could inspect the different state of the system quite nicely.
The data abort was a moving target, it manifested itself in different places in the code. Slightly moving around with every report. So there was some weird interaction happening that triggers it. And at first was happening sparsely on nightly tests. But luckily one of my colleagues found another easier way to reproduce the issue - having a reproducer is almost half the way to solving any debugging problem!
This abort had another lovely characteristic of faulting on accessing what seems to be a legal address! At first glance at least when inspsecting the disassembly and the content of the registers. Then I realized that the content of the Data Fault register are garbage sometimes and the real cause of the data abort is External. Hmmm… Is it really external? How?!
Using this producer the code was faulting predictably in a piece of code that was shuffling and processing lots of a data. So moving data around was my first clue. Yet still how it could cause an external abort?
As a first experiment, I disabled the data cache - and this made the problem disappear. But disabling caches messes greatly with timing, and if there was any sort of race condition involved, disabling the caches might have changed the timing enough to mask the real problem instead of fixing it.
ARM allows you to read the content of the caches - so using JTAG I wrote a little script to dump the content of the caches. Inspecting the content both of the D and I caches didn’t show any weird stuff. All the instructions we were executing just before faulting and the contenet of the data caches looked correct. So no weird memory corruption, in either the D$ or I$.
At this point I started suspecting some hardware related issue at the fabric where the address gets corrupted. But before getting there there are few things I could try still. I disabled Branch Prediction - no luck. Verified the software sets up the registers that report the fault address corrctly - and yes the reported address/cause was correct. I started digging deeper into the characteristics of the R7 processor to find any coherency, interconnect related setup that we might have missed - but no luck either. It seemed everything from the CPU point of view at leaset was setup and working correctly as expected.
Then the revelation come while trying to find more info about the R7
The Cortex-R7 processor features an upgraded 11-stage, superscalar, out-of-order pipeline with advanced dynamic and static branch prediction, dynamic register re-naming and non-blocking Load-Store Unit. [1]
It all fell apart in my head after that. In the tight loop that shuffles a lot of data the CPU was doing a speculative access to invalid memory region that I couldn’t see in the debugger since, well it is speculative and wrong and never should get committed in the pipeline. But it had the side effect of triggering a cache refill to a memory region that causes external error at the bus. Tadaa. Now it all makes sense and ties together to all the observation taken so far.
Of course the issue has happened after exposing a new region in the MPU (Memory Protection Unit) to be used for a new piece of hardware we were working on. Since the driver wasn’t committed we thought no one should be accessing this region, hence it should be safe to expose to prepare for the driver to be committed shortly after and to enable some development and debug work meanwhile. But no one has thought of speculative accesses at that time - especially the R7 was an upgrade from the R4 which doesn’t do speculative access - not in the sophisticated way the modern R7 can perform at least.
So in the end we reverted exposing this memory until the driver that sets everything up is committed. And considered keeping this area uncached/disabled in general if possible while the hardware isn’t active.
This has happened by the way before spectre and meltdown hit the news. I was really proud of this finding and the very interesting and satisfying debug session!
[1] https://developer.arm.com/products/processors/cortex-r/cortex-r7