**Updates December 18 2015 – See end of post**
For the last few weeks I’ve been chasing some frustrating issues with Bluetooth Low Energy on Android. The Mooshimeter’s over-the-air firmware update feature has been failing frequently for Android users, and it turns out the issue goes very deep. On 2 phones I’ve tested so far, Android has been intermittently pasting the payload from an incoming notification in to an outgoing write. For many protocols running on top of the BLE GATT layer, including the Mooshimeter firmware update protocol, this causes all kinds of malfunctions.
Phones tested so far:
- Nexus 5 running Android 6.0.1
- Samsung SGH-I537 (S4 Active) running Android 5.0.1
- Samsung SM-G318ML running Android 4.4.4
This is obviously quite a limited sample, I’ll try to test more phones soon.
Description of the bug:
When these conditions are both met:
- Android device is receiving multiple notifications on a GATT characteristic
- Android device is writing a value to the same GATT characteristic
It appears that the Android device will occasionally swap the data payload of the incoming notification with the outgoing write.
Users have been reporting woes with BLE on Android since the Mooshimeter first started shipping in February. In particular, uploading firmware updates to the meter has been an extremely hit-or-miss process. BLE on Android is a complete mess, but I won’t go in to that here (the guys at iDevicesInc put together a good partial summary here). Suffice it to say I’ve worked out most of the Android BLE bugs in the Mooshimeter app, but even as I improved the app and worked around bug after bug the firmware update process remained stubbornly unreliable.
When a firmware update fails, not much feedback is provided by the peripheral. In the case of the Mooshimeter, it runs a CRC over the downloaded firmware image, compares it to a pre-computed value sent by the Android device, and if the values match it assumes the newly downloaded firmware is pristine and jumps in. When the CRC check fails, the peripheral simply refuses to jump to the new firmware.
The Mooshimeter’s firmware download protocol is just TI’s with a few extensions. It’s extremely simple, and a firmware update goes like this:
- The Android device sends a firmware image identification, with information like the firmware’s CRC, total size and version number.
- If the peripheral thinks the identification block is acceptable, it sends a request for image block 0 as a GATT notification
- The host and peripheral then alternate sending blocks (as GATT writes) and block requests (as GATT notifications) until every block has been sent
- A block contains a block number followed by 16 bytes of firmware data
- A block request contains only a block number
- When every block has been sent, the peripheral runs a CRC of the complete downloaded image and compares it against the original CRC it received in the image identification.
The keen reader will notice that the block addresses in the exchange above are slightly out of sync – the peripheral is requesting blocks 0x966-0x969 (2406-2409), but the Android device is sending blocks 0x969-0x96C (2409-2412). This is not an error – because messages in BLE are only exchanged once every 30ms or so, it makes firmware upload much faster if the Android device just sends the next blocks that will be requested before the request actually arrives, so in practice the blocks being sent to the peripheral in a given exchange are always a few indices ahead of the blocks being requested.
What was so strange about the upload failures is that they were CRC failures of the final image, which indicates the firmware image in peripheral memory was corrupted. But examining the most obvious possible sources of corruption leads to head scratching.
- BLE is checksummed up the wazoo so the chance of over-the-air corruption is minimal
- The chance writing a corrupt payload to flash are astronomically low
- Every single block is requested and confirmed by the peripheral and the master so it’s impossible to skip or overwrite a block
- The firmware uploads flawlessly from iOS
Eventually I got down to examining the peripheral’s flash memory byte by byte. I took a Mooshimeter, uploaded a firmware image from iOS, then pulled the contents of the flash memory out of the debug interface. I did this twice and confirmed the flash images were identical (which is what we’d expect uploading the same firmware twice in a row).
Then I performed a firmware upload from Android, pulled the flash memory out of the debug interface and compared it to the known good images.
The differences were not the occasional single bit difference (indicating some kind of message corruption), or the occasional block reading 0xFFFF (indicating failure to write), or some sort of off-by-one error… it was just that firmware blocks were occasionally completely wrong.
I cataloged the blocks that were incorrect and looked for patterns. In the first image I looked at, I found 14 incorrect blocks. 5 of them has strong similarities to each other. In hex, they were:
The other 9 bore no resemblance to each other, but by this point I was beginning to suspect that blocks were being swapped on some layer, so I searched for repeats. Sure enough, every one of those 9 erroneous blocks was a copy of a correct block, 7-8 spaces away. Now I was starting to put it together, but it was still totally mystifying how these blocks could be ending up in the wrong locations.
In many scenarios with a hairy embedded bug, you can set a breakpoint and step through your code instruction by instruction to see exactly what’s going wrong. But the Mooshimeter uses a CC2540, which is an all-in-one system on a chip from TI with only one core, so setting a breakpoint for debugging breaks the bluetooth connection and means you have to reconnect and restart the firmware upload. So I broke out the packet sniffer and started recording the over-the-air traffic, then cross-referenced the captured data against the firmware images.
Let me walk through an example: First I found a difference in the final flash image of the device and determined what block of the application image it corresponded to (The application image is only a subset of the complete flash image, with the rest of the complete flash image being occupied by the bootloader code).
Then I combed the captured traffic to see what happened at that point in the transaction.
It looks like the Android device sent an ATT_Write_Command with the completely wrong data payload and skipped block 0x01E5! Because the Mooshimeter needs the blocks to arrive sequentially, it’s going to repeatedly request block 0x1E5 in response until it receives it. And because the Android device is sends blocks out before they are requested to accelerate the transfer, it’s going to keep sending firmware blocks for a few exchanges until it notices that block 0x01E5 is being repeatedly requested. But that’s where things go really wrong. If we go down a few exchanges in the packet capture…
The payload swap strikes again, but now the write from the Android device to the Mooshimeter contains only the index of the firmware block the meter is expecting next. This is where a bug in the Mooshimeter firmware bites us. This bug is present in the example code from TI, so if you work for TI please fix this so others aren’t bitten.
I don’t want to pollute this blog post with huge chunks of C code, but I think I can get the point across in 2 lines. When the Mooshimeter receives an attribute write, it filters through a few layers of callbacks in TI’s closed source libraries until it arrives at our own code. Its first stop in our code is the function:
static bStatus_t oadWriteAttrCB(uint16 connHandle, gattAttribute_t *pAttr, uint8 *pValue, uint8 len, uint16 offset);
This function takes a connection handle (basically ignored), a structure describing the attribute being written to (pAttr), but most importantly, the data payload (pValue) and length of the payload (len). The function examines the attribute and sees that the write command is for a firmware image block, so then it calls:
static bStatus_t oadImgBlockWrite( uint16 connHandle, uint8 *pValue );
What’s the problem with this function? It doesn’t take a length parameter! It’s written with the assumption that you will only ever call it with pValue pointing at a firmware block payload, which has 2 bytes of address and 16 bytes of actual firmware.
So when Android swaps the payload of the block request notification with the payload of the block write, it filters down to oadImgBlockWrite, which ignores the fact that the payload was only 2 bytes long and reads 18 bytes of pValue. TI’s BLE stack internally is recycling the same buffers over and over again, so the final 16 bytes being fed to oadImgBlockWrite is just whatever happened to last be in that recycled buffer. If the last thing to be in that buffer was another image block write, what is delivered to oadImgBlockWrite is a real firmware block with the wrong address attached. This is the cause of the copied blocks in the final flash image.
Fixes and Workarounds:
Regarding the payload swaps I’ve seen coming out of the Android device, I don’t know what layer the problem resides in. I’m 95% sure based on the structure of the Android BLE API that there’s a race condition when a notification arrives while a packet is queued for sending. I’ve only tested a few phones so far. But I think for me, the hunt ends here, because I’ve got too many other features pending for the Mooshimeter and trying to fix Android itself is beyond my ken for now.
Regarding the Mooshimeter Android app, I am going to slow down the firmware update and add an alert dialog informing the user that the update can be performed faster from an iOS device. On future production runs of the Mooshimeter I’ll make changes to the bootloader so it’s not vulnerable to the Android payload swap, but for all the units already in the field the bootloader cannot be updated. It’s really a shame… from iOS an update takes only a minute, from Android it takes anywhere from 3-10 depending on other Android BLE bugs/features that vary from device to device and version to version.
Thanks for reading, I hope this helps someone banging their heads on similar issues.
Update: December 18 2015
I’ve narrowed down the race condition and found a way to reduce the frequency of the payload swaps. In the course of debugging a different bug I peppered the code with print statements to trace the execution. One of the print statements ended up here:
“c” is a BluetoothGattCharacteristic, “value” is a byte of payload data. I believe the payload swap is occurring when a GATT notification arrives between the setValue and the writeCharacteristic, because the way the Android BLE API is designed an arriving notification simply overwrites whatever value was previously set with setValue. So when the Log statement is in between the setValue and the call to writeCharacteristic, it leaves a larger window of opportunity for the race condition to occur.
Moving the Log.d statement before the c.setValue statement reduces the error rate very substantially. This makes perfect sense, but it seems to expose a huge error in the design of the library itself. There needs to be some way to set the characteristic’s value and send the GATT write without chance if interruption. Java provides the synchronized keyword to prevent this kind of race condition, but it requires all threads to play fair and synchronize on the same objects, and based on some quick experiments it seems the BLE thread in Android does not synchronize on the GATT characteristics. If you’re a Java guru and have a recommendation for how to avoid this race condition, please leave a comment. Thank you.