Android Swapping BLE Packet Payloads
Hi Reader,
**Updates December 18 2015 – See end of post**
For the last few weeks I’ve been chasing some frustrating issues with Bluetooth Low Energy on Android. The Mooshimeter’s over-the-air firmware update feature has been failing frequently for Android users, and it turns out the issue goes very deep. On 2 phones I’ve tested so far, Android has been intermittently pasting the payload from an incoming notification in to an outgoing write. For many protocols running on top of the BLE GATT layer, including the Mooshimeter firmware update protocol, this causes all kinds of malfunctions.
Phones tested so far:
- Nexus 5 running Android 6.0.1
- Samsung SGH-I537 (S4 Active) running Android 5.0.1
- Samsung SM-G318ML running Android 4.4.4
This is obviously quite a limited sample, I’ll try to test more phones soon.
Description of the bug:
When these conditions are both met:
- Android device is receiving multiple notifications on a GATT characteristic
- Android device is writing a value to the same GATT characteristic
It appears that the Android device will occasionally swap the data payload of the incoming notification with the outgoing write.
Background:
Users have been reporting woes with BLE on Android since the Mooshimeter first started shipping in February. In particular, uploading firmware updates to the meter has been an extremely hit-or-miss process. BLE on Android is a complete mess, but I won’t go in to that here (the guys at iDevicesInc put together a good partial summary here). Suffice it to say I’ve worked out most of the Android BLE bugs in the Mooshimeter app, but even as I improved the app and worked around bug after bug the firmware update process remained stubbornly unreliable.
When a firmware update fails, not much feedback is provided by the peripheral. In the case of the Mooshimeter, it runs a CRC over the downloaded firmware image, compares it to a pre-computed value sent by the Android device, and if the values match it assumes the newly downloaded firmware is pristine and jumps in. When the CRC check fails, the peripheral simply refuses to jump to the new firmware.
The Mooshimeter’s firmware download protocol is just TI’s with a few extensions. It’s extremely simple, and a firmware update goes like this:
- The Android device sends a firmware image identification, with information like the firmware’s CRC, total size and version number.
- If the peripheral thinks the identification block is acceptable, it sends a request for image block 0 as a GATT notification
- The host and peripheral then alternate sending blocks (as GATT writes) and block requests (as GATT notifications) until every block has been sent
- A block contains a block number followed by 16 bytes of firmware data
- A block request contains only a block number
- When every block has been sent, the peripheral runs a CRC of the complete downloaded image and compares it against the original CRC it received in the image identification.
The keen reader will notice that the block addresses in the exchange above are slightly out of sync – the peripheral is requesting blocks 0x966-0x969 (2406-2409), but the Android device is sending blocks 0x969-0x96C (2409-2412). This is not an error – because messages in BLE are only exchanged once every 30ms or so, it makes firmware upload much faster if the Android device just sends the next blocks that will be requested before the request actually arrives, so in practice the blocks being sent to the peripheral in a given exchange are always a few indices ahead of the blocks being requested.
What was so strange about the upload failures is that they were CRC failures of the final image, which indicates the firmware image in peripheral memory was corrupted. But examining the most obvious possible sources of corruption leads to head scratching.
- BLE is checksummed up the wazoo so the chance of over-the-air corruption is minimal
- The chance writing a corrupt payload to flash are astronomically low
- Every single block is requested and confirmed by the peripheral and the master so it’s impossible to skip or overwrite a block
- The firmware uploads flawlessly from iOS
Eventually I got down to examining the peripheral’s flash memory byte by byte. I took a Mooshimeter, uploaded a firmware image from iOS, then pulled the contents of the flash memory out of the debug interface. I did this twice and confirmed the flash images were identical (which is what we’d expect uploading the same firmware twice in a row).
Then I performed a firmware upload from Android, pulled the flash memory out of the debug interface and compared it to the known good images.
The differences were not the occasional single bit difference (indicating some kind of message corruption), or the occasional block reading 0xFFFF (indicating failure to write), or some sort of off-by-one error… it was just that firmware blocks were occasionally completely wrong.
I cataloged the blocks that were incorrect and looked for patterns. In the first image I looked at, I found 14 incorrect blocks. 5 of them has strong similarities to each other. In hex, they were:
11020A00E80BA000B80D06025EA007A9
10020A00FA0BA000240E06025EA007A9
12020A00E70BA000B80D06025EA007A9
10020A001A0CA0005C0E06025EA007A9
11020A00DA0BA000B80D06025EA007A9
The other 9 bore no resemblance to each other, but by this point I was beginning to suspect that blocks were being swapped on some layer, so I searched for repeats. Sure enough, every one of those 9 erroneous blocks was a copy of a correct block, 7-8 spaces away. Now I was starting to put it together, but it was still totally mystifying how these blocks could be ending up in the wrong locations.
In many scenarios with a hairy embedded bug, you can set a breakpoint and step through your code instruction by instruction to see exactly what’s going wrong. But the Mooshimeter uses a CC2540, which is an all-in-one system on a chip from TI with only one core, so setting a breakpoint for debugging breaks the bluetooth connection and means you have to reconnect and restart the firmware upload. So I broke out the packet sniffer and started recording the over-the-air traffic, then cross-referenced the captured data against the firmware images.
Let me walk through an example: First I found a difference in the final flash image of the device and determined what block of the application image it corresponded to (The application image is only a subset of the complete flash image, with the rest of the complete flash image being occupied by the bootloader code).
Then I combed the captured traffic to see what happened at that point in the transaction.
It looks like the Android device sent an ATT_Write_Command with the completely wrong data payload and skipped block 0x01E5! Because the Mooshimeter needs the blocks to arrive sequentially, it’s going to repeatedly request block 0x1E5 in response until it receives it. And because the Android device is sends blocks out before they are requested to accelerate the transfer, it’s going to keep sending firmware blocks for a few exchanges until it notices that block 0x01E5 is being repeatedly requested. But that’s where things go really wrong. If we go down a few exchanges in the packet capture…
The payload swap strikes again, but now the write from the Android device to the Mooshimeter contains only the index of the firmware block the meter is expecting next. This is where a bug in the Mooshimeter firmware bites us. This bug is present in the example code from TI, so if you work for TI please fix this so others aren’t bitten.
I don’t want to pollute this blog post with huge chunks of C code, but I think I can get the point across in 2 lines. When the Mooshimeter receives an attribute write, it filters through a few layers of callbacks in TI’s closed source libraries until it arrives at our own code. Its first stop in our code is the function:
static bStatus_t oadWriteAttrCB(uint16 connHandle, gattAttribute_t *pAttr, uint8 *pValue, uint8 len, uint16 offset);
This function takes a connection handle (basically ignored), a structure describing the attribute being written to (pAttr), but most importantly, the data payload (pValue) and length of the payload (len). The function examines the attribute and sees that the write command is for a firmware image block, so then it calls:
static bStatus_t oadImgBlockWrite( uint16 connHandle, uint8 *pValue );
What’s the problem with this function? It doesn’t take a length parameter! It’s written with the assumption that you will only ever call it with pValue pointing at a firmware block payload, which has 2 bytes of address and 16 bytes of actual firmware.
So when Android swaps the payload of the block request notification with the payload of the block write, it filters down to oadImgBlockWrite, which ignores the fact that the payload was only 2 bytes long and reads 18 bytes of pValue. TI’s BLE stack internally is recycling the same buffers over and over again, so the final 16 bytes being fed to oadImgBlockWrite is just whatever happened to last be in that recycled buffer. If the last thing to be in that buffer was another image block write, what is delivered to oadImgBlockWrite is a real firmware block with the wrong address attached. This is the cause of the copied blocks in the final flash image.
Fixes and Workarounds:
Regarding the payload swaps I’ve seen coming out of the Android device, I don’t know what layer the problem resides in. I’m 95% sure based on the structure of the Android BLE API that there’s a race condition when a notification arrives while a packet is queued for sending. I’ve only tested a few phones so far. But I think for me, the hunt ends here, because I’ve got too many other features pending for the Mooshimeter and trying to fix Android itself is beyond my ken for now.
Regarding the Mooshimeter Android app, I am going to slow down the firmware update and add an alert dialog informing the user that the update can be performed faster from an iOS device. On future production runs of the Mooshimeter I’ll make changes to the bootloader so it’s not vulnerable to the Android payload swap, but for all the units already in the field the bootloader cannot be updated. It’s really a shame… from iOS an update takes only a minute, from Android it takes anywhere from 3-10 depending on other Android BLE bugs/features that vary from device to device and version to version.
Thanks for reading, I hope this helps someone banging their heads on similar issues.
~James
Update: December 18 2015
I’ve narrowed down the race condition and found a way to reduce the frequency of the payload swaps. In the course of debugging a different bug I peppered the code with print statements to trace the execution. One of the print statements ended up here:
c.setValue(value);
Log.d(TAG, "WRITE");
mBluetoothGatt.writeCharacteristic(c);
“c” is a BluetoothGattCharacteristic, “value” is a byte[] of payload data. I believe the payload swap is occurring when a GATT notification arrives between the setValue and the writeCharacteristic, because the way the Android BLE API is designed an arriving notification simply overwrites whatever value was previously set with setValue. So when the Log statement is in between the setValue and the call to writeCharacteristic, it leaves a larger window of opportunity for the race condition to occur.
Moving the Log.d statement before the c.setValue statement reduces the error rate very substantially. This makes perfect sense, but it seems to expose a huge error in the design of the library itself. There needs to be some way to set the characteristic’s value and send the GATT write without chance if interruption. Java provides the synchronized keyword to prevent this kind of race condition, but it requires all threads to play fair and synchronize on the same objects, and based on some quick experiments it seems the BLE thread in Android does not synchronize on the GATT characteristics. If you’re a Java guru and have a recommendation for how to avoid this race condition, please leave a comment. Thank you.
Hi James,
very good explanation of the issue. For me the longer time for firmware uploading from an android device is no problem. But today I found out another strange behaviour. I installed the newest Mooshimeter app from play store, and started firmware update in legacy mode. After 5 tries without success (after firmware update mooshimeter seems to remain in bootloader mode) I remembered the bug with bluetooth cache (workaround for bluetooth cache reported by Wayne within the other thread). I stopped the mooshimeter app task (still showing mooshimeter in bootloader mode) and disabled bluetooth. After that I restarted the mooshimeter app. I allowed the bluetooth enable attempt, and scanned for devices. Now my mooshimeter was showed in normal mode (without doing any changes at the mooshimeter in the meantime)! If anyone has problems with firmware updating, he/she should give this workaround a try. Maybe helpful.
Best regards
Thomas
Hi, I have a new Samsung Galaxy S6. I’ve just installed the Android APP and connected to the meter. It says I have Bootloader running, build 1447458797 , and there is a new image Ver 1 Build 1447458797 (size 147456). I’ve tried a few times now to “Start programming” and everytime it seems to pause/freeze in the last few bytes, but then reports programming successful . However, when I go to use the meter, I’m back at the connect page and I see the same Bootloader build version screen . I’ve tried it in legacy mode too, seemed to work but then still the same Bootloader build screen. I’ve just pressed the meter’s reset, turned off Bluetooth, and restarted my phone: same problem.
Luckily my wife has an IOS device. Any ideas to try out here?
thanks,
Hadyn
seems to be fixed by new batteries.
I’m glad you found the issue!
>>On future production runs of the Mooshimeter I’ll make changes to the bootloader so it’s not vulnerable to the Android payload swap, but for all the units already in the field the bootloader cannot be updated.<<
How do you change / submit the bootloader code to the mooshimeter? I think you use the contacts called "GND", "DT", "RST", "CK" and "+3V". What kind of connection is used at that contacts? Doesn't exist a possibility to change the bootloader by placing a correlating file on the SD card?
I'm just curious for that. As I told before, I don't really have the necessity for that ('cause firmware update time doesn't really matter).
Hi Katho80,
Yes, you can reprogram CC2540 at the heart of the Mooshimeter through the programming pads (GND, DT, RST, CK, +3V), with a CCDebugger and the appropriate software, but the bootloader flash pages are locked so you must erase the whole chip before reprogramming. This means losing the calibration data. A clever user could back up the calibration data, erase the whole chip, rewrite the bootloader, rewrite the calibration data, then rewrite the application firmware… but this is an onerous process and very few users would be able to do it.
Regarding SD card firmware update: The bootloader actually doesn’t have the capability to read SD cards… the bootloader and application compete for space, and to leave more room for the application and new features I tried to make the bootloader as small as possible.
I hope this answers your questions! Let me know if you have others. Best
~James
Amazing work James, well done digging through this mess.
If I wasn’t before, I am now scared to death of BLE. These chips look so small and harmless, but they clearly take years to master.
Hi James,
I don’t envy your troubles with Android :), but kudos for pushing on.
My Mooshiemeter gave me some trouble a while back and was stuck in an update loop. I did not use it for a while and it seems the batteries drained (for no apparent reason, I was sure it was in hibernation) and I replaced them tonight – this however solved the update loop!
I am now running the newst firmware.
I assume this was the same issue Hayden had.
Not sure why low batteries (40%) would cause this problem, except if the SoC suffered a brownout due to traffic spike during firmware upload?
Thanks Chris.
Glad you got past the update loop! I believe the low-battery update failure is a brown-out. The bootloader code doesn’t check the battery level before trying to write to internal flash. I’ve written a new version but it won’t affect meters already in the wild, since there’s no provision for updating the bootloader code itself.
Best
~James
I’ve seen this too. I have an Nexus 6P sending WriteNoResponse packets to a GATT Characteristic and also receiving Notifications on same characteristic from the peripheral. I observed with a sniffer that the notification (a UINT32 value) value read at Master was formed from 4 bytes from same packet location of the write data. On the sniffer the notification packet contained correct data but at the UINT32 extraction in OnCharacteristicChanged() it had somehow got the data bytes from the write packet. You gave a good clue for a workaround above, I can notify on a different GATT Characteristic and see if that alleviates my problem.
Thanks Patrick. Yeah, I separated out the characteristics where I could and just sidestepped the issue. Unfortunately the over-the-air bootloader that’s on all the devices in the field can’t be overwritten and uses a protocol that is susceptible to this issue. Such is life.