Skip to main content

Deobfuscating HPROF memory dumps


First posted on the Badoo tech blog

According to Crittercism 1, the second most common crash reported in Android apps is java.lang.OutOfMemoryError, so it stands to reason that analyzing these crashes should be one of the top priorities for any Android developer. If you are analyzing memory dumps from a debug build or if you are not using obfuscation this process is fairly straightforward. However, if your heap dump is coming from an app built using obfuscation (Proguard or Dexguard) you are in for quite a challenge (or at least you were, until now).

In the image below you can see a typical obfuscated instance dump in Eclipse Memory Analyzer (MAT), where most of the field names have been replaced with indecipherable one-character names.

Figure 1: Before deobfuscation
Before deobfuscation

Can we do anything about this then? Well, if you have the mapping files you could look up each symbol to figure out the name of the field and its value, but it would be an extremely time-consuming process. This article will outline a much more efficient and automated process to deobfuscate a HPROF heap dump. The end result of this process is shown in the image below. When compared to the first image it makes it much clearer what fields and values we are trying to analyze.

Figure 2: After deobfuscation
After deobfuscation

HPROF File Format

An HPROF file contains a Java heap dump taken at a given time. It is a VM-independent format (dumps can be taken from most JVMs) which means that the content of the file is not a byte-by-byte copy of the actual Java heap. The content includes (but is not limited to):
  • List of all classes loaded by the class loader
  • All strings
  • Class definitions (including constant values, static field values and in instance field declarations but not any information about methods)
  • Instance dumps (containing values of all instance fields associated with the object)
  • Heap roots, sticky objects, stack frames and stack traces
As mentioned, the HPROF files does not contain an exact copy of the heap. One interesting piece of information that is omitted is the actual physical location in memory of heap objects. This means that we cannot accurately calculate how fragmented the heap is, a condition that on Android can lead to OutOfMemoryErrors even when there is memory available. The reason for this is most likely that Sun’s JVM has been supporting compacting garbage collection2 since a very early version while Android is only planning to include this support in the upcoming Android L release.

HPROF files from Android (Dalvik) also contain several non-standard records. These records must either be converted to standard records or discarded before the file is read by any standard HPROF memory analyzer. These extensions are not documented and to be fully understood would require some digging into the Dalvik source code (comments are welcome on this topic!).

ProGuard/DexGuard Obfuscation

ProGuard DexGuard can perform several types of obfuscations and optimizations on your app but there are two in particular that affect memory dumps.
  • Renaming of classes and fields
  • Reuse of strings for field names
The first type of obfuscation is fairly straightforward. Class names and field names are simply replaced with a (shorter) unreadable string. The second type, though, requires a bit of background on how strings are handled in HPROF files in order to be explained clearly.

In the HPROF class definition you’ll not find the actual strings of the class or field names. Instead they contain a string identifier (usually a 4-byte ID that uniquely identifies the string). If two string fields have the same value they will also have the same string ID.

When the method fields are obfuscated, the obfuscated names are reused across classes. This means that two classes (A and B) which before obfuscation had fields with different names (say A.x and B.y) now have a field with the same name (A.q and B.q). As mentioned previously this means that the fields in both classes will have the same string ID for their names. 
As can be seen in the next part, this will complicate things when trying to deobfuscate the file.

Deobfuscating a HPROF File

The deobfuscation performed by deobfuscator can be broken down into four steps:
  1. Read mapping file (generated by ProGuard or DexGuard during the build).
  2. Read HPROF file to find all strings used as class and field names.
  3. Use mapping to look up the deobfuscated names for classes and fields.
  4. Write an updated HPROF file.
The first step is done using ProGuard’s proguard-base library which reads and processes the mapping file.
In the second step we are using the hprof-lib library (part of the source) to read the input HPROF file. Of all the data contained in the file we are only concerned with these records:
  • STRING: contains the ID and string value of one string
  • LOAD_CLASS: contains a record that a class is loaded by the VM
  • CLASS_DUMP: contains the definition of a class, including the name of the class, lists of constants, named static fields and named instance fields
When reading the field declarations of the class definitions an additional step is completed: deduplication of strings. As mentioned in the previous section about ProGuard/DexGuard obfuscation, fields that previously had unique names are made to share the same name after obfuscation. This means that in order to deobfuscate each field correctly we need to create copies of the strings and then deobfuscate each one separately. The table below attempts to explain this.

Deduplication and deobfuscation

The output from the second step is a list of all strings and class definitions for all loaded classes, with any field affected by the string deduplication updated.
In the third step we first process all class names to see if they have a corresponding entry in the mapping read in the first step. If they have, the entry in the list of strings is updated to reflect the new name.

After this we proceed to process the fields of each class (the class names must be done first since the field mapping is based on the original class names). Using the same lookup as for the class names we then update the field name string entries.

In the fourth, and last, step we then write the HPROF output file. This is done by reading the input HPROF file record by record, and either copying (for records that are unchanged) or replacing (for STRING and CLASS_DUMP) records that needs to be updated.

Due to the increased number of strings (and increased length of them) the output file is slightly larger than the input file.

Using the deobfuscator application

Source code and builds for the deobfuscator application are available here: https://github.com/badoo/hprof-deobfuscator
First, make sure that you have downloaded the most recent release of deobfuscator from our Github page, then execute the following command from the command line:

java -jar deobfuscator-all-x.y.jar {mapping file} {obfuscated hprof file} {output hprof file}

References

  1. Crittercism presentation at Droidcon Berlin 2012 (http://www.slideshare.net/crittercism/crittercism-droidcon-berlin-2012 )
  2. http://en.wikipedia.org/wiki/Mark-compact_algorithm

Further Reading

Comments

Popular posts from this blog

Simple outline for multi-sprite characters in Unity 2D using Shader Graph

For the last 6 months I've been working on a new (untitled) 2D game project in Unity both as a way to learn C# and also to play around with some game concepts I've been thinking about for quite a while. Since I'm not much of an artist or a graphic designer I purchased a set of rather nice looking character sprites from  https://tokegameart.net/  that also came with animations and ready to use Unity packages. Since my game has multiple characters on screen at one and each one can be given orders I needed a way to show which one was selected or active. One common way to handle this which felt like a good fit for me is to show an outline around the selected character. Luckily there's a lot of examples and guides explaining how to do this in Unity (and I based this one on a great article by Daniel Ilett). There was one snag though, my characters consist of multiple sprites (one for reach part of the body) that are drawn and animated separately. This meant that it w...

Getting started with OpenSTM32 on OSX

For some time now I have been doing projects (or should I rather say "been playing around") with AVR microcontrollers. Both in the form of different types of Arduinos but also in stand-alone projects (including the USB KVM and a battery powered ATTINY85 board, which I still haven't written a post about). For the most part I really like these microcontrollers, they are versatile, low powered and the development tools available are excellent (and importantly, available on all major platforms). However, In one of my latest projects I encountered a situation where AVRs just might not be enough. What I wanted to do was to capture images from a digital camera module (OV7670) and process them to determine movement speed and direction. While it might in theory be possible to do so on an ATMEGA microcontroller or similar, the small amount of memory available would make such an operation tricky at best. At that point I started looking for a more powerful microcontroller, and o...

Nucleo STM32F446RE and OV7670

After many hours of trial and failure I finally managed to get my OV7670 camera module to work properly with the Nucleo STM32F446RE board. I will try to put together a longer article about some of the issues I encountered and how I solved them but for now the source code is available on GitHub .