Deobfuscating HPROF memory dumps

First posted on the Badoo tech blog

According to Crittercism 1, the second most common crash reported in Android apps is java.lang.OutOfMemoryError, so it stands to reason that analyzing these crashes should be one of the top priorities for any Android developer. If you are analyzing memory dumps from a debug build or if you are not using obfuscation this process is fairly straightforward. However, if your heap dump is coming from an app built using obfuscation (Proguard or Dexguard) you are in for quite a challenge (or at least you were, until now).

In the image below you can see a typical obfuscated instance dump in Eclipse Memory Analyzer (MAT), where most of the field names have been replaced with indecipherable one-character names.

Figure 1: Before deobfuscation

Can we do anything about this then? Well, if you have the mapping files you could look up each symbol to figure out the name of the field and its value, but it would be an extremely time-consuming process. This article will outline a much more efficient and automated process to deobfuscate a HPROF heap dump. The end result of this process is shown in the image below. When compared to the first image it makes it much clearer what fields and values we are trying to analyze.

Figure 2: After deobfuscation

HPROF File Format

An HPROF file contains a Java heap dump taken at a given time. It is a VM-independent format (dumps can be taken from most JVMs) which means that the content of the file is not a byte-by-byte copy of the actual Java heap. The content includes (but is not limited to):

List of all classes loaded by the class loader
All strings
Class definitions (including constant values, static field values and in instance field declarations but not any information about methods)
Instance dumps (containing values of all instance fields associated with the object)
Heap roots, sticky objects, stack frames and stack traces

As mentioned, the HPROF files does not contain an exact copy of the heap. One interesting piece of information that is omitted is the actual physical location in memory of heap objects. This means that we cannot accurately calculate how fragmented the heap is, a condition that on Android can lead to OutOfMemoryErrors even when there is memory available. The reason for this is most likely that Sun’s JVM has been supporting compacting garbage collection2 since a very early version while Android is only planning to include this support in the upcoming Android L release.

HPROF files from Android (Dalvik) also contain several non-standard records. These records must either be converted to standard records or discarded before the file is read by any standard HPROF memory analyzer. These extensions are not documented and to be fully understood would require some digging into the Dalvik source code (comments are welcome on this topic!).

ProGuard/DexGuard Obfuscation

ProGuard DexGuard can perform several types of obfuscations and optimizations on your app but there are two in particular that affect memory dumps.

Renaming of classes and fields
Reuse of strings for field names

The first type of obfuscation is fairly straightforward. Class names and field names are simply replaced with a (shorter) unreadable string. The second type, though, requires a bit of background on how strings are handled in HPROF files in order to be explained clearly.

In the HPROF class definition you’ll not find the actual strings of the class or field names. Instead they contain a string identifier (usually a 4-byte ID that uniquely identifies the string). If two string fields have the same value they will also have the same string ID.

When the method fields are obfuscated, the obfuscated names are reused across classes. This means that two classes (A and B) which before obfuscation had fields with different names (say A.x and B.y) now have a field with the same name (A.q and B.q). As mentioned previously this means that the fields in both classes will have the same string ID for their names.

As can be seen in the next part, this will complicate things when trying to deobfuscate the file.

Deobfuscating a HPROF File

The deobfuscation performed by deobfuscator can be broken down into four steps:

Read mapping file (generated by ProGuard or DexGuard during the build).
Read HPROF file to find all strings used as class and field names.
Use mapping to look up the deobfuscated names for classes and fields.
Write an updated HPROF file.

The first step is done using ProGuard’s proguard-base library which reads and processes the mapping file.

In the second step we are using the hprof-lib library (part of the source) to read the input HPROF file. Of all the data contained in the file we are only concerned with these records:

STRING: contains the ID and string value of one string
LOAD_CLASS: contains a record that a class is loaded by the VM
CLASS_DUMP: contains the definition of a class, including the name of the class, lists of constants, named static fields and named instance fields

When reading the field declarations of the class definitions an additional step is completed: deduplication of strings. As mentioned in the previous section about ProGuard/DexGuard obfuscation, fields that previously had unique names are made to share the same name after obfuscation. This means that in order to deobfuscate each field correctly we need to create copies of the strings and then deobfuscate each one separately. The table below attempts to explain this.

The output from the second step is a list of all strings and class definitions for all loaded classes, with any field affected by the string deduplication updated.

In the third step we first process all class names to see if they have a corresponding entry in the mapping read in the first step. If they have, the entry in the list of strings is updated to reflect the new name.

After this we proceed to process the fields of each class (the class names must be done first since the field mapping is based on the original class names). Using the same lookup as for the class names we then update the field name string entries.

In the fourth, and last, step we then write the HPROF output file. This is done by reading the input HPROF file record by record, and either copying (for records that are unchanged) or replacing (for STRING and CLASS_DUMP) records that needs to be updated.

Due to the increased number of strings (and increased length of them) the output file is slightly larger than the input file.

Using the deobfuscator application

Source code and builds for the deobfuscator application are available here: https://github.com/badoo/hprof-deobfuscator

: java -jar deobfuscator-all-x.y.jar {mapping file} {obfuscated hprof file} {output hprof file}

References

Crittercism presentation at Droidcon Berlin 2012 (http://www.slideshare.net/crittercism/crittercism-droidcon-berlin-2012 )
http://en.wikipedia.org/wiki/Mark-compact_algorithm

erikandre.org

Search This Blog