Wednesday, November 6, 2024

Visualizing binary files with ImHex's DSL, the "pattern language"

Viewing my binary file with ImHex. The pattern language pane, on the right, provides the highlighting and data on the left.

I've got a binary file with a custom made binary format, and a spec for that binary format. How do I go into the binary quickly and see the data I want ?

This is a problem I ran into while writing some software that operates on this data. Before, my approach would have probably been to write some Python code to parse the format, following the spec carefully and see where it diverges. 

Recently though, I heard about ImHex, an hex editor with advanced features, and decided to give it a go.

The feature I used the most in ImHex for my case was the integrated DSL, the "pattern language". It lets you define structures that ImHex will match on, and decode the data. The syntax is a mix of C++ and Rust, but with unique semantics and a lot of nice affordances.

Many examples that follow will be my real code for parsing the SWF file format.

An overview of the pattern language

Here's a pattern which parses a char, followed by the "WS" string in memory, an unsigned byte, and a four byte number after:

import type.magic;

struct Header { 
    char compressionSignature;
    type::Magic<"WS"> signature; 
    u8 swfVersion; 
    u32 bytesSize; 
}; 

Header h @ 0x0;

In ImHex, there are two steps to patterns:

  • first, you define the variable that you want to place, with a type. This can be one of the many primitive types, like u8, char, an array, or a compound type, like a struct or a bitfield.
  • then, you can place that variable somewhere in memory, using the @ address syntax. 
Placing the pattern at 0x0 in memory. ImHex fills out the data in the Pattern Data pane, and shows the value in the types you chose.

Patterns can be made more complex, nesting structs, or if you don't have a spec of your data and are reverse engineering, you can progressively place things, until you make more sense of it.

Logic in patterns

A value may exist or not based on a previous value, or even be sized differently based on a prior value that tells you the size. When parsing SWFs, I encountered this case with a bitfield, where the first field decides the size of the remaining fields. 

These cases are easy to write, since in the pattern language, you can reuse previous values, even if it's declared in the same structure:

bitfield Rect {
    // First field has a size of 5 bits
    nSize : 5;
    // The remaining fields have a size of nSize bits,
    // based on the value of nSize.
    signed xMin : nSize;
    signed xMax : nSize;
    signed yMin : nSize;
    signed yMax : nSize;
};

You can even put conditionals and other logic constructs inside. Here there are two fields indicating the length of a structure, of different size, and the second one exists only if the first has some magic value:

bitfield RecordHeaderShort {
    Length: 6;
    Tag Tag: 10;
};

struct RecordHeader {
    // Including a structure inside another one
    RecordHeaderShort short;
    // Defines a variable, that does not show up in the view,
    // but can be accessed like a field
    Tag Tag = short.Tag;
    u32 len = short.Length;
    
    if (short.Length == 0x3f) {
        u32 LongLength;
        len = LongLength;
    }
};

I defined a variable len that's present on this structure, but doesn't show up in the ImHex view. It is accesible however if RecordHeader is nested in another struct, so it's useful for getting the real length, without duplicating the logic each time.

Match statements

This feature is neat. First let's define a enum:
enum Tag: u8 {
    End = 0,
    ShowFrame = 1,
    SetBackgroundColor = 9,
    DoAction = 12,
    FileAttributes = 69,
    Metadata = 77, 
};
Now I can use a Rust like match statement to match on the values extensively, defining values in the struct based on the matched value:
struct TagRecord {
    RecordHeader RecordHeader; // see above
    
    match (RecordHeader.Tag) {
        // To define a field, you can also just declare its type,
        // there is no need to name it. This declares a field
        // of type `SetBackgroundColor`.
        (Tag::SetBackgroundColor): SetBackgroundColor;
        (Tag::DoAction): DoAction;
        (Tag::FileAttributes): FileAttributes;
        (Tag::Metadata): Metadata;
        // The `padding` keyword is treated specially by the language:
        // Creates a padding, with its length set to
        // a field of the RecordHeader structure. This will not show up in 
        // the pattern view, because it's declared as padding.
        (_): padding[RecordHeader.len];
    } 
// An attribute. In the data pane, this structure's name will be
// the enum's name.
} [[name(RecordHeader.Tag)]];

In the data view, each tag shows up with the enum's name.

Array stop conditions

An issue I encountered was that my file format is just an array of TagRecords, until the end of the file. This means there is no length: and using a null terminated array doesn't work, since there are tags with null values. 

Thankfully, the pattern language has support for custom stop conditions, using a "loop sized array":

struct File {
    Header;
    // This function returns true when
    // the end of the file has been reached
    TagRecord Tags[while(!std::mem::eof())];
};
ImHex also has a special $ (dollar) operator, which always points to the current offset in the file. This variable is even modifiable inside your pattern to change the position. One good use for this is a loop sized array like this one, that will stop on a 0xFF value:
u8 string[while(std::mem::read_unsigned($, 1) != 0xFF)];

Decompressing files inside ImHex directly 

Most SWF files in the wild come compressed, with a non trivial decompression procedure, which means my pattern can't be applied directly: I need a decompression step first. In the past I would have reached for a Python script to decompress the file with the custom logic needed.

But the pattern language also supports this case, and has built-in decompression functions, as well as a virtual file system to store the output file:

struct Compressed {
    // Create a section in memory for the decompressed contents. Only contents
    // starting from a certain offset are compressed
    UncompressedHeader h;
    std::mem::Section decompressed = std::mem::create_section("Zlib decompressed");
    u8 compressedContents[std::mem::size() - 8] @ 0x8;
    // Do the actual decompression.
    hex::dec::zlib_decompress(compressedContents, decompressed, 15);
    
    // Combine both the uncompressed header and decompressed part to recreated
    // the full decompressed file.
    std::mem::Section full = std::mem::create_section("Full decompressed SWF");
    std::mem::copy_value_to_section(h, full, 0x0);
    std::mem::copy_section_to_section(decompressed, 0x0, full, 0x8, h.decompressedSize - 8);
    // Write decompressed flag, so next time we open the pattern knows it's decompressed
    std::mem::copy_value_to_section("Z", full, 0x0);
    
    // Write this file to the virtual file system.
    u8 d[h.bytesSize] @ 0x00 in full;
    builtin::hex::core::add_virtual_file(std::format("dec-{}", hex::prv::get_information("file_name")), d);
    std::warning("This SWF is ZLib-compressed, grab the decompressed save from\nthe Virtual Filesystem tab and use this pattern on it.");
};

struct Main {
    char compressed @ 0x0;
    match (compressed) {
        ('C'): Compressed;
        ('Z'): File;
    }
};
By running this pattern once, the resulting file ends up in a new tab, in ImHex's "virtual filesystem".
The decompressed file in the virtual file system

Conclusion

Overall I was very impressed by the pattern language. Not only does it make my task much simpler, it also is well thought out for the tasks it sets out to do,, both in its design and in its day to day conveniences.

However, currently the language doesn't have a lot of documentation or tutorials to get started. Thankfully it was easy to get started with some help, the developers on discord are very nice (and thanks again to paxcut for helping debug some issues!) Hopefully this post was interesting even if you've never had this problem before :)

You can find my full code here, if you're curious of what it looks like all put together.

Discussion on reddit, hn, lobsters

No comments:

Post a Comment

Visualizing binary files with ImHex's DSL, the "pattern language"

Viewing my binary file with ImHex. The pattern language pane, on the right,...