Accepted Papers
ideco: A Framework for Improving Non-C Decompilation
We introduce the ideco framework to aid in the decompilation of non-C programming languages. ideco provides users with the ability to create rules which rewrite parts of the decompilation.
We show that by using a small set of rules, the number of lines of decompiled code for binaries written in C++, Swift, Go, and Rust can be decreased by 5% to 10%. In addition, by using GPT-4o and GPT-4.1-mini as test subjects, we show that a reverse engineering task is easier to solve when its decompilation is processed by ideco.
Systematic Testing of C++ Abstraction Recovery Systems by Iterative Compiler Model Refinement
C++ source code abstractions such as classes and methods greatly assist human analysts and automated algorithms alike when analyzing C++ programs. These abstractions are lost during the compilation process, but researchers have been developing tools to recover them using program analysis. Despite promising advances, this difficult problem remains unsolved, with state-of-the-art solutions self-reporting accuracies of 78% [32] and 77.5% [10] for different types of abstractions.
In this paper, we address this problem by proposing a new model-based approach for systematically testing C++ abstraction recovery systems. Our high-level approach is to both jointly and iteratively refine the abstraction recovery system and a compiler model that introspects the compilation process. We built EmCee, a model of Microsoft’s Visual C++ compiler, to apply our technique to the popular C++ abstraction recovery systems VirtAnalyzer [10] and OOAnalyzer [32]. EmCee “parses” input files by interpreting them as answers to a series of multiple choice questions (inspired by the game “twenty questions”), which makes it very amenable to fuzzing. We use an off-the-shelf grey-box fuzzer to automatically generate test cases for EmCee that represent a variety of program structures and optimizations. We then use these test cases to evaluate the reasoning in VirtAnalyzer and OOAnalyzer for soundness problems, and correct any violations. Using our approach, we identified 27 soundness problems in OOAnalyzer and three in VirtAnalyzer.
LibIHT: A Hardware-Based Approach to Efficient and Evasion-Resistant Dynamic Binary Analysis
Dynamic program analysis is invaluable for malware detection, debugging, and performance profiling. However, software-based instrumentation incurs high overhead and can be evaded by anti-analysis techniques. In this paper, we propose LibIHT, a hardware-assisted tracing framework that leverages on-CPU branch tracing features (Intel Last Branch Record and Branch Trace Store) to efficiently capture program control-flow with minimal performance impact. Our approach reconstructs control-flow graphs (CFGs) by collecting hardware generated branch execution data in the kernel, preserving program behavior against evasive malware. We implement LibIHT as an OS kernel module and user-space library, and evaluate it on both benign benchmark programs and adversarial anti-instrumentation samples. Our results indicate that LibIHT reduces runtime overhead by over 150× compared to Intel Pin (7× vs 1,053× slowdowns), while achieving high fidelity in CFG reconstruction (capturing over 99% of execution basic blocks and edges). Although this hardware-assisted approach sacrifices the richer semantic detail available from full software instrumentation by capturing only branch addresses, this trade-off is acceptable for many applications where performance and low detectability are paramount. Our findings show that hardware-based tracing captures control flow information significantly faster, reduces detection risk and performs dynamic analysis with minimal interference.
Measuring While Playing Fair: An Empirical Analysis of Language and Framework Usage in the iOS App Store
Reverse engineering research has mainly focused on binaries compiled from C and C++, however, in the iOS ecosystem, neither of these languages are the focus of application developers. Apple provides their own languages with Objective-C and Swift as the official choices, while third party cross-platform frameworks, like Microsoft's .NET MAUI, Jetpack Compose, Flutter or even React Native promise unified development across iOS and Android. To investigate the relevance of languages for R&D efforts in software understanding, we conduct a historical analysis spanning 84,432 distinct iOS applications over the past five years.
Unlike previous approaches, we sidestep the technical and legal challenges of the FairPlay DRM system used to encrypt iOS apps and demonstrate that FairPlay does not cover various useful metadata, some of which can be used to detect the presence of programming languages in individual binaries and applications. Our key findings show that, as expected, Swift is now included in almost every popular application, however without phasing out Objective-C usage. Additionally, newer cross-platform languages like Flutter and Kotlin have seen a steady increase in use, while .NET has stagnated since 2020. All of these applications still include and interact with Objective-C, demonstrating that cross-language analysis is now an unavoidable challenge in the modern iOS analysis landscape.
Towards Scalable Evaluation of Software Understanding: A Methodology Proposal
In reverse engineering our goal is to build systems that help people to understand software. However, the field has not converged on a way to measure software understanding. In this paper, we make the case that understanding should be measured via performance on understanding-questions. We propose a method for constructing understanding-questions and evaluating answers at scale. We conduct a case study in which we apply our method and compare Ghidra's default auto analysis with an analysis that supports binary constructs that are specific to Objective-C.
Toward Inferring Structural Semantics from Binary Code Using Graph Neural Networks
Recovering semantic information from binary code is a fundamental challenge in reverse engineering, especially when source-level information is unavailable. We aim to analyze the types and roles of structural elements from the binary observed in the compiled program, focusing on their contextual usage patterns and associations to other members. We refer to such semantic aspects as structural semantics, meaning that cooccurring patterns of jointly updated structure members reveal the functional roles that can be inferred from their coupling, throughout this paper. Recent approaches have applied graph neural networks (GNNs) to data-flow graphs (DFGs) for variable type inference, but most rely on a single model architecture, such as the relational graph convolutional network (R-GCN). While effective, such models may overlook alternative patterns of structure member behavior. In this paper, we investigate the effectiveness of three alternative GNN architectures gated graph neural networks (GGNN), graph attention networks (GAT), and standard graph convolutional networks (GCN) in capturing structural semantics from binary-level data-flow graphs. We evaluate these models on real-world binaries compiled at multiple optimization levels, measuring their ability to infer semantic properties of structure members. Our results show that these architectures capture complementary aspects of structural semantics. GGNN is effective at modeling long-range dependencies, GAT suppresses irrelevant connections, and GCN offers computational simplicity. Different model architectures emphasize distinct aspects of structural semantics, capturing complementary patterns of how structure members are accessed together in memory. This demonstrates that architectural diversity provides richer perspectives for semantic inference in binary analysis.
DEBRA: A Real-World Benchmark For Evaluating Deobfuscation Methods
Software obfuscation is a broadly adopted protection method that hides representative information by transforming it into a highly opaque but semantic-equivalent form. To date, a variety of deobfuscation methods have been developed to peel off the obfuscation and expose the original program semantics. However, nearly all deobfuscation tools are merely tested and evaluated on small toy programs with ad-hoc configurations, leading to a fundamental gap between the deobfuscation research and real-world practice. We discover the key obstacle is due to the absence of a real-world, large-scale testing benchmark that can systematically evaluate the deobfuscation methods.
To fill this gap, we propose DEBRA, a comprehensive, large-scale obfuscation benchmark crafted with a diverse range of real-world programs for evaluating deobfuscation methods. First, we collect a set of real-world open-source programs representing diverse obfuscation scenarios. Second, we design a metric-driven approach to determine the crucial or sensitive functions to be obfuscated, because in real-world practice, only the critical parts of a program are obfuscated to balance security and execution overhead. Instead of blindly and arbitrarily obfuscating a program, this design makes our obfuscation benchmark closely mirror the real-world practice. Next, we obfuscate the selected areas in these programs with state-of-the-art obfuscators and obfuscation techniques, resulting in DEBRA. During our evaluation, the samples from DEBRA crashed the target deobfuscators and exposed limitations that were not shown during their original evaluation. With the hope of driving advancements in deobfuscation research, DEBRA serves as a pioneering standard benchmark for evaluating and comparing different deobfuscation methods
Benchmarking Binary Type Inference Techniques in Decompilers
Decompilation is the process of translating low-level, machine-executable code back into a high-level representation. Decompilers–tools that perform this translation–are essential for reverse engineers and security professionals, supporting critical tasks within their workflows. However, due to the inherent loss of information during compilation as a result of optimizations, inlining, and other compiler-specific transformations, decompiled output is often incomplete or inaccurate.
A central challenge in decompilation is accurate type inference: the reconstruction of high-level type information for variables based on low-level code patterns and memory access behaviors. Despite ongoing advancements in decompilation research, there is a notable lack of comprehensive comparative studies evaluating the type inference capabilities of existing decompilers.
This paper presents a benchmark study of five decompilers, focusing on their ability to infer types at both the function and variable levels. We conduct the evaluation on a dataset of binaries compiled from the Nixpkgs collection at both -O0 and -O2 optimization levels, allowing us to assess decompiler performance across unoptimized and optimized executables. The results highlight the relative strengths and weaknesses of each decompiler and identify recurring scenarios in which incorrect type information is produced.
On the Learnability, Robustness, and Adaptability of Deep Learning Models for Obfuscation-applied Code
Code obfuscation is a common technique to impede software analysis on purpose by concealing a program’s structure, logic, or behavior for both benign and malicious purposes. Its widespread adoption poses a significant challenge to security analysts. While deep learning has shown promise across various binary analysis tasks, the learnability and the applicability from obfuscation-applied code have received less attention to date. Prior work often overlooks code obfuscation or defers it to future research, as the complexity of obfuscation may differ significantly depending on the design and implementation differences across obfuscation tools. In this paper, we investigate how well obfuscation-applied code can be learned on a state-of-the-art model for the binary code similarity detection task. Training the model with obfuscated codes from two source-based and IR-based obfuscation tools (e.g., Tigress, Obfuscator-LLVM), we evaluate: i) learnability on obfuscated code, ii) generalizability for both obfuscated and non-obfuscated code, iii) robustness to known obfuscation techniques, and iv) adaptability to unknown obfuscation techniques. Our findings show that learning a task directly from obfuscated code is feasible, outperforming models trained on large volumes of non-obfuscated code even with a comparatively small dataset. However, achieving generalizability across obfuscated and non-obfuscated code remains challenging. Furthermore, we find that the model’s robustness and adaptability to previously (un)known obfuscations is closely tied to the inherent complexity of an obfuscation technique.