Embarking on a journey through the world of Java Bytecode? This article covers everything you need to know to get started.
Back in 1995, Sun Microsystems, the creators of the Java programming language, made a bold claim. They said that Java would allow you to “write once and run anywhere.” That meant that the compiled binaries would be able to run on any system architecture, something that C could not do and remains a core tenant of writing Java to this day.
To achieve this cross-platform capability, Java employs a unique approach when compiling. Instead of going from source code directly into machine code (which would be specific to each system architecture), Java compiles its programs into an intermediate form known as bytecode. Bytecode is a set of instructions that is neither tied to a particular machine language nor dependent on any specific hardware architecture. This abstraction is the key to Java's portability.
The program that interprets and executes Java bytecode instructions is called a Java Virtual Machine (JVM). The JVM translates each bytecode instruction into the machine code native to the particular system architecture it is running on. This process, often referred to as "just-in-time" (JIT) compilation, allows Java bytecode to be executed as efficiently as possible on any given platform.
Bytecode isn’t just useful for the JVM, though. Because the bytecode of a Java class is helpful for reverse engineering, performance optimization, security research, and other static analysis functions, the JDK ships with utilities to help you and me inspect it.
To glimpse at an example of bytecode, consider the following two methods from `java.lang.Boolean`, `booleanValue` and `valueOf(boolean)` which respectively unbox and box the `boolean` primitive type:
public boolean booleanValue() { return value; } public static Boolean valueOf(boolean b) { return (b ? TRUE : FALSE); }
Using the `javap` command, which ships with the JDK, we can see the bytecode for each. You can do this by running `javap` with the `-c` command and the fully-qualified name of the class, like so:
javap -c java.lang.Boolean
There result is the bytecode for all the public methods in `java.lang.Boolean`. Here I’ve copied just the bytecode for `booleanValue` and `valueOf(boolean)`:
public boolean booleanValue(); code: 0: aload_0 1: getfield #7 // Field value:Z 4: ireturn public static java.lang.Boolean valueOf(boolean); Code: 0: iload_0 1: ifeq 10 4: getstatic #27 // Field TRUE:Ljava/lang/Boolean; 7: goto 13 10: getstatic #31 // Field FALSE:Ljava/lang/Boolean; 13: areturn
At first glance, it’s an entirely new language to learn. However, it quickly becomes straightforward when as you learn what each instruction does and that Java operates with a stack.
Take the three bytecode instructions for `booleanValue`, for example:
`aload_n` means to place a reference to a local variable onto the stack. In a class instance, `aload_0` refers to `this`.
`getfield` means to read the member variable from `this` (the lower item on the stack) and place that value onto the stack
`#7` refers to the reference’s index in the constant pool
`// Field value:Z` tells us what `#7` refers to, a field named `value` of type `boolean` (Z)
`ireturn` means to pop a primitive value off of the stack and return it
Long story short, these three instructions lookup the instance’s `value` field and return it.
As a second example, take a look at the next method, `valueOf(boolean)`:
`iload_n` means to place a primitive local variable onto the stack. `iload_0` refers to the first method parameter (since the first method parameter is a primitive)
`ifeq n` means pop the value off of the stack and see if it is true; if so, proceed to the next line, otherwise jump to line `n`
`getstatic #n` means read a static member onto the stack
`#27` refers to the static member’s index in the constant pool
`// Field TRUE:Ljava/lang/Boolean` tells us what `#27` refers to, a static member named `TRUE` of type `Boolean
`goto n` means now jump to line `n` in the bytecode
`areturn` means pop a reference off of the stack and return it
In other words, these instructions say, take the first method parameter, if it’s true, then return `Boolean.TRUE`; otherwise, return `Boolean.FALSE`.
I mentioned earlier that this can be helpful for reverse engineering, performance optimization, and security research. Let’s expand on those now.
When working with third-party libraries or closed-source components, bytecode analysis becomes a powerful tool. Decompiling bytecode can provide a glimpse into the inner workings of these libraries, aiding in integration, troubleshooting, and ensuring compatibility.
In situations where you encounter proprietary or closed-source Java code, reading bytecode can be the only feasible way to understand its functionality. Bytecode analysis allows you to reverse engineer and comprehend the behavior of closed-source applications, facilitating interoperability or customization.
In the way of a real-life example, I was recently trying to integrate a third-party package tangle analysis tool into our Ci system. Unfortunately, the vendor was closed-sourced and only had documentation for how to access the library through their proprietary UI. By analyzing the bytecode, I was able to reverse engineer the expected inputs and outputs of the underlying analytics engine.
The above is the detailed content of How to read Java Bytecode for fun and profit. For more information, please follow other related articles on the PHP Chinese website!