This page describes the internal workings of the GXemul dynamic translation
system (usually referred to as just "dyntrans").
In order to support guest operating systems, which can overwrite old code pages in memory with new code, it is necessary to translate code dynamically. It is not possible to do a static (one-pass) translation. Self-modifying code and Just-in-Time compilers running inside the emulator are other things that would not work with a static translator. GXemul is a dynamic translator. However, it does not necessarily translate into native code, like some other emulators.
Dynamic translators usually translate from the emulated architecture (e.g. MIPS) into a kind of intermediate representation (IR), and then to native code (e.g. AMD64 or x86 code) which can be executed by the host. Since one of the main goals for GXemul is to keep everything as portable as possible, the IR is something which can be executed regardless of whether the final step (translation from IR to native code) has been implemented or not.
The IR in GXemul consists of arrays of pointers to functions, and a few arguments which are passed along to those functions. This is usually referred to as threaded code. The functions are implemented in either manually hand-coded C/C++, or generated C/C++.
("Generated code" in the old C dyntrans framework means generated by special programs at compile time, which output temporary .c files; in the C++ dyntrans framework, this is implemented using templates instead. In any case, this is all statically linked into the GXemul binary at link time.)
Here is a simplified diagram of how these arrays work.

There is one instruction call slot for every possible program counter location. For example, in the traditional MIPS case, instruction words are 32 bits long, and pages are 4 KB large, resulting in 1024 instructions per page. In GXemul, this is represented using 1025 or 1026 instruction call slots.
After the last of the regular instruction call slots, there is one or two additional slots, which are special "end of page" slots. These do not count as executed instructions. Instead, they jump to the first instruction on the next virtual page (which might cause exceptions, etc).
The core dyntrans loop, which executes the instructions in the instruction call slots, looks something like this: (Note: This is from pre-0.6.x.)
	/*  The normal instruction execution core:  */
	#define I	ic = cpu->cd.DYNTRANS_ARCH.next_ic ++; ic->f(cpu, ic);
	...
	n_instrs = 0;
	for (;;) {
		struct DYNTRANS_IC *ic;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		cpu->n_translated_instrs += 120;
		if (cpu->n_translated_instrs >= N_SAFE_DYNTRANS_LIMIT)
			break;
	}
Why the number 120? The answer is: this number is evenly divisible by 1, 2, 3, 4, 5, 6, 8, 10, 12 (and some other numbers). Long emulated instruction loops of those lenghts will thus result in the same host branch being taken from the same place multiple times, which seems to fit with some modern CPUs' branch prediction mechanisms.
The complexity of individual instructions vary. A simple example of what an instruction can look like is the MIPS addu instruction. First, as implemented in the old C dyntrans framework (pre-0.6.x):
	/*
	 *  3-register arithmetic instructions:
	 *
	 *  arg[0] = ptr to rs
	 *  arg[1] = ptr to rt
	 *  arg[2] = ptr to rd
	 */
	X(addu)
	{
		reg(ic->arg[2]) = (int32_t)(reg(ic->arg[0]) + reg(ic->arg[1]));
	}
It stores the result of a 32-bit addition of the register at arg[0] with the register at arg[1] into the register at arg[2]. If the emulated CPU is a 64-bit CPU, then this will store a correctly sign-extended value. If it is a 32-bit CPU, then only the lowest 32 bits will be stored, and the high part ignored. X(addu) is expanded to mips_instr_addu in the 64-bit case, and mips32_instr_addu in the 32-bit case. Both are compiled into the GXemul executable; no code is created during run-time.
The C++ dyntrans implementation (in GXemul 0.6.x) is similar, except that a fixed REG32 or REG64 is used instead of reg (so it is only compiled once instead of twice). Also, instructions that may be reused between emulated architectures, such as this 32-bit signed add instruction, are in CPUDyntransComopnent instead of MIPS_CPUComponent.
	/*
	 * arg 0: 64-bit register
	 * arg 1: 64-bit register
	 * arg 2: 64-bit register
	 *
	 * Adds the the registers in arg 1 and arg 2, and stores the result in arg 0
	 * (truncated to a signed 32-bit value).
	 */
	DYNTRANS_INSTR(CPUDyntransComponent,add_u64_u64_u64_truncS32)
	{
		REG64(ic->arg[0]) = (int32_t) (REG64(ic->arg[1]) + REG64(ic->arg[2]));
	}
Another reasonably simple example is the M88K bsr (branch to subroutine) instruction. This is a branch which also sets the return address register to the instruction following the branch. Here is the implementation for "samepage" offsets, i.e. branch offsets which are inside the same dyntrans page:
	DYNTRANS_INSTR(M88K_CPUComponent,bsr_samepage)
	{
		//  This declares a cpu variable of the correct type which we can use.
		DYNTRANS_INSTR_HEAD(M88K_CPUComponent)
		//  Set the return address.
		cpu->m_r[M88K_RETURN_REG] = (cpu->m_pc & ~((M88K_IC_ENTRIES_PER_PAGE-1)
		    << M88K_INSTR_ALIGNMENT_SHIFT)) + ic->arg[2].u32;
		//  Branch.
		cpu->m_nextIC = (struct DyntransIC *) ic->arg[0].p;
	}
Here, the M88K r1 register (the return register) is set to the return address (the current program counter plus an offset in arg[2] pointing to the next instruction) and then the next instruction call to execute is set to arg[0].
Although it looks like the bsr_samepage code would result in quite a lot of host instructions, objdump reveals that the compiler was able to optimize it to a reasonably short sequence:
On Alpha (compiled with GNU C++ 3.4.6): <_ZN17M88K_CPUComponent18instr_bsr_samepageEP20CPUDyntransComponentP10DyntransIC>: 760: b0 00 90 a4 ldq t3,176(a0) 764: 1f 04 ff 47 nop 768: 03 f0 bf 20 lda t4,-4093 76c: 18 00 71 a0 ldl t2,24(a1) 770: 00 00 85 44 and t3,t4,v0 774: 01 04 03 40 addq v0,t2,t0 778: d0 01 30 b0 stl t0,464(a0) ; Set the return address 77c: 08 00 51 a4 ldq t1,8(a1) 780: 38 01 50 b4 stq t1,312(a0) ; ... and update nextIC. 784: 01 80 fa 6b ret On AMD64 (compiled with GNU C++ 4.2.1): <_ZN17M88K_CPUComponent18instr_bsr_samepageEP20CPUDyntransComponentP10DyntransIC>: 419310: 8b 87 b0 00 00 00 mov 0xb0(%rdi),%eax 419316: 25 03 f0 ff ff and $0xfffff003,%eax 41931b: 03 46 18 add 0x18(%rsi),%eax 41931e: 89 87 d0 01 00 00 mov %eax,0x1d0(%rdi) ; Set the return address 419324: 48 8b 46 08 mov 0x8(%rsi),%rax 419328: 48 89 87 38 01 00 00 mov %rax,0x138(%rdi) ; ... and update nextIC. 41932f: c3 retq
The emulation of memory access instructions differ slightly between the various emulated architectures, but most of the principles in this chapter apply. The emulation of loads and stores when emulating a MIPS processor consist of a "happy case" shorter code path, and a fallback to a more general case if the preconditions for the happy case are not met.
Pseudo code for the short path:
| Step: | Overhead: | 
| 1. Calculate the final virtual address. | 3 host memory accesses: read the Base and Offset arguments. | 
| 2. Preconditions check: a) find a corresponding host RAM page, if there is one, and b) check address alignment. | 1 host memory access for 32 bit, 3 for 64 bit. | 
| 3. Perform native load or store. | 1 host memory access (or multiple if doing byte-order-swap). | 
| 4. "Write-back" to emulated CPU's destination register (for loads). | 2 host memory accesses. | 
Step 1 calculates the actual virtual address (usually a base register plus an immediate offset).
Step 2 looks up whether there is a corresponding RAM page in the host. If there is, we can potentially load or store directly from that page in the host's RAM.
However, before going ahead with the load or store, we need to check the alignment of the address as well. Different host architectures have different properties, for example when running on (some?) ARM hosts, the lower bits of an address may be ignored leading to the wrong address being used if we were to use the address as-is; when running on an x86-based host, unaligned loads and stores are silently accepted even though they shouldn't be on the hardware being emulated (in this case MIPS). Checking the lowest bits of the final address is thus important for correct emulation, so that exceptions can be generated etc.
For emulation of 32-bit address spaces, the lookup from emulated virtual address to host page is a single fetch from an array. For emulation of 64-bit address spaces, it is a three-level lookup. In either case, there are separate lookups for loads and stores. A page which is write-protected has just a mapping for load, no mapping for store. A page which is readable and writable has both a mapping for load and one for store.
If the alignment was ok, and there was a direct mapping to host page, we can go ahead (step 3) and perform a native load or store, possibly followed by byte order swap (which may be implemented by using multiple memory accesses) if the emulated CPU is of different endianness than the host.
(This mechanism is currently hardcoded for 4 KB base page size. The 1 KB page size of the VR41xx CPUs for example does not work, and neither would e.g. VAX' 512 byte pages. As long as the guest OS doesn't make use of smaller page sizes than 4 KB, it should work even when emulating such CPUs.)
If any of the checks fail in step 2, the code bails out and calls a more general memory access routine, which looks up addresses via the emulated MIPS TLB and depending on the outcome, may add host pages to the virtual-to-host arrays, or cause exceptions on the emulated cpu, or call out to memory mapped device routines.
This implementation is obviously not intended to result in any world speed records, but it strikes a reasonable balance between emulation correctness and host portability.
The actual code in src/cpus/cpu_mips_instr_loadstore.cc looks horrible due to being originally written in C with the intention of mimicking C++ templates. If it were to be written in C++ templates from scratch, it would likely be more readable.
When the TLB in the emulated CPU gets instructed to change its mapping, for example if virtual address 0x1000 pointed to physical address 0x1234000, but is changed to point to 0x2345000, then the virtual-to-physical-to-host mappings in the emulator must be updated to reflect this. This is known as invalidating the translation lookup tables.
Also, when performing a store to a page for which there are code translations (the IR described above), all such translated instruction slots are reset.
The performance of using this kind of executable IR is obviously lower than what can be achieved by emulators using native code generation, but can be significantly higher than using a naive fetch-decode-execute interpretation loop. Using threaded code as the IR is an interesting compromise.
The overhead per emulated instruction is usually around or below approximately 10 host instructions. This is very much dependent on your host architecture and what compiler and compiler switches you are using. Added to this instruction count is (of course) also the C/C++ code used to implement each specific instruction.
Short, common instruction sequences can sometimes be replaced by a "combined" instruction. An example could be a compare instruction followed by a conditional branch instruction. The advantages of instruction combinations are that
The special cases where instruction combinations give the most gain are in the cores of string/memory manipulation functions such as memset() or strlen(). The core loop can then (at least to some extent) be replaced by a native call to the equivalent function.
The implementations of compound instructions still keep track of the number of executed instructions, etc. When single-stepping, these translations are invalidated, and replaced by normal instruction calls (one per emulated instruction).
An example of what such an instruction combination can look like is a 3-instruction memset loop on MIPS. The pattern to detect is a sequence of addiu, bne, and sw with certain registers. In this case, an attempt to identify the sequence is only necessary for certain sw instructions; otherwise, the attempt to find the pattern is not done. (Note: The sw instruction is in the delay slot of the bne instruction.)
If the sequence is detected, the instruction call slot with addiu is changed to point to the instruction combination.
Original sequence: Changed to: slot 4: ... slot 4: ... slot 5: addiu r5, r3, 24 slot 5: addiu r5, r3, 24 slot 6: addiu r9, r9, 4 slot 6: sw_loop r9, r9, 4 slot 7: bne r12, r9, slot 6 slot 7: bne r12, r9, slot 6 slot 8: sw r6, -4(r9) slot 8: sw r6, -4(r9) slot 9: subu r2, r3, 32 slot 9: subu r2, r3, 32 <--- (*) slot 10: ... slot 10: ...
(*) If the sw_loop code is executed, then execution continues here (&ic[3]).
The sw_loop is then implemented as follows:
	/*
	 *  sw_loop:
	 *
	 *  s:	addiu	rX,rX,4			rX = arg[0] and arg[1]
	 *	bne	rY,rX,s  (or rX,rY,s)	rt=arg[1], rs=arg[0]
	 *	sw	rZ,-4(rX)		rt=arg[0], rs=arg[1]
	 */
	X(sw_loop)
	{
		MODE_uint_t rX = reg(ic->arg[0]), rZ = reg(ic[2].arg[0]);
		uint64_t *rYp = (uint64_t *) ic[1].arg[0];
		MODE_uint_t rY, bytes_to_write;
		unsigned char *page;
		int partial = 0;
		page = cpu->cd.mips.host_store[rX >> 12];
		/*  Fallback:  */
		if (cpu->delay_slot || page == NULL || (rX & 3) != 0 || rZ != 0) {
			instr(addiu)(cpu, ic);
			return;
		}
		if (rYp == (uint64_t *) ic->arg[0])
			rYp = (uint64_t *) ic[1].arg[1];
		rY = reg(rYp);
		bytes_to_write = rY - rX;
		if ((rX & 0xfff) + bytes_to_write > 0x1000) {
			bytes_to_write = 0x1000 - (rX & 0xfff);
			partial = 1;
		}
		memset(page + (rX & 0xfff), 0, bytes_to_write);
		reg(ic->arg[0]) = rX + bytes_to_write;
		cpu->n_translated_instrs += bytes_to_write / 4 * 3 - 1;
		cpu->cd.mips.next_ic = partial?
		    (struct mips_instr_call *) &ic[0] :
		    (struct mips_instr_call *) &ic[3];
	}
(Note: This is from the old C dyntrans system, not the 0.6.x dyntrans implementation.)
For large memsets, calling the native/system memset implementation will most likely be a lot faster than emulating the MIPS memset one instruction at a time.
The fallback check makes sure that the rest of the code in sw_loop will be able to run. The implementation only supports zero-fill memsets, so if rZ is non-zero (or some other dangerous condition is present), then we do fall-back to addiu. This is done by explicitly calling instr(addiu)(cpu, ic); and then returning.
(The example above is 32-bit specific, and only works if a direct page address can be obtained via cpu->cd.mips.host_store.)
To find out which such instruction combinations to implement, real-world code (such as a complete guest operating system with suitable user applications) should be executed with special instruction statistics gathering, and then the most common instructions can be retrieved from that statistics.
Native code generation was included in GXemul before 0.4.x, but only for i386 and Alpha hosts, but that did not feel clean and portable enough to work as a long-term solution. Time has proved this right; i386 has been superseded by amd64, and the Alpha architecture has unfortunately been killed. The current implementation instead focuses on an having an intermediate representation that can be executed regardless of what the native architecture is.
In theory, it would be possible to add native code generation again, as long as that generated code abides to the C or C++ ABI on the host.