|
|
 | | From: | ed_davis2 | | Subject: | virtual machine efficiency | | Date: | 29 Dec 2004 01:49:17 -0500 |
|
|
 | I have developed a simple stack based virtual machine. I would like it to be as efficient as possible, in terms of both speed and object code size. Size, as in size of the object code the VM processes (byte-code), and not necessarily the size of the VM.
But I've run into one of those situations where I can't figure out how to have both.
All of the opcodes are one byte in size. Given the instruction:
push address
'push' takes one byte, and the address takes 4 bytes. If I pack the code into a byte array, this takes 5 bytes. However, now that means that address isn't on a word boundary. If I load 'address' using:
unsigned char *code;
operand = *(int *)code;
I incur a speed hit on many processors, and it may even fail on others, since code probably isn't suitably aligned.
But if I use:
memcpy(&operand, code, sizeof(operand));
or
operand = (code[pc]) | (code[pc + 1] << 8) | (code[pc + 2] << 16) | (code[pc + 3] << 24);
I don't incur the miss-aligned speed hit, but it is nowhere as fast as "operand = *(int *)code;" could be, if code were suitably aligned.
I could make all opcodes a word in size, and make the code array an array of integers. But that would dramatically increase the size of the byte-code required by the VM.
Is there a way I can have both fast loading of operands, and small byte-code size? [Well, I suppose you could do halfword alignment, or you could split the code into two separate sections, a list of bytes for the opcodes, and a list of words for the operands. -John]
|
|
 | | From: | Chris Dollin | | Subject: | Re: virtual machine efficiency | | Date: | 12 Jan 2005 22:59:08 -0500 |
|
|
 | ed_davis2 wrote:
> I have developed a simple stack based virtual machine. I would like > it to be as efficient as possible, in terms of both speed and object > code size. Size, as in size of the object code the VM processes > (byte-code), and not necessarily the size of the VM.
(snips)
The last VM I built didn't allow you to embed large values into the opcode stream. Instead each procedure had an associated constant table, and you referred to large values by their index in the table.
Normally for loading big constants there was an eight-bit index field (all instructions were 16 bits wide, with a 5+3 opcode, where the 3 bits might be an additional immediate value, or a sub-opcode), but there was provision for handling the contextually-rare cases where there were more than 256 big constants.
16-bit instructions turned out to be an interesting and occasionally frustrating choice, as was the decision to try and combine stack-based and register-aka-local-variable instructions. That the machine had *two* stacks [1] - one for values, one for locals/return addresses - was an automatic consequence of umpteen years writing Pop11 (and similar) code.
[1] Well, two per thread. -- Chris "electric hedgehog" Dollin
|
|
 | | From: | cr88192 | | Subject: | Re: virtual machine efficiency | | Date: | 14 Jan 2005 00:40:06 -0500 |
|
|
 | "Chris Dollin" wrote in message > ed_davis2 wrote: >> I have developed a simple stack based virtual machine. I would like >> it to be as efficient as possible, in terms of both speed and object >> code size. Size, as in size of the object code the VM processes >> (byte-code), and not necessarily the size of the VM. > > (snips) > > The last VM I built didn't allow you to embed large values into the > opcode stream. Instead each procedure had an associated constant > table, and you referred to large values by their index in the table.
similar to mine.
> Normally for loading big constants there was an eight-bit index field > (all instructions were 16 bits wide, with a 5+3 opcode, where the 3 > bits might be an additional immediate value, or a sub-opcode), but > there was provision for handling the contextually-rare cases where > there were more than 256 big constants.
my opcodes are based on words, though a little different: 6.10: length, opcode. followed by any indices or addresses.
though in most cases optional, sometimes the length field has relevance...
> 16-bit instructions turned out to be an interesting and occasionally > frustrating choice, as was the decision to try and combine stack-based > and register-aka-local-variable instructions. That the machine had > *two* stacks [1] - one for values, one for locals/return addresses - was > an automatic consequence of umpteen years writing Pop11 (and similar) > code.
mine also has two stacks: the value stack; and a kind of opaque meta-stack used for control.
locals were stored in their own space which is indirectly referenced by the control stack (variables and similar are accessed with their own opcodes).
I like my approach personally (if albeit it uses a little extra space sometimes).
|
|
 | | From: | Chris Dollin | | Subject: | Re: virtual machine efficiency | | Date: | 15 Jan 2005 20:52:10 -0500 |
|
|
 | cr88192 wrote:
> "Chris Dollin" wrote in message
>> 16-bit instructions turned out to be an interesting and occasionally >> frustrating choice, as was the decision to try and combine stack-based >> and register-aka-local-variable instructions. That the machine had >> *two* stacks [1] - one for values, one for locals/return addresses - was >> an automatic consequence of umpteen years writing Pop11 (and similar) >> code.
> mine also has two stacks: the value stack; and a kind of opaque > meta-stack used for control.
> locals were stored in their own space which is indirectly referenced > by the control stack (variables and similar are accessed with their > own opcodes).
It's important to note that the two stacks in my VM are decoupled and not just a convenience - the value stack doesn't get cut back just because you return from a procedure, for example, and the value stack can grow arbitrarily large during a procedure's execution.
It's not quite as free-and-easy as Pop11's open stack, partly because one of the points of designing Spice [no, not that one] was to get most of the benefits of the open-stack model without the gotchas - although most of the constraints are imposed by code generation rather than by the instruction set. (Although one of the procedure-entry instructions explicitly transfers values off the value stack into the control stack.)
-- Chris "electric hedgehog" Dollin
|
|
 | | From: | Lars Duening | | Subject: | Re: virtual machine efficiency | | Date: | 30 Dec 2004 01:06:13 -0500 |
|
|
 | ed_davis2 wrote:
> [push address] takes one byte, and the address takes 4 bytes. > If I pack the code into a byte array, this takes 5 bytes. However, > now that means that address isn't on a word boundary. If I load > 'address' using: > > unsigned char *code; > > operand = *(int *)code; > > I incur a speed hit on many processors, and it may even fail on > others, since code probably isn't suitably aligned.
As a first step I would measure how much of a speed penalty you actually incur by having to extract the adress data from out of the bytecode (and yes, you will have to extract it for proper alignment).
Especially if your compiler recognizes memcpy() and compiles it into optimized code, you might find that for example crossing the line boundaries in your processor's L1 cache has greater impact than the lack of alignment of single operands.
-- Lars Duening; lars@bearnip dot com
|
|
 | | From: | John R. Strohm | | Subject: | Re: virtual machine efficiency | | Date: | 1 Jan 2005 18:02:44 -0500 |
|
|
 | "ed_davis2" said: > I have developed a simple stack based virtual machine. I would like > it to be as efficient as possible, in terms of both speed and object > code size. Size, as in size of the object code the VM processes > (byte-code), and not necessarily the size of the VM.
This thread has been running for a while, and a key point seems not to have arisen.
The efficiency of the interpreter IN PRACTICE will depend HEAVILY on the actual instruction stream fed to it.
If almost every instruction is a LOAD or STORE that specifies an absolute address, then efficiency of the byte-code engine will be dominated by efficiency of the absolute load and store primitives. However, if the application that is running on the system is, say, a pixel-cruncher, or a radar signal processor doing massive FFTs, the absolute load and store primitives are not going to contribute very much at all to the efficiency of the engine, AND IT IS A WASTE OF PROGRAMMER BRAIN CYCLES TO OPTIMIZE THE COLD SPOTS.
In general, it is impossible to optimize everything absolutely.
What I would suggest you consider is adding some "index registers" or "address registers" to your interpreter, and add "load address and load" and "load address and store" primitives. These instructions would load the target absolute address into one of the address registers, and then load or store the top-of-stack through the address register. Maybe you'd want to provide for an offset, absolute, or even indexed.
Then whatever generates code to run on the virtual machine can optimize e.g. constant and temporary placement, to get leverage off the address registers and the indexed indirect load/store operations. [Unless you plan to use a JIT compiler to translate your VM code into machine code, I'd think the higher level the operators, the better, to minimize the number of ops to decode. -John]
|
|
 | | From: | Rafael 'Dido' Sevilla | | Subject: | Re: virtual machine efficiency | | Date: | 30 Dec 2004 00:55:59 -0500 |
|
|
 | One thing you may want to consider is altering your architecture from a stack-based one to a memory transfer architecture. The Dis virtual machine used on the Inferno operating system is one example. This paper by Rob Pike and Phil Winterbottom gives some of the basic ideas:
http://www.vitanuova.com/inferno/papers/hotchips.html
It's something worth exploring I think.
|
|
 | | From: | Calum Grant | | Subject: | Re: virtual machine efficiency | | Date: | 30 Dec 2004 22:48:02 -0500 |
|
|
 | ed_davis2 wrote: > I have developed a simple stack based virtual machine. I would like > it to be as efficient as possible, in terms of both speed and object > code size. Size, as in size of the object code the VM processes > (byte-code), and not necessarily the size of the VM.
> Is there a way I can have both fast loading of operands, and small > byte-code size?
1) You could inserts nops before the op-code to ensure that the value was always word-aligned. However this would be too slow since your interpreter would need to handle the nops.
2) You could still align your data on a word-boundary. The interpreter would notice the alignment of the op-code, and implicitly skip 0-3 bytes. The compiler would need to insert appropriate padding to ensure the correct alignment.
case OP_LD: if(IP&3) IP = (IP&~3) + 4; operand = *(int*)IP;
3) You could have variable-width op-codes. Have 4 load instructions with widths 1,2,3 and 4 that align the data following the op-code. But I don't see much advantage over 2).
case OP_LD0: operand = *(int*)IP; IP += 4; ... break; case OP_LD1: IP ++; operand = *(int*)IP; IP += 4; ... break; case OP_LD2: IP += 2; operand = *(int*)IP; IP += 4; ... break; case OP_LD3: IP += 3; operand = *(int*)IP; IP += 4; ... break;
4) You could store values out-of-stream. But this is slightly less efficient.
5) (copied from Anton Ertl's post) reorder op-codes and move word-aligned data away from its op-code.
I think in the cases of 2) and 3), although there is occasional padding, the interpreter has much less to do than in the more complex schemes.
Calum
|
|
 | | From: | cr88192 | | Subject: | Re: virtual machine efficiency | | Date: | 31 Dec 2004 13:06:27 -0500 |
|
|
 | "Calum Grant" wrote in message > ed_davis2 wrote: >> I have developed a simple stack based virtual machine. I would like >> it to be as efficient as possible, in terms of both speed and object >> code size. Size, as in size of the object code the VM processes >> (byte-code), and not necessarily the size of the VM. > >> Is there a way I can have both fast loading of operands, and small >> byte-code size? > > 1) You could inserts nops before the op-code to ensure that the value > was always word-aligned. However this would be too slow since your > interpreter would need to handle the nops.
yes. maybe useful for hw, I doubt so for interpreters...
> 2) You could still align your data on a word-boundary. The interpreter > would notice the alignment of the op-code, and implicitly skip 0-3 > bytes. The compiler would need to insert appropriate padding to ensure > the correct alignment. > > case OP_LD: > if(IP&3) IP = (IP&~3) + 4; > operand = *(int*)IP;
yes.
I typically realign slightly differently, eg: IP=(IP+3)&(~3); the check seems like a slight timewaster, it is imo better just to realign and not worry about it than check and realign, as presumably the realign would be needed in 75% of the calls anyways.
> 3) You could have variable-width op-codes. Have 4 load instructions > with widths 1,2,3 and 4 that align the data following the op-code. But > I don't see much advantage over 2).
yes, I don't see an advantage either.
> case OP_LD0: > operand = *(int*)IP; > IP += 4; > ... > break; > case OP_LD1: > IP ++; > operand = *(int*)IP; > IP += 4; > ... > break; > case OP_LD2: > IP += 2; > operand = *(int*)IP; > IP += 4; > ... > break; > case OP_LD3: > IP += 3; > operand = *(int*)IP; > IP += 4; > ... > break; > > 4) You could store values out-of-stream. But this is slightly less > efficient.
not sure why really. I would guess this depends on the implementation of the interpreter, and mine typically being unoptimized enough to where it shouldn't matter (of all things the interpreter looks up binding dynamically).
to me though this still seems like a "splitting hairs case": p=table[ip[1]]; vs, eg: i=*(int *)(ip+1);
or: ip1=(byte *)(((long)ip+4)&(~3)); i=*(int *)ip;
I personally see other strong reasons for keeping values out of stream, with the only main detractor I can think of being that one has to keep tract of the out of band values table...
> 5) (copied from Anton Ertl's post) reorder op-codes and move > word-aligned data away from its op-code.
imo this idea is cool, but I can imagine it being difficult to work with. it would likely make a lot more sense, eg, if one could "packet" opcodes, but imo this doesn't make much sense in an interpreter (it being serial and all).
> I think in the cases of 2) and 3), although there is occasional > padding, the interpreter has much less to do than in the more complex > schemes.
personally I chose something like 4, imo it makes the most sense, eg, in the dynamicly typed and dynamically compiled case, and the case where opcodes don't really have associated types.
I think an important point is whether one views it as more sensible to have a generic "load" operator, or "load.i2", "load.i4", "load.f8", ...
but whatever, I may just be stupid.
|
|
 | | From: | Anton Ertl | | Subject: | Re: virtual machine efficiency | | Date: | 30 Dec 2004 00:57:29 -0500 |
|
|
 | "ed_davis2" writes: >I have developed a simple stack based virtual machine. I would like >it to be as efficient as possible, in terms of both speed and object >code size. Size, as in size of the object code the VM processes >(byte-code), and not necessarily the size of the VM. > >But I've run into one of those situations where I can't figure out how >to have both.
You can't. The fastest techniques (read some of my papers) use more memory than some other techniques. However, there are some techniques that are good compromises between speed and space consumption; read:
@InProceedings{proebsting95, author = "Todd A. Proebsting", title = "Optimizing an {ANSI~C} Interpreter with Superoperators", crossref = "popl95", pages = "322--332", annote = "Interpreter performance is optimized by combining operators during code generation, when they are still organized as trees. So a different, optimized interpreter is used for each program. Speedups of 1.8--3.1 are achieved, but this is probably strongly dependent on the original set of operators. The paper uses lccs intermediate code operators \cite{fraser&hanson91a}." }
@Proceedings{popl95, booktitle = "Principles of Programming Languages (POPL '95)", title = "Principles of Programming Languages (POPL '95)", year = "1995", key = "POPL '95" }
@string{spe="Software---Practice and Experience"}
@Article{hoogerbrugge+99, author = "Jan Hoogerbrugge and Lex Augusteijn and Jeroen Trum and Rik van de Wiel", title = "A code compression system based on pipelined interpreters", journal = spe, volume = "29", number = "11", pages = "1005--1023", month = sep, year = "1999", OPTannote= "" }
@InProceedings{lattendresse&feeley03, author = {Mario Latendresse and Marc Feeley}, title = {Generation of Fast Interpreters for {Huffman} Compressed Bytecode}, booktitle = {Interpreters, Virtual Machines and Emulators (IVME~'03)}, pages = {32--40}, year = {2003}, url1 = {http://www.complang.tuwien.ac.at/anton/ivme03/proceedings/latendresse.ps.gz}, url2 = {http://portal.acm.org/ft_gateway.cfm?id=858574&type=pdf}, abstract = {Embedded systems often have severe memory constraints requiring careful encoding of programs. For example, smart cards have on the order of 1K of RAM, 16K of non-volatile memory, and 24K of ROM. A virtual machine can be an effective approach to obtain compact programs but instructions are commonly encoded using one byte for the opcode and multiple bytes for the operands, which can be wasteful and thus limit the size of programs runnable on embedded systems. Our approach uses canonical Huffman codes to generate compact opcodes with custom-sized operand fields and with a virtual machine that directly executes this compact code. We present techniques to automatically generate the new instruction formats and the decoder. In effect, this automatically creates both an instruction set for a customized virtual machine and an implementation of that machine. We demonstrate that, without prior decompression, fast decoding of these virtual compressed instructions is feasible. Through experiments on Scheme and Java, we demonstrate the speed of these decoders. Java benchmarks show an average execution slowdown of 9%. Compression factors highly depend on the original bytecode and the training sample, but typically vary from 30% to 60%. } }
>All of the opcodes are one byte in size. Given the instruction: > >push address > >'push' takes one byte, and the address takes 4 bytes. If I pack >the code into a byte array, this takes 5 bytes. However, now >that means that address isn't on a word boundary. If I load >'address' using: > >unsigned char *code; > >operand = *(int *)code; > >I incur a speed hit on many processors, and it may even fail on >others, since code probably isn't suitably aligned. > >But if I use: > >memcpy(&operand, code, sizeof(operand)); > >or > >operand = >(code[pc]) | >(code[pc + 1] << 8) | >(code[pc + 2] << 16) | >(code[pc + 3] << 24); > >I don't incur the miss-aligned speed hit, but it is nowhere as >fast as "operand = *(int *)code;" could be, if code were suitably >aligned. > >I could make all opcodes a word in size, and make the code array an >array of integers. But that would dramatically increase the size of >the byte-code required by the VM.
Not necessarily. If you use superinstructions, you may have relatively few VM instructions compared to operands; with dynamic superinstructions you can get it down to one VM instruction per VM branch target (but you need space for the dynamic superinstruction code; I don't know if you consider that part of the VM, which is not as sensitive for you).
>Is there a way I can have both fast loading of operands, and >small byte-code size?
This particular problem has been addressed by Ogata et al. [ogata+02], but I am not sure that I would use it if I could define my own bytecode (they wanted to do JVM interpretation without rewriting the JVM code).
Other techniques that come to mind:
- Use byte-sized operands where possible.
- Collect VM instructions and byte-sized operands within words, but let word-sized operands start at the next word; this is somewhat like the two-stream (sections) idea suggested by our esteemed moderator, but reduces the overhead of dealing with two instruction pointers on VM control flow. Example of such VM code (for a machine with 4-byte words):
inst 1 | byte-operand 1 of inst 1 | inst 2 | inst 3 word-operand 1 of inst 1 word-operand 2 of inst 1 word-operand 1 of inst 2 word-operand 1 of inst 3 byte-operand 1 of inst 3 | inst 4 ... ....
Various methods for implementing such a scheme efficiently come to mind, but I won't go into that now.
@InProceedings{ogata+02, author = {Kazunori Ogata and Hideaki Komatsu and Toshio Nakatani}, title = {Bytecode Fetch Optimization for a {Java} Interpreter}, crossref = {asplos02}, pages = {58--67}, annote = {The paper presents a Java Bytecode interpreter for the PowerPC architecture and some optimizations and evaluates them: stack caching (a few variations), position-based handler customization and position-based speculative decoding (software pipelining of the interpreter). Position-based handler customization deals with different alignments of bytecodes by having four states in the interpreter for the different alignments, each state with its own specialized copy of the interpreter. For stack caching they evaluated a fixed one-TOS-register organization with write-through caching (5.6\% speedup over base), and dynamic stack caching with two registers (3 states, 7/9% speedup over base), and used the write-through organization for further experiments; write-through is not compared empirically to write-back. Position-based handler customization buys another 19\%, and software pipelining an additional 3.4\%. The paper also presents results on memory traffic (both I and D).} }
@Proceedings{asplos02, title = "Architectural Support for Programming Languages and Operating Systems (ASPLOS-X)", booktitle = "Architectural Support for Programming Languages and Operating Systems (ASPLOS-X)", year = "2002", key = "ASPLOS-X" }
Followups set to comp.compilers.
- anton -- M. Anton Ertl anton@mips.complang.tuwien.ac.at http://www.complang.tuwien.ac.at/anton/home.html
|
|
 | | From: | VBDis | | Subject: | Re: virtual machine efficiency | | Date: | 30 Dec 2004 01:02:08 -0500 |
|
|
 | "ed_davis2" schreibt: >Is there a way I can have both fast loading of operands, and small >byte-code size?
Some ideas:
1) Align arguments to word boundaries.
You can align the arguments to the next word boundary, e.g. by inserting NOP instruction bytes, or by SKIP1 until SKIPn instructions that "jump" forward immediately to the next instruction byte, residing just before the word boundary.
It also will be possible to have multiple opcodes for the same instructions, which take arguments, with each of these instructions including an specific increment of the PC to the next word boundary before fetching the argument from that address. This will move the calculation of the increment from runtime (interpretation) into "compile" time, when the code is stored in memory.
But I assume that processing these instructions, or calculating the address of the next word boundary, or even constructing the value from multiple byte fetches, will take much more time than reading a misaligned argument from memory. The excess data from the added fetch typically will still be available (in the cache) when the next instruction is read, so that IMO the penalty of a misaligned read is neglectable, in detail in an almost sequential access to the data.
2) Separate arrays for byte code and word arguments.
I also thought of separate code and argument areas, as John suggests, but then all jumps and calls must specify the addresses or indices in both areas, what again increases the memory usage and the time to handle the duplicate position counters. In addition more delays may occur for fetching data from different (nonsequential) addresses.
Even if I know that on some machines misaligned reads are expensive, does anybody know of a concrete machine where reading a misaligned "word" will definitely take longer than reading the according number of individual bytes???
DoDi
|
|
 | | From: | Chris F Clark | | Subject: | Re: virtual machine efficiency | | Date: | 30 Dec 2004 01:05:45 -0500 |
|
|
 | ed_davis2 asked: > I have developed a simple stack based virtual machine. I would like > it to be as efficient as possible, in terms of both speed and object > code size. Size, as in size of the object code the VM processes > (byte-code), and not necessarily the size of the VM.
You'll always have to make some tradeoffs between size and space. However, you probably can do quite well, if you note that a lot of operands are repeated. So, are a lot of opcodes and code sequences, but we'll concentrate on the operands first.
For example, there are probably on the order of 255 common operands in a code sequence, so use something akin to sliding window data compression to capture that fact. Here is one possibility:
struct operation { char opcode; char operand; };
Use the operand values 0-254 to refer to the most recent 255 operands (as appearing in the code. Use operand value 255 to indicate a "new" operand being takens from a list of word aligned locations. When you fetch a new operand, retire the oldest operand from the most-recently used list. You also increment the pointer into the operand list, so that the next "new" operand fetches a different element from the list.
address operand_mru[255]; // 0-254 address new_operand_list[]; // 255 copies 1 from here into the mrs
There are lots of heuristics you can build into the 255 most recent operand memory. You can think of it as a cross between a register file and a cache.
For example, you can preload the cache with the operand list. This saves the 1st 255 uses of the "new" operand. You can also load/store the cache on a call/return by call return basis. Note, you probably don't want to actually copy the cache on the call/return, but instead keep a cac he pointer. This is a place where profiling will pay-off. Measure a couple of different strategies and keep measuring as one tunes.
Next, since most routines do not have 255 local variables (unless you've done extensive inlining), you can consider making the cache smaller, and using some of the operand values to represent function parameters (carving out 16 or even 32 function parameters out of the cache won't significant impact the cache, and will handle almost any calls).
You can even compress more, say your profiling suggests that you can get away with 24 cache entries, 7 paramete locations, and 1 "new" operand, that fits in 5 bits. Then, if can find 7 common opcodes (3 more bits(, you can pack the entire opcode and operand into 1 byte.
You can also compress the operation spcae the same way by reserving, some number of opcodes as "compresses" opcodes, that do a table lookup into an instruction cache. If the instrcution cache can hold sequences of instructions, you are close to Ziv-Lempel eocnding.
Back to operand encoding, it is worth noting that the entries in the "address" mrs et. al. don't have to be just simple pointers. You can re-invent the concepts of indirect and indexed addressing (and even auto-increments if it suits you) in your virtual machine. RISC hardware architectures tended to drop these concepts. However, the tradeoffs in a hardware implementation are different than a software one. Again, measure and tune.
Of course, the farther you push this. the farther you get from a simple VM.
Going the ooposite direction. One can borrow a concept that hardware designers have used, aligning instructions with no-ops. If you want your essential instructions to be half-word (or byte) aligned, but you want your insructions with 32 bit operands to have the operands algined on 32 bit boundaries. First, determine what opcode length/position would put the operand on a 32 bit boundary. For example, if you decide to have byte boundary instruction with a 1 byte opcode field, putting an instruction on the 3rd byte should then align the following opcode field. If you wnat to generate such an instruction and you aren't on the appropriate boundary, you can either insert no-op instructions to get to ehte deisred alginment, or pull instructions from further in the instruction list to try to fill the space. Note, that doing so, is potentially difficult, and whole phases are written to "schedule" instructions to try to minimize various penalties. Now many of those penlties don't apply, since your VM is likely to execute each instruction in serial, rather than doing multi-issue instructions that hardware can do.
However, as more hardare gets to be multi-issue and out-of-order, the things which are cheap and parallel in hardware become more and more applicable to software. Not long ago, I would not have recommended using an 8 bit opcode and a 24 bit address in a VM, because the mask instruction to get a 32 bit pointer out of the 24 bit address, might have been expensive. These days I would recommend measuring the effect before making the decision. I suspect these days on any P 4 or better class machine, the mask is free. Of course, on the same class of machine I bet the mis-alignment penalty is nil also. The real measure is the amount of work one needs to do, especially in terms of memory traffic and most importantly non-cache memory traffic.
Hope this helps, -Chris
***************************************************************************** Chris Clark Internet : compres@world.std.com Compiler Resources, Inc. Web Site : http://world.std.com/~compres 23 Bailey Rd voice : (508) 435-5016 Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
|
|
 | | From: | cr88192 | | Subject: | Re: virtual machine efficiency | | Date: | 31 Dec 2004 13:07:09 -0500 |
|
|
 | "Chris F Clark" wrote in message > ed_davis2 asked: >> I have developed a simple stack based virtual machine. I would like >> it to be as efficient as possible, in terms of both speed and object >> code size. Size, as in size of the object code the VM processes >> (byte-code), and not necessarily the size of the VM. > > You'll always have to make some tradeoffs between size and space. > However, you probably can do quite well, if you note that a lot of > operands are repeated. So, are a lot of opcodes and code sequences, > but we'll concentrate on the operands first. > > For example, there are probably on the order of 255 common operands in > a code sequence, so use something akin to sliding window data > compression to capture that fact. Here is one possibility: > > struct operation { > char opcode; > char operand; > }; > > Use the operand values 0-254 to refer to the most recent 255 operands > (as appearing in the code. Use operand value 255 to indicate a "new" > operand being takens from a list of word aligned locations. When you > fetch a new operand, retire the oldest operand from the most-recently > used list. You also increment the pointer into the operand list, so > that the next "new" operand fetches a different element from the list. > > address operand_mru[255]; // 0-254 > address new_operand_list[]; // 255 copies 1 from here into the mrs > > There are lots of heuristics you can build into the 255 most recent > operand memory. You can think of it as a cross between a register > file and a cache.
yes, this sounds pretty cool really, especially if one's oprands may be complex or allocated types.
> For example, you can preload the cache with the operand list. This > saves the 1st 255 uses of the "new" operand. You can also load/store > the cache on a call/return by call return basis. Note, you probably > don't want to actually copy the cache on the call/return, but instead > keep a cac he pointer. This is a place where profiling will > pay-off. Measure a couple of different strategies and keep measuring > as one tunes. > > Next, since most routines do not have 255 local variables (unless > you've done extensive inlining), you can consider making the cache > smaller, and using some of the operand values to represent function > parameters (carving out 16 or even 32 function parameters out of the > cache won't significant impact the cache, and will handle almost any > calls). > > You can even compress more, say your profiling suggests that you can > get away with 24 cache entries, 7 paramete locations, and 1 "new" > operand, that fits in 5 bits. Then, if can find 7 common opcodes (3 > more bits(, you can pack the entire opcode and operand into 1 byte. > > You can also compress the operation spcae the same way by reserving, > some number of opcodes as "compresses" opcodes, that do a table lookup > into an instruction cache. If the instrcution cache can hold > sequences of instructions, you are close to Ziv-Lempel eocnding.
also cool imo.
could make sense for saving space, just I feel a little unsure of the overhead or usefulness in many cases.
> Back to operand encoding, it is worth noting that the entries in the > "address" mrs et. al. don't have to be just simple pointers. You can > re-invent the concepts of indirect and indexed addressing (and even > auto-increments if it suits you) in your virtual machine. RISC > hardware architectures tended to drop these concepts. However, the > tradeoffs in a hardware implementation are different than a software > one. Again, measure and tune.
well, I tend to use a large number of compound instructions myself...
> Of course, the farther you push this. the farther you get from a > simple VM.
my vm is still fairly simple, and specialized for a very specific task. at least it is a little more general than my previous one, but my previous one didn't even really use bytecode...
> Going the ooposite direction. One can borrow a concept that hardware > designers have used, aligning instructions with no-ops. If you want > your essential instructions to be half-word (or byte) aligned, but you > want your insructions with 32 bit operands to have the operands > algined on 32 bit boundaries. First, determine what opcode > length/position would put the operand on a 32 bit boundary. For > example, if you decide to have byte boundary instruction with a 1 byte > opcode field, putting an instruction on the 3rd byte should then align > the following opcode field. If you wnat to generate such an > instruction and you aren't on the appropriate boundary, you can either > insert no-op instructions to get to ehte deisred alginment, or pull > instructions from further in the instruction list to try to fill the > space. Note, that doing so, is potentially difficult, and whole > phases are written to "schedule" instructions to try to minimize > various penalties. Now many of those penlties don't apply, since your > VM is likely to execute each instruction in serial, rather than doing > multi-issue instructions that hardware can do.
yes...
> However, as more hardare gets to be multi-issue and out-of-order, the > things which are cheap and parallel in hardware become more and more > applicable to software. Not long ago, I would not have recommended > using an 8 bit opcode and a 24 bit address in a VM, because the mask > instruction to get a 32 bit pointer out of the 24 bit address, might > have been expensive. These days I would recommend measuring the > effect before making the decision. I suspect these days on any P 4 or > better class machine, the mask is free. Of course, on the same class > of machine I bet the mis-alignment penalty is nil also. The real > measure is the amount of work one needs to do, especially in terms of > memory traffic and most importantly non-cache memory traffic.
err, I sort of doubt masking and misalignment are "free" per-se, but they are not that expensive. afaik, even on older processors masking has been pretty cheap as well.
yeah, keeping the working set small is an issue, especially with a dynamically typed language that tends to tear through a lot of heap, but then again, at this point I don't think cache misses are the problem...
at least in some very specific cases my interpreter can run in constant memory, which is something at least...
|
|
 | | From: | cr88192 | | Subject: | Re: virtual machine efficiency | | Date: | 30 Dec 2004 01:03:23 -0500 |
|
|
 | "ed_davis2" wrote >I have developed a simple stack based virtual machine. I would like > it to be as efficient as possible, in terms of both speed and object > code size. Size, as in size of the object code the VM processes > (byte-code), and not necessarily the size of the VM. > > But I've run into one of those situations where I can't figure out how > to have both.
one is limited really.
description of problems of mixing addresses/literals with opcodes
I once started running into this issue, along with the occurance that mixing addresses in with bytecode is just imo bad. also, there was the issue that 256 just seems a bit small for an opcode space, and typical chaining approaches are imo ugly.
So, how I did it was to define the bytecode as an array of unsigned shorts. each opcode has a layout, namely the lower 12 bits give the opcode number and the upper 4 give the length. this is followed by any oprands or such (up to about 30 bytes).
Now, I did not embed values directly into most opcodes, instead, I defined another structure as a kind of out of band index table.
Keeping the table out of band has a few uses: I can put whatever in it and not worry about type; I am a lot more free of the compiletime/runtime issue wrt immediates; it avoids unnesessary duplication of literals; it makes operations that can recieve different literal types a lot more straightforward; it makes gc or storing/transmitting code more straightforward; ...
So, pretty much all the literals in bytecode are shorts (though some are signed, but this is a seperate issue). I divide up my bytecode code at the function level, and more or less assumes that no single function will exceed about 64kB (yes, jumps are short aligned, but are limited to +-32k shorts).
similarly, my opcodes are often fairly high-level and abstract as well, but this depends a lot on the opcode.
or such...
|
|
|