Author Topic: I do not understand Intel instruction encoding. But maybe this is where to look?  (Read 23610 times)

Offline dogman

  • Jr. Member
  • *
  • Posts: 51
I have the "2B" manual opened to Appendix B.

Sgt Hulka: "Am I to understand you men completed your army training on your own?"

Is there a formulaic way of encoding or must people who write assemblers and other tools for Intel do encoding and decoding based on these tables and using the other data in the appendix (opcode maps etc.)?

Offline Rob Neff

  • Forum Moderator
  • Full Member
  • *****
  • Posts: 429
  • Country: us
The MOV, ADD, JE, etc. instructions are simply mnemonics to the associated opcode which is output from the assembler at the assembly stage.  It is the binary opcodes that are executed on the CPU architecture in question.  You could write an assembler that used the word FISH to represent the MOV instruction but I doubt it would do you, or any one else for that matter, any good.

The opcodes are exposed by the CPU OEM and you must use them exactly the way they've stated if you want your programs to run on that architecture.  Accordingly, Intel has a large set of opcodes that assemblers like nasm, masm, tasm, gas, etc., generate based on the mnemonics used in the source file.  Similarly, ARM CPUs have opcodes as do IBM mainframes.  Unfortunately for us programmers the opcodes are not the same across all architectures.  This is why assembly language is considered non-portable: the source code you write for one CPU architecture will not work on a different manufacturer's CPU.

Very few people know the rhyme or reason of why the opcode is the binary number that it is but the folks at Intel sure do.  But, if you are writing an assembler then you need to adhere to their way of doing things.  This is also why you don't see very many cross-assemblers - attempting to manage syntactical differences in source along with the requisite differences in binary output is somewhat insane.  For portability it's much better to use a higher level language ( C for example ) where you can abstract away as many of those differences as you can.

I hope I didn't confuse you further!

Offline dogman

  • Jr. Member
  • *
  • Posts: 51
Hi Rob. Thank you. My question is specifically on how Intel instructions are encoded and whether there is an algorithmic way to do that or whether you have to use Intel's tables like the one in the 2B manual.

As a contrast, IBM encoding is dramatically simpler because of the limited number of instruction formats, all fixed lengths with standard operand encoding.

Joe

Offline Rob Neff

  • Forum Moderator
  • Full Member
  • *****
  • Posts: 429
  • Country: us
If you are looking to write an assembler for the Intel CPU chip-sets I see no better strategy than studying the nasm source code to provide info on how to do just that.  There are multiple lookup tables used that have to be considered.  Unfortunately, due to the variable length opcodes, there is no simple or direct relationship of mnemonic to opcode for many cases ( unlike NOP for example ).

The best thing for you to start with is to understand Intel's Mod/RM/Reg format.  You'll go a far way in Intel assembly if you do.

Intel takes the view that much programming functionality can be embedded in the opcodes ( AES and SSE are examples).  Their CPU's are considered CISC ( Complex Instruction Set Computing ). Contrast that ideal with RISC ( Reduced Instruction Set Computing ) where there is a much smaller ( reduced ) instruction set ( opcodes ) which require you to program much of that functionality.  There are a number of papers online ( google for "cisc vs risk" ) that attempt to weight the pro's and con's of each if you'd like a more detailed comparison.  Having spent most of my professional life in Intel's world I'm obviously rather biased.

Offline dogman

  • Jr. Member
  • *
  • Posts: 51
Thank you. I was asking about Intel encoding generally. If I understand your post, you are saying no, there is no algorithmic way, and yes to encode you have to use the tables and opcode maps in the manual, etc.

Offline Rob Neff

  • Forum Moderator
  • Full Member
  • *****
  • Posts: 429
  • Country: us
Yes, you obviously have to use the opcodes defined by Intel to create binaries that will execute on Intel CPUs.  I'm not quite sure what you mean by "algorithmic way" though.  Most programs larger than "Hello, World" involve implementing algorithms to solve problems.

Offline dogman

  • Jr. Member
  • *
  • Posts: 51
Yes, you obviously have to use the opcodes defined by Intel to create binaries that will execute on Intel CPUs.

Yes that it obvious. I didn't ask that at all. I don't mean to be argumentative but I don't know how I could have asked this more simply than I did in the opening post of the thread.

I'm not quite sure what you mean by "algorithmic way" though.  Most programs larger than "Hello, World" involve implementing algorithms to solve problems.

Correct again! But I wasn't asking about writing a program. I was asking if there was an algorithmic way to encode Intel instructions or whether you have to use the tables in Appendix B of the 2B manual. I found some info on the web that seems to confirm that everything is based on the tables and can't be described algorithmically.
« Last Edit: July 17, 2013, 06:02:26 PM by dogman »

Offline Rob Neff

  • Forum Moderator
  • Full Member
  • *****
  • Posts: 429
  • Country: us
Thank you. I was asking about Intel encoding generally.

To which I have been patiently responding to.  Granted, there have been some side discussion within this thread that may have constituted distraction from the original point.

I don't mean to be argumentative but I don't know how I could have asked this more simply than I did in the opening post of the thread.

And I responded simply.  As you are new here I have no idea of your current knowledge or skill set.  A lot of times we have to get a feel for someone in order to extract out what it is they're really trying to understand and how we can best help them achieve that.  Perhaps I failed in this regard.

Correct again! But I wasn't asking about writing a program.

That's not how I understood it when you wrote the following:

Is there a formulaic way of encoding or must people who write assemblers and other tools...


Offline dogman

  • Jr. Member
  • *
  • Posts: 51
And I responded simply.  As you are new here I have no idea of your current knowledge or skill set.  A lot of times we have to get a feel for someone in order to extract out what it is they're really trying to understand and how we can best help them achieve that.

I realize that and the background info you put in threads helps everybody who reads the forums. I learned a lot from many of your other posts.

But sometimes a banana is just a banana :P

Offline Frank Kotler

  • NASM Developer
  • Hero Member
  • *****
  • Posts: 2667
  • Country: us
Let me try it. Joe is "new" here (at least in his present incarnation), "new" to Nasm, and "new" to x86. But he is well familiar with other CPUs and other assembly languages.

I know only x86, so it seems "normal" to me, but I understand that people coming from other architectures are appalled by how "scattered" x86 is. As I understand it, there is a "scheme" of sorts that determines x86 opcodes, but there are so many exceptions and special cases that the  "table" method works out better. Keep in mind that I don't know the x86 encoding - I just trust Nasm to do the right thing.

Maybe I should also mention that the folks at news:comp.arch distinguish between the "architecture" and the "hardware" - the former being the ISA that is exposed to us, and the latter being the microcode that actually does the work these days. Essentially, the CPU itself is acting as a "compiler", taking our meticulously hand-crafted assembly language and executing the micro-ops that will do what we asked for.

If this makes it seem like assembly language is pointless unless you enjoy it... there may be something to that...

Best,
Frank


Offline dogman

  • Jr. Member
  • *
  • Posts: 51
I really only know one family of CPUs and assembly language well. I'm familiar at the 100,000 foot level with a few others. I wanted to understand the basis of Intel's encoding. I'm surprised to see you say there is a scheme to opcode encoding. Maybe this only applies to the opcodes. My question was about total instruction encoding rather than just the opcode part of it.

When I look at a dump or assembly listing I can read the object code nearly as well (in some cases actually better) as I can read the unassembled source and I trust it a lot more than source. It's critical to us to know what the guy is running matches the code we have. When I look an an Intel assembly listing I don't understand anything so I was wondering how to possibly go about doing that. I'm sure you guys recognize object code for most of the instructions you use but then I started wondering if that was really possible given the table-driven encoding that seems to be required. This kind of stuff interests me more than the practical how to do X on platform Y. From what I can tell the workflow is totally different but there are a lot of other factors in this too.

To your other comment. The comp.arch guys are not to be trusted from what I can tell. Once you step outside the Intel world or M68K the correctness and reliability of their comments falls off a cliff. As you certainly know Intel is heavily microcoded with a RISC-like engine underneath so yes there is a big difference between the ISA and the hardware in the Intel world. OTOH designs like IBM were somewhat microcoded in the past to where today's implementation is documented to be more than 75% directly hardwired. The vast majority (possibly all) of commonly-used instructions are implemented directly in hardware. Other popular RISC designs like SPARC and MIPS have very little to no microcode as far as I can tell from looking over Sweetman, Hennesy, Heinrich, and processor manuals put out by Sun, MIPS inc. etc and from other design documents, whitepapers, and discussions. So really most of the world (if not most of the volume) doesn't have any practical distinction between the ISA and the implementation. None of this was part of my question or what I wanted to discuss but since you both brought it up I figured I would mention what little I've been able to find out in the past when I was looking into this.

If there are any guys on the forum who know ARM from an engineering level it would be interesting to know if ARM's microarchitecture uses microcode etc.
« Last Edit: July 18, 2013, 09:36:42 AM by dogman »

Offline s_dubrovich

  • Jr. Member
  • *
  • Posts: 8
Quoting Frank:

 As I understand it, there is a "scheme" of sorts that determines x86 opcodes, but there are so many exceptions and special cases that the  "table" method works out better. Keep in mind that I don't know the x86 encoding - I just trust Nasm to do the right thing.


This falls into the area of 'urban legend', but I recall that if you recode the opcode tables for the 8086 in terms of octals, 2bits, 3bits, 3bits then an encoding 'scheme' is much more apparent.  But since then, with so many extensions to thwart that, it doesn't make sense to do so, if it ever did.  Also, there some instructions that have more than one opcode to do the same thing.  It's been over a decade since I last looked at these issues, so I can't recite examples for you.

Steve

Offline iVision

  • Jr. Member
  • *
  • Posts: 22
I'm not sure if this helps, but this is where I thought of when reading Intel Encoding Instruction:
Quote from: 'SoulByte' pid='33004234' dateline='1370087818'

The machine instrctions are not completely random, so you don't need to remember every possible combination.
A machine instruction consists of an opcode, direction bit, byte/word bit, some bits determining whether to use RAM or register, and then of course some bits specifying the addresses or registers to use.

If we want to add the two registers EAX and ECX, we'll need the opcode for add, which is '00 00 00'.
Instruction: 00 00 00

The next bit detemines the direction. It doesn't change the result of the calculation, since we are just perfoming an addition, but it changes which register will get the result. We'll put a '1', so that the result is stored in the register we specify first.
Instruction: 00 00 00 1

Next bit will determine whether we'll add 8-bit values(al, bl, cl, dl), or 32-bit values(eax, ebx, ecx, edx).
We'll use 32-bit values, so we'll append '1' to our instruction.
Instruction: 00 00 00 1 1

The next two bits decides addressing. When adding two registers, it should be '11'.
Instruction: 00 00 00 1 1 11

There's now only the registers left. You need three bits to specify a register, and we need two of those.
EAX has the value of 000 and ECX has 001.
Instruction: 00 00 00 1 1 11 000 001

The equivalent assembly instruction is: add eax, ecx
The hexadecimal representation is 03C1.
Source: hackforums.net -> Coding -> Assembly Language (dead section)

I don't know if you meant this?

Offline dogman

  • Jr. Member
  • *
  • Posts: 51
Thanks, that's a good example but because of the many instruction formats and possibilities apparently there is no algorithmic way even though it appears there is from this specific example. There are a bunch of tables in the Intel manuals and it seems you have to use those to be able to encode or decode instructions. On each platform the encoding is different. I'm used to very simple encoding and not many instruction formats but the RISC guys have even simpler encoding and less instruction formats than I'm used to.