Since all the registers in DOS mode is 16bits, how come theres no 16-bit SP register (stack pointer)?
You've got a 16-bit stack pointer allright, what doesn't work is "[sp]". Using 16-bit instructions/registers, addressing modes are very limited: an (optional) offset, plus an (optional) base register, plus an (optional) index register. Base registers are bx and bp, index registers are si and di. That's it! We can't even do "mov al, [si + di]"! "bp" is "special" in that it defaults to [ss:bp] - the rest of 'em use ds (by default).
We can use 32-bit instructions/registers, even in 16-bit code. There's an "override prefix" involved (66h for operand size and 67h for address size), but Nasm takes care of generating them for us. 32-bit addressing modes are much more flexible! Any (general purpose) register can be used as "base", and any but esp can be used as "index", and we can add a "scale" factor to be multiplied by the "index" register - "* 1" is implied if we don't specify one, or "* 2", "* 4", or "* 8" can be used. Very handy for addressing arrays of words, dwords (or single precision floats), or double precision floats (or other qword values). 16-bit addressing is a real PITA, and we're generally glad to forget it once we move to 32-bit code. There's a limitation to using 32-bit instructions/registers/addressing in 16-bit code - the "total offset" can't exceed 64k. Doing so causes a "segment overrun exception", which dos doesn't handle, so it usually hangs the machine. There's a workaround to this - "Flat Real Mode" (among other names), but I won't get into it here - you have to be in "real real mode" to do it, and we're often not.
As Rob explains, you're only pushing two bytes in this code, so add (not sub) 2 to "clean up the stack". It probably works anyway, since you don't really "need" the stack to be "clean", but it isn't "right" (as you suspected). As I recall, when you first started using [esp + ?], you were pushing eax, which is four byes - Nasm generated that 66h prefix to make it so. I probably should have explained more then... but it was working...
To do a subroutine with parameters on the stack, in "all 16-bit" code...
myfunction:
push bp ; save caller's reg
mov bp, sp ; create a "stack frame"
; sub sp, ? ; if you want "local variables"
mov dx [bp + 4] ; get the first parameter
; do whatever
mov sp, bp ; not necessary if no local variables
pop bp ; restore caller's reg
; now the next thing on the stack is our return address, so we're ready to...
ret
There's a further complication, "call far". If the subroutine is in a different segment than the calling code, both cs and ip are used, and both segment and offset are part of the return address, so the first parameter is at [bp + 6] (and you end with "retf"). You probably don't need this in a .com file, and it's rarely used in 32-bit code, so you probably don't need to worry about it, but be aware that it's a possibility.
Assembly language is really very simple, but you need to know a lot of "simple things" all at once in order to do anything, so it can seem more complicated at first that it really is (IMO).
Best,
Frank