Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Assembly, Intro To Algorithms With String Functions (Win32, NASM)

string assembly

  • Please log in to reply
3 replies to this topic

#1 RhetoricalRuvim

RhetoricalRuvim

    JavaScript Programmer

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1307 posts
  • Location:C:\Countries\US
  • Programming Language:C, Java, C++, PHP, Python, JavaScript

Posted 17 August 2011 - 01:05 PM

Algorithms are an important part for any programming language. Last time we made a window with text in it, but what's the use of putting text into a window, if we can't even process text?

In this tutorial, we'll take a look at some string algorithms and make functions that use those algorithms.

What Is An Algorithm?
An algorithm is a rule or set of rules specifying how to do something or solve a problem.

Making An Algorithm:
Here's what we need to do:
  • Think of a task.
  • Think about how you would do the task.
  • Write pseudo-code.
  • Write the assembly language function.


Think of a task
As with just about anything else, before you start something you have to think of what you're going to do.

Think about how you would do the task
Then you need to think how you would have done the task, if you were given the job.

Write pseudo-code
Generally, pseudo-code is code that does not really mean anything to any compiler, but is useful for writing what the computer will have to do.

This is where you go into more details about what, exactly, needs to be done.

The syntax of the pseudo-code depends on how you want to write it.

Write the assembly language function
After you have exactly what needs to be done, you can then start "compiling" your writing (pseudo-code) to assembly language code.



The Algorithms We'll Focus On
The algorithms we'll be working on here are string-handling algorithms.

We would make these four algorithms, but you can make more on your own:
  • strlen - Getting the length of a string.
  • strcpy - Copying a string.
  • strcat - String concatenation; this is where you append one string to another.
  • strcmp - Compare strings.



New Instructions
There are four new instructions that we'll use that we didn't go over yet. They're PUSHA, POPA, CBW, (CWD,) and CWDE.

PUSHA - Push All
Same as pushing the registers EAX, ECX, EDX, EBX, ESP, EBP, ESI, and EDI, in that order. Does that in one instruction.

POPA - Pop All
Same as popping the registers EDI, ESI, and EBP, then adding 4 to ESP, then popping EBX, EDX, ECX, and EAX, all in that order.

CBW - Convert Byte To Word
Takes bit 7 of AL and sets all the bits in AH to that value.

CWD - Convert Word To Double-Word
Takes bit 15 of AX and sets all the bits in register DX to that value.

CWDE - Convert Word To Double-Word
Same as CWD, but changes the higher-order word of EAX, rather than changing DX.



A technique that we'll use
In assembly language, we usually use [ebp-4], etc., for local variables. But this time, we'll use [ebp-4] for our return value.

The address is [ebp-(the sum of the sizes of all the local variables)-4], if we use the PUSHA instruction right after ENTER. The sum of the sizes of all the local variables is usually equal to the number used for the first operand of the ENTER instruction. For example, if we have 10 bytes of local variables, we would use 'ENTER 10, 0' , and that number would be 10. We would then use [ebp-10-4], which is [ebp-14], to access the return value.

This works because we reserve 10 bytes on the stack with the ENTER instruction, so [ebp-10] is the first byte of our stack frame. Then we push the registers using the PUSHA instruction, which pushes EAX first, meaning that the double-word right below [ebp-10] is used for EAX. But since our return value is in EAX, and since the POPA instruction would use that double-word (the one below [ebp-10]) for restoring the EAX register, it would make sense that if we save the return value to [ebp-(number of bytes reserved on the stack)-4] then it will be copied to the EAX register upon our return.



strlen
We know that every string has a NULL (0) terminator byte at the end of it. That way, we can scan the string and stop at the NULL terminator. While we are scanning the string, we're also counting the number of characters; this does not include the NULL terminator.



The Pseudo-Code

strlen: 
integer c 
pointer b 
c= 0 
b= address of the string 
loop 1: 
  if byte [b] is 0 then go to loop 1 stop 
  c= c + 1 
  b= b + 1 
  go to loop 1 
loop 1 stop: 
return c

Converting The Pseudo-Code To Assembly Language
First of all, we don't have to use memory for variables. If we can, we could use registers; the program runs faster, that way.

Since EBX is the address/base register, let's use it for our 'b' variable. And since ECX is the count register, let's use it for our 'c' variable.
strlen: 
enter 0, 0      ;; The stack frame. 
pusha           ;; We push EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI (in that order). 

mov eax, dword [ebp+8]          ;; Get the address of the string; it's the first argument. 
mov ebx, eax                         ;; b= address of the string 

xor ecx, ecx                          ;; c= 0 
.lp1:                                    ;; local label loop 1 
  ;; if byte [b] is 0 then go to loop 1 stop 
  cmp byte [ebx], 0 
  jz .lp1s 
  
  inc ecx                               ;; c= c + 1 
  inc ebx                               ;; b= b + 1 
  
  jmp .lp1                              ;; go to loop 1 
.lp1s: 

;; return c 
mov eax, ecx 
mov dword [ebp-4], eax 

popa 
leave 
ret 4





strcpy
Here, we want to copy each byte from the string at argument 2 to the corresponding byte of the string at argument 1, until after we copy the NULL terminator byte.



The Pseudo-Code

strcpy: 
character a 
pointer b 
pointer d 
b= argument 1 (destination) 
d= argument 2 (source) 
(That's how strcpy() takes arguments; strcpy(copy to, copy from)) 
loop 1: 
  a= byte [d] 
  byte [b]= a 
  
  b= b + 1 
  d= d + 1 
  
  if a then go to loop 1 
('if a' means "if a is not 0" ; when a is 0 (the NULL terminator), the loop would end.) 
return

Converting Our Pseudo-Code To Assembly Language
strcpy: 
enter 0, 0             ;; We won't need any local variables. 
pusha                  ;; Save all the register values. 

mov eax, dword [ebp+08] 
mov ebx, eax                   ;; b= argument 1 

mov eax, dword [ebp+12] 
mov edx, eax                   ;; d= argument 2 

.lp1: 
  mov al, byte [edx]          ;; a= byte [d]      (note:  EDX (on 386 or later machines) can also be used for addressing) 
  mov byte [ebx], al          ;; byte [b]= a 
  
  inc ebx                         ;; b= b + 1 
  inc edx                         ;; d= d + 1 
  
  cmp al, 0                      ;; Subtract 0 from AL (don't store the result). 
  jnz .lp1                         ;; If the result is not supposed to be 0, go to loop 1. 

popa                             ;; Restore all the register values. 
leave                            ;; Switch back to the previous stack frame. 
ret 8     ;; Free 8 bytes after return.





strcat
No, this algorithm does not have anything to do with cats :). This is the string concatenation algorithm; it appends string 2 to string 1. However, you do need to make sure that the destination string is big enough to fit what it's currently fitting plus the source string. That means if your destination string cat fit 512 bytes, currently has 410 bytes in use, and the source string is 200 bytes in length, then you shouldn't use this function - because you don't have enough space inside the source string - or you might get unpredictable results.

The same applies to strcpy().

The idea, now, is to add the length of string 1 to the address of string 1, and pass that as the first argument to the strcpy() function, while using string 2 as the second argument.



The Pseudo-Code

strcat: 
dword a 
a= argument1 + strlen argument1 
strcpy a, argument2 
return

The Assembly Language Routine
strcat: 
enter 0, 0                        ;; No local memory variables. 
pusha                             ;; Save the values of all the registers. 

;; a= argument1 + strlen argument1 = strlen argument1 + argument1, so... 
push dword [ebp+8]                      ;; strlen argument1 
call strlen 

;; ... + argument1 
add eax, dword [ebp+8] 

;; strcpy(a, argument2) 
push dword [ebp+12]                 ;; argument2 
push eax                                  ;; a 
call strcpy 

popa                              ;; Restore the values of all the registers. 
leave                             ;; Switch back to the previous stack frame. 
ret 8                              ;; Free 8 bytes after returning.





strcmp
Remember the CMP instruction? Well, the CMP instruction only works on small, static, values; it's time to make an algorithm that would do the same job but for bigger string values.

This function should return string 1 - string 2.



string 1 - string 2 - What?
string 1 - string 2 is 0 if the two strings are equal. Otherwise, it's the first unequal character of string 1 - the first unequal character of string 2. That means that if we're comparing "abcde" and "abcfe" then "abcde" - "abcfe" will be -2, because the first unequal character ('d') of string 1 - the first unequal character ('f') of string 2 is -2. 'd' is ASCII 100 and 'f' is ASCII 102, so that 100 - 102 is -2.

The Pseudo-Code

strcmp: 
character a 
integer c 
pointer b 
pointer d 
b= argument 1 
d= argument 2 
c= 0 
loop 1: 
  a= byte [b] 
  c= a - byte [d] 
  
  if not a then go to loop 1 stop 
  if c then go to loop 1 stop 
  
  b= b + 1 
  d= d + 1 
  
  go to loop 1 
loop 1 stop: 
return c

The Assembly Language Code
strcmp: 
enter 0, 0                    ;; We don't really need any memory local variables, right now. 
pusha                         ;; Save the values of all the general-purpose registers. 

;; b= argument 1 
mov eax, dword [ebp+08] 
mov ebx, eax 

;; d= argument 2 
mov eax, dword [ebp+12] 
mov edx, eax 

;; We'll use EAX for both integer c and character a. 
xor eax, eax 
.lp1: 
  mov al, byte [ebx]                 ;; Get byte [b] 
  mov ah, byte [edx]                ;; Get byte [d] 
  sub al, ah                            ;; See if they're the same. 
  jnz .lp1s                              ;; If not, the strings are not equal for sure. 
  
  cmp ah, 0                            ;; They're the same; but are they 0? 
  ;; Note that we can't use AL, here, because it'll always be 0 if the processor 
  ;; makes it to this instruction, because of the SUB instruction, above. 
  jz .lp1s                                ;; If so, the strings ended already, and we need to return. 
  
  inc ebx                                ;; b= b + 1 
  inc edx                                ;; d= d + 1 
  
  jmp .lp1 
.lp1s: 

cbw                                     ;; This instruction converts a byte value to a word value. 
                                          ;; This helps in situations where AL is a signed integer, because 
                                          ;; if you just clear AH then AX will be 256 - |the integer| for negative numbers. 
                                          ;; 00000010 is binary for 2 
                                          ;; 11111110 is binary for -2 
                                          ;; 11111101 is binary for -3 
                                          ;; 11111100 is binary for -4 
                                          ;; ... and so on ...  
                                          ;; But if you include AH to make AX, then it would be (if AH is 0): 
                                          ;; 00000000 00000010 is binary for 2 
                                          ;; 00000000 11111110 is binary for 254 
                                          ;; 00000000 11111101 is binary for 253 
                                          ;; ... and so on ...  
                                          ;; So what CBW does is it takes the left-most bit of AL and 
                                          ;; it sets every bit in AH to that value. 
                                          ;; 00000000 00000010 is binary for 2 
                                          ;; 11111111 11111110 is binary for -2 
                                          ;; 11111111 11111101 is binary for -3 
                                          ;; ... and so on ...  
cwde                                   ;; Then we have to convert the word (in AX) to a double-word; same idea as before. 

mov dword [ebp-4], eax            ;; return c 

popa                                      ;; Restore the register values. 
leave                                     ;; Switch back to the previous stack frame. 
ret 8                                      ;; Free 8 bytes after return.





Example Program
Well, what would be the point of showing these functions if I don't show an example? So let's take a look at how to use the functions...


Example Program - The Plan
- Get the length of the string "hello" and make a message box that says "hello" that many times.
- Make a message box that displays string1 (a blank string of 512 bytes).
- Copy "one" to string1; make a message box with string1.
- Copy "two" to string1; make a message box with string1.
- Append "three" to string1; make a message box with string1.
- Compare string1 to "abcd"; if equal, make a message box that says "string1 is equal to \"abcd\""
- Copy "abc" to string1, then append "123" to string1; make a message box with string1.
- Compare string1 to "abc123" and if equal, make a message box that says "string1 is equal to \"abc123\""
- Exit, returning 0.


Example Program - The Code
;; Define the externs. 
extern MessageBoxA 
extern ExitProcess 

;; Construct our symbol import table. 
import MessageBoxA user32.dll 
import ExitProcess kernel32.dll 

;; The following is the code section for our program. Use 32-bit code. 
section .text use32 
;; ..start is where our program starts running. 
..start: 

call main 

;; Exit, returning whatever main() returned. 
push eax 
call [ExitProcess] 

main: 
	enter 8, 0                     ;; Set up the stack frame with 8 bytes reserved for local variables. 
	
	
	;; MessageBox() (or MessageBoxA(), or MessageBoxW()) takes 4 arguments: 
	;;    hwnd    - The handle to the window that owns the message box. We don't have a window, currently, so this should be 0. 
	;;    msg     - A pointer to the string to output as the text for the message box. 
	;;    ttl     - A pointer to the string that has the title for the message box. 
	;;    flags   - Some flags for the message box. For a plain message box with an OK button, we use 0. 
	
	
	;; Part 1 - Display a message box with "hello" (strlen "hello") times. 
	
	;; Get the length of the string "hello" 
	push dword txt1 
	call strlen 
	mov dword [ebp-8], eax         ;; Save the number of "hello" 's, needed to be displayed, in dword [ebp-8]. 
	
	;; [ebp-4] will be our index variable for the loop for the "hello" message boxes. 
	xor eax, eax 
	mov dword [ebp-4], eax 
	.lp1: 
		mov eax, dword [ebp-4]     ;; Get the current index value (the number of "hello" 's already displayed). 
		cmp eax, dword [ebp-8]     ;; Compare it to the number of "hello" 's needed to be displayed. 
		jnl .lp1s                  ;; If already displayed that many "hello" 's, go to loop 1 stop. 
		
		;; Display a message box that says "hello" 
		push dword 0 
		push dword the_title 
		push dword txt2 
		push dword 0 
		call [MessageBoxA] 
		
		inc dword [ebp-4]          ;; Number of "hello" 's already displayed = number of "hello" 's already displayed + 1 
		jmp .lp1                   ;; Go to .lp1, to continue the loop. 
	.lp1s: 
	
	
	;; Part 2 - Make a message box that displays string1. 
	
	push dword 0 
	push dword the_title 
	push dword string1 
	push dword 0 
	call [MessageBoxA] 
	
	
	;; Part 3 - Copy "one" to string1 and display a message box with string1. 
	
	push dword txt3 
	push dword string1 
	call strcpy 
	
	push dword 0 
	push dword the_title 
	push dword string1 
	push dword 0 
	call [MessageBoxA] 
	
	
	;; Part 4 - Copy "two" to string1 and display a message box with string1. 
	
	push dword txt4 
	push dword string1 
	call strcpy 
	
	push dword 0 
	push dword the_title 
	push dword string1 
	push dword 0 
	call [MessageBoxA] 
	
	
	;; Part 5 - Append "three" to string1 and display a message box with string1. 
	
	push dword txt5 
	push dword string1 
	call strcat 
	
	push dword 0 
	push dword the_title 
	push dword string1 
	push dword 0 
	call [MessageBoxA] 
	
	
	;; Part 6 - Compare string1 to "abcd" and if equal then display message box with "string1 is equal to \"abcd\"" 
	
	push dword txt6 
	push dword string1 
	call strcmp 
	
	cmp eax, 0                     ;; Compare EAX to 0. 
	jz .lbl1                       ;; If it's equal to 0, go to .lbl1 
	jmp .lbl1f                     ;; Otherwise, go to .lbl1f (label 1 finish). 
	
	.lbl1: 
		push dword 0 
		push dword the_title 
		push dword txt7 
		push dword 0 
		call [MessageBoxA] 
		
		jmp .lbl1f 
	.lbl1f: 
	
	
	;; Part 7 - Copy "abc" to string1, then append "123" to string1; display a message box with string1. 
	
	push dword txt8 
	push dword string1 
	call strcpy 
	
	push dword txt9 
	push dword string1 
	call strcat 
	
	push dword 0 
	push dword the_title 
	push dword string1 
	push dword 0 
	call [MessageBoxA] 
	
	
	;; Part 8 - Compare string1 to "abc123" and if equal then display a message box with "string1 is equal to \"abc123\"" 
	
	push dword txt10 
	push dword string1 
	call strcmp 
	
	cmp eax, 0                     ;; If EAX is 0 (meaning string1 is equal to "abc123") 
	jz .lbl2                       ;; then go to .lbl2 
	jmp .lbl2f                     ;; Else, go to .lbl2f 
	
	.lbl2: 
		push dword 0 
		push dword the_title 
		push dword txt11 
		push dword 0 
		call [MessageBoxA] 
		
		jmp .lbl2f 
	.lbl2f: 
	
	
	;; Part 9 - Exit, returning 0. 
	
	xor eax, eax                   ;; Return 0. XOR a, a   clears a. Another way of saying MOV a, 0 
	leave                          ;; Switch back to previous stack frame. 
ret                                ;; Return to caller. 

;; Our data section. 
section .data 
the_title                                                     db "AlgorithmsIntro", 0 

txt1                                                          db "hello", 0 
txt2                                                          db "hello", 0 

txt3                                                          db "one", 0 

txt4                                                          db "two", 0 

txt5                                                          db "three", 0 

txt6                                                          db "abcd", 0 
txt7                                                          db "string1 is equal to ", 34, "abcd", 34, 0          ;; 34 is ASCII for double quote. 

txt8                                                          db "abc", 0 
txt9                                                          db "123", 0 

txt10                                                         db "abc123", 0 
txt11                                                         db "string1 is equal to ", 34, "abc123", 34, 0 

;; The following is part of the bss section. 
section .bss 
;; Reserve 512 bytes for string1. 
string1                                                       resb  512 

;; Tell NASM that we want to put the next things into the code section. 
section .text 

strlen: 
enter 0, 0      ;; The stack frame. 
pusha           ;; We push EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI (in that order). 

mov eax, dword [ebp+8]          ;; Get the address of the string; it's the first argument. 
mov ebx, eax                         ;; b= address of the string 

xor ecx, ecx                          ;; c= 0 
.lp1:                                    ;; local label loop 1 
	;; if byte [b] is 0 then go to loop 1 stop 
	cmp byte [ebx], 0 
	jz .lp1s 
	
	inc ecx                               ;; c= c + 1 
	inc ebx                               ;; b= b + 1 
	
	jmp .lp1                              ;; go to loop 1 
.lp1s: 

;; return c 
mov eax, ecx 
mov dword [ebp-4], eax 

popa 
leave 
ret 4 

strcpy: 
enter 0, 0             ;; We won't need any local variables. 
pusha                  ;; Save all the register values. 

mov eax, dword [ebp+08] 
mov ebx, eax                   ;; b= argument 1 

mov eax, dword [ebp+12] 
mov edx, eax                   ;; d= argument 2 

.lp1: 
	mov al, byte [edx]          ;; a= byte [d]      (note:  EDX (on 386 or later machines) can also be used for addressing) 
	mov byte [ebx], al          ;; byte [b]= a 
	
	inc ebx                         ;; b= b + 1 
	inc edx                         ;; d= d + 1 
	
	cmp al, 0                      ;; Subtract 0 from AL (don't store the result). 
	jnz .lp1                         ;; If the result is not supposed to be 0, go to loop 1. 
;; .....  

popa                             ;; Restore all the register values. 
leave                            ;; Switch back to the previous stack frame. 
ret 8     ;; Free 8 bytes after return. 

strcat: 
enter 0, 0                        ;; No local memory variables. 
pusha                             ;; Save the values of all the registers. 

;; a= argument1 + strlen argument1 = strlen argument1 + argument1, so... 
push dword [ebp+8]                      ;; strlen argument1 
call strlen 

;; ... + argument1 
add eax, dword [ebp+8] 

;; strcpy(a, argument2) 
push dword [ebp+12]                 ;; argument2 
push eax                                  ;; a 
call strcpy 

popa                              ;; Restore the values of all the registers. 
leave                             ;; Switch back to the previous stack frame. 
ret 8                              ;; Free 8 bytes after returning. 

strcmp: 
enter 0, 0                    ;; We don't really need any memory local variables, right now. 
pusha                         ;; Save the values of all the general-purpose registers. 

;; b= argument 1 
mov eax, dword [ebp+08] 
mov ebx, eax 

;; d= argument 2 
mov eax, dword [ebp+12] 
mov edx, eax 

;; We'll use EAX for both integer c and character a. 
xor eax, eax 
.lp1: 
	mov al, byte [ebx]                 ;; Get byte [b] 
	mov ah, byte [edx]                ;; Get byte [d] 
	sub al, ah                            ;; See if they're the same. 
	jnz .lp1s                              ;; If not, the strings are not equal for sure. 
	
	cmp ah, 0                            ;; They're the same; but are they 0? 
	;; Note that we can't use AL, here, because it'll always be 0 if the processor 
	;; makes it to this instruction, because of the SUB instruction, above. 
	jz .lp1s                                ;; If so, the strings ended already, and we need to return. 
	
	inc ebx                                ;; b= b + 1 
	inc edx                                ;; d= d + 1 
	
	jmp .lp1 
.lp1s: 

cbw                                     ;; This instruction converts a byte value to a word value. 
                                          ;; This helps in situations where AL is a signed integer, because 
                                          ;; if you just clear AH then AX will be 256 - |the integer| for negative numbers. 
                                          ;; 00000010 is binary for 2 
                                          ;; 11111110 is binary for -2 
                                          ;; 11111101 is binary for -3 
                                          ;; 11111100 is binary for -4 
                                          ;; ... and so on ...  
                                          ;; But if you include AH to make AX, then it would be (if AH is 0): 
                                          ;; 00000000 00000010 is binary for 2 
                                          ;; 00000000 11111110 is binary for 254 
                                          ;; 00000000 11111101 is binary for 253 
                                          ;; ... and so on ...  
                                          ;; So what CBW does is it takes the left-most bit of AL and 
                                          ;; it sets every bit in AH to that value. 
                                          ;; 00000000 00000010 is binary for 2 
                                          ;; 11111111 11111110 is binary for -2 
                                          ;; 11111111 11111101 is binary for -3 
                                          ;; ... and so on ...  
cwde                                   ;; Then we have to convert the word (in AX) to a double-word; same idea as before. 

mov dword [ebp-4], eax            ;; return c 

popa                                      ;; Restore the register values. 
leave                                     ;; Switch back to the previous stack frame. 
ret 8                                      ;; Free 8 bytes after return. 



Example Program - The Output
You should get some message boxes in a row, that say the following (a message box message per line):
hello
hello
hello
hello
hello

one
two
twothree
abc123
string1 is equal to "abc123"

http://forum.codecal...tachmentid=4138







First Tutorial:
Intro To Win32 Assembly, Using NASM

Previous Tutorial:
Simple Text

Next Tutorial:
String-Integer Algorithms

Attached Thumbnails

  • AlgorithmsIntro_output_cc.PNG

Edited by RhetoricalRuvim, 21 August 2011 - 03:54 PM.

  • 0

#2 dargueta

dargueta

    I chown trolls.

  • Moderator
  • 4854 posts
  • Programming Language:C, Java, C++, PHP, Python, JavaScript, Perl, Assembly, Bash, Others
  • Learning:Objective-C

Posted 20 August 2011 - 08:50 PM

strcmp - near the bottom, you have cbwe. Should be cwde​.
  • 0

sudo rm -rf / && echo $'Sanitize your inputs!'


#3 RhetoricalRuvim

RhetoricalRuvim

    JavaScript Programmer

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1307 posts
  • Location:C:\Countries\US
  • Programming Language:C, Java, C++, PHP, Python, JavaScript

Posted 20 August 2011 - 09:25 PM

Thanks. I fixed it.
  • 0

#4 RhetoricalRuvim

RhetoricalRuvim

    JavaScript Programmer

  • Expert Member
  • PipPipPipPipPipPipPip
  • 1307 posts
  • Location:C:\Countries\US
  • Programming Language:C, Java, C++, PHP, Python, JavaScript

Posted 21 August 2011 - 03:56 PM

Also, I noticed an error right after '.lp1:' in strcmp: AL is supposed to be set to [ebx], not [edx]; otherwise you'll get string2 - string1, instead of string1 - string2. I just fixed it.
  • 0