Click here to Skip to main content
15,881,600 members
Articles / Programming Languages / C

Lexical analyzer: an example

Rate me:
Please Sign up or sign in to vote.
4.71/5 (12 votes)
21 Nov 2014CPOL3 min read 188.7K   5.9K   40   18
A C program to scan source file for tokens

Download source code

→ You might want to have a look at Syntax analysis: an example after reading this.

Introduction

Lexical analyzer (or scanner) is a program to recognize tokens (also called symbols) from an input source file (or source code). Each token is a meaningful character string, such as a number, an operator, or an identifier.

This is the assignment: write a scanner following these lexical rules:

  • Case insensitive
  • All English letters (upper and lower), digits, and extra characters as seen below, plus whitespace
  • Identifiers
    • begin with a letter, and continue with any number of letters
    • assume no identifier is longer than 8 characters
  • Keywords (reserved), include: start finish then if repeat var int float do read print void return dummy program
  • Relational operators, include: == < > =!= => =<
  • Other operators, include: = : + - * / %
  • Delimiters, include: . ( ) , { } ; [ ]
  • Numbers:
    • any sequence of decimal digits, no sign
    • assume no number is longer than 8 digits
  • Other assumption:
    • No whitespace needed to separate tokens except when this changes the tokens (as x y vs xy)
    • Comments start with // and end with new line

Display tokens found, each with 3 fields: token type, token instance, and line number.

Other requirement:

  • Invocation of the program is:
    scanner [filename]
    where input source file is filename.lan (extension .lan is implicit).
    Program will read the same data from keyboard via redirecting from a file if filename is not specified. For example, scanner < filename.lan
  • Perform necessary validations. For example, if found invalid symbols, show error message with line number, and then abort the program

Preparation

Plan

This is the Makefile that also shows the organization of the program files:

CC	= gcc
PROG	= scanner 

$(PROG): driveScanner.o scanner.o
	$(CC) -o $(PROG) driveScanner.o scanner.o

driveScanner.o : driveScanner.c token.h scanner.h
	$(CC) -c driveScanner.c

scanner.o : scanner.c token.h scanner.h symdef.h
	$(CC) -c scanner.c

.PHONY: clean

clean:
	/usr/bin/rm -rf core $(PROG) *.o

token.h defines MAX (= 8, max length of each word string), LIMIT (= 200, max number of word strings in an input source file), and declares token types (show later).

Process command-line arguments and redirection

Start with main() from driveScanner.c:

C
int main(int argc, char *argv[]) {
    FILE *filePtr;

    switch (argc) {
        case 1: // No parameters, use stdin
                // printf("NO argument provided\n");
                filePtr = stdin;
                break;

        case 2: // One parameter, use .lan file supplied	
                if ( (filePtr = fopen(strcat(argv[1], ".lan"), "r")) == NULL ) {
                        printf("Cannot open input file.\n");
                        exit(1);
                }
                break;

        default:
                printf("Syntax: scanner [file] (.lan is implicit)\n");
                exit(0);
    }
}

Next to check if input file is empty:

C
fseek(filePtr, 0, SEEK_END);
if (ftell(filePtr) == 0) {
        printf("File is empty.\n");
        exit(1);
} else {
        rewind(filePtr);
}

Now check for invalid characters and max word length

C
char c;
int numLine = 1;

int charCount = 0;
char tempStr[MAX]; // ! Remember: C doesn't do array out-of-bound checking! 
int i;

int isValid = 1; // true 
while ((c = fgetc(filePtr)) != EOF) {
        if (c == '\n')
                numLine++;

        // Exclude comment line starting with '//' and ending with new line
        if (c == '/') {
                if (fgetc(filePtr) == '/') {
                        while ((c = fgetc(filePtr)) != '\n') {}
                        numLine++;
                }			
        }

        if (isalnum(c)) {
                tempStr[charCount] = c; 
                charCount++;
                if (charCount > MAX) {
                        printf("Word '%s' on line number %d is too long. \n", tempStr, numLine);
                        exit(1);	
                }
        } else if (isspace(c) || isExAcceptableChar(c)) { 
                charCount = 0;
        } else {
                printf("Invalid character '%c' at line %d. \n", c, numLine);
                isValid = 0; // false
        }

}

if (isValid == 0) { 
        printf("Something wrong with the input file. \n");
        exit(1);
}

rewind(filePtr);

At this time, the file is good. Now let scanner.c do the work:

C
TokenType tokenType = UNDEF;

while ((tokenType = getTokenType(filePtr)) != EOT) {}

fclose(filePtr); 
return 0; // done main and program

Let have a look at type declaration in token.h:

C
typedef enum {
	IDENTIFIER,
	KEYWORD,
	NUMBER,
	REL_OP, 	// such as ==  <  >  =!=    =>  =<
	OP,			// such as = :  +  -  *  / %
	DELIM,		// such as . (  ) , { } ; [ ] 

	UNDEF,		// undefined
	EOT 		// end of token

} TokenType;

typedef struct {
	TokenType tokenType;
	char *instance;
	int lineNum;

} Token;

Note that I have not yet utilized the TokeType enum and Token struct as intended. When working on this assignment, first I declared and intended to use Token struct as 'tripple' data type (TokenType enum, instance, and line number). But then these became not needed as showed below.

At this time, the rest of work is the scanner responsibility now. I created another header file, besides token.h, to support the scanner.c do the work.

symdef.h defines the following according to the assignment requirements, and other several local arrays to scanner to store tokens found:

C
char *keywords[] = {"start", "finish", "then", "if", "repeat", "var", 
	"int", "float", "do", "read", "print", "void", "return", "dummy", "program"};	

char *relationalOperators[] = {"==", "<", ">", "=!=", "=>", "=<"};

char otherOperators[] = {':', '+', '-', '*', '/', '%'};

char delimiters[] = {'.', '(', ')', ',', '{', '}', ';', '[', ']'};

char words[LIMIT][MAX]; // include identifiers, and keywords
int wordi = 0, wordj = 0;
int wordLineNums[LIMIT];

char keys[LIMIT][MAX]; // to store keywords
int keyi = 0;
int keyLineNums[LIMIT];
   
char idens[LIMIT][MAX]; // to store identifiers
int ideni = 0;
int idenLineNums[LIMIT];

char nums[LIMIT][MAX];  // to store numbers
int numi = 0, numj = 0;
int numLineNums[LIMIT];

char delims[LIMIT]; // to store delimiters
int delimi = 0;
int delimLineNums[LIMIT];

char otherOps[LIMIT]; // to store other operators
int otherOpi = 0;
int otherOpLineNums[LIMIT];

char relOps[LIMIT][MAX]; // to store keywords
int relOpi = 0, relOpj = 0;
int relOpLineNums[LIMIT];

The scanner.c

Besides English letters, and digits, these are extra acceptable characters:

C
int isExAcceptableChar(char c) {
	if (c == '.' || c == '(' || c == ')' || c == ',' || c =='{' || c == '}' ||
		c == ';' || c == '[' || c == ']' || 
		c == ':' || c == '+' || c == '-' || c == '*' || c == '/' || c == '%' || 
		c == '=' || c == '<' || c == '>' || c == '!') {

		return 1;
	} else 
		return 0;
}

The key function of scanner.c is as follow:

C
TokenType getTokenType(FILE *filePtr) {
	int lineNum = 1;	
	char ch;
	
	while ((ch = fgetc(filePtr)) != EOF) {
		if (ch == '\n') {
			lineNum++;
		}

		// Ignore comment starting with // to the end of line
		if (ch == '/') {
			if (fgetc(filePtr) == '/') {
				while ((ch = fgetc(filePtr)) != '\n') {}
				lineNum++;
			} else
				fseek(filePtr, -1, SEEK_CUR);
		}

		if (isalpha(ch)) {
			words[wordi][wordj++] = ch;	
			while (isalpha(ch = fgetc(filePtr))) {
				words[wordi][wordj++] = ch;	
			}
			words[wordi][wordj] = '\0';	
			wordLineNums[wordi] = lineNum;
			wordi++; wordj = 0;	
			fseek(filePtr, -1, SEEK_CUR);
		} 

		else if (isdigit(ch)) {
			nums[numi][numj++] = ch;
			while (isdigit(ch = fgetc(filePtr))) {
				nums[numi][numj++] = ch;
			}
			nums[numi][numj] = '\0';
			numLineNums[numi] = lineNum;
			numi++; numj = 0;
			fseek(filePtr, -1, SEEK_CUR);
		}

		else if (ispunct(ch)) {
			if (isDelimiter(ch)) {
				delims[delimi] = ch;
				delimLineNums[delimi] = lineNum;
				delimi++;
			} 
			else if (isOtherOperators(ch)) {
				otherOps[otherOpi] = ch;
				otherOpLineNums[otherOpi] = lineNum;
				otherOpi++;
			}
			else if (isStartRelationalOperator(ch)) {
				if (ch == '<' || ch == '>') {
					relOps[relOpi][relOpj++] = ch;
					relOps[relOpi][relOpj] = '\0';
					relOpLineNums[relOpi] = lineNum;
					relOpi++; relOpj = 0;
				}
				else if (ch == '=') {
					if ((ch = fgetc(filePtr)) == '=' || ch == '>' || ch == '<') {
						relOps[relOpi][relOpj++] = '=';	
						relOps[relOpi][relOpj++] = ch;	
						relOps[relOpi][relOpj] = '\0';
						relOpLineNums[relOpi] = lineNum;
						relOpi++; relOpj = 0;
					} else if (ch == '!') {
						if ((ch = fgetc(filePtr)) == '=') {
							relOps[relOpi][relOpj++] = '=';
							relOps[relOpi][relOpj++] = '!';
							relOps[relOpi][relOpj++] = ch;	
							relOps[relOpi][relOpj] = '\0';
							relOpLineNums[relOpi] = lineNum;
							relOpi++; relOpj = 0;
						} else {
							fseek(filePtr, -1, SEEK_CUR);
						}
					} else {
						fseek(filePtr, -1, SEEK_CUR);
					}
			
				}
			}
		}
	} // end while

	splitWords();	

	printSummary();
	
	return EOT; // end of token
}

Note:
- The idea is to have 'one look ahead' to decide what next token is going to be.
- fseek(filePtr, -1, SEEK_CUR) is to go back one character when needed.
- 2D array words is to 'collect' keyword and identifiers token. Then split this 2D array (of characters, or 1D array of strings equivalently) to get keyword and identifier tokens arrays.
- Operators and numbers tokens are founded by going forward and backward 'one character at a time'.
- These are some checking functions helpful while going back and forth the input source file:

C
int isStartRelationalOperator(char c) {
	int i;
	int result = 0; // false
	if (c == '=' || c == '<' || c == '>') {
		result = 1;
	}
	return result;	
}

int isOtherOperators(char c) {
	 int i;
	 int result = 0; // false
	 for (i = 0; i < 6; i++) {
		if (otherOperators[i] == c) 
			result = 1;
	 }
	 return result;
}

int isDelimiter(char c) {
	 int i;
	 int result = 0; // false
	 for (i = 0; i < 9; i++) {
		if (delimiters[i] == c) 
			result = 1;
	 }
	 return result;
}

int isKeyword(char *str) {
	int i;
	int result = 0; // false
	for (i = 0; i < 15; i++) {
		if (!strcasecmp(keywords[i], str))
			result = 1;
	}
	return result;
}

Note: the number 6, 9, 15 (number of pre-specified other operators, delimiters, and keywords) were hard-coded which shouldn't be. They should be defined the way similiar to MAX or LIMIT in token.h instead.

Now look at the words array, and pick tokens for keywords and identifiers token arrays (keys array and idens array):

C
void splitWords() {
	int i;
	for (i = 0; i < wordi; i++) {
		if (isKeyword(words[i])) {
			strcpy(keys[keyi], words[i]);
			keyLineNums[keyi] =  wordLineNums[i];
			keyi++;
		} else {
			strcpy(idens[ideni], words[i]);
			idenLineNums[ideni] = wordLineNums[i];
			ideni++;

		}
	}
}

Last step is to display these tokens arrays (nums, keys, idens, delims, relOps, and otherOps). I include the sample input source file (file.lan) as follow:

// first keywords separated by spaces then by delimiters then by operators
start finish then if repeat var int float do read print void return dummy program
start.finish(then)if,repeat{var}int;float[do]start==finish<then>if=!=repeat=>var<=int=float:do!read+print-void*return/dummy%program

// now some identifiers and numbers separated by operators or delimiters or WS

x.xyz abc123 abc 123 -123
</then>

This is my output:

Image 1

→ You might want to have a look at Syntax analysis: an example after reading this.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
United States United States
while (live) {
try {
learn();
code();
food();
...
} catch (Exception ex) {
recover();
}
}

Comments and Discussions

 
Questionhow to detect tokens in a c program Pin
Member 1460191024-Sep-19 1:14
Member 1460191024-Sep-19 1:14 
QuestionLexical analysis Pin
Member 138955372-Jul-18 9:10
Member 138955372-Jul-18 9:10 
QuestionInput Pin
Member 1297490017-Apr-17 8:17
Member 1297490017-Apr-17 8:17 
AnswerRe: Input Pin
Lộc Nguyễn28-Jun-17 9:43
professionalLộc Nguyễn28-Jun-17 9:43 
Sorry for my late reply. It has been a long time ago, but I'm pretty sure all necessary are there, including the source-code in .zip file.
If you downloaded it, then unzip, there you will see an input file file.lan, along with all other source code files.
GeneralMessage Closed Pin
16-Nov-17 21:10
linensrang16-Nov-17 21:10 
QuestionHow to run on flex win 10 Pin
Member 1226592715-Apr-17 5:36
Member 1226592715-Apr-17 5:36 
Questionsymdef.h pointers. Pin
Jon Christan5-Mar-17 14:40
Jon Christan5-Mar-17 14:40 
QuestionCompile Pin
Lokesh Kondayya Naidu15-Nov-16 15:57
Lokesh Kondayya Naidu15-Nov-16 15:57 
QuestionCant extract file Pin
Member 1278867411-Oct-16 19:37
Member 1278867411-Oct-16 19:37 
Questionexecution Pin
Member 121203836-Nov-15 9:30
Member 121203836-Nov-15 9:30 
QuestionGood Code Pin
murthygrs13-Jul-15 21:23
murthygrs13-Jul-15 21:23 
AnswerRe: Good Code Pin
Lộc Nguyễn14-Jul-15 5:25
professionalLộc Nguyễn14-Jul-15 5:25 
Questioninformation Pin
Member 1135861118-Feb-15 20:09
Member 1135861118-Feb-15 20:09 
AnswerRe: information Pin
Lộc Nguyễn14-Jul-15 5:27
professionalLộc Nguyễn14-Jul-15 5:27 
SuggestionRE: Pin
freddy964-Jan-15 20:36
freddy964-Jan-15 20:36 
GeneralRe: RE: Pin
Lộc Nguyễn14-Jul-15 5:29
professionalLộc Nguyễn14-Jul-15 5:29 
Generala few comments Pin
Member 802077228-Oct-14 7:13
Member 802077228-Oct-14 7:13 
GeneralRe: a few comments Pin
Lộc Nguyễn14-Jul-15 5:29
professionalLộc Nguyễn14-Jul-15 5:29 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.