Expert C++
上QQ阅读APP看书,第一时间看更新

Tokenization

The analysis phase of the compiler aims to split the source code into small units called tokens. A token may be a word or just a single symbol, such as = (the equals sign). A token is the smallest unit of the source code that carries meaningful value for the compiler. For example, the expression int a = 42; will be divided into the tokens int, a, =, 42, and ;. The expression isn't just split by spaces, because the following expression is being split into the same tokens (though it is advisable not to forget the spaces between operands):

int a=42;

The splitting of the source code into tokens is done using sophisticated methods using regular expressions. It is known as lexical analysis, or tokenization (dividing into tokens). For compilers, using a tokenized input presents a better way to construct internal data structures used to analyze the syntax of the code. Let's see how.