Oct 12, 2015 Engineering

A look at Go lexer/scanner packages

The art of tokenizing (photo from https://unsplash.com/szolkin)

I’ve decided to create a lexer (a.k.a scanner) for an upcoming hobby project. Before creating the lexer, I wanted to see how a lexer can be implemented in Go. Also important for me was how to implement it with an idiomatic Go usage. During my research I’ve found that in Go land most of the lexers out there are written in two forms (in terms of API usage).

One is based on Rob Pike’s functional way (go watch it if you haven’t) . For example text/template is based on this approach. So also is the toml parser. The blog post from Adam Presley explains it also with a nice tutorial.
Others are based on iterative approach. Advancing with a reader to fetch the next rune and handling it based on the syntax. Both text/scanner and go/scanner are based on this. So also is influxql and Ben Johnson’s blog post about writing Parser and Scanner.

I’ve also checked many other open sourced packages (thanks godoc!) and they all are influenced from the approaches I’ve described above. Even though the first approach from Rob Pike is cool, he also decided to rewrite some of his own project to a more traditional one:

That talk was about a lexer, but the deeper purpose was to demonstrate how concurrency can make programs nice even without obvious parallelism in the problem. And like many such uses of concurrency, the code is pretty but not necessarily fast.

I think it’s a fine approach to a lexer if you don’t care about performance. It is significantly slower than some other approaches but is very easy to adapt. I used it in ivy, for example, but just so you know, I’m probably going to replace the one in ivy with a more traditional model to avoid some issues with the lexer accessing global state. You don’t care about that for your application, I’m sure.

So: It’s pretty and nice to work on, but you’d probably not choose that approach for a production compiler.

I’m not going to dive into the code and explain how exactly these approaches work since both the video from Rob Pike and the blog posts are well written. Instead I’ll be focusing on the API and the usage.

Let’s start with looking at the traditional scanning styles available in the standard library: text/scanner and go/scanner. The following usages are API reflections, how the lexer works in the underlying code is not important for now.

Tokens, their positions and literal texts

Before we dive in into the API definition of the packages let me introduce how
a Token can be defined. I really like how
go/token handles it. Here is a
simplified version:

// Token is the set of lexical tokens of the Go programming language.
type Token int

// The list of tokens.
const (
    // Special tokens
    ILLEGAL Token = iota
    EOF

    literal_beg
    IDENT  // main
    INT    // 12345
    STRING // “abc”
    literal_end

    keyword_beg
    BREAK
    FUNC
    SELECT
    keyword_end
)

Tokens are just enumerations based on what you think should be tokenized. For example you could tokenize the literals true and false to be a token.BOOL token. However you also can say that it can be just an identifier. Please check out the full source code. What is interesting here are the constants _literal_beg _and _keyword_beg. This allows us to check easily if a given token belongs to a set of tokens:

// IsLiteral returns true for tokens corresponding to identifiers
// and basic type literals; it returns false otherwise.
func (tok Token) IsLiteral() bool { 
    return literal_beg < tok && tok < literal_end 
}

// IsKeyword returns true for tokens corresponding to keywords;
// it returns false otherwise.
func (tok Token) IsKeyword() bool { 
    return keyword_beg < tok && tok < keyword_end 
}

So this is how you define your tokens. The meaning is up to you and the language/syntax you are trying to parse. However only having a token is not useful. You also need the position and the literal string of the token.

d0d25-1v09_pqqqmubtx0ddgps10q

The position is a type that holds the line, column and probably also the offset. Given a file you can directly jump to the token with this information.

The literal string needs some explanation. Certain tokens have also values that needs to be extracted. Keywords for example don’t have values. They are predefined names (such as token.BREAK, token.SELECT,etc..). However say for a token.STRING the value is there and needs to be defined. This process is called by the Evaluator part of the lexer. So the value of certain tokens should be retrieved and provided by the lexer. If we have a token.INT, the value might be “0x123”, “1234” or “45e+10”. This all depends on what you define an int should be of course.

As you see, the lexer doesn’t try to give a meaning. The meaning is defined on what you want certain characters to be recognized as.

Current scanner APIs

Let’s move on and observe how text/scanner and go/scanner provides an API that provides us both the position and the text literal.

Let us start looking go/scanner. This package is used by the official Go tools itself. So it’s battle tested and works very well. And because it’s intended to be used with Go syntax, the constructor is also different. Below is the official example from the doc:

// src is the input that we want to tokenize.
src := []byte(“cos(x) + 1i*sin(x) // Euler”)

// Initialize the scanner.
var s scanner.Scanner
fset := token.NewFileSet()
file := fset.AddFile(“”, fset.Base(), len(src))
s.Init(file, src, nil, scanner.ScanComments)


// Traverse the input source
for {
    pos, tok, lit := s.Scan()
    if tok == token.EOF {
        break
    }
    fmt.Printf(“%st%st%qn”, fset.Position(pos), tok, lit)
}

API of go/scanner:

type Scanner
func (s *Scanner) Init(file *token.File, src []byte, err ErrorHandler, mode Mode)
func (s *Scanner) Scan() (token.Pos, token.Token, string)

Notes about the API:

There is no constructor (even though there is Init(), it’s not required, which means you might forget it)
The Init() method does the heavy lifting because the fields which are exported in text/scanner.(*Scanner) are not exported in go/scanner.(*Scanner) (such as ErrorHandler, Mode, Position, etc..)
Instead of passing a source directly, we have to also pass a token.File. Why? Because internally go/scanner uses it for handling the positions of tokens.
The scanner.Scan() method returns the position, token and literal all at once.

text/scanner is meant to be generic, it’s for Go like source text. It doesn’t return any kind of token, token’s doesn’t have a specific type (expected because it’s meant to be a generic scanner). But it returns the token literal, the first rune and the position. Below is an official example of the text/scanner package:

const src = `
// This is scanned code.
if a > 10 {
someParsable = text
}`


var s scanner.Scanner
s.Init(strings.NewReader(src))
var tok rune
for tok != scanner.EOF {
    tok = s.Scan()
    fmt.Println(“At position”, s.Pos(), “:”, s.TokenText())
}

API of text/scanner:

type Scanner
func (s *Scanner) Init(src io.Reader) *Scanner
func (s *Scanner) Next() rune
func (s *Scanner) Peek() rune
func (s *Scanner) Pos() (pos Position)
func (s *Scanner) Scan() rune
func (s *Scanner) TokenText() string

Several things about the API:

There is no constructor, just like go/scanner.
You have to call different methods to get the position or the literal, whereas in go/scanner it was part of the Scan() signature.
There is scanner.Scan() and scanner.Next() which seems to be identical and might be confusing. Scan() returns a token, **Next() **returns the next unicode character, however both are of type rune. For the following example: select * from table, Scan() would return: s, *, f and t Next() would return: s, e, l, e, c, …
If an error happens, scanner.Pos() should be used to update the position and advance the internal scanner to the error position.
The actual position is always stored in scanner.Position (a field of
*Scanner), and Position is populated with** scanner.Pos()** call.

A package should be feel right. Both packages are good in their own ways. In the end it doesn’t matter how the API’s are defined if you use them correctly. What matters is how efficient it’s in the long term (where efficiency is not only the runtime performance, also developer productivity).

A different approach

221a0-1vhftcv76sgabaacs9ztkpw

Let us experiment a little bit. What if we could make it a little easier to use? As described above, the position and literal text are parts of a token. They should stick together and be handled together. And the Scanner itself has too many methods which are actually not needed for retrieving the tokens. Let’s begin by redefining our token.

First we change the type Token and rename it to TokenType:

type TokenType int

const (
    ILLEGAL TokenType = iota
    EOF
    ...
)

Next we create our own Token type with the appropriate methods:

type Token struct {
    Type  TokenType
    Pos   Position
    Text  string
}

func (t Token) String() string {
    return t.Text
}

Once we have changed it, our scanner will have a very simple usage:

s := scanner.New(src)

var tok Token 
for tok.Type != token.EOF {
    tok = s.Scan()
    fmt.Println("Type: ", tok.Type)    
    fmt.Println("Pos : ", tok.Pos)    
    fmt.Println("Text: ", tok.Text)
}

As you see, all the token specific details are provided by the token itself. It’s very handy and we don’t need additional calls to the scanner anymore.

Even though this is very handy, the scanner usage is still not easy to use. Let’s change it a little bit to make it more usable as an iterator. First of all, the scanner stops when it receives an EOF. So let us create some helper methods to create a better experience around it:

type Scanner struct {
    tok Token // a new field ...
}

// Token returns the most recently parsed token
func (s *Scanner) Token() Token {
    return s.tok
}

// Next iterates over all tokens. Retrieve the most recent token with Token(). It returns false once it reaches token.EOF.
func (s *Scanner) Next() bool {
    s.tok = s.Scan()
    if s.tok.Type == EOF {
        return false
    }
    return true
}

Once we have the above structure, our scanner will be called as following:

s := scanner.New(src)

for s.Next() {
    tok = s.Token()
    fmt.Println("Type: ", tok.Type)    
    fmt.Println("Pos : ", tok.Pos)    
    fmt.Println("Text: ", tok.Text)
}

The for loop stops once we reach an EOF. And we retrieve the most recent token with the scanner.Token() method. You still can use scanner.Scan() method for possible usages.

Iterating with channels (fun approach)

Let’s implement another approach, this time based on channels. This is purely for fun and not recommended for real use (I’ll explain the reasons in the end). Let’s add a new method called scanner.Tokens() which will return a channel of tokens:

func (s *Scanner) Tokens() <-chan Token {
    tokens := make(chan Token, 0)
    go func() {
        for {
            tok := s.Scan()
            tokens <- tok




            if tok.Type == EOF {
                close(tokens)
                return
            }
        }
    }()
    return tokens
}

Once we have this, iterating will be simply:

s := scanner.New(src)

for tok := range s.Tokens() {
    fmt.Println("Type: ", tok.Type)    
    fmt.Println("Pos : ", tok.Pos)    
    fmt.Println("Text: ", tok.Text)
}

So why is this a fun approach and not recommended? Because it has these flaws:

If you call scanner.Scan() inside the for loop there is a possibility of race condition. The tokens will be out of order.
If you break out of the for loop, the goroutine which is filling the tokens channel will be blocking forever and eventually leaving garbage (memory leak).
If you decide to fix the above problems, you’ll end up with a complex and hard to maintain code base.

There are still other approaches we might explore, such as passing the Token to the scanner.Next(t *Token) method (like mgo) or creating a for loop with initial and post statements. These are left to the reader.

I hope you enjoyed the various implementations around how a lexer API for scanning tokens can be implemented. I’ve learned all these while working on my new project and thought that sharing with others would be useful. Note that these are all experimentations, subjective, and can be even further improved in many other ways. My next step will be implementing a parser, so I might change certain things. Please let me know if you see any more room for improvement.

Finally, I want to thank Ben Johnson, Cihangir Savas and Brian Knox for their much valuable feedback and suggestions.