on Mar 20th, 2008Writing a JSON Tokenizer/Parser
This is the first of about three posts in which I will explain how to write a JSON (JavaScript Object Notation) Tokenizer and Parser. JSON in itself is a rather simple language which can be used for transferring data between two machines, configuration files, etc. Let’s have a look at a small piece of JSON code:
{
"name":"thr",
"age":23,
"skills": [
"php", "javascript", "xml"
],
"free_text": "I love JSON!",
"iq":150.5,
"is_admin":True,
"ip_addr":Null
}
If you’re familiar with python, you can see that it looks very much like pythons dictionaries and lists, because that’s what it is - except for the fact that what’s called “dictionaries” in python are called objects in JSON (hence the name, …Object Notation). Our goal here is to take this string of JSON code and tokenize and then parse it into our native data format, for this example I’m going to use PHP because JSON is used on the web a lot and PHP is a very good fit to that.
So, the first thing we have to consider is how do we make the computer understand this, and especially how do we make it understand it in such a way so that it’s easy for us to think about and implement (no regular expression madness, etc.). What we’re going to use is a technique called tokenizing, which allows us to split the above code up into tokens that we feed to our parser.
If you look at the above snippet of JSON-code you can clearly see the structure, and you could probably edit the above snippet to add more data to it without breaking it’s syntax. The first thing we want to do is to identify the different elements (tokens) of the code, which are:
- Symbols: { } [ ] , :
- Strings: text enclosed in “-signs
- Numbers: 23, 150.5
- Keywords: True, False and Null
These elements are our tokens, this is how we identify the different parts of the JSON-code, its syntax. Some tokens differ from others, for example the symbols are always { } [ ] , : while the keywords always are True, False and Null and the strings and numbers can be just about anything as long as it follows their rules:
- Strings: start and stop with ” and anything except another ” is allowed between them (unless the “-sign follows a backslash, a so called escape-sequence)
- Numbers: only digits and the decimal delimiter: .
So, the first thing we’re going to do in actual code is to write the token-class which will represent the different tokens that our tokenizer creates:
class token {
const SYMBOL_OBJECT_START = 'obj_start';
const SYMBOL_OBJECT_END = 'obj_end';
const SYMBOL_LIST_START = 'list_start';
const SYMBOL_LIST_END = 'list_end';
const SYMBOL_COMMA = 'comma';
const SYMBOL_COLON = 'colon';
const TYPE_STRING = 'str';
const TYPE_INTEGER = 'int';
const TYPE_DOUBLE = 'dbl';
const TYPE_BOOLEAN = 'bol';
const TYPE_NULL = 'nil';
var $type;
var $value;
function __construct ($type, $value = null) {
$this->type = $type;
$this->value = $value;
}
}
A very simple class it is: just a __construct-method and two fields. The fields $type-field is filled by one of the constants of the class, representing some type of symbol or a data type, the $value-field is used for the TYPE_STRING, TYPE_INTEGER and TYPE_BOOLEAN constants to save their values, it’s null for all other types and all the symbols.
This will be all for todays post, in the next post I we will write the tokenizer, hope you enjoyed it!