The purpose of this SQL Parser is to parse the input SQL query then generate the abstract syntax tree (AST). We can write by Javacc, which is an open source parser generator and lexical analyzer generator written in the Java programming language. Till now, the parser can support the simple SQL as follow:
select ...
from ...
where ...
Besides, the boolean relations in "where" part will only be "and". The output will be a abstract syntax tree (AST) as "AST.xml".
The work should be down by the following flow:
program jj file -> program Main.java -> compile .jj file -> compile .java files
Then we can use the Parser. In the following sections, I will explain how to design the .jj file and how to use Main.java to CALL the method in the Parser class. First I will show a demo example of the program, then I will explain how to program the parser.
select e.Name, d.MName
from Emp e, Dept d
where e.Name = "Jonny" and (d.Name = c.MName and c.Salary > 7000) and d.Name = "James" and c.Name = d.Name
the output will be
<dbQuery><dbSFWStatement><dbSelectClause><dbAttr><dbRelVar> <dbRelAliasName Token="e" /> </dbRelVar><dbAttrName Token="Name" /> </dbAttr><dbAttr><dbRelVar> <dbRelAliasName Token="d" /> </dbRelVar><dbAttrName Token="MName" /> </dbAttr></dbSelectClause><dbFromClause><dbRelVar><dbRelName Token="Emp" /><dbRelAliasName Token="e" /> </dbRelVar><dbRelVar><dbRelName Token="Dept" /><dbRelAliasName Token="d" /> </dbRelVar></dbFromClause><dbWhereClause><BooleanExp><BooleanFactor><dbAttr><dbRelVar> <dbRelAliasName Token="e" /> </dbRelVar><dbAttrName Token="Name" /> </dbAttr><comparisonOp Token="=" /><dbConstValue><STRINGLITERAL Token="Jonny"/> </dbConstValue></BooleanFactor><BooleanExp><BooleanExp><BooleanFactor><dbAttr><dbRelVar> <dbRelAliasName Token="d" /> </dbRelVar><dbAttrName Token="Name" /> </dbAttr><comparisonOp Token="=" /><dbAttr><dbRelVar> <dbRelAliasName Token="c" /> </dbRelVar><dbAttrName Token="MName" /> </dbAttr></BooleanFactor><BooleanFactor><dbAttr><dbRelVar> <dbRelAliasName Token="c" /> </dbRelVar><dbAttrName Token="Salary" /> </dbAttr><comparisonOp Token=">" /><dbConstValue> <INTEGERLITERAL Token="7000"/> </dbConstValue></BooleanFactor></BooleanExp><BooleanExp><BooleanFactor><dbAttr><dbRelVar> <dbRelAliasName Token="d" /> </dbRelVar><dbAttrName Token="Name" /> </dbAttr><comparisonOp Token="=" /><dbConstValue><STRINGLITERAL Token="James"/> </dbConstValue></BooleanFactor><BooleanFactor><dbAttr><dbRelVar> <dbRelAliasName Token="c" /> </dbRelVar><dbAttrName Token="Name" /> </dbAttr><comparisonOp Token="=" /><dbAttr><dbRelVar> <dbRelAliasName Token="d" /> </dbRelVar><dbAttrName Token="Name" /> </dbAttr></BooleanFactor></BooleanExp></BooleanExp></BooleanExp></dbWhereClause></dbSFWStatement></dbQuery>
the original output is not formatted but it's fine since it is an XML file. The formatted XML output will be:
<dbQuery>
<dbSFWStatement>
<dbSelectClause>
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="e" />
</dbRelVar>
<dbAttrName Token="Name" />
</dbAttr>
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="d" />
</dbRelVar>
<dbAttrName Token="MName" />
</dbAttr>
</dbSelectClause>
<dbFromClause>
<dbRelVar>
<dbRelName Token="Emp" />
<dbRelAliasName Token="e" />
</dbRelVar>
<dbRelVar>
<dbRelName Token="Dept" />
<dbRelAliasName Token="d" />
</dbRelVar>
</dbFromClause>
<dbWhereClause>
<BooleanExp>
<BooleanFactor>
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="e" />
</dbRelVar>
<dbAttrName Token="Name" />
</dbAttr>
<comparisonOp Token="=" />
<dbConstValue>
<STRINGLITERAL Token="Jonny"/>
</dbConstValue>
</BooleanFactor>
<BooleanExp>
<BooleanExp>
<BooleanFactor>
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="d" />
</dbRelVar>
<dbAttrName Token="Name" />
</dbAttr>
<comparisonOp Token="=" />
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="c" />
</dbRelVar>
<dbAttrName Token="MName" />
</dbAttr>
</BooleanFactor>
<BooleanFactor>
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="c" />
</dbRelVar>
<dbAttrName Token="Salary" />
</dbAttr>
<comparisonOp Token=">" />
<dbConstValue>
<INTEGERLITERAL Token="7000"/>
</dbConstValue>
</BooleanFactor>
</BooleanExp>
<BooleanExp>
<BooleanFactor>
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="d" />
</dbRelVar>
<dbAttrName Token="Name" />
</dbAttr>
<comparisonOp Token="=" />
<dbConstValue>
<STRINGLITERAL Token="James"/>
</dbConstValue>
</BooleanFactor>
<BooleanFactor>
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="c" />
</dbRelVar>
<dbAttrName Token="Name" />
</dbAttr>
<comparisonOp Token="=" />
<dbAttr>
<dbRelVar>
<dbRelAliasName Token="d" />
</dbRelVar>
<dbAttrName Token="Name" />
</dbAttr>
</BooleanFactor>
</BooleanExp>
</BooleanExp>
</BooleanExp>
</dbWhereClause>
</dbSFWStatement>
</dbQuery>
Firstly, this SQL query will not check if the alias name of table in the "where" clause have it's corresponding alias name in "from" clause. Secondly, this SQL query example has the most situation. Especially in the where part, situations are more complicated than "select" and "from" part. In "where" clause, we will have "STRINGLITERAL", "INTEGERLITERAL" and relation.attribute. In the example, the "where" part is
where e.Name = "Jonny" and (d.Name = c.MName and c.Salary > 7000) and d.Name = "James" and c.Name = d.Name
The corresponding abstract syntax tree should corresponding to the same structure of this query, which is shown as below:
<dbWhereClause>
<BooleanExp>
<BooleanFactor>
<BooleanExp>
<BooleanExp>
<BooleanFactor>
<BooleanFactor>
</BooleanExp>
<BooleanExp>
<BooleanFactor>
<BooleanFactor>
</BooleanExp>
</BooleanExp>
</BooleanExp>
</dbWhereClause>
In JavaCC, we have two method to implement this parser. First one is to use jjtree, which is a preprocessor for JavaCC that inserts parse tree building actions at various places in the JavaCC source. The output of JJTree is run through JavaCC to create the parser. However, in this project, I directly use the second option, .jj file to implement it. The grammar are explained as following.
First, we need to set the options as what we need. In this case, ignore the case and make the method static.
options
{
IGNORE_CASE = true;
STATIC = true;
}
Then, initialize the parse method and start to parse when call this method. This method will return the AST based on the query we get.
PARSER_BEGIN(Parser)
public class Parser
{
public static String parse(String args) throws Exception
{
Parser parse = new Parser(new java.io.StringReader(args));
String rst = parse.Query();
return rst;
}
}
PARSER_END(Parser)
Next step is to setup the tokens of this parser. Because there may have some token in the input query that not related with parse, so we SKIP tokens of " ", "\t", "\r" and "\n"
SKIP :
{
" "
| "\t"
| "\r"
| "\n"
}
parse the tokens of SELECT, FROM, WHERE and AND
TOKEN :
{
< SELECT : "SELECT" >
| < FROM : "FROM" >
| < WHERE : "WHERE" >
| < AND : "AND" >
}
In case of the query has "(" and ")", there should have token as OPEN_PAR
and CLOSE_PAR
token
TOKEN :
{
< OPEN_PAR : "(" >
}
TOKEN :
{
< CLOSE_PAR : ")" >
}
dot and comma are also need to be setup as token for the query relation.attribute and multiply boolean factor.
TOKEN :
{
< DOT : "." >
}
TOKEN :
{
< COMMA : "," >
}
In case that quotation mark " will be used in the where part when it's a STRINGLITERAL
TOKEN :
{
< QUO : """ >
}
because operator has no difference in parse query into AST. So we set them all as OPERATOR
TOKEN :
{
< OPERATOR :
">"
| "< "
| "="
| ">="
| "<="
| "<>"
| "!=" >
}
this token represent only the numbers. it will be used in the where clause to recognize the INTEGERLITERAL
TOKEN :
{
< DIGITS : ([ "0"-"9" ])+ >
}
this token will represent all the mix of table name or attribute name
TOKEN :
{
< NAME : ([ "a"-"z", "0"-"9" ])+ >
}
start the query, call SFWStatement() and add "<dbQuery>""</dbQuery>"
out of the SFWStatement.
String Query() :
{
String rst;
}
{
rst = SFWStatement() < EOF >
{
return "<dbQuery>" + rst + "</dbQuery>";
}
}
first check the select clause then check the from clause finally check the where clause
String SFWStatement() :
{
String select = "";
String from = "";
String where = "";
}
{
select = SelectClause() from = FromClause() where = WhereClause()
{
return "<dbSFWStatement>" + select + from + where + "</dbSFWStatement>";
}
}
this is the select clause, we need to find all the attribute so we call Attr method
String SelectClause() :
{
String select;
}
{
< SELECT > select = Attr()
{
return "<dbSelectClause>" + select + "</dbSelectClause>";
}
}
this method will recursively find all the attribute in the select clause and return all the attribute as one String.
String Attr() :
{
Token relation;
Token attr;
String subAttr = "";
}
{
relation = < NAME > < DOT > attr = < NAME >
(
< COMMA > subAttr = Attr()
)*
{
return "<dbAttr>" + "<dbRelVar> <dbRelAliasName Token=\"" + relation.image + "\" /> </dbRelVar>" + "<dbAttrName Token=\"" + attr.image + "\" /> " + "</dbAttr>" + subAttr;
}
}
The regular expression (< COMMA > subAttr = Attr())*
will recursively all it self to find all the attributes. The warning is fine in this command because javacc prefer us to use LOOKAHEAD, but our way is also fine.
Next is the from clause, we will call RelVal() method to find all relations
String FromClause() :
{
String from;
}
{
< FROM > from = RelVal()
{
return "<dbFromClause>" + from + "</dbFromClause>";
}
}
the RelVal() method will find all the relations in the from clause. The regular expression (< COMMA > subVal = RelVal())*
will recursively all this method to find all the reaName and aliasName in the from clause.
String RelVal() :
{
Token realName;
Token aliasName;
String subVal = "";
}
{
realName = < NAME > aliasName = < NAME >
(
< COMMA > subVal = RelVal()
)*
{
return "<dbRelVar>" + "<dbRelName Token=\"" + realName.image + "\" />" + "<dbRelAliasName Token=\"" + aliasName.image + "\" /> </dbRelVar>" + subVal;
}
}
Next is the where clause. It will have several recursion next in this method.
First we call Expression and assign false to represent that this is the first time we call it, So it must add <BooleanExp><\BooleanExp>
out of the where clause abstract syntex tree.
String WhereClause() :
{
String where = "";
}
{
< WHERE > where = Expression(false)
{
return "<dbWhereClause>" + where + "</dbWhereClause>";
}
}
The Expression() method will find all the boolean factors and add BooleanExp in the case we have more than two boolean factors or has parenthesis in some of the boolean factors. In each call of this method, it will firstly check if there is parenthesis in this boolean expression, if true, we take the first boolean factor and check if there are parenthesis in this parenthesis and pass a true value to the next recursion in the case when we have"((...))" in the query. If hasFather == true means we have the <BooleanExp>...</BooleanExp>
out of this time's call. Then we check if exp2 is empty. if exp2 is empty after the parsing. This means this time is the final BooleanFactor in this BooleanExpression and we don't want to have a single BooleanFactor in the BooleanExp, so we will not add <BooleanExp> ... </BooleanExp>
out of it. If it has no father parenthesis out of it or exp2 not empty, we need to add the BooleanExp out of exp1 + exp2. Next, if the first token is not parenthesis, we need to check if this satisfy the boolean factor rules. So we call Factor() method to parse it. If the first factor is good, we need to check the next one.
(< AND > exp2 = Expression(true))*
can be a boolean factor or a parenthesis, so we call Expression() method again.
String Expression(boolean hasFather) :
{
String exp1 = "";
String exp2 = "";
}
{
< OPEN_PAR > exp1 = Expression(true) < CLOSE_PAR >
(
< AND > exp2 = Expression(true)
)*
{
if (hasFather == true && exp2.equals(""))
{
return exp1;
}
return "<BooleanExp>" + exp1 + exp2 + "</BooleanExp>";
}
| exp1 = Factor()
(
< AND > exp2 = Expression(true)
)*
{
/*the next check has the same function as the previous checking*/
if (hasFather == true && exp2.equals(""))
{
return exp1;
}
else
{
return "<BooleanExp>" + exp1 + exp2 + "</BooleanExp>";
}
}
}
The Factor() method will parse the Boolean factor. It can be three situations.
- it is a table.Attribute. such as "Emp.Name = Dep.MName"
- it is a string literal. such as "Emp.Name = "james""
- it is a integer literal. such as "Emp.Salary = 7000" So we set them as left part and and an operation and a right part and parse them one by one.
String Factor() :
{
String left = "";
String right = "";
String operator;
}
{
left = BooleanAttr()
operator = Operator()
right = BooleanAttr()
{
return "<BooleanFactor>" + left + operator + right + "</BooleanFactor>";
}
}
the BooleanAttr() method can have three situations as well.
- name.name such as "Emp.Name"
- String Literal such as "James"
- Integer Literal such as 7000
so this method will check them and return the corresponding attribute
String BooleanAttr() :
{
Token rel;
Token attr;
String attrName = "";
}
{
rel = < DIGITS >
{
return "<dbConstValue> <INTEGERLITERAL Token=\"" + rel.image + "\"/> </dbConstValue>";
}
| rel = < NAME > < DOT > attr = < NAME >
{
return "<dbAttr>" + "<dbRelVar> <dbRelAliasName Token=\"" + rel.image + "\" /> </dbRelVar>" + "<dbAttrName Token=\"" + attr.image + "\" /> </dbAttr>";
}
| < QUO > rel = < NAME > < QUO >
{
return "<dbConstValue><STRINGLITERAL Token=\"" + rel.image + "\"/> </dbConstValue>";
}
}
Finally, the operator() method will check the operator and return it as what it is.
String Operator() :
{
Token operator;
}
{
operator = < OPERATOR >
{
return "<comparisonOp Token=\"" + operator.image + "\" />";
}
}
Go to your jj file's folder. Then type:
franktekimbp:MyProgram frankgao$ javacc SQLParser.jj
Then we will see:
franktekimbp:MyProgram frankgao$ javacc SQLParser.jj
Java Compiler Compiler Version 5.0 (Parser Generator)
(type "javacc" with no arguments for help)
Reading from file SQLParser.jj . . .
File "TokenMgrError.java" does not exist. Will create one.
File "ParseException.java" does not exist. Will create one.
File "Token.java" does not exist. Will create one.
File "SimpleCharStream.java" does not exist. Will create one.
Parser generated with 0 errors and 0 warnings.
You may see several warnings because of the auto-generate files from JavaCC have several unreachable exceptions, which will not affect our program. In my case, I suppressed those warning so got 0 warnings. Then we compile all the .java files.
franktekimbp:MyProgram frankgao$ javac *.java
Then the Parser is compiled.
So next I'm going to explain how to use the parser.
In this demo, I will use Main.java to show how to use the Parser. We can directly use the Parser.parse(input) to parse the input String. The method will return the AST.
private static String parse(String input) {
String rst = "";
try {
rst = Parser.parse(input);
} catch (Exception e) {
System.out.println("Error during Parsing");
e.printStackTrace();
}
return rst;
}
Then we can do something to the AST String. I will print it out and output it into an AST.xml file. That's it.
-
put “input.txt” and “AST.xml” into Project file Path. For example, in eclipse, out of src and bin folder.
-
if you want to import other files as input. change the inputPath String in the Main.java
String inputPath = "input.txt";
-
the AST.xml will be generated automatically in the Java Project file folder. So if want the output into other folders, change the outputPath in the Main.java
String outputPath = "AST.xml";
-
The Main.java will not be generated automatically by javacc. So if you need to use the parser for other purposes, please read the Main.java File and modify the file.
-
If using JavaCC for the first time. The JavaCC Tutorial is a very good reference to learn how to use it. After following all the instructions in the tutorial, I believe this project is quite understandable.
-
Email me if have any other questions regarding this project, email is in the Main.java File.
Finally, thanks to the great help from Professor Shashi Gadia and Qingqin Hou.
Best
Tianxiang
Apr 5 2016