Home

 › 

Articles

 › 

Types of Parsing in Compiler Design: The Full Explanation in Plain English

types of parsing

Types of Parsing in Compiler Design: The Full Explanation in Plain English

Have you ever tried to read a foreign language without adequately knowing the vocabulary and grammar? It can be pretty tough, right? You might be able to pick out a few words here and there, but you’re probably not going to understand the whole sentence. 

Parsing is a lot like trying to read and synthesize or understand a language. It’s the process of breaking down a sentence into its component parts and figuring out how they fit together in computing. 

It’s an important part of compiler design because it allows the compiler to verify that the source code in a given programming language is syntactically correct and, thus, translatable into machine code.

Parsing is a part of the syntax analysis phase of a compiler. If the parser finds a syntax error, it will report an error message and the compiler will not be able to generate an executable program.

There are different types of parsing used in compiler design and we’ll be looking at each and every one of them in deeper detail. But before we embark on that, we need to understand two fundamental concepts: syntax analysis and the parse tree as used in parsing.

Syntax Analysis

Syntax analysis is the process of analyzing the structure of a program to determine if it adheres to the rules and requirements of a particular programming language. This requires using a set of rules called context-free grammar.

Grammar is the set of rules that define the structure of a language. It describes the syntax and semantics of a language, including its alphabet, vocabulary, and the ways in which words and symbols combine to form valid sentences or expressions.

Context-free grammar consists of a set of production rules that describe how to generate strings in the language and how the non-terminal symbols in the grammar should replace other symbols. The rules are context-free in the sense that the left-hand side of a rule can replace the right-hand side in any context.

There are two main subclasses of context-free grammars: LL grammars for top-down parsing and LR grammars for bottom-up parsing. An LL grammar is a context-free grammar that can be parsed using a top-down parser, while an LR grammar is a context-free grammar that can be parsed using an LR parser.

Syntax analysis needs context-free grammar to generate a parse tree — a hierarchical representation of the structure of a program.

Parse Tree

In compiler design, parsing breaks down a program’s source code to identify its grammatical structure and create a parse tree, which is then used to generate executable code.

A parse tree is a hierarchical tree data structure representing the code’s grammatical structure. Each node in the tree represents a construct in the program, and each edge represents a relationship between constructs.

The topmost node represents the start symbol of the language’s grammar, and the leaves represent the terminals of the grammar. The internal nodes of the tree represent the nonterminals of the grammar.

Types of Methods in Java
The starting symbol of the grammar must be used as the root of the Parse Tree.

©TippaPatt/Shutterstock.com

The tree is constructed by recursively applying the rules of grammar until every terminal symbol in the code matches a non-terminal symbol.

A parse tree is an important tool for the compiler because it allows the compiler to understand the structure of the program and generate code accordingly.

Let’s now dive into the various types of compiler design, starting with one of the main ones: top-down parsing.

Top-Down Parsing

Top-down parsing is a parsing technique where the parse tree is constructed from the root to the leaves. It starts by trying to match the root of the parse tree with the start symbol of the grammar.

If it can’t find a match, it backtracks and tries to match the root with another nonterminal. This process continues until the entire parse tree has been fully constructed.

Top-down parsing allows the parser to generate helpful error messages for the programmer, which is one of its advantages. The parser can identify where the error occurred in the input and provide a suggestion for how to fix it. 

There are two main types of top-down parsing: recursive descent parsing and non-recursive descent parsing.

Recursive Descent Parsing

Recursive descent parsing associates a function with each non-terminal symbol in the grammar to generate code for that symbol. The function corresponds to the rules in the grammar and matches the input with the grammar rules in recursive descent parsing.

The parser then calls all functions recursively, starting with the grammar’s start symbol. This goes on until every terminal symbol in the input code matches a non-terminal symbol in the grammar.

The Pascal programming language is an example of a language that implements this parsing technique. Recursive descent parsing can be inefficient for large grammars because it can result in too much recursion.

This can cause stack overflow errors, which can be difficult to debug. However, backtracking can be useful in handling recursive grammar.

Backtracking

Backtracking is an important recursive descent parsing technique that allows the parser to retrace its steps and try a different path if it runs into a dead end. This technique is particularly useful when dealing with ambiguous grammar or input that is difficult to parse.

The basic idea behind backtracking is to keep track of the state of the parser at each step of the parsing process. If the parser reaches a point where it cannot continue, it backtracks to the most recent point where it had a choice of which path to take, and tries a different path.

programming languages for AI
In backtracking, if one derivation of a production fails, the syntax analyzer restarts the process using different rules of the same production.

©REDPIXEL.PL/Shutterstock.com

For example, consider a grammar that describes the syntax of arithmetic expressions, such as “3 * (4 + 2).” If the parser encounters the opening parenthesis, it must choose which rule to apply: either to parse the expression inside the parentheses first or to parse the rest of the expression first. 

If the parser chooses the wrong path, it may eventually reach a dead end and have to backtrack. In this case, backtracking allows the parser to retrace its steps and try parsing the expression inside the parentheses first instead, leading to a successful parse of the input.

Backtracking can be computationally expensive, as it requires the parser to maintain a record of its state at each step of the parsing process. However, it can be a powerful tool for handling difficult input and is a key part of many top-down parsing algorithms.

Non-Recursive Descent Parsing

Non-recursive descent parsing uses a stack to keep track of matched grammar rules and input symbols in a top-down parsing technique. This technique avoids the problem of recursion by using a loop to apply the grammar rules to the input.

The non-recursive descent parsing algorithm pops off the corresponding non-terminal symbol from the stack and replaces it with the terminal symbol when it finds a match.

Non-recursive descent parsing can be more efficient than recursive descent parsing for large grammar. However, it can be more difficult to generate helpful error messages with this technique. Let’s look at a type of non-recursive parsing known as predictive parsing.

Predictive Parsing

Predictive parsing is a type of top-down parsing technique that parses an input stream of tokens by looking ahead at the next few tokens in the input to make decisions on which production rule to apply.

It is a type of LL(k) parsing where “LL” stands for left-to-right, leftmost derivation scan of the input, and “k” refers to the number of tokens of “lookahead.” The parser uses a predictive parsing table to predict which production rule to apply when expanding the current non-terminal symbol based on the next input token.

Predictive parsing is a deterministic parsing technique and is more efficient than other top-down parsing techniques, such as recursive descent parsing. It does not require backtracking, which means that it can parse a larger class of context-free grammar in linear time.

For example, consider the following grammar:

E → E + T | T

T → T * F | F

F → ( E ) | id

To create the parsing table, we first find the first and follow sets of each non-terminal symbol. Then, for each non-terminal symbol A and each terminal symbol a in the following set of A, we add the production A → α to the parsing table entry M[A,a], where α is the right-hand side of the production.

Using the parsing table, we can predict which production rule to use for each non-terminal symbol based on the next input token. For example, if the current non-terminal symbol is E and the next input token is +, we use the production E → E + T to expand the current non-terminal symbol.

ANTLR (ANother Tool for Language Recognition) is a LL(*) parser generator that uses a top-down approach to parsing, and it can be used to create parsers for various programming languages like Java and Python.

Bottom-Up Parsing

Bottom-up parsing starts at the leaves of the parse tree and works its way up to the root. It starts by trying to match each leaf of the parse tree with a terminal of the grammar. If it can’t find a match, it backtracks and tries to match the leaf with another terminal. Below are the two main types of bottom-up parsing.

Shift Reduce Parsing

One common implementation of bottom-up parsing is shift-reduce parsing. The implementation involves a parser that reads the input code from left to right and builds up a stack of symbols that match the input code.

In other words, the parser “shifts” input symbols onto the stack until it can “reduce” them into a higher-level symbol. This process repeats until the entire input has been parsed, and a parse tree is built.

sql vs. java
Shift reduce parsing is a process of reducing a string to the start symbol of a grammar.

©Roman Samborskyi/Shutterstock.com

This type of parsing can be particularly useful for detecting errors in code because it allows the parser to back up and try different alternatives if it encounters an error. For example, if the parser cannot reduce a symbol, it can “shift” additional symbols onto the stack until it can.

Alternatively, if the parser cannot shift any more symbols, it can “reduce” the stack to a lower-level symbol and try a different alternative. Popular programming languages like C++ and Java all use variants of shift-reduce parsing to parse and compile code.

LR Parser

LR (left-to-right, rightmost derivation) parsing is a more complex bottom-up parsing technique. It uses a finite state machine to recognize the input stream. An LR parser is a table that maps each state of the parser to a list of possible reductions.

The LR parser works by repeatedly reading tokens from the input stream and moving from state to state in the parser. When the parser reaches a state that has a reduction associated with it, it reduces the nonterminal symbol on the top of the stack and moves to the new state.

This process continues until the parser reaches a state that has a shift operation associated with it. When the parser reaches a state that has a shift operation associated with it, the parser shifts the token onto the stack and moves to the next state in the parser.

A common example of an LR parser is the YACC (Yet Another Compiler Compiler) parser generator, which is used to generate parsers for languages like C and Ruby.

Generally, bottom-up LR parsing has an edge over top-down parsing since, besides parsing LL(k) grammar, it can be used to parse any context-free grammar. It’s also more efficient than shift-reduce parsing for large grammar because it avoids the need for backtracking.

However, implementing these types of parsers is usually more complicated than implementing top-down parsers and often requires more memory.

Error Reporting and Error Recovery

While parsing a program, the compiler may encounter syntax errors. These occur when the compiler encounters a sequence of tokens that do not match any production rule in the grammar.

These errors can cause the compiler to stop parsing and exit, which can be frustrating for the user. Error recovery is a mechanism that allows the compiler to continue parsing after encountering a syntax error.

Error reporting involves identifying the location of the error and reporting it to the user. The compiler can display a message to the user indicating the line number and the type of error.

Error recovery, on the other hand, is the process of continuing the parsing process after encountering a syntax error. There are several strategies for error recovery in parsing, including panic mode recovery, error production, and local correction.

During panic mode recovery, the parser discards tokens until it reaches a token that can resume parsing after encountering an error. This strategy is simple and easy to implement, but it can result in reporting many errors to the user.

Error productions handle common syntax errors and are added to the grammar. When the parser encounters a syntax error, it can use an error production to recover from the error and continue parsing.

Local correction involves making small modifications to the input string to correct the error. It’s more complex than panic mode, but it can result in better error recovery.

The Wrap-Up

Parsing is an important part of compiler design. It allows the compiler to verify that the source code is syntactically correct. There are two main types of parsing: top-down parsing and bottom-up parsing.

Top-down parsing is generally easier to implement, but it can be inefficient for grammar with long left-recursive rules. Bottom-up parsing is generally more efficient, but it can be more difficult to implement.

In practice, most compilers will use a combination of both top-down and bottom-up parsing techniques, as they offer complementary strengths and can handle a broader range of grammar.

Types of Parsing in Compiler Design: The Full Explanation in Plain English FAQs (Frequently Asked Questions) 

What is parsing in compiler design?

Parsing in is the process of analyzing source code to determine its structure according to formal grammar. It is important because it allows the compiler to understand code in a particular programming language and translate it into executable machine code.

What is grammar in parsing?

In parsing, grammar refers to a set of rules that define the structure of a programming language or natural language. It specifies how different elements of the language, such as keywords, operators, and variables, can be combined to form valid statements and expressions.

What is a parse tree?

A parse tree is a graphical representation of the structure of the source code, showing how each component of the code is related to each other. It relates to parsing in compiler design because it is the output of the parsing process and is used to guide subsequent stages of the compilation process.

What’s the difference between top-down and bottom-up parsing?

The main difference lies in the order in which the parser attempts to match the input with the production rules of the grammar. Top-down parsing starts from the root of the parse tree and attempts to match the input with the production rules from the top-down. Bottom-up parsing, on the other hand, starts with the input and works up the parse tree.

What's the difference between recursive and no-recursive parsing?

Recursive parsing, also known as recursive descent parsing, is a top-down parsing technique where the parser starts at the root of the parse tree and recursively works its way down to the leaves.

Non-recursive parsing, also known as iterative parsing, is a parsing technique where the parser uses a stack to keep track of the parsing state. It iteratively applies parsing rules and pushes or pops elements from the stack as needed to construct the parse tree.

What is context-free grammar, and why is it important in parsing?

Context-free grammar is a formal notation used to describe the structure of programming languages, allowing the parser to understand the language’s syntax. It is important in parsing because it provides a formal basis for analyzing and processing programming languages.

To top