forked from davidsusan/Project-LEO
-
Notifications
You must be signed in to change notification settings - Fork 0
arbihaza/Project-LEO
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Welcome to Project LEO! There are two parts to this document. The first part is a simiplified way of doing it written by me. The second part is from the reference site for our dragon book (http://dragonbook.stanford.edu/lecture-notes.html). While it is more complicated, you can do a lot more with the procedure in the second section. Instructions for writing grammars: First, in a word doc or similar write your start symbol: For example: @ <- start symbol After every decleration, put an arrow and then what it is For example: number <- 0|1|2|3|4|5|6|7|8|9 In the above example, I included the pipe( | ) or "or" sign. Use this to represent something that could be one thing or another. Here is a complete example: equation <- start symbol equation <- number + number | number - number | number * number | number / number number <- 0|1|2|3|4|5|6|7|8|9 Remeber all instructions MUST end in a terminal. Have fun! Alan Part 2: Formal Grammars Handout written by Maggie Johnson and Julie Zelenski. What is a grammar? A grammar is a powerful tool for describing and analyzing languages. It is a set of rules by which valid sentences in a language are constructed. Here’s a trivial example of English grammar: sentence –> <subject> <verb-phrase> <object> subject –> This | Computers | I verb-phrase –> <adverb> <verb> | <verb> adverb –> never verb –> is | run | am | tell object –> the <noun> | a <noun> | <noun> noun –> university | world | cheese | lies Using the above rules or productions, we can derive simple sentences such as these: This is a university. Computers run the world. I am the cheese. I never tell lies. In addition to several reasonable sentences, we can also derive nonsense like "Computers run cheese" and "This am a lies". These sentences don't make semantic sense, but they are syntactically correct because they are of the sequence of subject, verb-phrase, and object. Formal grammars are a tool for syntax, not semantics. We worry about semantics at a later point in the compiling process. In the syntax analysis phase, we verify structure, not meaning. Vocabulary We need to review some definitions before we can proceed: grammar: a set of rules by which valid sentences in a language are constructed. nonterminal: a grammar symbol that can be replaced/expanded to a sequence of symbols. terminal: an actual word in a language; these are the symbols in a grammar that cannot be replaced by anything else. "terminal" is supposed to conjure up the idea that it is a dead-end—no further expansion is possible. production: a grammar rule that describes how to replace/exchange symbols. The general form of a production for a nonterminal is: X –>Y1Y2Y3...Yn The nonterminal X is declared equivalent to the concatenation of the symbols Y1Y2Y3...Yn. The production means that anywhere where we encounter X, we may replace it by the string Y1Y2Y3...Yn. Eventually we will have a string containing nothing that can be expanded further, i.e., it will consist of only terminals. Such a string is called a sentence. In the context of programming languages, a sentence is a syntactically correct and complete program. derivation: a sequence of applications of the rules of a grammar that produces a finished string of terminals. A leftmost derivation is where we always substitute for the leftmost nonterminal as we apply the rules (we can similarly define a rightmost derivation). A derivation is also called a parse. start symbol: a grammar has a single nonterminal (the start symbol) from which all sentences derive: S –> X1X2X3...Xn All sentences are derived from S by successive replacement using the productions of the grammar. null symbol: ε, it is sometimes useful to specify that a symbol can be replaced by nothing at all. To indicate this, we use the null symbol ε, e.g., A –> B | ε. Productions are often defined in terms of themselves. For example a list of variables in a programming language grammar could be specified by this production: variable_list –> variable | variable_list , variable Such productions are said to be recursive. Mark: tts art OOP AI & drivers Sub packages: drivers -- art - Morgan,David -- sound - Mark -- network - Alan Ai - Alan Crypto - Nathan core - all Go here for the interfacing guide: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-manual-325462-rmver.html Go here for the google url guide: http://cdn.yoast.com/wp-content/uploads/2007/07/google-url-parameters.pdf Lexer info starts here: To implement a lexical analyzer by hand, it helps to start with a diagram or other description for the lexemes of each token. We can then write code to identify each occurrence of each lexeme on the input and to return information about the token identified. more info on all these words later The Role of the Lexical Analyzer As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. Since the lexical analyzer is the part of the compiler that reads the source text, it may perform certain other tasks besides identification of lexemes. One such task is stripping out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input). Another task is correlating error messages generated by the compiler with the source program. For instance, the lexical analyzer may keep track of the number of newline characters seen, so it can associate a line number with each error message. In some compilers, the lexical analyzer makes a copy of the source program with the error messages inserted at the appropriate positions. Sometimes, lexical analyzers are divided into a cascade of two processes: a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one. b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output. Lexical Analysis Versus Parsing There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing (syntax analysis) phases. 1. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and whitespace as syntactic units would be considerably more complex than one that can assume comments and whitespace have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design. 2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. 3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer. Tokens, Patterns, and Lexemes When discussing lexical analysis, we use three related but distinct terms: A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will often refer to a token by its token name. A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings. A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an ihstance of that token Attributes for Tokens When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the par- ticular lexeme that matched. For example, the pattern for token number matches both 0 and 1, but it is extremely important for the code generator to know which lexeme was found in the source program. Thus, in many cases the lexical analyzer returns to the parser not only a token name, but an attribute value that describes the lexeme represented by the token; the token name in- fluences parsing decisions, while the attribute value influences translation of tokens after the parse. We shall assume that tokens have at most one associated attribute, although this attribute may have a structure that combines several pieces of information. The most important example is the token id, where we need to associate with the token a great deal of information. Normally, information about an identi- fier - e.g., its lexeme, its type, and the location at which it is first found ( in case an error message about that identifier must be issued) - is kept in the symbol table. Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for that identifier.
About
Project LEO
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published