Python Natural Language Processing
上QQ阅读APP看书,第一时间看更新

Defining context-free grammar

Now let's focus on NLU, and to understand it, first we need to understand context-free grammar (CFG) and how it is used in NLU.

Context-free grammar is defined by its four main components. Those four components are shown in this symbolic representation of CFG:

  • A set of non-terminal symbols, N
  • A set of terminal symbols, T
  • A start symbol, S, which is a non-terminal symbol
  • A set of rules called production rules P, for generating sentences

Let's take an example to get better understanding of the context-free grammar terminology:

X ->

Here, X-> is called the phrase structure rule or production rule, P. X ε N means X belongs to non-terminal symbol; ε {N or T} means belongs to either terminal symbols or non-terminal symbols. X can be rewritten in the form of . The rule tells you which element can be rewritten to generate a sentence, and what the order of the elements will be as well.

Now I will take a real NLP example. I'm going to generate a sentence using CFG rules. We are dealing with simple sentence structure to understand the concepts.

Let's think. What are the basic elements required to generate grammatically correct sentences in English? Can you remember them? Think!

I hope you remember that noun phrases and verb phrases are important elements of the sentences. So, start from there. I want to generate the following sentence:

He likes cricket.

In order to generate the preceding sentence, I'm proposing the following production rules:

  • R1: S -> NP VP
  • R2: NP -> N
  • R3: NP -> Det N
  • R4: VP -> V NP
  • R5: VP -> V
  • R6: N -> Person Name | He | She | Boy | Girl | It | cricket | song | book
  • R7: V -> likes | reads | sings

See the parse tree of the sentence: He likes cricket, in Figure 3.2:

Figure 3.2: Parse tree for the sentence using the production rule

Now, let's know how we have generated a parse tree:

  • According to the production rules, we can see S can be rewritten as a combination of a noun phrase (NP) and a verb phrase (VP); see rule R1.
  • NP can be further rewritten as either a noun (NN) or as a determiner (Det) followed by a noun; see rules R2 and R3.
  • Now you can rewrite the VP in form of a verb (V) followed by a NP, or a VP can be rewritten as just V; see rules R4 and R5.
  • Here, N can be rewritten in the form of Person Name, He, She, and so on. N is a terminal symbol; see the rule R6.
  • V can be rewritten by using any of the options on the right-hand side in rule R7. V is also terminal symbol.

By using all the rules, we have generated the parse tree in Figure 3.2.

Don't worry if you cannot generate a parse tree. We will see the concept and implementation details in the Chapter 5, Feature Engineering and NLP Algorithms.

Here, we have seen a very basic and simple example of CFG. Context-free grammar is also called phrase structure grammar.