Natural languages make infinite use of finite means (for example, grammar and vocabulary). Humans in general have no issue understanding a language as they acquire the nuances in specific contexts and learn how to express themselves. So how is it possible for a computer program connected to structured databases or unstructured data repositories to interact with someone using a native language?
This 100-page monograph provides answers to this question by investigating two important areas of natural language processing (NLP) research: how to format and manage unstructured linguistic data repositories, and how to design natural language interfaces for structured data in databases. After establishing the commonalities of looking at natural language data management and natural language interfaces to databases (NLIDB) in terms of linguistic methodologies and techniques at large, the authors start by providing a quick run-through of basic linguistic concepts (syntax, morphology, semantics) before delving into the modeling and querying of linguistic data.
The chapter on natural language data management looks at data sources--for example, data quality, language format, that is, spoken versus written--and data models ranging from template-based to more formal syntactic or semantic representations. Querying such data models includes Boolean keyword queries and grammar-based searches for general document-level retrieval, to text-pattern and tree-pattern queries for word-level extraction. Various types of indexing are discussed to support better query options. Meaning representations and tools like WordNet, PropBank, and FrameNet (pages 36 and 37) that allow for inferences and reasoning in multiple contexts are discussed.
The chapter on NLIDB first presents a typical system overview followed by a discussion of the challenges of natural language understanding in the context of interface design. Ad hoc versus controlled natural language queries, stateful versus stateless queries, and other prevalent issues with query translations are investigated. A major part of this chapter consists of a thorough hands-on analysis of existing NLIDB.
The basis for this book was a three-hour tutorial given at SIGMOD 2017, an annual conference sponsored by the ACM Special Interest Group on Management of Data. The intended audience includes researchers and practitioners in the field of data management, and more specifically NLP.
The presentation of the material in the book mirrors the pace and format of the underlying tutorial. It is a tour de force across multiple areas of research in NLP, covering the discipline in a very tight and informative fashion. The authors have successfully reduced a potential 1,000-page exposé to 100 pages.
What I do take issue with is the claim that there is a strong commonality between unstructured natural language data management and structured NLIDB, besides processing natural language in one form or another. To query databases, the user is limited by what has already been deemed to be the universe of retrieval by the database designer. So, for all intents and purposes, an NLIDB has to use the formal query language as its meaning representation (page 102). Trying to inject multipurpose parsing techniques or even machine learning seems to be overkill and is bound to introduce further complications (for example, ambiguities, divergent interpretations), considering that there is only one structure that is targeted: the query language of the database. Directly parsing natural language queries into structured query language (SQL) was first done successfully in a product called a natural language query (NLQ), the first commercially available natural language interface for Oracle.
On a more philosophical note, even though the authors make reference to the bag-of-words model (pages 35 and 38), they never discuss its importance. Zellig Harris’ dictum that “language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use”  gave rise to the two major schools of thought in NLP, when it comes to how to treat linguistic data. Proponents who agree with Harris favor a top-down approach in line with a more structural analysis of linguistic data: parsing, morphology, and semantics. On the opposing side are the statisticians who claim that all there is are “bags of words” that can be analyzed statistically and without regard for any linguistic structure. In fairness to both sides, it may well come down to what the purpose of NLP in a given domain is, and why the use of natural language is necessary and useful in the first place. Focusing on a particular domain and sublanguage is an issue the authors mention only in passing (page 32); however, the topic is worth further discussion. Sublanguage research demonstrates how domain-specific parsing can naturally constrain ambiguities and simplify knowledge representation and query success .
Readers of this monograph are rewarded with solid information about NLP tools and methodologies when it comes to natural language data management and NLIDB. As long as they keep in mind that there are significant differences with how to apply NLP in structured versus unstructured data repositories, this book enables them to start applying NLP in practice and in theory.