Danfeng Wu

The point of a written language is not to record speech

2020-02-08T00:00:00-08:00

What’s the point of written language (aka orthography)? Assuming it is a human invention, why was it created? A simple assumption is that it records spoken language, so that information can be transmitted in more permanent and replicable ways. A reader who sees the written text can reconstruct the speech in the physical absence of the speaker. This makes sense conceptually because we know that in a spoken language, information is generally transmitted through speech. Then any written form that captures the uttered sounds in speech can transmit the information. This idea also has its empirical support: all written languages in the world postdate their spoken counterparts, and are “glottographic” - they all record uttered sounds with symbols.

In this post I argue based on case studies of the English and Chinese writing systems that this idea is in fact wrong. Both writing systems leave out critical aspects of speech (I call this “under-documentation” of speech, where some meaningful differences in speech are lost in writing), while at the same time over-specifying other aspects of speech (“over-documentation” of speech, where the same sound may correspond to different symbols). I have a speculation (based on a small number of observations) as to which aspects of speech tend to be under-documented, and which aspects tend to be over-documented. Sentence-level sound information (e.g. intonation and pitch) tends to be under-documented, whereas word-level sound information tends to be over-documented. Furthermore, I argue that over-documentation is a result of an important property of orthography: it should avoid ambiguities in the mapping between form and meaning.

First, I will talk about how English orthography (or any existing writing system that I know of) under-documents certain sentence-level information. Before doing so, here’s what I mean by “sentence-level” and “word-level”. They are the two levels on which humans convey information through uttered sounds: (sentence-level) prosody, and (word-level) phonology. On the word level, different words are pronounced differently, e.g. apple sounds different from dog. In addition, given an identical string of words, prosody (intonation, prominence, and breaks) plays an important role in distinguishing their meanings. Consider (1):

(1) The Vikings won over their enemies.

This sentence can mean two things: “The Vikings defeated their enemies” or “The Vikings persuaded their enemies.” Prosody is the only factor that distinguishes them: there is a greater break between over and their in the former meaning than in the latter, and over is more prominent in the former meaning than in the latter. Clearly English orthography does not transcribe prosody, hence creating ambiguity with texts like (1).

While sentence-level features such as prosody are under-documented, word-level features are often over-documented. In many writing systems, words with identical pronunciation but different meanings (homophones) may have different spellings, such as there and their in English.¹

If the sole purpose of orthography is to record uttered sounds, then orthography should not bother to distinguish words with identical sounds because they are not distinguished in speech anyway. But this move makes sense if orthography is designed to avoid ambiguities, to ensure that a reader is not confused when reading a string of words. While ambiguities may arise in speech due to accidental homophony, they are avoided in writing whenever possible.²

When homophony arises, the obvious solution is to create different spellings for the homophones, as we saw in English. However, spoken English has a relatively small number of homophones. What happens in a language with a large number of homophones? Designing the orthography for such a language will be a complex problem, as there are opposing factors to consider. On the one hand, orthography should avoid ambiguities, such that homophones should appear differently in writing. But since there are so many homophones, this will generate a large number of symbols to distinguish them, making them difficult to learn and memorize. Therefore, an efficient writing system should avoid ambiguities, while keeping its symbols easy to learn.

Mandarin Chinese is an ideal case of study due to its large number of homophones. Chinese orthography (Simplified Chinese and Traditional Chinese) may appear to be unnecessarily complicated at first glance, but once we view it as a balancing act of opposing factors, it actually makes a lot of sense.

Let’s begin by examining Simplified Chinese. Just like all the writing systems in the world, it encodes the segmental properties of a word - a word’s composition of consonant and vowels. For example, the sound ma is a combination of the consonant m and the vowel a. The smallest unit of Chinese orthography is a character, which often consists of at least two parts (aka radicals), one of which encodes the segmental property. Take 妈 (mā) ‘mother’ as an example. It has two radicals, one on the left 女 and one on the right 马. The radical on the right 马 (mǎ) is identical to 妈 (mā) in its consonant-and-vowel combination, so this radical encodes the segmental properties of 妈 (mā). Let’s call 马 the segmental radical of 妈.

To appreciate the existing Chinese orthography, let’s try designing an alternative system. Imagine we are commissioned by a skeptic of the current writing system to design an alternative. Where do we start? It makes sense for orthography to encode the segmental properties, so suppose we have decided that for the character for ‘mother’, we should at least have the radical 马. What other features should the character include?

An obvious answer is the tone. Like segments, tone is a “lexical” property - it helps the listener determine which lexical item / word the speaker is using. The same combination of consonants and vowels can have different meanings depending on which tone it carries (ma means ‘mother’ with one tone, and ‘hemp’ with another tone). Therefore, just having the segments ma does not tell us exactly which word it is. There are 5 tones in total, so encoding the tones is easy. We just need to add a symbol that can make a 5-way distinction (e.g. with numeral “radicals”: 马1, 马2, 马3, 马4, 马5).

If the sole function of orthography is to record speech, we should be done now — each character does encode the lexical, phonological property of a word, including information about its segments and tone. But as we just saw with English, orthography should avoid ambiguities created by homophony.

Mandarin Chinese has a large number of homophones, even when taking tone into account. The segmental and tonal combination mǎ, for example, is ambiguous between 马 ‘horse’, 码 ‘code’, 蚂 ‘ant’, 玛 ‘agate’, and so on. Just knowing the phonological properties of the word is not enough.³

To disambiguate all these meanings, let’s use the simple, dummy “radical”, numeral again. But instead of assigning numbers to tones, let’s assign numbers to the different meanings that the segments ma can have, say 马1 is ‘mother’, 马2 is ‘hemp’, 马3 is ‘horse’, 马4 is ‘code’, 马5 is ‘ant’, and so on. Then for a different segmental combination such as ba (whose segmental radical is 巴), we use the numeral radical again, so that 巴1 is ‘father’, 巴2 is ‘take’, 巴3 is ‘palladium’, and so on. We repeat this numeral assignment process for every possible segmental combination in Chinese (pa, na…).

This alternative system is clearly cumbersome and difficult to learn. Learners of this system would have to memorize exactly which number corresponds to which meaning. It does not help that the number–meaning combination is completely arbitrary. For example, there is no clear relation between the number one and ‘mother’, or between number two and ‘hemp’. Furthermore, the memorized knowledge of number–meaning combinations for ma does not transfer to ba. After memorizing the combinations for ma, the learner has to start over, and memorize the combinations for ba again.

To facilitate learning, rather than arbitrary numeral radicals, we can assign slightly more meaningful radicals to meanings. To ‘mother’, we can assign a radical that reminds the learner that the word has to do with motherhood, but not hemp, horse, code, or any other confusable meaning corresponding to ma.

This system is actually very similar to the existing one. In the existing system, 妈 ‘mother’ has a radical on the left side 女 in addition to the segmental radical on the right 马. This left-side radical has nothing to do with the phonological property of ‘mother’, but vaguely suggests its meaning. 女 means female, which distinguishes 妈 ‘mother’ from all the other meanings with the same segments ma because 妈 ‘mother’ is the only one that is related to femaleness. Let’s call this left-side radical of 妈 its meaning radical. It is essentially an advanced version of the dummy numeral radical I proposed. It is a mnemonic that helps the learner memorize the semi-arbitrary grapheme-to-meaning assignments.

Having a meaning radical also allows the learner to transfer knowledge from one segmental combination to another. For example, meaning radicals assigned to ma can be reused for ba, so that a learner does not have to start over again. The meaning radical 口 (suggesting the character is related to mouths) can be combined with the segmental radical for ma, generating 吗, a sentence-final particle (it is vaguely related to mouths due to its discourse/speech-related function). It can also be combined with the segmental radical for ba, generating 吧, another sentence-final particle. Every character has a segmental radical, and most of them also have a meaning radical.⁴⁵ There are over 200 radicals in total.

This raises a further question: we know the existing system is pretty good for its ability to disambiguate homophones while being relatively easy to learn, but exactly how good is it? How efficient is it at assigning meaning radicals to different meanings? Given a segmental radical, a maximally efficient system should make all the necessary meaning distinctions (such that there is no ambiguity), while keeping the total number of meaning radicals as low as possible. In other words, it should have the lowest number of redundant radicals possible.

A good designer of Chinese orthography should achieve this goal by first examining all the meanings that share the same segment (e.g. for ma, it is ‘mother’, ‘hemp’, ‘horse’, ‘code’, ant’, ‘agate’, etc.). Then they should create the meaning radicals, so that every meaning has a distinct meaning radical (for ma, we should create a meaning radical for ‘mother’, a meaning radical for ‘hemp’, a meaning radical for ‘horse’, and so on). Then in order to minimize the redundancy of these meaning radicals, the designer should make sure that the meaning radicals they created for ma can be maximally reusable for the meaning space of the other segments, e.g. ba. Jon Gauthier and I are currently working to find the most efficient allocation of meaning radicals given the segments, and compare this allocation system with the existing Chinese system.

So far, we have taken the segmental radical as a given, and asked how to manage the inventory of meaning radicals most efficiently. If we can alter the structure of the segmental radical, the writing system will be improved significantly. Suppose that instead of the segmental radical, we use a “phonological” radical, which encodes both the segmental properties and the tone. For example, we can borrow the five diacritics from the Romanization system pinyin, so that the phonological radicals for mā, má, mǎ, mà, and ma are 马̄ , 马,́ 马̆ , 马̀, and 马 respectively. Then we can significantly reduce the number of meaning radicals needed because the confusable meanings given the segments and the tone (e.g. mā) are way fewer than the confusable meanings given just the segments (i.e., where the tone is not specified, e.g. ma). This system requires a combination of segmental radicals like 马, the five tonal diacritics, plus a small number of meaning radicals. The radical required by this system are far fewer than what’s required by the existing system, which demands a combination of segmental radicals plus a large number of meaning radicals.

Therefore, the existing Chinese orthography provides a pretty good solution, but there is certainly space for improvement. If we believe that languages evolve over time to become more efficient (a common assumption in the study of language evolution), does this assumption apply to writing systems, too? If so, why doesn’t Chinese orthography follow obvious ways to improve itself, e.g. by adopting the system I just outlined? Chinese orthography has changed a lot over its history, and did these changes occur as an effort to improve? A speculation is that some changes are more costly to adopt than others. It may be cheaper to add or delete some meaning radicals under the existing framework, than to change the framework itself (e.g. replace all segmental radicals with phonological radicals). I leave this question of cost-sensitive, incremental evolution for another post.

I’ve argued in this post that written language doesn’t likely serve the function of recording speech, as it both over-documents and under-documents important features of spoken language. I suggested instead that the function of written language might be to clearly map to word meanings, and drew on evidence from Mandarin Chinese to show that distinguishing lexical meanings already plays a role in the structure of a writing system. It is not clear that written Chinese distinguishes lexical meanings efficiently or perfectly. In future posts, I will further explore alternative writing systems based on the Mandarin example to ask how efficient this system is.

Interestingly, while English over-documents certain aspects of word-level phonology (by spelling out homophones differently), it under-documents other aspects of word-level phonology, specifically lexical stress. English stress is a lexical property, as the same combination of consonants and vowels can mean different things depending on the location of stress, e.g. the nominal use of record vs. the verbal use of record. Clearly English writing does not specify stress placement, hence creating ambiguity with the written word record. ↩
This does not always seem true, as homographs are often tolerated. These are written forms that correspond to different meanings, e.g. lead in English (ambiguous between the act of directing and a metal). Unlike in writing, the different meanings of lead are actually distinguished by sounds ([li:d] vs. [lɛd]), so lead is a case where phonology is better at disambiguation than orthography. There are also cases where homophones are not distinguished in writing, e.g. light, bank, and ball, where orthography does just as poorly as phonology at disambiguation. I speculate that while the goal of orthography should be to clearly map forms to meanings, occasional failures are tolerated, especially when the meanings are different enough, and the context can usually help distinguish them. They also suggest that orthography is not perfect, a point I will briefly discuss at the end of the post. This is again why Mandarin Chinese may be a more instructive case of study: it has far more homophones than English, and presents a more challenging problem to distinguish homophones in writing. ↩
One might hope that the context in which the character is used can help distinguish these homophones, but context is not always helpful. One example where contexts don’t help to disambiguate meanings is names. Chinese human names are typically two to four syllables long. Organizational names are more varied in length, but in general short as well. Each syllable has its own meaning that is represented by a character, and contributes to the meaning of the name as a whole. The contexts in which names are mentioned often do not help to distinguish exactly the meaning of each syllable. However, names are probably the first thing someone learns to write when they learn writing. ↩
There are two types of characters in Chinese orthography. The first consists of radicals that compose with each other in the way I outlined, such that one radical encodes segmental information, and the other encodes meaning. The second type is a pictograph that does not consist of radicals which compose in the way I outlined. For example, 耳 (ĕr) ‘ear’ is derived from the image of an ear. An interesting example of the second type is 取 (qŭ) ‘take’, which can apparently be analyzed as consisting of two radicals, 耳 (ĕr) ‘ear’ on the left and 又 (yòu) ‘right hand’ on the right. They don’t compose in the way I outlined. Instead, together they make the image of a hand grabbing an ear, which alludes to the custom of taking the ears of dead enemy soldiers. I do not consider characters like 取 decomposable into a meaning radical and a segmental radical. For the purpose of this post, I take characters like 取 to consist of just one radical, a segmental radical. This is because 取 can combine with a meaning radical like 女 ‘female’ to generate 娶 (qŭ) ‘to marry’. ↩
The actual picture is more complicated than what I presented here. A segmental combination may be represented by different segmental radicals. For example, 马 and 麻 are both segmental radicals for ma. They may combine with the same meaning radical 口, generating different characters: 吗 and 嘛. Because both characters are sentence-final particles, and indicate the discourse property of the sentence, they share the meaning radical 口, which suggests the character has to do with mouth. Here what serves to disambiguate the two characters is in fact not the meaning radical, but the segmental radical. ↩

A hunch

2020-01-17T17:53:31-08:00

I think a lot about what I call “coordinators” in natural languages. They are non-local elements that nevertheless must appear together (or even appear similar in form). One example of non-local coordinators is the English disjunction coordinators either…or… (as in (1a)).¹ When a disjunction begins with either, or must also appear (1b) (* indicates ungrammaticality).

(1) a. Pat eats either rice or beans.
b. *Pat eats either rice beans.

Coordinators fascinate me because they relate to each other at a distance (they are usually not adjacent to each other, being separated by other words). Linguists study these long-distance relations because they reveal a key property of human language - it often relies on non-local, non-linear relations. In the case of coordinators, how does the grammar ensure that when one of them appears, the other must also appear? As a metaphor, how does one element signal to the other: “I’m here, and you’d better also appear”? Under what conditions must they cooccur, and why? This post talks about my speculation about what conditions their cooccurrence.

You may still wonder why I am interested in such a narrow topic - who cares about small words like either and or anyway? I will tell you why at the end of this post.

To begin, let’s reconsider either…or…. Interestingly, while either cannot appear without or (1b), or can appear without either (2). In other words, in English disjunction, either, the initial coordinator is optional, whereas or, the second coordinator is obligatory.

(2) Pat eats rice or beans.

Based on a small number of observations in English and Mandarin Chinese, I have a speculation for all coordinated constructions in all languages: the initial coordinator is easier to be left out than the second coordinator.² What this means is: for a construction, if we can drop the second coordinator but not the initial one, then there must be a version of the same construction where the initial coordinator can be dropped but not the second one. Imagine a hypothetical language English’ where (1b) is grammatical, then (2) must also be grammatical. There is no language where (1b) is grammatical, but (2) is not.

Now I will discuss a few coordinated structures in English and Mandarin Chinese, and show that they all comply to this generalization. (If you know of any counterexample in any language, let me know!)

English conjunction coordinators both…and… behave in parallel to either…or…: while both is optional in a conjoined structure, and is not. Mandarin Chinese has dis/conjunction coordinators that behave just like English ones, so I won’t repeat them here.

What is interesting about Mandarin Chinese is that it has a wider range of coordinators to study than English appears to have. For instance, Chinese has an “although…but…” construction, where although and but can cooccur (2a). Interestingly, only although can be dropped (3b), but not but (3c):³

(3) a. Suiran     ta  shule, danshi  ta   hen   gaoxing
          Although   he  lost     but         he  very  happy
          ’Although he lost, he was very happy.’
     b. Ta   shule, danshi  ta    hen    gaoxing
         He   lost     but          he   very   happy
     c. Suiran    ta   shule,  ta    hen    gaoxing
           Although he  lost       he   very   happy

If we take suiran ‘although’ to be the initial coordinator and danshi ‘but’ to be the second one, this conforms to my generalization: suiran is easier to be dropped than danshi. Mandarin Chinese has many more such constructions that behave in parallel, such as yinwei…suoyi… ‘because…therefore…’, where yinwei ‘because’ can be omitted but not suoyi ‘therefore’. For the sake of space, I do not enumerate the examples here.

Having examined Mandarin Chinese, we may now return to English and wonder whether it has a parallel construction to Chinese “although…but…” (or “because…therefore…”), but somehow the coordinators although and but can never cooccur in English (4a)?

(4) a. Although she lost, but she was happy.
b. Although she lost, she was happy.
c. She lost, but she was happy.

Let’s set aside the question of why English does not allow the cooccurrence of although and but, but Chinese does. If we accept that English does have such a parallel construction (i.e., the single occurrence of although means the omission of but (4b), and the single occurrence of but means the omission of although (4c)), then it may appear to pose a challenge to my generalization. The initial coordinator can be dropped, and the second coordinator can as well!

This is why it is important to clarify: my generalization is not about banning the omission of the second coordinator altogether. It only says that the second coordinator is harder to be omitted than the initial one. In other words, (4) is not a counterexample. But if we find a language where (4b) is not grammatical but (4c) is, then it is a problem.

If this generalization is true, why is it so? I offer nothing but another (very vague) speculation that it may have to do with information structure, i.e. how humans organize and think about information. Maybe the later part of a sentence is naturally more prominent than the earlier part. Then in a coordinated structure, we want to make the information in the second part more salient and clear than the information in the first part, possibly by not dropping the second coordinator as much as the initial one.

By the way, this is also why linguists are interested in these seemingly narrow and small questions. They offer insights into how humans think about and organize information. If my generalization is universally true, then it is even more striking: no matter what language a human speaks, they follow the same principle in arranging coordinators.

Although either and or do not look similar, the negated disjunction coordinators neither and nor do, both beginning with n-, which marks their negative property. In many languages, the disjunction coordinators are even identical. Polish is a language that showcases this property nicely. Its counterpart of “either…or…” is albo…albo…; its “neither…nor…” (the negative version of “either…or…”) is ani…ani…; and its “whether…or…” (the wh-version of “either…or…”) is czy…czy…. ↩
I use the phrase second coordinator (for simplicity), though I really mean noninitial coordinator here. This is because sometimes there are more than two coordinated parts, e.g. rice, beans, and bread in “Pat eats either rice or beans or bread.” The ors in this sentence are noninitial coordinators, and they should be harder to omit than either, the initial coordinator. ↩
(3c) would be ok if we insert que in the second clause, a particle that roughly means ‘but’:
(i) Suiran ta shule, ta que hen gaoxing
Although he lost he but very happy
I take que to be a second coordinator much in the same way as danshi ‘but’. Then the obligatory presence of the second coordinator (que or danshi) in this construction is still compliant to my generalization.
We can also use haishi ‘still’ instead of que in (i), so haishi is another second coordinator just like que and danshi. ↩

LSA 2020 Highlights

2020-01-10T21:36:46-08:00

I just came back from the LSA 2020 Annual Meeting in New Orleans, and here are some (but not all)¹ of the talks / posters I found interesting and why.

Sluicing with complement coercion: An argument for focus-based semantic identity (Hadas Kotek)
- Kotek found that some speakers judge sentences like this one to be grammatical (through a Facebook thread!): “John is an author, and Bill is a reader. John started a novel, and Bill did too.” with the (non-joke) reading that “John started writing a novel, and Bill started reading a novel.”
- She claims that such examples pose a challenge to one account of ellipsis (Q-equivalence accounts), but can be accommodated within another account (focus-based account).
- My thought: Nominals can have much richer meanings than just nominals.
  - For example, Baker and Grimshaw have observed that “the time” denotes a question in “She asked the time.”
  - In the sentences observed by Kotek, maybe the object nominal denotes a richer meaning as well. For example, “the novel” may mean “any event that involves the novel”, which is compatible with both a writing event and a reading event, and creates sufficient semantic identity to license ellipsis.
  - Some evidence for the idea that “the novel” can mean “any event that involves the novel”: assuming the subject of “takes five weeks” has to denote an event, then the sentence “The novel takes five weeks” suggests that “the novel” may denote an event rather than just the physical novel. This evidence is not strong, however, because it may be a tough-construction with an omitted infinitival: “The novel takes five weeks (to write/read).”
  - Not only can an object denote an event that involves doing something to the object, but an instrumental element can denote an event that involves doing something with that instrumental, e.g. “Don’t fix the car with this wrench. This wrench takes forever.” In this sentence “this wrench” denotes the event of fixing the car with this wrench.
  - More example where a nominal has a broader meaning than just the literal nominal: “Beijing” and “Washington” in “Beijing opposes Washington’s sanctions on Chinese firms.”
Two kinds of dislocations in Biblical Hebrew (Matthew Hewett)
- There are two kinds of dislocations that we know exist across languages: Hanging Topic and Left Dislocation. They are distinguished by various diagnostics. These diagnostics show that Biblical Hebrew has both kinds of dislocations.
- Questions I’ve always had about making generalizations about dead languages, which the author handled well and clearly:
  - Q: How do we know that Biblical Hebrew is not written by different speakers who speak different dialects? Then the fact that both types of dislocations are attested in Biblical Hebrew cannot be taken as indication that Biblical Hebrew allows both types, because it is also compatible with the possibility that the text is written by at least two different speakers, each of whose dialects only allows one way of dislocation.
    - A: This is true. However, there is an interesting entailment relation synchronically and diachronically between Hanging Topic and Left Dislocation. A language that has Hanging Topic also has Left Dislocation, but not vice versa. As languages with both types of dislocation develop, they may lose Left Dislocation first, but not Hanging Topic first. Then the fact that we found evidence for Hanging Topic in Biblical Hebrew suggests that the same speaker also allows Left Dislocation.
  - Q: Can researchers take absence of a certain construction in a text as evidence of its ungrammaticality? Why can’t such absence be accidental?
    - A: It may be accidental. Rigid tests are required to show that it is not a statistical accident, for example.
Anti-pied-piping (Kenyon Branan and Michael Yoshitaka Erlewine)
- Anti-pied-piping refers to the phenomenon where a constituent XP is focused, but only a subpart of XP moves or has a focus particle.
- For example, in Miyara Yaeyaman (Ryukyuan), the focus particle may attach to the direct object of a sentence, which is ambiguous between object focus and VP focus.
- In Kikuyu, the direct object may undergo focus movement, which is ambiguous between object focus and VP focus.
- Anti-pied-piping focus particle placement can feed scrambling in Japanese (which suggests that anti-pied-piping focus particle placement is not simply post-syntactic lowering): under VP focus, the direct object, to which the focus particle is attached, can scramble on its own.
- It is often the element on the edge of the larger focused constituent that the focus particle attaches to. This suggests that anti-pied-piping can refer to post-syntactic properties, e.g. “edgemost”.
- Why I like it: these facts are just so cool!!
Prosodic conditioning of word order in Khoekhoegowab (Leland Kusmer)
- Khoekhoegowab is word final, but some tense markers precede the verb.
- After ruling out alternative explanations, Kusmer argues that the tense markers that precede the verb do so because they are monomoraic, and Khoekhoegowab dislikes phrases that (begin or) end in a monomoraic morpheme.
- Sam Zukoff’s question, which I found interesting and thoughtful: we know that phonology has ways to fix word-level well-formedness (I thought of: making the vowel longer, adding a coda, etc.). Why do we displace the monomoraic morpheme in this case to fix it? Thinking in Optimality Theory terms, are there prosodically-related constraints? How do they interact with phonological constraints?
- Why I like it: the talk is clear and well-organized, and the data is very cool.
Re-constructing massive pied-piping: An argument for non-interrogative CPs (Daniel Amy)
- Massive pied-piping (MPP) refers to optional pied-piping of a large constituent, e.g. movement of “the cover of which” in “John read the book [the cover of which]i Mary illustrated ti.”
- While previous observations restrict MPP to non-subordinated clauses such as relative clauses, Amy claims that MPP is acceptable in complements of factive predicates like know and tell as well: “I know/wonder [the poster of which pop star]i Mary hung ti in her office.” & “John told/asked Sue [the poster of which pop star]i Mary hung ti in her office.”
- Amy thus calls for a distinction between factive complements (which allow MPP) and interrogative complements (which disallow MPP). He ties this distinction to Cable’s (2010) QP-analysis and den Dikken’s (2003) two-stage wh-agreement.
- Some related issues that I find interesting:
  - MPP is not allowed in matrix clause questions, except when it is an echo question: “[The cover of which book]i did Mary illustrate ti?” Why are echo questions an exception? This intrigues me because I have been interested in the syntax of echo questions since Haoze Li’s NELS talk.
  - While factive complement clauses allow MPP, factive subject clauses don’t seem as good: “It is surprising [the poster of which pop star]i Mary hung ti in her office.” vs. “??/*[The poster of which pop star]i Mary hung ti in her office is surprising.”
A workspace-based analysis of adjuncts (Daniel Milway)
- Warning: this is a high-level talk about frameworks of Merge and has a lot of technical details.
- Adjuncts are derived in parallel workspaces alongside the workspaces that derive their “hosts”.
- To derive “Sadie sang the anthem with gusto”, where “with gusto” is an adjunct modifying the VP:
  - In Workspace 1, combine the verb and the object, and derive “sang the anthem”
  - In Workspace 2, combine the preposition and the noun, and derive “with gusto”
  - Then combine the products of both workspaces with tense (T) respectively, and derive “T sang the anthem” in Workspace 1, and “T with gusto” in Workspace 2
  - Merge the products with the subject “Sadie”, and derive “Sadie T sang the anthem” in Workspace 1, and “Sadie T with gusto” in Workspace 2
  - Delete “Sadie” and “T” under identity in Workspace 2
  - The workspaces are inherently ordered and pronounced in this order
- Adjunct island falls out from this analysis. Movement must occur within a workspace. If it starts in a workspace, it must end in the same workspace. Adjunct island violations violate this rule because they involve movement that starts in the workspace for the adjunct, and ends in the workspace for the “host”.
- Parasitic gap falls out from this analysis. Parasitic gap refers to constructions where there is a gap in the “host” clause and another gap in the adjunct clause. They are legal because they involve a movement in the workspace for the adjunct, and another movement in the workspace for the “host”.
- Issues and why I like the talk nonetheless:
  - We may have to abandon the beloved notion of selection / subcategorization (at least in the workspace for the adjunct, as the above derivation shows, where tense directly merges with the PP “with gusto”).
  - Adjuncts participate in syntactic and semantic relations with their “hosts”, which is difficult to capture if we build them in separate workspaces.
  - I like the talk because adjuncts have always been a mystery and an exception to generalizations about Merge. I appreciate the courage of taking a crack at this difficult problem. While the solution is not perfect, by spelling it out, we can weigh its benefits and costs explicitly.

The order is arbitrary and should not matter. I wish I could mention more, as there were many more interesting ones, but I am unfortunately constrained by time and memory. ↩

Why we can never be sure that a human or a neural network has a grammar that uses hierarchical structure

2019-09-14T19:10:13-07:00

MIT Linguistics is offering a new course this semester called “Linguistics for Researchers in Computer Science, Cognitive Science, and Related Fields”. As the TA, I gave an opening speech at the beginning of the first class. Here is my speech , and I added at the end additional thoughts that weren’t put into the speech:

Hi everyone! Welcome to the class. Whether your research is in Natural Language Processing, cognitive science or any other language-related topic, you may find some of these topics interesting: what do speakers know about their native language? Can we find a general theory to characterize this linguistic knowledge? How is this knowledge learned? For cognitive scientists who study language, these are the topics of primary interest. For NLPers who want to find an efficient algorithm for processing human languages, knowing about the linguistic features of human languages would be helpful, too. These are precisely the topics of this class, so you are in the right place.

In addition to gaining linguistic knowledge, I hope this class achieves a more ambitious goal: hopefully it helps us become better researchers.

Here is what I think makes a great researcher, and why I think this course will help us in that direction. Great researchers are broad: they read broadly, think broadly and have a broad knowledge. They may work on an apparently narrow topic, but they always bear in mind the bigger questions behind it, and situate their research in a grand picture. Great research requires breadth. In science, as we try to discover the order behind a phenomenon, a deep answer to a question often relates a broad range of phenomena.

But it’s very expensive to be broad: you may want to attend a talk in another department or read a paper in another field, but they are often too specialized and too dense to understand. That is why I co-organize CompLang, a discussion group that engages members of CSAIL, BCS and linguistics. CompLang bridges between these fields. Instead of assuming a lot of prior knowledge from the audience, presenters are required to make their talk accessible to people who have very little background in the topic, bearing in mind the diversity of the audience. CompLang is in some sense a predecessor to this course. It is through organizing CompLang that I realized how important these inter-disciplinary discussions are to all of us.

From my experience, these discussions allow participants to take a step back and reflect on their own research approaches. I will show you what I mean by sharing one of my own intellectual journeys at CompLang. It is told from my perspective as a linguist, learning from computer scientists and cognitive scientists. But I hope (and am sure) that you will have similar gains in the other direction. I hope that by thinking about linguists’ approach to language, you will get to compare and reflect on your own approach as well.

In cognitive science and linguistics, researchers are interested in the questions of what are the initial state (factory settings) and final state of human language knowledge, and what a child experiences as they get from the initial state to the final state. The traditional approach is to study how humans learn languages, but in recent years some researchers have started to study this topic by looking at how neural networks learn languages.

For example, we know that human language knowledge can be described in terms of hierarchical structures like trees. It leads to the question of where this property of languages comes from: are humans born with a bias favoring hierarchical structure? If the hierarchical structure of language cannot be learned only based on the sentences that children hear, then this innate bias is necessary. But if it can be learned only based on the sentences that children hear, then this innate bias is not required.

Some researchers showed that a neural network language model with no prior bias for any hierarchical structure can nevertheless learn linguistic phenomena that refer to hierarchical structure only based on the input. They take this as evidence against the view that hierarchical structure is innate.

Let’s put aside the questions of whether human children and neural networks are on an equal footing in these studies. The question I was most interested in when I heard about research like this was: how can we be sure that a neural network has successfully learned a linguistic phenomenon?

Suppose that a neural network language model can provide us with a “grammaticality judgment” for each sentence presented to it, that is whether the sentence is grammatical or not. Again, let’s set aside the question of what kind of measurements from a neural network can be interpreted as a “grammaticality judgment”. We can then ask whether the judgments made by a neural network align with those that would be made by a human. If they align, then researchers conclude that a neural network has learned the same linguistic knowledge as a human. Assuming that humans’ linguistic knowledge makes reference to hierarchical structure, so must a neural network’s.

Take the example of English verb agreement. Linguists claim that the verb always agrees in number with the subject. If the subject of a sentence is plural, then the verb must be plural, as in (1) and (2) (* indicates that the sentence is judged ungrammatical by humans):

(1) The keys are lost.
(2) *The keys is lost.

If the subject is singular, then the verb must be singular, as in (3) and (4):

(3) The key is lost.
(4) *The key are lost.

Crucially, this verb agreement rule can be described by hierarchical structure, but not by linear order. For example, based on the grammatical sentence “The keys are lost,” one may think that the verb agrees with the first thing linearly to its left, that is “keys”. But this is not true, as is shown by the following sentences:

(5) The keys to the cabinet are lost.
(6) *The keys to the cabinet is lost.

In fact, a rule is doomed if it only refers to the first, second or third thing linearly to the left of the verb. Instead, a rule that refers to agreement with the “subject” is successful. Since the concept of “subject” cannot be described in terms of linear order, but can be described with trees, we say that the verb agreement rule uses hierarchical structure.

Now suppose that we can show that a neural network judges correctly “The keys are lost” to be grammatical, and “The keys is lost” to be ungrammatical, can we then conclude that the neural network has learned the verb agreement rule in English, and therefore has learned a hierarchical rule?

Probably not, because the neural network may have only learned the linear rule I just mentioned, that is the verb agrees with linearly the first thing to its left. To make sure the neural network is not adopting this hypothesis, we can test it with more complicated sentences like “The keys to the cabinet are lost.”

Suppose it still gets it right. Can we then be sure that the neural network has learned a hierarchical rule? Can we think of a linear rule that can still give the correct judgment to this complicated sentence? I think so! Such a linear rule might be: the verb agrees with the first noun to its left that does not follow by a preposition. This rule may sound a little complicated, but it is only based on linear order, and is sufficient to help the neural network “fake” as a human and give the correct judgments.

We can think of an even more complicated sentence to exclude this complicated linear rule, but we can always find an even more sophisticated linear rule to cover it. And we can repeat this exercise infinitely.

We can think of an even more complicated sentence to exclude this complicated linear rule, such as (7) and (8):

(7) The keys that the carpenter made are lost.
(8) *The keys that the carpenter made is lost.

Suppose the neural network still gets it right. Can we be sure that it has learned hierarchical structure? No, because a linear rule would still be compatible such as: the verb agrees with the first noun to its left that does not follow a preposition or “that”. In fact, this exercise can be repeated infinitely.

What this exercise taught me is that we can never be really sure that the neural network has learned hierarchical structure. No matter how difficult the test sentence is, we can always find a complicated linear rule that covers it successfully.

Then how will we ever know that the neural network is not using an extremely complicated linear rule? How can we ever conclude from grammaticality judgments that the neural network has learned to use hierarchical structure?

As I asked this to myself, I started to reflect on how linguists came to the conclusion in the first place that human’s linguistic knowledge is hierarchical. Didn’t they draw this conclusion also based on grammaticality judgments, or more broadly, humans’ behavior that can be observed? Why did no one challenge the linguists like I challenged the researchers of neural networks, and claimed that humans just have an extremely complicated linear system?

This might be possible, but at some point this linear system becomes so complicated that it is very unlikely that humans use it, and young children learn it. In science, if we find two hypotheses that can explain the data equally well, then researchers tend to favor the simpler one. Suppose that all humans have this bias for simplicity, and suppose that a hierarchical rule referring to the sentential “subject” is simpler than an extremely complicated linear rule,¹ then all humans will converge on the hierarchical rule.

If we treat a neural network like a psychological object (a human), which is what these researchers do, then why can’t we also give up as the linear rule gets too complicated, and say the neural network can’t have chosen that? Then, it must have learned hierarchical structure.

You may wonder whether we can really treat a neural network like a human. For example, what is the criterion for “simplicity” for a neural network? Is it the same as researchers’ or human subjects’ criterion for “simplicity”?

Anyway, this is my own intellectual journey at one of the CompLang talks last semester. We started from a specific project. By thinking critically about the project, I started to look back at what I had taken for granted, and reexamined my own implicit assumptions. In the future, when I hear the claim that human language is hierarchical, I know to put an asterisk on it, and know that we actually can’t be 100% sure of that. Furthermore, this thought process created deeper questions that lingered: if there is a criterion for simplicity followed by researchers, do human subjects and neural networks follow it too?

These are not the questions you can only get to by attending a talk on neural networks. You can get to the exact same questions from the other direction, by going to a linguistics class like this one. For instance, a syntax lecture may show you many sentences, and linguists’ hypothesis based on those sentences. You can ask the same question as I did: is this the only hypothesis compatible with these sentences? Can I think of any alternative explanation? Why did linguists come to this particular conclusion as opposed to any other? What is the assumption behind this conclusion? Is it different from the assumptions behind my research? By asking these questions, I hope that you not only gain linguistic knowledge, but get to reflect on the bigger questions that relate our fields, and become a broader scientist.

But don’t feel pressured to only ask these types of questions in this class. Feel free to ask anything you want. We especially welcome clarificational questions. For example, if you hear the professor use the word VP a lot and don’t know what it means, feel free to ask. This is a new course, and we are as new to this course as you are. No question is inappropriate. By asking questions, you also help us know what works and what doesn’t.

Here are more thoughts that weren’t put into the speech:

There should also be meta-evaluation of simplicity of sampling processes. For verb agreement, based on the sentences a child or a neural network has heard, there are always an infinite number of ways to describe the verb agreement rule that would be compatible with all the data. This infinite space of compatible descriptions includes the rules that this blog post has mentioned, as well as other strictly linear descriptions and hierarchical descriptions, but it also includes many descriptions that no researcher has ever thought of. For this reason, I think this space is massive, and more importantly, very messy, in that we cannot properly divide it into a finite number of segments.

The messiness and infinity of this description space begs the question of how children or neural networks sample and choose descriptions from this infinite and messy space.

In the speech I have assumed that a child or a neural network looks at this infinite space of descriptions, and chooses the “simplest” one, whatever that means. But one can think of a “simpler” way of sampling: instead of looking at this infinite description space altogether, a child or a neural network can just randomly take five descriptions from this space and compare them. The simplest one in these five that works is kept. If none of them works, then five more descriptions are sampled until we find one that works. Why don’t we assume this is how a child or a neural network does?

Also, I think it is useful to keep doing the exercise mentioned in the speech seriously until we find a linear description that would cover all the data concerning English verb agreement that can ever be conceived of. In other words, it is worth finding that “extremely complicated linear rule”. Then we can compare this linear rule with the hierarchical rule that refers to agreement with the sentential subject. Suppose that all humans converge on one of these two rules - say, all humans actually follow the hierarchical rule - then presumably the hierarchical rule wins out because it is considered simpler by the “human language system”. We can then reverse-engineer a human language system so as to guarantee the win of the hierarchical rule over the linear rule. If we do the same exercise for all the other grammatical phenomena, then eventually we will get a pretty good idea of the human language system through reverse-engineering.

For instance, to continue the exercise mentioned in the speech a bit further, while the linear hypothesis (the verb agrees with the first noun to its left that does not follow a preposition or “that”) could cover the 8 examples, it fails to cover the following:

(9) John said that the keys are lost.
(10) John said that the keys is lost.

This calls for a revision to the linear hypothesis. Suppose that the linear rule scans a sentence from left to right. It matches the first verb it sees with a noun following this rule: the verb agrees with the first noun to its left that does not follow a preposition. Then the second verb it sees agrees with the first noun to its left that (1) does not follow a preposition and (2) has not been matched with a verb yet. Therefore, this linear rule can be summarized as the verb agrees with the first noun to its left that (1) does not follow a preposition and (2) has not been matched with a verb yet.

Can you think of an example that this revised linear rule does not cover, and think of a revision to the linear rule so that it can be covered by your revision?

Can we be sure of this though? We would need a simplicity metric, against which we can compare different rules. Then we can tell which rule is simpler: a rule that only uses linear order; or a rule that relies on hierarchical structure like trees. ↩

What is Wrap-XP and why I think it is wrong

2019-04-13T07:15:44-07:00

Wrap-XP is an important and influential notion in prosody (the study of how sounds in a sentence are grouped together). In this blog post I will explain what Wrap-XP is and why I think it is wrong.

Wrap-XP was first proposed in Truckenbrodt’s dissertation as a constraint in the sense of Optimality Theory. As language creates prosodic structure from syntax, Wrap-XP requires a certain type of syntactic phrases (call them type-A syntactic phrases for now) not to be broken up by any prosodic phrase (ф) boundary. In other words, for every type-A phrase, we must be able to find a ф such that the type-A phrase is not broken up in this ф. Crucially, Truckenbrodt argues that Wrap-XP only cares about type-A phrases, but not other kinds of phrases (let us call the complement of type-A phrases type-B syntactic phrases). Therefore, only type-A phrases may not be broken up by ф, but type-B phrases can be broken up by ф.

To give you an example of how Wrap-XP works, suppose that a type-A phrase AP contains two phrases XP1 and XP2 in syntax: [XP1 XP2]. In mapping from syntax to prosody, imagine we want to create a prosodic phrase ф for each syntactic phrase. We may create a ф for XP1 and a ф for XP2: (XP1)(XP2).¹ Wrap-XP does not like this prosodic structure because the AP, which contains XP1 and XP2, is broken up by the prosodic boundary between XP1 and XP2. If we prefer to satisfy Wrap-XP, then we will need to remove the boundary between XP1 and XP2 and generate (XP1 XP2).

But if it is a type-B phrase BP that embeds XP1 and XP2, then Wrap-XP does not care if this BP is broken up and will allow this structure: (XP1)(XP2).

Here is the flaw of Wrap-XP: every time a type-A phrase contains a type-B phrase, we run into problems. First, consider a sentence with only a type-B phrase that contains XP1 and XP2. Wrap-XP does not care whether the type-B phrase is broken up by a prosodic boundary, and so we create boundaries between XP1 and XP2: (XP1)(XP2).

Next, consider another utterance where this BP is embedded in an AP. Then Wrap-XP cares about this AP and requires no subdivision within it. Thus, all the boundaries we created earlier within the BP have to be removed: (XP1 XP2).

Basically Wrap-XP says: depending on what the highest embedding phrase is, there may or may not be subdivision within this phrase. If we have a type-A phrase embedding a type-B phrase, we have to erase all the phrasing created earlier in the type-B phrase. This can’t be right.
Any attempt to revise Wrap-XP won’t work either. As long as Wrap-XP discriminates, i.e. it only applies to some types of syntactic phrases but not others, the same problem will occur every time a phrase it applies to embeds a phrase it doesn’t apply to.

Now I will explain why Wrap-XP must discriminate. If Wrap-XP does not discriminate, it just means that all the utterances cannot be broken up by any prosodic boundary. This is not what we see in the languages where Wrap-XP was argued to be relevant.

Truckenbrodt argues that Wrap-XP is relevant in at least two languages, Papago and Chichewa. In Papago, any syntactic phrase to the left of the tense head must have its own corresponding ф, while any syntactically phrase between the tense head and the verb head is grouped with the verb prosodically. The prosodic phrasing looks like this: (XP1)(T XP2 XP3 V). In Chichewa, subjects in Spec, TP always have their corresponding ф too.

These facts lead Truckenbrodt to propose a discriminating Wrap-XP. In Papago, Wrap-XP should apply to the phrase that immediately dominates T, XP2, XP3 and V, but not to the phrase that immediately dominates XP1. Likewise, in Chichewa Wrap-XP should apply everywhere except to the phrase that immediately dominates the subject.

One may wonder why ((XP1)(XP2)) isn’t a possible structure for an AP embedding XP1 and XP2 [AP XP1 XP2]. If it is, then Wrap-XP is always satisfied because AP is not broken up in the largest ф. The prosodic structure where a ф contains another ф is called a recursive structure. For many years researchers believed that recursive structure is not allowed in any language. Since Truckenbrodt’s dissertation, some have changed their view and thought that some languages may allow recursive structure, while other languages don’t. Crucially, Truckenbrodt has argued on independent grounds that Papago and Chichewa don’t allow recursive structure, so Wrap-XP would still run into problems in those languages, or any language that doesn’t allow recursive prosodic structures. ↩