Matisse is a Team Player

Henri Matisse works well with oth­ers in the court­yard of the Kabbalist, and in the morn­ing, Leonardo’s shadow got to give the peo­ple what they want.

Two very in­flu­en­tial artists, to be sure, but what was Matisse do­ing col­lab­o­rat­ing with Kabbalists, and what ex­actly did the peo­ple want from Leonar­do’s shadow?

Table of Contents

Some Context

In mid-June, Randall Munroe, the ge­nius be­hind xkcd and the au­thor of quite a few books, announced he was go­ing on a book tour to pro­mote his lat­est book: How To: Absurd Scientific Advice for Common Real-World Problems

As part of the an­nounce­ment, he in­tro­duced an in­ter­est­ing chal­lenge:

Write the best story us­ing noth­ing but book cov­ers.

The win­ner of the chal­lenge would be re­warded by a visit from Munroe as part of the tour. Even though Ann Arbor was al­ready in the itin­er­ary, I could­n’t help but think about this chal­lenge… and how I could use some of my lin­guis­tics knowl­edge to my ad­van­tage.

The Project

I de­cided I would try to hack this chal­lenge by ac­quir­ing a dataset of book ti­tles, and hav­ing a com­puter gen­er­ate these sto­ries for me.

I found a list of over 200,000 books listed on Amazon, and a cou­ple of Jupyter note­books and Python pack­ages later, we were off to the races.

Grammar School: Constituency

The key con­cept be­hind this pro­ject was some­thing I learned in my in­tro­duc­tion to lin­guis­tics class last fall: con­stituency trees. The ba­sic idea is that each sen­tence can be rep­re­sented as a com­bi­na­tion of words and phrases (con­stituents).

For ex­am­ple, a sen­tence is, in its base form, a noun phrase and a verb phrase:

S -> NP VP

And a noun phrase might be a noun, with op­tional ad­jec­tives and de­ter­min­ers, and maybe a prepo­si­tional phrase.

NP -> [DET] [ADJ] N [PP]

And a prepo­si­tional phrase, in turn, might be a prepo­si­tion fol­lowed by a noun phrase.

PP -> [P] [NP]

And in this way, we can con­struct a very ex­pres­sive and pro­duc­tive gram­mar for the English lan­guage. Productivity here means that, with even a lim­ited vo­cab­u­lary, we can form many dis­tinct sen­tences and thoughts.

So the main idea was we would la­bel each book ti­tle as a noun phrase, verb phrase, ad­jec­tive phrase, etc. Then, we could use this gram­mar to com­pose sen­tences.

The Nitty Gritty

Although there were over 200,000 books in the dataset, I could­n’t use all of them. In fact, out of 32 cat­e­gories, I used only 10. Books with cat­e­gories such as cook­books, cal­en­dars, etc were ex­cluded be­cause those ti­tles were gen­er­ally not very use­ful.

The cat­e­gories I ended up us­ing were:

  • Biographies & Memoirs
  • Children’s Books
  • Engineering & Transportation
  • History
  • Humor & Entertainment
  • Literature & Fiction
  • Mystery, Thriller & Suspense
  • Science Fiction & Fantasy
  • Self-Help
  • Teen & Young Adult

I also did some pre­pro­cess­ing on the ti­tles be­fore pars­ing the con­stituents.

cleaned_titles = books.title.str.lower()  # Lowercase
cleaned_titles = cleaned_titles.str.replace(r"\(.+\)", "")  # Remove everything in parentheses
cleaned_titles = cleaned_titles.str.replace(r"\[.+\]", "")  # Remove everything in brackets
cleaned_titles = cleaned_titles.str.replace(r"(volume|vol\.) (\d+|\w+)", "")  # Remove volume numbers
cleaned_titles = cleaned_titles.str.replace(r"issue (\d+|\w+)", "")  # Remove issue numbers
cleaned_titles = cleaned_titles.str.strip('-, ')  # Remove dashes
cleaned_titles = cleaned_titles.apply(lambda x: x.split(':')[0])  # Only keep first part of title (no subtitles)

The last part — get­ting rid of sub­ti­tles — some­times helped and some­times did not.

Parsing

I used benepar, which is a state of the art con­stituency parser that fits nicely into NLTK. I found that it pro­vided de­cent re­sults. One is­sue that I ran into was han­dling more gran­u­lar de­tails such as sub­ject verb plu­ral­ity align­ment. In those cases, I wanted to know what the noun was in the noun phrase, etc.

For this, I used a dif­fer­ent gram­mar model: de­pen­dency gram­mar. In de­pen­dency gram­mars, in­stead con­stituents, each word is de­pen­dent on ex­actly one other word in the sen­tence. This im­age from Wikipedia is pretty good at show­ing the dif­fer­ence be­tween the two.

Dependency vs Constituency

I use the root of each de­pen­dency tree as the active noun” or active verb” in the phrase, and their parts of speech in­clude mark­ers for plu­ral­ity, proper­ness, verb tense, etc.

Creating Sentences

Afterwards, it was just a process of re­turn­ing a ran­dom sam­ple for each con­stituent. I wanted the sen­tence gen­er­a­tion to be com­pletely au­to­mated, but I ended up adopt­ing a more hy­brid process where I would query for noun phrases, for ex­am­ple, un­til I found one I thought was in­ter­est­ing, then move on to a verb phrase.

Gallery

Here are a cou­ple more sen­tences I put to­gether with the help of com­put­ers. Stay tuned for an on­line in­ter­ac­tive ver­sion in com­ing days that will let you con­struct your own book cover sto­ries.

Radioactive horses and ponies thank you for be­ing a good friend, but not the hip­popota­mus.

19 va­ri­eties of gazelle found ap­ple, ap­ples every­where!

The Russian Kremlin des­per­ately seek­ing ex­clu­siv­ity around the world with Justin Bieber.