## 2008-05-12

I found the ITA Software Hiring Puzzles almost a year ago and ever since I haven't really been able to stop thinking about one of the puzzles: Sling Blade Runner.

"How long a chain of overlapping movie titles, like Sling Blade Runner, can you find?"

Use the following listing of movie titles: MOVIES.LST. Multi-word overlaps, as in "License to Kill a Mockingbird," are allowed. The same title may not be used more than once in a solution. Heuristic solutions that may not always produce the greatest number of titles will be accepted: seek a reasonable tradeoff of efficiency and optimality.
The file MOVIES.LST contains 6561 movie titles in alphabetical order.

It *sounds* simple enough, but it's actually quite a challenging problem. In fact, if I'm not mistaken, it's one of those problems you really can't find an *optimal* solution to. Just better or worse approximations thereof.

I think I've made something like five or six, more or less serious, attempts at getting a decent solution during the year. But thinking about this problem has made me read several books, web pages and articles on various subjects like graph theory and clustering. I've written code in Lisp, CLIPS and Python to test ideas I've had. All in all; it's been a very rewarding, educational and fun year.

Last week I decided to give it another serious attempt. That resulted in (after two days and 6 hours worth of coding) a small (~200 LOC) Python program that finds a chain that is 225 titles long. I really have no idea if that is good or bad. I know that Eric Burke found a chain that is 245 titles long so I guess I could do better. But if that chain is optimal, or close to optimal, I'm quite pleased anyway.

My program uses a best-first search approach using the size of the successors tree as the heuristic. It's a rather crude approach but works surprisingly well. Since it doesn't backtrack or explore more than one path at a time it doesn't really require that much space. This also means that I can keep using my laptop while the program is running.

The program works as follows:

The first step is to build a dict containing all the successors for a title. I'm actually using the line number (from MOVIES.LST) as the identifier instead of the title itself, but this should give you an idea of what it looks like:
`{ ..., 'SLING BLADE': ['BLADE II', 'BLADE RUNNER', 'BLADE TRINITY',], ... }`
It takes about half a second to create the dict for this data set. It contains 6561 keys and the sum of the lengths of the values is 15157.

The second step is the search. It is done by selecting the title with the most successors (that isn't already in the list), adding it to the list and repeating until there are no more titles that can be added.

The dict is sorted by the length of the list of successors so we don't have to search using all of the 6561 titles as starting points. Since, at least in theory we'll get shorter and shorter chains the further down we go.

The gist of the search function looks like this:
`>>> def best_first_search(xref, lookahead = 5, cutoff = -1):...   [...]...     for index in sorted(xref.keys(), key = lambda k: tree_size(xref, k, 0, [k,]), reverse = True):...       [...]...       while True:...         [...]...         for next in sorted(xref[curr], key = lambda k: tree_size(xref, k, lookahead, chain), reverse = True):...           if next not in chain:...             chain.append(next)...             assigned = True...             break...         if not assigned:...           if len(chain) > len(longest):...             print len(chain), "(#%s)" % (i), "%02.2fs" % (time.time()-start)...             longest = chain...           break...       [...]...   return longest`
The function that calculates the size of the successors tree (see below) does so by traversing it to a certain depth. The deeper you search, the longer it takes, but there seems to be diminishing returns when using a search-depth greater than six or seven.
`>>> def tree_size(xref, curr, depth, chain = None):...   if chain is None:...     chain = []...   result = 0...   if depth > 0:...     for next in xref[curr][1:]:...       if next not in chain:...         result += tree_size(xref, next, depth- 1, chain+ [next,])...   else:...     length = 0...     for title in xref[curr][1:]:...       if title not in chain:...         length += 1...     result = length...   return result`
Using the above we can get a chain that is 203 titles long. It takes about two and a half minutes and is found as the 17th title searched. A chain that is 200 titles long is found when expanding the third title and takes about 20 seconds to get to. The next best chain is 202 titles long and takes about 1 minute 15 seconds to find at the ninth title expanded. The total runtime for the program using a cutoff at 20 takes about three minutes to complete. Here is the output:
`161 (#1) 7.98s177 (#2) 14.72s200 (#3) 20.57s202 (#9) 72.25s203 (#17) 146.46s1 WOMEN IN LOVE2 LOVE WALKED IN3 IN THIS OUR LIFE4 LIFE OR SOMETHING LIKE IT5 IT HAPPENED ONE NIGHT...199 SEX AND THE OTHER MAN200 MAN OF THE YEAR201 YEAR OF THE DRAGON202 DRAGON SEED203 SEED OF CHUCKY203 titles in chain 182.76s`
To get a longer chain we need to do some tweaking. The problem is that by selecting the title with the largest successors tree we might get long chains, but we're missing all of those little detours that make it the "longest" chain. For example, if we have a title A with two successors, B and C, where B has several successors (D, E, F, ...) and C only has B as its successor. Even though A->C->B, clearly, is the better choice my program will *always* choose the direct route A->B.

To compensate for this blind spot, every chain is expanded by trying to find longer paths between each pair of nodes in the chain (see below). Unfortunately, by adding this extra step the program requires more memory and becomes noticeably slower.
`>>> def expand(xref, chain, max_depth, longest):...   [...]...     for curr, next in zip(chain[:-1], chain[1:]):...       expansions = None...       for i in range(2, max_depth+ 1):...         paths = find(xref, curr, next, i, chain)...         if paths:...           expansions = paths...       if expansions is not None:...         expansion = sorted(expansions, key = lambda k: len(k), reverse = True)[0]...         [...]...         chain[chain.index(next):chain.index(next)] = expansion...   return chain`
The find function looks like this:
`>>> def find(xref, curr, end, depth, chain = None):...   if chain is None:...     chain = []...   if depth > 0:...     result = []...     for next in xref[curr][1:]:...       if next not in chain:...         paths = find(xref, next, end, depth- 1, chain+ [next,])...         if paths:...           for path in paths:...             result.append([next,]+ path)...     return result...   elif end in xref[curr][1:]:...     return [[],]...   return None`
I've tried various values for max_depth and 25 seems to be a good-enough value. I have found several expansions that are longer than that but they don't seem to add to the overall result anyway.

If we change the block of code that checks for and assigns the longest chain in the best_first_search function to:
`...         if not assigned:...           chain = expand(xref, chain, max_depth, len(longest))...           if len(chain) > len(longest):...             print len(chain), "(#%s)" % (i), "%02.2fs" % (time.time()-start)...             longest = chain...           break`
and run the program again we get the first result after about a minute and it is 205 titles long.

The longest chain is 225 titles and it is found at #134 but it takes almost 50 minutes to find. Running the whole search (500 "best" titles) takes about 4 hours. Running it with a cutoff of 150 takes about 55 minutes and produces the following output:
`205 (#1) 52.72s206 (#5) 176.38s208 (#6) 199.54s[10]209 (#12) 338.63s211 (#13) 360.17s212 (#18) 471.40s[20]213 (#21) 538.34s[30][40][50][60][70][80]215 (#85) 1904.34s[90][100][110]218 (#116) 2592.82s[120][130]219 (#131) 2858.43s225 (#134) 2900.19s[140][150]1 OFF THE BLACK2 THE BLACK ANGEL3 ANGEL EYES4 EYES OF AN ANGEL5 ANGEL BABY6 BABY SECRET OF THE LOST LEGEND7 LEGEND OF THE LOST8 THE LOST BOYS9 BOYS ON THE SIDE10 SIDE OUT11 OUT COLD12 COLD FEVER13 FEVER PITCH14 PITCH BLACK15 BLACK HAWK DOWN16 DOWN WITH LOVE17 LOVE WALKED IN18 IN THIS OUR LIFE19 LIFE OR SOMETHING LIKE IT20 IT HAPPENED ONE NIGHT21 ONE NIGHT STAND22 STAND IN23 IN THE HEAT OF THE NIGHT24 THE NIGHT OF THE FOLLOWING DAY25 DAY OF THE DEAD26 THE DEAD27 DEAD OF NIGHT28 NIGHT OF THE LIVING DEAD29 DEAD MAN30 DEAD MAN WALKING31 WALKING AND TALKING32 TALKING ABOUT SEX33 SEX AND THE OTHER MAN34 MAN OF THE HOUSE35 HOUSE OF DRACULA36 DRACULA DEAD AND LOVING IT37 IT HAD TO BE YOU38 YOU CAN COUNT ON ME39 ME MYSELF I40 I LOVE YOU TO DEATH41 DEATH WISH42 WISH UPON A STAR43 A STAR IS BORN44 BORN AMERICAN45 AMERICAN ME46 ME WITHOUT YOU47 YOU LIGHT UP MY LIFE48 MY LIFE49 LIFE AS A HOUSE50 HOUSE OF FRANKENSTEIN51 FRANKENSTEIN AND THE MONSTER FROM HELL52 FROM HELL53 HELL UP IN HARLEM54 HARLEM RIVER DRIVE55 DRIVE ME CRAZY56 CRAZY AS HELL57 HELL NIGHT58 NIGHT AND THE CITY59 THE CITY60 CITY OF ANGELS61 ANGELS WITH DIRTY FACES62 FACES OF DEATH63 DEATH SHIP64 SHIP OF FOOLS65 FOOLS RUSH IN66 IN THE WINTER DARK67 DARK STAR68 STAR TREK IV THE VOYAGE HOME69 HOME ALONE70 ALONE IN THE DARK71 THE DARK HALF72 HALF LIGHT73 LIGHT OF DAY74 DAY FOR NIGHT75 NIGHT FALLS ON MANHATTAN76 MANHATTAN MURDER MYSTERY77 MYSTERY ALASKA78 ALASKA SPIRIT OF THE WILD79 THE WILD ONE80 ONE NIGHT WITH THE KING81 THE KING AND I82 I WANT TO LIVE83 LIVE AND LET DIE84 DIE MOMMIE DIE85 DIE HARD86 HARD EIGHT87 EIGHT AND A HALF WOMEN88 WOMEN IN LOVE89 LOVE AND DEATH90 DEATH WISH 391 3 NINJAS KNUCKLE UP92 UP CLOSE AND PERSONAL93 PERSONAL BEST94 BEST OF THE BEST95 BEST MEN96 MEN IN BLACK97 BLACK AND WHITE98 WHITE HUNTER BLACK HEART99 HEART CONDITION100 CONDITION RED101 RED EYE102 EYE FOR AN EYE103 EYE OF GOD104 GOD TOLD ME TO105 TO DIE FOR106 FOR THE BOYS107 THE BOYS108 BOYS AND GIRLS109 GIRLS WILL BE GIRLS110 GIRLS GIRLS GIRLS111 GIRLS OF SUMMER112 SUMMER SCHOOL113 SCHOOL OF ROCK114 ROCK STAR115 STAR TREK THE MOTION PICTURE116 PICTURE BRIDE117 BRIDE OF THE WIND118 THE WIND AND THE LION119 THE LION KING120 KING OF THE JUNGLE121 THE JUNGLE BOOK122 BOOK OF LOVE123 LOVE IS THE DEVIL124 THE DEVIL RIDES OUT125 OUT OF THE BLUE126 BLUE CAR127 CAR 54 WHERE ARE YOU128 YOU ONLY LIVE ONCE129 ONCE AROUND130 AROUND THE BEND131 BEND OF THE RIVER132 THE RIVER133 THE RIVER WILD134 WILD THINGS135 THINGS TO COME136 COME AND GET IT137 IT TAKES TWO138 TWO FRIENDS139 FRIENDS WITH MONEY140 MONEY FOR NOTHING141 NOTHING BUT TROUBLE142 TROUBLE EVERY DAY143 DAY OF THE WOMAN144 THE WOMAN IN RED145 RED RIVER146 RIVER OF NO RETURN147 RETURN TO HORROR HIGH148 HIGH SPIRITS149 SPIRITS OF THE DEAD150 DEAD BANG151 BANG BANG YOURE DEAD152 DEAD MAN ON CAMPUS153 CAMPUS MAN154 MAN TROUBLE155 TROUBLE IN PARADISE156 PARADISE ROAD157 ROAD HOUSE158 HOUSE PARTY159 HOUSE PARTY 3160 3 NINJAS KICK BACK161 BACK STAGE162 STAGE FRIGHT163 FRIGHT NIGHT164 NIGHT MOTHER165 MOTHER NIGHT166 NIGHT ON EARTH167 EARTH GIRLS ARE EASY168 EASY COME EASY GO169 GO NOW170 NOW YOU SEE HIM NOW YOU DONT171 DONT GO IN THE HOUSE172 THE HOUSE OF THE DEAD173 DEAD END174 END OF DAYS175 DAYS OF HEAVEN176 HEAVEN CAN WAIT177 WAIT UNTIL DARK178 DARK CITY179 CITY BY THE SEA180 SEA OF LOVE181 LOVE LIFE182 LIFE IS BEAUTIFUL183 BEAUTIFUL PEOPLE184 PEOPLE I KNOW185 I KNOW WHAT YOU DID LAST SUMMER186 SUMMER CATCH187 CATCH A FIRE188 FIRE ON THE MOUNTAIN189 THE MOUNTAIN MEN190 MEN WITH GUNS191 GUNS OF THE MAGNIFICENT SEVEN192 THE MAGNIFICENT SEVEN193 THE MAGNIFICENT SEVEN RIDE194 RIDE THE HIGH COUNTRY195 COUNTRY LIFE196 LIFE WITH FATHER197 FATHER OF THE BRIDE198 BRIDE OF THE MONSTER199 MONSTER HOUSE200 HOUSE PARTY 2201 2 DAYS IN THE VALLEY202 VALLEY GIRL203 GIRL IN THE CADILLAC204 CADILLAC MAN205 MAN ON FIRE206 FIRE IN THE SKY207 SKY HIGH208 HIGH SCHOOL HIGH209 HIGH CRIMES210 CRIMES OF PASSION211 PASSION IN THE DESERT212 DESERT BLUE213 BLUE STEEL214 STEEL DAWN215 DAWN OF THE DEAD216 DEAD HEAT217 HEAT AND DUST218 DUST TO GLORY219 GLORY ROAD220 ROAD GAMES221 GAMES PEOPLE PLAY NEW YORK222 NEW YORK COP223 COP LAND224 LAND OF THE DEAD225 THE DEAD POOL225 titles in chain 3203.74s`
I've got a few other ideas and strategies (beam search beam-stack search for instance) I'd like to try to see if I can get an even longer chain, but I should probably re-implement the code in Lisp to make it faster first. 50 minutes is a tad on the long side ;-)

[Update 2008-05-12] I just did some more extensive Googling and my results are not that good in comparison to the 312 titles long chain found by icefox. Back to work.

[Correction 2008-07-02]: When I say beam search I mean beam-stack search.