2008-05-12

Sling Blade Runner

I found the ITA Software Hiring Puzzles almost a year ago and ever since I haven't really been able to stop thinking about one of the puzzles: Sling Blade Runner.

"How long a chain of overlapping movie titles, like Sling Blade Runner, can you find?"

Use the following listing of movie titles: MOVIES.LST. Multi-word overlaps, as in "License to Kill a Mockingbird," are allowed. The same title may not be used more than once in a solution. Heuristic solutions that may not always produce the greatest number of titles will be accepted: seek a reasonable tradeoff of efficiency and optimality.
The file MOVIES.LST contains 6561 movie titles in alphabetical order.

It *sounds* simple enough, but it's actually quite a challenging problem. In fact, if I'm not mistaken, it's one of those problems you really can't find an *optimal* solution to. Just better or worse approximations thereof.

I think I've made something like five or six, more or less serious, attempts at getting a decent solution during the year. But thinking about this problem has made me read several books, web pages and articles on various subjects like graph theory and clustering. I've written code in Lisp, CLIPS and Python to test ideas I've had. All in all; it's been a very rewarding, educational and fun year.

Last week I decided to give it another serious attempt. That resulted in (after two days and 6 hours worth of coding) a small (~200 LOC) Python program that finds a chain that is 225 titles long. I really have no idea if that is good or bad. I know that Eric Burke found a chain that is 245 titles long so I guess I could do better. But if that chain is optimal, or close to optimal, I'm quite pleased anyway.

My program uses a best-first search approach using the size of the successors tree as the heuristic. It's a rather crude approach but works surprisingly well. Since it doesn't backtrack or explore more than one path at a time it doesn't really require that much space. This also means that I can keep using my laptop while the program is running.

The program works as follows:

The first step is to build a dict containing all the successors for a title. I'm actually using the line number (from MOVIES.LST) as the identifier instead of the title itself, but this should give you an idea of what it looks like:
{ ..., 'SLING BLADE': ['BLADE II', 'BLADE RUNNER', 'BLADE TRINITY',], ... }
It takes about half a second to create the dict for this data set. It contains 6561 keys and the sum of the lengths of the values is 15157.

The second step is the search. It is done by selecting the title with the most successors (that isn't already in the list), adding it to the list and repeating until there are no more titles that can be added.

The dict is sorted by the length of the list of successors so we don't have to search using all of the 6561 titles as starting points. Since, at least in theory we'll get shorter and shorter chains the further down we go.

The gist of the search function looks like this:
>>> def best_first_search(xref, lookahead = 5, cutoff = -1):
... [...]
... for index in sorted(xref.keys(), key = lambda k: tree_size(xref, k, 0, [k,]), reverse = True):
... [...]
... while True:
... [...]
... for next in sorted(xref[curr], key = lambda k: tree_size(xref, k, lookahead, chain), reverse = True):
... if next not in chain:
... chain.append(next)
... assigned = True
... break
... if not assigned:
... if len(chain) > len(longest):
... print len(chain), "(#%s)" % (i), "%02.2fs" % (time.time()-start)
... longest = chain
... break
... [...]
... return longest
The function that calculates the size of the successors tree (see below) does so by traversing it to a certain depth. The deeper you search, the longer it takes, but there seems to be diminishing returns when using a search-depth greater than six or seven.
>>> def tree_size(xref, curr, depth, chain = None):
... if chain is None:
... chain = []
... result = 0
... if depth > 0:
... for next in xref[curr][1:]:
... if next not in chain:
... result += tree_size(xref, next, depth- 1, chain+ [next,])
... else:
... length = 0
... for title in xref[curr][1:]:
... if title not in chain:
... length += 1
... result = length
... return result
Using the above we can get a chain that is 203 titles long. It takes about two and a half minutes and is found as the 17th title searched. A chain that is 200 titles long is found when expanding the third title and takes about 20 seconds to get to. The next best chain is 202 titles long and takes about 1 minute 15 seconds to find at the ninth title expanded. The total runtime for the program using a cutoff at 20 takes about three minutes to complete. Here is the output:
161 (#1) 7.98s
177 (#2) 14.72s
200 (#3) 20.57s
202 (#9) 72.25s
203 (#17) 146.46s

1 WOMEN IN LOVE
2 LOVE WALKED IN
3 IN THIS OUR LIFE
4 LIFE OR SOMETHING LIKE IT
5 IT HAPPENED ONE NIGHT
...
199 SEX AND THE OTHER MAN
200 MAN OF THE YEAR
201 YEAR OF THE DRAGON
202 DRAGON SEED
203 SEED OF CHUCKY
203 titles in chain 182.76s
To get a longer chain we need to do some tweaking. The problem is that by selecting the title with the largest successors tree we might get long chains, but we're missing all of those little detours that make it the "longest" chain. For example, if we have a title A with two successors, B and C, where B has several successors (D, E, F, ...) and C only has B as its successor. Even though A->C->B, clearly, is the better choice my program will *always* choose the direct route A->B.

To compensate for this blind spot, every chain is expanded by trying to find longer paths between each pair of nodes in the chain (see below). Unfortunately, by adding this extra step the program requires more memory and becomes noticeably slower.
>>> def expand(xref, chain, max_depth, longest):
... [...]
... for curr, next in zip(chain[:-1], chain[1:]):
... expansions = None
... for i in range(2, max_depth+ 1):
... paths = find(xref, curr, next, i, chain)
... if paths:
... expansions = paths
... if expansions is not None:
... expansion = sorted(expansions, key = lambda k: len(k), reverse = True)[0]
... [...]
... chain[chain.index(next):chain.index(next)] = expansion
... return chain
The find function looks like this:
>>> def find(xref, curr, end, depth, chain = None):
... if chain is None:
... chain = []
... if depth > 0:
... result = []
... for next in xref[curr][1:]:
... if next not in chain:
... paths = find(xref, next, end, depth- 1, chain+ [next,])
... if paths:
... for path in paths:
... result.append([next,]+ path)
... return result
... elif end in xref[curr][1:]:
... return [[],]
... return None
I've tried various values for max_depth and 25 seems to be a good-enough value. I have found several expansions that are longer than that but they don't seem to add to the overall result anyway.

If we change the block of code that checks for and assigns the longest chain in the best_first_search function to:
...         if not assigned:
... chain = expand(xref, chain, max_depth, len(longest))
... if len(chain) > len(longest):
... print len(chain), "(#%s)" % (i), "%02.2fs" % (time.time()-start)
... longest = chain
... break
and run the program again we get the first result after about a minute and it is 205 titles long.

The longest chain is 225 titles and it is found at #134 but it takes almost 50 minutes to find. Running the whole search (500 "best" titles) takes about 4 hours. Running it with a cutoff of 150 takes about 55 minutes and produces the following output:
205 (#1) 52.72s
206 (#5) 176.38s
208 (#6) 199.54s
[10]
209 (#12) 338.63s
211 (#13) 360.17s
212 (#18) 471.40s
[20]
213 (#21) 538.34s
[30]
[40]
[50]
[60]
[70]
[80]
215 (#85) 1904.34s
[90]
[100]
[110]
218 (#116) 2592.82s
[120]
[130]
219 (#131) 2858.43s
225 (#134) 2900.19s
[140]
[150]
1 OFF THE BLACK
2 THE BLACK ANGEL
3 ANGEL EYES
4 EYES OF AN ANGEL
5 ANGEL BABY
6 BABY SECRET OF THE LOST LEGEND
7 LEGEND OF THE LOST
8 THE LOST BOYS
9 BOYS ON THE SIDE
10 SIDE OUT
11 OUT COLD
12 COLD FEVER
13 FEVER PITCH
14 PITCH BLACK
15 BLACK HAWK DOWN
16 DOWN WITH LOVE
17 LOVE WALKED IN
18 IN THIS OUR LIFE
19 LIFE OR SOMETHING LIKE IT
20 IT HAPPENED ONE NIGHT
21 ONE NIGHT STAND
22 STAND IN
23 IN THE HEAT OF THE NIGHT
24 THE NIGHT OF THE FOLLOWING DAY
25 DAY OF THE DEAD
26 THE DEAD
27 DEAD OF NIGHT
28 NIGHT OF THE LIVING DEAD
29 DEAD MAN
30 DEAD MAN WALKING
31 WALKING AND TALKING
32 TALKING ABOUT SEX
33 SEX AND THE OTHER MAN
34 MAN OF THE HOUSE
35 HOUSE OF DRACULA
36 DRACULA DEAD AND LOVING IT
37 IT HAD TO BE YOU
38 YOU CAN COUNT ON ME
39 ME MYSELF I
40 I LOVE YOU TO DEATH
41 DEATH WISH
42 WISH UPON A STAR
43 A STAR IS BORN
44 BORN AMERICAN
45 AMERICAN ME
46 ME WITHOUT YOU
47 YOU LIGHT UP MY LIFE
48 MY LIFE
49 LIFE AS A HOUSE
50 HOUSE OF FRANKENSTEIN
51 FRANKENSTEIN AND THE MONSTER FROM HELL
52 FROM HELL
53 HELL UP IN HARLEM
54 HARLEM RIVER DRIVE
55 DRIVE ME CRAZY
56 CRAZY AS HELL
57 HELL NIGHT
58 NIGHT AND THE CITY
59 THE CITY
60 CITY OF ANGELS
61 ANGELS WITH DIRTY FACES
62 FACES OF DEATH
63 DEATH SHIP
64 SHIP OF FOOLS
65 FOOLS RUSH IN
66 IN THE WINTER DARK
67 DARK STAR
68 STAR TREK IV THE VOYAGE HOME
69 HOME ALONE
70 ALONE IN THE DARK
71 THE DARK HALF
72 HALF LIGHT
73 LIGHT OF DAY
74 DAY FOR NIGHT
75 NIGHT FALLS ON MANHATTAN
76 MANHATTAN MURDER MYSTERY
77 MYSTERY ALASKA
78 ALASKA SPIRIT OF THE WILD
79 THE WILD ONE
80 ONE NIGHT WITH THE KING
81 THE KING AND I
82 I WANT TO LIVE
83 LIVE AND LET DIE
84 DIE MOMMIE DIE
85 DIE HARD
86 HARD EIGHT
87 EIGHT AND A HALF WOMEN
88 WOMEN IN LOVE
89 LOVE AND DEATH
90 DEATH WISH 3
91 3 NINJAS KNUCKLE UP
92 UP CLOSE AND PERSONAL
93 PERSONAL BEST
94 BEST OF THE BEST
95 BEST MEN
96 MEN IN BLACK
97 BLACK AND WHITE
98 WHITE HUNTER BLACK HEART
99 HEART CONDITION
100 CONDITION RED
101 RED EYE
102 EYE FOR AN EYE
103 EYE OF GOD
104 GOD TOLD ME TO
105 TO DIE FOR
106 FOR THE BOYS
107 THE BOYS
108 BOYS AND GIRLS
109 GIRLS WILL BE GIRLS
110 GIRLS GIRLS GIRLS
111 GIRLS OF SUMMER
112 SUMMER SCHOOL
113 SCHOOL OF ROCK
114 ROCK STAR
115 STAR TREK THE MOTION PICTURE
116 PICTURE BRIDE
117 BRIDE OF THE WIND
118 THE WIND AND THE LION
119 THE LION KING
120 KING OF THE JUNGLE
121 THE JUNGLE BOOK
122 BOOK OF LOVE
123 LOVE IS THE DEVIL
124 THE DEVIL RIDES OUT
125 OUT OF THE BLUE
126 BLUE CAR
127 CAR 54 WHERE ARE YOU
128 YOU ONLY LIVE ONCE
129 ONCE AROUND
130 AROUND THE BEND
131 BEND OF THE RIVER
132 THE RIVER
133 THE RIVER WILD
134 WILD THINGS
135 THINGS TO COME
136 COME AND GET IT
137 IT TAKES TWO
138 TWO FRIENDS
139 FRIENDS WITH MONEY
140 MONEY FOR NOTHING
141 NOTHING BUT TROUBLE
142 TROUBLE EVERY DAY
143 DAY OF THE WOMAN
144 THE WOMAN IN RED
145 RED RIVER
146 RIVER OF NO RETURN
147 RETURN TO HORROR HIGH
148 HIGH SPIRITS
149 SPIRITS OF THE DEAD
150 DEAD BANG
151 BANG BANG YOURE DEAD
152 DEAD MAN ON CAMPUS
153 CAMPUS MAN
154 MAN TROUBLE
155 TROUBLE IN PARADISE
156 PARADISE ROAD
157 ROAD HOUSE
158 HOUSE PARTY
159 HOUSE PARTY 3
160 3 NINJAS KICK BACK
161 BACK STAGE
162 STAGE FRIGHT
163 FRIGHT NIGHT
164 NIGHT MOTHER
165 MOTHER NIGHT
166 NIGHT ON EARTH
167 EARTH GIRLS ARE EASY
168 EASY COME EASY GO
169 GO NOW
170 NOW YOU SEE HIM NOW YOU DONT
171 DONT GO IN THE HOUSE
172 THE HOUSE OF THE DEAD
173 DEAD END
174 END OF DAYS
175 DAYS OF HEAVEN
176 HEAVEN CAN WAIT
177 WAIT UNTIL DARK
178 DARK CITY
179 CITY BY THE SEA
180 SEA OF LOVE
181 LOVE LIFE
182 LIFE IS BEAUTIFUL
183 BEAUTIFUL PEOPLE
184 PEOPLE I KNOW
185 I KNOW WHAT YOU DID LAST SUMMER
186 SUMMER CATCH
187 CATCH A FIRE
188 FIRE ON THE MOUNTAIN
189 THE MOUNTAIN MEN
190 MEN WITH GUNS
191 GUNS OF THE MAGNIFICENT SEVEN
192 THE MAGNIFICENT SEVEN
193 THE MAGNIFICENT SEVEN RIDE
194 RIDE THE HIGH COUNTRY
195 COUNTRY LIFE
196 LIFE WITH FATHER
197 FATHER OF THE BRIDE
198 BRIDE OF THE MONSTER
199 MONSTER HOUSE
200 HOUSE PARTY 2
201 2 DAYS IN THE VALLEY
202 VALLEY GIRL
203 GIRL IN THE CADILLAC
204 CADILLAC MAN
205 MAN ON FIRE
206 FIRE IN THE SKY
207 SKY HIGH
208 HIGH SCHOOL HIGH
209 HIGH CRIMES
210 CRIMES OF PASSION
211 PASSION IN THE DESERT
212 DESERT BLUE
213 BLUE STEEL
214 STEEL DAWN
215 DAWN OF THE DEAD
216 DEAD HEAT
217 HEAT AND DUST
218 DUST TO GLORY
219 GLORY ROAD
220 ROAD GAMES
221 GAMES PEOPLE PLAY NEW YORK
222 NEW YORK COP
223 COP LAND
224 LAND OF THE DEAD
225 THE DEAD POOL
225 titles in chain 3203.74s
I've got a few other ideas and strategies (beam search beam-stack search for instance) I'd like to try to see if I can get an even longer chain, but I should probably re-implement the code in Lisp to make it faster first. 50 minutes is a tad on the long side ;-)

[Update 2008-05-12] I just did some more extensive Googling and my results are not that good in comparison to the 312 titles long chain found by icefox. Back to work.

[Correction 2008-07-02]: When I say beam search I mean beam-stack search.

Inga kommentarer: