Movies by Math: The Sequel

I got enough comments and questions via Twitter, Facebook, and other avenues about my previous post that I figured I’d elaborate upon my crazy statistical methods.

First of all, to those who wanted a post about movies that actually contain math, here’s a lovely link to a page mainained by Harvard’s math department. How cool is that? VERY.

Secondly, I shall address that I got the most questions about: what’s the actual formula?

Sit back. This will take a while.

DISCLAIMER

I built my movie tracking tool to be tailored to my personal tastes. This thing is NOT meant to scientifically balance all factors to generate a universal “best of” movie list. It was meant to help me discover really awesome films that I’d never seen before.

Also, the way I built this monster is really fidgety. A user interface designer would be horrified.

So, no, I’m not going to publish the tool itself. I’ll just detail what I did and why, and, if you like, you could build your own widget that would serve your needs.

OVERVIEW

Okay, I’ll admit it. I built this thing in Excel. I keep thinking about turning it into an Access database, but I’m lazy.

The Excel file is a wretched hive of VB macros, tabs, conditional formatting, VLOOKUP commands, and nested pivot tables. Specifically, there are 67 pivot tables, pulling together data from 28 sources where the key data doesn’t necessarily match. (For instance, IMDB and Rotten Tomatoes can list the same film under a different title, like “Bicycle Thieves (1948)” and “The Bicycle Thief / Bicycle Thieves (1949)”. Easy to match if you’re human, but not so much if you’re a computer.) I won’t get into how I got this all working, because that would be boring. I’ll just say that I put the raw data in one end, and the thing will spit out fresh numbers in under a second. So it works, even though it’s cobbled together from the digital equvalent of fish hooks, gum wrappers, cat saliva, parsnips, and Roto-Rooter parts.

(In my defense, I’ll just say I started building it in 2005. This is the Winchester Mansion of databases.)

I just refreshed the data this morning. The database currently tracks 2,931 films. I have seen 28% of them.

I call this tool The Megalist.

THE MATH

The main thing that the Megalist does is calculate a “score” for each film. The higher the number, the more likely a movie will be a film of great historical/artistic/entertainment value (…according to my tastes, anyway).

I wanted to balance my list between things I should see (because they’re good for my brain) and things I should see (because I’d probably really enjoy them) and things I should see (because they have some sort of historical value). Thus, I needed to draw from sources that would balance popular opinion, critical opinion, and historical opinion.

There are four major data sources:

      • The Internet Movie Database: The IMDB has been running for a long time and has a huge user base of movie nerds. They are also good with their statistics. In order for a film to appear on one of their Top 50/Top 250 lists, it has to have a certain number of votes as well as a high rating. This prevents films with only 11 reviews (all 10 stars, because it was rated by friends of the cast and crew) from vaulting above The Godfather. I figured this would be a decent source of popular opinion.
      • Rotten Tomatoes: RT has a nice method of blending critical opinion with user base opinion, and their Best Movie rankings often differ significantly from the IMDB. They are shaky when it comes to older films, though, so they definitely need to be balanced out by other data sources.
      • The Academy Awards: Yeah, I know. The Academy is notoriously short-sighted in granting their Oscars. For my purposes, though, this was perfect. The Academy Awards are easily quantifiable, and they give me an idea of what was popular in the industry that year. It’s a decent way to net movies that were once a big deal but are now forgotten.
      • Roger Ebert: I love Roger Ebert as a critic. He’s able to appreciate movies at levels both lofty and base. His Great Movies list is a great way to give the real cream of the crop an extra boost.

And here are the specific sources and the math associated with them. Explanations are below the list:

      • Internet Movie Database:
          • Top 250 Overall: (251 points – [rank])
          • Top 100 Overall: (101 points – [rank])
          • Top 50 of the following genres: (101 points – [rank]
              • Action
              • Adventure
              • Animation
              • Biography
              • Comedy
              • Crime
              • Documentary
              • Film Noir
              • History
              • Horror
              • Musical
              • Mystery
              • Sci-Fi
              • Thriller
              • War
              • Western
      • Rotten Tomatoes
          • Top 100 of each of the following lists: (101 points – [rank])
              • Best Of
              • Action/Adventure
              • Art House
              • Classics
              • Comedy
              • Documentary
              • Horror
              • Musical
              • Suspense
              • Sci-Fi/Fantasy
      • Academy Awards
          • Best Picture: 10 points for a nomination, 15 for a win
          • Best Actor/Actress/Supporting Actor/Supporting Actress: 5 points for nomination or win
          • Best Director: 10 points for a nomination or win
          • Best Original or Adapted Screenplay: 10 points for nomination or win
          • Best Foreign Film: 10 points for a nomination, 15 for a win
          • Best Feature Documentary: 10 points for nomination or win
      • Roger Ebert’s Great Movies list: 50 points

Explanations:

      • Rank Dependency: As you can see on the IMDB and RT calculations, the score is rank-dependent. The movie in the #1 slot of the Top 100 gets 100 points, and #2 gets 99, etc. The IMDB Top 50 lists are also subtracted from 101, so they balance out with the RT Top 100 lists.
      • Duplication of Ranks: Yes, the IMDB Top 100 AND the IMDB Top 250 are both represented separately, even though the Top 100 is part of the Top 250. This is because, long ago, I had a goal to watch the entire Top 100, so I double-loaded the points in that direction. (I’ve currently seen 99 of the Top 100. I need to see A Separation.)
      • Missing Genres: You might notice some genres are missing from the IMDB and RT lists. This is because I don’t care about sports movies or romances as a genre. Also, I feel that if a film is truly great, it transcends its genre… and thus it will show up in one of the other lists anyway.
      • Weighting: You might notice that the Academy Awards points are rather light in comparison to the other data souces. I did this because I intended the Academy points to be a little boost for movies that are also critically/popularly acclaimed.

OTHER STATISTICS

The Megalist also tracks various other stuff:

      • Whether a film is on Netflix Streaming or Hulu Plus.
      • How many categories each film appears in.
      • Whether a film is a “genre authority” (has a rank of 10 or above on any ranked genre list).
      • Films that “fall off” all of the genre lists remain on the Megalist with a point total of 1. (Hey, it’s probably still a good movie if it punctured the IMDB Top Whatever for a few weeks!)
      • How many films I’d seen in each category at the point of every update. This is important, as I can actually slide backward in my progress, thanks to films entering/leaving the ranked lists. This feature helps me make sure I’m always keeping up.

FINALLY, AN EXAMPLE

In yesterday’s post, I noted that the Megalist says Aliens is the awesomest movie in history. Here’s how the math works out:

Aliens is listed in a whopping 11 categories: IMDB Action, IMDB Adventure, IMDB Sci-Fi, IMDB Thriller, IMDB Top 100, IMDB Top 250, RT Best Of, RT Action/Adventure, RT Horror, RT Sci-Fi, and a Best Actress nomination for Sigourney Weaver.

The film is listed at #14 on the IMDB Action list. 101-14=87 points
It is #17 on the IMDB Adventure list. 101-17=83 points
The Best Actress nomination adds 5 points

And so on. The final total for Aliens is a whopping 950 points.

FUN FACTS

It does work. After several years of picking away at this list, I’ve seen the best of the best. Even though I’ve only seen 28% of the films by number, I’ve slain 68% of the points.

Of the films I have not seen yet, these are the ones that have the highest point totals:

1. A Separation (2011) (371 points)
2. A Man Escaped (1956)
3. The Taste of Others (2000)
4. This Is Not a Film (2011)
5. Into the Wild (2007)
6. Everyday Sunshine: The Story of Fishbone (2011)
7. The Best Years of Our Lives (1946)
8. Andrei Rublev (1966)
9. Aguirre: Wrath of God (1972)
10. The Killers (1946)

Two films are sprawled across 14 categories: Lawrence of Arabia and Chinatown.

I’ve seen 58% of Roger Ebert’s Great Movies.

Netflix Streaming has a lot more of the movies I want to see (276) than Hulu Plus (158), but Hulu Plus has better movies (at an average of 45 points per film, as opposed to 30 points per film). This is thanks to the Criterion Collection, which has an exclusive streaming contract with Hulu.

I spend way too much time thinking about stuff like this.

6 Comments:

  1. Wait, you haven’t seen Aguirre or The Killers yet? I’m shocked, shocked I tell you.

  2. “The Excel file is a wretched hive of VB macros, tabs, conditional formatting, VLOOKUP commands, and nested pivot tables”

    I think I’m in love! That’s awesome.

  3. you’re my personal hero. All hail!

  4. There are people into fantasy sports that don’t put that much work into their teams. Heck there may be real sports teams that don’t.

Leave a Reply

Your email address will not be published. Required fields are marked *