Boxofficemojo.com Data Scraping: May 2013

Thursday, 30 May 2013

Monster Roll Releases the Kraken upon Un-suspecting Sushi Chefs

Written and directed by Dan Blank, Monster Roll is a six minute proof of concept short, channeling the best conceptual elements of classic Japanese monster movies and the fantastic gusto of 80s screen gems like Ghostbusters. But where something like Godzilla worked with 1950s nuclear anxiety, Monster Roll resonates with the very real twenty-first century problem of over-fishing and resource scarcity.

The film’s opening narration frames the central conflict as nature’s reaction to a broken promise on the part of humanity. Therein, a Japanese legend speaks of a bond between man and the sea. For our part, man promised to only kill what he would eat, and eat all that he killed. The movie then opens on a quintessential North American douche bag stuffing himself with sushi, utterly disrespecting the chef’s efforts as well as the traditional accoutrement of the meal.

For that scene alone, Monster Roll seems to be as much a cultural commentary on the worst sort of sushi restaurant patrons, which perhaps reflects on the Western appropriation of an Eastern culinary practice, as much as it is the hook for a monster movie. All of this happens before a giant tentacle emerges from the restaurant’s sink and starts strangling the sushi chef.

During its short run time Monster Roll establishes a conflict, creates a sense of empathy for the sushi chefs turned last line of defense against sea monsters, and scores some very well placed laughs amid the chaos. Moreover, the computer generated monsters are quite convincing in their ability to interact with flesh and blood actors.

Even though the short looks to be set in California, I’ll admit that I’m quite pleased to see Asian actors speaking Japanese within the movie. If Ben Kingsley being cast as The Mandarin in Iron Man 3 teaches us anything, it’s that some elements within mainstream Hollywood do not really understand race. It’s the same phenomenon which brought us an almost entirely white principal cast in The Last Airbender – though race was the least of that movie’s problems. Granted it’s a sad state of affairs when sub-titles threaten to hurt a movie’s appeal, but I’m glad to see this film demonstrating the courage to keep a culturally appropriate cast and language track.

So let’s review:

A classic man versus nature story as told through giant sea monsters.

Unlikely heroes rising from their humble origins to do great things.

Adept injections of comedy as a means of bringing the audience into the story without imposing too much on their suspension of disbelief.

Also, approbation from the likes of Moon and Source Code director Duncan Jones.

Source: http://www.pageofreviews.com/category/the-daily-shaft/movies/page/2/

Monday, 27 May 2013

Using R to scrape HTML tables from a stack of web pages

For this exercise, I based my code on an example from the blog of Tony Breyal’s. What I wanted to extract, was the box office results for all movies that were released in 2009, 2010 or 2011. Thankfully, it was a relatively simple matter of adding a few more lines to tweak things here and there, do some additional stuff I needed and hey presto. It takes a couple of minutes to run.

With a little extra effort, the get_table function could be re-written to be a little more generic such the table number to grab from the page, whether or not to skip the header lines, clean punctuation etc.

# load required packages
require(XML)

# local helper functions
get_table <- function(u) {
# we need the 7th table in the page source
# BUT, the first row in the table only has 5 columns (for headings)
# whilst the rest of the rows have 7
table <- readHTMLTable(u, skip.rows=1,as.data.frame=TRUE)[[7]]
names(table) <- c("Title", "Studio", "Domestic.Gross", "Domestic.Theaters", "Opening.Gross", "Opening.Theaters", "Open.Date")
table <- as.data.frame(table, stringsAsFactors=FALSE) # get rid of factors and force strings for cleaning purposes
return(table)
}

# clean a data frame, by cleaning each column
clean_df <- function(df) {
df <- sapply(df, clean)
df <- as.data.frame(df, stringsAsFactors=FALSE)
return(df)
}

# clean a single column by removing punctuation
clean <- function(col) {
col <- gsub("$", "", col, fixed = TRUE)
col <- gsub("%", "", col, fixed = TRUE)
col <- gsub(",", "", col, fixed = TRUE)
col <- gsub("^", "", col, fixed = TRUE)
col <- gsub("n/a", NA, col,fixed = TRUE) # replace with mising values
col <- gsub("N/A", NA, col,fixed = TRUE) # replace with missing values
return(col)
}

# Step 1: define the necessary arrays for URL construction
# Define the index values used in the URL
letter.pages <- c('NUM','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z')
# Define the number of pages each index section has - a total of 135 pages in all
num.pages <- c(1,8,8,7,6,5,6,6,6,5,3,3,5,6,5,5,7,1,6,13,8,2,3,6,1,2,1)

# Step 2: construct URLs
# create an empty array of length 135
urls <- array(" ", 135)

for(m in 1:length(letter.pages)) { # for each of the letter.pages
for(n in 1:num.pages[m]){ # for each page of the letter
if(n ==1){ # the first page of a letter, does not need the page variable in the URL
# stick the new URL in the slot that is the sum of all previous pages, plus the iteration of the current number page of the current letter
urls[sum(num.pages[1:m-1]) + n] <- paste("http://boxofficemojo.com/movies/alphabetical.htm?letter=",letter.pages[m],"&p=.htm", sep = "")
} else {
urls[sum(num.pages[1:m-1]) + n] <- paste("http://boxofficemojo.com/movies/alphabetical.htm?letter=",letter.pages[m],"&page=",n,"&p=.htm", sep = "")
}
}
}

# Step 3: scrape website
# the following code generates a unicode error which I cannot work out how to fix
# so I am disabling warnings for this part only
options(warn=-1)
df <- do.call("rbind", lapply(urls, get_table))
options(warn=1)

# Step 4: clean dataframe
df <- clean_df(df)

# Step 5: set column types
df[, c(3:6)] <- sapply(df[, c(3:6)], as.numeric)
df[, 7] <- sapply(df[, 7], as.Date,'%m/%d/%Y') # converts to days since 1/1/1970

# Step 6: Remove all titles that are not in 2009, 2010 or 2011
df as.Date("12/31/2008",'%m/%d/%Y'))
df <- subset(df, df$Open.Date < as.Date("01/01/2012", '%m/%d/%Y'))

# Step 6: remove entries with NA box office domestic (3) and opening weekend (5)
df <- df[!is.na(df[,3]),]
df <- df[!is.na(df[,5]),]

However, the above code only extracts the textual values from the table cells. If you are wanting to extract some underlying links from some of those cells, then you need to take a different route and parse the HTML

library(xml)
library(stringr)
# Get the page's source
web_page <- readLines("http://www.the-numbers.com/movies/index1.php")
# Pull out the appropriate line
movie_link_lines <- web_page[grep("<A HREF=\"/movies/", web_page)] # this finds what I need, now to extract!
movie_names <- web_page[grep("<A HREF=\"/movies/", web_page) + 1]
# url starts at pos 61 consistently, and we need to drop the last 6 characters
movie_links <- str_sub(movie_link_lines, 61, -6) # this pulls a clean movie name
# some errors thrown from lines with unknown characters

The problem with this code though, is that the parsing puts the anchor tag into a different slot from the text it is associated with. So it becomes very important to use grep carefully (or the str_locate function from stringr) to look for a particular HREF pattern. If you have one. If not, then things just got a whole lot messier.

If I had time, I would search for a function that would extract all the HTML for a particular table on a web page, and then pass just that to the parser, in order to be able to pull out links from the table (which would work so much better if those links followed no repeatable pattern).

Last time I did any serious code writing was for the Y2K effort with COBOL/DB2. I’m really enjoying writing R for my UCSC Silicon Valley Extension data mining class(2612).

Source: http://binalytics.wordpress.com/2012/02/15/using-r-to-scrape-html-tables-from-a-stack-of-web-pages/

Friday, 24 May 2013

Do film critics have any impact on box-office sales?

I got my hands on some movie data to see how much of a relationship exists between film reviews and box-office earnings.

Not that much, I discovered.

Click here to see the findings. (I didn’t embed them here because our WordPress install doesn’t play nice with Tableau).

I also wanted to see the relative popularity of movies between Quebec and the rest of the domestic Hollywood market, which includes the U.S. and Canada. turns out Quebec does have distinct tastes when if comes to major film releases.

I’m curious if a general, non-geeky audience would find this interesting. Do you think all the stat talk makes this inaccessible? Please share your thoughts.

Sources:

Metacritic data is from Needlebase. Someone had already scraped it and offered it publicly.

Box office data comes from Box Office Mojo.

Rotten Tomatoes data from Information is Beautiful, which is holding a visualization contest using Hollywood data.

Quebec box-office data from Cineac.

Source: http://blogs.montrealgazette.com/2012/01/26/do-film-critics-have-any-impact-on-box-office-sales/

Friday, 17 May 2013

IRON MAN 3 Powers Weekend Box Office Pass US$4m!

SINGAPORE - IRON MAN 3 opened on 99 screens (including 3D and IMAX 3D) and set a new record for the highest one-day box office gross with US$1.38m (S$1.7m) on its opening day. By the end of its first weekend, the film had grossed a total of US$3,761,476. That's more than what THE AVENGERS managed over its opening weekend (US$3,441,766) on a higher 124 screen count.

A clearer picture between the two films may only be formed next week as THE AVENGERS had already grossed a higher US$5,766,054 by the end of its first weekend with an earlier opening on the Labour Day holiday which fell on a Tuesday. IRON MAN 3 opened on a Friday with evening sneaks on Thursday with the Labour Day holiday falling on an approaching Wednesday.

THE AVENGERS ended up as the all-time box office champion with a US$11,172,253 tally. How IRON MAN 3 will actually fare will be clearer in the coming weeks. Regardless, a finish between US$7.3m-US$8.2m seems almost assured - this will give it a strong shot at becoming 2013's top film and a spot in the Top 5 of all-time.
IRON MAN 3 also opened bigger than both its predecessors IRON MAN (US$1,965,910) and IRON MAN 2 (US$1,920,023). It also set a new per screen average record of US$37,995, besting the all time high recently set by AH BOYS TO MEN last November (US$33,321).* Its opening already secured it as this year's third highest grosser to date behind AH BOYS TO MEN 2 (US$6,366,469) and G.I. JOE: RETALIATION (US$4,268,582).

Along with Tony Stark's arrival, a few new releases get thrown to the wolves with none opening on more than 5 screens. Faring best is the Chinese romance comedy FINDING MR. RIGHT starring Tang Wei which managed to secure a decent US$7,255 screen average and a total of US$21,766 from 3 screens.

The R21-rated Korean drama AV IDOL grossed US$15,412 from 3 screens for a decent US$5,137 screen average, while the Tina Fey comedy ADMISSION fell flat with US$16,872 from 5 screens and a poor US$3,374 screen average.

Also released but unranked is the Academy and BAFTA award-winning documentary SEARCHING FOR SUGAR MAN on a single screen at FilmGarde.

With IRON MAN 3's arrival, all previous releases drop drastically with last week's no.1 film OBLIVION tumbling 86.7% to the no. 2 spot. The Tom Cruise sci-fi hit has grossed US$2,151,879 to date and a US$2.2m-US$2.4m finish looks likely - lower than the initial projection of US$2.8m.

JUDGMENT DAY - the 9th domestic release of the year has grossed US$212,430 to date and may now finish with around US$270k instead of the US$380k projected last week.

Overall, IRON MAN 3 powered the total box office past US$4m for only the second time ever * - the last time was back during the weekend of 2 June 2011 when the top 3 films X-MEN: FIRST CLASS, KUNG FU PANDA 2 and THE HANGOVER PART II helped pushed the box office to an all-time high of US$4,236,701 - a record that has yet to be broken.

This weekend, a total of US$4,090,461 was collected, of which 91.96% was for sales for IRON MAN 3. This is a 168.7% jump from the last weekend and a 242.65% jump from the same frame a year ago when BATTLESHIP spent a 3rd week at the no. 1 spot.

Source: http://sgmoviebiz.blogspot.in/2013/04/iron-man-3-powers-weekend-box-office-to.html

Monday, 6 May 2013

Random Facts: Woody Allen at the Box Office

Woody Allen's latest film, Vicky Cristina Barcelona, has earned mostly rave reviews, and it's doing well at the box office -- or, that is, it's doing well for a Woody Allen film. It opened in 10th place for the weekend of Aug. 15-17, the first time an Allen film has cracked the top 10 at all (let alone opened there) since Small Time Crooks, eight years and eight movies ago. And Small Time Crooks was the first one since Husbands and Wives, eight years and eight movies before that.

I wouldn't say there's ever been a time when Allen's films routinely made the top 10 -- he's always managed to release a total flop here and there to break up the streak -- but it certainly used to occur much more frequently than it does now. Manhattan opened at #1 in 1979, possibly the only Allen film ever to do so. (I can't find specific weekend data on Annie Hall, which is the only other likely candidate.) Various others have spent at least a couple weekends in the top 10. Still, no Allen film has ever been what you'd call a "blockbuster." His biggest hit, Hannah and Her Sisters (1986), made $40 million and never got higher than 5th place at the box office. Granted, if you adjust for inflation, Annie Hall's $38 million would be about $120 million today, and that would be considered fantastic for a low-budget indie. But it's still not commensurate with how beloved and acclaimed Allen is.

Consider this: Woody Allen has directed 38 theatrical features. The Dark Knight has made more money than all 38 of them combined. Isn't it strange that one of the most iconic American filmmakers of all time can barely scrape together a crowd to actually watch his movies? Then again, maybe that's only true in the United States. All of these statistics represent the domestic box office. Allen's films routinely do much better overseas, particularly in France, Italy, and Spain. (Or at least the more recent ones do. Box Office Mojo's foreign records only go back about a decade.) For example, Cassandra's Dream was a bomb in the U.S., earning less than a million dollars, but it made $21.1 million overseas. Match Point earned $23.2 million here, and another $62.2 million elsewhere.

We should also remember that Allen's films don't play on 3,000 screens, so it would be unrealistic to expect them to have $20 million opening weekends. The widest release ever for an Allen movie was the 1,033 screens devoted to Anything Else (2003), which opened in 12th place (so close!) and which was -- let us not forget -- a lousy movie.

Which brings us to the element that might make the most difference when it comes to Allen's box-office success: quality. Most people agree that Allen went through a creative slump for a while, and it's just been in the last few years that he's started to come out of it. The box office clearly is not a meritocracy -- terrible movies make shloads of money all the time -- but producing something good will generally improve your chances. Luckily, even if blockbuster success continues to elude him, it would seem Woody Allen is going to keep making movies anyway -- and for those who love his work, that's fine with us.

[Thanks to Box Office Mojo's indispensable records, particularly the page devoted to Woody Allen specifically.]

Source: http://blog.moviefone.com/2008/08/25/random-facts-woody-allen-at-the-box-office/

Thursday, 2 May 2013

Web Data Scraping Proxy Data Scraping an Easy Way

Have you ever heard of “data scraping?” Scraping Data scraping technology to new technology and a successful businessman who made his fortune by making use of the data.

Sometimes website owners automated harvesting of your data can not be happy. Webmasters tools or methods that the content of websites to find block certain IP addresses from using their websites to disallow webscrapers have learned to are ultimately left with is blocked.

Venus is a modern solution to the problem. Proxy data scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program performs an output of a website, the website thinks that it comes from a different IP address.

Now you might be asking yourself, “I can get for my project where data scraping proxy technology?” “Do it yourself” solution, but unfortunately, not need to mention. The proxy server you choose to rent consider hosting providers, but that option is fairly pricey, but definitely better than the alternative is incredibly dangerous (but) free public proxy servers.

There are literally thousands of free proxy servers located all over the world that are very easy to use. But the trick is finding them. Many sites list hundreds of servers, but one that works to identify, access, and supports the type of protocol you need perseverance, trial and error, a lesson.

There are so many data scraping tools are available on the internet. With these tools, without stress, you can download large amounts of data. In the past decade, the Internet has revolutionized the world as an information center. Each type of information you can get from the Internet.

Web data extraction tools to extract data from HTML pages from various websites and comparing data. Every day there are many websites are hosted on the Internet. On the same day, it is not possible to see all websites. Thedata mining tool, you are able to manage all web pages on the internet. If you are using a wide range of applications, these scraping tools are very useful for you.

Structured data extraction software tool for the Internet is used to compare data. There are so many search engines on the internet to help a website to a particular issue. The data are displayed in different sites in different styles. The expert scraping date records in a separate site and structures will help to compare data.

And the web crawler software tool is used for indexing web pages on the internet, it will move to data from the Internet to your hard disk. This feature allows you to browse much faster when connected to the Internet. And the device has significant off-peak hours when trying to download data from the Internet. It will take time to download. However, with this tool you can quickly can all download data from the Internet is an entrepreneur and email extractortool.

However, there are more scraping Finally available on the internet. And even some reputable websites are providing information about these devices. You pay a nominal fee to download these tools.

Source: http://xformative.org/web-data-scraping-proxy-data-scraping-an-easy-way.html

Note:

Delta Ray is experienced web scraping consultant and writes articles on data scraping services, web data scraping, web scraper, data scraping services, website scraping, eBay product scraping, Forms Data Entry etc.