Boxofficemojo.com Data Scraping: November 2014

Thursday, 27 November 2014

Scraping SSL Labs Server Test Results With R

    NOTE: Qualys allows automated access to their SSL Server Test site in their T&C’s, and the R fucntion/script provided here does its best to adhere to their guidelines. However, if you launch multiple scripts at one time and catch their attention you will, no doubt, be banned.

This post will show you how to do some basic web page data scraping with R. To make it more palatable to those in the security domain, we’ll be scraping the results from Qualys’ SSL Labs SSL Test site by building an R function that will:

    fetch the contents of a URL with RCurl
    process the HTML page tags with R’s XML library
    identify the key elements from the page that need to be scraped
    organize the results into a usable R data structure

You can skip ahead to the code at the end (or in this gist) or read on for some expository that isn’t in the code’s comments.

Setting up the script and processing flow

We’ll need some assistance from three R packages to perform the scraping, processing and transformation tasks:

library(RCurl) # scraping
library(XML)   # XML (HTML) processing
library(plyr) # data transformation

If you poke at the SSL Test site with a few different URLs, you’ll see there are three primary inputs to the GET request we’ll need to issue:

    d (the domain)
    s (the IP address to test)
    ignoreMismatch (which we’ll leave as ‘on‘)

You’ll also see that there’s often a delay between issuing a request and getting the results, so we’ll need to build in a GET+check-loop (like the javascript on the page does automagically). Finally, when the results are eventually displayed they are (at least for this example) usually either "Overall Rating" or "Assessment" and, we’ll use that status result in our tests for what to return.

We’ll account for the domain and IP address in the function parameters along with the amount of time we should pause between GET+check attempts. It’s also a good idea to provide a way to pass in any extra curl options (e.g. in the event folks are behind a proxy server and need to input that to make the requests work). We’ll define the function with some default parameters:

get_rating <- function(site="rud.is", ip="", pause=5, curl.opts=list()) {

}

This definition says that if we just call get_rating(), it will

    default to using "rud.is" as the domain (you can pick what you want in your implementation)
    not supply an IP address (which the script will then have to lookup with nsl)
    will pause 5s between GET+check attempts
    pass no extra curl options

Getting into the details

For the IP address logic, we’ll have to test if we passed in an an address string and perform a lookup if not:

# try to resolve IP if not specified; if no IP can be found, return
# a "NA" data frame

if (ip == "") {

    tmp <- nsl(site)
    if (is.null(tmp)) {
      return(data.frame(site=site, ip=NA, Certificate=NA,
                        Protocol.Support=NA, Key.Exchange=NA,
                        Cipher.Strength=NA)) }
    ip <- tmp
}

(don’t worry about the return(...) part yet, we’ll get there in a bit).

Once we have an IP address, we’ll need to make the call to the ssllabs.com test site and perform the check loop:

# get the contents of the URL (will be the raw HTML text)
# build the URL with sprintf

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)

# while we don't find some indication of a completed request,
# pause and try again

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {
Sys.sleep(pause)
rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)
}

We can then start making some decisions based on the results:

# if the assessment failed, return a data frame of NA's

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,
                    Protocol.Support=NA, Key.Exchange=NA,
                    Cipher.Strength=NA))
}

# otherwise, parse the resultant HTML

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

Unfortunately, the results are not “consistent”. While there are plenty of uniquely identifiable <div>s, there are enough differences between runs that we have to be a bit generic in our selection of data elements to extract. I’ll leave the view-source: of a result as an exercise to the reader. For this example, we’ll focus on extracting:

        the overall rating (A-F)
        the “Certificate” score
        the “Protocol Support” score
        the “Key Exchange” score
        the “Cipher Strength” score

There are plenty of additional fields to extract, but you should be able to extrapolate and grab what you want to from the rest of the example.

Extracting the results

We’ll need to delve into XPath to extract the <div> values. We’ll use the xpathSApply function to perform this task. Since there sometimes is a <span> tag within the <div> for the rating and since the rating has a class tag to help identify which color it should be, we use a starts-with selection parameter to just get anything beginning with rating_. If it returns an R list structure, we know we have the one with a <span> element, so we re-issue the call with that extra XPath component.

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/text()", xmlValue)

if (class(rating) == "list") {

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/span/text()", xmlValue)
}

For the four attributes (and values) we’ll be extracting, we can use the getNodeSet call which will give us all of them into a structure we can process with xpathSApply

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

At this point, labs will be a vector of label names and vals will be the corresponding values. We’ll put them, the original domain and the IP address into a data frame:

# rbind will turn the vector into row elements, with each

# value being in a column

rating.result <- data.frame(site=site, ip=ip,

                            rating=rating, rbind(vals),
                            row.names=NULL)

# we use the labs vector as the column names (in the right spot)

colnames(rating.result) <- c("site", "ip", "rating",

                              gsub(" ", "\\.", labs))

and return the result:
return(rating.result)
Finishing up

If we run the whole function on one domain we’ll get a one-row data frame back as a result. If we use ldply from the plyr package to run the get_rating function repeatedly on a vector of domains, it will combine them all into one whole data frame. For example:

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

##                site              ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1            rud.is 184.106.97.102      B         100               70           80              90

## 2 stackoverflow.com 198.252.206.140      A         100               90           80              90

## 3        er-ant.com            <NA>   <NA>        <NA>             <NA>         <NA>            <NA>

There are many tweaks you can make to this function to extract more data and perform additional processing. If you make some of your own changes, you’re encouraged to add to the gist (link above & below) and/or drop a note in the comments.

Hopefully you’ve seen how well-suited R is for this type of operation and have been encouraged to use it in your next attempt at some site/data scraping.

library(RCurl)
library(XML)
library(plyr)

#' get the Qualys SSL Labs rating for a domain+cert

#'

#' @param site domain to test SSL configuration of

#' @param ip address of \code{site} (will resolve it and take\cr

#' first response if not specified, but that may not always work as you expect)

#' @param hide.results ["on"|"off"] should the results show up in the SSL Labs history (default "on")

#' @param pause timeout between tries (default 5s)

#' @param curl.opts options to pass to \code{getURL} i.e. proxy setting

#' @return data frame of results

#'

get_rating <- function(site="rud.is", ip="", hide.results="on", pause=5, curl.opts=list()) {

# try to resolve IP if not specified; if no IP can be found, return

# a "NA" data frame

if (ip == "") {

tmp <- nsl(site)

if (is.null(tmp)) { return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA)) }

ip <- tmp

}

# need to let it actually process the certificate if not already cached

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {

Sys.sleep(pause)

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

}

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA))

}

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

# sometimes there is a <span ...> tag in the <div>, which will result in an

# empty list() object being returned. we check for that and handle it

# appropriately.

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/text()"]])

if (class(rating) == "list") {

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/span/text()"]])

}

# extract the XML objects for the ratings labels & values

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

# make them into a data frame

rating.result <- data.frame(site=site, ip=ip, rating=rating, rbind(vals), row.names=NULL)

colnames(rating.result) <- c("site", "ip", "rating", gsub(" ", "\\.", labs))

return(rating.result)

}

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

## site ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1 rud.is 184.106.97.102 B 100 70 80 90

## 2 stackoverflow.com 198.252.206.140 A 100 90 80 90

## 3 er-ant.com <NA> <NA> <NA> <NA> <NA> <NA>

Source: http://www.r-bloggers.com/scraping-ssl-labs-server-test-results-with-r/

Wednesday, 26 November 2014

Web Scraping Tools for Non-developers

I recently spoke with a resource-limited organization that is investigating government corruption and wants to access various public datasets to monitor politicians and law firms. They don’t have developers in-house, but feel pretty comfortable analyzing datasets in CSV form. While many public datasources are available in structured form, some sources are hidden in what us data folks call the deep web. Amazon is a nice example of a deep website, where you have to enter text into a search box, click on a few buttons to narrow down your results, and finally access relatively structured data (prices, model numbers, etc.) embedded in HTML. Amazon has a structured database of their products somewhere, but all you get to see is a bunch of webpages trapped behind some forms.

A developer usually isn’t hindered by the deep web. If we want the data on a webpage, we can automate form submissions and key presses, and we can parse some ugly HTML before emitting reasonably structured CSVs or JSON. But what can one accomplish without writing code?

This turns out to be a hard problem. Lots of companies have tried, to varying degrees of success, to build a programmer-free interface for structured web data extraction. I had the pleasure of working on one such project, called Needlebase at ITA before Google acquired it and closed things down. David Huynh, my wonderful colleague from grad school, prototyped a tool called Sifter that did most of what one would need, but like all good research from 2006, the lasting impact is his paper rather than his software artifact.

Below, I’ve compiled a list of some available tools. The list comes from memory, the advice of some friends that have done this before, and, most productively, a question on Twitter that Hilary Mason was nice enough to retweet.

The bad news is that none of the tools I tested would work out of the box for the specific use case I was testing. To understand why, I’ll break down the steps required for a working web scraper, and then use those steps to explain where various solutions broke down.

The anatomy of a web scraper

There are three steps to a structured extraction pipeline:

    Authenticate yourself. This might require logging in to a website or filling out a CAPTCHA to prove you’re not…a web scraper. Because the source I wanted to scrape required filling out a CAPTCHA, all of the automated tools I’ll review below failed step 1. It suggests that as a low bar, good scrapers should facilitate a human in the loop: automate the things machines are good at automating, and fall back to a human to perform authentication tasks the machines can’t do on their own.

    Navigate to the pages with the data. This might require entering some text into a search box (e.g., searching for a product on Amazon), or it might require clicking “next” through all of the pages that results are split over (often called pagination). Some of the tools I looked at allowed entering text into search boxes, but none of them correctly handled pagination across multiple pages of results.

    Extract the data. On any page you’d like to extract content from, the scraper has to help you identify the data you’d like to extract. The cleanest example of this that I’ve seen is captured in a video for one of the tools below: the interface lets you click on some text you want to pluck out of a website, asks you to label it, and then allows you to correct mistakes it learns how to extract the other examples on the page.

As you’ll see in a moment, the steps at the top of this list are hardest to automate.

What are the tools?

Here are some of the tools that came highly recommended, and my experience with them. None of those passed the CAPTCHA test, so I’ll focus on their handling of navigation and extraction.

    Web Scraper is a Chrome plugin that allows you to build navigable site maps and extract elements from those site maps. It would have done everything necessary in this scenario, except the source I was trying to scrape captured click events on links (I KNOW!), which tripped things up. You should give it a shot if you’d like to scrape a simpler site, and the youtube video that comes with it helps get around the slightly confusing user interface.

    import.io looks like a clean webpage-to-api story. The service views any webpage as a potential data source to generate an API from. If the page you’re looking at has been scraped before, you can access an API or download some of its data. If the page hasn’t been processed before, import.io walks you through the process of building connectors (for navigation) or extractors (to pull out the data) for the site. Once at the page with the data you want, you can annotate a screenshot of the page with the fields you’d like to extract. After you submit your request, it appears to get queued for extraction. I’m still waiting for the data 24 hours after submitting a request, so I can’t vouch for the quality, but the delay suggests that import.io uses crowd workers to turn your instructions into some sort of semi-automated extraction process, which likely helps improve extraction quality. The site I tried to scrape requires an arcane combination of javascript/POST requests that threw import.io’s connectors for a lo
op, and ultimately made it impossible to tell import.io how to navigate the site. Despite the complications, import.io seems like one of the more polished website-to-data efforts on this list.

    Kimono was one of the most popular suggestions I got, and is quite polished. After installing the Kimono bookmarklet in your browser, you can select elements of the page you wish to extract, and provide some positive/negative examples to train the extractor. This means that unlike import.io, you don’t have to wait to get access to the extracted data. After labeling the data, you can quickly export it as CSV/JSON/a web endpoint. The tool worked seamlessly to extract a feed from the Hackernews front page, but I’d imagine that failures in the automated approach would make me wish I had access to import.io’s crowd workers. The tool would be high on my list except that navigation/pagination is coming soon, and will ultimately cost money.

    Dapper, which is now owned by Yahoo!, provides about the same level of scraping capabilities as Kimono. You can extract content, but like Kimono it’s unclear how to navigate/paginate.

    Google Docs was an unexpected contender. If the data you’re extracting is in an HTML table/RSS Feed/CSV file/XML document on a single webpage with no navigation/authentication, you can use one of the Import* functions in Google Docs. The IMPORTHTML macro worked as advertised in a quick test.

    iMacros is a tool that I could imagine solves all of the tasks I wanted, but costs more than I was willing to pay to write this blog post. Interestingly, the free version handles the steps that the other tools on this list don’t do as well: navigation. Through your browser, iMacros lets you automate filling out forms, clicking on “next” links, etc. To perform extraction, you have to pay at least $495.

    A friend has used Screen-scraper in the past with good outcomes. It handles navigation as well as extraction, but costs money and requires a small amount of programming/tokenization skills.

    Winautomation seems cool, but it’s only available for Windows, which was a dead end for me.

So that’s it? Nothing works?

Not quite. None of these tools solved the problem I had on a very challenging website: the site clearly didn’t want to be crawled given the CAPTCHA, and the javascript-submitted POST requests threw most of the tools that expected navigation through links for a loop. Still, most of the tools I reviewed have snazzy demos, and I was able to use some of them for extracting content from sites that were less challenging than the one I initially intended to scrape.

All hope is not lost, however. Where pure automation fails, a human can step in. Several proposals suggested paying people on oDesk, Mechanical Turk, or CrowdFlower to extract the content with a human touch. This would certainly get us past the CAPTCHA and hard-to-automate navigation. It might get pretty expensive to have humans copy/paste the data for extraction, however. Given that the tools above are good at extracting content from any single page, I suspect there’s room for a human-in-the-loop scraping tool to steal the show: humans can navigate and train the extraction step, and the machine can perform the extraction. I suspect that’s what import.io is up to, and I’m hopeful they keep the tool available to folks like the ones I initially tried to help.

While we’re on the topic of human-powered solutions, it might make sense to hire a developer on oDesk to just implement the scraper for the site this organization was looking at. While a lot of the developer-free tools I mentioned above look promising, there are clearly cases where paying someone for a few hours of script-building just makes sense.

Source: http://blog.marcua.net/post/74655674340

Sunday, 23 November 2014

4 Data Mining Tips to Scrap Real Estate Data; Innovative Way to Give Realty Business a boost!

Internet has become a huge source of data – in fact; it has turned into a goldmine for the marketers, from where they can easily dig the useful data!

Web scraping has become a norm in today’s competitive era, where one with maximum and relevant information wins the race!

Real Estate Data Extraction and Scraping Service

It has helped many industries to carve a niche in the market; especially real estate – Scraping real estate data has been of great help for professionals to reach out to a large number of people and gather reliable property data. However, there are some people for whom web scraping is still an alien concept; most probably because most of its advantages are not discussed.

There are institutions, companies and organizations, entrepreneurs, as well as just normal citizens generating an extraordinary amount of information every day. Property information extraction can be effectively used to get an idea about the customer psyche and even generate valuable lead to further the business.

In addition to this, data mining has also some of following uses making it an indispensable part of marketing.

Gather Properties Details from Different Geographical Locations

You are an estate agent and want to expand your business to the neighboring city or state. But, then you are short of information. You are completely aware of the properties in the vicinity and in your town; however, with data mining services will help you to get an idea about the properties in the other state. You can also approach probable clients and increase your database to offer extensive services.

Online Offers and Discounts are just a Click Away

Now, it is tough to deal with the clients, show them the property of their choice and again act as a mediator between the buyer and seller. In all this, it becomes almost difficult to take a look at some special discounts or offers. With the data mining services, you can get an insight into these amazing offers. Thus, you can plan a move or even provide your client an amazing deal.

What people are talking about – Easy Monitoring of your Online Reputation

Internet has become a melting pot where different people come together. In fact, it provides a huge platform where people discuss about their likes and dislikes. When you dig into such online forums, you can get an idea of reputation that you or your firm holds. You can know what people think about you and where you require to buck up and where you need to slow down.

A Chance to Know your Competitors Better!

Last, but not the least, you can keep an eye on the competitor. Real Estate is getting more competitive; and therefore, it is important to have knowledge about your competitors to get an upper hand. It will help you to plan your moves and strategize with more ease. Moreover, you also know what is that “something” that your competitor does not have and you have, with can be subtly highlighted.

Property information extraction can prove to be the most fruitful method to get a cutting edge in the industry.

Source: http://www.hitechbposervices.com/blog/4-data-mining-tips-to-scrap-real-estate-data-innovative-way-to-give-realty-business-a-boost/

Wednesday, 19 November 2014

Web Scraping for Fun & Profit

There’s a number of ways to retrieve data from a backend system within mobile projects. In an ideal world, everything would have a RESTful JSON API – but often, this isn’t the case.Sometimes, SOAP is the language of the backend. Sometimes, it’s some proprietary protocol which might not even be HTTP-based. Then, there’s scraping.

Retrieving information from web sites as a human is easy. The page communicates information using stylistic elements like headings, tables and lists – this is the communication protocol of the web. Machines retrieve information with a focus on structure rather than style, typically using communication protocols like XML or JSON. Web scraping attempts to bridge this human protocol into a machine-readable format like JSON. This is what we try to achieve with web scraping.

As a means of getting to data, it don’t get much worse than web scraping. Scrapers were often built with Regular Expressions to retrieve the data from the page. Difficult to craft, impossible to maintain, this means of retrieval was far from ideal. The risks are many – even the slightest layout change on a web page can upset scraper code, and break the entire integration. It’s a fragile means for building integrations, but sometimes it’s the only way.

Having built a scraper service recently, the most interesting observation for me is how far we’ve come from these “dark days”. Node.js, and the massive ecosystem of community built modules has done much to change how these scraper services are built.

Effectively Scraping Information

Websites are built on the Document Object Model, or DOM. This is a tree structure, which represents the information on a page.By interpreting the source of a website as a DOM, we can retrieve information much more reliably than using methods like regular expression matching. The most popular method of querying the DOM is using jQuery, which enables us to build powerful and maintainable queries for information. The JSDom Node module allows us to use a DOM-like structure in serverside code.

For purpose of Illustration, we’re going to scrape the blog page of FeedHenry’s website. I’ve built a small code snippet that retrieves the contents of the blog, and translates it into a JSON API. To find the queries I need to run, first I need to look at the HTML of the page. To do this, in Chrome, I right-click the element I’m looking to inspect on the page, and click “Inspect Element”.

Screen Shot 2014-09-30 at 10.44.38

Articles on the FeedHenry blog are a series of ‘div’ elements with the ‘.itemContainer’ class

Searching for a pattern in the HTML to query all blog post elements, we construct the `div.itemContainer` query. In jQuery, we can iterate over these using the .each method:

var posts = [];

$('div.itemContainer').each(function(index, item){

// Make JSON objects of every post in here, pushing to the posts[] array

});

From there, we pick off the heading, author and post summary using a child selector on the original post, querying the relevant semantic elements:

    Post Title, using jQuery:

    $(item).find('h3').text()trim() // trim, because titles have white space either side

    Post Author, using jQuery:

    $(item).find('.catItemAuthor a').text()

    Post Body, using jQuery:

    $(item).find('p').text()

Adding some JSDom magic to our snippet, and pulling together the above two concept (iterating through posts, and picking off info from each post), we get this snippet:

var request = require('request'),

jsdom = require('jsdom');

jsdom.env(

"http://www.feedhenry.com/category/blog",

["http://code.jquery.com/jquery.js"],

function (errors, window) {

    var $ = window.$, // Alias jQUery

    posts = [];

    $('div.itemContainer').each(function(index, item){

      item = $(item); // make queryable in JQ

      posts.push({

        heading : item.find('h3').text().trim(),

        author : item.find('.catItemAuthor a').text(),

        teaser : item.find('p').text()

      });

    });

    console.log(posts);

}

);

A note on building CSS Queries

As with styling web sites with CSS, building effective CSS queries is equally as important when building a scraper. It’s important to build queries that are not too specific, or likely to break when the structure of the page changes. Equally important is to pick a query that is not too general, and likely to select extra data from the page you don’t want to retrieve.

A neat trick for generating the relevant selector statement is to use Chrome’s “CSS Path” feature in the inspector. After finding the element in the inspector panel, right click, and select “Copy CSS Path”. This method is good for individual items, but for picking repeating patterns (like blog posts), this doesn’t work though. Often, the path it gives is much too specific, making for a fragile binding. Any changes to the page’s structure will break the query.

Making a Re-usable Scraping Service

Now that we’ve retrieved information from a web page, and made some JSON, let’s build a reusable API from this. We’re going to make a FeedHenry Blog Scraper service in FeedHenry3. For those of you not familiar with service creation, see this video walkthrough.

We’re going to start by creating a “new mBaaS Service”, rather than selecting one of the off-the-shelf services. To do this, we modify the application.js file of our service to include one route, /blog, which includes our code snippet from earlier:

// just boilerplate scraper setup

var mbaasApi = require('fh-mbaas-api'),

express = require('express'),

mbaasExpress = mbaasApi.mbaasExpress(),

cors = require('cors'),

request = require('request'),

jsdom = require('jsdom');

var app = express();

app.use(cors());

app.use('/sys', mbaasExpress.sys([]));

app.use('/mbaas', mbaasExpress.mbaas);

app.use(mbaasExpress.fhmiddleware());

// Our /blog scraper route

app.get('/blog', function(req, res, next){

jsdom.env(

    "http://www.feedhenry.com/category/blog",

    ["http://code.jquery.com/jquery.js"],

    function (errors, window) {

      var $ = window.$, // Alias jQUery

      posts = [];

      $('div.itemContainer').each(function(index, item){

        item = $(item); // make queryable in JQ

        posts.push({

          heading : item.find('h3').text().trim(),

          author : item.find('.catItemAuthor a').text(),

          teaser : item.find('p').text()

        });

      });

      return res.json(posts);

    }

);

});

app.use(mbaasExpress.errorHandler());

var port = process.env.FH_PORT || process.env.VCAP_APP_PORT || 8001;

var server = app.listen(port, function() {});

We’re also going to write some documentation for our service, so we (and other developers) can interact with it using the FeedHenry discovery console. We’re going to modify the README.md file to document what we’ve just done using API Blueprint documentation format:

# FeedHenry Blog Web Scraper

This is a feedhenry blog scraper service. It uses the `JSDom` and `request` modules to retrieve the contents of the FeedHenry developer blog, and parse the content using jQuery.

# Group Scraper API Group

# blog [/blog]

Blog Endpoint

## blog [GET]

Get blog posts endpoint, returns JSON data.

+ Response 200 (application/json)

    + Body

            [{ blog post}, { blog post}, { blog post}]

We can now try out the scraper service in the studio, and see the response:

Scraping – The Ultimate in API Creation?

Now that I’ve described some modern techniques for effectively scraping data from web sites, it’s time for some major caveats. First, WordPress blogs like ours already have feeds and APIs available to developers - there’s no need to ever scrape any of this content. Web Scraping is not a replacement for an API. It should be used only as a last resort, after every endeavour to discover an API has already been made. Using a web scraper in a commercial setting requires much time set aside to maintain the queries, and an agreement with the source data is being scraped on to alert developers in the event the page changes structure.

With all this in mind, it can be a useful tool to iterate quickly on an integration when waiting for an API, or as a fun hack project.

Source: http://www.feedhenry.com/web-scraping-fun-profit/

Saturday, 15 November 2014

Building Java Object Graph with Tour de France results – using screen scraping, java.util.Parser and assorted facilities

Last Saturday, the Tour de France 2011 departed. For people like myself, enjoying sports and working on Data Visualizations on the one hand and far fetched uses of SQL on the other, the Tour de France offers a wealth of data to work with: rankings for each stage in various categories, nationalities and teams to group by, distances and velocity, years to compare with one another and the like. So it has been my intention for some time to get hold of that data in a format I could work with.

Today I finally found some time to get it done. To locate the statistics for the Tour de France editions for the last few years and get them onto my laptop and into my database. This article describes the first part of that journey: how to get the stage results from some source on the internet into my locally running Java program in an appropriate object structure.

My starting point is the official Tour de France website:

Image

This website goes back to 2007 and also has the latest (2011) results. It presents the result in a format pleasing to the human eye – based on an HTML structure that is fairly pleasing to my groping Java code as well.

Analyzing the source of the Tour de France data

I start my explorations in Firefox, using the Firebug plugin. When I select the tab with the results for a particular stage, I inspect the (AJAX) call that is made to retrieve the stage results into the browser:

Image

The URL that was accessed is www.letour.fr/2010/TDF/LIVE/us/700/classement/ITE.html . When I access that URL directly, I see an HTML fragment with the individual ranking for the 7th stage in 2010. It turns out that with ITG instead of ITE in this URL, I get the overall ranking after the 7th Stage. Using IME in stead of ITE, I get the 7th stage’s climbers’ standing. And so on.

The HTML associated with the stage standing looks like this:

Image

Which is not as user friendly as the corresponding display in the browser:

Image

but still fairly well structured and programmatically interpretable.

Retrieving HTML fragments and parsing in Java

Consuming these HTML fragments with stage standings into my own Java code is very easy. Parsing the data and turning it into sensible Java Objects is slightly more work, but still quite feasible. From the Java Objects I next need to create a persistent storage for the data – that is the subject for another article.

Using the Java URL class and its openStream method to open an InputStream on whatever content can be found at the URL, it is dead easy to start reading the HTML from the Tour de France website into my Java program. I make use of the java.util.Scanner class to work my way through the HTML by Table Row (TR element). When you inspect the HTML fragments, it is clear early on that every individual rider’s entry corresponds with a TR element, so it seems only logical to have the Scanner break up the data by TR.

private static Stage processStage(int year, int stageSequence, Map<Integer, Rider> riders) throws java.io.IOException, java.net.MalformedURLException {

    String typeOfStanding = "ITE";
     URL stageStanding = new URL("http://www.letour.fr/"+year+"/TDF/LIVE/us/"
                                +(stageSequence==0?"0":stageSequence+"00") +
                                "/classement/"+typeOfStanding+".html");
    InputStream stream = stageStanding.openStream();
    Scanner scanner = new Scanner(stream);
    scanner.useDelimiter("</tr>");
    Stage stage = new Stage();
    stage.setSequence(stageSequence);
    boolean first = true;
    boolean firstStanding = true;
    while (scanner.hasNext()) {
        String entry = scanner.next();
        if (first) {
            first = false;
            Matcher regexMatcher = regexDistance.matcher(entry);
            if (regexMatcher.find()) {
                String distanceString = regexMatcher.group();
                stage.setTotalDistance(Float.parseFloat(distanceString.substring(0, distanceString.length() - 3)));
            }
        }
        if (!first) {
            String[] els = entry.split("/td>");
            if (els.length > 1) { // only the standing-entries have more than one td element
                Integer riderNumber = Integer.parseInt(extractValue(els[2]));

                Rider rider=null;
                if (riders.containsKey(riderNumber)) {
                    rider = riders.get(riderNumber);
                }
                else {
                    rider = new Rider(extractValue(els[1]),riderNumber, extractValue(els[3]));
                    riders.put(riderNumber,rider);
                }
                Standing standing =
                    new Standing(firstStanding ? 1 : (Integer.parseInt(extractValue(els[0]).replace(".", ""))),
                                  rider,extractValue(els[4]),
                                  extractValue(els[5]));
                firstStanding = false;
                stage.getStandings().add(standing);                }
        }
    } //while
    scanner.close();
    return stage;
}

Subsequently, the TR elements need to be broken up in the TD cell elements that contain the rank, rider’s name, their number, the team they ride for and the time for the stage as well as their lag with regard to the winner. I have used a simple split (on /td>) to extract the cells. The final logic for pulling the correct value from the cell is in the method extractValue. Note: this code is not very pretty, and I am not necessarily overly proud of it. On the other hand: it is one-time-use-only code and it is still fairly compact and easy to write and read.

private static String extractValue(String el) {
    String r = el.split("</")[0];
    if (r.lastIndexOf(">") > 0) {
        r = r.substring(r.lastIndexOf(">") + 1);
    }
    return r.split("<")[0];
}

I have created a few domain classes: Rider, Stage, Standing (as well as Tour) that are a business domain like representation of the Tour de France result data. Objects based on these classes are instantiated in the processStage method that is being invoked from the processTour method.

public static void processTour(Tour tour) throws IOException, MalformedURLException {
    if (tour.isPrologue())
      tour.getStages().add(processStage(tour.getYear(),0, tour.getRiders()));

    for (int i=1;i<= tour.getNumberOfStages();i++) {
        tour.getStages().add(processStage(tour.getYear(),i, tour.getRiders()));
    }
}

When I run the TourManager class – a class that create a single Tour object for the Tour de France in 2010 –

public class TourManager {
     List<Tour> tours = new ArrayList<Tour>();
     public TourManager() {
        tours.add(new Tour(2010, 20, true));
        try {
            ProcessTourStandings.processTour(tours.get(0));
        } catch (MalformedURLException e) {
            System.out.println(e.getMessage());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
     public static void main(String[] args) {
        TourManager tm = new TourManager();
        for (Tour tour : tm.getTours()) {
            for (Stage stage : tour.getStages()) {
                System.out.println("================ Stage " + stage.getSequence() + "(" + stage.getTotalDistance() +
                                   " km)");
                for (Standing standing : stage.getStandings()) {
                    if (standing.getRank() < 4) {
                        System.out.println(standing.getRank() + "." + standing.getRider().getName());
                    }
                }
            }
        }
    }

it will print the top 3 in every stage:

Image

Source:http://technology.amis.nl/2011/07/04/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities/

Thursday, 13 November 2014

Interactive Crawls for Scraping AJAX Pages on the Web

Crawling pages on the web has become an everyday affair for most enterprises. Too often do we come across offline businesses as well who’d like data gathered from the web for internal analyses. All this eventually to serve customers faster and better. At times, when the crawl job is high-end cum high-scale, businesses also consider DaaS providers to supplement their efforts.

However, the web landscape too has evolved with newer technologies that provide fancy experiences to web users. AJAX elements are one such common aid that leave even the DaaS providers perplexed. They come in various forms from a user’s point of view-

1. Load more results on the same page

2. Filter results based on various selection criteria

3. Submit forms, etc.

When crawling a non-AJAX page, simple GET requests do the job. However, AJAX pages work with POST requests that are not easy to trace for a normal bot.

Difference between GET request and POST request- Scraping

GET vs. POST

At PromptCloud, from our experience with a number of AJAX sites on the web, we’ve crossed the tech barrier. Below is a quick review about the challenges that come with AJAX crawling and its indicative solutions-

1. Javascript Emulations- A bot essentially emulates human browsing to fetch pages. When this needs to be done for Javascript components on a page, it gets tricky. Headless browser, which emulates human interaction with a web page without an interface, is the current approach. These browsers click on various elements/ dropdown lists that are embedded within Javascript code and capture responses to be transferred to programs. Which headless browser to pick depends on what fits well into your current stack.

2. Fetch Bandwidths- Unlike GET requests which complete pretty quickly, POST requests take quite a bit of time due to the number of events involved per fetch. Hence a good amount of bandwidth needs to be allocated in order to receive the response. For the same reason, wait times need to be taken care of too else you might end up with incomplete responses.

3. .NET Architectures- This is a more complex scenario related to maintaining the View State. Most of the postbacks come with an event and its validation. The bot needs to track the view state and pass validations for the event to occur so that the code can be executed and results captured. This is achieved by adopting a mechanism to restore states if things break midway.

4. Page Encoding- Request and response headers need to be taken care of on AJAX pages. The request needs to be sent in the exact format as expected by the server (Content-type or media type, accept fields, etc.) and similarly responses need to be parsed based on the content-type.

A Use Case

One of our clients who is into sale of event tickets at discounted rates had us crawl one of the ticketing sites on the web weekly; one of the most complex AJAX crawling we’ve dealt with so far. For the data that was to be extracted, multiple AJAX fetches were needed depending on the selections made. Requests had to be made for a combination of items from the dropdown box. These came with cookies and session IDs. To add to the challenge the site was extremely dynamic and changed its structure every week making it difficult for us to follow what data was where on the page.

We developed an AJAX crawler specific to this site to take care of all the dynamics. Response times were taken care of so that we didn’t miss any relevant information. We included an ML component to improve the crawler which is now pretty stable irrespective of changes on the site.

Overall, AJAX crawling requires more compute power in addition to the tech expertise. And because there’s no uniformity on the web, there’s always a new challenge to overcome in this landscape. It wouldn’t be an overrating if we said we’ve done a good job at that so far and have developed the knack :)

Reach out to us for any kind of web scraping/ crawling- either AJAX or not. We’ll take care of the complexities.

Source: https://www.promptcloud.com/blog/web-scraping-interactive-ajax-crawls/

Wednesday, 12 November 2014

Web scraping services-importance of scraped data

Web scraping services are provided by computer software which extracts the required facts from the website. Web scraping services mainly aims at converting unstructured data collected from the websites into structured data which can be stockpiled and scrutinized in a centralized databank. Therefore, web scraping services have a direct influence on the outcome of the reason as to why the data collected in necessary.

It is not very easy to scrap data from different websites due to the terms of service in place. So, the there are some legalities that have been improvised to protect altering the personal information on different websites. These ‘rules’ must be followed to the letter and to some extent have limited web scraping services.

Owing to the high demand for web scraping, various firms have been set up to provide the efficient and reliable guidelines on web scraping services so that the information acquired is correct and conforms to the security requirements. The firms have also improvised different software that makes web scraping services much easier.

Importance of web scraping services

Definitely, web scraping services have gone a long way in provision of very useful information to various organizations. But business companies are the ones that benefit more from web scraping services. Some of the benefits associated with web scraping services are:

    Helps the firms to easily send notifications to their customers including price changes, promotions, introduction of a new product into the market. Etc.
    It enables firms to compare their product prices with those of their competitors
    It helps the meteorologists to monitor weather changes thus being able to focus weather conditions more efficiently
    It also assists researchers with extensive information about peoples’ habits among many others.
    It has also promoted e-commerce and e-banking services where the rates of stock exchange, banks’ interest rates, etc. are updated automatically on the customer’s catalog.

Advantages of web scraping services

The following are some of the advantages of using web scraping services

    Automation of the data

    Web scraping can retrieve both static and dynamic web pages

    Page contents of various websites can be transformed

    It allows formulation of vertical aggregation platforms thus even complicated data can still be extracted from different websites.

    Web scraping programs recognize semantic annotation

    All the required data can be retrieved from their websites

    The data collected is accurate and reliable

Web scraping services mainly aims at collecting, storing and analyzing data. The data analysis is facilitated by various web scrapers that can extract any information and transform it into useful and easy forms to interpret.

Challenges facing web scraping

    High volume of web scraping can cause regulatory damage to the pages

    Scale of measure; the scales of the web scraper can differ with the units of measure of the source file thus making it somewhat hard for the interpretation of the data

    Level of source complexity; if the information being extracted is very complicated, web scraping will also be paralyzed.

It is clear that besides web scraping providing useful data and information, it experiences a number of challenges. The good thing is that the web scraping services providers are always improvising techniques to ensure that the information gathered is accurate, timely, reliable and treated with the highest levels of confidentiality.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/191-web-scraping-services-importance-of-scraped-data/

Monday, 10 November 2014

Review: import.io’s New Scraping Process and Features

Web scraping Data platform import.io, announced last week that they have secured $3M in funding from investors that include the founders of Yahoo! and MySQL.

They also released a new beta version of the tool that is essentially a better version of their extraction tool, with some new features and a much cleaner and faster user experience.

First Impression

I’ve used the tool for a week and can say it is an improvement over the old version – which was a bit bulky and awkward. While still not exactly the most intuitive process, the development team at import.io has managed to slim down what was a relatively button heavy process, without sacrificing any of the functionality – they made the new workflow both simpler and more complicated at the same.

The new version features a simple tool bar across the top as opposed to the space hogging table and wizard from before, which is a large improvement on the pink and white of the previous version.

True, the loss of the wizard means there isn’t as much guidance as before (the pop-up help only appears on the first use), but the undo button means you don’t really need it. You can click around and experiment a bit with the different extraction options before settling down to do some real work.

Data Extraction

Once you’ve figured out how it works, the new version requires far fewer mouse clicks to get from the page to a table of data/API as shown in their homepage video.

All you need to do now is navigate to a website, click a single piece of data on the page – such as price, image, or URL – and their app will find all the other examples of similar data on the website, immediately creating a structured table of data.

download2

This latest version of the extractor also includes a important new feature labeled “Suggest Data”. Its important because it lets you extract all the data from a page, instantly creating a table of data that can be published as an API. This makes import.io very exciting and quick, I spent a long time playing with this and it worked on the majority of sites.

Advanced Features

Most non-programmer web scrapers struggle with complex sites that use JavaScript or iFrames, but import.io also now deals with this. In the basic mode you can toggle JavaScript and CSS on and off to help you see your data better.

If that doesn’t work, you can switch into an ‘advanced mode’ where import allows you to write your own XPath and RegExp. They’ve also added a source code view, though without the ability to click on the site and inspect element (like in Chrome) this feature isn’t particularly useful.

API Integration

Once you’ve created your scraper, there are a number of options for what you can do with it.

If you’ want you can just copy and paste the data into a spreadsheet or Download as CSV. You can also push your data directly Google Sheets, with import.io’s self generated formula.

For the rest of us, they have surfaced both the POST and GET requests for you and given you a JSON view which allows you to see how the data is returned, which is handy.

All this functionality is nice, and it’s clear they’re trying to cater to all technical levels, but it has made the API page somewhat messy and potentially confusing for newer or less technical users, but they should be able to get what they need.

Good with lots of Potential

Their new tool certainly isn’t perfect. There are still a few sites where manual row training is required and you can’t access the authentication feature (though you can still do this in the old version) or pagination.

Even if it’s not quite there yet, if import.io continue like this, they are well on its way to becoming the best data scraping platform on the market. Especially when you consider the “free for life” price tag.

Source:http://scraping.pro/review-import-ios-new-scraping-tools-features/

Saturday, 8 November 2014

Web Scraping Enters Politics

Web scraping is becoming an essential tool in gaining an edge over everything about just anything. This is proven by international news on US political campaigns, specifically by identifying wealthy donors. As is commonly known, election campaigns should follow a rule regarding the use of a certain limited amount of money for the expenses of each candidate. Being so, much of the campaign activities must be paid by supporters and sponsors.

It is not a surprise then that even politics is lured to make use of the dynamic and ever growing data mining processes. Once again, web mining has proven to be an essential component of almost all levels of human existence, the society, and the world as a whole. It proves its extraordinary capacity to dig precious information to reach the much aspired for goals of every individual.

Mining for personal information

The CBC News online very recently disclosed that the US Republican presidential candidate Mitt Romney has used data mining in order to identify rich donors. It is reported that the act of getting personal information such as the buying history and church attendance were vital in this incident. Through this information, the party was able to identify prospective rich donors and indeed tap them. As a businessman himself, Romney knows exactly how to fish and where the fat fish are. Moreover, what is unique about the identified donors is that they have never been donating before.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-enters-politics/

Wednesday, 5 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled. Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/