Boxofficemojo.com Data Scraping: 2014

Wednesday, 31 December 2014

Hand Scraped Flooring: Points to Keep in Mind

The demand for hand-scraped flooring is growing. Yet, this type of flooring, in terms of appearance, isn't like any other. If you are one of the many considering it for your home, what points do you need to keep in mind as you look for the right type of hand-scraped hardwood?

First, nearly all species – domestic and exotic – are available as this distressed variety. Species from white oak to Brazilian cherry are all available with this distressed and rustic look. And, any floor of a building can have hand-scraped flooring, as both solid and engineered types are distressed. As you look at different types of hand-scraped flooring, think about where you will be installing it into your home, and plan accordingly with the right type of solid or engineered hardwood.

What's most notable about hand-scraped hardwood is its creation. All planks are distressed by hand, and as a result, no two appear similar. Multiple methods are used for distressing hardwood, including the following techniques for aging, scraping, or finishing.

Aged hardwood goes by one of two names: Time Worn Aged or Antique. Both are similar, but a lower grade is used for Antique flooring. In addition to being aged, the hardwood's distressed appearance is accented further through darker staining, highlighting the grain, or contouring.

Scraping techniques alter the texture of the hardwood, making an otherwise smooth surface rough. Wire Brushed is a term used to indicate hand-scraped flooring with removed sapwood and accented grain. Hand-sculpted, on the other hand, still has texture but is smoother than other varieties. Hardwood that is Hand Hewn and Rough Sawn has the roughest texture for hand-scraped flooring, with even saw marks visible.

Flooring that uses finish to give hardwood an aged texture is usually sold as French Bleed. Such hand-scraped flooring has deeper beveled edges, and the joints of the floor are highlighted with darker stain. Also a somewhat superficial type of hand-scraped flooring is pegged. Considered to be decorative only, pegged flooring must not be fastened directly onto a subfloor.

If you want an even less uniform appearance for your floor, consider having it custom distressed. In this case, after the unfinished hardwood is installed, a professional comes in to alter it through beating with chains, pickeling, fastening with antique nails, or bleaching. After, a finish is applied.

Also as you look at hand-scraped hardwood, think about your flooring long term. Will you want a distressed appearance a decade or more down the line? If not, plan ahead by going with flooring that can be sanded down: solid hardwood or an engineered variety with a thicker wear layer.

If, on the other hand, you plan to keep the hand-scraped flooring, think about how you will refinish it years down the line. Ideally, to keep up the distressed look without diminishing it through sanding, you will need a floor abrader to remove only the finish, or be prepared to have a professional refinish your floors.

Source:http://www.articlesbase.com/home-improvement-articles/hand-scraped-flooring-points-to-keep-in-mind-5435851.html

Tuesday, 30 December 2014

Web Data Scraping Services Have Various Method Of Business

Magnetic or optical data removal or Data Scraping Services is a term that refers to the elimination of digital storage media. Data Scraping Services of the method varies, depending on medium and method used in the process.

Similarly, patents, models, business strategies and other confidential business information, including sensitive data, can be easily accessed by others if the data is not deleted.As I said in the beginning, Data Scraping Services methods vary depending on the storage medium. For each storage medium, there are a variety of Data Scraping Services techniques.

Optical media such as that can be destroyed by the plastic granulating. This method does not extract information, but makes recovery almost impossible. However, removal of thin film that coats the top of the disk, scraping, sanding by hand or destroy physical data. In contrast, using the microwave, a less traditional technologies, stable and disk storage layer of the thin film is very effective for the most common cause sparks to load.

Typical modern magnetic media and hard drives, tape backup units of such media is possible, but in the face of such devices requires considerable financial investment in the plant. Acids, in particular, nitric acid, 50% concentration in the iron oxide layer to react with violence, it will be completely destroyed within a few minute. In some cases it may be a storage alternative for incineration. However, this may inadvertently expose caseinogens operator and may be restricted in certain countries.

Data Scraping Services, on the other hand, is defined by Wikipedia as "an automatic search for large stores of data for patterns of practice." In other words, you already know, and you learn things about it useful analysis.

Data Scraping Services is often accompanied by a lot of complex algorithms based on statistical methods. How do you see the data in the first place - is not. Data Scraping Services analysis, you only care about what is already there in many cases, a single-pass binary wipe (to write random zeroes and ones riding) will permanently deletes all data from the storage device to remove.

use of materials recovery.
It is for this reason that the technology has been left until last.
Data Scraping Services, screen scraping is not.
This is a great simplification, so I will work a bit.

Fast-forwarding to the web world today, screen scraping is the information relates to websites. This means that computer programs "crawl" or can "spider" through web sites, data retrieval. people, We deserved pages, text data Scraping Services, automated data collection, data extraction and web site even bloody website if we have a problem it presents some.

Data Scraping Services, on the other hand, is defined by Wikipedia as "an automatic search for large stores of data for patterns of practice." In other words, you already know, and you learn things about it useful analysis. Data Scraping Services is often accompanied by a lot of complex algorithms based on statistical methods. How do you see the data in the first place - is not. Data Scraping Services analysis, you only care about what is already there.

Source:http://www.articlesbase.com/outsourcing-articles/web-data-scraping-services-have-various-method-of-business-5594515.html

Saturday, 27 December 2014

Scraping By

In his classic 1976 Chesapeake portrait, Beautiful Swimmers, William Warner described the scrape boat as "a workboat unlike any other I had ever seen on the Bay." Seeming half as wide as it was long, he said, it looked like a "a miniature battleship." There's a reason for that, of course. It's a classic case of form following function; the boat evolved for one purpose, to ply the Bay's grassy shallows for shedding blue crabs.

Said to "float on a heavy dew," scrape boats run from 26 to 30 feet long and 9 to 10 feet wide. The hull is a shallow-V deadrise that quickly flattens toward the stern, enabling the boat to pull its twin scrapes—rectangular steel frames, each with a trailing mesh bag—in knee-deep waters. The broad beam might sound ungainly, but the hull tapers toward the stern—betraying its sailboat origins. And it has a graceful sheer, flowing from a bow height of a few feet to little more than a foot above the water amidships.

And you want a low freeboard when you spend the whole day hoisting aboard scrapes, which weigh 50 pounds apiece, not including the load of sea grass and crabs that come in too. Low sides or not, there's a higher than average inci-dence of back problems among scrape boat crabbers. They spend long days bending in precisely the position back doctors say puts undue pressure on the lower back as they sort through rolls of grasses to pluck out the peelers and softies. And that alone may be why crab potting is now the far more common way of catching soft crabs.

Some people think that's good, assuming that dragging a scrape across the Bay's beleaguered grass flats must be destructive. But the smooth bar of the scrape, unlike a toothed dredge, doesn't uproot grasses. In fact, where scraping is traditional, the grass beds seem relatively resilient. I've often thought if Maryland and Virginia had stuck with scraping as the major legal way to soft-crab, overfishing might not have become a problem. Pots can be deployed everywhere and by the thousands, whereas scraping is limited to grass beds and to ground covered at three miles per hour; and even the sturdiest waterman can only pull two of them by hand. But peeler pots seem here to stay, and other soft crabbers have taken to using a single, large scrape operated from larger workboats by hydraulic power.

The bottom line is that these lovely, superbly functional expressions of Chesapeake crabbing culture now number only in the dozens, if you count working, wooden models. There are some fiberglass scrape boat hulls in service, and a Carolina skiff or two has been adapted for the task. They are functional, but have little art to them.

It is probably a sign of how fast scrape boats are going that the Smithsonian Institution recently took the lines off Darlene, a scraper worked by Morris Marsh of Smith Island, for its archives. You can see photos of scrape boats, and learn more about the 140-year old history of scraping, from Paula Johnson's fine book, The Workboats of Smith Island. Mr. Marsh, still going strong in his late 60s, is the scraper who took Warner out nearly 40 years ago when he was researching Beautiful Swimmers.

Indeed, scraping seems to win over those who master it. Marsh's father-in-law, Ed Harrison, scraped for almost 70 years, nearly wearing through the cross-planked bottom of his boat—from the inside—with decades of walking the planks, tending his scrapes. And an islander who scrapes with Marsh today, David Laird, says he is 71—one year younger than Scotty Boy, the scrape boat he took over from his dad in 1958. "I wouldn't even know how to crab in another boat," Laird says.

Soft crabs may well be caught—or farmed—a century from now on the Chesapeake; but no one will devise a way to take them so intimately and beautifully from the shallowest marsh edges and tiniest crevices in the shore as the scrapers do.

Source:http://www.articlesbase.com/culture-articles/scraping-by-1560919.html

Wednesday, 24 December 2014

Choose Mining Wear Parts Wisely

It is important to choose a reputable supplier of mining wear parts; one that has been acknowledged as a leader in mining expertise. You will want to research and seek out a company that specializes in the engineering, manufacturing, procurement and design of mining wear parts and who has access to a multitude of patterns and templates to choose from.

It is vital to find a company that invites you to put them to the test; a company that is committed to selling more than just a product, standing behind the parts that they design and manufacture with an unprecedented industry guarantee. Some companies are so confident in their products that each wear part is stamped with their logo, identifying it as a superior product.

You will also want to find a company that takes pride in establishing strong customer relationships and who employs people who are as equally committed to providing outstanding service with customer satisfaction a priority. Your research will help you find a mining wear parts company that guarantees that if they do not have the part available, that they will find it for you or are capable of custom designing products to your exact specifications.

If you stop to consider the ramifications of an equipment malfunction or breakdown on production quotas, the significance of reliable parts becomes readily apparent. The impact can be far reaching if it halts production while the necessary repairs are completed. The ugly reality is that downtime incurs financial losses.

While the cost of aftermarket replacement mining wear parts is one factor, the installation of the part is equally as important. It is vital that aftermarket parts are built to a rugged standard to endure the rigorous industrial demands placed on them. Mining wear parts are routinely subjected to high stress abrasion and impact. The fabricated parts need to have the structural strength to be wear resistant with extended usage. Hardened manganese is the preferred material of choice to impart added strength and avoid premature breakage and replacement. Using inferior quality parts may result in the necessity of replacing them prematurely if they do not withstand the wear and tear that they are subjected to daily. While a few dollars may be saved initially by purchasing inferior mining wear parts, production costs can dramatically increase if frequent breakdowns occur and manpower hours are wasted in the field. Efficient use of manpower is an important budget consideration. Reliability is an absolute necessity w
hen you have production deadlines to meet and operations can quickly grind to a standstill when production is halted.

Quality assurance management monitors the consistency of the parts, demanding that they are machined within precise measurements. In addition, they focus on striving to improve the quality of parts as new technology becomes available. Using precision made, high quality wear parts can make your business more competitive, giving you an advantage and improving your bottom line.

Source:http://ezinearticles.com/?Choose-Mining-Wear-Parts-Wisely&id=6691631

Monday, 22 December 2014

Scraping table from any web page with R or CloudStat

Scraping table from any web page with R or CloudStat:

You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.

Thanks to XML package from R. It provides amazing readHTMLtable() function.

For a study case,

I want to scrape data:

    US Airline Customer Score.
    World Top Chess Players (Men).

A. Scraping US Airline Customer Score table from

http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines

Code:

airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’

airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)

Result:

> library(XML)

Warning message:

package "XML" was built under R version 2.14.1

> airline = "http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines"
> airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
> airline.table

                     Base-line 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10
1          Southwest        78 76 76 76 74 72 70 70 74 75 73 74 74 76 79 81 79
2         All Others        NM 70 74 70 62 67 63 64 72 74 73 74 74 75 75 77 75
3           Airlines        72 69 69 67 65 63 63 61 66 67 66 66 65 63 62 64 66
4        Continental        67 64 66 64 66 64 62 67 68 68 67 70 67 69 62 68 71
5           American        70 71 71 62 67 64 63 62 63 67 66 64 62 60 62 60 63
6             United        71 67 70 68 65 62 62 59 64 63 64 61 63 56 56 56 60
7         US Airways        72 67 66 68 65 61 62 60 63 64 62 57 62 61 54 59 62
8              Delta        77 72 67 69 65 68 66 61 66 67 67 65 64 59 60 64 62
9 Northwest Airlines        69 71 67 64 63 53 62 56 65 64 64 64 61 61 57 57 61

11 PreviousYear%Change FirstYear%Change

1 81                 2.5              3.8
3 65                -1.5             -9.7
4 64                -9.9             -4.5
5 63                 0.0            -10.0
7 61                -1.6            -15.3
8 56                -9.7            -27.3
9 #                 N/A              N/A

>

B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men

Code:

chess = ‘http://ratings.fide.com/top.phtml?list=men’
chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)

Result:

> chess = "http://ratings.fide.com/top.phtml?list=men"
> chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
> chess.table

     Rank                       Name Title Country Rating Games B-Year

1      1           Carlsen, Magnus    g    NOR 2835   17 1990
2      2            Aronian, Levon    g    ARM 2805   25 1982
3      3         Kramnik, Vladimir    g    RUS 2801   17 1975
4      4        Anand, Viswanathan    g    IND 2799   17 1969
5      5         Radjabov, Teimour    g    AZE 2773    9 1987
6      6          Topalov, Veselin    g    BUL 2770    9 1975
7      7          Karjakin, Sergey    g    RUS 2769   16 1990
8      8         Ivanchuk, Vassily    g    UKR 2766   16 1969
9      9     Morozevich, Alexander    g    RUS 2763    6 1977
10    10           Gashimov, Vugar    g    AZE 2761    9 1986
11    11       Grischuk, Alexander    g    RUS 2761    8 1983
12    12          Nakamura, Hikaru    g    USA 2759   17 1987
13    13            Svidler, Peter    g    RUS 2749   17 1976
14    14    Mamedyarov, Shakhriyar    g    AZE 2747    9 1985
15    15       Tomashevsky, Evgeny    g    RUS 2740    0 1987
16    16            Gelfand, Boris    g    ISR 2739    9 1968
17    17          Caruana, Fabiano    g    ITA 2736   19 1992
18    18       Nepomniachtchi, Ian    g    RUS 2735   16 1990
19    19                 Wang, Hao    g    CHN 2733    6 1989
20    20              Kamsky, Gata    g    USA 2732    0 1974
21    21 Dominguez Perez, Leinier    g    CUB 2730    6 1983
22    22         Jakovenko, Dmitry    g    RUS 2729    0 1983
23    23        Ponomariov, Ruslan    g    UKR 2727   13 1983
24    24          Vitiugov, Nikita    g    RUS 2726    1 1987
25    25            Adams, Michael    g    ENG 2724   17 1971
26    26               Leko, Peter    g    HUN 2720    9 1979
27    27            Almasi, Zoltan    g    HUN 2717    8 1976
28    28               Giri, Anish    g    NED 2714   15 1994
29    29            Le, Quang Liem    g    VIE 2714    0 1991
30    30             Navara, David    g    CZE 2712    8 1985
31    31            Shirov, Alexei    g    LAT 2710   13 1972
32    32             Polgar, Judit    g    HUN 2710    0 1976
33    33     Riazantsev, Alexander    g    RUS 2710    0 1985
34    34       Wojtaszek, Radoslaw    g    POL 2706    8 1987
35    35      Moiseenko, Alexander    g    UKR 2706    7 1980
36    36   Vallejo Pons, Francisco    g    ESP 2705   15 1982
37    37        Malakhov, Vladimir    g    RUS 2705    0 1980
38    38            Jobava, Baadur    g    GEO 2704   23 1983
39    39           Bacrot, Etienne    g    FRA 2704   14 1983
40    40          Laznicka, Viktor    g    CZE 2704    8 1988
41    41            Sutovsky, Emil    g    ISR 2703    8 1977
42    42        Naiditsch, Arkadij    g    GER 2702   14 1985
43    43         Movsesian, Sergei    g    ARM 2700    9 1978
44    44       Sasikiran, Krishnan    g    IND 2700    9 1981
45    45   Vachier-Lagrave, Maxime    g    FRA 2699   13 1990
46    46            Dreev, Aleksey    g    RUS 2698    6 1969
47    47           Efimenko, Zahar    g    UKR 2695    8 1985
48    48         Volokitin, Andrei    g    UKR 2695    0 1986
49    49                 Wang, Yue    g    CHN 2694    6 1987
50    50        Fressinet, Laurent    g    FRA 2693   17 1981
51    51                Li, Chao b    g    CHN 2693    6 1989
52    52            Grachev, Boris    g    RUS 2693    0 1986
53    53      Nielsen, Peter Heine    g    DEN 2693    0 1973
54    54            Van Wely, Loek    g    NED 2692   13 1972
55    55    Bruzon Batista, Lazaro    g    CUB 2691   19 1982
56    56           McShane, Luke J    g    ENG 2691    8 1984
57    57            Eljanov, Pavel    g    UKR 2690   10 1983
58    58      Kasimdzhanov, Rustam    g    UZB 2689   14 1979
59    59         Inarkiev, Ernesto    g    RUS 2689    6 1985
60    60         Zvjaginsev, Vadim    g    RUS 2688    8 1976
61    61         Andreikin, Dmitry    g    RUS 2688    0 1990
62    62    Areshchenko, Alexander    g    UKR 2688    0 1986
63    63         Rublevsky, Sergei    g    RUS 2686    0 1974
64    64         Akopian, Vladimir    g    ARM 2685    8 1971
65    65          Potkin, Vladimir    g    RUS 2684    0 1982
66    66       Sargissian, Gabriel    g    ARM 2683   15 1983
67    67            Berkes, Ferenc    g    HUN 2682   16 1985
68    68           Bologan, Viktor    g    MDA 2680   15 1971
69    69          Bauer, Christian    g    FRA 2679   24 1977
70    70          Tiviakov, Sergei    g    NED 2677   22 1973
71    71            Short, Nigel D    g    ENG 2677   15 1965
72    72        Motylev, Alexander    g    RUS 2677    6 1979
73    73         Gharamian, Tigran    g    FRA 2676    0 1984
74    74          Kobalia, Mikhail    g    RUS 2673    0 1978
75    75              Meier, Georg    g    GER 2671    9 1987
76    76       Onischuk, Alexander    g    USA 2670   13 1975
77    77              Bu, Xiangzhi    g    CHN 2670    6 1985
78    78          Alekseev, Evgeny    g    RUS 2670    0 1985
79    79            Azarov, Sergei    g    BLR 2667    0 1983
80    80        Kryvoruchko, Yuriy    g    UKR 2666    0 1986
81    81             Balogh, Csaba    g    HUN 2665    8 1987
82    82           Harikrishna, P.    g    IND 2665    6 1986
83    83       Khismatullin, Denis    g    RUS 2664    8 1984
84    84   Nguyen, Ngoc Truong Son    g    VIE 2662    6 1990
85    85           Fridman, Daniel    g    GER 2660   11 1976
86    86              Smirin, Ilia    g    ISR 2660    7 1968
87    87               Ding, Liren    g    CHN 2660    6 1992
88    88         Sadler, Matthew D    g    ENG 2660    3 1974
89    89            Korobov, Anton    g    UKR 2660    0 1985
90    90          Cheparinov, Ivan    g    BUL 2659   18 1986
91    91          Timofeev, Artyom    g    RUS 2659    0 1985
92    92           Georgiev, Kiril    g    BUL 2658   17 1965
93    93           Bartel, Mateusz    g    POL 2658    9 1985
94    94          Zhigalko, Sergei    g    BLR 2658    8 1989
95    95         Feller, Sebastien    g    FRA 2658    0 1991
96    96            Ragger, Markus    g    AUT 2655   17 1988
97    97         Jones, Gawain C B    g    ENG 2653   27 1987
98    98                So, Wesley    g    PHI 2653    5 1993
99    99              Milov, Vadim    g    SUI 2653    0 1972
100 100           Gupta, Abhijeet    g    IND 2652    9 1989
101 101            Postny, Evgeny    g    ISR 2652    8 1981
102 102             Roiz, Michael    g    ISR 2652    6 1983
103 103           Gyimesi, Zoltan    g    HUN 2652    4 1977
104 104          Nikolic, Predrag    g    BIH 2652    2 1960

>

Done. You had successfully scraping data from any web page with R or CloudStat.

Then, you can analyze as usual! Great! No more retype the data. Enjoy!

Source: http://www.r-bloggers.com/scraping-table-from-any-web-page-with-r-or-cloudstat/

Thursday, 18 December 2014

Basic Information About Tooth Extraction Cost

In order to maintain the good health of teeth, one must be devoted and must take proper care of one's teeth. Dentists play a huge role in this regard and their support is important in making people aware of their oral conditions, so that they receive the necessary health services concerning the problems of the mouth.

The flat fee of teeth-extraction varies from place to place. Nonetheless, there are still some average figures that people can refer to. Simple extraction of teeth might cause around 75 pounds, but if people need to remove the wisdom teeth, the extraction cost would be higher owing to the complexity of extraction involved.

There are many ways people can adopt in order to reduce the cost of extraction of tooth. For instance, they can purchase the insurance plans covering medical issues beforehand. When conditions arise that might require extraction, these insurance claims can take care of the costs involved.

Some of the dental clinics in the country are under the network of Medicare system. Therefore, it is possible for patients to make claims for these plans to reduce the amount of money expended in this field. People are not allowed to make insurance claims while they undergo cosmetic dental care like diamond implants, but extraction of teeth is always regarded as a necessity for patients; so most of the claims that are made in this front are settled easily.

It is still possible for them to pay less at the moment of the treatment, even if they have not opted for dental insurance policies. Some of the clinics offer plans which would allow patients to pay the tooth extraction cost in the form of installments. This is one of the better ways that people can consider if they are unable to pay the entire cost of tooth extraction immediately.

In fact, the cost of extracting one tooth is not very high and it is affordable to most people. Of course, if there are many other oral problems that you encounter, the extraction cost would be higher. Dentists would also consider the other problems you have and charge you additional fees accordingly. Not brushing the teeth regularly might aid in the development of plaque and this can make the cost of tooth extraction higher.

Maintaining a good oral health is important and it reflects the overall health of an individual.

To conclude, you need to know the information about cost of extraction so you can get the right service and must also follow certain easy practices to reduce the tooth extraction cost.

Source:http://ezinearticles.com/?Basic-Information-About-Tooth-Extraction-Cost&id=6623204

Monday, 15 December 2014

Autoscraping casts a wider net

We have recently started letting more users into the private beta for our Autoscraping service. We’re receiving a lot of applications following the shutdown of Needlebase and we’re increasing our capacity to accommodate these users.

Natalia made a screencast to help our new users get started:

It’s also a great introduction to what this service can do.

We released slybot as an open source integration of the scrapely extraction library and the scrapy framework. This is the core technology behind the autoscraping service and we will make it easy to export autoscraping spiders from Scrapinghub and run them completely with slybot – allowing our users to have the flexibility and freedom provided by open source.

Source:http://blog.scrapinghub.com/2012/02/27/autoscraping-casts-a-wider-net/

Saturday, 13 December 2014

Handling exceptions in scrapers

When requesting and parsing data from a source with unknown properties and random behavior (in other words, scraping), I expect all kinds of bizarrities to occur. Managing exceptions is particularly helpful in such cases.

Here is some ways that an exception might be raised.
[][0] #The list has no zeroth element, so this raises an IndexError
{}['foo'] #The dictionary has no foo element, so this raises a KeyError

Catching the exception is sometimes cleaner than preventing it from happening in the first place. Here are some examples handling bizarre exceptions in scrapers.

Example 1: Inconsistant date formats

Let’s say we’re parsing dates.
import datetime
This doesn’t raise an error.
datetime.datetime.strptime('2012-04-19', '%Y-%m-%d')
But this does.
datetime.datetime.strptime('April 19, 2012', '%Y-%m-%d')

It raises a ValueError because the date formats don’t match. So what do we do if we’re scraping a data source with multiple date formats?

Ignoring unexpected date formats

A simple thing is to ignore the date formats that we didn’t expect.

import lxml.html
import datetime
def parse_date1(source):
    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text
    try:
         cleandate = datetime.datetime.strptime(rawdate, '%Y-%m-%d')
    except ValueError:
         cleandate = None
    return cleandate

print parse_date1('<div id="date">2012-04-19</div>')

If we make a clean date column in a database and put this in there, we’ll have some rows with dates and some rows with nulls. If there are only a few nulls, we might just parse those by hand.

Trying multiple date formats

Maybe we have determined that this particular data source uses three different date formats. We can try all three.

import lxml.html
import datetime

def parse_date2(source):

    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text

    for date_format in ['%Y-%m-%d', '%B %d, %Y', '%d %B, %Y']:

        try:
             cleandate = datetime.datetime.strptime(rawdate, date_format)
             return cleandate
        except ValueError:
             pass
    return None

print parse_date2('<div id="date">19 April, 2012</div>')

This loops through three different date formats and returns the first one that doesn’t raise the error.

Example 2: Unreliable HTTP connection

If you’re scraping an unreliable website or you are behind an unreliable internet connection, you may sometimes get HTTPErrors or URLErrors for valid URLs. Trying again later might help.

import urllib2
def load(url):
    retries = 3
    for i in range(retries):
        try:
            handle = urllib2.urlopen(url)
            return handle.read()
        except urllib2.URLError:
            if i + 1 == retries:
                raise
            else:
                time.sleep(42)
    # never get here

print load('http://thomaslevine.com')

This function tries to download the page thee times. On the first two fails, it waits 42 seconds and tries again. On the third failure, it raises the error. On a success, it returs the content of the page.

Example 3: Logging errors rather than raising them

For more complicated parses, you might find loads of errors popping up in weird places, so you might want to go through all of the documents before deciding which to fix first or whether to do some of them manually.

import scraperwiki
for document_name in document_names:
    try:
        parse_document(document_name)
    except Exception as e:
        scraperwiki.sqlite.save([], {
            'documentName': document_name,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')

This catches any exception raised by a particular document, stores it in the database and then continues with the next document. Looking at the database afterwards, you might notice some trends in the errors that you can easily fix and some others where you might hard-code the correct parse.

Example 4: Exiting gracefully

When I’m scraping over 9000 pages and my script fails on page 8765, I like to be able to resume where I left off. I can often figure out where I left off based on the previous row that I saved to a database or file, but sometimes I can’t, particularly when I don’t have a unique index.

for bar in bars:
    try:
        foo(bar)
    except:
        print('Failure at bar = "%s"' % bar)
        raise

This will tell me which bar I left off on. It’s fancier if I save the information to the database, so here is how I might do that with ScraperWiki.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except:
        scraperwiki.sqlite.save_var('resume_index', i)
        raise
scraperwiki.sqlite.save_var('resume_index', 0)

ScraperWiki has a limit on CPU time, so an error that often concerns me is the scraperwiki.CPUTimeExceededError. This error is raised after the script has used 80 seconds of CPU time; if you catch the exception, you have two CPU seconds to clean up. You might want to handle this error differently from other errors.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except scraperwiki.CPUTimeExceededError:
        scraperwiki.sqlite.save_var('resume_index', i)
    except Exception as e:
        scraperwiki.sqlite.save_var('resume_index', i)
        scraperwiki.sqlite.save([], {
            'bar': bar,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')
scraperwiki.sqlite.save_var('resume_index', 0)

tl;dr

Expect exceptions to occur when you are scraping a randomly unreliable website with randomly inconsistent content, and consider handling them in ways that allow the script to keep running when one document of interest is bizarrely formatted or not available.

Source: https://blog.scraperwiki.com/2012/05/handling-exceptions-in-scrapers/

Thursday, 11 December 2014

Scraping Webmaster Tools with FMiner

The biggest problem (after the problem with their data quality) I am having with Google Webmaster Tools is that you can’t export all the data for external analysis. Luckily the guys from the FMiner.com web scraping tool contacted me a few weeks ago to test their tool. The problem with Webmaster Tools is that you can’t use web based scrapers and all the other screen scraping software tools were not that good in the steps you need to take to get to the data within Webmaster Tools. The software is available for Windows and Mac OSX users.

FMiner is a classical screen scraping app, installed on your desktop. Since you need to emulate real browser behaviour, you need to install it on your desktop. There is no coding required and their interface is visual based which makes it possible to start scraping within minutes. Another possibility I like is to upload a set of keywords, to scrape internal search engine result pages for example, something that is missing in a lot of other tools. If you need to scrape a lot of accounts, this tool provides multi-browser crawling which decreases the time needed.

This tool can be used for a lot of scraping jobs, including Google SERPs, Facebook Graph search, downloading files & images and collecting e-mail addresses. And for the real heavy scrapers, they also have built in a captcha solving API system so if you want to pass captchas while scraping, no problem.

Below you can find an introduction to the tool, with one of their tutorial video’s about scraping IMDB.com:

More basic and advanced tutorials can be found on their website: Fminer tutorials. Their tutorials show you a range of simple and complex tasks and how to use their software to get the data you need.

Guide for Scraping Webmaster Tools data

The software is capable of dealing with JavaScript and AJAX, one of the main requirements to scrape data from within Google Webmaster Tools.

Step 1: The first challenge is to login into webmaster tools. After opening a new project, first browse to https://www.google.com/webmasters/ and select the Recording button in the upper left corner.

fminer01

After browsing to this page, a goto action appears in the left panel. Click on this button and look for the “Action Options” button at the bottom of that panel. Tick the option Clear cookies before do it to avoid problems if you are already logged in for example.

fminer06

Step 2: Click the “Sign in Webmaster Tools” button. You will notice the Macro designer overview on the left registered a click as the first step.

fminer03

Step 3: Fill in your Google username and password. In the designer panel you will see the two Fill actions emerging.

fminer04

Step 4: After this step you should add some waiting time to be sure everything is fully loaded. Use the second button on the right side above the Macro Designer panel to add an action. 2000 milliseconds (2 seconds :)) will do the job.

fminer07

fminer08

Step 5: Browse to the account of which you want to export the data from

fminer05

Step 6: Browse to the specific pages of which you want the data scraped

fminer09

Step 7:Scrape the data from the tables as shown in the video

Congratulations, now you are able to scrape data from Google Webmaster Tools :)

Step 8: One of the things I use it for is pulling the search query data per keyword, which you normally can’t export. To do that, you have to use a right mouse click on the keyword, which opens a menu with options. Go to open links recursively and select normal. This will loop through all the keywords.

fminer10

Step 9: This video will show you how to make use of the pagination elements to loop through all the pages:

You can also download the following file, which has a predefined set of actions to login in WMT and download the keywords, impressions and clicks: google_webmaster_tools_login.fmpx. Open the file and update the login details by clicking on those action buttons and insert your own Google account details.

Automating and scheduling scrapers

For people that want to automate and regularly download the data, you can setup a Scheduler config and within the project settings you can setup the program to send an e-mail after completion of the crawl:

Source: http://www.notprovided.eu/scraping-webmaster-tools-fminer/

Thursday, 4 December 2014

Web scraping tutorial

There are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as Web Scraping is a technique of extracting information from websites using specially coded programs.

In this post we will take a quick look at writing a simple scraperusing the simplehtmldom library. But before we continue a word of caution:

Writing screen scrapers and spiders that consume large amounts of bandwidth, guess passwords, grab information from a site and use it somewhere else may well be a violation of someone’s rights and will eventually land you in trouble. Before writing a screen scraper first see if the website offers an RSS feed or an API for the data you are looking. If not and you have to use a scraper, first check the websites policies regarding automated tools before proceeding.

Now that we have got all the legalities out of the way, lets start with the examples.

1. Installing simplehtmldom.
Simplehtmldom is a PHP library that facilitates the process of creating web scrapers. It is a HTML DOM parser written in PHP5 that let you manipulate HTML in a quick and easy way. It is a wonderful library that does away with the messy details of regular expressions and uses CSS selector style DOM access like those found in jQuery.

First download the library from sourceforge. Unzip the library in you PHP includes directory or a directory where you will be testing the code.

Writing our first scraper.

Now that we are ready with the tools, lets write our first web scraper. For our initial idea let us see how to grab the sponsored links section from a google search page.

There are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as Web Scraping is a technique of extracting information from websites using specially coded programs.

In this post we will take a quick look at writing a simple scraperusing the simplehtmldom library. But before we continue a word of caution:

Writing screen scrapers and spiders that consume large amounts of bandwidth, guess passwords, grab information from a site and use it somewhere else may well be a violation of someone’s rights and will eventually land you in trouble. Before writing a screen scraper first see if the website offers an RSS feed or an API for the data you are looking. If not and you have to use a scraper, first check the websites policies regarding automated tools before proceeding.

Source: http://www.codediesel.com/php/web-scraping-in-php-tutorial/

Thursday, 27 November 2014

Scraping SSL Labs Server Test Results With R

    NOTE: Qualys allows automated access to their SSL Server Test site in their T&C’s, and the R fucntion/script provided here does its best to adhere to their guidelines. However, if you launch multiple scripts at one time and catch their attention you will, no doubt, be banned.

This post will show you how to do some basic web page data scraping with R. To make it more palatable to those in the security domain, we’ll be scraping the results from Qualys’ SSL Labs SSL Test site by building an R function that will:

    fetch the contents of a URL with RCurl
    process the HTML page tags with R’s XML library
    identify the key elements from the page that need to be scraped
    organize the results into a usable R data structure

You can skip ahead to the code at the end (or in this gist) or read on for some expository that isn’t in the code’s comments.

Setting up the script and processing flow

We’ll need some assistance from three R packages to perform the scraping, processing and transformation tasks:

library(RCurl) # scraping
library(XML)   # XML (HTML) processing
library(plyr) # data transformation

If you poke at the SSL Test site with a few different URLs, you’ll see there are three primary inputs to the GET request we’ll need to issue:

    d (the domain)
    s (the IP address to test)
    ignoreMismatch (which we’ll leave as ‘on‘)

You’ll also see that there’s often a delay between issuing a request and getting the results, so we’ll need to build in a GET+check-loop (like the javascript on the page does automagically). Finally, when the results are eventually displayed they are (at least for this example) usually either "Overall Rating" or "Assessment" and, we’ll use that status result in our tests for what to return.

We’ll account for the domain and IP address in the function parameters along with the amount of time we should pause between GET+check attempts. It’s also a good idea to provide a way to pass in any extra curl options (e.g. in the event folks are behind a proxy server and need to input that to make the requests work). We’ll define the function with some default parameters:

get_rating <- function(site="rud.is", ip="", pause=5, curl.opts=list()) {

}

This definition says that if we just call get_rating(), it will

    default to using "rud.is" as the domain (you can pick what you want in your implementation)
    not supply an IP address (which the script will then have to lookup with nsl)
    will pause 5s between GET+check attempts
    pass no extra curl options

Getting into the details

For the IP address logic, we’ll have to test if we passed in an an address string and perform a lookup if not:

# try to resolve IP if not specified; if no IP can be found, return
# a "NA" data frame

if (ip == "") {

    tmp <- nsl(site)
    if (is.null(tmp)) {
      return(data.frame(site=site, ip=NA, Certificate=NA,
                        Protocol.Support=NA, Key.Exchange=NA,
                        Cipher.Strength=NA)) }
    ip <- tmp
}

(don’t worry about the return(...) part yet, we’ll get there in a bit).

Once we have an IP address, we’ll need to make the call to the ssllabs.com test site and perform the check loop:

# get the contents of the URL (will be the raw HTML text)
# build the URL with sprintf

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)

# while we don't find some indication of a completed request,
# pause and try again

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {
Sys.sleep(pause)
rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)
}

We can then start making some decisions based on the results:

# if the assessment failed, return a data frame of NA's

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,
                    Protocol.Support=NA, Key.Exchange=NA,
                    Cipher.Strength=NA))
}

# otherwise, parse the resultant HTML

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

Unfortunately, the results are not “consistent”. While there are plenty of uniquely identifiable <div>s, there are enough differences between runs that we have to be a bit generic in our selection of data elements to extract. I’ll leave the view-source: of a result as an exercise to the reader. For this example, we’ll focus on extracting:

        the overall rating (A-F)
        the “Certificate” score
        the “Protocol Support” score
        the “Key Exchange” score
        the “Cipher Strength” score

There are plenty of additional fields to extract, but you should be able to extrapolate and grab what you want to from the rest of the example.

Extracting the results

We’ll need to delve into XPath to extract the <div> values. We’ll use the xpathSApply function to perform this task. Since there sometimes is a <span> tag within the <div> for the rating and since the rating has a class tag to help identify which color it should be, we use a starts-with selection parameter to just get anything beginning with rating_. If it returns an R list structure, we know we have the one with a <span> element, so we re-issue the call with that extra XPath component.

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/text()", xmlValue)

if (class(rating) == "list") {

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/span/text()", xmlValue)
}

For the four attributes (and values) we’ll be extracting, we can use the getNodeSet call which will give us all of them into a structure we can process with xpathSApply

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

At this point, labs will be a vector of label names and vals will be the corresponding values. We’ll put them, the original domain and the IP address into a data frame:

# rbind will turn the vector into row elements, with each

# value being in a column

rating.result <- data.frame(site=site, ip=ip,

                            rating=rating, rbind(vals),
                            row.names=NULL)

# we use the labs vector as the column names (in the right spot)

colnames(rating.result) <- c("site", "ip", "rating",

                              gsub(" ", "\\.", labs))

and return the result:
return(rating.result)
Finishing up

If we run the whole function on one domain we’ll get a one-row data frame back as a result. If we use ldply from the plyr package to run the get_rating function repeatedly on a vector of domains, it will combine them all into one whole data frame. For example:

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

##                site              ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1            rud.is 184.106.97.102      B         100               70           80              90

## 2 stackoverflow.com 198.252.206.140      A         100               90           80              90

## 3        er-ant.com            <NA>   <NA>        <NA>             <NA>         <NA>            <NA>

There are many tweaks you can make to this function to extract more data and perform additional processing. If you make some of your own changes, you’re encouraged to add to the gist (link above & below) and/or drop a note in the comments.

Hopefully you’ve seen how well-suited R is for this type of operation and have been encouraged to use it in your next attempt at some site/data scraping.

library(RCurl)
library(XML)
library(plyr)

#' get the Qualys SSL Labs rating for a domain+cert

#'

#' @param site domain to test SSL configuration of

#' @param ip address of \code{site} (will resolve it and take\cr

#' first response if not specified, but that may not always work as you expect)

#' @param hide.results ["on"|"off"] should the results show up in the SSL Labs history (default "on")

#' @param pause timeout between tries (default 5s)

#' @param curl.opts options to pass to \code{getURL} i.e. proxy setting

#' @return data frame of results

#'

get_rating <- function(site="rud.is", ip="", hide.results="on", pause=5, curl.opts=list()) {

# try to resolve IP if not specified; if no IP can be found, return

# a "NA" data frame

if (ip == "") {

tmp <- nsl(site)

if (is.null(tmp)) { return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA)) }

ip <- tmp

}

# need to let it actually process the certificate if not already cached

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {

Sys.sleep(pause)

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

}

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA))

}

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

# sometimes there is a <span ...> tag in the <div>, which will result in an

# empty list() object being returned. we check for that and handle it

# appropriately.

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/text()"]])

if (class(rating) == "list") {

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/span/text()"]])

}

# extract the XML objects for the ratings labels & values

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

# make them into a data frame

rating.result <- data.frame(site=site, ip=ip, rating=rating, rbind(vals), row.names=NULL)

colnames(rating.result) <- c("site", "ip", "rating", gsub(" ", "\\.", labs))

return(rating.result)

}

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

## site ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1 rud.is 184.106.97.102 B 100 70 80 90

## 2 stackoverflow.com 198.252.206.140 A 100 90 80 90

## 3 er-ant.com <NA> <NA> <NA> <NA> <NA> <NA>

Source: http://www.r-bloggers.com/scraping-ssl-labs-server-test-results-with-r/

Wednesday, 26 November 2014

Web Scraping Tools for Non-developers

I recently spoke with a resource-limited organization that is investigating government corruption and wants to access various public datasets to monitor politicians and law firms. They don’t have developers in-house, but feel pretty comfortable analyzing datasets in CSV form. While many public datasources are available in structured form, some sources are hidden in what us data folks call the deep web. Amazon is a nice example of a deep website, where you have to enter text into a search box, click on a few buttons to narrow down your results, and finally access relatively structured data (prices, model numbers, etc.) embedded in HTML. Amazon has a structured database of their products somewhere, but all you get to see is a bunch of webpages trapped behind some forms.

A developer usually isn’t hindered by the deep web. If we want the data on a webpage, we can automate form submissions and key presses, and we can parse some ugly HTML before emitting reasonably structured CSVs or JSON. But what can one accomplish without writing code?

This turns out to be a hard problem. Lots of companies have tried, to varying degrees of success, to build a programmer-free interface for structured web data extraction. I had the pleasure of working on one such project, called Needlebase at ITA before Google acquired it and closed things down. David Huynh, my wonderful colleague from grad school, prototyped a tool called Sifter that did most of what one would need, but like all good research from 2006, the lasting impact is his paper rather than his software artifact.

Below, I’ve compiled a list of some available tools. The list comes from memory, the advice of some friends that have done this before, and, most productively, a question on Twitter that Hilary Mason was nice enough to retweet.

The bad news is that none of the tools I tested would work out of the box for the specific use case I was testing. To understand why, I’ll break down the steps required for a working web scraper, and then use those steps to explain where various solutions broke down.

The anatomy of a web scraper

There are three steps to a structured extraction pipeline:

    Authenticate yourself. This might require logging in to a website or filling out a CAPTCHA to prove you’re not…a web scraper. Because the source I wanted to scrape required filling out a CAPTCHA, all of the automated tools I’ll review below failed step 1. It suggests that as a low bar, good scrapers should facilitate a human in the loop: automate the things machines are good at automating, and fall back to a human to perform authentication tasks the machines can’t do on their own.

    Navigate to the pages with the data. This might require entering some text into a search box (e.g., searching for a product on Amazon), or it might require clicking “next” through all of the pages that results are split over (often called pagination). Some of the tools I looked at allowed entering text into search boxes, but none of them correctly handled pagination across multiple pages of results.

    Extract the data. On any page you’d like to extract content from, the scraper has to help you identify the data you’d like to extract. The cleanest example of this that I’ve seen is captured in a video for one of the tools below: the interface lets you click on some text you want to pluck out of a website, asks you to label it, and then allows you to correct mistakes it learns how to extract the other examples on the page.

As you’ll see in a moment, the steps at the top of this list are hardest to automate.

What are the tools?

Here are some of the tools that came highly recommended, and my experience with them. None of those passed the CAPTCHA test, so I’ll focus on their handling of navigation and extraction.

    Web Scraper is a Chrome plugin that allows you to build navigable site maps and extract elements from those site maps. It would have done everything necessary in this scenario, except the source I was trying to scrape captured click events on links (I KNOW!), which tripped things up. You should give it a shot if you’d like to scrape a simpler site, and the youtube video that comes with it helps get around the slightly confusing user interface.

    import.io looks like a clean webpage-to-api story. The service views any webpage as a potential data source to generate an API from. If the page you’re looking at has been scraped before, you can access an API or download some of its data. If the page hasn’t been processed before, import.io walks you through the process of building connectors (for navigation) or extractors (to pull out the data) for the site. Once at the page with the data you want, you can annotate a screenshot of the page with the fields you’d like to extract. After you submit your request, it appears to get queued for extraction. I’m still waiting for the data 24 hours after submitting a request, so I can’t vouch for the quality, but the delay suggests that import.io uses crowd workers to turn your instructions into some sort of semi-automated extraction process, which likely helps improve extraction quality. The site I tried to scrape requires an arcane combination of javascript/POST requests that threw import.io’s connectors for a lo
op, and ultimately made it impossible to tell import.io how to navigate the site. Despite the complications, import.io seems like one of the more polished website-to-data efforts on this list.

    Kimono was one of the most popular suggestions I got, and is quite polished. After installing the Kimono bookmarklet in your browser, you can select elements of the page you wish to extract, and provide some positive/negative examples to train the extractor. This means that unlike import.io, you don’t have to wait to get access to the extracted data. After labeling the data, you can quickly export it as CSV/JSON/a web endpoint. The tool worked seamlessly to extract a feed from the Hackernews front page, but I’d imagine that failures in the automated approach would make me wish I had access to import.io’s crowd workers. The tool would be high on my list except that navigation/pagination is coming soon, and will ultimately cost money.

    Dapper, which is now owned by Yahoo!, provides about the same level of scraping capabilities as Kimono. You can extract content, but like Kimono it’s unclear how to navigate/paginate.

    Google Docs was an unexpected contender. If the data you’re extracting is in an HTML table/RSS Feed/CSV file/XML document on a single webpage with no navigation/authentication, you can use one of the Import* functions in Google Docs. The IMPORTHTML macro worked as advertised in a quick test.

    iMacros is a tool that I could imagine solves all of the tasks I wanted, but costs more than I was willing to pay to write this blog post. Interestingly, the free version handles the steps that the other tools on this list don’t do as well: navigation. Through your browser, iMacros lets you automate filling out forms, clicking on “next” links, etc. To perform extraction, you have to pay at least $495.

    A friend has used Screen-scraper in the past with good outcomes. It handles navigation as well as extraction, but costs money and requires a small amount of programming/tokenization skills.

    Winautomation seems cool, but it’s only available for Windows, which was a dead end for me.

So that’s it? Nothing works?

Not quite. None of these tools solved the problem I had on a very challenging website: the site clearly didn’t want to be crawled given the CAPTCHA, and the javascript-submitted POST requests threw most of the tools that expected navigation through links for a loop. Still, most of the tools I reviewed have snazzy demos, and I was able to use some of them for extracting content from sites that were less challenging than the one I initially intended to scrape.

All hope is not lost, however. Where pure automation fails, a human can step in. Several proposals suggested paying people on oDesk, Mechanical Turk, or CrowdFlower to extract the content with a human touch. This would certainly get us past the CAPTCHA and hard-to-automate navigation. It might get pretty expensive to have humans copy/paste the data for extraction, however. Given that the tools above are good at extracting content from any single page, I suspect there’s room for a human-in-the-loop scraping tool to steal the show: humans can navigate and train the extraction step, and the machine can perform the extraction. I suspect that’s what import.io is up to, and I’m hopeful they keep the tool available to folks like the ones I initially tried to help.

While we’re on the topic of human-powered solutions, it might make sense to hire a developer on oDesk to just implement the scraper for the site this organization was looking at. While a lot of the developer-free tools I mentioned above look promising, there are clearly cases where paying someone for a few hours of script-building just makes sense.

Source: http://blog.marcua.net/post/74655674340

Sunday, 23 November 2014

4 Data Mining Tips to Scrap Real Estate Data; Innovative Way to Give Realty Business a boost!

Internet has become a huge source of data – in fact; it has turned into a goldmine for the marketers, from where they can easily dig the useful data!

Web scraping has become a norm in today’s competitive era, where one with maximum and relevant information wins the race!

Real Estate Data Extraction and Scraping Service

It has helped many industries to carve a niche in the market; especially real estate – Scraping real estate data has been of great help for professionals to reach out to a large number of people and gather reliable property data. However, there are some people for whom web scraping is still an alien concept; most probably because most of its advantages are not discussed.

There are institutions, companies and organizations, entrepreneurs, as well as just normal citizens generating an extraordinary amount of information every day. Property information extraction can be effectively used to get an idea about the customer psyche and even generate valuable lead to further the business.

In addition to this, data mining has also some of following uses making it an indispensable part of marketing.

Gather Properties Details from Different Geographical Locations

You are an estate agent and want to expand your business to the neighboring city or state. But, then you are short of information. You are completely aware of the properties in the vicinity and in your town; however, with data mining services will help you to get an idea about the properties in the other state. You can also approach probable clients and increase your database to offer extensive services.

Online Offers and Discounts are just a Click Away

Now, it is tough to deal with the clients, show them the property of their choice and again act as a mediator between the buyer and seller. In all this, it becomes almost difficult to take a look at some special discounts or offers. With the data mining services, you can get an insight into these amazing offers. Thus, you can plan a move or even provide your client an amazing deal.

What people are talking about – Easy Monitoring of your Online Reputation

Internet has become a melting pot where different people come together. In fact, it provides a huge platform where people discuss about their likes and dislikes. When you dig into such online forums, you can get an idea of reputation that you or your firm holds. You can know what people think about you and where you require to buck up and where you need to slow down.

A Chance to Know your Competitors Better!

Last, but not the least, you can keep an eye on the competitor. Real Estate is getting more competitive; and therefore, it is important to have knowledge about your competitors to get an upper hand. It will help you to plan your moves and strategize with more ease. Moreover, you also know what is that “something” that your competitor does not have and you have, with can be subtly highlighted.

Property information extraction can prove to be the most fruitful method to get a cutting edge in the industry.

Source: http://www.hitechbposervices.com/blog/4-data-mining-tips-to-scrap-real-estate-data-innovative-way-to-give-realty-business-a-boost/

Wednesday, 19 November 2014

Web Scraping for Fun & Profit

There’s a number of ways to retrieve data from a backend system within mobile projects. In an ideal world, everything would have a RESTful JSON API – but often, this isn’t the case.Sometimes, SOAP is the language of the backend. Sometimes, it’s some proprietary protocol which might not even be HTTP-based. Then, there’s scraping.

Retrieving information from web sites as a human is easy. The page communicates information using stylistic elements like headings, tables and lists – this is the communication protocol of the web. Machines retrieve information with a focus on structure rather than style, typically using communication protocols like XML or JSON. Web scraping attempts to bridge this human protocol into a machine-readable format like JSON. This is what we try to achieve with web scraping.

As a means of getting to data, it don’t get much worse than web scraping. Scrapers were often built with Regular Expressions to retrieve the data from the page. Difficult to craft, impossible to maintain, this means of retrieval was far from ideal. The risks are many – even the slightest layout change on a web page can upset scraper code, and break the entire integration. It’s a fragile means for building integrations, but sometimes it’s the only way.

Having built a scraper service recently, the most interesting observation for me is how far we’ve come from these “dark days”. Node.js, and the massive ecosystem of community built modules has done much to change how these scraper services are built.

Effectively Scraping Information

Websites are built on the Document Object Model, or DOM. This is a tree structure, which represents the information on a page.By interpreting the source of a website as a DOM, we can retrieve information much more reliably than using methods like regular expression matching. The most popular method of querying the DOM is using jQuery, which enables us to build powerful and maintainable queries for information. The JSDom Node module allows us to use a DOM-like structure in serverside code.

For purpose of Illustration, we’re going to scrape the blog page of FeedHenry’s website. I’ve built a small code snippet that retrieves the contents of the blog, and translates it into a JSON API. To find the queries I need to run, first I need to look at the HTML of the page. To do this, in Chrome, I right-click the element I’m looking to inspect on the page, and click “Inspect Element”.

Screen Shot 2014-09-30 at 10.44.38

Articles on the FeedHenry blog are a series of ‘div’ elements with the ‘.itemContainer’ class

Searching for a pattern in the HTML to query all blog post elements, we construct the `div.itemContainer` query. In jQuery, we can iterate over these using the .each method:

var posts = [];

$('div.itemContainer').each(function(index, item){

// Make JSON objects of every post in here, pushing to the posts[] array

});

From there, we pick off the heading, author and post summary using a child selector on the original post, querying the relevant semantic elements:

    Post Title, using jQuery:

    $(item).find('h3').text()trim() // trim, because titles have white space either side

    Post Author, using jQuery:

    $(item).find('.catItemAuthor a').text()

    Post Body, using jQuery:

    $(item).find('p').text()

Adding some JSDom magic to our snippet, and pulling together the above two concept (iterating through posts, and picking off info from each post), we get this snippet:

var request = require('request'),

jsdom = require('jsdom');

jsdom.env(

"http://www.feedhenry.com/category/blog",

["http://code.jquery.com/jquery.js"],

function (errors, window) {

    var $ = window.$, // Alias jQUery

    posts = [];

    $('div.itemContainer').each(function(index, item){

      item = $(item); // make queryable in JQ

      posts.push({

        heading : item.find('h3').text().trim(),

        author : item.find('.catItemAuthor a').text(),

        teaser : item.find('p').text()

      });

    });

    console.log(posts);

}

);

A note on building CSS Queries

As with styling web sites with CSS, building effective CSS queries is equally as important when building a scraper. It’s important to build queries that are not too specific, or likely to break when the structure of the page changes. Equally important is to pick a query that is not too general, and likely to select extra data from the page you don’t want to retrieve.

A neat trick for generating the relevant selector statement is to use Chrome’s “CSS Path” feature in the inspector. After finding the element in the inspector panel, right click, and select “Copy CSS Path”. This method is good for individual items, but for picking repeating patterns (like blog posts), this doesn’t work though. Often, the path it gives is much too specific, making for a fragile binding. Any changes to the page’s structure will break the query.

Making a Re-usable Scraping Service

Now that we’ve retrieved information from a web page, and made some JSON, let’s build a reusable API from this. We’re going to make a FeedHenry Blog Scraper service in FeedHenry3. For those of you not familiar with service creation, see this video walkthrough.

We’re going to start by creating a “new mBaaS Service”, rather than selecting one of the off-the-shelf services. To do this, we modify the application.js file of our service to include one route, /blog, which includes our code snippet from earlier:

// just boilerplate scraper setup

var mbaasApi = require('fh-mbaas-api'),

express = require('express'),

mbaasExpress = mbaasApi.mbaasExpress(),

cors = require('cors'),

request = require('request'),

jsdom = require('jsdom');

var app = express();

app.use(cors());

app.use('/sys', mbaasExpress.sys([]));

app.use('/mbaas', mbaasExpress.mbaas);

app.use(mbaasExpress.fhmiddleware());

// Our /blog scraper route

app.get('/blog', function(req, res, next){

jsdom.env(

    "http://www.feedhenry.com/category/blog",

    ["http://code.jquery.com/jquery.js"],

    function (errors, window) {

      var $ = window.$, // Alias jQUery

      posts = [];

      $('div.itemContainer').each(function(index, item){

        item = $(item); // make queryable in JQ

        posts.push({

          heading : item.find('h3').text().trim(),

          author : item.find('.catItemAuthor a').text(),

          teaser : item.find('p').text()

        });

      });

      return res.json(posts);

    }

);

});

app.use(mbaasExpress.errorHandler());

var port = process.env.FH_PORT || process.env.VCAP_APP_PORT || 8001;

var server = app.listen(port, function() {});

We’re also going to write some documentation for our service, so we (and other developers) can interact with it using the FeedHenry discovery console. We’re going to modify the README.md file to document what we’ve just done using API Blueprint documentation format:

# FeedHenry Blog Web Scraper

This is a feedhenry blog scraper service. It uses the `JSDom` and `request` modules to retrieve the contents of the FeedHenry developer blog, and parse the content using jQuery.

# Group Scraper API Group

# blog [/blog]

Blog Endpoint

## blog [GET]

Get blog posts endpoint, returns JSON data.

+ Response 200 (application/json)

    + Body

            [{ blog post}, { blog post}, { blog post}]

We can now try out the scraper service in the studio, and see the response:

Scraping – The Ultimate in API Creation?

Now that I’ve described some modern techniques for effectively scraping data from web sites, it’s time for some major caveats. First, WordPress blogs like ours already have feeds and APIs available to developers - there’s no need to ever scrape any of this content. Web Scraping is not a replacement for an API. It should be used only as a last resort, after every endeavour to discover an API has already been made. Using a web scraper in a commercial setting requires much time set aside to maintain the queries, and an agreement with the source data is being scraped on to alert developers in the event the page changes structure.

With all this in mind, it can be a useful tool to iterate quickly on an integration when waiting for an API, or as a fun hack project.

Source: http://www.feedhenry.com/web-scraping-fun-profit/

Saturday, 15 November 2014

Building Java Object Graph with Tour de France results – using screen scraping, java.util.Parser and assorted facilities

Last Saturday, the Tour de France 2011 departed. For people like myself, enjoying sports and working on Data Visualizations on the one hand and far fetched uses of SQL on the other, the Tour de France offers a wealth of data to work with: rankings for each stage in various categories, nationalities and teams to group by, distances and velocity, years to compare with one another and the like. So it has been my intention for some time to get hold of that data in a format I could work with.

Today I finally found some time to get it done. To locate the statistics for the Tour de France editions for the last few years and get them onto my laptop and into my database. This article describes the first part of that journey: how to get the stage results from some source on the internet into my locally running Java program in an appropriate object structure.

My starting point is the official Tour de France website:

Image

This website goes back to 2007 and also has the latest (2011) results. It presents the result in a format pleasing to the human eye – based on an HTML structure that is fairly pleasing to my groping Java code as well.

Analyzing the source of the Tour de France data

I start my explorations in Firefox, using the Firebug plugin. When I select the tab with the results for a particular stage, I inspect the (AJAX) call that is made to retrieve the stage results into the browser:

Image

The URL that was accessed is www.letour.fr/2010/TDF/LIVE/us/700/classement/ITE.html . When I access that URL directly, I see an HTML fragment with the individual ranking for the 7th stage in 2010. It turns out that with ITG instead of ITE in this URL, I get the overall ranking after the 7th Stage. Using IME in stead of ITE, I get the 7th stage’s climbers’ standing. And so on.

The HTML associated with the stage standing looks like this:

Image

Which is not as user friendly as the corresponding display in the browser:

Image

but still fairly well structured and programmatically interpretable.

Retrieving HTML fragments and parsing in Java

Consuming these HTML fragments with stage standings into my own Java code is very easy. Parsing the data and turning it into sensible Java Objects is slightly more work, but still quite feasible. From the Java Objects I next need to create a persistent storage for the data – that is the subject for another article.

Using the Java URL class and its openStream method to open an InputStream on whatever content can be found at the URL, it is dead easy to start reading the HTML from the Tour de France website into my Java program. I make use of the java.util.Scanner class to work my way through the HTML by Table Row (TR element). When you inspect the HTML fragments, it is clear early on that every individual rider’s entry corresponds with a TR element, so it seems only logical to have the Scanner break up the data by TR.

private static Stage processStage(int year, int stageSequence, Map<Integer, Rider> riders) throws java.io.IOException, java.net.MalformedURLException {

    String typeOfStanding = "ITE";
     URL stageStanding = new URL("http://www.letour.fr/"+year+"/TDF/LIVE/us/"
                                +(stageSequence==0?"0":stageSequence+"00") +
                                "/classement/"+typeOfStanding+".html");
    InputStream stream = stageStanding.openStream();
    Scanner scanner = new Scanner(stream);
    scanner.useDelimiter("</tr>");
    Stage stage = new Stage();
    stage.setSequence(stageSequence);
    boolean first = true;
    boolean firstStanding = true;
    while (scanner.hasNext()) {
        String entry = scanner.next();
        if (first) {
            first = false;
            Matcher regexMatcher = regexDistance.matcher(entry);
            if (regexMatcher.find()) {
                String distanceString = regexMatcher.group();
                stage.setTotalDistance(Float.parseFloat(distanceString.substring(0, distanceString.length() - 3)));
            }
        }
        if (!first) {
            String[] els = entry.split("/td>");
            if (els.length > 1) { // only the standing-entries have more than one td element
                Integer riderNumber = Integer.parseInt(extractValue(els[2]));

                Rider rider=null;
                if (riders.containsKey(riderNumber)) {
                    rider = riders.get(riderNumber);
                }
                else {
                    rider = new Rider(extractValue(els[1]),riderNumber, extractValue(els[3]));
                    riders.put(riderNumber,rider);
                }
                Standing standing =
                    new Standing(firstStanding ? 1 : (Integer.parseInt(extractValue(els[0]).replace(".", ""))),
                                  rider,extractValue(els[4]),
                                  extractValue(els[5]));
                firstStanding = false;
                stage.getStandings().add(standing);                }
        }
    } //while
    scanner.close();
    return stage;
}

Subsequently, the TR elements need to be broken up in the TD cell elements that contain the rank, rider’s name, their number, the team they ride for and the time for the stage as well as their lag with regard to the winner. I have used a simple split (on /td>) to extract the cells. The final logic for pulling the correct value from the cell is in the method extractValue. Note: this code is not very pretty, and I am not necessarily overly proud of it. On the other hand: it is one-time-use-only code and it is still fairly compact and easy to write and read.

private static String extractValue(String el) {
    String r = el.split("</")[0];
    if (r.lastIndexOf(">") > 0) {
        r = r.substring(r.lastIndexOf(">") + 1);
    }
    return r.split("<")[0];
}

I have created a few domain classes: Rider, Stage, Standing (as well as Tour) that are a business domain like representation of the Tour de France result data. Objects based on these classes are instantiated in the processStage method that is being invoked from the processTour method.

public static void processTour(Tour tour) throws IOException, MalformedURLException {
    if (tour.isPrologue())
      tour.getStages().add(processStage(tour.getYear(),0, tour.getRiders()));

    for (int i=1;i<= tour.getNumberOfStages();i++) {
        tour.getStages().add(processStage(tour.getYear(),i, tour.getRiders()));
    }
}

When I run the TourManager class – a class that create a single Tour object for the Tour de France in 2010 –

public class TourManager {
     List<Tour> tours = new ArrayList<Tour>();
     public TourManager() {
        tours.add(new Tour(2010, 20, true));
        try {
            ProcessTourStandings.processTour(tours.get(0));
        } catch (MalformedURLException e) {
            System.out.println(e.getMessage());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
     public static void main(String[] args) {
        TourManager tm = new TourManager();
        for (Tour tour : tm.getTours()) {
            for (Stage stage : tour.getStages()) {
                System.out.println("================ Stage " + stage.getSequence() + "(" + stage.getTotalDistance() +
                                   " km)");
                for (Standing standing : stage.getStandings()) {
                    if (standing.getRank() < 4) {
                        System.out.println(standing.getRank() + "." + standing.getRider().getName());
                    }
                }
            }
        }
    }

it will print the top 3 in every stage:

Image

Source:http://technology.amis.nl/2011/07/04/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities/