Web scraping / crawling a particular Google book

Question

For my work, I need to scrape the text from a large book on Google Books. The book in question is a very old book and is out of copyright. The book is a Gazetteer of the World. We will be putting the text into a database, so we need the raw text rather than the pdf.

I have already spent much time researching the tools and techniques that could be used to complete this task. I feel overwhelmed and do not know where to start or which is the best / easiest method to employ. I do not want to waste more time on a dead end.

The problem can be split into two parts:

Crawling the pages.
Downloading the data.

It is really part (1) that I am most stuck on. Once I have the data (even if it is only the raw html pages), I'm sure I could use a parser to extract what I want.

Navigating the pages is done by clicking continue or an arrow. The page increment is not always consistent, it can vary because some pages have embedded images. So, I cannot necessarily predict the next url. The initial url for volume 1 of the book is:

http://books.google.co.uk/books?id=grENAAAAQAAJ&pg=PR5&output=text

I can program in Java and JavaScript and I have basic knowledge of Python. I have considered node.js and scrapy amongst many other things. I tried wget but receive a 401 unathorized access error. Also, I tried iRobot, GreaseMonkey and FoxySpider.

score 1 · Answer 1 · answered Aug 28 '13 at 15:04

If I do a Google search on download "Gazetteer of the World" dictionary of geographic knowledge, I see it available in e.g. PDF, and then something like PDF2Word kan be used to extract the text;

Unless the PDF is all pics ;-) Then you could try extracting with pdf2jpg and feed the image files intro an OCR program.

You can also buy them (Amazon has several hits), cut the pages out and pull these through a scanner with autofeed and OCR.

Since this is a one-time effort, programming would be my very last resort.

It might help if you first make an inventory of the format in which you are able to acquire the (6?) volumes, and an estimate of the costs and amount of work processing these formats will take.

Thanks for the suggestions. I tried PDF2Word but it did not work, I believe because the text in the file is an image.I thought about OCRing the PDF file, but it is not very high resolution and I think there will be problems associated with this. — user2661243, Aug 29 '13 at 14:47

user2661243 · Accepted Answer · 2013-09-13T13:56:03.683

I solved this problem by writing a small program (called extract.js) in node.js to scrape the text. I used this page to help me: http://blog.miguelgrinberg.com/post/easy-web-scraping-with-nodejs

Each html page contains multiple book pages. Therefore if we increment the page parameter in the url only by 1 then we would be scraping duplicate book pages if we are not careful (this was the part I was particularly stuck on). I got around this by using a jquery selector to only select the individual book page specified in the url and to ignore the other book pages present in the html. This way I could quickly construct a text file using a spreadsheet program with the urls for each single page in order (because the increment is only 1).

So far I have successfully scraped the first two volumes, five more to go! The code is given below, it may serve as a useful starter for scraping other Google books.

// Usage: node extract.js input output
// where input (mandatory) is the text file containing your list of urls
// and output (optional) is the directory where the output files will be saved

var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');

// Read the command line parameters
var input = process.argv[2];
var output = process.argv[3];

if (!input) {
  console.log("Missing input parameter");
  return;
}

// Read the url input file, each url is on a new line
var urls = fs.readFileSync(input).toString().split('\n');

// Check for non urls and remove
for (var i = 0; i < urls.length; i++) {
  if (urls[i].slice(0, 4) != 'http') {
    urls.splice(i, 1);
  }
}

// Iterate through the urls
for (var i = 0; i < urls.length; i++) {
  var url = urls[i];

  // request function is asynchronous, hence requirement for self-executing function
  // Cannot guarantee the execution order of the callback for each url, therefore save results to separate files 
  request(url, ( function(url) {
            return function(err, resp, body) {
                if (err)
                    throw err;

                // Extract the pg parameter (book page) from the url
                // We will use this to only extract the text from this book page
                // because a retrieved html page contains multiple book pages
                var pg = url.slice(url.indexOf('pg=') + 3, url.indexOf('&output=text'));

                //
                // Define the filename
                //
                var number = pg.slice(2, pg.length);
                var zeroes = 4 - number.length;

                // Insert leading zeroes
                for (var j = 0; j < zeroes; j++) {
                  number = '0' + number;
                }  

                var filename = pg.slice(0, 2) + number + '.txt';

                // Add path to filename
                if (output) {
                  if (!fs.existsSync(output))
                    fs.mkdirSync(output);

                  filename = output + '/' + filename;
                }

                // Delete the file if it already exists
                if (fs.existsSync(filename))
                  fs.unlinkSync(filename);

                // Make the DOM available to jquery
                $ = cheerio.load(body);

                // Select the book page
                // Pages are contained within 'div' elements (where class='flow'),
                // each of which contains an 'a' element where id is equal to the page
                // Use ^ to match pages because sometimes page ids can have a trailing hyphen and extra characters
            var page = $('div.flow:has(a[id=' + pg + ']), div.flow:has(a[id^=' + pg + '-])');

            //
            // Extract and save the text of the book page to the file
            //

            var hasText = false;

            // Text is in 'gtxt_body', 'gtxt_column' and 'gtxt_footnote'
            page.find('div.gtxt_body, div.gtxt_column, div.gtxt_footnote').each(function() {  
              this.find('p.gtxt_body, p.gtxt_column, p.gtxt_footnote').each(function() {
                hasText = true;

                fs.appendFileSync(filename, this.text());
                fs.appendFileSync(filename, '\n\n');
              });
            });

                // Log progress
                if (hasText) {
                  console.log("Retrieved and saved page: " + pg);
                }
                else {
                  console.log("Skipping page: " + pg);
                }
            }
        } )(url));
}

I may have to modify the code so that it does not assume the order of gtxt_body etc. I have just come across a page where the gtxt_footnote is above gtxt_column. — user2661243, Sep 12 '13 at 15:13

Web scraping / crawling a particular Google book

2 Answers2