A A

Scraping Google Trends with Mechanize and Hpricot

Sat, Jan 24, 2009

Automation, Code, Ruby, Scraping, Scripts

This is a small Ruby script that fetches the 100 trends of the day for a specific date. If multiple dates are searched, one can find out how many times a keyword occurred between two dates, or just find out what keywords are constantly appearing on the top 100 list. Very profitable info! but alas, the script is incomplete and one must implement the “implement me!” methods to get full functionality. This, in its current state, should serve as a good starting point for scraping Google Trends.

On a technical note, it’s using mechanize, hpricot, tempfile (for the cache). A lot of this is just copy & paste programming from the earlier anime scraper.

To grab the gems (rdoc takes 10x as long as the gem to fetch and install):

sudo gem install mechanize --no-rdoc
sudo gem install hpricot --no-rdoc
#!/usr/bin/env ruby
# biodegradablegeek.com
# public domain
# 
 
require 'rubygems'
require 'hpricot'
require 'tempfile'
require 'mechanize'
#require 'highline/import'
#HighLine.track_eof = false
 
$mech = WWW::Mechanize.new
$mech.user_agent_alias = 'Mac Safari'
$master = []
 
 
def puts2(txt=''); puts "*** #{txt}"; end
 
 
class Cache
  def initialize
    # Setup physical cache location 
    @path = 'cache'
    Dir.mkdir @path unless File.exists? @path
 
    # key/val = url/filename (of fetched data)
    @datafile = "#{@path}/cache.data"
    @cache = load @datafile
  end
 
  def put key, val
    tf = Tempfile.new('googletrends', @path)
    path = tf.path
    tf.close! # important!
 
    puts2 "Saving to cache (#{path})"
    open(path, 'w') { |f|
      f.write(val)
      @cache[key] = path
    }
 
    save @datafile
  end
 
  def get key
    return nil unless exists?(key) && File.exists?(@cache[key])
    open(@cache[key], 'r') { |f| f.read }
  end
 
  def files
    @cache.values
  end
 
  def first
    @cache.first
  end
 
  def exists? key
    @cache.has_key? key
  end
 
private
  # Load saved cache 
  def load file
    return File.exists?(file) ? YAML.load(open(file).read) : {}
  end
 
  # Save cache 
  def save path
    open(path, 'w') { |f|
      f.write @cache.to_yaml
    }
  end
end
 
$cache = Cache.new
 
 
def fetch(url)
  body = $mech.get(url).body()
  $cache.put(url, body)
  body
end
 
 
def getPage(url)
  body = $cache.get(url) 
 
  if body.nil?
    puts "Not cached. Fetching from site..."
    body = fetch url 
  end
  body
end
 
 
def loadState
  mf = 'cache/master.data'
  $master = File.exists?(mf) ? YAML.load(open(mf).read) : {}
  $master = {} if $master==false
end
 
 
def saveState
  open('cache/master.data', 'w+') { |f|
    f.write $master.to_yaml
  }
end
 
 
def main
  #loadState
 
  # Grab top 100 Google Trends (today)
  #date = Time.now.strftime '%Y-%m-%d'
  date = '2009-01-21'
 
  puts2 "Getting Google's top 100 search trends for #{date}"
  url = "http://www.google.com/trends/hottrends?sa=X&date=#{date}"
  puts2 url
 
  begin
    body = getPage(url)
  rescue WWW::Mechanize::ResponseCodeError
    puts2 "Couldn't fetch URL. Invalid date..?"
    exit 5
  end
 
  puts2 "Fetched page (#{body.size} bytes)"
 
  if body['There is no data on date']
    puts2 'No data available for this date.'
    puts2 'Date might be too old or too early for report, or just invalid'
    exit 3
  end
 
  doc = Hpricot(body)
 
  (doc/"td[@class='hotColumn']/table[@class='Z2_list']//tr").each do |tr|
    td = (tr/:td)
    num = td[0].inner_text.sub('.','').strip
    kw = td[1].inner_text
    url = (td[1]/:a).first[:href]
    Keyword.find_or_new(kw) << Occurance.new(num, date, url)
  end
  puts "Got info on #{$master.size} keywords for #{date}"
  puts "keyword '#{$master.first.name}' occured #{$master.first.occurances} times"
end
 
class Occurance
  attr_accessor :pos, :date, :url
  def initialize(pos, date, url)
    @pos = pos
    @date = date
    @url = url
  end
end
 
class Keyword
  attr_accessor :name, :occurances
  def initialize(name)
    @name = name
    @occurances = []
    @position_average = nil
    @count = nil
    $master << self
  end
 
  def self.find_or_new(name)
    x = $master.find { |m| name==m.name }
    x || Keyword.new(name)
  end
 
  def << occurance
    @occurances << occurance
  end
 
  def occured_on? datetime
    raise 'implement me'
  end
 
  def occured_between? datetime
    raise 'implement me'
  end
 
  def occurances datetime=nil
    raise 'implement me' if datetime
    @occurances.size 
  end
 
  def occurances_between datetime
    raise 'implement me'
  end
 
  def pos_latest
    @occurances.last.date
  end
 
  def pos_average
    @position_average
  end
 
  def pos_average_between datetime
    raise 'implement me'
  end
end
 
#   Instance= [num, date, url]
#   Keyword=[Instance, Intance, Instance]
#   Methods for keywords: 
#   KW.occured_on? date 
#   KW.occured_between? d1, d2 
#   KW.occurances
#   KW.occurances_between? d1, d2
#   KW.pos_latest
#   KW.pos_average
#   KW.pos_average_between
 
#   KW has been on the top 100 list KW.occurances.size times
#   The #1 keywords for the month of January: Master.sort_by KW.occurances_between? Jan1,Jan31.pos_average_between Jan1,Jan31 
#
#   Top keywords: sort by KW.occurances.size = N keyword was listed the most.
#   Top keywords for date D: Master.sort_by KW.occured_on (x).num
 
main


Add me. I'm lonely Why not subscribe to the feed?. If you’re on a mobile device I suggest Viigo

Tags: , , , , , ,

2 Comments For This Post

  1. sct Says:

    Nice post.

    Is there a way to simulate user clicking a button with onclick method?

    Thanks

  2. Sean Says:

    What is most ethical amount to scrape without getting into trouble with IP/domain?

1 Trackbacks For This Post

  1. Ennuyer.net » Blog Archive » 2009-01-24 - Today’s Ruby/Rails Reading Says:

    [...] Scraping Google Trends with Mechanize and Hpricot [...]

Leave a Reply