Scraping Google Trends with Mechanize and Hpricot
Sat, Jan 24, 2009
This is a small Ruby script that fetches the 100 trends of the day for a specific date. If multiple dates are searched, one can find out how many times a keyword occurred between two dates, or just find out what keywords are constantly appearing on the top 100 list. Very profitable info! but alas, the script is incomplete and one must implement the “implement me!” methods to get full functionality. This, in its current state, should serve as a good starting point for scraping Google Trends.
On a technical note, it’s using mechanize, hpricot, tempfile (for the cache). A lot of this is just copy & paste programming from the earlier anime scraper.
To grab the gems (rdoc takes 10x as long as the gem to fetch and install):
sudo gem install mechanize --no-rdoc
sudo gem install hpricot --no-rdoc
#!/usr/bin/env ruby
# biodegradablegeek.com
# public domain
#
require 'rubygems'
require 'hpricot'
require 'tempfile'
require 'mechanize'
#require 'highline/import'
#HighLine.track_eof = false
$mech = WWW::Mechanize.new
$mech.user_agent_alias = 'Mac Safari'
$master = []
def puts2(txt=''); puts "*** #{txt}"; end
class Cache
def initialize
# Setup physical cache location
@path = 'cache'
Dir.mkdir @path unless File.exists? @path
# key/val = url/filename (of fetched data)
@datafile = "#{@path}/cache.data"
@cache = load @datafile
end
def put key, val
tf = Tempfile.new('googletrends', @path)
path = tf.path
tf.close! # important!
puts2 "Saving to cache (#{path})"
open(path, 'w') { |f|
f.write(val)
@cache[key] = path
}
save @datafile
end
def get key
return nil unless exists?(key) && File.exists?(@cache[key])
open(@cache[key], 'r') { |f| f.read }
end
def files
@cache.values
end
def first
@cache.first
end
def exists? key
@cache.has_key? key
end
private
# Load saved cache
def load file
return File.exists?(file) ? YAML.load(open(file).read) : {}
end
# Save cache
def save path
open(path, 'w') { |f|
f.write @cache.to_yaml
}
end
end
$cache = Cache.new
def fetch(url)
body = $mech.get(url).body()
$cache.put(url, body)
body
end
def getPage(url)
body = $cache.get(url)
if body.nil?
puts "Not cached. Fetching from site..."
body = fetch url
end
body
end
def loadState
mf = 'cache/master.data'
$master = File.exists?(mf) ? YAML.load(open(mf).read) : {}
$master = {} if $master==false
end
def saveState
open('cache/master.data', 'w+') { |f|
f.write $master.to_yaml
}
end
def main
#loadState
# Grab top 100 Google Trends (today)
#date = Time.now.strftime '%Y-%m-%d'
date = '2009-01-21'
puts2 "Getting Google's top 100 search trends for #{date}"
url = "http://www.google.com/trends/hottrends?sa=X&date=#{date}"
puts2 url
begin
body = getPage(url)
rescue WWW::Mechanize::ResponseCodeError
puts2 "Couldn't fetch URL. Invalid date..?"
exit 5
end
puts2 "Fetched page (#{body.size} bytes)"
if body['There is no data on date']
puts2 'No data available for this date.'
puts2 'Date might be too old or too early for report, or just invalid'
exit 3
end
doc = Hpricot(body)
(doc/"td[@class='hotColumn']/table[@class='Z2_list']//tr").each do |tr|
td = (tr/:td)
num = td[0].inner_text.sub('.','').strip
kw = td[1].inner_text
url = (td[1]/:a).first[:href]
Keyword.find_or_new(kw) << Occurance.new(num, date, url)
end
puts "Got info on #{$master.size} keywords for #{date}"
puts "keyword '#{$master.first.name}' occured #{$master.first.occurances} times"
end
class Occurance
attr_accessor :pos, :date, :url
def initialize(pos, date, url)
@pos = pos
@date = date
@url = url
end
end
class Keyword
attr_accessor :name, :occurances
def initialize(name)
@name = name
@occurances = []
@position_average = nil
@count = nil
$master << self
end
def self.find_or_new(name)
x = $master.find { |m| name==m.name }
x || Keyword.new(name)
end
def << occurance
@occurances << occurance
end
def occured_on? datetime
raise 'implement me'
end
def occured_between? datetime
raise 'implement me'
end
def occurances datetime=nil
raise 'implement me' if datetime
@occurances.size
end
def occurances_between datetime
raise 'implement me'
end
def pos_latest
@occurances.last.date
end
def pos_average
@position_average
end
def pos_average_between datetime
raise 'implement me'
end
end
# Instance= [num, date, url]
# Keyword=[Instance, Intance, Instance]
# Methods for keywords:
# KW.occured_on? date
# KW.occured_between? d1, d2
# KW.occurances
# KW.occurances_between? d1, d2
# KW.pos_latest
# KW.pos_average
# KW.pos_average_between
# KW has been on the top 100 list KW.occurances.size times
# The #1 keywords for the month of January: Master.sort_by KW.occurances_between? Jan1,Jan31.pos_average_between Jan1,Jan31
#
# Top keywords: sort by KW.occurances.size = N keyword was listed the most.
# Top keywords for date D: Master.sort_by KW.occured_on (x).num
main
Why not subscribe to the feed?. If you're on a mobile device I suggest Viigo
Tags: Automation, Code example, making monies, programming, public domain, Ruby, Scraping
February 19th, 2009 at 12:55 pm
Nice post.
Is there a way to simulate user clicking a button with onclick method?
Thanks
April 30th, 2009 at 2:21 am
What is most ethical amount to scrape without getting into trouble with IP/domain?