Scraping Google Trends with Mechanize and Hpricot
Sat, Jan 24, 2009
This is a small Ruby script that fetches the 100 trends of the day for a specific date. If multiple dates are searched, one can find out how many times a keyword occurred between two dates, or just find out what keywords are constantly appearing on the top 100 list. Very profitable info! but alas, the script is incomplete and one must implement the “implement me!” methods to get full functionality. This, in its current state, should serve as a good starting point for scraping Google Trends.
On a technical note, it’s using mechanize, hpricot, tempfile (for the cache). A lot of this is just copy & paste programming from the earlier anime scraper.
To grab the gems (rdoc takes 10x as long as the gem to fetch and install):
sudo gem install mechanize --no-rdoc sudo gem install hpricot --no-rdoc
#!/usr/bin/env ruby # biodegradablegeek.com # public domain # require 'rubygems' require 'hpricot' require 'tempfile' require 'mechanize' #require 'highline/import' #HighLine.track_eof = false $mech = WWW::Mechanize.new $mech.user_agent_alias = 'Mac Safari' $master = [] def puts2(txt=''); puts "*** #{txt}"; end class Cache def initialize # Setup physical cache location @path = 'cache' Dir.mkdir @path unless File.exists? @path # key/val = url/filename (of fetched data) @datafile = "#{@path}/cache.data" @cache = load @datafile end def put key, val tf = Tempfile.new('googletrends', @path) path = tf.path tf.close! # important! puts2 "Saving to cache (#{path})" open(path, 'w') { |f| f.write(val) @cache[key] = path } save @datafile end def get key return nil unless exists?(key) && File.exists?(@cache[key]) open(@cache[key], 'r') { |f| f.read } end def files @cache.values end def first @cache.first end def exists? key @cache.has_key? key end private # Load saved cache def load file return File.exists?(file) ? YAML.load(open(file).read) : {} end # Save cache def save path open(path, 'w') { |f| f.write @cache.to_yaml } end end $cache = Cache.new def fetch(url) body = $mech.get(url).body() $cache.put(url, body) body end def getPage(url) body = $cache.get(url) if body.nil? puts "Not cached. Fetching from site..." body = fetch url end body end def loadState mf = 'cache/master.data' $master = File.exists?(mf) ? YAML.load(open(mf).read) : {} $master = {} if $master==false end def saveState open('cache/master.data', 'w+') { |f| f.write $master.to_yaml } end def main #loadState # Grab top 100 Google Trends (today) #date = Time.now.strftime '%Y-%m-%d' date = '2009-01-21' puts2 "Getting Google's top 100 search trends for #{date}" url = "http://www.google.com/trends/hottrends?sa=X&date=#{date}" puts2 url begin body = getPage(url) rescue WWW::Mechanize::ResponseCodeError puts2 "Couldn't fetch URL. Invalid date..?" exit 5 end puts2 "Fetched page (#{body.size} bytes)" if body['There is no data on date'] puts2 'No data available for this date.' puts2 'Date might be too old or too early for report, or just invalid' exit 3 end doc = Hpricot(body) (doc/"td[@class='hotColumn']/table[@class='Z2_list']//tr").each do |tr| td = (tr/:td) num = td[0].inner_text.sub('.','').strip kw = td[1].inner_text url = (td[1]/:a).first[:href] Keyword.find_or_new(kw) << Occurance.new(num, date, url) end puts "Got info on #{$master.size} keywords for #{date}" puts "keyword '#{$master.first.name}' occured #{$master.first.occurances} times" end class Occurance attr_accessor :pos, :date, :url def initialize(pos, date, url) @pos = pos @date = date @url = url end end class Keyword attr_accessor :name, :occurances def initialize(name) @name = name @occurances = [] @position_average = nil @count = nil $master << self end def self.find_or_new(name) x = $master.find { |m| name==m.name } x || Keyword.new(name) end def << occurance @occurances << occurance end def occured_on? datetime raise 'implement me' end def occured_between? datetime raise 'implement me' end def occurances datetime=nil raise 'implement me' if datetime @occurances.size end def occurances_between datetime raise 'implement me' end def pos_latest @occurances.last.date end def pos_average @position_average end def pos_average_between datetime raise 'implement me' end end # Instance= [num, date, url] # Keyword=[Instance, Intance, Instance] # Methods for keywords: # KW.occured_on? date # KW.occured_between? d1, d2 # KW.occurances # KW.occurances_between? d1, d2 # KW.pos_latest # KW.pos_average # KW.pos_average_between # KW has been on the top 100 list KW.occurances.size times # The #1 keywords for the month of January: Master.sort_by KW.occurances_between? Jan1,Jan31.pos_average_between Jan1,Jan31 # # Top keywords: sort by KW.occurances.size = N keyword was listed the most. # Top keywords for date D: Master.sort_by KW.occured_on (x).num main
Why not subscribe to the feed?. If you’re on a mobile device I suggest Viigo
Tags: Automation, Code example, making monies, programming, public domain, Ruby, Scraping
February 19th, 2009 at 12:55 pm
Nice post.
Is there a way to simulate user clicking a button with onclick method?
Thanks
April 30th, 2009 at 2:21 am
What is most ethical amount to scrape without getting into trouble with IP/domain?