<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Biodegradable Geek &#187; Scraping</title>
	<atom:link href="http://biodegradablegeek.com/category/scraping/feed/" rel="self" type="application/rss+xml" />
	<link>http://biodegradablegeek.com</link>
	<description></description>
	<lastBuildDate>Tue, 22 Jun 2010 21:52:41 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=abc</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Scraping Google Trends with Mechanize and Hpricot</title>
		<link>http://biodegradablegeek.com/2009/01/scraping-google-trends-with-mechanize-and-hpricot/</link>
		<comments>http://biodegradablegeek.com/2009/01/scraping-google-trends-with-mechanize-and-hpricot/#comments</comments>
		<pubDate>Sat, 24 Jan 2009 06:53:06 +0000</pubDate>
		<dc:creator>Isam</dc:creator>
				<category><![CDATA[Automation]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Scripts]]></category>
		<category><![CDATA[Code example]]></category>
		<category><![CDATA[making monies]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[public domain]]></category>

		<guid isPermaLink="false">http://biodegradablegeek.com/?p=313</guid>
		<description><![CDATA[This is a small Ruby script that fetches the 100 trends of the day for a specific date. If multiple dates are searched, one can find out how many times a keyword occurred between two dates, or just find out what keywords are constantly appearing on the top 100 list. Very profitable info! but alas, [...]]]></description>
			<content:encoded><![CDATA[<p>This is a small Ruby script that fetches the 100 trends of the day for a specific date. If multiple dates are searched, one can find out how many times a keyword occurred between two dates, or just find out what keywords are constantly appearing on the top 100 list. <strong>Very profitable info!</strong> but alas, the script is incomplete and one must implement the &#8220;implement me!&#8221; methods to get full functionality. This, in its current state, should serve as a good starting point for scraping Google Trends.</p>
<p>On a technical note, it&#8217;s using mechanize, hpricot, tempfile (for the cache). A lot of this is just <a href="http://en.wikipedia.org/wiki/Copy_and_paste_programming">copy &amp; paste programming</a> from the <a href="http://biodegradablegeek.com/2009/01/animecrazy-scraper-example-using-hpricot-mechanize/">earlier anime scraper</a>. </p>
<p>To grab the gems <em>(rdoc takes 10x as long as the gem to fetch and install)</em>:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #c20cb9; font-weight: bold;">sudo</span> gem <span style="color: #c20cb9; font-weight: bold;">install</span> mechanize <span style="color: #660033;">--no-rdoc</span>
<span style="color: #c20cb9; font-weight: bold;">sudo</span> gem <span style="color: #c20cb9; font-weight: bold;">install</span> hpricot <span style="color: #660033;">--no-rdoc</span></pre></div></div>


<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
<span style="color:#008000; font-style:italic;"># biodegradablegeek.com</span>
<span style="color:#008000; font-style:italic;"># public domain</span>
<span style="color:#008000; font-style:italic;"># </span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'rubygems'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'hpricot'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'tempfile'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'mechanize'</span>
<span style="color:#008000; font-style:italic;">#require 'highline/import'</span>
<span style="color:#008000; font-style:italic;">#HighLine.track_eof = false</span>
&nbsp;
<span style="color:#ff6633; font-weight:bold;">$mech</span> = <span style="color:#6666ff; font-weight:bold;">WWW::Mechanize</span>.<span style="color:#9900CC;">new</span>
<span style="color:#ff6633; font-weight:bold;">$mech</span>.<span style="color:#9900CC;">user_agent_alias</span> = <span style="color:#996600;">'Mac Safari'</span>
<span style="color:#ff6633; font-weight:bold;">$master</span> = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#93;</span>
&nbsp;
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> puts2<span style="color:#006600; font-weight:bold;">&#40;</span>txt=<span style="color:#996600;">''</span><span style="color:#006600; font-weight:bold;">&#41;</span>; <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;*** #{txt}&quot;</span>; <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
&nbsp;
<span style="color:#9966CC; font-weight:bold;">class</span> Cache
  <span style="color:#9966CC; font-weight:bold;">def</span> initialize
    <span style="color:#008000; font-style:italic;"># Setup physical cache location </span>
    <span style="color:#0066ff; font-weight:bold;">@path</span> = <span style="color:#996600;">'cache'</span>
    <span style="color:#CC00FF; font-weight:bold;">Dir</span>.<span style="color:#9900CC;">mkdir</span> <span style="color:#0066ff; font-weight:bold;">@path</span> <span style="color:#9966CC; font-weight:bold;">unless</span> <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">exists</span>? <span style="color:#0066ff; font-weight:bold;">@path</span>
&nbsp;
    <span style="color:#008000; font-style:italic;"># key/val = url/filename (of fetched data)</span>
    <span style="color:#0066ff; font-weight:bold;">@datafile</span> = <span style="color:#996600;">&quot;#{@path}/cache.data&quot;</span>
    <span style="color:#0066ff; font-weight:bold;">@cache</span> = <span style="color:#CC0066; font-weight:bold;">load</span> <span style="color:#0066ff; font-weight:bold;">@datafile</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> put key, val
    tf = <span style="color:#CC00FF; font-weight:bold;">Tempfile</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'googletrends'</span>, <span style="color:#0066ff; font-weight:bold;">@path</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    path = tf.<span style="color:#9900CC;">path</span>
    tf.<span style="color:#9900CC;">close</span>! <span style="color:#008000; font-style:italic;"># important!</span>
&nbsp;
    puts2 <span style="color:#996600;">&quot;Saving to cache (#{path})&quot;</span>
    <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>path, <span style="color:#996600;">'w'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span>
      f.<span style="color:#9900CC;">write</span><span style="color:#006600; font-weight:bold;">&#40;</span>val<span style="color:#006600; font-weight:bold;">&#41;</span>
      <span style="color:#0066ff; font-weight:bold;">@cache</span><span style="color:#006600; font-weight:bold;">&#91;</span>key<span style="color:#006600; font-weight:bold;">&#93;</span> = path
    <span style="color:#006600; font-weight:bold;">&#125;</span>
&nbsp;
    save <span style="color:#0066ff; font-weight:bold;">@datafile</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> get key
    <span style="color:#0000FF; font-weight:bold;">return</span> <span style="color:#0000FF; font-weight:bold;">nil</span> <span style="color:#9966CC; font-weight:bold;">unless</span> exists?<span style="color:#006600; font-weight:bold;">&#40;</span>key<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&amp;&amp;</span> <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">exists</span>?<span style="color:#006600; font-weight:bold;">&#40;</span>@cache<span style="color:#006600; font-weight:bold;">&#91;</span>key<span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>@cache<span style="color:#006600; font-weight:bold;">&#91;</span>key<span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#996600;">'r'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span> f.<span style="color:#9900CC;">read</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> files
    <span style="color:#0066ff; font-weight:bold;">@cache</span>.<span style="color:#9900CC;">values</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> first
    <span style="color:#0066ff; font-weight:bold;">@cache</span>.<span style="color:#9900CC;">first</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> exists? key
    <span style="color:#0066ff; font-weight:bold;">@cache</span>.<span style="color:#9900CC;">has_key</span>? key
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
private
  <span style="color:#008000; font-style:italic;"># Load saved cache </span>
  <span style="color:#9966CC; font-weight:bold;">def</span> <span style="color:#CC0066; font-weight:bold;">load</span> file
    <span style="color:#0000FF; font-weight:bold;">return</span> <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">exists</span>?<span style="color:#006600; font-weight:bold;">&#40;</span>file<span style="color:#006600; font-weight:bold;">&#41;</span> ? <span style="color:#CC00FF; font-weight:bold;">YAML</span>.<span style="color:#CC0066; font-weight:bold;">load</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>file<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">read</span><span style="color:#006600; font-weight:bold;">&#41;</span> : <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#008000; font-style:italic;"># Save cache </span>
  <span style="color:#9966CC; font-weight:bold;">def</span> save path
    <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>path, <span style="color:#996600;">'w'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span>
      f.<span style="color:#9900CC;">write</span> <span style="color:#0066ff; font-weight:bold;">@cache</span>.<span style="color:#9900CC;">to_yaml</span>
    <span style="color:#006600; font-weight:bold;">&#125;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#ff6633; font-weight:bold;">$cache</span> = Cache.<span style="color:#9900CC;">new</span>
&nbsp;
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> fetch<span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>
  body = <span style="color:#ff6633; font-weight:bold;">$mech</span>.<span style="color:#9900CC;">get</span><span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">body</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#ff6633; font-weight:bold;">$cache</span>.<span style="color:#9900CC;">put</span><span style="color:#006600; font-weight:bold;">&#40;</span>url, body<span style="color:#006600; font-weight:bold;">&#41;</span>
  body
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> getPage<span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>
  body = <span style="color:#ff6633; font-weight:bold;">$cache</span>.<span style="color:#9900CC;">get</span><span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span> 
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">if</span> body.<span style="color:#0000FF; font-weight:bold;">nil</span>?
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Not cached. Fetching from site...&quot;</span>
    body = fetch url 
  <span style="color:#9966CC; font-weight:bold;">end</span>
  body
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> loadState
  mf = <span style="color:#996600;">'cache/master.data'</span>
  <span style="color:#ff6633; font-weight:bold;">$master</span> = <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">exists</span>?<span style="color:#006600; font-weight:bold;">&#40;</span>mf<span style="color:#006600; font-weight:bold;">&#41;</span> ? <span style="color:#CC00FF; font-weight:bold;">YAML</span>.<span style="color:#CC0066; font-weight:bold;">load</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>mf<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">read</span><span style="color:#006600; font-weight:bold;">&#41;</span> : <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
  <span style="color:#ff6633; font-weight:bold;">$master</span> = <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">&#125;</span> <span style="color:#9966CC; font-weight:bold;">if</span> <span style="color:#ff6633; font-weight:bold;">$master</span>==<span style="color:#0000FF; font-weight:bold;">false</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> saveState
  <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'cache/master.data'</span>, <span style="color:#996600;">'w+'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span>
    f.<span style="color:#9900CC;">write</span> <span style="color:#ff6633; font-weight:bold;">$master</span>.<span style="color:#9900CC;">to_yaml</span>
  <span style="color:#006600; font-weight:bold;">&#125;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> main
  <span style="color:#008000; font-style:italic;">#loadState</span>
&nbsp;
  <span style="color:#008000; font-style:italic;"># Grab top 100 Google Trends (today)</span>
  <span style="color:#008000; font-style:italic;">#date = Time.now.strftime '%Y-%m-%d'</span>
  date = <span style="color:#996600;">'2009-01-21'</span>
&nbsp;
  puts2 <span style="color:#996600;">&quot;Getting Google's top 100 search trends for #{date}&quot;</span>
  url = <span style="color:#996600;">&quot;http://www.google.com/trends/hottrends?sa=X&amp;date=#{date}&quot;</span>
  puts2 url
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">begin</span>
    body = getPage<span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#9966CC; font-weight:bold;">rescue</span> <span style="color:#6666ff; font-weight:bold;">WWW::Mechanize::ResponseCodeError</span>
    puts2 <span style="color:#996600;">&quot;Couldn't fetch URL. Invalid date..?&quot;</span>
    <span style="color:#CC0066; font-weight:bold;">exit</span> <span style="color:#006666;">5</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  puts2 <span style="color:#996600;">&quot;Fetched page (#{body.size} bytes)&quot;</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">if</span> body<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">'There is no data on date'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    puts2 <span style="color:#996600;">'No data available for this date.'</span>
    puts2 <span style="color:#996600;">'Date might be too old or too early for report, or just invalid'</span>
    <span style="color:#CC0066; font-weight:bold;">exit</span> <span style="color:#006666;">3</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  doc = Hpricot<span style="color:#006600; font-weight:bold;">&#40;</span>body<span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
  <span style="color:#006600; font-weight:bold;">&#40;</span>doc<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;td[@class='hotColumn']/table[@class='Z2_list']//tr&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>tr<span style="color:#006600; font-weight:bold;">|</span>
    td = <span style="color:#006600; font-weight:bold;">&#40;</span>tr<span style="color:#006600; font-weight:bold;">/</span>:td<span style="color:#006600; font-weight:bold;">&#41;</span>
    num = td<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#93;</span>.<span style="color:#9900CC;">inner_text</span>.<span style="color:#CC0066; font-weight:bold;">sub</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'.'</span>,<span style="color:#996600;">''</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">strip</span>
    kw = td<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span>.<span style="color:#9900CC;">inner_text</span>
    url = <span style="color:#006600; font-weight:bold;">&#40;</span>td<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">/</span>:a<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">first</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    Keyword.<span style="color:#9900CC;">find_or_new</span><span style="color:#006600; font-weight:bold;">&#40;</span>kw<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> Occurance.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>num, date, url<span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
  <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Got info on #{$master.size} keywords for #{date}&quot;</span>
  <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;keyword '#{$master.first.name}' occured #{$master.first.occurances} times&quot;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">class</span> Occurance
  attr_accessor <span style="color:#ff3333; font-weight:bold;">:pos</span>, <span style="color:#ff3333; font-weight:bold;">:date</span>, <span style="color:#ff3333; font-weight:bold;">:url</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> initialize<span style="color:#006600; font-weight:bold;">&#40;</span>pos, date, url<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#0066ff; font-weight:bold;">@pos</span> = pos
    <span style="color:#0066ff; font-weight:bold;">@date</span> = date
    <span style="color:#0066ff; font-weight:bold;">@url</span> = url
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">class</span> Keyword
  attr_accessor <span style="color:#ff3333; font-weight:bold;">:name</span>, <span style="color:#ff3333; font-weight:bold;">:occurances</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> initialize<span style="color:#006600; font-weight:bold;">&#40;</span>name<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#0066ff; font-weight:bold;">@name</span> = name
    <span style="color:#0066ff; font-weight:bold;">@occurances</span> = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    <span style="color:#0066ff; font-weight:bold;">@position_average</span> = <span style="color:#0000FF; font-weight:bold;">nil</span>
    <span style="color:#0066ff; font-weight:bold;">@count</span> = <span style="color:#0000FF; font-weight:bold;">nil</span>
    <span style="color:#ff6633; font-weight:bold;">$master</span> <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> <span style="color:#0000FF; font-weight:bold;">self</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> <span style="color:#0000FF; font-weight:bold;">self</span>.<span style="color:#9900CC;">find_or_new</span><span style="color:#006600; font-weight:bold;">&#40;</span>name<span style="color:#006600; font-weight:bold;">&#41;</span>
    x = <span style="color:#ff6633; font-weight:bold;">$master</span>.<span style="color:#9900CC;">find</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>m<span style="color:#006600; font-weight:bold;">|</span> name==m.<span style="color:#9900CC;">name</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
    x <span style="color:#006600; font-weight:bold;">||</span> Keyword.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>name<span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> occurance
    <span style="color:#0066ff; font-weight:bold;">@occurances</span> <span style="color:#006600; font-weight:bold;">&lt;&lt;</span> occurance
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> occured_on? datetime
    <span style="color:#CC0066; font-weight:bold;">raise</span> <span style="color:#996600;">'implement me'</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> occured_between? datetime
    <span style="color:#CC0066; font-weight:bold;">raise</span> <span style="color:#996600;">'implement me'</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> occurances datetime=<span style="color:#0000FF; font-weight:bold;">nil</span>
    <span style="color:#CC0066; font-weight:bold;">raise</span> <span style="color:#996600;">'implement me'</span> <span style="color:#9966CC; font-weight:bold;">if</span> datetime
    <span style="color:#0066ff; font-weight:bold;">@occurances</span>.<span style="color:#9900CC;">size</span> 
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> occurances_between datetime
    <span style="color:#CC0066; font-weight:bold;">raise</span> <span style="color:#996600;">'implement me'</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> pos_latest
    <span style="color:#0066ff; font-weight:bold;">@occurances</span>.<span style="color:#9900CC;">last</span>.<span style="color:#9900CC;">date</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> pos_average
    <span style="color:#0066ff; font-weight:bold;">@position_average</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> pos_average_between datetime
    <span style="color:#CC0066; font-weight:bold;">raise</span> <span style="color:#996600;">'implement me'</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#008000; font-style:italic;">#   Instance= [num, date, url]</span>
<span style="color:#008000; font-style:italic;">#   Keyword=[Instance, Intance, Instance]</span>
<span style="color:#008000; font-style:italic;">#   Methods for keywords: </span>
<span style="color:#008000; font-style:italic;">#   KW.occured_on? date </span>
<span style="color:#008000; font-style:italic;">#   KW.occured_between? d1, d2 </span>
<span style="color:#008000; font-style:italic;">#   KW.occurances</span>
<span style="color:#008000; font-style:italic;">#   KW.occurances_between? d1, d2</span>
<span style="color:#008000; font-style:italic;">#   KW.pos_latest</span>
<span style="color:#008000; font-style:italic;">#   KW.pos_average</span>
<span style="color:#008000; font-style:italic;">#   KW.pos_average_between</span>
&nbsp;
<span style="color:#008000; font-style:italic;">#   KW has been on the top 100 list KW.occurances.size times</span>
<span style="color:#008000; font-style:italic;">#   The #1 keywords for the month of January: Master.sort_by KW.occurances_between? Jan1,Jan31.pos_average_between Jan1,Jan31 </span>
<span style="color:#008000; font-style:italic;">#</span>
<span style="color:#008000; font-style:italic;">#   Top keywords: sort by KW.occurances.size = N keyword was listed the most.</span>
<span style="color:#008000; font-style:italic;">#   Top keywords for date D: Master.sort_by KW.occured_on (x).num</span>
&nbsp;
main</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://biodegradablegeek.com/2009/01/scraping-google-trends-with-mechanize-and-hpricot/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>AnimeCrazy Scraper Example Using Hpricot &amp; Mechanize</title>
		<link>http://biodegradablegeek.com/2009/01/animecrazy-scraper-example-using-hpricot-mechanize/</link>
		<comments>http://biodegradablegeek.com/2009/01/animecrazy-scraper-example-using-hpricot-mechanize/#comments</comments>
		<pubDate>Sun, 11 Jan 2009 20:44:40 +0000</pubDate>
		<dc:creator>Isam</dc:creator>
				<category><![CDATA[Automation]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Scripts]]></category>
		<category><![CDATA[Code example]]></category>
		<category><![CDATA[hpricot]]></category>
		<category><![CDATA[mechanize]]></category>
		<category><![CDATA[tuts]]></category>

		<guid isPermaLink="false">http://biodegradablegeek.com/?p=306</guid>
		<description><![CDATA[This is a little (as of now incomplete) scraper I wrote to grab all the anime video code off of AnimeCrazy (dot) net. This site doesn&#8217;t host any videos on its own server, but just embeds ones that have been uploaded to other sites (Megavideo, YouTube, Vimeo, etc). I don&#8217;t know who the original uploaders [...]]]></description>
			<content:encoded><![CDATA[<p>This is a little <em>(as of now incomplete)</em> scraper I wrote to grab all the anime video code off of AnimeCrazy (dot) net. This site doesn&#8217;t host any videos on its own server, but just embeds ones that have been uploaded to other sites (Megavideo, YouTube, Vimeo, etc). I don&#8217;t know who the original uploaders of the videos are, but I&#8217;ve seen this same collection of anime links being used on some other sites. This site has about 10,000 episodes/parts (1 movie may have 6+ parts). The scraper below was only tested with &#8220;completed anime shows&#8221; and got around 6300 episodes. The remaining content (anime movies and running anime shows) should work as-is, but I personally held off on getting those because I want to examine them closely to try cleaning up the inconsistencies as much as possible.</p>
<p>This scraper needs some initial setup and <strong>won&#8217;t work out of the box</strong>, but I&#8217;m including it here in the hopes that it will serve as a decent example of a small real world scraper, if you&#8217;re looking to learn the basics of scraping with <a href="http://redhanded.hobix.com/inspect/hpricot01.html">Hpricot</a> and Mechanize. Let me know if you find any use for it. I will update the posted code later this week when I have time to complete it and add some more features.</p>
<p>There&#8217;s one major problem with the organization of episodes on AnimeCrazy, and it&#8217;s the fact that some episodes are glued together into one post. Right now the scraper stops and asks you how to proceed when it comes across such a post. You basically need to tell the scraper if a post (page) contains 1 episode (video) or multiple. If there&#8217;s 1, it proceeds on its own, but if there&#8217;s two, it requires that you give it the names and links of each individual episode (part1 and part2 usually). Sometimes 2 episodes are together in 1 video. Sorta like those music albums on KaZaA or LimeWire that are basically ripped as one huge mp3 instead of individual songs.</p>
<p>This only accounts for maybe 30-40 out of 6000 videos, and it&#8217;s not that big of a deal because the amount of work needed to proceed with the scraping is small, but it IS work, and is a bitch slap to the entire concept of automation, but coding around the issue is a major hassle and there would still be a high chance that some inconsistencies will still come through. It would be far less work to just find another anime site which is far more consistent, though the reason animecrazy is good is because it&#8217;s active, and the site IS updated manually these days, as far as I can tell.</p>
<p>BTW, <strong><a href="http://whytheluckystiff.net/">Why The Lucky Stiff rocks</a>, and Hpricot is amazing.</strong> But the serious scrapologist should consider <a href="http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/">scrAPI</a> or <a href="http://scrubyt.org/">sCRUBYt</a> (uses Hpricot) for big projects.</p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;"><span style="color:#008000; font-style:italic;">#!/usr/bin/env ruby</span>
<span style="color:#008000; font-style:italic;"># License: Public domain. Go sell it to newbs on DigitalPoint.</span>
&nbsp;
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'rubygems'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'hpricot'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'mechanize'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'tempfile'</span>
<span style="color:#CC0066; font-weight:bold;">require</span> <span style="color:#996600;">'highline/import'</span>
HighLine.<span style="color:#9900CC;">track_eof</span> = <span style="color:#0000FF; font-weight:bold;">false</span>
&nbsp;
<span style="color:#ff6633; font-weight:bold;">$mech</span> = <span style="color:#6666ff; font-weight:bold;">WWW::Mechanize</span>.<span style="color:#9900CC;">new</span>
<span style="color:#ff6633; font-weight:bold;">$mech</span>.<span style="color:#9900CC;">user_agent_alias</span> = <span style="color:#996600;">'Mac Safari'</span>
&nbsp;
<span style="color:#008000; font-style:italic;">###############################</span>
<span style="color:#ff6633; font-weight:bold;">$skip_until</span> = <span style="color:#0000FF; font-weight:bold;">false</span>
DEBUG=<span style="color:#0000FF; font-weight:bold;">false</span>
<span style="color:#008000; font-style:italic;">###############################</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> debug?
  DEBUG
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> puts2<span style="color:#006600; font-weight:bold;">&#40;</span>txt=<span style="color:#996600;">''</span><span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;*** #{txt}&quot;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#008000; font-style:italic;">#  Anime has: title, type (series, movie), series</span>
<span style="color:#008000; font-style:italic;">#  Episode has name/#, description, parts (video code)</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">class</span> Episode
  attr_accessor <span style="color:#ff3333; font-weight:bold;">:name</span>, <span style="color:#ff3333; font-weight:bold;">:src</span>, <span style="color:#ff3333; font-weight:bold;">:desc</span>, <span style="color:#ff3333; font-weight:bold;">:cover</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> initialize<span style="color:#006600; font-weight:bold;">&#40;</span>title, page<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#0066ff; font-weight:bold;">@src</span> = page <span style="color:#008000; font-style:italic;"># parts (megavideo, youtube etc)</span>
    <span style="color:#0066ff; font-weight:bold;">@name</span> = title
    <span style="color:#0066ff; font-weight:bold;">@desc</span> = <span style="color:#0000FF; font-weight:bold;">nil</span> <span style="color:#008000; font-style:italic;"># episode description</span>
    <span style="color:#0066ff; font-weight:bold;">@cover</span> = <span style="color:#0000FF; font-weight:bold;">nil</span> <span style="color:#008000; font-style:italic;"># file path</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">class</span> Anime
  attr_accessor <span style="color:#ff3333; font-weight:bold;">:name</span>, <span style="color:#ff3333; font-weight:bold;">:page</span>, <span style="color:#ff3333; font-weight:bold;">:completed</span>, <span style="color:#ff3333; font-weight:bold;">:anime_type</span>, <span style="color:#ff3333; font-weight:bold;">:episodes</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> initialize<span style="color:#006600; font-weight:bold;">&#40;</span>title, page<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#0066ff; font-weight:bold;">@name</span> = title
    <span style="color:#0066ff; font-weight:bold;">@page</span> = page
    <span style="color:#0066ff; font-weight:bold;">@episodes</span> = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#93;</span>
    <span style="color:#0066ff; font-weight:bold;">@anime_type</span> = <span style="color:#996600;">'series'</span>
    <span style="color:#0066ff; font-weight:bold;">@completed</span> = <span style="color:#0000FF; font-weight:bold;">false</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> complete!
    <span style="color:#0066ff; font-weight:bold;">@completed</span> = <span style="color:#0000FF; font-weight:bold;">true</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> episode! episode
    <span style="color:#0066ff; font-weight:bold;">@episodes</span> <span style="color:#006600; font-weight:bold;">&amp;</span>lt;<span style="color:#006600; font-weight:bold;">&amp;</span>lt; episode
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">class</span> Cache
  <span style="color:#9966CC; font-weight:bold;">def</span> initialize
    <span style="color:#008000; font-style:italic;"># Setup physical cache location</span>
    <span style="color:#0066ff; font-weight:bold;">@path</span> = <span style="color:#996600;">'cache'</span>
    <span style="color:#CC00FF; font-weight:bold;">Dir</span>.<span style="color:#9900CC;">mkdir</span> <span style="color:#0066ff; font-weight:bold;">@path</span> <span style="color:#9966CC; font-weight:bold;">unless</span> <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">exists</span>? <span style="color:#0066ff; font-weight:bold;">@path</span>
&nbsp;
    <span style="color:#008000; font-style:italic;"># key/val = url/filename (of fetched data)</span>
    <span style="color:#0066ff; font-weight:bold;">@datafile</span> = <span style="color:#996600;">&quot;#{@path}/cache.data&quot;</span>
    <span style="color:#0066ff; font-weight:bold;">@cache</span> = <span style="color:#CC0066; font-weight:bold;">load</span> <span style="color:#0066ff; font-weight:bold;">@datafile</span>
    <span style="color:#008000; font-style:italic;">#puts @cache.inspect</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> put key, val
    tf = <span style="color:#CC00FF; font-weight:bold;">Tempfile</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'animecrazy'</span>, <span style="color:#0066ff; font-weight:bold;">@path</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    path = tf.<span style="color:#9900CC;">path</span>
    tf.<span style="color:#9900CC;">close</span>! <span style="color:#008000; font-style:italic;"># important!</span>
&nbsp;
    puts2 <span style="color:#996600;">&quot;Saving to cache (#{path})&quot;</span>
    <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>path, <span style="color:#996600;">'w'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span>
      f.<span style="color:#9900CC;">write</span><span style="color:#006600; font-weight:bold;">&#40;</span>val<span style="color:#006600; font-weight:bold;">&#41;</span>
      <span style="color:#0066ff; font-weight:bold;">@cache</span><span style="color:#006600; font-weight:bold;">&#91;</span>key<span style="color:#006600; font-weight:bold;">&#93;</span> = path
    <span style="color:#006600; font-weight:bold;">&#125;</span>
&nbsp;
    save <span style="color:#0066ff; font-weight:bold;">@datafile</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> get key
    <span style="color:#0000FF; font-weight:bold;">return</span> <span style="color:#0000FF; font-weight:bold;">nil</span> <span style="color:#9966CC; font-weight:bold;">unless</span> exists?<span style="color:#006600; font-weight:bold;">&#40;</span>key<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&amp;</span>amp;<span style="color:#006600; font-weight:bold;">&amp;</span>amp; <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">exists</span>?<span style="color:#006600; font-weight:bold;">&#40;</span>@cache<span style="color:#006600; font-weight:bold;">&#91;</span>key<span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>@cache<span style="color:#006600; font-weight:bold;">&#91;</span>key<span style="color:#006600; font-weight:bold;">&#93;</span>, <span style="color:#996600;">'r'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span> f.<span style="color:#9900CC;">read</span> <span style="color:#006600; font-weight:bold;">&#125;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">def</span> exists? key
    <span style="color:#0066ff; font-weight:bold;">@cache</span>.<span style="color:#9900CC;">has_key</span>? key
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
private
  <span style="color:#008000; font-style:italic;"># Load saved cache</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> <span style="color:#CC0066; font-weight:bold;">load</span> file
    <span style="color:#0000FF; font-weight:bold;">return</span> <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">exists</span>?<span style="color:#006600; font-weight:bold;">&#40;</span>file<span style="color:#006600; font-weight:bold;">&#41;</span> ? <span style="color:#CC00FF; font-weight:bold;">YAML</span>.<span style="color:#CC0066; font-weight:bold;">load</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>file<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">read</span><span style="color:#006600; font-weight:bold;">&#41;</span> : <span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">&#125;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  <span style="color:#008000; font-style:italic;"># Save cache</span>
  <span style="color:#9966CC; font-weight:bold;">def</span> save path
    <span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>path, <span style="color:#996600;">'w'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span>
      f.<span style="color:#9900CC;">write</span> <span style="color:#0066ff; font-weight:bold;">@cache</span>.<span style="color:#9900CC;">to_yaml</span>
    <span style="color:#006600; font-weight:bold;">&#125;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#ff6633; font-weight:bold;">$cache</span> = Cache.<span style="color:#9900CC;">new</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> fetch<span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>
  body = <span style="color:#ff6633; font-weight:bold;">$mech</span>.<span style="color:#9900CC;">get</span><span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">body</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#ff6633; font-weight:bold;">$cache</span>.<span style="color:#9900CC;">put</span><span style="color:#006600; font-weight:bold;">&#40;</span>url, body<span style="color:#006600; font-weight:bold;">&#41;</span>
  body
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> getPage<span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#008000; font-style:italic;"># First let's see if this is cached already.</span>
  body = <span style="color:#ff6633; font-weight:bold;">$cache</span>.<span style="color:#9900CC;">get</span><span style="color:#006600; font-weight:bold;">&#40;</span>url<span style="color:#006600; font-weight:bold;">&#41;</span> 
&nbsp;
  <span style="color:#9966CC; font-weight:bold;">if</span> body.<span style="color:#0000FF; font-weight:bold;">nil</span>?
    <span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Not cached. Fetching from site...&quot;</span>
    body = fetch url
  <span style="color:#9966CC; font-weight:bold;">end</span>
  body
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> main
  <span style="color:#008000; font-style:italic;"># Open anime list (anime_list = saved HTML of</span>
<span style="color:#006600; font-weight:bold;">&lt;</span>ul<span style="color:#006600; font-weight:bold;">&gt;</span>...<span style="color:#006600; font-weight:bold;">&lt;/</span>ul<span style="color:#006600; font-weight:bold;">&gt;</span>
sidebar from animecrazy.<span style="color:#9900CC;">net</span><span style="color:#006600; font-weight:bold;">&#41;</span>
  anime_list = Hpricot<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">'anime_list'</span>, <span style="color:#996600;">'r'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#123;</span> <span style="color:#006600; font-weight:bold;">|</span>f<span style="color:#006600; font-weight:bold;">|</span> f.<span style="color:#9900CC;">read</span> <span style="color:#006600; font-weight:bold;">&#125;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
  puts2 <span style="color:#996600;">&quot;Anime list open&quot;</span>
&nbsp;
  <span style="color:#008000; font-style:italic;"># Read in the URL to every series</span>
  masterlist = <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">&#93;</span>
&nbsp;
  <span style="color:#006600; font-weight:bold;">&#40;</span>anime_list<span style="color:#006600; font-weight:bold;">/</span>:li<span style="color:#006600; font-weight:bold;">/</span>:a<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>series<span style="color:#006600; font-weight:bold;">|</span>
    anime = Anime.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>series.<span style="color:#9900CC;">inner_text</span>, series<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    masterlist <span style="color:#006600; font-weight:bold;">&amp;</span>lt;<span style="color:#006600; font-weight:bold;">&amp;</span>lt; anime
    puts2 <span style="color:#996600;">&quot;Built structure for #{anime.name}...&quot;</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  puts2
&nbsp;
  puts2 <span style="color:#996600;">&quot;Fetched #{masterlist.size} animes. Now fetching episodes...&quot;</span>
  masterlist.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>anime<span style="color:#006600; font-weight:bold;">|</span>
    puts2 <span style="color:#996600;">&quot;Fetching body (#{anime.name})&quot;</span>
    body = getPage<span style="color:#006600; font-weight:bold;">&#40;</span>anime.<span style="color:#9900CC;">page</span><span style="color:#006600; font-weight:bold;">&#41;</span>
    puts2 <span style="color:#996600;">&quot;Snatched that bitch (#{body.size} bytes of Goku Goodness)&quot;</span>
    puts2
&nbsp;
    doc = Hpricot<span style="color:#006600; font-weight:bold;">&#40;</span>body<span style="color:#006600; font-weight:bold;">&#41;</span>
    <span style="color:#006600; font-weight:bold;">&#40;</span>doc<span style="color:#006600; font-weight:bold;">/</span><span style="color:#996600;">&quot;h1/a[@rel='bookmark']&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>episode<span style="color:#006600; font-weight:bold;">|</span>
      name = clean<span style="color:#006600; font-weight:bold;">&#40;</span>episode.<span style="color:#9900CC;">inner_text</span><span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
      <span style="color:#9966CC; font-weight:bold;">if</span> <span style="color:#ff6633; font-weight:bold;">$skip_until</span>
        <span style="color:#008000; font-style:italic;">#$skip_until = !inUrl(episode[:href], 'basilisk-episode-2')</span>
        <span style="color:#008000; font-style:italic;">#$skip_until = nil == name['Tsubasa Chronicles']</span>
        puts2 <span style="color:#996600;">&quot;Resuming from #{episode[:href]}&quot;</span> <span style="color:#9966CC; font-weight:bold;">if</span> !$skip_until
        <span style="color:#9966CC; font-weight:bold;">next</span>
      <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
      <span style="color:#008000; font-style:italic;"># Here it gets tricky. This is a major source of inconsistencies in the site.</span>
      <span style="color:#008000; font-style:italic;"># They group episodes into 1 post sometimes, and the only way to find</span>
      <span style="color:#008000; font-style:italic;"># out from the title of the post is by checking for the following patterns</span>
      <span style="color:#008000; font-style:italic;"># (7 and 8 are example episode #s)</span>
      <span style="color:#008000; font-style:italic;"># X = 7+8, 7 + 8, 7 and 8, 7and8, 7 &amp;amp; 8, 7&amp;amp;8</span>
&nbsp;
      <span style="color:#008000; font-style:italic;"># If an episode has no X then it is 1 episode.</span>
      <span style="color:#008000; font-style:italic;"># If it has multiple parts, they are mirrors.</span>
      <span style="color:#9966CC; font-weight:bold;">if</span> single_episode? name
        <span style="color:#9966CC; font-weight:bold;">begin</span>
&nbsp;
          puts2 <span style="color:#996600;">&quot;Adding episode #{name}...&quot;</span>
          ep = Episode.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>name, episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
          ep.<span style="color:#9900CC;">src</span> = getPage<span style="color:#006600; font-weight:bold;">&#40;</span>episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
          anime.<span style="color:#9900CC;">episode</span>! ep
        <span style="color:#9966CC; font-weight:bold;">rescue</span> <span style="color:#6666ff; font-weight:bold;">WWW::Mechanize::ResponseCodeError</span>
          puts2 <span style="color:#996600;">&quot;ERROR: Page not found? Skipping...&quot;</span>
          <span style="color:#CC0066; font-weight:bold;">puts</span> name
          puts2 episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span>
        <span style="color:#9966CC; font-weight:bold;">end</span>
      <span style="color:#9966CC; font-weight:bold;">else</span>
        <span style="color:#008000; font-style:italic;"># If an episode DOES have X, it *may* have 2 episodes (but may have mirrors, going up to 4 parts/vids per page).</span>
        <span style="color:#008000; font-style:italic;"># Multiple parts will be the individual episodes in chronological order.</span>
        puts2 <span style="color:#996600;">&quot;Help me! I'm confused @ '#{name}'&quot;</span>
        puts2 <span style="color:#996600;">&quot;This post might contain multiple episodes...&quot;</span>
&nbsp;
        puts2 <span style="color:#996600;">&quot;Please visit this URL and verify the following:&quot;</span>
        <span style="color:#CC0066; font-weight:bold;">puts</span> episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span>
&nbsp;
        <span style="color:#9966CC; font-weight:bold;">if</span> agree<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Is this 1 episode? yes/no &quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
          <span style="color:#9966CC; font-weight:bold;">begin</span>
            puts2 <span style="color:#996600;">&quot;Adding episode #{name}...&quot;</span>
            ep = Episode.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>name, episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
            ep.<span style="color:#9900CC;">src</span> = getPage<span style="color:#006600; font-weight:bold;">&#40;</span>episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
            anime.<span style="color:#9900CC;">episode</span>! ep
          <span style="color:#9966CC; font-weight:bold;">rescue</span> <span style="color:#6666ff; font-weight:bold;">WWW::Mechanize::ResponseCodeError</span>
            puts2 <span style="color:#996600;">&quot;ERROR: Page not found? Skipping...&quot;</span>
            <span style="color:#CC0066; font-weight:bold;">puts</span> name
            puts2 episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span>
          <span style="color:#9966CC; font-weight:bold;">end</span>
        <span style="color:#9966CC; font-weight:bold;">else</span>
          more = <span style="color:#0000FF; font-weight:bold;">true</span>
          <span style="color:#9966CC; font-weight:bold;">while</span> more
            ename = ask<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Enter the name of an episode: &quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
            eurl =  ask<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Enter the URL of an episode: &quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
&nbsp;
            <span style="color:#9966CC; font-weight:bold;">begin</span>
              puts2 <span style="color:#996600;">&quot;Adding episode #{ename}...&quot;</span>
              ep = Episode.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span>name, episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
              ep.<span style="color:#9900CC;">src</span> = getPage<span style="color:#006600; font-weight:bold;">&#40;</span>episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
              anime.<span style="color:#9900CC;">episode</span>! ep
            <span style="color:#9966CC; font-weight:bold;">rescue</span> <span style="color:#6666ff; font-weight:bold;">WWW::Mechanize::ResponseCodeError</span>
              puts2 <span style="color:#996600;">&quot;ERROR: Page not found? Skipping...&quot;</span>
              <span style="color:#CC0066; font-weight:bold;">puts</span> name
              puts2 episode<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#ff3333; font-weight:bold;">:href</span><span style="color:#006600; font-weight:bold;">&#93;</span>
            <span style="color:#9966CC; font-weight:bold;">end</span>
            more = agree<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;Add another episode? Y/N&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
          <span style="color:#9966CC; font-weight:bold;">end</span>
          puts2 <span style="color:#996600;">&quot;Added episodes manually... moving on&quot;</span>
        <span style="color:#9966CC; font-weight:bold;">end</span>
      <span style="color:#9966CC; font-weight:bold;">end</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
    anime.<span style="color:#9900CC;">complete</span>!
    <span style="color:#008000; font-style:italic;"># XXX save the entire anime object, instead of just cache</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> inTitle<span style="color:#006600; font-weight:bold;">&#40;</span>document, title<span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#0000FF; font-weight:bold;">return</span> <span style="color:#006600; font-weight:bold;">&#40;</span>document<span style="color:#006600; font-weight:bold;">/</span>:title<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">inner_text</span><span style="color:#006600; font-weight:bold;">&#91;</span>title<span style="color:#006600; font-weight:bold;">&#93;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> inUrl<span style="color:#006600; font-weight:bold;">&#40;</span>url, part<span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#0000FF; font-weight:bold;">return</span> url<span style="color:#006600; font-weight:bold;">&#91;</span>part<span style="color:#006600; font-weight:bold;">&#93;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> single_episode?<span style="color:#006600; font-weight:bold;">&#40;</span>name<span style="color:#006600; font-weight:bold;">&#41;</span>
  !<span style="color:#006600; font-weight:bold;">&#40;</span>name =~ <span style="color:#006600; font-weight:bold;">/</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">-</span><span style="color:#006666;">9</span><span style="color:#006600; font-weight:bold;">&#93;</span> ?<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006600; font-weight:bold;">+&amp;</span>amp;<span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">|</span>and<span style="color:#006600; font-weight:bold;">&#41;</span> ?<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">-</span><span style="color:#006666;">9</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">/</span><span style="color:#006600; font-weight:bold;">&#41;</span>
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
<span style="color:#9966CC; font-weight:bold;">def</span> clean<span style="color:#006600; font-weight:bold;">&#40;</span>txt<span style="color:#006600; font-weight:bold;">&#41;</span>
  <span style="color:#008000; font-style:italic;"># This picks up most of them, but some are missing. Like *Final* and just plain &quot;Final&quot;</span>
  txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (Final)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>=<span style="color:#996600;">''</span> <span style="color:#9966CC; font-weight:bold;">if</span> txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (Final)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (Final Episode)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>=<span style="color:#996600;">''</span> <span style="color:#9966CC; font-weight:bold;">if</span> txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (Final Episode)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (FINAL)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>=<span style="color:#996600;">''</span> <span style="color:#9966CC; font-weight:bold;">if</span> txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (FINAL)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (FINAL EPISODE)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>=<span style="color:#996600;">''</span> <span style="color:#9966CC; font-weight:bold;">if</span> txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (FINAL EPISODE)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
&nbsp;
  txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">'(Final)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>=<span style="color:#996600;">''</span> <span style="color:#9966CC; font-weight:bold;">if</span> txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">'(Final)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">'(Final Episode)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>=<span style="color:#996600;">''</span> <span style="color:#9966CC; font-weight:bold;">if</span> txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">'(Final Episode)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">'(FINAL)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>=<span style="color:#996600;">''</span> <span style="color:#9966CC; font-weight:bold;">if</span> txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (FINAL)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">'(FINAL EPISODE)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>=<span style="color:#996600;">''</span> <span style="color:#9966CC; font-weight:bold;">if</span> txt<span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">' (FINAL EPISODE)'</span><span style="color:#006600; font-weight:bold;">&#93;</span>
&nbsp;
  txt
<span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
main</pre></div></div>

<p>If you&#8217;re writing your own scraper and would like to use the minimal caching functionality present below, you can gut everything in main() out and put in your own code. Feel free to <a href="/contact">contact me for assistance</a>.</p>
<p>Here is some sample output:<br />
<span id="more-306"></span></p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">*** Adding episode Initial D: Episode 1 (Stage 2)...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090111-12300-mbdpcl-0)
*** Fetching body (Initial D: Third Stage)
*** Snatched that bitch (77695 bytes of Goku Goodness)
***
*** Adding episode Initial D: Third Stage...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090111-12300-ea69nr-0)
*** Fetching body (Kaiji)
*** Snatched that bitch (87553 bytes of Goku Goodness)
***
*** Adding episode Basilisk Episode 4...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-fomoh0-0)
*** Adding episode Basilisk Episode 3...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-1dx9xm-0)
*** Adding episode Basilisk Episode 2...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-5xt774-0)
*** Adding episode Basilisk Episode 1...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-br5fxd-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 3...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-zmuwix-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 2...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-1ah20eg-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 1...
Not cached. Fetching from site...</pre></div></div>

<p>This was written for fun, but primarily profit, and not for my own viewing pleasure. The only anime I&#8217;ve seen was Akira a decade or so ago, and only because the cover looked cool, but feel free to recommend your favorites.</p>
]]></content:encoded>
			<wfw:commentRss>http://biodegradablegeek.com/2009/01/animecrazy-scraper-example-using-hpricot-mechanize/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How to POST Form Data Using Ruby</title>
		<link>http://biodegradablegeek.com/2008/04/how-to-post-form-data-using-ruby/</link>
		<comments>http://biodegradablegeek.com/2008/04/how-to-post-form-data-using-ruby/#comments</comments>
		<pubDate>Thu, 24 Apr 2008 15:43:09 +0000</pubDate>
		<dc:creator>Isam</dc:creator>
				<category><![CDATA[Automation]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Scraping]]></category>
		<category><![CDATA[Snippets]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[http]]></category>
		<category><![CDATA[scripting]]></category>
		<category><![CDATA[Scripts]]></category>
		<category><![CDATA[Tips]]></category>

		<guid isPermaLink="false">http://biodegradablegeek.com/2008/04/24/how-to-post-form-data-using-ruby/</guid>
		<description><![CDATA[POSTing data on web forms is essential for writing tools and services that interact with resources already available on the web. You can grab information from your Gmail account, add a new thread to a forum from your own app, etc. 
The following is a brief example on how this can be done in Ruby [...]]]></description>
			<content:encoded><![CDATA[<p><strong>POST</strong>ing data on web forms is essential for writing tools and services that interact with resources already available on the web. You can grab information from your Gmail account, add a new thread to a forum from your own app, etc. </p>
<p>The following is a brief example on how this can be done in Ruby using <a href="http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html" target="_blank">Net::HTTP</a>and <a href="http://www.interlacken.com/webdbdev/ch05/formpost.asp" target="_blank">this POST form example</a>.</p>
<p>Looking at the source (interlacken.com/webdbdev/ch05/formpost.asp):</p>
<pre class="brush: xml;">
&lt;form method=&quot;POST&quot; action=&quot;formpost.asp&quot;&gt;
&lt;p&gt;&lt;input type=&quot;text&quot; name=&quot;box1″ size=&quot;20″ value=&quot;&quot;&gt;
&lt;input type=&quot;submit&quot; value=&quot;Submit&quot; name=&quot;button1″&gt;&lt;/p&gt;
&lt;/form&gt;
</pre>
<p>We see two attributes are sent to the formpost.asp script when the user hits the submit button: A textbox named <strong>box1</strong> and the value of the submit button, named <strong>Submit</strong>. If this form used a GET method, we would just fetch the URL postfixed with (for example) <strong>?box1=our+text+here</strong>. Fortunately, Ruby&#8217;s Net::HTTP makes posting data just as easy.</p>
<p>The Ruby code: </p>
<pre class="brush: ruby;">
#!/usr/bin/ruby

require &quot;uri&quot;
require &quot;net/http&quot;

params = {'box1′ =&gt; 'Nothing is less important than which fork you use. Etiquette is the science of living. It embraces everything. It is ethics. It is honor. -Emily Post',
'button1′ =&gt; 'Submit'
}
x = Net::HTTP.post_form(URI.parse('http://www.interlacken.com/webdbdev/ch05/formpost.asp'), params)
puts x.body

# Uncomment this if you want output in a file
# File.open('out.htm', 'w') { |f| f.write x.body }
</pre>
<p>Sending the value of button1 is optional in this case, but sometimes this value is checked in the server side script. One example is when the coder wants to find out if the form has been submitted &#8211; as opposed to it being the user&#8217;s first visit to the form &#8211; without creating a hidden attribute to send along w/ the other form fields. Besides, there&#8217;s no harm in sending a few more bytes.</p>
<p>If you&#8217;re curious about URI.parse, it simply makes the URI easier to work with by separating and classifying each of its attributes, effectively letting the methods in Net::HTTP do their sole job only, instead of having to analyze and parse the URL. More info on this in the <a href="http://www.ruby-doc.org/stdlib/libdoc/uri/rdoc/classes/URI.html#M009241" target="_blank">Ruby doc</a>.</p>
<p>Assuming no errors, running this example (<em>ruby postpost</em> or <em>chmod a+x postpost.rb; ./postpost.rb</em>) yields:</p>
<pre class="brush: xml;">
&lt;form method=&quot;POST&quot; action=&quot;formpost.asp&quot;&gt;
&lt;p&gt;&lt;input type=&quot;text&quot; name=&quot;box1″ size=&quot;20″ value=&quot;NOTHING IS LESS
IMPORTANT THAN WHICH FORK YOU USE. ETIQUETTE IS THE
SCIENCE OF LIVING. IT EMBRACES EVERYTHING. IT IS ETHICS.
IT IS HONOR. -EMILY POST&quot;&gt;
&lt;input type=&quot;submit&quot; value=&quot;Submit&quot; name=&quot;button1″&gt;&lt;/p&gt;
&lt;/form&gt;
</pre>
<p>In practice, you might want to use a more specialized library to handle what you&#8217;re doing. Be sure to check out <a href="http://mechanize.rubyforge.org/mechanize/" target="_blank">Mechanize</a> and <a href="http://github.com/adamwiggins/rest-client/tree/master" target="_blank">Rest-client</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://biodegradablegeek.com/2008/04/how-to-post-form-data-using-ruby/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
