AnimeCrazy Scraper Example Using Hpricot & Mechanize

This is a little (as of now incomplete) scraper I wrote to grab all the anime video code off of AnimeCrazy (dot) net. This site doesn’t host any videos on its own server, but just embeds ones that have been uploaded to other sites (Megavideo, YouTube, Vimeo, etc). I don’t know who the original uploaders of the videos are, but I’ve seen this same collection of anime links being used on some other sites. This site has about 10,000 episodes/parts (1 movie may have 6+ parts). The scraper below was only tested with “completed anime shows” and got around 6300 episodes. The remaining content (anime movies and running anime shows) should work as-is, but I personally held off on getting those because I want to examine them closely to try cleaning up the inconsistencies as much as possible.

This scraper needs some initial setup and won’t work out of the box, but I’m including it here in the hopes that it will serve as a decent example of a small real world scraper, if you’re looking to learn the basics of scraping with Hpricot and Mechanize. Let me know if you find any use for it. I will update the posted code later this week when I have time to complete it and add some more features.

There’s one major problem with the organization of episodes on AnimeCrazy, and it’s the fact that some episodes are glued together into one post. Right now the scraper stops and asks you how to proceed when it comes across such a post. You basically need to tell the scraper if a post (page) contains 1 episode (video) or multiple. If there’s 1, it proceeds on its own, but if there’s two, it requires that you give it the names and links of each individual episode (part1 and part2 usually). Sometimes 2 episodes are together in 1 video. Sorta like those music albums on KaZaA or LimeWire that are basically ripped as one huge mp3 instead of individual songs.

This only accounts for maybe 30-40 out of 6000 videos, and it’s not that big of a deal because the amount of work needed to proceed with the scraping is small, but it IS work, and is a bitch slap to the entire concept of automation, but coding around the issue is a major hassle and there would still be a high chance that some inconsistencies will still come through. It would be far less work to just find another anime site which is far more consistent, though the reason animecrazy is good is because it’s active, and the site IS updated manually these days, as far as I can tell.

BTW, Why The Lucky Stiff rocks, and Hpricot is amazing. But the serious scrapologist should consider scrAPI or sCRUBYt (uses Hpricot) for big projects.


#!/usr/bin/env ruby
# License: Public domain. Go sell it to newbs on DigitalPoint.

require 'rubygems'
require 'hpricot'
require 'mechanize'
require 'tempfile'
require 'highline/import'
HighLine.track_eof = false

$mech = WWW::Mechanize.new
$mech.user_agent_alias = 'Mac Safari'

###############################
$skip_until = false
DEBUG=false
###############################

def debug?
  DEBUG
end

def puts2(txt='')
  puts "*** #{txt}"
end

#  Anime has: title, type (series, movie), series
#  Episode has name/#, description, parts (video code)

class Episode
  attr_accessor :name, :src, :desc, :cover
  def initialize(title, page)
    @src = page # parts (megavideo, youtube etc)
    @name = title
    @desc = nil # episode description
    @cover = nil # file path
  end
end

class Anime
  attr_accessor :name, :page, :completed, :anime_type, :episodes
  def initialize(title, page)
    @name = title
    @page = page
    @episodes = []
    @anime_type = 'series'
    @completed = false
  end

  def complete!
    @completed = true
  end

  def episode! episode
    @episodes << episode
  end
end

class Cache
  def initialize
    # Setup physical cache location
    @path = 'cache'
    Dir.mkdir @path unless File.exists? @path

    # key/val = url/filename (of fetched data)
    @datafile = "#{@path}/cache.data"
    @cache = load @datafile
    #puts @cache.inspect
  end

  def put key, val
    tf = Tempfile.new('animecrazy', @path)
    path = tf.path
    tf.close! # important!

    puts2 "Saving to cache (#{path})"
    open(path, 'w') { |f|
      f.write(val)
      @cache[key] = path
    }

    save @datafile
  end

  def get key
    return nil unless exists?(key) && File.exists?(@cache[key])
    open(@cache[key], 'r') { |f| f.read }
  end

  def exists? key
    @cache.has_key? key
  end

private
  # Load saved cache
  def load file
    return File.exists?(file) ? YAML.load(open(file).read) : {}
  end

  # Save cache
  def save path
    open(path, 'w') { |f|
      f.write @cache.to_yaml
    }
  end
end

$cache = Cache.new

def fetch(url)
  body = $mech.get(url).body()
  $cache.put(url, body)
  body
end

def getPage(url)
  # First let's see if this is cached already.
  body = $cache.get(url) 

  if body.nil?
    puts "Not cached. Fetching from site..."
    body = fetch url
  end
  body
end

def main
  # Open anime list (anime_list = saved HTML of sidebar from animecrazy.net)
  anime_list = Hpricot(open('anime_list', 'r') { |f| f.read })
  puts2 "Anime list open"

  # Read in the URL to every series
  masterlist = []

  (anime_list/:li/:a).each do |series|
    anime = Anime.new(series.inner_text, series[:href])
    masterlist << anime
    puts2 "Built structure for #{anime.name}..."
  end

  puts2

  puts2 "Fetched #{masterlist.size} animes. Now fetching episodes..."
  masterlist.each do |anime|
    puts2 "Fetching body (#{anime.name})"
    body = getPage(anime.page)
    puts2 "Snatched that bitch (#{body.size} bytes of Goku Goodness)"
    puts2

    doc = Hpricot(body)
    (doc/"h1/a[@rel='bookmark']").each do |episode|
      name = clean(episode.inner_text)

      if $skip_until
        #$skip_until = !inUrl(episode[:href], 'basilisk-episode-2')
        #$skip_until = nil == name['Tsubasa Chronicles']
        puts2 "Resuming from #{episode[:href]}" if !$skip_until
        next
      end

      # Here it gets tricky. This is a major source of inconsistencies in the site.
      # They group episodes into 1 post sometimes, and the only way to find
      # out from the title of the post is by checking for the following patterns
      # (7 and 8 are example episode #s)
      # X = 7+8, 7 + 8, 7 and 8, 7and8, 7 & 8, 7&8

      # If an episode has no X then it is 1 episode.
      # If it has multiple parts, they are mirrors.
      if single_episode? name
        begin

          puts2 "Adding episode #{name}..."
          ep = Episode.new(name, episode[:href])
          ep.src = getPage(episode[:href])
          anime.episode! ep
        rescue WWW::Mechanize::ResponseCodeError
          puts2 "ERROR: Page not found? Skipping..."
          puts name
          puts2 episode[:href]
        end
      else
        # If an episode DOES have X, it *may* have 2 episodes (but may have mirrors, going up to 4 parts/vids per page).
        # Multiple parts will be the individual episodes in chronological order.
        puts2 "Help me! I'm confused @ '#{name}'"
        puts2 "This post might contain multiple episodes..."

        puts2 "Please visit this URL and verify the following:"
        puts episode[:href]

        if agree("Is this 1 episode? yes/no ")
          begin
            puts2 "Adding episode #{name}..."
            ep = Episode.new(name, episode[:href])
            ep.src = getPage(episode[:href])
            anime.episode! ep
          rescue WWW::Mechanize::ResponseCodeError
            puts2 "ERROR: Page not found? Skipping..."
            puts name
            puts2 episode[:href]
          end
        else
          more = true
          while more
            ename = ask("Enter the name of an episode: ")
            eurl =  ask("Enter the URL of an episode: ")

            begin
              puts2 "Adding episode #{ename}..."
              ep = Episode.new(name, episode[:href])
              ep.src = getPage(episode[:href])
              anime.episode! ep
            rescue WWW::Mechanize::ResponseCodeError
              puts2 "ERROR: Page not found? Skipping..."
              puts name
              puts2 episode[:href]
            end
            more = agree("Add another episode? Y/N")
          end
          puts2 "Added episodes manually... moving on"
        end
      end
    end
    anime.complete!
    # XXX save the entire anime object, instead of just cache
  end
end

def inTitle(document, title)
  return (document/:title).inner_text[title]
end

def inUrl(url, part)
  return url[part]
end

def single_episode?(name)
  !(name =~ /[0-9] ?([+&]|and) ?[0-9]/)
end

def clean(txt)
  # This picks up most of them, but some are missing. Like *Final* and just plain "Final"
  txt[' (Final)']='' if txt[' (Final)']
  txt[' (Final Episode)']='' if txt[' (Final Episode)']
  txt[' (FINAL)']='' if txt[' (FINAL)']
  txt[' (FINAL EPISODE)']='' if txt[' (FINAL EPISODE)']

  txt['(Final)']='' if txt['(Final)']
  txt['(Final Episode)']='' if txt['(Final Episode)']
  txt['(FINAL)']='' if txt[' (FINAL)']
  txt['(FINAL EPISODE)']='' if txt[' (FINAL EPISODE)']

  txt
end

main

If you’re writing your own scraper and would like to use the minimal caching functionality present below, you can gut everything in main() out and put in your own code. Feel free to contact me for assistance.

Here is some sample output:

*** Adding episode Initial D: Episode 1 (Stage 2)...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090111-12300-mbdpcl-0)
*** Fetching body (Initial D: Third Stage)
*** Snatched that bitch (77695 bytes of Goku Goodness)
***
*** Adding episode Initial D: Third Stage...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090111-12300-ea69nr-0)
*** Fetching body (Kaiji)
*** Snatched that bitch (87553 bytes of Goku Goodness)
***
*** Adding episode Basilisk Episode 4...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-fomoh0-0)
*** Adding episode Basilisk Episode 3...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-1dx9xm-0)
*** Adding episode Basilisk Episode 2...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-5xt774-0)
*** Adding episode Basilisk Episode 1...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-br5fxd-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 3...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-zmuwix-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 2...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-1ah20eg-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 1...
Not cached. Fetching from site...

This was written for fun, but primarily profit, and not for my own viewing pleasure. The only anime I’ve seen was Akira a decade or so ago, and only because the cover looked cool, but feel free to recommend your favorites.

2 thoughts on “AnimeCrazy Scraper Example Using Hpricot & Mechanize”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>