AnimeCrazy Scraper Example Using Hpricot & Mechanize
Sun, Jan 11, 2009
This is a little (as of now incomplete) scraper I wrote to grab all the anime video code off of AnimeCrazy (dot) net. This site doesn’t host any videos on its own server, but just embeds ones that have been uploaded to other sites (Megavideo, YouTube, Vimeo, etc). I don’t know who the original uploaders of the videos are, but I’ve seen this same collection of anime links being used on some other sites. This site has about 10,000 episodes/parts (1 movie may have 6+ parts). The scraper below was only tested with “completed anime shows” and got around 6300 episodes. The remaining content (anime movies and running anime shows) should work as-is, but I personally held off on getting those because I want to examine them closely to try cleaning up the inconsistencies as much as possible.
This scraper needs some initial setup and won’t work out of the box, but I’m including it here in the hopes that it will serve as a decent example of a small real world scraper, if you’re looking to learn the basics of scraping with Hpricot and Mechanize. Let me know if you find any use for it. I will update the posted code later this week when I have time to complete it and add some more features.
There’s one major problem with the organization of episodes on AnimeCrazy, and it’s the fact that some episodes are glued together into one post. Right now the scraper stops and asks you how to proceed when it comes across such a post. You basically need to tell the scraper if a post (page) contains 1 episode (video) or multiple. If there’s 1, it proceeds on its own, but if there’s two, it requires that you give it the names and links of each individual episode (part1 and part2 usually). Sometimes 2 episodes are together in 1 video. Sorta like those music albums on KaZaA or LimeWire that are basically ripped as one huge mp3 instead of individual songs.
This only accounts for maybe 30-40 out of 6000 videos, and it’s not that big of a deal because the amount of work needed to proceed with the scraping is small, but it IS work, and is a bitch slap to the entire concept of automation, but coding around the issue is a major hassle and there would still be a high chance that some inconsistencies will still come through. It would be far less work to just find another anime site which is far more consistent, though the reason animecrazy is good is because it’s active, and the site IS updated manually these days, as far as I can tell.
BTW, Why The Lucky Stiff rocks, and Hpricot is amazing. But the serious scrapologist should consider scrAPI or sCRUBYt (uses Hpricot) for big projects.
#!/usr/bin/env ruby
# License: Public domain. Go sell it to newbs on DigitalPoint.
require 'rubygems'
require 'hpricot'
require 'mechanize'
require 'tempfile'
require 'highline/import'
HighLine.track_eof = false
$mech = WWW::Mechanize.new
$mech.user_agent_alias = 'Mac Safari'
###############################
$skip_until = false
DEBUG=false
###############################
def debug?
DEBUG
end
def puts2(txt='')
puts "*** #{txt}"
end
# Anime has: title, type (series, movie), series
# Episode has name/#, description, parts (video code)
class Episode
attr_accessor :name, :src, :desc, :cover
def initialize(title, page)
@src = page # parts (megavideo, youtube etc)
@name = title
@desc = nil # episode description
@cover = nil # file path
end
end
class Anime
attr_accessor :name, :page, :completed, :anime_type, :episodes
def initialize(title, page)
@name = title
@page = page
@episodes = []
@anime_type = 'series'
@completed = false
end
def complete!
@completed = true
end
def episode! episode
@episodes << episode
end
end
class Cache
def initialize
# Setup physical cache location
@path = 'cache'
Dir.mkdir @path unless File.exists? @path
# key/val = url/filename (of fetched data)
@datafile = "#{@path}/cache.data"
@cache = load @datafile
#puts @cache.inspect
end
def put key, val
tf = Tempfile.new('animecrazy', @path)
path = tf.path
tf.close! # important!
puts2 "Saving to cache (#{path})"
open(path, 'w') { |f|
f.write(val)
@cache[key] = path
}
save @datafile
end
def get key
return nil unless exists?(key) && File.exists?(@cache[key])
open(@cache[key], 'r') { |f| f.read }
end
def exists? key
@cache.has_key? key
end
private
# Load saved cache
def load file
return File.exists?(file) ? YAML.load(open(file).read) : {}
end
# Save cache
def save path
open(path, 'w') { |f|
f.write @cache.to_yaml
}
end
end
$cache = Cache.new
def fetch(url)
body = $mech.get(url).body()
$cache.put(url, body)
body
end
def getPage(url)
# First let's see if this is cached already.
body = $cache.get(url)
if body.nil?
puts "Not cached. Fetching from site..."
body = fetch url
end
body
end
def main
# Open anime list (anime_list = saved HTML of
- ...
If you’re writing your own scraper and would like to use the minimal caching functionality present below, you can gut everything in main() out and put in your own code. Feel free to contact me for assistance.
Here is some sample output:
*** Adding episode Initial D: Episode 1 (Stage 2)...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090111-12300-mbdpcl-0)
*** Fetching body (Initial D: Third Stage)
*** Snatched that bitch (77695 bytes of Goku Goodness)
***
*** Adding episode Initial D: Third Stage...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090111-12300-ea69nr-0)
*** Fetching body (Kaiji)
*** Snatched that bitch (87553 bytes of Goku Goodness)
***
*** Adding episode Basilisk Episode 4...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-fomoh0-0)
*** Adding episode Basilisk Episode 3...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-1dx9xm-0)
*** Adding episode Basilisk Episode 2...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-5xt774-0)
*** Adding episode Basilisk Episode 1...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-br5fxd-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 3...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-zmuwix-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 2...
Not cached. Fetching from site...
*** Saving to cache (cache/animecrazy20090110-14992-1ah20eg-0)
*** Adding episode Tsubasa Chronicles: Tokyo Revelations Episode 1...
Not cached. Fetching from site...
This was written for fun, but primarily profit, and not for my own viewing pleasure. The only anime I’ve seen was Akira a decade or so ago, and only because the cover looked cool, but feel free to recommend your favorites.
Why not subscribe to the feed?. If you’re on a mobile device I suggest Viigo
Tags: Automation, Code example, hpricot, mechanize, Ruby, Scraping, tuts
0 Comments For This Post
2 Trackbacks For This Post
January 12th, 2009 at 6:34 am
[...] AnimeCrazy Scraper Example Using Hpricot & Mechanize [...]
January 24th, 2009 at 1:54 am
[...] On a technical note, it’s using mechanize, hpricot, tempfile (for the cache). A lot of this is just copy & paste programming from the earlier anime scraper. [...]
Leave a Reply