Category Archives: Automation

Bulk Upload Images as Simple Products in Magento

The client has thousands of images, each of which is a (simple) product they want in Magento. Each image is named the associated product’s SKU number with a JPG extension. I began by using a recorded Batch script in Photoshop to resize all the images to reduce filesize and make the image more web-friendly.

The following Ruby script generates a Magento Product CSV file from JPG images in a directory. It is pasted as-is.

#!/bin/env ruby

require 'rubygems'
require 'csv'
require 'htmlentities'
require 'uri'


def main
    #Dir.glob("media/products/#{catname}/*").each_with_index do |imgpath, i|

    # Magento 1.7 CSV product headers
    csv_headers = '"store"	"websites"	"attribute_set"	"type"	"category_ids"	"sku"	"has_options"	"name"	"meta_title"	"meta_description"	"image"	"small_image"	"thumbnail"	"url_key"	"url_path"	"custom_design"	"page_layout"	"options_container"	"image_label"	"small_image_label"	"thumbnail_label"	"country_of_manufacture"	"msrp_enabled"	"msrp_display_actual_price_type"	"gift_message_available"	"price"	"special_price"	"weight"	"msrp"	"status"	"visibility"	"Featured"	"Deal"	"Hot"	"enable_googlecheckout"	"tax_class_id"	"is_recurring"	"description"	"short_description"	"meta_keyword"	"custom_layout_update"	"news_from_date"	"news_to_date"	"special_from_date"	"special_to_date"	"custom_design_from"	"custom_design_to"	"qty"	"min_qty"	"use_config_min_qty"	"is_qty_decimal"	"backorders"	"use_config_backorders"	"min_sale_qty"	"use_config_min_sale_qty"	"max_sale_qty"	"use_config_max_sale_qty"	"is_in_stock"	"low_stock_date"	"notify_stock_qty"	"use_config_notify_stock_qty"	"manage_stock"	"use_config_manage_stock"	"stock_status_changed_auto"	"use_config_qty_increments"	"qty_increments"	"use_config_enable_qty_inc"	"enable_qty_increments"	"is_decimal_divided"	"stock_status_changed_automatically"	"use_config_enable_qty_increments"	"product_name"	"store_id"	"product_type_id"	"product_status_changed"	"product_changed_websites"'

    path = "output.csv"
    open(path, 'w') { |f| f.puts csv_headers } 

    Dir.glob("*.JPG").each_with_index do |imgpath, i|
        image_name = File.basename(imgpath)
        image_url = "/#{image_name}"
        sku = image_name[0,image_name.size-4]
        name = sku
        desc_short = sku
        desc = sku
        csv = '"default"	"base"	"Default"	"simple"	"2"	"SKU"	"0"	"NAME"	"META_TITLE"	"META_DESC"	"IMAGE"	"IMAGE"	"IMAGE"	"url-key-SKU"	"url-key-SKU"	""	"No layout updates"	"Block after Info Column"	""	""	""	" "	"Use config"	"Use config"	"No"	"1.0000"	"China"	"1.0000"	""	"Enabled"	"Catalog, Search"	"0"	"0"	"0"	"Yes"	"None"	"No"	"LONG_DESC"	"SHORT_DESC"	"META_KEYWORDS"	""	""	""	""	""	""	""	"10000.0000"	"0.0000"	"1"	"0"	"0"	"1"	"1.0000"	"1"	"0.0000"	"1"	"1"	""	""	"1"	"0"	"1"	"0"	"1"	"0.0000"	"1"	"0"	"0"	"0"	"1"	"NAME"	"0"	"simple"	""	""'
        csv.gsub!('SKU', sku)
        csv.gsub!('NAME', name)
        csv.gsub!('LONG_DESC', desc)
        csv.gsub!('SHORT_DESC', desc_short)
        csv.gsub!('META_TITLE', sku)
        csv.gsub!('META_DESC', sku)
        csv.gsub!('META_KEYWORDS', sku)
        csv.gsub!('IMAGE', "#{image_url}")
        open(path, 'a') { |f| f.puts csv } 
    end

    p "Finished."
end

main


=begin

#require 'yaml'
#require 'hpricot'
#require 'tempfile'
#require 'mechanize'
#require 'open-uri'
#require 'highline/import'
#HighLine.track_eof = false

class Importer
  def initialize
    CSV.foreach('/Users/bluish/Desktop/export_ish.csv') do |row|
      p row.inspect
    end
    return
    
    CSV.open('/Users/bluish/Desktop/export_ish.csv', 'wb') do |csv|
      p 'injecting csv'
      csv << ['123', '', '456'] 
    end
  end
end

class String
  def decode
    HTMLEntities.new(:expanded).decode(self)
  end

  def encode
    #HTMLEntities.new(:expanded).encode(self)
    URI::escape(self)
  end
end
 
def saveImage(remote_url, thumb_url, dir='./media/import')
  require 'open-uri'
    begin
        base = File.basename(remote_url)
        img = "#{dir}/#{base}"
        
        unless File.exist? img
          p img
          open(img, 'wb') do |file|
            file << open(remote_url).read
          end
        end 
        
        thumb =  "#{dir}/thumb-#{base}"
        unless File.exist? thumb
          p thumb
          open(thumb, 'wb') do |file|
            file << open(thumb_url).read
         end
        end
        
        p "GOT MEDIA"

    rescue
       p "FAILED to grab #{remote_url}"
       p "or #{thumb_url}"
       p 'Possible 404? continuing...'
    end
end           

=end



Add Items to Category Programmatically in Magento

The code on the bottom will take the following array:

$categories = array(
#   "nickname" => category_id,
    "Eyeliner" => 107,
    "Lipstick" => 108,
    "General" => 18,
#   etc.
);

The $lineItems array contains a line of category+sku-items in the format “CategoryNick SKU1 SKU2 SKU3.” CategoryNick is matched against the $categories array above to figure out the Magento category ID. If the product is already in the category, it doesn’t do anything. The product will retain the categories it is already associated with rather than dropping the existing category associations.

$lineItems = Array();
$lineItems[] =     "Eyeliner 7897";
$lineItems[] =     "Eyeliner 7898 7772 7771 7770";
$lineItems[] =     "Lipstick 7909-15 7909-16 7939 7941 7984";
$lineItems[] =     "General 7940"; 

This is the full code. It should be placed in a /filename.php file and visited at /filename.php – Make a backup of your database before using this (and daily!) Don’t keep executable PHP files hanging around after use. If you need to, make sure the permissions are 600 or 700 and NOT world-writable or rename to .phps so they aren’t executable by web

<?php
define('MAGENTO', realpath(dirname(__FILE__)));
require_once MAGENTO . '/app/Mage.php';
Mage::app();

# assign duct tape to all categories
function FindBySku($sku) {
    return Mage::getModel('catalog/product')->loadByAttribute('sku', $sku);
}

function AddToCategory($product, $newcat, $save=false) {
    $cats = $product->getCategoryIds();

    # Don't double categories
    if (in_array($newcat, $cats)) return;

    $cats[] = $newcat;

    $product->setCategoryIds($cats);
    if ($save) return $product->Save();
}

# CHANGE ME
# 3 categories nicknamed and linked to respective Magento IDs
$categories = array(
    "Eyeliner" => 107,
    "Lipstick" => 108,
    "General" => 18,
);

# CHANGE ME
# Add 5 items to the Eyeliner category (107), 5 to Lipstick and 1 to General.
$lineItems = Array();
$lineItems[] =     "Eyeliner 7897";
$lineItems[] =     "Eyeliner 7898 7772 7771 7770";
$lineItems[] =     "Lipstick 7909-15 7909-16 7939 7941 7984";
$lineItems[] =     "General 7940"; 

foreach($lineItems as $line) {
    $pieces = explode(' ', $line);
    $cat = $pieces[0];

    #$skus = implode(' ', $pieces);
    #$skus = substr($skus, strpos($skus, ' ')+1, strlen($skus) - strlen($cat));
    # $cat is the category NickName and $skus is a space-delimited list of SKU #s

    $skus = array_slice($pieces, 1);
    # $cat is the category NickName and $skus is an array of SKU numbers

    if (array_key_exists($cat, $categories) && $categories[$cat] != 0) {
        $cat_id = $categories[$cat];
        echo "Adding to category $cat (id = $cat_id)\n";
        echo '\tAdding ' . sizeof($skus) . " SKUs\n";
        #$item_skus = explode(" ", $skus);
        $i = 0;

        foreach($skus as $sku) {
            $product = FindBySku($sku);
            if (!$product) {
                echo "** ERROR: NO PRODUCT FOUND SKU=$sku\n";
                continue;
            }
            AddToCategory($product, $cat_id, true);
            ++$i;
            echo "\tadded $sku to $cat ";
        }

        echo "\nTOTAL $i/" . sizeof($skus) . " items to $cat\n\n";
    } else {
        echo "** ERROR: CANNOT FIND CATEGORY '$cat'\n\n";
    }
}

?>                            

Installing Alan Storm’s LayoutViewer in Magento (works with 1.7!)

I use the debug toolbar but nothing beats Alan Storm’s configviewer and layoutviewer. Unfortunately the only archive of this on Alan’s site is missing a module config file to activate it. Below is a simple shell script to download, install and activate the LayoutViewer in any version of Magento. I can confirm it works flawlessly in Magento 1.7. If you’d like to manually install this module, read the “manual installation” section below.

Script

#!/bin/sh

#VIEWER_HTTP_DOWNLOAD="http://alanstorm.com/2005/projects/MagentoLayoutViewer.tar.gz"
VIEWER_HTTP_DOWNLOAD="http://biodegradablegeek.com/MagentoLayoutViewer.tar.gz"

# If in root, go to app/code/local
if [ -f "index.php" ]; then
    cd app/code/local;
fi

echo "Using curl to download Magento LayoutViewer"
curl -so - $VIEWER_HTTP_DOWNLOAD | tar xvzf -

echo "Writing app/etc/modules/ config file"
(
cat <<'ConfigFile'
<?php xml version="1.0"?>
<config>
<modules>
 <Alanstormdotcom_Layoutviewer>
   <active>true</active>
   <codePool>local</codePool>
 </Alanstormdotcom_Layoutviewer>
</modules>
</config>
ConfigFile
) > ../../etc/modules/Alanstormdotcom_Layoutviewer.xml

echo "Done. Visit any page with ?showLayout=page"

Save this to a file and run “sh” in your Magento root.

Manual Installation

Download the MagentoLayoutViewer and extract the Alanstormdotcom folder to [magento-root]/app/code/local/

Create a new config file in app/etc/modules/ named Alanstormdotcom_Layoutviewer.xml  and paste in it the following:

<?php xml version="1.0"?>
<config>
<modules>
 <Alanstormdotcom_Layoutviewer>
   <active>true</active>
   <codePool>local</codePool>
 </Alanstormdotcom_Layoutviewer>
</modules>
</config>

Done. See usage below.

Module Usage

Visit any URL with ?showLayout=page (or handle or package) to retrieve the layouts XML

The module also accepts a showLayoutFormat=text argument if you’d like plain text instead of XML.

Example: http://my-store.cxm/product/123?showLayout=page&showLayoutFormat=text

I Can’t Live Without My vim Config

I have updated the vim page with my vimrc/gvimrc configs. Instead of repeating myself, I will quote some parts of the page ..

More details and the vim config itself here

I recommend turning backups on if you have them off. I personally hate having the ~ files all over my OS, so I keep them along with the .swp files in 1 backup dir in ~/.vim/

The programming language skeleton stuff will detect what files you are editing and change options in vim by inheriting the specified files which I put in ~/.vim/skeletons and ~/.vim/inherit.

The skeletons are automatically inserted in new files that vim is aware of. For example, in my own config, I have ~/.vim/inherit/c which has all the usual includes and int main() code. When I make a new C file (“gvim hello.c”), the new file begins with the skeleton code already present. Neat huh?

The inherit files can be used to set specific options for each language. This can mean different bindings, whitespace options, themes, etc depending on what language you’re working with, automatically.

See the vim page

What options have helped you the most?

Bash Script to Force an Empty Git Push

Sometimes, like when you’re testing hooks or trying to create synced remote and local repos, you’ll find yourself touching empty files just to get a git push going. This script automates this task by creating a unique temporary file, committing it, pushing, and then removing the file.

#!/bin/sh
TMP=tmp-`date +'%m%s'`
touch $TMP
git add $TMP
git commit $TMP -m '(forced push)'
git push
git rm $TMP
</pre>
Usage, assuming you named it git-force and made it executable (chmod)
<pre lang="bash">cd git-repo/
./git-force

I place this in ~/bin/ which is in my $PATH. You might want to if you use this a lot.

How to Maintain Static Sites with Git & Jekyll

Static sites in this context just means non-database driven sites. Your static site can be an elaborate PHP script or just a few markup and image files. For this I am using Jekyll – A neat Ruby gem that makes your static sites dynamic. It lets you create layouts and embed custom variables in your HTML (this is a “prototype” of the site).

Jekyll tackles all the nuisances involved in creating static pages (I used to add just enough PHP to make a layout). It works by running your prototype through some parsers and outputs plain static HTML/XML (RSS feeds) etc. It’s perfect for lightweight sites that would be impractical on WordPress, like a few static pages of information, landing pages, portfolio/resume pages, and parked domains.

Git takes care of keeping your development (local) and production (remote) environments synced. Git might be a little confusing if you’re learning it with the mindset that it works like Subversion.

I’ll update this post when the guide is done. For now, the following will assume you’re familiar with Jekyll (or at least have an empty file in the prototype directory) and git. This Bash script simplifies creating the remote git repository:

** please read through the code and make sure you know what this does, and what you’re doing. As of now, this is bias towards my own Apache/vhost setup. It’s trivial to edit for your specific needs. You’re using this at your own risk.

(direct link – repogen.sh)

#!/bin/sh
# 
# 04/01/2009 | http://biodegradablegeek.com | GPL 
# 
# You should be in site (NOT public) root (be in same dir as public/ log/ etc)
# proto/ is created and will house the jekyll prototype
# public/ will be the generated static site
# the public/ folder will be REMOVED and regenerated on every push
# 

if [ -z "$1" ]; then
  echo "Usage: ./repogen.sh domain.comn"
  exit
fi

# optional. will make it easier to copy/paste cmd to clone repo 
SSHURL="ssh.domain.com"
URL="$1"

echo "** creating tmp repo"
mkdir proto
cd proto
git init 
touch INITIAL
git add INITIAL
git commit -a -m "Initial Commit"

echo "** creating bare repo"
cd ..
git clone --bare proto proto.git
mv proto proto.old
git clone proto.git
rm -rf proto.old

echo "** generating hook"
HOOK=proto.git/hooks/post-update

mv $HOOK /tmp
echo '#!/bin/sh' &gt;&gt; $HOOK
echo '# To enable this hook, make this file executable by "chmod +x post-update".' &gt;&gt; $HOOK
echo '#exec git-update-server-info' &gt;&gt; $HOOK
echo '' &gt;&gt; $HOOK
echo '' &gt;&gt; $HOOK
echo 'URL='"$URL" &gt;&gt; $HOOK
echo 'PROTO="/home/$USER/www/$URL/proto"' &gt;&gt; $HOOK
echo 'PUBLIC="/home/$USER/www/$URL/public"' &gt;&gt; $HOOK
echo  '' &gt;&gt; $HOOK
echo 'export GIT_DIR="$PROTO/.git"' &gt;&gt; $HOOK
echo 'pushd $PROTO &gt; /dev/null' &gt;&gt; $HOOK
echo 'git pull' &gt;&gt; $HOOK
echo 'popd &gt; /dev/null' &gt;&gt; $HOOK
echo '' &gt;&gt; $HOOK
echo "echo -----------------------------" &gt;&gt; $HOOK
echo "echo '** Pushing changes to '$URL" &gt;&gt; $HOOK
echo "echo '** Moving current public to /tmp'" &gt;&gt; $HOOK
echo 'mv "$PUBLIC" "/tmp/'$URL'public-`date '+%m%d%Y'`"' &gt;&gt; $HOOK
echo 'echo "** Generating new public"' &gt;&gt; $HOOK
echo 'jekyll "$PROTO" "$PUBLIC"' &gt;&gt; $HOOK

echo "** enabling hook"
chmod a+x $HOOK 

echo "** clone repo on local machina. example:"
echo "git clone ssh://$USER@$SSHURL/~$USER/www/$SSHURL/proto.git"

Usage

Your site structure might be different. repogen.sh is made by pasting the above code in a new file, and then chmod a+x to make it executable. This should be done on the remote server.

cd www/domain.com/
ls
public/ private/ log/ cgi-bin/

./repogen.sh domain.com

Now on your local machine, clone the new repo, move your files in, and push:

git clone ssh://[username]@ssh.domain.com/~[username]/www/domain.com/proto.git
cd proto/
cat "hello, world" &gt; index.htm
git add index.htm
git commit -a -m 'first local commit'
git push

After you push your changes, the post-update hook will delete the public/ directory (the root of the site). This dir and its contents are automatically generated and will get wiped out on EVERY push. Keep this in mind. All your changes and content should reside in proto/.

The proto/ repo will pull in the new changes, and then Jekyll will be invoked to generate the updated site in public/ from the prototype.

Should you need to edit it, the post-update hook is in the bare git repo (proto.git/hooks/)

Thanks to the authors in the posts below for sharing ideas. I first read this git method on dmiessler’s site.

Resources:
dmiessler.com – using git to maintain static pages
toroid.org – using git to manage a web site
Jekyll @ GitHub
git info
more git info

Scraping Google Trends with Mechanize and Hpricot

This is a small Ruby script that fetches the 100 trends of the day for a specific date. If multiple dates are searched, one can find out how many times a keyword occurred between two dates, or just find out what keywords are constantly appearing on the top 100 list. The script is incomplete and one must implement the “implement me!” methods to get full functionality. This, in its current state, should serve as a good starting point for scraping Google Trends.

On a technical note, it’s using mechanize, hpricot, tempfile (for the cache). A lot of this is just copy & paste programming from the earlier anime scraper.

To grab the gems (rdoc takes 10x as long as the gem to fetch and install):

sudo gem install mechanize --no-rdoc
sudo gem install hpricot --no-rdoc
#!/usr/bin/env ruby
# biodegradablegeek.com
# public domain
#

require 'rubygems'
require 'hpricot'
require 'tempfile'
require 'mechanize'
#require 'highline/import'
#HighLine.track_eof = false

$mech = WWW::Mechanize.new
$mech.user_agent_alias = 'Mac Safari'
$master = []

def puts2(txt=''); puts "*** #{txt}"; end

class Cache
  def initialize
    # Setup physical cache location
    @path = 'cache'
    Dir.mkdir @path unless File.exists? @path

    # key/val = url/filename (of fetched data)
    @datafile = "#{@path}/cache.data"
    @cache = load @datafile
  end

  def put key, val
    tf = Tempfile.new('googletrends', @path)
    path = tf.path
    tf.close! # important!

    puts2 "Saving to cache (#{path})"
    open(path, 'w') { |f|
      f.write(val)
      @cache[key] = path
    }

    save @datafile
  end

  def get key
    return nil unless exists?(key) && File.exists?(@cache[key])
    open(@cache[key], 'r') { |f| f.read }
  end

  def files
    @cache.values
  end

  def first
    @cache.first
  end

  def exists? key
    @cache.has_key? key
  end

private
  # Load saved cache
  def load file
    return File.exists?(file) ? YAML.load(open(file).read) : {}
  end

  # Save cache
  def save path
    open(path, 'w') { |f|
      f.write @cache.to_yaml
    }
  end
end

$cache = Cache.new

def fetch(url)
  body = $mech.get(url).body()
  $cache.put(url, body)
  body
end

def getPage(url)
  body = $cache.get(url)

  if body.nil?
    puts "Not cached. Fetching from site..."
    body = fetch url
  end
  body
end

def loadState
  mf = 'cache/master.data'
  $master = File.exists?(mf) ? YAML.load(open(mf).read) : {}
  $master = {} if $master==false
end

def saveState
  open('cache/master.data', 'w+') { |f|
    f.write $master.to_yaml
  }
end

def main
  #loadState

  # Grab top 100 Google Trends (today)
  #date = Time.now.strftime '%Y-%m-%d'
  date = '2009-01-21'

  puts2 "Getting Google's top 100 search trends for #{date}"
  url = "http://www.google.com/trends/hottrends?sa=X&date=#{date}"
  puts2 url

  begin
    body = getPage(url)
  rescue WWW::Mechanize::ResponseCodeError
    puts2 "Couldn't fetch URL. Invalid date..?"
    exit 5
  end

  puts2 "Fetched page (#{body.size} bytes)"

  if body['There is no data on date']
    puts2 'No data available for this date.'
    puts2 'Date might be too old or too early for report, or just invalid'
    exit 3
  end

  doc = Hpricot(body)

  (doc/"td[@class='hotColumn']/table[@class='Z2_list']//tr").each do |tr|
    td = (tr/:td)
    num = td[0].inner_text.sub('.','').strip
    kw = td[1].inner_text
    url = (td[1]/:a).first[:href]
    Keyword.find_or_new(kw) << Occurance.new(num, date, url)
  end
  puts "Got info on #{$master.size} keywords for #{date}"
  puts "keyword '#{$master.first.name}' occured #{$master.first.occurances} times"
end

class Occurance
  attr_accessor :pos, :date, :url
  def initialize(pos, date, url)
    @pos = pos
    @date = date
    @url = url
  end
end

class Keyword
  attr_accessor :name, :occurances
  def initialize(name)
    @name = name
    @occurances = []
    @position_average = nil
    @count = nil
    $master << self
  end

  def self.find_or_new(name)
    x = $master.find { |m| name==m.name }
    x || Keyword.new(name)
  end

  def << occurance
    @occurances << occurance
  end

  def occured_on? datetime
    raise 'implement me'
  end

  def occured_between? datetime
    raise 'implement me'
  end

  def occurances datetime=nil
    raise 'implement me' if datetime
    @occurances.size
  end

  def occurances_between datetime
    raise 'implement me'
  end

  def pos_latest
    @occurances.last.date
  end

  def pos_average
    @position_average
  end

  def pos_average_between datetime
    raise 'implement me'
  end
end

#   Instance= [num, date, url]
#   Keyword=[Instance, Intance, Instance]
#   Methods for keywords:
#   KW.occured_on? date
#   KW.occured_between? d1, d2
#   KW.occurances
#   KW.occurances_between? d1, d2
#   KW.pos_latest
#   KW.pos_average
#   KW.pos_average_between

#   KW has been on the top 100 list KW.occurances.size times
#   The #1 keywords for the month of January: Master.sort_by KW.occurances_between? Jan1,Jan31.pos_average_between Jan1,Jan31
#
#   Top keywords: sort by KW.occurances.size = N keyword was listed the most.
#   Top keywords for date D: Master.sort_by KW.occured_on (x).num

main

AnimeCrazy Scraper Example Using Hpricot & Mechanize

This is a little (as of now incomplete) scraper I wrote to grab all the anime video code off of AnimeCrazy (dot) net. This site doesn’t host any videos on its own server, but just embeds ones that have been uploaded to other sites (Megavideo, YouTube, Vimeo, etc). I don’t know who the original uploaders of the videos are, but I’ve seen this same collection of anime links being used on some other sites. This site has about 10,000 episodes/parts (1 movie may have 6+ parts). The scraper below was only tested with “completed anime shows” and got around 6300 episodes. The remaining content (anime movies and running anime shows) should work as-is, but I personally held off on getting those because I want to examine them closely to try cleaning up the inconsistencies as much as possible.

This scraper needs some initial setup and won’t work out of the box, but I’m including it here in the hopes that it will serve as a decent example of a small real world scraper, if you’re looking to learn the basics of scraping with Hpricot and Mechanize. Let me know if you find any use for it. I will update the posted code later this week when I have time to complete it and add some more features.

There’s one major problem with the organization of episodes on AnimeCrazy, and it’s the fact that some episodes are glued together into one post. Right now the scraper stops and asks you how to proceed when it comes across such a post. You basically need to tell the scraper if a post (page) contains 1 episode (video) or multiple. If there’s 1, it proceeds on its own, but if there’s two, it requires that you give it the names and links of each individual episode (part1 and part2 usually). Sometimes 2 episodes are together in 1 video. Sorta like those music albums on KaZaA or LimeWire that are basically ripped as one huge mp3 instead of individual songs.

This only accounts for maybe 30-40 out of 6000 videos, and it’s not that big of a deal because the amount of work needed to proceed with the scraping is small, but it IS work, and is a bitch slap to the entire concept of automation, but coding around the issue is a major hassle and there would still be a high chance that some inconsistencies will still come through. It would be far less work to just find another anime site which is far more consistent, though the reason animecrazy is good is because it’s active, and the site IS updated manually these days, as far as I can tell.

BTW, Why The Lucky Stiff rocks, and Hpricot is amazing. But the serious scrapologist should consider scrAPI or sCRUBYt (uses Hpricot) for big projects.


#!/usr/bin/env ruby
# License: Public domain. Go sell it to newbs on DigitalPoint.

require 'rubygems'
require 'hpricot'
require 'mechanize'
require 'tempfile'
require 'highline/import'
HighLine.track_eof = false

$mech = WWW::Mechanize.new
$mech.user_agent_alias = 'Mac Safari'

###############################
$skip_until = false
DEBUG=false
###############################

def debug?
  DEBUG
end

def puts2(txt='')
  puts "*** #{txt}"
end

#  Anime has: title, type (series, movie), series
#  Episode has name/#, description, parts (video code)

class Episode
  attr_accessor :name, :src, :desc, :cover
  def initialize(title, page)
    @src = page # parts (megavideo, youtube etc)
    @name = title
    @desc = nil # episode description
    @cover = nil # file path
  end
end

class Anime
  attr_accessor :name, :page, :completed, :anime_type, :episodes
  def initialize(title, page)
    @name = title
    @page = page
    @episodes = []
    @anime_type = 'series'
    @completed = false
  end

  def complete!
    @completed = true
  end

  def episode! episode
    @episodes &lt;&lt; episode
  end
end

class Cache
  def initialize
    # Setup physical cache location
    @path = 'cache'
    Dir.mkdir @path unless File.exists? @path

    # key/val = url/filename (of fetched data)
    @datafile = "#{@path}/cache.data"
    @cache = load @datafile
    #puts @cache.inspect
  end

  def put key, val
    tf = Tempfile.new('animecrazy', @path)
    path = tf.path
    tf.close! # important!

    puts2 "Saving to cache (#{path})"
    open(path, 'w') { |f|
      f.write(val)
      @cache[key] = path
    }

    save @datafile
  end

  def get key
    return nil unless exists?(key) &amp;&amp; File.exists?(@cache[key])
    open(@cache[key], 'r') { |f| f.read }
  end

  def exists? key
    @cache.has_key? key
  end

private
  # Load saved cache
  def load file
    return File.exists?(file) ? YAML.load(open(file).read) : {}
  end

  # Save cache
  def save path
    open(path, 'w') { |f|
      f.write @cache.to_yaml
    }
  end
end

$cache = Cache.new

def fetch(url)
  body = $mech.get(url).body()
  $cache.put(url, body)
  body
end

def getPage(url)
  # First let's see if this is cached already.
  body = $cache.get(url) 

  if body.nil?
    puts "Not cached. Fetching from site..."
    body = fetch url
  end
  body
end

def main
  # Open anime list (anime_list = saved HTML of sidebar from animecrazy.net)
  anime_list = Hpricot(open('anime_list', 'r') { |f| f.read })
  puts2 "Anime list open"

  # Read in the URL to every series
  masterlist = []

  (anime_list/:li/:a).each do |series|
    anime = Anime.new(series.inner_text, series[:href])
    masterlist &lt;&lt; anime
    puts2 "Built structure for #{anime.name}..."
  end

  puts2

  puts2 "Fetched #{masterlist.size} animes. Now fetching episodes..."
  masterlist.each do |anime|
    puts2 "Fetching body (#{anime.name})"
    body = getPage(anime.page)
    puts2 "Snatched that bitch (#{body.size} bytes of Goku Goodness)"
    puts2

    doc = Hpricot(body)
    (doc/"h1/a[@rel='bookmark']").each do |episode|
      name = clean(episode.inner_text)

      if $skip_until
        #$skip_until = !inUrl(episode[:href], 'basilisk-episode-2')
        #$skip_until = nil == name['Tsubasa Chronicles']
        puts2 "Resuming from #{episode[:href]}" if !$skip_until
        next
      end

      # Here it gets tricky. This is a major source of inconsistencies in the site.
      # They group episodes into 1 post sometimes, and the only way to find
      # out from the title of the post is by checking for the following patterns
      # (7 and 8 are example episode #s)
      # X = 7+8, 7 + 8, 7 and 8, 7and8, 7 &amp; 8, 7&amp;8

      # If an episode has no X then it is 1 episode.
      # If it has multiple parts, they are mirrors.
      if single_episode? name
        begin

          puts2 "Adding episode #{name}..."
          ep = Episode.new(name, episode[:href])
          ep.src = getPage(episode[:href])
          anime.episode! ep
        rescue WWW::Mechanize::ResponseCodeError
          puts2 "ERROR: Page not found? Skipping..."
          puts name
          puts2 episode[:href]
        end
      else
        # If an episode DOES have X, it *may* have 2 episodes (but may have mirrors, going up to 4 parts/vids per page).
        # Multiple parts will be the individual episodes in chronological order.
        puts2 "Help me! I'm confused @ '#{name}'"
        puts2 "This post might contain multiple episodes..."

        puts2 "Please visit this URL and verify the following:"
        puts episode[:href]

        if agree("Is this 1 episode? yes/no ")
          begin
            puts2 "Adding episode #{name}..."
            ep = Episode.new(name, episode[:href])
            ep.src = getPage(episode[:href])
            anime.episode! ep
          rescue WWW::Mechanize::ResponseCodeError
            puts2 "ERROR: Page not found? Skipping..."
            puts name
            puts2 episode[:href]
          end
        else
          more = true
          while more
            ename = ask("Enter the name of an episode: ")
            eurl =  ask("Enter the URL of an episode: ")

            begin
              puts2 "Adding episode #{ename}..."
              ep = Episode.new(name, episode[:href])
              ep.src = getPage(episode[:href])
              anime.episode! ep
            rescue WWW::Mechanize::ResponseCodeError
              puts2 "ERROR: Page not found? Skipping..."
              puts name
              puts2 episode[:href]
            end
            more = agree("Add another episode? Y/N")
          end
          puts2 "Added episodes manually... moving on"
        end
      end
    end
    anime.complete!
    # XXX save the entire anime object, instead of just cache
  end
end

def inTitle(document, title)
  return (document/:title).inner_text[title]
end

def inUrl(url, part)
  return url[part]
end

def single_episode?(name)
  !(name =~ /[0-9] ?([+&amp;]|and) ?[0-9]/)
end

def clean(txt)
  # This picks up most of them, but some are missing. Like *Final* and just plain "Final"
  txt[' (Final)']='' if txt[' (Final)']
  txt[' (Final Episode)']='' if txt[' (Final Episode)']
  txt[' (FINAL)']='' if txt[' (FINAL)']
  txt[' (FINAL EPISODE)']='' if txt[' (FINAL EPISODE)']

  txt['(Final)']='' if txt['(Final)']
  txt['(Final Episode)']='' if txt['(Final Episode)']
  txt['(FINAL)']='' if txt[' (FINAL)']
  txt['(FINAL EPISODE)']='' if txt[' (FINAL EPISODE)']

  txt
end

main

If you’re writing your own scraper and would like to use the minimal caching functionality present below, you can gut everything in main() out and put in your own code. Feel free to contact me for assistance.

Here is some sample output:
Continue reading AnimeCrazy Scraper Example Using Hpricot & Mechanize

Quick BASH Script to Dump & Compress a MySQL Database

A quick script I whipped up to dump my MySQL database.
Usage: sh backthatsqlup.sh

(be warned that it dumps ALL databases. This can get huge uncompressed)


#!/bin/sh
# Isam (Biodegradablegeek.com) public domain 12/28/2008
# Basic BASH script to dump and compress a MySQL dump

out=sequel_`date +'%m%d%Y_%M%S'`.sql
dest=/bx/

function e {
  echo -e "n** $1"
}

e "Dumping SQL file ($out). May take awhile..."
#echo "oh snap" &gt; $out
sudo mysqldump -u root -p --all-databases &gt; $out
if [ $? -ne 0 ]; then
  e "MySQL dump failed. Check that server is up and your username/pass"
  exit 7
fi

e "Uncompressed SQL file size"
du -hs $out

e "Compressing SQL file"
gz=$out.tar.gz
tar -zvvcf $gz $out
rt=$?

if [ $rt -ne 0 ]; then
  e "tar failed (error=$rt). Will NOT remove uncompressed SQL file"
else
  e "Removing uncompressed SQL file"
  rm -f $out
  out=$gz

  e "Compressed SQL file size"
  du -hs $out
fi

e "Moving shit to '$dest'"
sudo mv $out $dest

download BackThatSqlUp.sh

Using Javascript to Populate Forms During Development

During development, working with forms quickly gets annoying because you have to constantly fill in each field, sometimes with unique info. One way around this is to write a little Javascript code that just populates the fields. I use something like this on the bottom of the form. I had jQuery no-conflict mode on in this case. In your app you might be able to get away replacing _j() with $():

<% if ENV['RAILS_ENV']=='development' -%>


<% end -%>