Craigslist feed aggregator

Posted by Joel Jensen Mon, 21 Aug 2006 01:14:29 GMT

Ok I am done with the scooterbot thing. Here is the last iteration, I was also honing up on ruby/ xml/ databases …

Its a feed aggregator that searches all craigslist cities for a term, stores the results in a mysql database, creates an rss feed and uploads it to a server. Though while writing this I realize I could get rid of the mysql database and use the xml as a datasource. But no, I’m done. Heres the code if its useful

I wrote an automator script to launch this. Run shell script ::

cd to_your_directory
./the_scripts_filename

When it completes launch safari and visit your rss feed.

#!/usr/bin/ruby
require 'mysql'
require 'rexml/document'
require 'time'
require 'open-uri'
require 'URI'

host = <host>
password = <password>
user = <login>
cd_to = <your www directory>
saved_file_name = <filename>
searchterm = 'aprilia'


# CREATE DATABASE CRAIGSLIST;
# CREATE TABLE `items` (
#   `id` int(11) NOT NULL default '0',
#   `title` varchar(255) default NULL,
#   `url` varchar(255) default NULL,
#   `date` datetime default NULL,
#   `xml` text,
#   `viewed` int(11) default '0'
# ) ENGINE=MyISAM DEFAULT CHARSET=utf8


header = <<EOF
<rss version='0.91'
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://purl.org/rss/1.0/"
  xmlns:ev="http://purl.org/rss/1.0/modules/event/"
  xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:admin="http://webns.net/mvcb/"
  >
 <channel>
  <title>Where the hell is my scooter? bot</title>
  <link>www.nervetree.com/scooter.xml</link>
  <description>Where the hell is my scooter? bot</description>
  <language>us-en</language>
  <copyright>None</copyright>
  <managingEditor></managingEditor>
  <webMaster></webMaster>
  <image>
   <title></title>
   <url></url>
   <width></width>
   <height></height>
   <description>Where the hell is my scooter? bot</description>
  </image>
</channel>
</rss>
EOF


here = Dir.pwd
area_ids = [1,127,231,200,207,51,244,18,57,293,100,63,187,43,189,104,7,285,96,102,103,209,188,12,8,1,191,62,97,208,210,13,287,315,288,44,168,281,193,10,238,236,125,219,80,20,39,203,237,186,37,124,258,14,256,257,205,28,52,190,11,224,223,225,229,227,226,45,228,98,307,280,99,133,58,199,283,284,31,206,169,34,4,239,173,240,172,22,259,129,261,212,309,260,262,255,19,316,230,134,222,30,221,29,192,282,55,26,92,198,170,286,50,218,59,248,40,249,201,250,3,126,130,247,171,41,273,61,36,274,272,196,251,35,27,42,131,204,252,54,70,233,94,216,9,232,275,166,279,167,277,17,33,278,276,180,38,128,101,253,254,195,220,202,46,32,269,15,264,266,265,21,132,23,271,267,263,268,53,308,270,292,56,93,291,290,48,60,289,217,2,95,246,194,243,242,241,165,47,197]

feeds = area_ids.collect{|aid| 
    "http://www.craigslist.org/cgi-bin/search?areaID=#{aid}&subAreaID=&query=#{URI.escape(searchterm)}&catAbbreviation=sss&format=rss"
}

$my = Mysql.connect('localhost', 'root')
$my.select_db('craigslist')

feeds.each {
    |feed|
    ids = []
    results = $my.query('select id from items')
    results.each{|result| ids << result[0]}

    rss = open(feed).read()
    xml = REXML::Document.new(rss)
    xml.elements.each("//item") { |item|
      id = /(\d+).html/.match(item.attributes['about'])[1]
      url = item.attributes['about']

      if ids.include?(id) == false
        st = $my.prepare("insert into items (id,title,url,date,xml) values (?,?,?,?,?)")
        st.execute(id,item.elements['title'].text,item.elements['link'].text,item.elements['dc:date'].text,item.to_s)
        st.close
      end
    }
}

xmlstuff = []
results = $my.query('select xml from items order by date desc')
results.each{|result| xmlstuff << result[0]}
$my.close

# header = File.new here+'/rss_header.xml'
maindoc = REXML::Document.new( header )
rt = maindoc.elements['rss/channel']
xmlstuff.each{|e| 
  elementdoc = REXML::Document.new(e)
  rt.add_element elementdoc 
}

scooter = StringIO.new ""
maindoc.write scooter
scooter.rewind

require 'net/ftp'
ftp = Net::FTP.new(host)
ftp.login(user,password)
ftp.passive = true
files = ftp.chdir(cd_to)
ftp.storlines("STOR " + saved_file_name , scooter)
ftp.close
p 'done'

CSS3 Multi-column printing of a webpage

Posted by Joel Jensen Sat, 19 Aug 2006 16:43:00 GMT

I have written some reports for a real estate brokerage I am building a site for, they are printouts of all the agents 300 or so. The office uses these as phone lists. They wanted it to print out mulitcolum like MS-Word multicolumn. Trouble is if I made it one huge table spit in 2 columns you would get a-m in column one and n-z in column 2, not flowing on each printed page. Enter CSS3.

I heard about CSS3 support in some browsers, and though I would give it a try in firefox.

<style type="text/css">
  #wrap {
    -moz-column-width:20em;
    -moz-column-count: 3;
    -moz-column-gap: 20px;
  }
</style>

link to docs

Here is the CSS code for a <div id='wrap'>list goes here</div>.

Works great.

Backup your MySQL databases and your user folder on your mac with Automator

Posted by Joel Jensen Sat, 19 Aug 2006 02:15:00 GMT

I have been backing my stuff up weekly using an applescript. But ran into a wall trying to automate the mysql backup. The issue is I have ALOT of development databases and I do not want a single dumpfile, else I would have to blow away and rebuild everything if I needed a refresh of one table. I wrote this nifty automator script.

  • 1 Basicly the steps are drag Automator::Run Shell Script.
  • 2 Fill in the script.
  • 3 Drag Mail::New Message and send confirmation to yourself. I would have preferred to add an event to iCal but there is no Time.now() feature in the iCal Automator actions, so my system would always be backed up freshly August 18th 2006 at 12:50 pm, This is a problem.
  • 4 Drag the Mail::Send Outgoing Messages
  • 5 Save as a workflow Automator Here is the script from step one. It asssumes there is no root password. I never open my laptop up to the world so a password is a hassle.

I created a directory in /Users/joel/backup_mysql/ to hold the archived mysql I created a directory in /Volumes/DROPSHIP/backup to archive to ( This is an external firewire drive )

for x in `/usr/local/mysql/bin/mysql -u root -Bse "show databases"`; do
/usr/local/mysql/bin/mysqldump -u root $x | /sw/bin/gzip -9 > /Users/joel/backup_mysql/$x.gz
done;
/usr/bin/rsync -aE --delete ~ /Volumes/DROPSHIP/backup 2>>~/rsyncErr.txt || echo -n;

To execute, rightclick :: Automator :: Backup Laptop … run Automator

Craigslist search all cities script 2

Posted by Joel Jensen Wed, 16 Aug 2006 15:12:00 GMT

My scooter got stolen a while ago, and I am tired of looking at craigslist to find it. So I wrote this script to do it for me. I wrote a bot that looked at all the cities, and found the citynumber, took the results and plugged it in here. It looks at all the craigslist cities [cityname, citynumber] goes out to there, says its Internet explorer and does a search. The search results are an html file that has the name of the search as the title. Takes a few minutes to run, but can be automated.

Usage for me looking for my scooter is this => search_all_craigslist.py aprilia

#!/usr/bin/env python
# encoding: utf-8
"""
search_all_craigslist.py

Copyright © 2006 Joel Jensen. All Rights Reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY JOEL JENSEN "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

"""

import sys
import getopt
import urllib2, urllib, re, smtplib, time

help_message = '''
Usage search_all_craigslist.py searchterm [searchterm]
Example:
search_all_craigslist.py red wagon
'''


class Usage(Exception):
    def __init__(self, msg):
        self.msg = msg


def main(argv=None):
    if argv is None:
        argv = sys.argv
    try:
        try:
            opts, args = getopt.getopt(argv[1:], "ho:v", ["help", "output="])
        except getopt.error, msg:
            raise Usage(msg)

        # option processing
        for option, value in opts:
            if option == "-v":
                verbose = True
            if option in ("-h", "--help"):
                raise Usage(help_message)
            if option in ("-o", "--output"):
                output = value

        cl_search(args)

    except Usage, err:
        print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg)
        print >> sys.stderr, "\t for help use --help"
        return 2


def cl_search(search):
  print 'started on',time.strftime('%X %x %Z')

  searchterm = urllib.quote(" ".join(search))
  cities = """http://www.craigslist.org,1
  http://bham.craigslist.org,127
  http://huntsville.craigslist.org,231
  http://mobile.craigslist.org,200
  http://montgomery.craigslist.org,207
  http://geo.craigslist.org/iso/us/ak,51
  http://flagstaff.craigslist.org,244
  http://phoenix.craigslist.org,18
  http://tucson.craigslist.org,57
  http://fayar.craigslist.org,293
  http://littlerock.craigslist.org,100
  http://bakersfield.craigslist.org,63
  http://chico.craigslist.org,187
  http://fresno.craigslist.org,43
  http://humboldt.craigslist.org,189
  http://inlandempire.craigslist.org,104
  http://losangeles.craigslist.org,7
  http://merced.craigslist.org,285
  http://modesto.craigslist.org,96
  http://monterey.craigslist.org,102
  http://orangecounty.craigslist.org,103
  http://palmsprings.craigslist.org,209
  http://redding.craigslist.org,188
  http://sacramento.craigslist.org,12
  http://sandiego.craigslist.org,8
  http://sfbay.craigslist.org,1
  http://slo.craigslist.org,191
  http://santabarbara.craigslist.org,62
  http://stockton.craigslist.org,97
  http://ventura.craigslist.org,208
  http://cosprings.craigslist.org,210
  http://denver.craigslist.org,13
  http://fortcollins.craigslist.org,287
  http://pueblo.craigslist.org,315
  http://rockies.craigslist.org,288
  http://hartford.craigslist.org,44
  http://newhaven.craigslist.org,168
  http://newlondon.craigslist.org,281
  http://geo.craigslist.org/iso/us/de,193
  http://geo.craigslist.org/iso/us/dc,10
  http://daytona.craigslist.org,238
  http://fortlauderdale.craigslist.org,236
  http://fortmyers.craigslist.org,125
  http://gainesville.craigslist.org,219
  http://jacksonville.craigslist.org,80
  http://miami.craigslist.org,20
  http://orlando.craigslist.org,39
  http://pensacola.craigslist.org,203
  http://sarasota.craigslist.org,237
  http://tallahassee.craigslist.org,186
  http://tampa.craigslist.org,37
  http://westpalmbeach.craigslist.org,124
  http://athensga.craigslist.org,258
  http://atlanta.craigslist.org,14
  http://augusta.craigslist.org,256
  http://macon.craigslist.org,257
  http://savannah.craigslist.org,205
  http://geo.craigslist.org/iso/us/hi,28
  http://geo.craigslist.org/iso/us/id,52
  http://chambana.craigslist.org,190
  http://chicago.craigslist.org,11
  http://peoria.craigslist.org,224
  http://rockford.craigslist.org,223
  http://springfieldil.craigslist.org,225
  http://bloomington.craigslist.org,229
  http://evansville.craigslist.org,227
  http://fortwayne.craigslist.org,226
  http://indianapolis.craigslist.org,45
  http://southbend.craigslist.org,228
  http://desmoines.craigslist.org,98
  http://quadcities.craigslist.org,307
  http://topeka.craigslist.org,280
  http://wichita.craigslist.org,99
  http://lexington.craigslist.org,133
  http://louisville.craigslist.org,58
  http://batonrouge.craigslist.org,199
  http://lafayette.craigslist.org,283
  http://lakecharles.craigslist.org,284
  http://neworleans.craigslist.org,31
  http://shreveport.craigslist.org,206
  http://geo.craigslist.org/iso/us/me,169
  http://geo.craigslist.org/iso/us/md,34
  http://boston.craigslist.org,4
  http://capecod.craigslist.org,239
  http://westernmass.craigslist.org,173
  http://worcester.craigslist.org,240
  http://annarbor.craigslist.org,172
  http://detroit.craigslist.org,22
  http://flint.craigslist.org,259
  http://grandrapids.craigslist.org,129
  http://kalamazoo.craigslist.org,261
  http://lansing.craigslist.org,212
  http://nmi.craigslist.org,309
  http://saginaw.craigslist.org,260
  http://up.craigslist.org,262
  http://duluth.craigslist.org,255
  http://minneapolis.craigslist.org,19
  http://rmn.craigslist.org,316
  http://gulfport.craigslist.org,230
  http://jackson.craigslist.org,134
  http://columbiamo.craigslist.org,222
  http://kansascity.craigslist.org,30
  http://springfield.craigslist.org,221
  http://stlouis.craigslist.org,29
  http://geo.craigslist.org/iso/us/mt,192
  http://lincoln.craigslist.org,282
  http://omaha.craigslist.org,55
  http://lasvegas.craigslist.org,26
  http://reno.craigslist.org,92
  http://geo.craigslist.org/iso/us/nh,198
  http://newjersey.craigslist.org,170
  http://southjersey.craigslist.org,286
  http://albuquerque.craigslist.org,50
  http://santafe.craigslist.org,218
  http://albany.craigslist.org,59
  http://binghamton.craigslist.org,248
  http://buffalo.craigslist.org,40
  http://hudsonvalley.craigslist.org,249
  http://ithaca.craigslist.org,201
  http://longisland.craigslist.org,250
  http://newyork.craigslist.org,3
  http://rochester.craigslist.org,126
  http://syracuse.craigslist.org,130
  http://utica.craigslist.org,247
  http://asheville.craigslist.org,171
  http://charlotte.craigslist.org,41
  http://fayetteville.craigslist.org,273
  http://greensboro.craigslist.org,61
  http://raleigh.craigslist.org,36
  http://wilmington.craigslist.org,274
  http://winstonsalem.craigslist.org,272
  http://geo.craigslist.org/iso/us/nd,196
  http://akroncanton.craigslist.org,251
  http://cincinnati.craigslist.org,35
  http://cleveland.craigslist.org,27
  http://columbus.craigslist.org,42
  http://dayton.craigslist.org,131
  http://toledo.craigslist.org,204
  http://youngstown.craigslist.org,252
  http://oklahomacity.craigslist.org,54
  http://tulsa.craigslist.org,70
  http://bend.craigslist.org,233
  http://eugene.craigslist.org,94
  http://medford.craigslist.org,216
  http://portland.craigslist.org,9
  http://salem.craigslist.org,232
  http://erie.craigslist.org,275
  http://harrisburg.craigslist.org,166
  http://lancaster.craigslist.org,279
  http://allentown.craigslist.org,167
  http://pennstate.craigslist.org,277
  http://philadelphia.craigslist.org,17
  http://pittsburgh.craigslist.org,33
  http://reading.craigslist.org,278
  http://scranton.craigslist.org,276
  http://geo.craigslist.org/iso/us/pr,180
  http://geo.craigslist.org/iso/us/ri,38
  http://charleston.craigslist.org,128
  http://columbia.craigslist.org,101
  http://greenville.craigslist.org,253
  http://myrtlebeach.craigslist.org,254
  http://geo.craigslist.org/iso/us/sd,195
  http://chattanooga.craigslist.org,220
  http://knoxville.craigslist.org,202
  http://memphis.craigslist.org,46
  http://nashville.craigslist.org,32
  http://amarillo.craigslist.org,269
  http://austin.craigslist.org,15
  http://beaumont.craigslist.org,264
  http://brownsville.craigslist.org,266
  http://corpuschristi.craigslist.org,265
  http://dallas.craigslist.org,21
  http://elpaso.craigslist.org,132
  http://houston.craigslist.org,23
  http://laredo.craigslist.org,271
  http://lubbock.craigslist.org,267
  http://mcallen.craigslist.org,263
  http://odessa.craigslist.org,268
  http://sanantonio.craigslist.org,53
  http://easttexas.craigslist.org,308
  http://waco.craigslist.org,270
  http://provo.craigslist.org,292
  http://saltlakecity.craigslist.org,56
  http://geo.craigslist.org/iso/us/vt,93
  http://blacksburg.craigslist.org,291
  http://charlottesville.craigslist.org,290
  http://norfolk.craigslist.org,48
  http://richmond.craigslist.org,60
  http://roanoke.craigslist.org,289
  http://bellingham.craigslist.org,217
  http://seattle.craigslist.org,2
  http://spokane.craigslist.org,95
  http://yakima.craigslist.org,246
  http://geo.craigslist.org/iso/us/wv,194
  http://appleton.craigslist.org,243
  http://eauclaire.craigslist.org,242
  http://greenbay.craigslist.org,241
  http://madison.craigslist.org,165
  http://milwaukee.craigslist.org,47
  http://geo.craigslist.org/iso/us/wy,197"""

  outfile = searchterm +'_search.html'
  searchfor =  re.compile(r'^<p>&nbsp;\w.*$',re.M)
  urls = cities.splitlines()
  txdata = None
  txheaders = {
      'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
      'Accept-Language': 'en-us',
      'Keep-Alive': '300',
      'Connection': 'keep-alive',
      'Cache-Control': 'max-age=0',
  }
  outarr = []
  for url in urls:
    url = url.strip()
    loc,area_id = url.split(',')
    section = '/cgi-bin/search?areaID='+area_id+'&subAreaID=&query='+searchterm+'&catAbbreviation=sss'
    url = loc.strip() + section.strip() 
    req = urllib2.Request(url, txdata, txheaders)
    u = urllib2.urlopen(req)
    headers = u.info()
    data = u.read()
    results = searchfor.findall(data)
    outarr += results
    print url
  outstuff = ('\n').join(outarr)
  outfile = open(outfile,'w')
  outfile.write(outstuff)
  outfile.close
  print 'ended on ', time.strftime('%X %x %Z')



if __name__ == "__main__":
    sys.exit(main())

Python script to strip out bounced emails from a mailing list

Posted by Joel Jensen Mon, 24 Jul 2006 18:54:51 GMT

The title says it all. Here it is

import re, MySQLdb,time

def grab_email(files = []):
    # if passed a list of text files, will return a list of
    # email addresses found in the files, matched according to
    # basic address conventions. Note: supports most possible
    # names, but not all valid ones.

    found = []
    if files != None:
        mailsrch = re.compile(r'[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,5}')

        for file in files:            
            for line in open(file,'r'):                
                found.extend(mailsrch.findall(line))    

    # remove duplicate elements
    # borrowed from Tim Peters' algorithm on ASPN Cookbook
    u = {}
    for item in found:
        u[item] = 1

    # return list of unique email addresses
    return u.keys()

def chunk_array(inlist,size):
    '''this is to chunk the email list into bite size pieces.'''
    outlist = []
    while inlist:
        outlist.append(inlist[0:size])
        del inlist[0:size]
    return outlist

results = grab_email(['bouncefile.txt'])

results = [(x.lower(),) for x in results]
results =  chunk_array(results,5)


db = MySQLdb.connect(host="yourhost",user="root",passwd="yourpass",db="yourdb",port=3306)
c  = db.cursor()
start = time.time()
print 'connected'
i = 0
for result in results:
    c.executemany('''update orders set emailBillingSpam  = 0 where lower(emailBilling)  = %s''', result)
    c.executemany('''update orders set emailShippingSpam = 0 where lower(emailShipping) = %s''', result)
    db.commit()

    i = i + 5
    now = time.time()
    print i ,i/(now - start),'requests per second'