watching web pages and tracking changes
Nick Demou
ndemou at gmail.com
Tue Oct 10 16:50:07 EEST 2006
Με βάση και τις συμβουλές από τη λίστα έφτιαξα ένα bash script το
οποίο το καλώ καθε μέρα (μέσω cron) και μου στέλνει ένα απλό και
καθαρό email με ότι αλλαγές έχουν γήνει σε συγκεκριμένες web pages που
με ενδιαφέρει να παρακολουθώ
Το παραθέτω για όποιον ενδιαφέρετε
ΠΡΟΣΟΧΗ:
1) είμαι σχεδόν άσχετος απο bash scripting αλλά το παρακάτω δουλευει
εδώ και λίγες μέρες σωστά σε τέσσερα διαφορετικά sites
2) αν θέλετε απλά να το δοκιμάσετε αρκεί να βάλετε τα URLs που θέλετε
και ένα τυχαίο αλλα μοναδικό alias για κάθε ένα από αυτά στο τέλος του
script (γραμμες URL=, FILE=) *ΚΑΙ* να αλλάξετε την γραμμή "cd
/home/ndemou/cron/fsa" ωστε να δήχνει σε ένα υπάρχον directory το
οποίο θα γεμίσει με αρχεία (σας είπα ότι είμαι σχεδόν άσχετος :-). Α!
απετεί τα wget,sed,w3m, iconv και nail (ή εναλακτικά mail) utilities
3) αν ολά πάνε καλά και σας αρέσει καλό είναι να διαβάσετε τον κώδικα
και τα σχόλια μια φορά (π.χ. σαν το παρακάτω:
# use grep -v "^[ ].*" if you want to see both deletions and additions
# use grep "^+.*" if you only want to see additions (new content -OR-
moved content)
)
4) και φυσικά: if your system breaks you get to keep the pieces :-)
#/bin/bash
#
############################
#
# this does all the job
# once for every url
#
############################
function checkURL {
(
echo "=============================================S"
echo "checking $URL"
echo "--------------------------------------------- "
TMPFILE=$FILE.`date +"%j"`.html
# TODO: maybe put temp html and txt files in a subfolder
# after done testing it the html files can be discarded after being
converted to text
# download the page
wget --timeout=20 --tries=3 -q -O $TMPFILE $URL
# TODO: What if nothing could be downloaded
# maybe I should check if the resulting file $TMPFILE is more than 0 bytes
# NOTE: I currently use a 20secs timeout and 3 retries max
# flaten the page (replaces <td> <tr> <th> with <p> thus serializing
the contents of the cells of tables)
sed -e 's#<tr[^>]*>#<p>#g' -e 's#<td[^>]*>#<p>#g' -e 's#<th[^>]*>#<p>#g' \
-e 's#</tr>#</p>#g' -e 's#</td>#</p>#g' -e 's#</th>#</p>#g' \
$TMPFILE > $TMPFILE.flaten
# TODO: what about placing due to css styles?
# maybe w3m is rendering without styles but I must checkit
# convert to text and clean it up:
# the s/^\_\ *$// is to clean lines that contain just one _
# the "s/\ \ */\ /g" "s/^\ *$//" are to suprress multiple spaces to
one space and lines with just one space to empty lines
# the uniq is to eliminate successive empty lines (it will also
eliminate successive lines with the same content but I don't care)
w3m -dump -T text/html $TMPFILE.flaten \
| sed -e "s/\[spacer\]/ /g" \
-e "s/^\_\ *$//" -e "s/\ \ */\ /g" -e "s/^\ *$//" \
| uniq > $FILE.txt
# get the diff of the current vs the previous text
# use grep -v "^[ ].*" if you want to see both deletions and additions
# use grep "^+.*" if you only want to see additions (new content -OR-
moved content)
diff -uiEbwB $FILE.txt $FILE.old.txt | grep -v "^[ ].*"
# TODO: I could colorify the diff output by converting it to html
# (after I learn how to send html mail maybe)
# cycle txt's (current txt becomes old.txt)
mv $FILE.old.txt $FILE.old2.txt
mv $FILE.txt $FILE.old.txt
# TODO: I MUST RE-ENABLE THIS COMMAND AFTER DONE TESTING
# (and cleanup the temp STRING.NUMBER.html files that would have been
collected till then)
#rm $TMPFILE
echo "---------------------------------------------E"
echo " "
) >> $LOG 2>&1
}
############################
#
# Init
#
############################
# TODO: This is an ugly hack - I bet there is a better way!
# If I rename/move anything this script will fail
cd /home/ndemou/cron/fsa
LOG="fsa.report"
rm $LOG
# TODO: I should read the urls bellow from a config file
# but what about the relating $FILE ???:
# I could use the domain with any / replaced by _ up until 80 chars maximum
# If it's more than 80 chars I could append the HEX md5sum of the url at the end
############################
#
# Main
# (change the lines bellow to select what URL's to watch)
#
############################
# for every page enter the URL and a uniq alias used for creating needed files
URL="http://www.example1.com/"
FILE="page1"
checkURL
URL="http://www.example2.com/foo.html"
FILE="page2"
checkURL
############################
#
# Done
#
############################
# sent report by email encoded by win1253
# iconv -f UTF-8 -t WINDOWS-1253
cat $LOG | nail -s "Updates to monitored sites" ndemou at enlogic.gr
# TODO: maybe I should only mail when at least one difference exists
# but OTOH usualy there is always some small diference
More information about the Linux-greek-users
mailing list