I maintain our departmental mail servers, and spend a fair amount of effort trying to reduce unsolicited email. One of the ways I do that is by collecting my spam and training our spam filter with it so our users won't see it more than once.
Today I got a hilarious email that must have been written in another language and then translated into English. It's first paragraph mentions a “most exciting lose flesh product available.” Sounds good! Better yet is this supposed testimonial from a guy in New York: “And you see me, the bed became cool also!” To quote Temperance Brennan: “I don't know what that means.”
I also don't know how I mange to “decline the preposition” and resist this exciting new product that “attacks unnecessary kilos.”
Every so often I get curious about nutrition and whether my diet is actually a healthy one. Over the years I've used a program called NUT, which is a really great console program that uses all the data from the USDA National Nutrient Database for Standard Reference. A couple days ago I downloaded the latest version and compiled it on my MacBook Pro. Thanks to the genius of writing simple, portable C code that builds with gcc, it compiled perfectly (not even a single warning) and I was off and running.
Unfortunately I was having a little trouble deleting the 26,642 gram (58+ pound) apple I accidentally entered for lunch today, and because I had the source code available, I discovered a buffer overflow error in the menu entry code. (A buffer overflow is sort of like when a form asks for your first name but only has room for six letters, and instead of stopping at C-h-r-i-s-t you continue to write the rest of your name into the following boxes not designed for your first name.) So I wrote to the author. An hour later, he wrote me back to thank me for finding the bug. Along the way he found a couple more, fixed them, and released a new version.
Timeline: Find a bug before dinner. Contact author. By the time I'm having my first beer, the program has already been fixed.
Try getting that kind of support from your commercial vendor.
In my job as a systems administrator, spam is one of those things I accept as fact, but have to deal with as best I can so my users can actually get work done. I came across this article on Slashdot today, and even though there's absolutely nothing revelatory in this article, I think people fail to appreciate where spam comes from. It's not evil spammers sending you junk mail; spam comes from computers running Microsoft Windows that have been infected with something. If you don't like spam, stop sending Microsoft money for their software. Every time you buy a Microsoft product, you're supporting all the network effects of their software. The same network effects that make sharing a Word document with other Microsoft Office users easy, also result in more infections, more spam, more wasted time and money.
Update Thu Jan 10 09:38:42 AKST 2008: Unless you really need a complete mirror like this, a much faster way to achieve something similar is to use Thanassis Tsiodras's Wikipedia Offline method. Templates and other niceties don't work quite as well with his method, but the setup is much, much faster and easier.
I've come to depend on the Wikipedia. Despite potential problems with vandalism, pages without citations, and uneven writing, it's so much better than anything else I have available. And it's a click away.
Except when flooding on the Richardson Highway and a mistake by an Alaska railroad crew cut off Fairbanks from the world. So I've been exploring mirroring the Wikipedia on a laptop. Without images and fulltext searching of article text, it weights in at 7.5 GiB (20061130 dump). If you add the fulltext article search, it's 23 GiB on your hard drive. That's a bit much for a laptop (at least mine), but a desktop could handle it easily. The image dumps aren't being made anymore since many of the images aren't free from Copyright, but even the last dump in November 2005 was 79 GiB. It took about two weeks to download, and I haven't been able to figure out how to integrate it into my existing mirror.
In any case, here's the procedure I used:
Install apache, PHP5, and MySQL. I'm not going to go into detail here, as there are plenty of good tutorials and documentation pages for installing these three things on virtually any platform. I've successfully installed Wikipedia mirrors on OS X and Linux, but there's no reason why this wouldn't work on Windows, since apache, PHP and MySQL are all available for that platform. The only potential problem is that the text table is 6.5 GiB, and some Windows file systems may not be able to handle files larger than 4 GiB (NTFS should be able to handle it, but earlier filesystems like FAT32 probably can't).
Download the latest version of the mediawiki software from http://www.mediawiki.org/wiki/Download (the software links are on the right side of the page).
Create the mediawiki database:
$ mysql -p mysql> create database wikidb; mysql> grant create,select,insert,update,delete,lock tables on wikidb.* to user@localhost identified by 'userpasswd'; mysql> grant all on wikidb.* to admin@localhost identified by 'adminpasswd'; mysql> flush privileges;
Untar the mediawiki software to your web server directory:
$ cd /var/www $ tar xzf ~/mediawiki-1.9.2.tar.gz
Point a web browser to the configuration page, probably something like http://localhost/config/index.php, and fill in the database section with the database name (wikidb) users and passwords from the SQL you typed in earlier. Click the 'install' button. Once that finishes:
$ cd /var/www/ $ mv config/LocalSettings.php . $ rm -rf config/
More detailed instructions for getting mediwiki running are at: http://meta.wikimedia.org/wiki/Help:Installation
Now, get the Wikipedia XML dump from http://download.wikimedia.org/enwiki/. Find the most recent directory that contains a valid pages_articles.xml.bz2 file.
Also download the mwdumper.jar program from http://download.wikimedia.org/tools/. You'll need Java installed to run this program.
Configure your MySQL server to handle the load by editing /etc/mysql/my.cnf, changing the following settings:
[mysqld] max_allowed_packet = 128M innodb_log_file_size = 100M
[mysql] max_allowed_packet = 128M
Restart the server, empty some tables and disable binary logging:
$ sudo /etc/init.d/mysql restart $ mysql -p wikidb mysql> set sql_log_bin=0; mysql> delete from page; mysql> delete from revision; mysql> delete from text;
Now you're ready to load in the Wikipedia dump file. This will take several hours to more than a day, depending on how fast your computer is (a dual 1.8 Ghz Opteron system with 4 GiB of RAM took a little under 17 hours with an average load around 3.0 on the 20061103 dump file). The command is (all on one line):
$ java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5 enwiki-20060925-pages-articles.xml.bz2 | mysql -u admin -p wikidb
You'll use the administrator password you chose earlier. You can also use your own MySQL account, since you created the database, you have all the needed rights.
After this finishes, it's a good idea to make sure there are no errors in the MySQL tables. I normally get a few errors in the pagelinks, templatelinks and page tables. To check the tables for errors:
$ mysqlcheck -p wikidb
If there are tables with errors, you can repair them in two different ways. The first is done inside MySQL and doesn't require shutting down the MySQL server. It's slower, though:
$ mysql -p wikidb mysql> repair table pagelinks extended;
The faster way requires shutting down the MySQL server:
$ sudo /etc/init.d/mysql stop (or however you stop it) $ sudo myisamchk -r -q /var/lib/mysql/wikidb/pagelinks.MYI $ sudo /etc/init.d/mysql start
There are several important extensions to mediawiki that Wikipedia depends on. You can view all of them by going to http://en.wikipedia.org/wiki/Special:Version, which shows everything Wikipedia is currently using. You can get the latest versions of all the extensions with:
$ svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions extensions
svn is the client command for http://subversion.tigris.org/. It's a revision control system that eliminates most of the issues people had with CVS (and rcs before that). The command above will check out all the extensions code into a new directory on your system named extensions.
The important extensions are the parser functions, citation functions, CategoryTree and WikiHero. Here's how you install these from the extensions directory that svn created.
$ cd extensions/ParserFunctions $ mkdir /var/www/extensions/ParserFunctions $ cp Expr.php ParserFunctions.php SprintfDateCompat.php /var/www/extensions/ParserFunctions $ cat >> /var/www/LocalSettings.php require_once("$IP/extensions/ParserFunctions/ParserFunctions.php"); $wgUseTidy = true; ^d
(the last four lines just add those PHP commands to the LocalSettings.php file. It's probably easier to just use a text editor.
$ cd ../Cite $ mkdir /var/www/extensions/Cite $ cp Cite.php Cite.i18n.php /var/www/extensions/Cite/ $ cat >> /var/www/LocalSettings.php require_once("$IP/extensions/Cite/Cite.php"); ^d
$ cd .. $ tar cf - CategoryTree/ | (cd /var/www/extensions/; tar xvf -) $ cat >> /var/www/LocalSettings.php $wgUseAjax = true; require_once("$IP/extensions/CategoryTree/CategoryTree.php"); ^d
$ tar cf - wikihiero | (cd /var/www/extensions/; tar xvf -) $ cat >> /var/www/LocalSettings.php require_once("$IP/extensions/wikihiero/wikihiero.php"); ^d
If you want the math to show up properly, you'll need to have LaTeX, dvips, convert (from the ImageMagick suite), GhostScript, and an OCaml setup to build the code. Here's how to do it:
$ cd /var/www/math $ make $ mkdir ../images/tmp $ mkdir ../images/math $ sudo chown -R www-data ../images/
My web server runs as user www-data. If yours uses a different account, that's what you'd change the images directories to be owned by. Alternatively, you could use chmod -R 777 ../images to make them writeable by anyone.
Change the $wgUseTeX variable in LocalSettings.php to true. If your Wikimirror is at the root of your web server (as it is in the examples above), you need to make sure that your apache configuration doesn't have an Alias section for images If any of the programs mentioned aren't in the system PATH (like if you installed them in /usr/local/bin or /sw/bin on a Mac) you'll need to put them in /usr/bin or someplace the script can find them.
MediaWiki comes with a variety of maintenance scripts in the maintenance directory. To allow these to function, you need to put the admin user's username and password into AdminSettings.php:
$ mv /var/www/AdminSettings.sample /var/www/AdminSettings.php
and change the values of $wgDBadminuser to admin (or what you really set it to when you created the database and initialized your mediawiki) and $wgDBadminpassword to adminpasswd.
Now, if you want the Search box to search anything besides the titles of articles, you'll need to rebuild the search tables. As I mentioned earlier, these tables make the database grow from 7 GiB to 23 GiB (as of the September 25, 2006 dump), so make sure you've got plenty of space before starting this process. I've found a Wikimirror is pretty useful even without full searching so don't abandon the effort if you don't have 20+ GiB to devote to a mirror.
To rebuild everything:
$ php /var/www/maintenance/rebuildall.php
This script builds the search tables first (which takes several hours), and then moves on to rebuilding the link tables. Rebuilding the link tables takes a very, very long time, but there's no problem breaking out of this process once it starts. I've found that this has a tendency to damage some of the link tables, requiring a repair before you can continue. If that does happen, note the table that was damaged and the index number where the rebuildall.php script failed. Then:
$ mysql -p wikidb mysql> repair table pagelinks extended;
(replace pagelinks with whatever table was damaged.) I've had repairs take a few minutes, to 12 hours, so keep this in mind.
After the table is repaired, edit the /var/www/maintenance/rebuildall.php script, comment out these lines:
# dropTextIndex( $database ); # rebuildTextIndex( $database ); # createTextIndex( $database ); # rebuildRecentChangesTablePass1(); # rebuildRecentChangesTablePass2();
and insert the index number where the previous run crashed into this line:
refreshLinks( 1 );
Then run it again.
One final note: Doing all of these processes on a laptop can be very taxing on a computer that might not be well equipped to handle a full load for days at a time. If you have a desktop computer, you can do the dumping and rebuilding on that computer, and after everything is finished, simply copy the database files from the desktop to your laptop. I just tried this with the 20061130 dump, copying all the MySQL files from /var/lib/mysql/wikidb on a Linux machine to /sw/lib/mysql/wikidb on my MacBook Pro. After the copying was finished, I restarted the MySQL daemon, and the Wikipedia mirror is now live on my laptop. The desktop had MySQL version 5.0.24 and the laptop has 5.0.16. I'm not sure how different these can be for a direct copy to work, but it does work between different platforms (Linux and OS X) and architectures (AMD64 and Intel Duo Core).
Last week I wrote a Python script to import my Unix calendar event files into Google calendar. Today I wanted to put the 2006 Alaska Goldpanners schedule into my Google calendar. I suppose I could have entered all the games in manually, but instead I came up with an event file format, and a script to translate these files into iCal files that can be imported into Google calendar.
The format looks like this:
2006-Jun-14 1900 2200 Goldpanners vs. Fairbanks Adult All-Stars
with one event per line. The start and end times are in military time, and events have to start and finish on the same day.
To convert a file of these events to iCal format, download mycal_to_ics.py, and run it like this:
cat mycal | ./mycal_to_ics.py > mycal.ics
Then you can import it into your Google calendar using the Manage Calendars | Import Calendar tab. I'd recommend creating a new temporary calendar and importing into that so that if there are any errors, you won't have disturbed your existing calendars.