wed, 08-nov-2006, 17:11

books closeup

I have a friend whose house1 burned down a few years ago. He and I are both baseball fans and I'd lent him my copy of Robert K. Adair's The Physics of Baseball before the fire. I never saw the book again, and wasn't going to bother him with such an inconsequential item when he was trying to replace all the really important things he lost. I've since replaced it with a newer edition and he's now living in his new house.

Since that happened, I've worried about what I'd be able to replace if we had a disaster, mostly because I wouldn't be able to remember everything. Normally, I suppose, you'd make a long list of the stuff you own, file it with your insurance company or put it someplace safe. But there's a much faster way: just take a digital picture of everything, burn it all to a CD or DVD and file that away. We've got a reasonably inexpensive 3.1 Megapixel camera, and I just took a couple photos of one of my bookshelves. The book titles were too hard to resolve when I took the entire bookshelf with one shot, but you can easily read everything at full resolution when only a couple shelves are fit into the frame (that's what the image above is a sample of).

A few minutes with a camera and I'll have a nice record of all of it.


1 Google as an English language expert: I couldn't remember whether this phrase was "who's house" or "whose house". A google search for the first phrase yielded 76.7 thousand hits. 'whose house' is almost ten times more popular (737 thousand), so I figure that must be correct. To confirm this, I repeated the search, adding 'site:http://www.nytimes.com' to the search string. Two hundred and ninety-five hits for 'whose house' at the New York Times, zero for "who's house". Case closed.

Also: the new version of Firefox will spell check text entries like the big textarea I'm typing this post into. Goes a long way toward eliminating spelling mistakes in blog posts. Check it out!

tags: books  house 
sun, 05-nov-2006, 14:35

the drift

When I was in high school there was a downhill ski "club" that rented a bus and drove all of us down to Bristol Mountain, eight or ten times each winter. We'd leave right after school on Wednesdays, ski all afternoon and evening, and be back in Webster late that night. Portable cassette tape players were the iPod of that time, and I spent a lot of those bus rides getting sick from diesel fumes, listening to Peter Gabriel and reading Stephen King novels. To this day, whenever I hear certain tracks off Melting Face or Security, I can't help but think of Jack Torrance, Randall Flagg, and the other characters from King's early books.

Yesterday I finished Cormac McCarthy's No Country For Old Men, and for most of it, I was listening to Calexico. They're an independent rock group from Arizona that plays music that just sounds like the southwest. Perfect soundtrack for a book about the border country in Texas. Because the book made a strong impression on me, I have a feeling that whenever I hear Calexico in the future, I'll probably think of Moss, Bell and Chigurh.

I've got Scott Walker's The Drift playing right now. I can't imagine what book I should be reading with this as it's soundtrack. Yikes!

tags: books  music 
sat, 04-nov-2006, 09:34

wikipedia

Update Thu Jan 10 09:38:42 AKST 2008: Unless you really need a complete mirror like this, a much faster way to achieve something similar is to use Thanassis Tsiodras's Wikipedia Offline method. Templates and other niceties don't work quite as well with his method, but the setup is much, much faster and easier.


I've come to depend on the Wikipedia. Despite potential problems with vandalism, pages without citations, and uneven writing, it's so much better than anything else I have available. And it's a click away.

Except when flooding on the Richardson Highway and a mistake by an Alaska railroad crew cut off Fairbanks from the world. So I've been exploring mirroring the Wikipedia on a laptop. Without images and fulltext searching of article text, it weights in at 7.5 GiB (20061130 dump). If you add the fulltext article search, it's 23 GiB on your hard drive. That's a bit much for a laptop (at least mine), but a desktop could handle it easily. The image dumps aren't being made anymore since many of the images aren't free from Copyright, but even the last dump in November 2005 was 79 GiB. It took about two weeks to download, and I haven't been able to figure out how to integrate it into my existing mirror.

In any case, here's the procedure I used:

Install apache, PHP5, and MySQL. I'm not going to go into detail here, as there are plenty of good tutorials and documentation pages for installing these three things on virtually any platform. I've successfully installed Wikipedia mirrors on OS X and Linux, but there's no reason why this wouldn't work on Windows, since apache, PHP and MySQL are all available for that platform. The only potential problem is that the text table is 6.5 GiB, and some Windows file systems may not be able to handle files larger than 4 GiB (NTFS should be able to handle it, but earlier filesystems like FAT32 probably can't).

Download the latest version of the mediawiki software from http://www.mediawiki.org/wiki/Download (the software links are on the right side of the page).

Create the mediawiki database:

$ mysql -p
mysql> create database wikidb;
mysql> grant create,select,insert,update,delete,lock tables on wikidb.* to user@localhost identified by 'userpasswd';
mysql> grant all on wikidb.* to admin@localhost identified by 'adminpasswd';
mysql> flush privileges;

Untar the mediawiki software to your web server directory:

$ cd /var/www
$ tar xzf ~/mediawiki-1.9.2.tar.gz

Point a web browser to the configuration page, probably something like http://localhost/config/index.php, and fill in the database section with the database name (wikidb) users and passwords from the SQL you typed in earlier. Click the 'install' button. Once that finishes:

$ cd /var/www/
$ mv config/LocalSettings.php .
$ rm -rf config/

More detailed instructions for getting mediwiki running are at: http://meta.wikimedia.org/wiki/Help:Installation

Now, get the Wikipedia XML dump from http://download.wikimedia.org/enwiki/. Find the most recent directory that contains a valid pages_articles.xml.bz2 file.

Also download the mwdumper.jar program from http://download.wikimedia.org/tools/. You'll need Java installed to run this program.

Configure your MySQL server to handle the load by editing /etc/mysql/my.cnf, changing the following settings:

[mysqld]
max_allowed_packet = 128M
innodb_log_file_size = 100M
[mysql]
max_allowed_packet = 128M

Restart the server, empty some tables and disable binary logging:

$ sudo /etc/init.d/mysql restart
$ mysql -p wikidb
mysql> set sql_log_bin=0;
mysql> delete from page;
mysql> delete from revision;
mysql> delete from text;

Now you're ready to load in the Wikipedia dump file. This will take several hours to more than a day, depending on how fast your computer is (a dual 1.8 Ghz Opteron system with 4 GiB of RAM took a little under 17 hours with an average load around 3.0 on the 20061103 dump file). The command is (all on one line):

$ java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5 enwiki-20060925-pages-articles.xml.bz2 | mysql -u admin -p wikidb

You'll use the administrator password you chose earlier. You can also use your own MySQL account, since you created the database, you have all the needed rights.

After this finishes, it's a good idea to make sure there are no errors in the MySQL tables. I normally get a few errors in the pagelinks, templatelinks and page tables. To check the tables for errors:

$ mysqlcheck -p wikidb

If there are tables with errors, you can repair them in two different ways. The first is done inside MySQL and doesn't require shutting down the MySQL server. It's slower, though:

$ mysql -p wikidb
mysql> repair table pagelinks extended;

The faster way requires shutting down the MySQL server:

$ sudo /etc/init.d/mysql stop (or however you stop it)
$ sudo myisamchk -r -q /var/lib/mysql/wikidb/pagelinks.MYI
$ sudo /etc/init.d/mysql start

There are several important extensions to mediawiki that Wikipedia depends on. You can view all of them by going to http://en.wikipedia.org/wiki/Special:Version, which shows everything Wikipedia is currently using. You can get the latest versions of all the extensions with:

 $ svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions extensions

svn is the client command for http://subversion.tigris.org/. It's a revision control system that eliminates most of the issues people had with CVS (and rcs before that). The command above will check out all the extensions code into a new directory on your system named extensions.

The important extensions are the parser functions, citation functions, CategoryTree and WikiHero. Here's how you install these from the extensions directory that svn created.

Parser functions:

$ cd extensions/ParserFunctions
$ mkdir /var/www/extensions/ParserFunctions
$ cp Expr.php ParserFunctions.php SprintfDateCompat.php /var/www/extensions/ParserFunctions
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/ParserFunctions/ParserFunctions.php");
$wgUseTidy = true;
^d

(the last four lines just add those PHP commands to the LocalSettings.php file. It's probably easier to just use a text editor.

Citation functions:

$ cd ../Cite
$ mkdir /var/www/extensions/Cite
$ cp Cite.php Cite.i18n.php /var/www/extensions/Cite/
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/Cite/Cite.php");
^d

CategoryTree:

$ cd ..
$ tar cf - CategoryTree/ | (cd /var/www/extensions/; tar xvf -)
$ cat >> /var/www/LocalSettings.php
$wgUseAjax = true;
require_once("$IP/extensions/CategoryTree/CategoryTree.php");
^d

WikiHero:

$ tar cf - wikihiero | (cd /var/www/extensions/; tar xvf -)
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/wikihiero/wikihiero.php");
^d

If you want the math to show up properly, you'll need to have LaTeX, dvips, convert (from the ImageMagick suite), GhostScript, and an OCaml setup to build the code. Here's how to do it:

$ cd /var/www/math
$ make
$ mkdir ../images/tmp
$ mkdir ../images/math
$ sudo chown -R www-data ../images/

My web server runs as user www-data. If yours uses a different account, that's what you'd change the images directories to be owned by. Alternatively, you could use chmod -R 777 ../images to make them writeable by anyone.

Change the $wgUseTeX variable in LocalSettings.php to true. If your Wikimirror is at the root of your web server (as it is in the examples above), you need to make sure that your apache configuration doesn't have an Alias section for images If any of the programs mentioned aren't in the system PATH (like if you installed them in /usr/local/bin or /sw/bin on a Mac) you'll need to put them in /usr/bin or someplace the script can find them.

MediaWiki comes with a variety of maintenance scripts in the maintenance directory. To allow these to function, you need to put the admin user's username and password into AdminSettings.php:

$ mv /var/www/AdminSettings.sample /var/www/AdminSettings.php

and change the values of $wgDBadminuser to admin (or what you really set it to when you created the database and initialized your mediawiki) and $wgDBadminpassword to adminpasswd.

Now, if you want the Search box to search anything besides the titles of articles, you'll need to rebuild the search tables. As I mentioned earlier, these tables make the database grow from 7 GiB to 23 GiB (as of the September 25, 2006 dump), so make sure you've got plenty of space before starting this process. I've found a Wikimirror is pretty useful even without full searching so don't abandon the effort if you don't have 20+ GiB to devote to a mirror.

To rebuild everything:

$ php /var/www/maintenance/rebuildall.php

This script builds the search tables first (which takes several hours), and then moves on to rebuilding the link tables. Rebuilding the link tables takes a very, very long time, but there's no problem breaking out of this process once it starts. I've found that this has a tendency to damage some of the link tables, requiring a repair before you can continue. If that does happen, note the table that was damaged and the index number where the rebuildall.php script failed. Then:

$ mysql -p wikidb
mysql> repair table pagelinks extended;

(replace pagelinks with whatever table was damaged.) I've had repairs take a few minutes, to 12 hours, so keep this in mind.

After the table is repaired, edit the /var/www/maintenance/rebuildall.php script, comment out these lines:

# dropTextIndex( $database );
# rebuildTextIndex( $database );
# createTextIndex( $database );
# rebuildRecentChangesTablePass1();
# rebuildRecentChangesTablePass2();

and insert the index number where the previous run crashed into this line:

refreshLinks( 1 );

Then run it again.

One final note: Doing all of these processes on a laptop can be very taxing on a computer that might not be well equipped to handle a full load for days at a time. If you have a desktop computer, you can do the dumping and rebuilding on that computer, and after everything is finished, simply copy the database files from the desktop to your laptop. I just tried this with the 20061130 dump, copying all the MySQL files from /var/lib/mysql/wikidb on a Linux machine to /sw/lib/mysql/wikidb on my MacBook Pro. After the copying was finished, I restarted the MySQL daemon, and the Wikipedia mirror is now live on my laptop. The desktop had MySQL version 5.0.24 and the laptop has 5.0.16. I'm not sure how different these can be for a direct copy to work, but it does work between different platforms (Linux and OS X) and architectures (AMD64 and Intel Duo Core).

tags: books  linux  sysadmin 
sat, 12-aug-2006, 17:04

Today when I went out to take care of the neighbor's cats I noticed that I'd worn through the leather soles of my cowboy boots. I went upstairs and got my older pair and while I was slipping them on I remembered buying them in California as a pair of boots to go with my new motorcycle. That was in 1991, so those boots are more than 15 years old.

More and more I find myself coming across something that I've owned for more than a decade, or a memory occurs to me from longer ago than seems possible. I'm no longer in my 20's, and even though I know this, it still feels like I'm a young kid who has just graduated from college, just old enough to drink beer. Maybe not yet wise enough to know not to drink too many.

But I'm not that young kid anymore. I've been a lot of places and done a lot of things since then. A lot of it has been chronicled in my journals, or in the other pieces of paper and electronic records that are part of every person's life. I'm thinking it might be fun to make a journal book that would be a sort of lifetime accounting for the places I've been and the things I've done. My journal books are 192 pages, and if I put four months on a page, I can cram around 63 years into the book. With four months per page, that's around six lines per month, which sounds like just the right amount of space. I'll put the details of the book itself on my bookbinding pages.

June 1991: bought motorcycle and cowboy boots.

Check.

tags: books  make  writing 
sun, 26-feb-2006, 09:57

A couple days ago, in an article about prospect analysis in baseball (subscription required) Nate Silver produced a cool table showing the year to year correlations of the six major batting events. This morning while I wait for my dough to rise, I decided to replicate this analysis with my new found baseball hack ability.

You can download the R program code for the analysis by clicking on the link.

Here's the result, showing the 2004 to 2005 correlations for rate-adjusted batting statistics for all players with more than 250 at-bats in both seasons:

Hits / PA           0.422
Singles / PA        0.663
Doubles / PA        0.369
Triples / PA        0.501
Home Runs / PA      0.702
Walks / PA          0.718
Strikeouts / PA     0.813
Plate appearances   0.405

What Silver was trying to show by presenting his table (which included all year to year correlations since World War II) is that "there's really no such thing as a doubles hitter."

You can see from looking at the table that there's very little relationship between how many doubles a hitter hit in 2004 and how many they got in 2005. But a home run hitter in one season is likely to hit them at the same rate in the next season. Also note that strikeouts and walks are very highly correlated. So, 2004 and 2005 strikeout leader Adam Dunn is likely to strike out more than 150 times in 2006. Thankfully for Reds fans, he'll probably also hit more than 40 home runs.

The last number is also interesting. There isn't a great correlation between plate appearances between seasons. This is probably a combination of older players breaking down between 2004 and 2005, and younger players stepping in to take their place at the plate.

tags: baseball  books 

<< 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 >>
Meta Photolog Archives