19 July 2010

Thoughts about wikipedia xml dump

They are the BIGGEST single plain text file that I've ever seen in my life! and I'm talking about the 27GB english version.
So they choose bzip2 why? OK they recently move to 7z anyway (but I still got the bz2 one)... 7zip decompression is far faster than bzip2.

They provide a python library called mwlib to work with the dump. This is where I getting interested with. For my country Indonesia, I think this is a great asset for education. People here are mostly (even in Java island) are still alien to the internet. Worse, when recent videotape scandal of local artist boasted over the media, most people set a negative feeling about the internet. Even our minister are outspoken about his plan to censor the internet! LOL talk about China. Anyway with a little mindset changes, for example: instead of forcing to buy secondary bike which mostly for show off, people should better invest on computer even the cheapest one they can get. At least it doesn't eat gas. With the idea of making portable wiki in flashdrive like wikitaxi. People get access to one of the best knowledge source legally for free. Youngster can be forced to "read" more than foolishly chatting on facebook and playing oxymoron games.

English itself is relatively a problem let alone Indonesian language itself. Here we're usually know at least two languages: native and national. Native is the commonly taught by parent but current parent start to ignore their own legacy (sad). However I never understand any language than java (jowo) simply because they aren't even similar :( Sunda, Aceh, Padang and the list goes on. My english also vey little (being more able to read than write, that's why I force myself to make this blog though I had other indonesian blog). With $100 computer people could afford $10 for 8GB flashdisk, or for student just buy the flashdisk and access the wiki somewhere.

Some months ago someone called Steven Haryanto has hijack KBBI Daring and transform it into Stardict IFO format. Great job dude! Our government should learn their hypocrisy of claiming to bring literacy while at same time only sell pricey tadbook and a mostly unaccessible online dictionary. Me myself must admit that I'm not very good at Indonesian and not excellent at multilevel Java language. I never love schooling, I even disgusted at (considerably not smart) people who seek higher degree in our sucks educational system. In short, reading wikipedia is a lot better than most of school lessons or hearing lazy teacher.

Back on track, there is PyLZMA for (on the fly?) decompression (not sure if it fully compatible with 7z, mwlib, some html renderer like PyQT, PyWebkitGTK or even PyGTKHtml should able to handle simple css. Then add embedded translation and thesaurus or wordnet lookup for Indonesian<->English. Add favorite or history system too and we'll get a decent open source encyclopedia system. Even just a texts (lots of text)

A 7zip version could reduce the size up to 20% while bring multiple orders of reading speedup.

Ummm.. How come I can't find (googling) any attempt of it in Indonesia? I'm not an IT guy, so where are our famed hackers who supposedly bring this into reality and help our own nation instead hijacking website or making virus all the time. Shit! I think I just bumped another yet hypocrisy :)

I've take a look closer to mwlib for a while and build the binary too, but it seems mwlib drag many (though small) additional packages notably for its server functionality. What I need basically is the archive dumper, however it (mw-buildcdb) produce a twice larger archive but seekable. So 6 GB en.wiki would be 12GB archive! thus won't fit on 8GB DL-DVD. The problem is seekable archive a.k.a non solid compression (like zip) is such a waste for plain text. A better compression such as Lzma (7zip) with small solid threshold may give balanced performance, let say DVD drives have average read speed of 3MB/S (moderate) there if we limit solid size to 4MB it took at most around second to retrive a content. This is also a tolerant maximum size for a single compressed article too. Need to dig more...

No comments:

Post a Comment