New version of ds

I made some incremental improvements to my trusty synchronization tool. I have added ssh master connection sharing support, which should give a nice speed up in many cases. In addition to this, there are cleanups in auto completion code and more flexible command line parsing. There is a release on Github.

I do certain things the old fashioned way – synchronizing application config and data across different hosts is one of those areas. I don’t use Dropbox or other internet services for file sync purposes, except for the builtin Google stuff on my Android phone (which is mostly app config backup). I like to keep it peer-to-peer, simple and manual. And I like to keep my data private. Ds is just the right tool to make syncing that much easier, which I guess is the reason I wrote it.

Cloud services is not something I need to manage my data. I have my own “cloud services”: A synchronization scheme which works well enough, automated backups with regular encrypted offsite storage (a future blog post), personal web and file services and high speed fibre optic internet connection – all at home. I will admit to a certain level of NIH syndrome here, but my own solutions work well enough that I will not bother looking into something else yet.

Performance of stored field compression in Lucene 4.1+

Starting at version 4.1 of the Apache Lucene search library, compression of stored fields is enabled by default. The basic idea is that with some refinements to how compression is handled, disk I/O will often end up costing more time than the extra CPU cycles required to decompress data when retrieving documents from search results. This suits the most common usage scenarios of Lucene well, but is not always a win. Depending on factors such as server hardware and index update frequency, it may hurt performance if you have many small stored fields and/or your on disk index fits nicely in available OS cache memory in uncompressed form.

If you just need to know how to to disable stored field compression, you can skip directly to the solution.

Background

We use the Lucene heavily in our Java-based CMS solution. It’s mostly about indexing metadata, which is used for lookups and all kinds of listing queries. We gain flexibility and great speed by using Lucene instead querying the database through a SQL interface (the authoritive source of basically the same set of metadata that we index). Up to this point, things look pretty standard.

Our use of Lucene may be considered atypical, though. We do not index full text content (instead dedicated Apache Solr instances are used for that purpose), and we do not use scoring. Our metadata queries use pure boolean algebra and filters on typed data fields, and most use some defined sorting order (but never score). These queries are usually not based on human input, but rather pre-coded, configured or built dynamically by the system. Some searches will fetch a significant amount of documents from the result set, and also a significant amount of fields from each of those docs. In addition, we have lower level Lucene customizations like filters, callback based API for walking the index with no limits on number of retrievable matches and conversion of queries to filters for speed.

Migrating from Lucene 3 to 4 resulted in performance reduction

Last year, we did a major upgrade from Lucene 3.6.2 to Lucene 4.9. Testing had not revealed much difference in performance, but when it all went into production, we saw a noticeable increase in CPU usage on our application servers. Logs showed that searches took more time than with the previous Lucene 3 based release. (You may now argue that our performance testing was not thorough enough, and you would be right, but that is beside the point.)

After investigating thread dumps of busy JVMs, we discovered that many threads were often in Lucene code, and then in a decompresssing stage for stored field retrieval. This was consistently the case, and we figured it had to be the cause of the increase in CPU usage.

Solution: disable stored field compression

With many small stored fields that are often retrieved in search results and indices which fit in OS cache memory, compression yielded worse performance for us. Unfortunately there is no simple config option to adjust this feature in Lucene 4.X* and it requires swapping out a codec format class.
*I have not yet investigated if this is the case with Lucene 5.X.

You need to provide a custom codec class to your IndexWriter in order to disable stored field compression. There is a class called FilterCodec in the Lucene API which wraps a base codec class instance and delegates by default. It can be extended to allow overriding methods that provide specific parts or formats of a codec. In this case we need only change the stored fields format.

Create a custom codec class:

package org.foo.search;
import org.apache.lucene.codecs.FilterCodec;
import org.apache.lucene.codecs.StoredFieldsFormat;
import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat;
import org.apache.lucene.codecs.lucene410.Lucene410Codec;

public final class Lucene410CodecWithNoFieldCompression extends FilterCodec {

    private final StoredFieldsFormat storedFieldsFormat;

    public Lucene410CodecWithNoFieldCompression() {
        super("Lucene410CodecWithNoFieldCompression", new Lucene410Codec());
        storedFieldsFormat = new Lucene40StoredFieldsFormat();
    }

    @Override
    public StoredFieldsFormat storedFieldsFormat() {
        return storedFieldsFormat;
    }
}

This custom  org.apache.lucene.codecs.Codec subclass uses Lucene410Codec as basis, but swaps out the stored fields format with Lucene40StoredFieldsFormat, which does not employ compression.

Set this codec class in your IndexWriterConfig instance:

IndexWriterConfig cfg = new IndexWriterConfig(Version.LATEST, someAnalyzer);
cfg.setCodec(new Lucene410CodecWithNoFieldCompression());

IndexWriter myWriter = new IndexWriter(someDirectory, cfg);

This takes care of the index writing part, and segments written by myWriter will not use compression for stored fields. However, Lucene also needs to be able to open and read index segments created by this writer, which in principle uses an entirely new custom codec. Lucene locates codec classes on the classpath through Java service provider interface (SPI) mechanism, where codecs are identified by name. Our custom codec uses the name "Lucene410CodecWithNoFieldCompression" (passed to the FilterCodec superclass constructor in the example class above).

Make the class available through SPI by creating a file:
META-INF/services/org.apache.lucene.codecs.Codec
with the following contents:

org.foo.search.Lucene410CodecWithNoFieldCompression

This file itself should be available in the classpath, e.g. in your appliaction jar or war file. Lucene will now be able to open and read index segments created by myWriter.

You should be aware that generic Lucene tools such as Luke will not be able to open indices created with your custom codec, since it is unknown to them. You can however provide the codec class by adding it their classpath, along with the SPI metadata file (assuming the tool is Java-based).

Difference in performance and size

We observed a clear reduction on server load average on our busiest site, and most queries have had their total execution time reduced. On the extreme side, some heavy data report queries that iterate many index documents had their total execution time reduced by up to 75%. We suspect this to be caused by access of small stored fields for many docs, which was hit pretty badly by compression. As for size, one of our larger indices grew by only 80% without compression, which is no problem for us.

Links

  1. Blog posts Efficient compressed stored fields with Lucene and Stored fields compression in Lucene 4.1 by Adrien Grand.
  2. https://issues.apache.org/jira/browse/LUCENE-4226
  3. https://issues.apache.org/jira/browse/LUCENE-4509
  4. https://issues.apache.org/jira/browse/LUCENE-5914

Distrowatching with Bittorrent

Updated ranking as of 25th of April:

  1. ubuntu-14.04.2-desktop-amd64.iso, ratio: 139
  2. ubuntu-14.04.2-desktop-i386.iso, ratio: 138
  3. ubuntu-14.10-desktop-i386.iso, ratio: 93.9
  4. ubuntu-14.10-desktop-amd64.iso, ratio: 87.0
  5. linuxmint-17.1-cinnamon-64bit.iso, ratio: 81.5
  6. debian-7.8.0-amd64-DVD-1.iso, ratio: 31.5
  7. Fedora-Live-Workstation-x86_64-21, ratio: 25.6
  8. debian-7.8.0-amd64-DVD-2.iso, ratio: 16.0
  9. debian-7.8.0-amd64-DVD-3.iso, ratio: 15.7
  10. Fedora-Live-Workstation-i686-21, ratio: 13.3
  11. debian-update-7.8.0-amd64-DVD-1.iso, ratio: 10.5
  12. debian-update-7.8.0-amd64-DVD-2.iso, ratio: 8.90

Total running time: 21 days, total uploaded: 1.04 terabytes.

Originally posted:

I have plenty of spare bandwidth at home, so I’ve been seeding a small selection of popular Linux ISOs via Bittorrent continuously for about 12 days now. Upload-cap was set to 1024 KB/s divided equally amongst all torrents (this is only about en eighth of my total uplink capacity).  Here are the results of the popularity contest as of now:

  1. ubuntu-14.04.2-desktop-amd64.iso, ratio: 93.3
  2. ubuntu-14.04.2-desktop-i386.iso, raitio: 83.5
  3. ubuntu-14.10-desktop-i386.iso, ratio: 57.1
  4. ubuntu-14.10-desktop-amd64.iso, ratio: 53.0
  5. linuxmint-17.1-cinnamon-64bit.iso, ratio: 46.1
  6. debian-7.8.0-amd64-DVD-1.iso, ratio: 18.3
  7. Fedora-Live-Workstation-x86_64-21, ratio: 15.6
  8. debian-7.8.0-amd64-DVD-3.iso, ratio: 10.1
  9. debian-7.8.0-amd64-DVD-2.iso, ratio: 9.48
  10. Fedora-Live-Workstation-i686-21, ratio: 7.82
  11. debian-update-7.8.0-amd64-DVD-2.iso, ratio: 6.13
  12. debian-update-7.8.0-amd64-DVD-1.iso, ratio: 6.06

A total of 636.0 GB has been uploaded. These are Transmission stats obtained at the time of writing this post. Though not statistically significant by any means, it is still interesting to note that Ubuntu seems more popular than Linux Mint on Bittorrent (contrary to what distrowatch.org has to say about it). Also, the LTS version of Ubuntu is more popular than the current 14.10 stable release. (I should add that the ratio of the Debian DVD ISOs cannot be directly compared, since these images are significantly larger in size. And Linux Mint MATE edition is not present at all.)

The list happens to be in accordance with my recommendation to anyone wanting to try Linux for the first time, specifically Ubuntu: go for the LTS version. (Recent Linux Mint is now also based on Ubuntu-LTS.) Many years of experience have taught me that the interim releases have a lot more bugs, annoyances and less polish. Sure, you learn a lot by fixing problems, but it’s perhaps not the best first time experience.