Lucene is a great tool for retrieving fragmented information even including the fragmenting process (the Analyzer). So there was an intuitional way to retrieving molecules from a cheminformation database with the Lucene engine. ChemSink’s about (http://www.chemsink.com/about/) also mentions that "The chemical search uses Open Babel, Lucene, and MySQL ". Recently I have worked it out and have deployed on production services.
How it works
The concepts is connected as blow and the implementation is not hard. I work on my cheminformation platforms based on .NET and use Lucene.NET.
- A Molecule is a File in Lucene.
- Fingerprint is a Field. That means other information or even another suite of fingerprint can be stored for being queried.
- A Bit in the fingerprint is a Term.
- Substructure search queries Lucene for all the set bit in fingerprint of the query molecule are required in the result molecules.
- Similarity search finds the most relevant molecules.
Similarity and Lucene
The most popular similarity algorithm being used is the Tanimato as written as

As my understanding, this coefficient is so widely used most because it’s simple and running fast for some time-critical cases such as online searching.
Lucene employs Cosin-Similarity with Vector Space Model (VSM) of Information Retrieval.
http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/search/Similarity.html


and norm(t,d) part includes three parts to be multiplied, the lengthNorm part stands for what is effected by the length of the document (total fingerprint bits count of a molecule).
The default Similarity implementation inside Lucene is ready to be used for similar molecules retrieving.
More about the lengthNorm
In my researching phase I have found that the scores of several different target molecules are the same in value. It’s found that the lengthNorms are all the same for this several molecule while they have various number of fingerprint bits been set. The lengthNorm is not calculated while the searching phase but pre-calculated and stored at the indexing phase. Finally I have found this sentence inside the Lucene documents.
"However the resulted norm value is encoded as a single byte before being stored… comes with the price of precision loss"
So my molecules are treated as the same length documents when being queried.
| Molecule ID |
# of bits set |
lengthNorm calculated
in indexing phase
|
lengthNorm stored
and been |
|
| 11096 |
23 |
0.2085 |
0.1875 |
|
| 11578 |
27 |
0.19245 |
0.1875 |
|
| 201736 |
28 |
0.18838 |
0.1875 |
|
The formula to calculate lengthNorm is 1.0 / Math.Sqrt(bits_set)
Fortunately, the DefaultSimilarity class could be overrided including the lengthNorm function. According to Duan Lian’s diagram of distribution of number of fingerprint bit set, there exists a function to mapping this distribution to a more flat one and been using to calculate lengthNorm and take full advantage of the precision-limited value.
Here is the distribution of fingerprint darkness of my database of 80000 commercial compounds.
