Skip to content

Latest commit

 

History

History
224 lines (185 loc) · 8.65 KB

README.adoc

File metadata and controls

224 lines (185 loc) · 8.65 KB

Javadoc Search

A fast server-side search engine for the Javadoc API of selected JDK releases and other libraries.

Can also be added to a bowser as custom search engine:

Warning

The above is a fairly weak server and not suitable for benchmarking.

You can also replace jdk-latest which always points to the latest full release with one of the following:

  • jdk-11 for the JDK 11 API

  • jdk-17 for the JDK 17 API

  • junit5-latest for the current JUnit 5 API

The same works for the website in the browser as well.

Details

Realized as Quarkus web application mainly with Qute, RESTEasy and Microstream. The client-side uses Bootstrap and htmx.

The search engine is implemented as a sparse radix trie generated from the xxx-search-index.js files of a Javadoc site. It is also mixed with a suffix tree for name segments of the Javadoc elements.

The following example shows the trie generated for java.lang.Long and java.util.OptionalLong, both in the java.base module. The trie will be compressed such that the equally colored nodes point to the same object.

example trie
Figure 1. Example trie

To search in that trie like one can in the official Javadoc implementation, the nodes in the trie are being traversed recursively based on a custom regular expression matcher implementation.

For example, j.u.OpL gets transformed into a regular expression equivalent to

j\p{javaLowerCase}*+\s*+*+\.\s*+u\p{javaLowerCase}*+\s*+*+\.\s*+Op\p{javaLowerCase}*+L[^]*+

and gets compiled into:

CompiledRegex [
  0x0000006a,  // matches 'j'
  0xc0000000,  // star referencing char class #0 -> "\p{javaLowerCase}"
  0xc0000001,  // star referencing char class #1 -> "\s"
  0x80000378,  // star referencing literal '•'
  0x0000002e,  // matches '.'
  0xc0000001,  // star referencing char class #1 -> "\s"
  0x00000075,  // matches 'u'
  0xc0000000,  // star referencing char class #0 -> "\p{javaLowerCase}"
  0xc0000001,  // star referencing char class #1 -> "\s"
  0x80000378,  // star referencing literal '•'
  0x0000002e,  // matches '.'
  0xc0000001,  // star referencing char class #1 -> "\s"
  0x0000004f,  // matches 'O'
  0x00000070,  // matches 'p'
  0xc0000000,  // star referencing char class #0 -> "\p{javaLowerCase}"
  0x0000004c,  // matches 'L'
  0xc0000002   // star referencing char class #2 -> "[^•]"
]

These instructions are an array of int and they are accompanied by an array of CharPredicates, which are functions of type char → boolean and represent the character classes in the custom regex implementation.

The instructions are encoded as follows:

Bit 31 Bit 30 Bits 29 …​ 0

Wrapped by a star?

Represents a char class?

char value or CharPredicate index

This compiled custom regex implementation has the disadvantage that it does not support nested groups or repetitions of more than a single character. This is however not an issue for our purpose, as we only need to star single characters.

The advantage, however, is that all its state can be stored in a single long, which can easily be passed around methods and avoids the generation of objects.

This long value is composed as follows:

Bit 63 Bits 62 …​ 32 Bit 31 Bits 30 …​ 0

Did the match fail?

Number of stars that matched at least one character

Did the current star already match one character?

Next index in the instruction array

As the state is only a long that is stored by the user and this possessive matcher implementation can advance the state step by step consuming a single character, collecting all matches by traversing the trie and branching at the transitions is fairly simple.

All of that allows for the following search syntax:

  • Query ja.l.o will find:

    • java.base/java.lang.Object

    • java.base/java.lang.Override

    • java.base/java.lang.OutOfMemoryError

  • Query LDT will find:

    • java.base/java.time.LocalDateTime

    • java.base/java.time.chrono.ChronoLocalDateTime

    • …​

  • Query math max will find:

    • java.base/java.lang.Math.max(double, double)

    • java.base/java.lang.Math.max(long, long)

    • …​

  • Query j.u.s.Col.jo will find:

    • java.base/java.util.stream.Collectors.joining()

    • java.base/java.util.stream.Collectors.joining(CharSequence, CharSequence, CharSequence)

    • java.base/java.util.stream.Collectors.joining(CharSequence)

  • Query .C~rs will find:

    • java.base/java.util.stream.Collectors

    • …​

The search engine implementation is mostly compliant with the Javadoc Search Specification but since major changes got introduced with JDK 19 (search terms can be separated by spaces), I felt free to include custom behavior.

Note

One example for that is ~, which allows skipping lower case characters until the following part of the search term matches. Like the custom regular expression implementation, ~ is possessive and the engine will not backtrack to find a match.

As a result of the trie and simplified regex matching combination, the search is very fast and rarely exceeds 20 ms. Most of the time, search completes in about 1 ms. The longer the search term, the faster the results are available.

The matching results are currently ranked by:

  • Where the match of the query starts.
    A match starting at the beginning is the best case, a match starting at or after a separator comes next and matches that start within identifiers (upper-case letters, underscore) have the lowest rank. java.desktop/javax.print.attribute.SetOfIntegerSyntax is therefore a better match for Set than java.base/java.util.AbstractSet.

  • How good they match the query.
    java.base/java.io.File.isHidden() is a better match for File.isH than java.base/java.nio.file.Files.isHidden(Path) because it does not require further lower case characters after File.

  • Natural order of the entries.
    Results with smaller char values at the same position come first. The implementation is very much like String.compareTo on the last segment of the qualified name.

Case insensitive matching is only performed if case-sensitive did not yield any results.

Building

This project requires at least JDK 17.

Building should be straightforward since this is a regular Maven project.

  • mvn clean to clean up generated artifacts (does not remove the database folder)

  • mvn test runs all tests

  • mvn quarkus:dev starts the Quarkus dev mode (live reload)

  • mvn package bundles the web application using the prod profile into target/quarkus-app

It might be helpful to add the missing search-index files in src/main/resources/net/maisikoleni/javadoc/service/jdk-latest (see jdk-index-files-go-here.txt there for details), but this is not required as they are then fetched from the official site and cached in database/javadoc-indexes.

To run outside Maven, --add-exports java.base/jdk.internal.misc=ALL-UNNAMED is required as MicroStream requires Unsafe at the moment.

License

Javadoc Search was created by Christian Femers in 2022 and the program and the accompanying materials are made available under the terms of the European Union Public License 1.2 (see the "LICENSE" file, SPDX-License-Identifier: EUPL-1.2-or-later) with the following exceptions:

  • src/test/java/net/maisikoleni/javadoc/parser/JdkUrlGeneration.java (GPL-2.0-with-classpath-exception)