A fast, local search engine for your text files. Query your notes and get a ranked list of the best matches.
Those tools are all great, and I use them all the time! But they search for literal text matches, usually line-by-line. That’s often what I want, but sometimes I want to know, “what notes are most similar to this query, or to this other note?”
If I search for “chunky bacon,” I still want to see documents that talk about “chunks of bacon.” And, below those, I probably want to see notes that discuss regular “bacon,” even if it’s not chunky.
docsim
uses a few different information retrieval algorithms to provide a ranked
list of text documents. It takes term frequency into account, weighs rare words
more heavily, uses stoplists to ignore common terms, and (optionally) stems
English words (so “spinning” can match “spinner”).
It’s also slower and more memory-intensive than those other tools, of course, since it does more work. But performance is a goal, and on a mid-range machine it’ll process a few thousand documents without notable lag.
docsim.el
relies on the docsim command-line tool to do most of the work, so
you’ll need to install that on your $PATH
first. It’s a small Go utility with
precompiled binaries available for most popular platforms, or you can install it
from source.
Once that’s done, just install docsim.el
from MELPA like any other package. With
use-package
:
(use-package docsim
:ensure t
:bind (("C-c n s" . docsim-search)
("C-c n d" . docsim-search-buffer))
:custom
(docsim-search-paths '("~/directory/where/you/keep/your/notes")))
Or, if you’d rather install manually, add docsim.el
to your load path and:
(require 'docsim)
(setq docsim-search-paths '("~/directory/where/you/keep/your/notes"))
(global-set-key (kbd "C-c n s") 'docsim-search)
(global-set-key (kbd "C-c n d") 'docsim-search-buffer)
Those keybindings are just made up, of course. Use whatever you’d like!
Once you’ve got docsim.el
installed and configured, you’ll probably just use:
docsim-search
- Search your notes for a query.
docsim-search-buffer
- Parse the current buffer and list similar notes.
If you link your notes together into a zettelkasten (maybe you use org-roam
or
denote
?), docsim-search-buffer
can help you surface similar notes that “should”
be linked together. That can help you find connections you’d otherwise miss and
makes it easier to add notes into your personal knowledge graph.
In particular, the maintainer of this package 👋 happens to use denote
to
manage their notes, so docsim.el
includes an option to exclude notes that you’ve
already linked from your search results. Configure that like so:
(setq docsim-search-paths (list denote-directory))
(setq docsim-omit-denote-links t)
WARNING: docsim
doesn’t respect .ignore
or .gitignore
files yet, so
it’ll try to search through .git
directories, node_modules
, and so on. That
should be fixed in the near future.
Because its author is a monoglot (and because it was easy for them to find a
good Go stemming library for English) docsim
defaults to using English stoplists
and stemming to improve accuracy. If you mostly write in a language other than
English you might need to tweak things a bit to get comparable results.
First, disable those English-specific features:
(setq docsim-assume-english nil)
Next, optionally, provide a custom stoplist of common words in your language.
This is a text file of words that don’t carry much semantic meaning (like “as”
or “because” in English) separated by newlines. Words like this would probably
already be mostly ignored by docsim
, because words are weighted by how rarely
they show up, but if you add a stoplist they’ll be skipped completely.
(setq docsim-stoplist-file "~/path/to/my/custom/stoplist.txt")
There’s no universal authority on what constitutes “a good stoplist,” but here’s a repo of stoplists for 29 languages that might be helpful!
By default docsim.el
displays the “title” of each file in a results buffer,
instead of just a filename. It tries to get a decent title: it looks for the
#+TITLE:
in Org files, checks for a title:
key in YAML frontmatter in
everything else, but otherwise just shows a relative path. Parsing titles from
every possible file type is an enormous task. It’s not docsim
’s job.
If the default titles aren’t working for you, you can write your own and point
docsim-get-title-function
to it. Your function should take an absolute
file path as an argument and return a string that represents the title of that
file.
For example, to show the name of each search result file without its directory and in ALL CAPS:
(defun my-title-parser (path)
(upcase (file-name-nondirectory path)))
(setq docsim-get-title-function #'my-title-parser)