Thank you for the great write-up, especially the reasons for going with a trigram. Personally, I would have created a domain specific stemmer to keep the tokens 1:1 with inputs and maintain position that way as well. It would have to pass through special characters unmolested, as you note. From there, roaring bitmaps would efficiently offer boolean operations like AND, OR, NOT in queries.
It’s great to see someone else getting into indexing/IR! In my experience, once organizations find out you have done it, they throw a bunch more at you, like telling an acquaintance you’re good with computers and suddenly you’re their 24/7 tech support. I hope your next steps with it go well!
Thank you. If you do end up exploring different ways please let me know. Always happy to try out new things.
If that’s the case I shall sit back and wait for the huge signing offers to come work at other places to roll in.