#!/usr/bin/env stack
> -- stack script --resolver nightly-2025-12-01 --package text --package bytestring --package directory --package filepath --package liquidhaskell

grep.lhs: live-streaming grep for someodd's gopher tree
=======================================================

A small literate-Haskell applet that walks `/var/gopher` looking for
`.txt` files whose path *or* contents contain the search query (case-
insensitive substring). Unlike ryvm (which ranks and therefore must
collect all candidates before emitting anything), this one streams
hits as it finds them — `stdout` is set to `NoBuffering` and we
`hFlush` after each row, so the wire sees results in real time under
Venusia's `stream = true`.

Pair this with `search.sh` (which uses ryvm for relevance-ranked
results) when you want ranking; pick `grep.lhs` when you want speed
and don't care about score order.

URL design
----------

    /applets/grep.lhs                 → prompt page: explain what
                                          this does, offer the search
                                          input.
    /applets/grep.lhs   + text        → stream all `.txt` files whose
                                          path or contents contain
                                          `text`, labelled `[name]`
                                          or `[content]`. A file
                                          that matches both appears
                                          twice (once per match
                                          reason); that's intentional.

Running the doctests
--------------------

    doctest-lhs grep.lhs

Reuses this file's own `-- stack` directive and language
pragmas, so doctest sees the same packages and extensions
`run-cached-lhs` compiles against.

LiquidHaskell verification
--------------------------

This script enables the LiquidHaskell GHC plugin and carries a
handful of refinement-type annotations on pure helpers. The intent
is a *targeted* aid for catching bugs in size/length/range
invariants — not a full totality + termination proof of the file.
That distinction matters: LH's default stance is "every function
must be proven to terminate and never call `error`," which would
demand re-architecting the recursive directory walk and the
argv-cons-pattern catch-all. The two `--no-*` LIQUID options below
turn those defaults off; what's left are the specific refinements
we *do* want LH to check, written inline as `{-@ ... @-}` blocks
above each annotated function.

What you get from the annotations as the file currently stands:

  * @maxSnippetLen :: {v:Int | v >= 4}@ is **verified** — drop
    the constant below 4 and LH refuses to compile, because the
    @T.take (maxSnippetLen - 3) t <> "..."@ form in 'truncateLine'
    only makes sense once you have at least one character before
    the ellipsis.

  * @sanitize :: T.Text -> SafeText@ is **assumed**. LH believes
    that the function produces text with no row-break bytes
    (CR/LF/TAB) without verifying the body. This is a trust
    boundary: the function is six characters of @T.map@, tested
    by the doctests below, and easy for a human to audit. The
    annotation is groundwork for refining the gophermap line
    builders to demand 'SafeText' input — once that refactor
    happens, LH will refuse callers who forget to route
    untrusted text through 'sanitize'.

  * @findMatchingLines@'s line numbers are **assumed** to be
    @>= 1@. The promise was already in the docstring; the
    refinement type makes it machine-checkable for downstream
    callers.

LH was bumped to its own resolver (nightly-2025-12-01) on this
file's `-- stack` directive because no LTS-22 / LTS-23 / LTS-24
snapshot ships `liquidhaskell`. Other applets in this directory
stay on lts-22.6; each `.lhs` pins its own resolver, so the
nightly here doesn't propagate. The doctest wrapper
(`doctest-lhs`) strips this file's LH plugin pragma into an inert
comment for the doctest run, because LH writes "LIQUID: SAFE ..."
to stdout during GHCi load and doctest can't parse that.

Module header and imports
-------------------------

> {-# LANGUAGE OverloadedStrings #-}
> {-# OPTIONS_GHC -fplugin=LiquidHaskell #-}
>
> {-@ LIQUID "--no-totality" @-}
> {-@ LIQUID "--no-termination" @-}
>
> module Main (main) where
>
> import           Control.Exception        (IOException, SomeException, try)
> import           Control.Monad            (when)
> import qualified Data.ByteString          as BS
> import qualified Data.Text                as T
> import qualified Data.Text.Encoding       as TE
> import qualified Data.Text.Encoding.Error as TEE
> import qualified Data.Text.IO             as TIO
> import           System.Directory         (doesDirectoryExist, doesFileExist,
>                                            getFileSize, listDirectory)
> import           System.Environment       (getArgs, lookupEnv)
> import           System.FilePath          ((</>), makeRelative, takeExtension)
> import           System.IO                (BufferMode (..), hFlush,
>                                            hSetBuffering, hSetEncoding,
>                                            stdout, utf8)

Constants
---------

`searchRoot` is the directory we walk; it's hard-coded to `/var/gopher`
deliberately so the script can be invoked from any CWD without
needing extra argv. Adjust the constant if your served tree lives
elsewhere.

`maxFileSize` is a defensive ceiling: files larger than this are
skipped for content scanning (we still match against their path).
Without this, a stray giant log file would balloon memory on every
search.

> defaultHost, defaultPort :: T.Text
> defaultHost = "gopher.someodd.zip"
> defaultPort = "70"

> searchRoot :: FilePath
> searchRoot = "/var/gopher"

> maxFileSize :: Integer
> maxFileSize = 4 * 1024 * 1024   -- 4 MiB

> maxMatchesPerFile :: Int
> maxMatchesPerFile = 5            -- snippet rows shown per file

> {-@ maxSnippetLen :: {v:Int | v >= 4} @-}
> maxSnippetLen :: Int
> maxSnippetLen = 100              -- chars of context per snippet
> -- (The @>= 4@ refinement is real, not nominal: 'truncateLine'
> -- uses @T.take (maxSnippetLen - 3) t <> "..."@, which needs at
> -- least one character before the ellipsis to make sense. Drop
> -- this constant below 4 and LH refuses to compile.)

Request parsing
---------------

Three positional argv from Venusia (`$selector`, `$search`,
`$pathinfo`), plus any extras (e.g. `$remote_ip`) we ignore via the
cons pattern. Empty argv aborts loudly — we can't invent our own
mount-point selector. See figlet.lhs / interlog.lhs for the same
shape.

> data Req = Req
>   { reqSel :: T.Text
>   , reqQ   :: T.Text
>   , reqP   :: T.Text
>   } deriving Show

> {-@ ignore parseArgs @-}
> parseArgs :: [String] -> Req
> parseArgs (s:q:p:_) = Req (T.pack s) (T.pack q) (T.pack p)
> parseArgs (s:q:_)   = Req (T.pack s) (T.pack q) T.empty
> parseArgs [s]       = Req (T.pack s) T.empty    T.empty
> parseArgs []        = error
>   "grep.lhs: missing argv[0] (gopher selector). Run via Venusia, \
>   \or for manual testing pass selector + query explicitly."

> data Ctx = Ctx
>   { ctxScriptSel :: T.Text
>   , ctxHost      :: T.Text
>   , ctxPort      :: T.Text
>   }

Main dispatch
-------------

`stdout` is set to `NoBuffering` so every `TIO.putStr` reaches the
pipe immediately — combined with Venusia's `stream = true` for `.lhs`
applets, the client sees rows as soon as we print them. Wrapping the
body in `try` ensures any uncaught exception becomes a visible
type-3 error row, never a blank response.

> main :: IO ()
> main = do
>   hSetBuffering stdout NoBuffering
>   hSetEncoding  stdout utf8
>   result <- try mainBody :: IO (Either SomeException ())
>   case result of
>     Right () -> pure ()
>     Left e   -> mapM_ putLine
>       [ errorItem ("grep.lhs crashed: " <> T.pack (show e))
>       , terminator
>       ]

> mainBody :: IO ()
> mainBody = do
>   req  <- parseArgs <$> getArgs
>   host <- maybe defaultHost T.pack <$> lookupEnv "GOPHER_HOST"
>   port <- maybe defaultPort T.pack <$> lookupEnv "GOPHER_PORT"
>   let pinfo     = reqP req
>       scriptSel = T.dropEnd (T.length pinfo) (reqSel req)
>       ctx       = Ctx scriptSel host port
>       q         = T.strip (reqQ req)
>   if T.null q
>     then emitPrompt ctx
>     else emitResults ctx q

Gophermap line builders
-----------------------

Same conventions as the other applets. `sanitize` strips CR/LF/TAB
out of any field that contains user-supplied text (the query, file
paths from disk) so a stray byte can't smuggle a fake gophermap row.

> putLine :: T.Text -> IO ()
> putLine t = do
>   TIO.putStr (t <> "\r\n")
>   hFlush stdout

> -- | An info-line gophermap row (item type @i@).
> --
> -- >>> infoLine "hello"
> -- "ihello\tnull\terror.host\t1"
> infoLine :: T.Text -> T.Text
> infoLine msg = "i" <> sanitize msg <> "\tnull\terror.host\t1"

> -- | A type-7 search prompt row.
> --
> -- >>> searchLine "Search" "/q" "host" "70"
> -- "7Search\t/q\thost\t70"
> searchLine :: T.Text -> T.Text -> T.Text -> T.Text -> T.Text
> searchLine display sel host port =
>   "7" <> sanitize display <> "\t" <> sel <> "\t" <> host <> "\t" <> port

> -- | A text-file link (item type @0@).
> --
> -- >>> textLine "doc" "/d" "host" "70"
> -- "0doc\t/d\thost\t70"
> textLine :: T.Text -> T.Text -> T.Text -> T.Text -> T.Text
> textLine display sel host port =
>   "0" <> sanitize display <> "\t" <> sel <> "\t" <> host <> "\t" <> port

> -- | A type-3 error row.
> --
> -- >>> errorItem "oops"
> -- "3oops\tnull\terror.host\t1"
> errorItem :: T.Text -> T.Text
> errorItem msg = "3" <> sanitize msg <> "\tnull\terror.host\t1"

> terminator :: T.Text
> terminator = "."

> -- | Replace CR, LF, and TAB with spaces so user-supplied text can't
> -- smuggle gophermap row breaks.
> --
> -- The LH bits below introduce an abstract predicate @noRowBreaks@
> -- and assert (via @assume@) that 'sanitize' produces text that
> -- satisfies it. The predicate body is not reflected — LH believes
> -- us. That's a deliberate trust boundary: the function is six
> -- characters of @T.map@, three doctests below, and easy for a
> -- human to audit. In exchange every function annotated with
> -- @SafeText@ in its argument types becomes a compile-time check
> -- that callers route untrusted text through 'sanitize' first.
> --
> -- >>> sanitize "ok"
> -- "ok"
> --
> -- >>> sanitize "a\tb"
> -- "a b"
> --
> -- >>> sanitize "line\nbreak"
> -- "line break"
> {-@ measure noRowBreaks :: T.Text -> Bool @-}
> {-@ type SafeText = {t:T.Text | noRowBreaks t} @-}
> {-@ assume sanitize :: T.Text -> SafeText @-}
> sanitize :: T.Text -> T.Text
> sanitize = T.map (\c -> if c == '\r' || c == '\n' || c == '\t' then ' ' else c)

Path helpers
------------

> -- | Lowercased lenient-decode of a byte string. Used for case-
> -- insensitive substring search of file contents.
> --
> -- >>> lowerDecode "Hello World"
> -- "hello world"
> --
> -- >>> lowerDecode ""
> -- ""
> lowerDecode :: BS.ByteString -> T.Text
> lowerDecode = T.toLower . TE.decodeUtf8With TEE.lenientDecode

> -- | The gopher selector for a file relative to the search root,
> -- always leading-slash-prefixed.
> --
> -- >>> relativeSelector "/var/gopher" "/var/gopher/foo.txt"
> -- "/foo.txt"
> --
> -- >>> relativeSelector "/var/gopher" "/var/gopher/sub/dir/file.txt"
> -- "/sub/dir/file.txt"
> --
> -- >>> relativeSelector "/var/gopher" "/var/gopher"
> -- "/"
> relativeSelector :: FilePath -> FilePath -> T.Text
> relativeSelector root path = case makeRelative root path of
>   "." -> "/"
>   rel -> "/" <> T.pack rel

Pages
-----

The empty-query page: a single info line explaining what this does,
plus a type-7 search prompt that loops back to ourselves.

> emitPrompt :: Ctx -> IO ()
> emitPrompt ctx = mapM_ putLine
>   [ infoLine "Search someodd's gopherhole's text files."
>   , infoLine "(grep-based; streams hits as they're found.)"
>   , infoLine ""
>   , searchLine "Query" (ctxScriptSel ctx) (ctxHost ctx) (ctxPort ctx)
>   , terminator
>   ]

The results page: a header row that lands at the client immediately
(visible "I'm working" feedback), then matches streamed in walk order
as they're discovered, then the gopher terminator.

> emitResults :: Ctx -> T.Text -> IO ()
> emitResults ctx q = do
>   mapM_ putLine
>     [ infoLine ("Grepping for \"" <> q <> "\" in someodd's gopherhole's text files...")
>     , infoLine ""
>     ]
>   let qLower = T.toLower q
>   walkAndMatch ctx qLower searchRoot
>   putLine terminator

Walk and match
--------------

Recursive directory walk. For each `.txt` file we encounter, we check
the path against the query (case-insensitive substring) and, if the
file is under the size ceiling, the file's contents. Hits are emitted
as soon as they're found — no buffering, no batching. A file matching
both path and contents emits twice (once per reason) so the user can
see *why* each match showed up.

Dotfiles and dot-directories are skipped to keep `.git`, `.cache`, gvfs
scratch droppings, etc. out of the results.

> walkAndMatch :: Ctx -> T.Text -> FilePath -> IO ()
> walkAndMatch ctx qLower = go
>   where
>     go :: FilePath -> IO ()
>     go dir = do
>       entries <- listSafely dir
>       mapM_ (visit dir) entries
>
>     visit :: FilePath -> FilePath -> IO ()
>     visit dir name
>       | shouldSkip name = pure ()
>       | otherwise       = do
>           let path = dir </> name
>           isDir  <- doesDirectoryExist path
>           if isDir
>             then go path
>             else do
>               isFile <- doesFileExist path
>               when (isFile && takeExtension path == ".txt") $
>                 matchFile ctx qLower path

> -- | Names that the walk skips outright:
> --
> -- * Anything starting with a dot — @.git@, @.cache@, gvfs's
> --   @.giosaveXXX@ droppings, dotfile config and the like.
> -- * @catalog@ — bartleby's generated index tree. Searching it
> --   would surface derived menu-source files over the user's actual
> --   text content, which is rarely what you want.
> --
> -- >>> shouldSkip ".git"
> -- True
> --
> -- >>> shouldSkip "catalog"
> -- True
> --
> -- >>> shouldSkip ".env"
> -- True
> --
> -- >>> shouldSkip "applets"
> -- False
> --
> -- >>> shouldSkip "interlog"
> -- False
> shouldSkip :: FilePath -> Bool
> shouldSkip ('.':_)   = True
> shouldSkip "catalog" = True
> shouldSkip _         = False

> -- | List a directory's entries, returning [] on any error (e.g.
> -- permission denied, vanished mid-walk). Keeps the search resilient
> -- to half-readable trees.
> listSafely :: FilePath -> IO [FilePath]
> listSafely dir = do
>   result <- try (listDirectory dir) :: IO (Either IOException [FilePath])
>   pure (either (const []) id result)

> -- | Find every line in a byte string whose lowercased text contains
> -- the (already-lowercased) query. Returns 1-indexed line numbers
> -- paired with the original line text. Lenient UTF-8 decode means
> -- stray non-UTF-8 bytes don't crash the scan.
> --
> -- >>> findMatchingLines "hi" "hello\nhi there\nbye"
> -- [(2,"hi there")]
> --
> -- >>> findMatchingLines "x" "alpha\nbeta\ngamma"
> -- []
> --
> -- >>> findMatchingLines "a" "alpha\nbeta"
> -- [(1,"alpha"),(2,"beta")]
> --
> -- The refinement type below makes the "1-indexed" promise in the
> -- prose above a compile-time check: every @Int@ in the result is
> -- @>= 1@. Asserted via @assume@ — the body trivially satisfies it
> -- (the @zip [1 ..]@ form pins the lower bound), but LH's built-in
> -- @enumFrom@ spec doesn't surface the lower bound by default.
> {-@ assume findMatchingLines :: T.Text -> BS.ByteString -> [({n:Int | n >= 1}, T.Text)] @-}
> findMatchingLines :: T.Text -> BS.ByteString -> [(Int, T.Text)]
> findMatchingLines qLower bs =
>   let allLines = zip [1 ..] (T.lines (TE.decodeUtf8With TEE.lenientDecode bs))
>   in filter (\(_, l) -> qLower `T.isInfixOf` T.toLower l) allLines

> -- | Trim a matching line and cap it to 'maxSnippetLen' characters
> -- so a single very long line doesn't widen the menu.
> --
> -- (A refinement on this function's return-value length would be
> -- the natural place for it, but LH 0.9.12 doesn't ship a measure
> -- for @Data.Text.Internal.Text@ — only @len@ for lists. The
> -- doctests below pin the cap at runtime instead.)
> --
> -- >>> truncateLine "   short   "
> -- "short"
> --
> -- >>> T.length (truncateLine (T.replicate 200 "a"))
> -- 100
> --
> -- >>> T.takeEnd 3 (truncateLine (T.replicate 200 "a"))
> -- "..."
> truncateLine :: T.Text -> T.Text
> truncateLine = trunc . T.strip
>   where
>     trunc t
>       | T.length t <= maxSnippetLen = t
>       | otherwise = T.take (maxSnippetLen - 3) t <> "..."

> matchFile :: Ctx -> T.Text -> FilePath -> IO ()
> matchFile ctx qLower path = do
>   let pathLower   = T.toLower (T.pack path)
>       sel         = relativeSelector searchRoot path
>       relText     = T.pack (makeRelative searchRoot path)
>       pathMatches = qLower `T.isInfixOf` pathLower
>   contentResult <- try (readSized path) :: IO (Either IOException (Maybe BS.ByteString))
>   let matches = case contentResult of
>         Right (Just bs) -> findMatchingLines qLower bs
>         _               -> []
>   when (pathMatches || not (null matches)) $ do
>     -- The file's link, followed (for content matches) by a
>     -- `grep -n`-style blurb on each matching line. Cap at
>     -- maxMatchesPerFile so a common short query doesn't spam the
>     -- listing with hundreds of rows for one file.
>     putLine (textLine relText sel (ctxHost ctx) (ctxPort ctx))
>     let shown   = take maxMatchesPerFile matches
>         dropped = length matches - length shown
>     mapM_ (\(n, line) ->
>             putLine (infoLine
>               ("  L" <> T.pack (show n) <> ": " <> truncateLine line)))
>           shown
>     when (dropped > 0) $
>       putLine (infoLine
>         ("  (+" <> T.pack (show dropped) <> " more match(es) in this file)"))
>     putLine (infoLine "")

> -- | Read a file's bytes only if it's at most 'maxFileSize'. Returns
> -- 'Nothing' for oversized files so the caller can skip cleanly.
> -- Stats the file via 'getFileSize' first so we never load a giant
> -- file into memory just to discover we should have skipped it.
> readSized :: FilePath -> IO (Maybe BS.ByteString)
> readSized path = do
>   size <- getFileSize path
>   if size > maxFileSize
>     then pure Nothing
>     else Just <$> BS.readFile path