#!/usr/bin/env stack > -- stack script --resolver lts-22.6 --package text --package process --package tagsoup --package network-uri webhole.lhs: browse any website as a gopherhole ================================================ Feed it the URL of a web page; it hands the page back as a *gopher menu*. The page's prose becomes info lines, its headings become titled sections, and its links become menu rows you can follow --- pages stay inside gopherspace (each link is a selector back into this same applet), while images, audio, downloads and off-protocol links are proxied through as gopher binaries or handed to your client. The effect is a text-mode browser whose "render target" is a gophermap instead of a screen. curl gopher://gopher.someodd.zip/1/applets/webhole.lhs/example.com Same family as `txtify.lhs` (page -> plaintext) and `atomize.lhs` (page -> Atom feed): an SSRF-guarded, size-capped `curl`, a type-7 search front door, and the target URL ridden in path-info as one opaque percent-encoded segment. Where those two each emit *one* artifact, webhole emits a *navigable* one and recurses: following a link fetches and re-renders the next page, so the whole web is walkable as a tree of menus. (Markdown-flavoured literate Haskell. Headings use setext underlines rather than ATX `#`, because GHC's literate parser reads a `#` at column 1 of a non-code line as a pragma.) The protocol point that makes this work ---------------------------------------- A gopher item-type byte (the `1` in `/1/applets/...`) is a *client* rendering hint; it is never sent to the server. The server only ever sees the selector. So a page can be served as a type-1 menu whose every in-content link is *itself* a type-1 selector pointing back at this script with the next URL encoded in path-info. Click it and the client follows without prompting; we fetch that URL and render it the same way. Navigation never leaves gopherspace. What "browse" means here, intelligently --------------------------------------- A dump of every `` on a modern page is link soup --- nav bars, cookie banners, tag clouds, share buttons. webhole instead reads the page the way a person does and lays it out for gopher: 0. SSRF guard --- http/https only; refuse localhost, RFC-1918, link-local, CGNAT and the cloud-metadata address. curl is capped on time, redirects and response size. 1. Fetch the page (browser-ish UA, gzip allowed). 2. One linear pass over the tags classifies every region: * hard chrome (script/style/svg/form/button/...) is dropped whole, contents and all; * nav chrome (nav/header/footer/aside) keeps only its *links*, dropped text --- this is the site menu; * everything else is main content: headings, paragraphs, list items, with the links that sat inside them. A separate pass finds `
`s with a text field --- see "Forms and search" below. 3. Render main content in document order: a heading titles a section, prose wraps to info lines, and each in-content link follows the prose it appeared in (so the link keeps its sentence). Tiny link-free fragments (bylines, captions) are filtered by length. 4. Collect the nav-chrome links, dedupe against what the body already showed, and lay them out as a "Site navigation" section at the foot --- so you can move around the site without the menu cluttering the article. (lynx, essentially.) 5. Every page also carries: a type-0 "read as plain text" link (chrome- and link-free, for saving), an `h URL:` handoff to open the real page in a browser, "up one level" / "site root" rows for spatial navigation, and a type-7 box to jump to any other URL. Link item-types are chosen by sniffing the target: an HTML page recurses as a type-1 menu; an image/audio/text/file is proxied as raw bytes through the `r` endpoint with the matching gopher type (`I`, `g`, `s`, `0`, `9`); an off-protocol link (`mailto:`, `tel:`) becomes an `h URL:` handoff. What it can't do: it runs no JavaScript (it is curl, not a browser), so a JS-only or paywalled page comes back thin --- the same honest limit as its siblings. URL / endpoint design --------------------- `$SCRIPT` is wherever `routes.toml` mounts this file; the script computes its own mount point from `$selector - $pathinfo`. $SCRIPT front door (explainer + search box) $SCRIPT + type-7 query browse that URL as a menu $SCRIPT/ browse that URL as a menu (bare form) $SCRIPT/m/ browse that URL as a menu (internal) $SCRIPT/t/ the page as plain text (type 0) $SCRIPT/r/ the resource's raw bytes (proxy) $SCRIPT/g/ + q submit a GET form with the typed query $SCRIPT/o//
 + q   submit a POST form with the typed query

The page URL rides in path-info as one opaque, percent-encoded
segment so its own slashes survive. A leading `m/`, `t/`, `r/`,
`g/` or `o/` segment selects the mode; anything else is treated as
a bare `host/path` URL (defaulting to https) and browsed.

Forms and search
----------------

Gopher's type-7 item *is* a one-line text input: the client
prompts, then resends the selector with the typed string after a
tab. That maps cleanly onto an HTML form with a text field. A
second pass over the page finds each `` that has a text
input (or textarea), and renders it as a type-7 "Search & forms"
row. The selector carries everything needed to submit *except*
the value, pre-split so the typed query is simply appended: a GET
form is `/g/` where the prefix is the action plus the
hidden pairs plus `field=`; a POST form is `/o//`. Each dynamic part is one ordinary percent-encoded path
segment (just like the URL the sibling applets ride in path-info)
with real `/` only between segments --- so the round-trip never
depends on a tab or other gopher-significant byte surviving.

Submitting lands back here with the typed value as the gopher
query; we percent-encode it, append it to the prefix --- a GET
becomes the full URL, a POST becomes the body --- fetch it, and
render the response as a browse menu. So a result page is
itself walkable, and its own forms are submittable in turn. A
DuckDuckGo or Google search box on the fetched HTML becomes a
working gopher search; non-text controls (checkboxes, file
uploads, JS-driven widgets) are out of scope.

Running the doctests
--------------------

    doctest-lhs webhole.lhs

Module header and imports
-------------------------

> {-# LANGUAGE OverloadedStrings #-}
> module Main (main) where
>
> import           Control.Exception      (SomeException, try)
> import           Data.Char              (isDigit)
> import           Data.Maybe             (mapMaybe)
> import qualified Data.Text              as T
> import qualified Data.Text.IO           as TIO
> import           Network.URI            (URI (..), escapeURIString,
>                                          isUnreserved, parseURI,
>                                          parseURIReference, relativeTo,
>                                          unEscapeString, uriAuthority,
>                                          uriRegName, uriScheme, uriToString)
> import           System.Environment     (getArgs, lookupEnv)
> import           System.Exit            (ExitCode (..))
> import           System.IO              (BufferMode (..), hFlush,
>                                          hSetBuffering, hSetEncoding, stdout,
>                                          utf8)
> import           System.Process         (StdStream (Inherit), createProcess,
>                                          proc, readProcessWithExitCode,
>                                          std_out, waitForProcess)
> import           Text.HTML.TagSoup      (Tag (..), parseTags)
> import           Text.HTML.TagSoup.Entity (lookupEntity)

Defaults and tunables
---------------------

Host/port overridable via `GOPHER_HOST` / `GOPHER_PORT` so the
same script serves staging and production unedited.

> defaultHost, defaultPort :: T.Text
> defaultHost = "gopher.someodd.zip"
> defaultPort = "70"
>
> -- | Wrap column for rendered prose.
> wrapCol :: Int
> wrapCol = 72
>
> -- | A non-link content block needs at least this many characters of
> -- prose to be shown (headings and link-bearing blocks are exempt).
> -- Filters bylines, captions and single-word remnants.
> minBlockChars :: Int
> minBlockChars = 40
>
> -- | Cap on rows in the "Site navigation" section, so a megamenu
> -- can't run away.
> maxNav :: Int
> maxNav = 80
>
> -- | Truncate a menu row's display text to this many characters.
> maxLabel :: Int
> maxLabel = 68

Request parsing
---------------

Four positional argv from Venusia (`$selector`, `$search`,
`$pathinfo`, `$remote_ip`); we ignore the IP via the cons pattern.

> data Req = Req
>   { reqSel :: T.Text
>   , reqQ   :: T.Text
>   , reqP   :: T.Text
>   } deriving Show
>
> parseArgs :: [String] -> Req
> parseArgs (s:q:p:_) = Req (T.pack s) (T.pack q) (T.pack p)
> parseArgs (s:q:_)   = Req (T.pack s) (T.pack q) T.empty
> parseArgs [s]       = Req (T.pack s) T.empty    T.empty
> parseArgs []        = error
>   "webhole.lhs: missing argv[0] (gopher selector). Run via Venusia, \
>   \or for manual testing pass selector + query/pathinfo explicitly."
>
> data Ctx = Ctx
>   { ctxScriptSel :: T.Text
>   , ctxHost      :: T.Text
>   , ctxPort      :: T.Text
>   }

Main dispatch
-------------

`stdout` is `NoBuffering` per house convention; the menu/text
bodies are wrapped in `try` so an uncaught exception becomes a
visible type-3 row rather than a blank response. The raw byte
proxy guards itself and writes through curl, so its failures are
handled there.

> main :: IO ()
> main = do
>   hSetBuffering stdout NoBuffering
>   hSetEncoding  stdout utf8
>   result <- try mainBody :: IO (Either SomeException ())
>   case result of
>     Right () -> pure ()
>     Left e   -> mapM_ putLine
>       [ errorItem ("webhole.lhs crashed: " <> T.pack (show e))
>       , terminator
>       ]
>
> mainBody :: IO ()
> mainBody = do
>   req  <- parseArgs <$> getArgs
>   host <- maybe defaultHost T.pack <$> lookupEnv "GOPHER_HOST"
>   port <- maybe defaultPort T.pack <$> lookupEnv "GOPHER_PORT"
>   let pinfo     = reqP req
>       scriptSel = T.dropEnd (T.length pinfo) (reqSel req)
>       ctx       = Ctx scriptSel host port
>       q         = T.strip (reqQ req)
>       p         = T.dropWhile (== '/') pinfo
>   if not (T.null p)
>     then case routeOf p of
>            (RGet,  enc) -> emitGet  ctx enc q
>            (RPost, enc) -> emitPost ctx enc q
>            (RText, enc) -> emitText ctx (normalizeTargetUrl (decodeSeg enc))
>            (RRaw,  enc) -> emitRaw      (normalizeTargetUrl (decodeSeg enc))
>            (RMenu, enc) -> emitMenu ctx (normalizeTargetUrl (decodeSeg enc))
>     else if T.null q
>       then emitFront ctx
>       else emitMenu  ctx (normalizeTargetUrl q)

Path-info routing
-----------------

A leading `m/`, `t/` or `r/` segment selects the mode and the
rest is the (percent-encoded) URL; anything else is the whole
path-info treated as a bare URL to browse. The mode markers are
single letters with a following slash, so they never collide
with a percent-encoded URL (which starts `https%3A...`) or a
bare `host/path` (whose first segment is a dotted hostname).

> data Mode = RMenu | RText | RRaw | RGet | RPost deriving (Eq, Show)
>
> -- | >>> routeOf "t/https%3A%2F%2Fa.com"
> -- (RText,"https%3A%2F%2Fa.com")
> --
> -- >>> routeOf "r/https%3A%2F%2Fa.com%2Fx.png"
> -- (RRaw,"https%3A%2F%2Fa.com%2Fx.png")
> --
> -- >>> routeOf "g/https%3A%2F%2Fa%2Fs%3Fq%3D"
> -- (RGet,"https%3A%2F%2Fa%2Fs%3Fq%3D")
> --
> -- >>> routeOf "https%3A%2F%2Fa.com"
> -- (RMenu,"https%3A%2F%2Fa.com")
> --
> -- >>> routeOf "example.com/page"
> -- (RMenu,"example.com/page")
> routeOf :: T.Text -> (Mode, T.Text)
> routeOf p =
>   let (seg0, rest) = T.break (== '/') p
>   in case seg0 of
>        "t" | not (T.null rest) -> (RText, T.drop 1 rest)
>        "r" | not (T.null rest) -> (RRaw,  T.drop 1 rest)
>        "g" | not (T.null rest) -> (RGet,  T.drop 1 rest)
>        "o" | not (T.null rest) -> (RPost, T.drop 1 rest)
>        "m" | not (T.null rest) -> (RMenu, T.drop 1 rest)
>        _                       -> (RMenu, p)
>
> -- | Decode one percent-encoded path-info segment back to text.
> decodeSeg :: T.Text -> T.Text
> decodeSeg = T.pack . unEscapeString . T.unpack

URL normalisation (shared with the sibling applets)
---------------------------------------------------

> -- | Trim, repair a single-slashed scheme, and default a scheme-less
> -- input to https.
> --
> -- >>> normalizeTargetUrl "  example.com/blog "
> -- "https://example.com/blog"
> --
> -- >>> normalizeTargetUrl "https:/example.com"
> -- "https://example.com"
> normalizeTargetUrl :: T.Text -> T.Text
> normalizeTargetUrl raw =
>   let s = repairScheme (T.strip raw)
>   in if hasScheme s then s else "https://" <> s
>
> -- | >>> hasScheme "http://x"
> -- True
> --
> -- >>> hasScheme "x.com"
> -- False
> hasScheme :: T.Text -> Bool
> hasScheme s = let l = T.toLower s
>               in "http://" `T.isPrefixOf` l || "https://" `T.isPrefixOf` l
>
> -- | Repair a scheme whose `//` collapsed to `/` (happens when a URL
> -- is typed unencoded into path-info and empty segments drop).
> --
> -- >>> repairScheme "http:/example.com"
> -- "http://example.com"
> repairScheme :: T.Text -> T.Text
> repairScheme s
>   | "https://" `T.isPrefixOf` l = s
>   | "http://"  `T.isPrefixOf` l = s
>   | "https:/"  `T.isPrefixOf` l = "https://" <> T.drop 7 s
>   | "http:/"   `T.isPrefixOf` l = "http://"  <> T.drop 6 s
>   | otherwise                   = s
>   where l = T.toLower s

SSRF guard (shared with the sibling applets)
--------------------------------------------

> -- | The host component of a (normalised) URL, lower-cased.
> --
> -- >>> hostOf "https://Example.com/x"
> -- Just "example.com"
> --
> -- >>> hostOf "not a url"
> -- Nothing
> hostOf :: T.Text -> Maybe T.Text
> hostOf u = do
>   uri  <- parseURI (T.unpack (normalizeTargetUrl u))
>   auth <- uriAuthority uri
>   let h = T.toLower (T.pack (uriRegName auth))
>   if T.null h then Nothing else Just h
>
> -- | True for hosts we refuse to fetch: localhost, RFC-1918, CGNAT,
> -- link-local, the cloud-metadata IP, and `.local` mDNS names.
> --
> -- >>> isBlockedHost "localhost"
> -- True
> --
> -- >>> isBlockedHost "192.168.1.10"
> -- True
> --
> -- >>> isBlockedHost "169.254.169.254"
> -- True
> --
> -- >>> isBlockedHost "example.com"
> -- False
> isBlockedHost :: T.Text -> Bool
> isBlockedHost h0 =
>   let h = T.toLower h0
>   in  h `elem` ["", "localhost", "0.0.0.0", "::1", "[::1]", "metadata"]
>    || ".local" `T.isSuffixOf` h
>    || "127."     `T.isPrefixOf` h
>    || "10."      `T.isPrefixOf` h
>    || "192.168." `T.isPrefixOf` h
>    || "169.254." `T.isPrefixOf` h
>    || "100.64."  `T.isPrefixOf` h
>    || in172 h
>   where
>     -- 172.16.0.0 .. 172.31.255.255
>     in172 h = case T.splitOn "." h of
>       (a:b:_) | a == "172"
>               , all isDigit (T.unpack b)
>               , not (T.null b)
>               , let n = read (T.unpack b) :: Int
>               -> n >= 16 && n <= 31
>       _       -> False

Fetching
--------

A tight envelope: http/https only, capped redirects, time and
response size, gzip allowed, browser-ish User-Agent. Returns the
body, or a human note on failure.

> fetchPage :: T.Text -> IO (Either T.Text T.Text)
> fetchPage url = case hostOf url of
>   Nothing -> pure (Left "could not parse that as an http(s) URL")
>   Just h | isBlockedHost h ->
>     pure (Left ("refused: \"" <> h <> "\" is a local/private address"))
>   Just _ -> do
>     (ec, out, _) <- readProcessWithExitCode "curl"
>       [ "-sSL", "--compressed", "--proto", "=http,https", "--max-redirs", "5"
>       , "--connect-timeout", "5", "--max-time", "20"
>       , "--max-filesize", "5000000"
>       , "-A", "Mozilla/5.0 (compatible; webhole.lhs; gopher web browser)"
>       , "--", T.unpack url ] ""
>     pure $ case ec of
>       ExitSuccess | not (null out) -> Right (T.pack out)
>       _ -> Left "could not fetch the page (timeout, too big, or error)"
>
> -- | As 'fetchPage', but a POST with a urlencoded body --- for forms
> -- whose method is post.
> fetchPagePost :: T.Text -> T.Text -> IO (Either T.Text T.Text)
> fetchPagePost url body = case hostOf url of
>   Nothing -> pure (Left "could not parse that as an http(s) URL")
>   Just h | isBlockedHost h ->
>     pure (Left ("refused: \"" <> h <> "\" is a local/private address"))
>   Just _ -> do
>     (ec, out, _) <- readProcessWithExitCode "curl"
>       [ "-sSL", "--compressed", "--proto", "=http,https", "--max-redirs", "5"
>       , "--connect-timeout", "5", "--max-time", "20"
>       , "--max-filesize", "5000000"
>       , "-A", "Mozilla/5.0 (compatible; webhole.lhs; gopher web browser)"
>       , "--data", T.unpack body
>       , "--", T.unpack url ] ""
>     pure $ case ec of
>       ExitSuccess | not (null out) -> Right (T.pack out)
>       _ -> Left "could not submit the form (timeout, too big, or error)"

Tag-soup helpers
----------------

> lc :: T.Text -> T.Text
> lc = T.toLower
>
> isOpen, isClose :: T.Text -> Tag T.Text -> Bool
> isOpen  n (TagOpen t _) = lc t == n
> isOpen  _ _             = False
> isClose n (TagClose t)  = lc t == n
> isClose _ _             = False
>
> -- | Attribute value by (case-insensitive) name; "" if absent.
> attrOf :: T.Text -> [(T.Text, T.Text)] -> T.Text
> attrOf k attrs = maybe "" id (lookup (lc k) [ (lc a, b) | (a, b) <- attrs ])
>
> -- | Concatenated text directly inside the first `...`.
> innerText :: T.Text -> [Tag T.Text] -> T.Text
> innerText name tags =
>   case dropWhile (not . isOpen name) tags of
>     (_:rest) -> T.strip . T.concat $
>                   [ s | TagText s <- takeWhile (not . isClose name) rest ]
>     []       -> ""

Region and block vocabulary
---------------------------

> -- | Elements whose entire subtree is chrome we never want.
> junkTags :: [T.Text]
> junkTags =
>   [ "script", "style", "noscript", "template", "svg", "head"
>   , "button", "select", "option", "iframe", "object", "embed" ]
>
> -- | Elements whose text is chrome but whose *links* are the site
> -- menu: kept and shown in a separate navigation section.
> navTags :: [T.Text]
> navTags = [ "nav", "header", "footer", "aside" ]
>
> -- | Block-level elements we treat as content chunks.
> blockTags :: [T.Text]
> blockTags = [ "p", "li", "blockquote", "pre", "dd", "dt", "figcaption" ]
>
> -- | >>> headLevel "h3"
> -- 3
> headLevel :: T.Text -> Int
> headLevel t = case T.unpack t of
>   ['h', d] | isDigit d -> read [d]
>   _                    -> 0
>
> -- | >>> isHeading "h2"
> -- True
> --
> -- >>> isHeading "p"
> -- False
> isHeading :: T.Text -> Bool
> isHeading t = t `elem` ["h1", "h2", "h3", "h4", "h5", "h6"]

The page model
--------------

The parser flattens a page into chunks in document order. A
content block carries its prose *and* the links that sat inside
it (so the prose reads naturally and the links follow it); a
standalone content link is its own chunk; nav-chrome links are
tagged separately for the navigation section.

> -- | A resolved, absolute link: display label + target URL.
> data Link = Link
>   { lLabel :: T.Text
>   , lUrl   :: T.Text
>   } deriving (Eq, Show)
>
> data Chunk
>   = CHead Int T.Text       -- ^ heading: level + text
>   | CBlock T.Text [Link]   -- ^ block prose + links found inside it
>   | CLink Link             -- ^ a content link with no enclosing block
>   | CNav Link              -- ^ a link from nav/header/footer/aside chrome
>   deriving (Eq, Show)

The parse walk
--------------

A single left-to-right pass. State: the stack of open hard-junk
tags (skip everything while non-empty), the nav-chrome depth
(drop text, keep links), the block currently being captured, and
the anchor currently being captured (its href + accumulating
text). Nested block opens fold into the enclosing block; this
keeps the walk linear and is fine for a heuristic.

> data Anchor = Anchor T.Text T.Text          -- href, text-so-far
> data Cur    = Cur T.Text (Maybe Int) T.Text [Link]
>                                              -- tag, heading-level?, text, links (rev)
>
> -- | >>> pageChunks "https://x/a" "Home
Hi
Read more here, this line is plainly long enough."
> -- [CNav (Link {lLabel = "Home", lUrl = "https://x/home"}),CHead 1 "Hi",CBlock "Read more here, this line is plainly long enough." [Link {lLabel = "more", lUrl = "https://x/more"}]]
> pageChunks :: T.Text -> T.Text -> [Chunk]
> pageChunks base html = extractChunks base (parseTags html)
>
> extractChunks :: T.Text -> [Tag T.Text] -> [Chunk]
> extractChunks base = go [] 0 Nothing Nothing
>   where
>     go :: [T.Text] -> Int -> Maybe Cur -> Maybe Anchor
>        -> [Tag T.Text] -> [Chunk]
>     go _    _    cur _   []       = flush cur []
>     go junk navd cur anc (tg:ts)  = case tg of
>       TagOpen rawN attrs
>         | nm `elem` junkTags        -> go (nm : junk) navd cur anc ts
>         | not (null junk)           -> go junk navd cur anc ts
>         | nm `elem` navTags         -> go junk (navd + 1) cur anc ts
>         | nm == "a"                 -> case anc of
>             Just _  -> go junk navd cur anc ts          -- nested : ignore
>             Nothing -> go junk navd cur (Just (Anchor (attrOf "href" attrs) "")) ts
>         | nm == "br"                -> case cur of
>             Just (Cur t h x ls) -> go junk navd (Just (Cur t h (x <> " ") ls)) anc ts
>             Nothing             -> go junk navd cur anc ts
>         | isHeading nm              -> startBlock (Just (headLevel nm))
>         | nm `elem` blockTags       -> startBlock Nothing
>         | otherwise                 -> go junk navd cur anc ts
>         where
>           nm = lc rawN
>           startBlock mh = case (navd > 0, cur) of
>             (True, _)        -> go junk navd cur anc ts   -- no blocks in nav
>             (False, Nothing) -> go junk navd (Just (Cur nm mh "" [])) anc ts
>             (False, Just _)  -> go junk navd cur anc ts   -- fold into enclosing
>       TagClose rawN
>         | nm `elem` junkTags  -> go (dropFirst nm junk) navd cur anc ts
>         | not (null junk)     -> go junk navd cur anc ts
>         | nm `elem` navTags   -> go junk (max 0 (navd - 1)) cur anc ts
>         | nm == "a"           -> closeAnchor
>         | otherwise           -> case cur of
>             Just (Cur t _ _ _) | nm == t -> flush cur (go junk navd Nothing anc ts)
>             _                            -> go junk navd cur anc ts
>         where
>           nm = lc rawN
>           closeAnchor = case anc of
>             Nothing -> go junk navd cur Nothing ts
>             Just a  -> case mkLink base a of
>               Nothing  -> go junk navd cur Nothing ts
>               Just lnk
>                 | navd > 0  -> CNav lnk : go junk navd cur Nothing ts
>                 | otherwise -> case cur of
>                     Just (Cur t h x ls) ->
>                       go junk navd (Just (Cur t h x (lnk : ls))) Nothing ts
>                     Nothing             -> CLink lnk : go junk navd cur Nothing ts
>       TagText s
>         | not (null junk) -> go junk navd cur anc ts
>         | otherwise       ->
>             let anc' = fmap (\(Anchor h x) -> Anchor h (x <> s)) anc
>                 cur' = fmap (\(Cur t h x ls) -> Cur t h (x <> s) ls) cur
>             in go junk navd cur' anc' ts
>       _ -> go junk navd cur anc ts
>
>     flush Nothing  rest = rest
>     flush (Just (Cur _ (Just lvl) x _)) rest =
>       let t = clean x in if T.null t then rest else CHead lvl t : rest
>     flush (Just (Cur _ Nothing x ls)) rest =
>       let t = clean x; links = reverse ls
>       in if T.null t && null links then rest else CBlock t links : rest
>
> -- | Resolve an anchor to an absolute link, or drop it (empty href,
> -- fragment-only, or a javascript:/data: pseudo-URL).
> mkLink :: T.Text -> Anchor -> Maybe Link
> mkLink base (Anchor href0 txt) =
>   let href = T.strip href0
>       l    = lc href
>       lbl0 = collapseWs (decodeEntities txt)
>   in if T.null href || "#" `T.isPrefixOf` href
>         || "javascript:" `T.isPrefixOf` l || "data:" `T.isPrefixOf` l
>      then Nothing
>      else let abs' = resolveUrl base href
>               lbl  = if T.null lbl0 then deriveLabel abs' else lbl0
>           in Just (Link lbl abs')
>
> -- | Resolve a possibly-relative href against a base URL.
> --
> -- >>> resolveUrl "https://a.com/blog/" "post"
> -- "https://a.com/blog/post"
> --
> -- >>> resolveUrl "https://a.com/blog/" "/about"
> -- "https://a.com/about"
> resolveUrl :: T.Text -> T.Text -> T.Text
> resolveUrl base href =
>   case (parseURIReference (T.unpack href), parseURI (T.unpack base)) of
>     (Just r, Just b) -> T.pack (uriToString id (r `relativeTo` b) "")
>     _                -> href
>
> -- | First occurrence of an element removed from the open-junk stack.
> --
> -- >>> dropFirst "svg" ["svg","div"]
> -- ["div"]
> dropFirst :: Eq a => a -> [a] -> [a]
> dropFirst _ []     = []
> dropFirst x (y:ys)
>   | x == y    = ys
>   | otherwise = y : dropFirst x ys

Link classification
-------------------

The target's scheme and file extension decide the gopher item
type, and so whether following the link recurses (a page) or
fetches a resource through the byte proxy (everything else).

> data Kind = KPage | KImage | KGif | KSound | KText | KFile | KExt
>   deriving (Eq, Show)
>
> -- | >>> map classifyKind ["https://a/p","https://a/x.PNG","https://a/x.gif","https://a/f.pdf","https://a/n.txt","mailto:x@y"]
> -- [KPage,KImage,KGif,KFile,KText,KExt]
> classifyKind :: T.Text -> Kind
> classifyKind u
>   | not (isHttp u)   = KExt
>   | e == "gif"       = KGif
>   | e `elem` images  = KImage
>   | e `elem` sounds  = KSound
>   | e `elem` texts   = KText
>   | e `elem` files   = KFile
>   | otherwise        = KPage
>   where
>     e      = urlExt u
>     images = ["jpg","jpeg","png","webp","bmp","ico","apng","avif","svg","tif","tiff"]
>     sounds = ["mp3","ogg","oga","wav","flac","m4a","opus","aac"]
>     texts  = ["txt","md","markdown","csv","log","json","xml","rss","atom","css","js"]
>     files  = [ "pdf","zip","gz","tgz","tar","bz2","xz","7z","rar","doc","docx"
>              , "xls","xlsx","ppt","pptx","epub","mobi","mp4","webm","mkv","mov"
>              , "avi","wmv","flv","exe","dmg","iso","bin","apk","deb","rpm" ]
>
> -- | >>> isHttp "https://x"
> -- True
> --
> -- >>> isHttp "mailto:a@b"
> -- False
> isHttp :: T.Text -> Bool
> isHttp u = let l = lc u in "http://" `T.isPrefixOf` l || "https://" `T.isPrefixOf` l
>
> -- | Lower-cased file extension of a URL's last path segment, "" if
> -- none.
> --
> -- >>> urlExt "https://a.com/dir/file.HTML?x=1"
> -- "html"
> --
> -- >>> urlExt "https://a.com/dir/"
> -- ""
> urlExt :: T.Text -> T.Text
> urlExt u =
>   let core = T.takeWhile (\c -> c /= '?' && c /= '#') u
>       seg  = case reverse (filter (not . T.null) (T.splitOn "/" core)) of
>                (x:_) -> x
>                []    -> ""
>   in if T.any (== '.') seg then lc (T.takeWhileEnd (/= '.') seg) else ""
>
> -- | The gopher item-type byte for a link kind. `KExt` is rendered
> -- specially (an `h URL:` row), so it shares the page byte here.
> --
> -- >>> map kindType [KPage,KImage,KGif,KSound,KText,KFile]
> -- "1Igs09"
> kindType :: Kind -> Char
> kindType k = case k of
>   KPage  -> '1'
>   KImage -> 'I'
>   KGif   -> 'g'
>   KSound -> 's'
>   KText  -> '0'
>   KFile  -> '9'
>   KExt   -> 'h'

Deriving labels and ancestor URLs
---------------------------------

> -- | A readable label for a link that had no anchor text: its last
> -- path segment, or the host.
> --
> -- >>> deriveLabel "https://a.com/blog/my-post/"
> -- "my-post"
> --
> -- >>> deriveLabel "https://a.com/"
> -- "a.com"
> deriveLabel :: T.Text -> T.Text
> deriveLabel u =
>   let noScheme = case T.breakOn "://" u of
>                    (_, r) | not (T.null r) -> T.drop 3 r
>                    _                       -> u
>       core = T.takeWhile (\c -> c /= '?' && c /= '#') noScheme
>       segs = filter (not . T.null) (T.splitOn "/" core)
>   in case segs of
>        []  -> u
>        [h] -> h
>        _   -> T.pack (unEscapeString (T.unpack (last segs)))
>
> -- | scheme://host of a URL (no path), for the "site root" row.
> --
> -- >>> rootOf "https://a.com/blog/post?x=1"
> -- Just "https://a.com"
> rootOf :: T.Text -> Maybe T.Text
> rootOf url = do
>   uri  <- parseURI (T.unpack (normalizeTargetUrl url))
>   auth <- uriAuthority uri
>   pure (T.pack (uriScheme uri) <> "//" <> T.pack (uriRegName auth))
>
> -- | The parent URL (one path segment up), or Nothing at the root.
> --
> -- >>> parentUrl "https://a.com/blog/post"
> -- Just "https://a.com/blog/"
> --
> -- >>> parentUrl "https://a.com/"
> -- Nothing
> parentUrl :: T.Text -> Maybe T.Text
> parentUrl url = do
>   uri <- parseURI (T.unpack (normalizeTargetUrl url))
>   let segs = filter (not . null) (splitSlash (uriPath uri))
>   if null segs
>     then Nothing
>     else let parentPath = '/' : concatMap (++ "/") (init segs)
>              uri'       = uri { uriPath = parentPath, uriQuery = ""
>                               , uriFragment = "" }
>              s          = T.pack (uriToString id uri' "")
>          in if s == normalizeTargetUrl url then Nothing else Just s
>   where
>     splitSlash str = case break (== '/') str of
>       (a, '/':rest) -> a : splitSlash rest
>       (a, _)        -> [a]

Browse view (the hero: a page rendered as a menu)
-------------------------------------------------

> emitMenu :: Ctx -> T.Text -> IO ()
> emitMenu ctx url = do
>   res <- fetchPage url
>   either (menuError ctx url) (renderMenu ctx url) res
>
> -- | Render a fetched HTML body as a browse menu: header affordances,
> -- the page's forms as type-7 rows, the main content interleaved, and
> -- a segregated site-navigation section.
> renderMenu :: Ctx -> T.Text -> T.Text -> IO ()
> renderMenu ctx url body = do
>   let tags     = parseTags body
>       title0   = collapseWs (decodeEntities (innerText "title" tags))
>       title    = if T.null title0 then url else title0
>       chunks   = extractChunks url tags
>       forms    = extractForms url tags
>       content  = keepContent [ c | c <- chunks, notNav c ]
>       bodyUrls = concatMap chunkUrls content
>       navLinks = take maxNav (dedupeLinks bodyUrls [ l | CNav l <- chunks ])
>   mapM_ putLine (headerRows ctx url title)
>   mapM_ putLine (formRows ctx forms)
>   if null content
>     then putLine (infoLine "(no readable text found -- the page may be \
>                             \JS-rendered, paywalled, or not HTML)")
>     else mapM_ putLine
>            (concatMap (\c -> renderChunk ctx c ++ [infoLine ""]) content)
>   if null navLinks
>     then pure ()
>     else do
>       putLine (infoLine "")
>       putLine (infoLine "----- Site navigation -----")
>       mapM_ (putLine . linkRow ctx) navLinks
>   mapM_ putLine (footerRows ctx)
>   putLine terminator
>   where
>     notNav (CNav _) = False
>     notNav _        = True
>
> chunkUrls :: Chunk -> [T.Text]
> chunkUrls (CBlock _ ls) = map lUrl ls
> chunkUrls (CLink l)     = [lUrl l]
> chunkUrls _             = []
>
> -- | Keep headings and content links always; keep a prose block only
> -- when it carries enough text or at least one link.
> keepContent :: [Chunk] -> [Chunk]
> keepContent = filter keep
>   where
>     keep (CHead _ _)    = True
>     keep (CLink _)      = True
>     keep (CBlock t ls)  = not (null ls) || T.length t >= minBlockChars
>     keep (CNav _)       = False
>
> -- | A content chunk as gophermap rows: prose as info lines, then its
> -- links as follow-able rows. The prose line is suppressed when a
> -- block is just a single link (its label would only repeat it).
> renderChunk :: Ctx -> Chunk -> [T.Text]
> renderChunk _   (CHead lvl t) = map infoLine (headingDisplays lvl t)
> renderChunk ctx (CBlock t ls) = prose ++ map (linkRow ctx) ls
>   where
>     prose | not show'  = []
>           | otherwise  = map infoLine (T.lines (wrapText wrapCol t))
>     show' = not (T.null t)
>          && not (length ls == 1 && collapseWs (lLabel (head ls)) == t)
> renderChunk ctx (CLink l)     = [linkRow ctx l]
> renderChunk _   (CNav _)      = []
>
> -- | A heading's display lines: setext-underlined for h1/h2, marked
> -- with a guillemet for deeper levels.
> --
> -- >>> headingDisplays 2 "Hi"
> -- ["","Hi","--"]
> headingDisplays :: Int -> T.Text -> [T.Text]
> headingDisplays 1 t = ["", t, rule '=' t, ""]
> headingDisplays 2 t = ["", t, rule '-' t]
> headingDisplays _ t = ["", "\187 " <> t]
>
> -- | A repeated-character rule as wide as the text (clamped).
> rule :: Char -> T.Text -> T.Text
> rule c s = T.replicate (max 2 (min wrapCol (T.length s))) (T.singleton c)
>
> -- | One link as a gophermap row, item-type chosen by classification:
> -- a page recurses through `/m/`, a resource is proxied through `/r/`,
> -- an off-protocol link is an `h URL:` handoff.
> linkRow :: Ctx -> Link -> T.Text
> linkRow ctx (Link lbl0 url) =
>   let lbl  = truncLabel (if T.null lbl0 then deriveLabel url else lbl0)
>       h    = ctxHost ctx
>       p    = ctxPort ctx
>       kind = classifyKind url
>   in case kind of
>        KExt  -> hRow ctx lbl url
>        KPage -> menuRow (kindType kind) lbl (modeSel ctx "m" url) h p
>        _     -> menuRow (kindType kind) lbl (modeSel ctx "r" url) h p
>
> -- | The selector for a mode endpoint of this script.
> modeSel :: Ctx -> T.Text -> T.Text -> T.Text
> modeSel ctx m url = ctxScriptSel ctx <> "/" <> m <> "/" <> percentEncodeUrl url
>
> -- | Header block on every browsed page: title, source, the read /
> -- open / up / root affordances.
> headerRows :: Ctx -> T.Text -> T.Text -> [T.Text]
> headerRows ctx url title =
>   [ infoLine title
>   , infoLine (rule '=' title)
>   , infoLine ""
>   , infoLine ("Source: " <> url)
>   , menuRow '0' "Read this page as plain text (no links)"
>       (modeSel ctx "t" url) (ctxHost ctx) (ctxPort ctx)
>   , hRow ctx "Open this page in a web browser" url
>   ] ++ upRows ctx url ++ [ infoLine "" ]
>
> -- | "Up one level" and "site root" rows, when they differ from the
> -- current URL (and from each other).
> upRows :: Ctx -> T.Text -> [T.Text]
> upRows ctx url =
>   let cur  = normalizeTargetUrl url
>       mpar = parentUrl url
>       mrt  = rootOf url
>       row u = menuRow '1' ("Up: " <> truncLabel u)
>                 (modeSel ctx "m" u) (ctxHost ctx) (ctxPort ctx)
>       par = case mpar of Just p | p /= cur -> [row p]; _ -> []
>       rt  = case mrt of
>               Just r | r /= cur && mpar /= Just r -> [row r]
>               _                                   -> []
>   in par ++ rt
>
> -- | Footer search box and banner, on every page.
> footerRows :: Ctx -> [T.Text]
> footerRows ctx =
>   [ infoLine ""
>   , searchLine "Burrow to another web URL" (ctxScriptSel ctx)
>       (ctxHost ctx) (ctxPort ctx)
>   , infoLine ""
>   , menuRow '1' "-- served over gopher * no js * no ads * no tracking --"
>       "" (ctxHost ctx) (ctxPort ctx) ]
>
> -- | The menu shown when a page can't be fetched.
> menuError :: Ctx -> T.Text -> T.Text -> IO ()
> menuError ctx url note = do
>   mapM_ putLine
>     [ infoLine "webhole -- browse the web as a gopherhole"
>     , infoLine ("Target: " <> url)
>     , infoLine ""
>     , errorItem ("Could not open: " <> note)
>     , infoLine ""
>     , infoLine "Try a different page, or paste a full http(s) URL." ]
>   mapM_ putLine (footerRows ctx)
>   putLine terminator

Forms and search
----------------

A form with a text field, distilled to what a gopher type-7
search needs: method, resolved action, the field's name, and the
hidden inputs (as a ready-made urlencoded query fragment).

> data FormSpec = FormSpec
>   { fsMethod :: T.Text   -- ^ "get" or "post"
>   , fsAction :: T.Text   -- ^ absolute action URL
>   , fsField  :: T.Text   -- ^ name of the text field the query fills
>   , fsHidden :: T.Text   -- ^ hidden inputs, urlencoded "a=1&b=2" (or "")
>   } deriving (Eq, Show)
>
> -- | Cap on forms surfaced per page.
> maxForms :: Int
> maxForms = 12

Each `` region is its open-tag attributes paired with the
tags inside it (forms don't nest in practice; a stray inner
`` open is treated as content and the outer closes first).

> formRegions :: [Tag T.Text] -> [([(T.Text, T.Text)], [Tag T.Text])]
> formRegions [] = []
> formRegions (TagOpen n attrs : rest)
>   | lc n == "form" = let (inner, after) = breakClose "form" rest
>                      in (attrs, inner) : formRegions after
> formRegions (_ : rest) = formRegions rest
>
> -- | Tags up to the first matching close, and the remainder after it.
> breakClose :: T.Text -> [Tag T.Text] -> ([Tag T.Text], [Tag T.Text])
> breakClose name = go []
>   where
>     go acc []       = (reverse acc, [])
>     go acc (t:ts)
>       | isClose name t = (reverse acc, ts)
>       | otherwise      = go (t : acc) ts
>
> -- | What we scrape out of a form's inner tags.
> data FormAcc = FormAcc
>   { faField  :: Maybe (T.Text, T.Text)  -- ^ (name, placeholder) of the text field
>   , faHidden :: [(T.Text, T.Text)]      -- ^ hidden (name, value), reversed
>   , faSubmit :: T.Text                  -- ^ a submit button's label, if any
>   }
>
> collectForm :: [Tag T.Text] -> FormAcc
> collectForm = foldl step (FormAcc Nothing [] "")
>   where
>     textTypes = [ "", "text", "search", "email", "url", "tel", "number", "password" ]
>     step acc (TagOpen n a)
>       | lc n == "input" =
>           let ty = lc (attrOf "type" a); nm = attrOf "name" a in
>           if ty == "hidden"
>             then if T.null nm then acc
>                  else acc { faHidden = (nm, attrOf "value" a) : faHidden acc }
>           else if ty `elem` textTypes && not (T.null nm)
>             then case faField acc of
>                    Just _  -> acc
>                    Nothing -> acc { faField = Just (nm, attrOf "placeholder" a) }
>           else if ty `elem` ["submit", "image"]
>             then let v = attrOf "value" a in if T.null v then acc else acc { faSubmit = v }
>           else acc
>       | lc n == "textarea" =
>           let nm = attrOf "name" a in
>           case (T.null nm, faField acc) of
>             (False, Nothing) -> acc { faField = Just (nm, attrOf "placeholder" a) }
>             _                -> acc
>       | otherwise = acc
>     step acc _ = acc
>
> -- | A form region to a (display-label, FormSpec), or Nothing when it
> -- has no text field we can drive from a type-7 query.
> processForm :: T.Text -> [(T.Text, T.Text)] -> [Tag T.Text] -> Maybe (T.Text, FormSpec)
> processForm base attrs inner =
>   let acc = collectForm inner
>   in case faField acc of
>        Nothing            -> Nothing
>        Just (fname, ph)   ->
>          let rawAct = attrOf "action" attrs
>              action = resolveUrl base (if T.null rawAct then base else rawAct)
>              method = if lc (attrOf "method" attrs) == "post" then "post" else "get"
>              hidden = T.intercalate "&"
>                [ percentEncodeUrl n <> "=" <> percentEncodeUrl v
>                | (n, v) <- reverse (faHidden acc) ]
>              label  = firstNonEmpty [ faSubmit acc, ph, niceName fname ]
>          in Just (label, FormSpec method action fname hidden)
>
> -- | Friendly label for a bare field name; the usual search names
> -- become "Search".
> --
> -- >>> niceName "q"
> -- "Search"
> --
> -- >>> niceName "email"
> -- "email"
> niceName :: T.Text -> T.Text
> niceName fname
>   | lc fname `elem` ["q","query","s","search","term","keyword","kw","p","wd"] = "Search"
>   | otherwise = fname
>
> -- | All submittable forms on the page, capped.
> extractForms :: T.Text -> [Tag T.Text] -> [(T.Text, FormSpec)]
> extractForms base =
>   take maxForms . mapMaybe (\(a, inner) -> processForm base a inner) . formRegions
>
> -- | Forms as a "Search & forms" section of type-7 rows. A GET form
> -- becomes a `/g/` row (the typed query is appended to the
> -- prefix); a POST form becomes a `/o//` row.
> formRows :: Ctx -> [(T.Text, FormSpec)] -> [T.Text]
> formRows _   []    = []
> formRows ctx forms =
>   [ infoLine "", infoLine "----- Search & forms -----" ]
>   ++ [ searchLine (truncLabel label) (formSel ctx fs) (ctxHost ctx) (ctxPort ctx)
>      | (label, fs) <- forms ]
>
> -- | The type-7 selector for a form. Every dynamic part is a single
> -- percent-encoded path segment (exactly like the URL the sibling
> -- applets ride in path-info), with real `/` only between segments ---
> -- so nothing here depends on a tab or other gopher-significant byte
> -- surviving the round-trip.
> formSel :: Ctx -> FormSpec -> T.Text
> formSel ctx fs = case fsMethod fs of
>   "post" -> ctxScriptSel ctx <> "/o/" <> percentEncodeUrl (fsAction fs)
>                              <> "/"   <> percentEncodeUrl (bodyPrefix fs)
>   _      -> ctxScriptSel ctx <> "/g/" <> percentEncodeUrl (getPrefix fs)

A submission is split into a "prefix" (everything up to and
including `field=`) and the typed query value, which the submit
endpoint percent-encodes and appends. For GET the prefix is a URL
ready for the value; for POST it is the request body, sent
alongside the (separately encoded) action.

> -- | A GET form's URL prefix: action, the right separator, the hidden
> -- pairs, then `field=` ready for the value to be appended.
> --
> -- >>> getPrefix (FormSpec "get" "https://a/s" "q" "hl=en")
> -- "https://a/s?hl=en&q="
> --
> -- >>> getPrefix (FormSpec "get" "https://a/s?p=1" "q" "")
> -- "https://a/s?p=1&q="
> getPrefix :: FormSpec -> T.Text
> getPrefix fs =
>   let sep = if T.any (== '?') (fsAction fs) then "&" else "?"
>   in fsAction fs <> sep <> bodyPrefix fs
>
> -- | A form's body prefix: hidden pairs then `field=`, ready for the
> -- value. Used directly as the POST body prefix and inside 'getPrefix'.
> --
> -- >>> bodyPrefix (FormSpec "post" "https://a/s" "q" "hl=en")
> -- "hl=en&q="
> bodyPrefix :: FormSpec -> T.Text
> bodyPrefix fs =
>   T.intercalate "&" (filter (not . T.null) [fsHidden fs, fsField fs <> "="])
>
> -- | Handle a GET submission: append the encoded query to the decoded
> -- URL prefix and browse the result.
> emitGet :: Ctx -> T.Text -> T.Text -> IO ()
> emitGet ctx enc q = emitMenu ctx (decodeSeg enc <> percentEncodeUrl q)
>
> -- | Handle a POST submission: the remainder is `/ -- prefix>`; rebuild the body with the encoded query and render the
> -- response as a browse menu.
> emitPost :: Ctx -> T.Text -> T.Text -> IO ()
> emitPost ctx enc q = case T.break (== '/') enc of
>   (ea, slashEb) | not (T.null slashEb) -> do
>     let action = decodeSeg ea
>         body   = decodeSeg (T.drop 1 slashEb) <> percentEncodeUrl q
>     res <- fetchPagePost action body
>     either (menuError ctx action) (renderMenu ctx action) res
>   _ -> menuError ctx "(form)" "malformed form submission"

Plain-text view (type 0)
------------------------

The same parse, rendered as chrome-free, link-free wrapped text
for reading or saving. Reuses the chunk model: keep headings and
prose blocks, drop links and the navigation section.

> emitText :: Ctx -> T.Text -> IO ()
> emitText _ url = do
>   res <- fetchPage url
>   case res of
>     Left note  -> TIO.putStr $ T.intercalate "\n"
>       [ "webhole: could not read " <> url, "", note, "" ]
>     Right body -> do
>       let tags   = parseTags body
>           title0 = collapseWs (decodeEntities (innerText "title" tags))
>           title  = if T.null title0 then url else title0
>           txt    = renderPlain title url (extractChunks url tags)
>       if T.null txt
>         then TIO.putStr $ T.intercalate "\n"
>           [ title, rule '=' title, "", "Source: " <> url, ""
>           , "(no readable text -- likely JS-rendered, paywalled, or not \
>             \HTML; webhole runs no JavaScript)", "" ]
>         else TIO.putStr txt
>
> -- | Title, rule, source, then kept headings and prose blocks as
> -- wrapped plaintext. "" when nothing substantial survives.
> --
> -- >>> renderPlain "Hi" "https://x/a" [CHead 1 "Hi", CBlock "This is the body and it is plainly long enough to keep." [], CNav (Link "skip" "https://x/n")]
> -- "Hi\n==\n\nSource: https://x/a\n\nHi\n\nThis is the body and it is plainly long enough to keep.\n"
> renderPlain :: T.Text -> T.Text -> [Chunk] -> T.Text
> renderPlain title url chunks =
>   let kept = [ render c | c <- chunks, keepText c ]
>       hdr  = T.intercalate "\n" [ title, rule '=' title, "", "Source: " <> url ]
>   in if null kept then "" else hdr <> "\n\n" <> T.intercalate "\n\n" kept <> "\n"
>   where
>     keepText (CHead _ _)   = True
>     keepText (CBlock t ls) = T.length t >= minBlockChars && linkDensity t ls < 0.5
>     keepText _             = False
>     render (CHead _ t)   = t
>     render (CBlock t _)  = wrapText wrapCol t
>     render _             = ""
>
> -- | Fraction of a block's text that sat inside links (0 when empty).
> --
> -- >>> linkDensity "abcdef" [Link "abc" "u"]
> -- 0.5
> linkDensity :: T.Text -> [Link] -> Double
> linkDensity t ls =
>   let tot = fromIntegral (T.length (collapseWs t))
>       lnk = fromIntegral (sum [ T.length (collapseWs (lLabel l)) | l <- ls ])
>   in if tot <= 0 then 1 else lnk / tot

Resource proxy (raw bytes)
--------------------------

The byte path for images, audio, text files and downloads. It
guards the host itself and streams curl's output straight to our
stdout (the gopher socket) via an inherited fd, so binary bytes
pass through untouched by Haskell's text encoding, and there is
no gopher terminator. On a guard failure it emits a short text
note instead.

> emitRaw :: T.Text -> IO ()
> emitRaw url = case hostOf url of
>   Nothing -> TIO.putStr "webhole: could not parse that as an http(s) URL\n"
>   Just h | isBlockedHost h ->
>     TIO.putStr ("webhole: refused \"" <> h <> "\" (local/private address)\n")
>   Just _ -> do
>     hFlush stdout
>     (_, _, _, ph) <- createProcess (proc "curl"
>       [ "-sSL", "--compressed", "--proto", "=http,https", "--max-redirs", "5"
>       , "--connect-timeout", "5", "--max-time", "30"
>       , "--max-filesize", "20000000"
>       , "-A", "Mozilla/5.0 (compatible; webhole.lhs; gopher web browser)"
>       , "--", T.unpack url ]) { std_out = Inherit }
>     _ <- waitForProcess ph
>     pure ()

Front door
----------

> emitFront :: Ctx -> IO ()
> emitFront ctx = do
>   let host     = ctxHost ctx
>       port     = ctxPort ctx
>       portPart = if port == "70" then "" else ":" <> port
>       base     = "gopher://" <> host <> portPart <> "/1" <> ctxScriptSel ctx
>   mapM_ putLine
>     [ infoLine "webhole -- browse any website as a gopherhole."
>     , infoLine ""
>     , infoLine "Paste a URL and walk the page as a gopher menu: its text"
>     , infoLine "becomes info lines, its links become rows you can follow."
>     , infoLine "Pages stay in gopherspace; images, files and off-site"
>     , infoLine "resources are proxied through or handed to your client."
>     , infoLine "No JavaScript, so JS-only pages come back thin."
>     , infoLine ""
>     , searchLine "Paste a web-page URL to start browsing"
>         (ctxScriptSel ctx) host port
>     , infoLine ""
>     , infoLine "Or jump straight in from a shell:"
>     , infoLine ("  curl " <> base <> "/example.com")
>     , infoLine "(the URL is the selector; bare host/path defaults to https)"
>     ]
>   mapM_ putLine (footerRows ctx)
>   putLine terminator

Selector encoding
-----------------

The page URL becomes one opaque path-info segment: escape all but
the URI-unreserved set, so its slashes and colon survive intact.

> -- | >>> percentEncodeUrl "https://a.com/b?x=1"
> -- "https%3A%2F%2Fa.com%2Fb%3Fx%3D1"
> percentEncodeUrl :: T.Text -> T.Text
> percentEncodeUrl = T.pack . escapeURIString isUnreserved . T.unpack

Dedupe + label utilities
------------------------

> -- | Drop links whose URL is in the seen set or already emitted, order
> -- stable. Keeps the navigation section free of links the body showed.
> --
> -- >>> map lUrl (dedupeLinks ["https://x/a"] [Link "A" "https://x/a", Link "B" "https://x/b", Link "B2" "https://x/b"])
> -- ["https://x/b"]
> dedupeLinks :: [T.Text] -> [Link] -> [Link]
> dedupeLinks seen = go seen
>   where
>     go _    []       = []
>     go acc  (l:ls)
>       | lUrl l `elem` acc = go acc ls
>       | otherwise         = l : go (lUrl l : acc) ls
>
> -- | First entry that is non-empty after trimming, or "".
> --
> -- >>> firstNonEmpty ["", "  ", "x", "y"]
> -- "x"
> firstNonEmpty :: [T.Text] -> T.Text
> firstNonEmpty xs = case filter (not . T.null . T.strip) xs of
>   (x:_) -> T.strip x
>   []    -> ""
>
> -- | Truncate a label to the menu width with an ellipsis.
> --
> -- >>> truncLabel "short"
> -- "short"
> truncLabel :: T.Text -> T.Text
> truncLabel s
>   | T.length s <= maxLabel = s
>   | otherwise              = T.take (maxLabel - 1) s <> "\8230"

Text utilities (shared with the sibling applets)
------------------------------------------------

> -- | Collapse runs of whitespace (incl. NBSP) to single spaces, trim.
> --
> -- >>> collapseWs "  a\t b\n c "
> -- "a b c"
> collapseWs :: T.Text -> T.Text
> collapseWs = T.unwords . T.words . T.map deNbsp
>   where deNbsp c = if c == '\160' then ' ' else c
>
> -- | Clean a captured block: decode entities, then collapse.
> clean :: T.Text -> T.Text
> clean = collapseWs . decodeEntities
>
> -- | Greedy word-wrap to a column width.
> --
> -- >>> wrapText 10 "the quick brown fox"
> -- "the quick\nbrown fox"
> wrapText :: Int -> T.Text -> T.Text
> wrapText w = T.intercalate "\n" . wrapWords . T.words
>   where
>     wrapWords []     = []
>     wrapWords (x:xs) = go x xs
>     go line []       = [line]
>     go line (y:ys)
>       | T.length line + 1 + T.length y <= w = go (line <> " " <> y) ys
>       | otherwise                           = line : go y ys
>
> -- | Decode HTML entities (named, decimal, hex). Idempotent on text
> -- with no `&name;` left to touch.
> --
> -- >>> decodeEntities "Tom & Jerry 'hi'"
> -- "Tom & Jerry 'hi'"
> decodeEntities :: T.Text -> T.Text
> decodeEntities = T.pack . decode . T.unpack
>   where
>     decode [] = []
>     decode ('&':cs) = case break (== ';') cs of
>       (ent, ';':rest)
>         | not (null ent), Just dec <- lookupEntity ent -> dec ++ decode rest
>       _ -> '&' : decode cs
>     decode (c:cs) = c : decode cs

Gophermap line builders + sanitisation
--------------------------------------

> putLine :: T.Text -> IO ()
> putLine t = TIO.putStr (t <> "\r\n") >> hFlush stdout
>
> -- | >>> infoLine "hello"
> -- "ihello\t\t\t0"
> infoLine :: T.Text -> T.Text
> infoLine msg = "i" <> sanitize msg <> "\t\t\t0"
>
> -- | >>> menuRow '1' "home" "/" "host" "70"
> -- "1home\t/\thost\t70"
> menuRow :: Char -> T.Text -> T.Text -> T.Text -> T.Text -> T.Text
> menuRow t display selector host port =
>   T.singleton t <> sanitize display <> "\t" <> selector
>     <> "\t" <> host <> "\t" <> port
>
> -- | An `h` row whose selector is a `URL:` handoff to a web client.
> --
> -- >>> hRow (Ctx "/s" "host" "70") "site" "https://a.com"
> -- "hsite\tURL:https://a.com\thost\t70"
> hRow :: Ctx -> T.Text -> T.Text -> T.Text
> hRow ctx display url =
>   "h" <> sanitize display <> "\tURL:" <> url
>     <> "\t" <> ctxHost ctx <> "\t" <> ctxPort ctx
>
> -- | >>> searchLine "Search" "/q" "host" "70"
> -- "7Search\t/q\thost\t70"
> searchLine :: T.Text -> T.Text -> T.Text -> T.Text -> T.Text
> searchLine display sel host port =
>   "7" <> sanitize display <> "\t" <> sel <> "\t" <> host <> "\t" <> port
>
> -- | >>> errorItem "nope"
> -- "3nope\t\t\t0"
> errorItem :: T.Text -> T.Text
> errorItem msg = "3" <> sanitize msg <> "\t\t\t0"
>
> terminator :: T.Text
> terminator = "."
>
> -- | >>> sanitize "a\tb\nc"
> -- "a b c"
> sanitize :: T.Text -> T.Text
> sanitize = T.map (\c -> if c == '\r' || c == '\n' || c == '\t' then ' ' else c)