codevoid.de/1/hn/comments_48529990.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Show HN: Kage â Shadow any website to a single binary for offline viewing
       
       
        ralferoo wrote 6 hours 8 min ago:
        It sounds like a nice idea, but "It drives a real browser, lets the
        page finish doing whatever it does, grabs the finished result, and then
        rips every script out of it." sounds like it'd fail for a lot of
        websites that do things like display some kind of modal banner inviting
        you to click something to close it, and that prevents scrolling while
        that is open, whilst also delaying loading the rest of the article
        until a certain scroll point is hit.
       
        Departed7405 wrote 8 hours 26 min ago:
        What's the advantage compared to mhtml?
       
        kjmh wrote 9 hours 28 min ago:
        I was floored by the idea of browsing docs offline but disappointed
        that recreating the demo of archiving Paul Grahamâs essays gave me a
        ZIM with broken images and broken Unicode symbols when viewed in Kiwix.
       
        ekianjo wrote 12 hours 13 min ago:
        Curious about "keep it for a decade" claim. Can something possibly
        break down the road?
       
        snowflaxxx wrote 13 hours 51 min ago:
        Meet Teleport Pro
       
        aa-jv wrote 14 hours 34 min ago:
        I've been using "Print to PDF" as my principle bookmarks management
        tool, since 1998, and I have over 90,000+ such PDF's sitting on my
        system, easily re-read and discovered.
        
        So I don't quite get whats the point of kage?  What does it do that
        print-to-PDF won't already do?    The resulting .pdf's contain all the
        content, and also include the original URL and creation date, etc.  How
        is kage an improvement?
       
        Sathwickp wrote 14 hours 48 min ago:
        I'm still trying to cope with your github profile, 68k commits a year
        is crazyy
       
        c7b wrote 15 hours 20 min ago:
        Probably a stupid question, but could this archive embedded videos as
        well?
       
          tamnd wrote 8 hours 22 min ago:
          Possible, but currently I disable all large files, including videos.
          
          For video downloading, I suggest wrapping around yt-dlp. It's an
          awesome tool.
       
        xlii wrote 18 hours 18 min ago:
        > No tracking, no network calls, no surprises.
        
        Won't comment on a project (though idea seems interesting) but this in
        README is a tell for me ;)
       
          praveer13 wrote 7 hours 43 min ago:
          Somehow 'Kage' is the first name claude suggests to me for any new
          project as well
       
            timerol wrote 3 hours 19 min ago:
            The Japanese word for shadow?
       
          xd1936 wrote 11 hours 40 min ago:
          It's not just no tracking â It's no surprises.
       
        endorphine wrote 19 hours 47 min ago:
        Anyone remembers Teleport Pro?
       
        sails wrote 20 hours 4 min ago:
        What is the best way to give coding agent a full website so that it can
        see what I see? With animation and design Iâm never sure what it gets
        when I save the website in the browser. Maybe this is suitable?
       
        jokethrowaway wrote 20 hours 13 min ago:
        Amazing stuff!
        
        I would recommend an add-on or new feature to detect and remove cookie
        banners / annoying popups that open on load (eg. sign up to my mailing
        list).
        
        listing a few examples form fastText could help you.
        
        You might also have the opposite problem though: some websites have
        content in the base html (so it's searchable by Google and they get
        views) and remove it on load (so you have to pay).
        
        Capturing the initial html and comparing it to the final version could
        give you some hints and allow you to repair the removed content.
        
        Best of luck with the project!
       
        carsonye wrote 22 hours 27 min ago:
        This is interesting. Is the intended use case mostly read-only websites
        like blogs/docs/essays? How well does it handle sites where navigation,
        search, dropdowns, or other UI interactions depend on JavaScript?
       
          tamnd wrote 8 hours 24 min ago:
          If there's more demand for that, maybe I will implement a more
          relaxed version.
       
          tamnd wrote 8 hours 25 min ago:
          Currently, all of that is broken. At one point, I had a traumatic
          experience where an archived HTML file kept redirecting to the live
          site, even though I already had all the content rendered, so I ended
          up disabling all JavaScript entirely.
       
        rickylin wrote 22 hours 34 min ago:
        It seems like [1] is better.
        
  HTML  [1]: https://github.com/tw93/pake
       
          italiancheese wrote 22 hours 23 min ago:
          Both of these projects have completely different purposes and use
          cases.
          
          Have you even read the first line of the readme of the project you're
          commenting on?
       
        smusamashah wrote 22 hours 49 min ago:
        What if I wanted to download all Confluence docs at work?
       
        G_o_D wrote 23 hours 17 min ago:
        How its different then MHTML ??
       
        amatecha wrote 1 day ago:
        Suddenly remembering the days of dialup and your browser serving a
        fully-functional cached copy of a webpage when you try to access it and
        you're not online...
       
        godot wrote 1 day ago:
        the readme uses paulgraham.com as an example (which is text articles
        mostly) and I never use "Save As" for a web page (for the reasons the
        author states), I always just print as PDF and save the PDF file.
        
        for an entire website though of many pages I can see this can be
        useful.
       
        jyscao wrote 1 day ago:
        I tried to clone a HTTP (not HTTPS) site, and it's giving me
        `navigation failed: net::ERR_NAME_NOT_RESOLVED`. Even when I explicitly
        included the protocol with `http://`.
       
        kadhirvelm wrote 1 day ago:
        This is awesome, we wanted an offline copy of someoneâs prototype (as
        built on Lovable, etc) so we could do version control and sharing in an
        easier format. Wrote our approach here: [1] But will look into this
        now, see if we can swap some stuff out. Weâve really liked the idea
        of an offline mirror, makes a lot of collaboration use cases simpler
        
  HTML  [1]: https://productnow.ai/blogs/extracting-html-from-ai-prototypin...
       
        chfritz wrote 1 day ago:
        how is this different from using puppeteer to load the page and save
        the DOM as HTML?
       
        sneak wrote 1 day ago:
        The README is LLM slop. This makes me assume the code is the same.
       
        calrizien wrote 1 day ago:
        Does this work for the Apple Docs website? Really tricky to get those
        offline.
       
          tamnd wrote 9 hours 50 min ago:
          Good news for you: here is the command to clone Apple Docs:
          
          ```bash
          bin/kage clone [1] \
            --scope-prefix /documentation/ \
            --out /Users/apple/data/apple-docs \
            --chrome "/Applications/Google Chrome.app/Contents/MacOS/Google
          Chrome" \
            --max-pages 0 --max-depth 0 \
            --workers 3 --browser-pages 3 --asset-workers 6 \
            --render-timeout 60s --settle 2s --timeout 30s \
            2>&1 | tee -a /Users/apple/apple-docs.log
          ```
          
          Adjust it to your needs :)
          
          I smoke-tested it, and all the content and CSS work, but I stripped
          all the JS, so the sidebar won't work.
          
          If you run into any problems, feel free to create new issues in the
          repo. It helps me prioritize and know what should be fixed.
          
  HTML    [1]: https://developer.apple.com/documentation/
       
          tamnd wrote 1 day ago:
          Making docs available offline was one of my main motivations for
          building this tool. I will try Apple Docs too.
          
          I previously downloaded the Snowflake docs, and it was something like
          tens or even hundreds of thousands of pages, I do not remember
          exactly. The output ended up being very large.
          
          By the way, I forgot to add zstd compression support to my ZIM
          reader/writer. I will implement that in the next version.
       
        cynicalsecurity wrote 1 day ago:
        Binary app is a really bad way of storing data. No one would ever want
        to run a binary shared with them or found online.
       
          tamnd wrote 1 day ago:
          For sharing, better use  the html folder or zim format, Kage supports
          both of them.
       
        nitotm wrote 1 day ago:
        I was looking for something like this the other day, it can be very
        helpful.
       
        coffeecoders wrote 1 day ago:
        I've accumulated a bunch of old website archives over the years. The
        funny thing is the ugly HTML dumps have been more useful than the
        "perfect" archive.
        
        It's one of the reasons I've become a bigger fan of RSS over time. A
        feed from 10-ish years ago is often more usable today than a carefully
        preserved (application) website.
       
          couscouspie wrote 13 hours 27 min ago:
          Maybe it is just me, but by far most of the time, when I want to
          archive something from the internet, it is information and
          information is best served in an absolutely minimal text format like
          html or md.
       
          tamnd wrote 1 day ago:
          I have a project for creating and archiving RSS feeds, keeping the
          full history from the time the crawler starts. I need to clean up a
          bit, then will open source it soon.
       
        Onavo wrote 1 day ago:
        How does it handle websites with client side paywalls? Can you run it
        with extensions like bypass paywalls and ublock origin?
       
        KellyCriterion wrote 1 day ago:
        Sounds like .MCH-files re-invented? (-:
       
        soulofmischief wrote 1 day ago:
        Cool project! I know it's written in go, but it would be cool to see
        something like this which uses Cosmopolitan Libc + redbean or something
        similar to create a binary which runs anywhere. Would be fun to be able
        to pass around self-executable website archives. [1] [2] [3]
        (Certificates just expired for justine's website, just ignore the
        warning.)
        
  HTML  [1]: https://github.com/jart/cosmopolitan
  HTML  [2]: https://justine.lol/cosmopolitan/index.html
  HTML  [3]: https://redbean.dev
       
          jokethrowaway wrote 20 hours 11 min ago:
          I never understood the appeal for cosmopolitan.
          
          I'd rather have platform specific minimal binaries than a single
          binary with hacks.
          
          Installing packages is a solved problem
       
            soulofmischief wrote 16 hours 1 min ago:
            Installing packages is a completely different activity than passing
            around self-executable archives among friends. Not everything needs
            to go through a CI pipeline and distribution platform before you
            can share it with others. On top of that, I really enjoy being able
            to write quick little utilities and then pass them around without
            worrying about what operating system anyone who stumbles upon it
            has.
            
            It's fine if you don't personally find it useful for your workflow,
            but I think it's mad cool, especially since you can zip together
            multiple binaries into one, along with data.
       
          tamnd wrote 1 day ago:
          This could be a nice code golf project. It only needs a webview, a
          ZIM reader, and a way to append data to an existing binary and read
          it back.
          
          I did something like that a very long time ago (Of course, I have
          forgotten)
       
        shinryuu wrote 1 day ago:
        Reminds me of this. [1] Compared to that is there anything kage does
        better?
        
  HTML  [1]: https://gwern.net/gwtar
       
        chinnyys wrote 1 day ago:
        The readme is AI slop, and incredibly grating to read. The disgust I
        felt while reading it almost put me off trying the project.
        
        Is the code also AI slop?
       
        simonw wrote 1 day ago:
        I was intrigued to see how the demo GIF in the README was generated:
        [1] Turns out it's using another project by the same author: [2] The
        script used for the demo is at [1] and has a comment showing how to run
        it:
        
          ascii-gif render docs/demo/kage.tape -o docs/static/demo.gif
        
        Looks like it's an opinionated wrapper around
        
  HTML  [1]: https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c6306...
  HTML  [2]: https://github.com/tamnd/ascii-gif
  HTML  [3]: https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c6306...
  HTML  [4]: https://github.com/charmbracelet/vhs
       
          tamnd wrote 1 day ago:
          I have a bunch of opinionated/personal-use binaries like this in my
          $HOME/bin/, like delete-all-npm, clean-rust-cache,
          download-youtube-playlist, and get-markdown . It feels good, and I
          don't need to remember any commands. Sometimes my coding agent can
          figure out how to call some of those tools too ;))
       
          vqtska wrote 1 day ago:
          You can also do an animated svg which is way smaller than a gif
          because it's just text keyframes ( [1] )
          
  HTML    [1]: https://github.com/vytskalt/pseudoc/blob/main/assets/factori...
       
            Noumenon72 wrote 22 hours 19 min ago:
            How can you do it? I don't see an SVG output from ascii-gif.
       
              vqtska wrote 16 hours 35 min ago:
              I used a different project,
              
  HTML        [1]: https://github.com/marionebl/svg-term-cli
       
              LocoPadre wrote 18 hours 45 min ago:
              it might be this:
              
  HTML        [1]: https://github.com/mrmarble/termsvg
       
            embedding-shape wrote 1 day ago:
            Very cool, never thought of that! "way smaller" is almost an
            understatement, when it's 50kb :P Neat that it loads in GitHub
            READMEs as well, which is probably a large reason people use .gif
            today.
       
          stavros wrote 1 day ago:
          VHS is fantastic for scripting cli video generation.
       
          jubilanti wrote 1 day ago:
          Have you heard the good news about the terminal savior asciinema --
          
  HTML    [1]: https://asciinema.org/
       
            embedding-shape wrote 1 day ago:
            It's a cool tool/platform, but very different. Asciinema tries to
            make the "multimedia" itself better by making it actual text
            instead of being video/images, while the CLI command above turns
            actual text into multimedia supported by platforms already. Both
            are useful, both have their use cases :)
       
          alterom wrote 1 day ago:
          FYI, on other platforms (Windows/MacOS), LiceCAP is a fantastic tool
          to record screen into compact GIFs by the author of Winamp and Reaper
          DAW:
          
  HTML    [1]: https://www.cockos.com/licecap/
       
        ninalanyon wrote 1 day ago:
        > kage serve $HOME/data/kage/paulgraham.com
        
        If the result is static why does it need a server?  Isn't it possible
        to make it so that it can simply be opened by the browser? Like:
        
        $ firefox $HOME/data/kage/paulgraham.com
        
        Then the result would be useable on machines without kage nstalled.
       
          tamnd wrote 1 day ago:
          You could use python -m http.server instead. I haven't tried it yet,
          but it should work.
          
          Actually, Kage has two parts: a crawler that crawls pages and
          converts them to clean HTML by capturing the DOM after rendering in
          Chrome/Chromium, and a pack/serve component that packages the result
          as either a ZIM file for Kiwix or an executable file.
       
          afavour wrote 1 day ago:
          Youâll likely run into a ton of CORS issues doing that.
       
            embedding-shape wrote 1 day ago:
            I don't think so, there is no HTTP requests being done from JS as
            it's stripped away, and all the other resources are pulled down
            (and I'm assume their reference made relative), so really shouldn't
            be any issues because of CORS at all.
       
          doctoboggan wrote 1 day ago:
          Usually JavaScript is blocked when you load pages that way.
       
            recursive wrote 1 day ago:
            I am quite familiar with this and it is factually false
       
              danielheath wrote 1 day ago:
              Js modules donât work on file urls (classic js does).
       
                recursive wrote 23 hours 20 min ago:
                They can be made to work with blob urls.  I have done this.
       
                  danielheath wrote 3 hours 27 min ago:
                  Okay thatâs super interesting and I would love to see an
                  example or writeup - I have a project which would benefit
                  from being able to do that.
       
                    recursive wrote 2 hours 40 min ago:
                    It's a technique I created (someone else must have done it
                    first??) for a sandbox demonstrating a web UI framework I
                    made. [1] To see it work, click "Download self contained
                    .html" from the menu.
                    
                    Here's the source file that handles this part: [2] The idea
                    is to use  to define modules.  That's something I just made
                    up.  For each such script, provision a blob URL.  The main
                    blocker is usually the same origin policy.  Crucially,
                    these blob URLs count as the same origin.  So then you need
                    to rewrite the imports from the named modules to the blob
                    URLs.  I used some regex rather than a proper parser, but
                    it was more than good enough for me.
                    
                    It seems quite doable to make some proper bundling tools
                    around this concept.
                    
  HTML              [1]: https://mutraction.dev/sandbox
  HTML              [2]: https://github.com/tomtheisen/mutraction/blob/mast...
       
            dmazzoni wrote 1 day ago:
            Not all JavaScript, but a lot of APIs are restricted
       
            embedding-shape wrote 1 day ago:
            Since when? You won't be able to make HTTP requests to localhost,
            as it'd be a different Origin, but I don't think any mainstream
            browser blocks JS outright when you use file:// to load and view
            HTML files.
       
              rzzzt wrote 1 day ago:
              Somewhere around 2019, each document loaded from file:// became
              its own origin in Firefox: [1] (I didn't check when this happened
              in Chromium)
              
              Related WHATWG discussion:
              
  HTML        [1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1500453
  HTML        [2]: https://github.com/whatwg/html/issues/3099
       
                embedding-shape wrote 1 day ago:
                Yeah, but that's fine, the document is .html, and it can load
                ./app.js or ./style.css just fine even if loaded by file:// (as
                long as it isn't initiated by JS itself, then Origin starts to
                matter a lot more), otherwise basically every single local HTML
                file would suddenly be broken, I don't think anyone would have
                accepted that even with the origin changes.
       
                  rzzzt wrote 7 hours 39 min ago:
                  I tried this on a small example and it works indeed. In my
                  head this would have been something like a restrictive CSP
                  script-source directive, even if not exposed in response
                  headers or anything.
       
                    embedding-shape wrote 2 hours 2 min ago:
                    > I tried this on a small example and it works indeed.
                    
                    I was thinking "of course it works, how else would people
                    get started creating websites otherwise?" then I remember
                    what's the most common approaches in the frontend ecosystem
                    nowadays.
                    
                    Back in the days of yore, every tutorial/book started with
                    "First we create a index.html file which you open in your
                    browser ...", even a JavaScript resource would start with
                    this of course :)
       
                  dncornholio wrote 19 hours 35 min ago:
                  React and Angular are completely broken through file://
       
                    embedding-shape wrote 16 hours 51 min ago:
                    I don't know about Angular but React works perfectly fine
                    through file://. I'd think the bundler/packager matter more
                    than whar JS libraries you use, you sure you're not
                    actually thinking of something else not handling file://
                    properly?
       
            pixelatedindex wrote 1 day ago:
            I thought all the JS was stripper?
       
        latexr wrote 1 day ago:
        For those with an eReader, one thing that works really well is using
        pandoc to download and convert a webpage to EPUB that you can then load
        to your reader.
        
          pandoc --from html --to epub --output /PATH/TO/FILE.epub
        https://example.com
       
          arikrahman wrote 1 day ago:
          Thanks, will try this out on the Kobo later.
       
        telesilla wrote 1 day ago:
        I've been using httrack ( [1] ) to download wikis to read on flights,
        which isn't perfect but better than I'd found previously. I'll try this
        out, I'd be delighted to have good results. Thanks for the post.
        
  HTML  [1]: https://www.httrack.com
       
          tamnd wrote 1 day ago:
          This brings back memories. Around twenty years ago, internet was
          still expensive dial-up, so I used to go to an internet cafe, run
          HTTrack to download websites and manga, copy everything onto my tiny
          128MB USB stick (felt very large at that time), then bring it home
          and read offline ;))
       
          nikisweeting wrote 1 day ago:
           [1] or browsertrix may be easier to use for some, it's what was used
          to save a lot of the data.gov stuff before it got taken down.
          
  HTML    [1]: https://github.com/archiveteam/grab-site
       
          throwaway219450 wrote 1 day ago:
          Specifically for wikis, is there a reason you wouldn't use Kiwix? For
          non "official" releases it's more complicated, but there are some
          services to generate the ZIM files. The desktop reader app is pretty
          good in my experience. [1] EDIT:
          
  HTML    [1]: https://wiki.openzim.org/wiki/Build_your_ZIM_file
  HTML    [2]: https://get.kiwix.org/en/solutions/applications/kiwix-reader...
       
            tamnd wrote 1 day ago:
            Kiwix has readers for almost every platform, Android, desktop,
            iPhone. That's why I made Kage produce ZIM file.
            
            The executable file is mostly for people who don't have Kiwix
            installed yet, or just want to run the archive directly.
       
            telesilla wrote 1 day ago:
            Thanks, never knew about this and great to hear about it.
       
        Igor_Wiwi wrote 1 day ago:
        This is quite useful tool, especially for the cases where internet
        access is limited (the flights for example). I implemented it as a
        separate feature in mdview.io: for example you can export a document as
        a html file  for offline usage, with all the presentation features like
        reach tables, mermaid and etc built in. Example [1] then try to Export
        - Export HTML
        
  HTML  [1]: https://mdview.io/s/why-markdown-became-default-format-for-ai
       
        delduca wrote 1 day ago:
        curl can do this
       
        daviding wrote 1 day ago:
        Nice idea!
        fwiw, false positives and all, but the Windows 11 default Windows
        Security doesn't like it:
        `leakless.exe: Operation did not complete successfully because the file
        contains a virus or potentially unwanted software.`
       
        lolpython wrote 1 day ago:
        This is cool. I could see myself downloading the articles behind the
        first couple pages of hacker news with this, for viewing on a flight or
        long distance train ride with spotty internet
       
        dimiprasakis wrote 1 day ago:
        Neat project, I like the idea.
        One thing from a quick read: you launch Chrome with --no-sandbox. Is
        there a good reason for that? Security wise it's probably not a good
        idea. If there is no reason, I'd suggest leaving the sandbox on!
        
        In any case, cool stuff :)
       
          nikisweeting wrote 1 day ago:
          --no-sandbox is needed in docker, maybe they assume it will mostly
          run in docker?
       
            tamnd wrote 1 day ago:
            Exactly. For downloading, Kage requires Chrome or Chromium. Running
            it inside Docker makes setup easier and keeps cleanup simple: [1]
            Btw, let me think the way to only enable this when running inside
            Docker.
            
  HTML      [1]: https://github.com/tamnd/kage/blob/main/Dockerfile
       
              nikisweeting wrote 1 day ago:
              Docker is designed to be undetectable by default, the best way I
              have found is to set env IN_DOCKER=True manually in your
              Dockerfile + check that there is no $DISPLAY configured + that
              you're on linux. Usually if all/most of those are true you can
              safely add --no-sandbox --disable-setuid-sandbox
              --disable-dev-shm-usage etc. all the docker-specific flags. Thats
              what we do in
              
  HTML        [1]: https://github.com/ArchiveBox/ArchiveBox/blob/dev/Docker...
       
                dimiprasakis wrote 14 hours 42 min ago:
                Cool approach.
                
                But, a compromise still lands on host's kernel, Docker doesn't
                provide kernel isolation (well it does on a macOS because it
                runs in Docker machine but thats a side effect).
                
                I wonder if a better solution would be to play with seccomp or
                Linux capabilities so that Chrome is sandboxed even in Docker.
                Not sure how this would work tbh.
                
                Answering here to get ideas, I saw your fix on Git and request
                for feedback (will try to review and give it some thought once
                I find some time)
       
                  nikisweeting wrote 3 hours 22 min ago:
                  I have never seen anyone pull off seccomp nested sandboxing
                  of Chrome in Docker before, if you manage to figure it out
                  please let me know!
       
                tamnd wrote 19 hours 37 min ago:
                It should be fixed by [1] Thanks for nice trick.
                
  HTML          [1]: https://github.com/tamnd/kage/pull/12
       
        grahamstanes17 wrote 1 day ago:
        nice
       
        wolttam wrote 1 day ago:
        One use I'd have for this is company wikis that you want to give folks
        easy offline access to (maybe the wiki has documentation that's useful
        at sites that don't have cellular coverage).
        
        Cool!
        
        It would be especially cool to have a version that didn't require the
        separate serving process - even though it's nifty you can package up a
        whole site as a single binary.
        
        Maybe a single HTML entrypoint shim with a bit of javascript that could
        index into an archive (potentially embedded) of the site's content?
       
          everforward wrote 10 hours 39 min ago:
          This is a nice way to do it if youâre already stuck with a solution
          (print to PDF would probably also, if you can script it).
          
          In a green field world, I have a personal requirement that technical
          documentation systems are capable of bulk exporting to a
          human-readable format on disk. Iâm pretty flexible on what that is,
          though. Markdown is preferred, but Iâm also fine with static,
          dependency-free HTML and I could accept PDFs if the rest of it is
          super nice.
          
          Itâs an integral part of DR, and most places want their docs
          on-premise, so DR effectively requires offline documentation.
          Everywhere Iâve worked either a) writes documentation in something
          that works offline (eg git repo with tarballs somewhere), or b) has
          invested a bunch of time in trying to scrape their own wiki into
          something legible during DR.
          
          I guess itâs a long-winded way of saying âthatâs using a tool
          to fix a self-inflicted problem that shouldnât existâ.
       
          gwern wrote 1 day ago:
          > Maybe a single HTML entrypoint shim with a bit of javascript that
          could index into an archive (potentially embedded) of the site's
          content?
          
          So something like SingleFileZ [1] or Gwtar [2] ?
          
  HTML    [1]: https://github.com/gildas-lormeau/SingleFileZ
  HTML    [2]: https://gwern.net/gwtar
       
          tamnd wrote 1 day ago:
          Submitting this to Hacker News is the right place! Thanks for your
          idea. I will consider implementing that :)
          
          Also, in my mind, I already have a script/program to convert HTML to
          Markdown, so it could actually store everything on disk as a folder
          of Markdown files, and then commit them to a Git repo.
       
            d3Xt3r wrote 19 hours 18 min ago:
            I'd like to request something between what GP suggested and what
            your program is doing currently - basically I still want a single
            binary, but instead of embedding a full browser in it, I would like
            the binary to be just a self-extracting archive that calls the
            user's default browser, maybe in a new window/frame.
            
            Basically I'm looking for something like the old-school .chm files
            on Windows, where you could pack a bunch of HTML documents into a
            single archive and open it without needing to embed a full browser
            engine.
            
            This would have the advantage of keeping the file sizes really
            small. And you don't have to worry about the browser engine become
            outdated and potentially becoming an attack vector.
       
              Bad_CRC wrote 14 hours 21 min ago:
              I instantly searched for chm on the comments and yours was the
              only one :o
       
                d3Xt3r wrote 5 hours 14 min ago:
                I really miss the old .chm files, they used to be quite
                thorough, like the ones that came with MSDN / VS 6.0. In modern
                times, AutoHotkey continued that tradition, absolutely love
                their comprehensive .chm. But in the Linux/modern world,
                outside of the man pages, you need to go to the web everytime
                to lookup stuff and I hate that.
       
                samat wrote 13 hours 30 min ago:
                You are not alone
                
                For the younger generation
                
  HTML          [1]: https://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_...
       
            mcdonje wrote 23 hours 16 min ago:
            Not to load you up with too many ideas, but a markdown folder
            sounds a lot like obsidian, which has a plugin system now.
            
            Epub would also be a great target.
       
            smeej wrote 1 day ago:
            I would use the shit out of this. I'm a heavy user of Logseq (OG,
            the md file-based version). Would LOVE to save my favorite web
            resources this way.
       
            mgiampapa wrote 1 day ago:
            I think the zim flow was perfect for offline use. I know I will be
            making use of it as soon as I can figure out how to pass chrome the
            cookies so I can be signed into the site. Didn't see it in the
            page, but I didn't look closely yet.
       
              tamnd wrote 1 day ago:
              Not yet supporting cookies, since I created this tool for
              shadowing public websites first. I will add options to pass
              cookies later. It will pass them to the underlying
              Chrome/Chromium process, so it should not be hard to do.
       
        rahimnathwani wrote 1 day ago:
        So this is like using wget --mirror except that it works on pages that
        require javascript, right?
       
          tamnd wrote 1 day ago:
          Yeah, it is. For example, openai.com is rendered with Next.js, so I
          will try to mirror it tomorrow.
       
        sanqui wrote 1 day ago:
        Cool concept.  I would like to see this combined with mitmproxy for
        archive grade fidelity.  You could be saving exactly the data served
        and at the same time a representation by a modern (contemporary)
        browser, with all JS having run.  This combination would be my perfect
        replacement for the WARC format.
       
          Dhavidh wrote 1 day ago:
          sound interesting
       
          tamnd wrote 1 day ago:
          I'm working on WARC too, with format from Common Crawl!
          
          By converting it to Markdown, we save a lot of space, but it is for a
          different purpose and a different project:
          
  HTML    [1]: https://github.com/tamnd/ccrawl-cli
       
            sanqui wrote 1 day ago:
            That's neat!  In my opinion, the WARC format is quite tricky and
            underspecified especially since HTTP2 introduced new semantics.  It
            encodes too much in-band and requires rewriting of the server data.
             A mitmproxy capture is higher fidelity and supports capturing
            modern features such as WebSockets.  I think if we could wrap
            Kage's crawler interactions by it and store its capture (the
            intercepted traffic), we could make a potentially nice new archival
            format.
       
              tamnd wrote 1 day ago:
              I tried to follow well-known formats first, such as WARC and ZIM
              from Kiwix, so we could benefit from existing tooling support.
              
              For my own custom data format, I have a lot of private code that
              I plan to release soon. It is optimized for compression, fast
              lookups, and more. I have been working on it for two years. 
              This is part of a larger, ambitious umbrella project: I am
              building Google from scratch (all open source), something that
              anyone can host, including the crawler, indexer, storage, and
              serving layers. Stay tuned!
       
                threecheese wrote 1 day ago:
                OK, sounds fascinating; following! (your GH)
       
                  tamnd wrote 1 day ago:
                  Thanks ;)
       
                Prime_Axiom wrote 1 day ago:
                Looking forward to the next project! I love these kinds of
                archiving tools.
       
                sanqui wrote 1 day ago:
                I'm a fan of compatibility with established formats!
                
                Sounds awesome.  There is a lot of untapped potential with
                respect to efficiently archiving and indexing websites.  I saw
                the impressive things Marginalia Search is doing in this area
                (the blog is great when it gets technical).  There is also a
                lot of very complete archives of websites out there which are
                not being indexed at all, and I would love to make them
                available for researchers.  In any case, I'm interested in your
                project!
       
        gregwebs wrote 1 day ago:
        This seems like it has potential to create a lot of load on a site- are
        there settings to set how fast it clones or avoid images/videos?
        Is there a way to only get a subset of a website?
       
          ares623 wrote 1 day ago:
          Just pretend you're an AI crawler problem solved
       
          tamnd wrote 1 day ago:
          Could you help create a new issue for that? I will do it later. It is
          already 1:00 AM my time, but I am happy that anyone is interested in
          it. : )
       
        maxloh wrote 1 day ago:
        I find SingleFile [0] to be a much more robust version of this.
        
        It strips out all the JavaScript too, but also packs everything into a
        single HTML file that is easy to transfer. Binary assets (like web
        fonts and images) are packed as base64 strings.
        
        They also offer a CLI powered by Puppeteer. [1] [0]: [1]:
        
  HTML  [1]: https://github.com/gildas-lormeau/singlefile
  HTML  [2]: https://github.com/gildas-lormeau/single-file-cli
       
          arikrahman wrote 1 day ago:
          This is what I first thought and it's a very elegant solution, and
          not needlessly overcomplicated.
       
          wamatt wrote 1 day ago:
          Love love love SingleFile too. The FF extension works pretty well for
          a clean save.
          
          That said, Kage looks promising if OP can combine SingleFile
          reproduction quality with the HTTPTrack spidering approach. SPA's are
          kinda tricky with archiving and do wonder how well Kage would handle
          that
       
            initramfs wrote 1 day ago:
            I've seen the option in IE- .mhtml.
            
            For some reason it displays in IE better but I don't recall seeing
            this option in chrome of Firefox recently..
       
          HelloUsername wrote 1 day ago:
          What's the difference with, any webbrowser on a computer, File ->
          Save as ?
       
            dmazzoni wrote 1 day ago:
            Save As works fine for simple websites with static content.
            
            Let's say you have a site that fetches content from a database. If
            you Save As, then at best you'll get a local copy of an HTML page
            with JS that loads the content from the same remote database. It
            might not work (since the local copy has a different origin), or if
            it does, it requires you to be online, which defeats half of the
            purpose.
            
            What this project, and SingleFile, both do is save a snapshot of
            what the rendered page actually looks like at that moment in time.
            The scripts are stripped out so it runs locally and has no external
            dependencies.
       
            nmstoker wrote 1 day ago:
            That's for a single page, this handles the whole site. Also the
            browser Save As options often work poorly.
       
          tamnd wrote 1 day ago:
          And thanks for the link. Let me implement this single HTML feature,
          it looks nice to have!
       
            maxloh wrote 1 day ago:
            Yeah. An idea on top of that is to bundle an entire website into a
            single HTML page, with vendored JavaScript to enable client-side
            routing (all of the original pages' JS is still stripped out).
            
            That way, the page is self-contained as it is, but requires no
            bundled binary code to serve the site. It is actually safer
            security-wise.
            
            The vendored script can be as simple as this:
            
              const site = {
                "path-1": " ... ",
                "path-2": " ... ",
                // More paths
              }
            
              function attachListeners() {
                for (const [path, html] of Object.entries(site)) {
                  document.querySelector(`a[href=${path}]`).onclick = () => {
                document.documentElement.outerHTML = html
                attachListeners()
                  }
                }
              }
            
              document.addEventListeners("DOMContentLoaded", attachListeners)
       
          tamnd wrote 1 day ago:
          It seems this repo only saves one web page?
          
          What I'm implementing here is mirroring a whole website, with all its
          subpages, so you can browse it all offline. For example, all essays
          from paulgraham.com.
       
            nikisweeting wrote 1 day ago:
            Singlefile supports scoped recursive crawls too: [1] I highly
            recommend reading the singlefile source or [2] to see how they
            handle closed shadow DOMs, cross-origin iframes, websockets, media
            urls, deduping large assets, etc.
            
  HTML      [1]: https://github.com/gildas-lormeau/single-file-cli#:~:text=...
  HTML      [2]: https://archiveweb.page/
       
            sillysaurusx wrote 1 day ago:
            > For example, all essays from paulgraham.com
            
            Not the same thing, but I made a clone of pgâs website which can
            be used for exactly that: [1] [2] If you want to read all essays,
            just clone the repo and open any of the .html files. Or any of the
            .page files which generated them.
            
  HTML      [1]: https://github.com/shawwn/pg
  HTML      [2]: https://shawwn.github.io/pg/
       
            maxloh wrote 1 day ago:
            Oh, I see. In that case, feature-wise, it is actually a modern
            alternative to HTTrack.
            
            I think the misunderstanding stems from the browser's "Save As"
            reference in the description. It is misleading. You use "Save As"
            to save a single page, not an entire website.
            
            Also, the description lacks a clear explanation of the project's
            purpose. It would be helpful to include a sentence explaining that
            the program downloads an entire website, not just a single page.
       
       
   DIR <- back to front page