codevoid.de/1/hn/comments_47046070.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Two Bits Are Better Than One: making bloom filters 2x more accurate
       
       
        IshKebab wrote 9 hours 57 min ago:
        Is this worth reading? The text is LLM slop.
       
          aziis98 wrote 9 hours 47 min ago:
          > The win? Massive.
          
          But there are benchmark numbers at least, so maybe they only used it
          for the prose
       
        geon wrote 10 hours 29 min ago:
        Nice.
        
        The ligature â for << was a bit confusing.
        
        And a nitpick; Finding a match in a bloom filter isnât a false
        positive, it is inconclusive.
        
        Since bloom filter are only designed to give negatives, never
        positives, the concept of a false positive is nonsensical. Yeah, I get
        what you mean, but language is important.
       
          eesmith wrote 10 hours 3 min ago:
          Calling it a false positive is entirely in line with the historical
          use.
          
          Back in the 1980s or earlier it was called a "false drop".
          
          Knuth, for example, talks about it in "The Art of Computer
          programming", v3, section 6.5, "Retrieval on Secondary Keys", using
          cookie ingredients. (See [1] for Tharp using the same example.)
          
          Bloom filters are a type of superimposed coding.
          
  HTML    [1]: https://archive.org/details/fileorganisation0000thar/mode/2u...
       
        londons_explore wrote 14 hours 41 min ago:
        It's true that a fixed size bloom filter gives better compiler
        performance...
        
        But another approach is to use C++ templating so you can have say 10
        different 'fixed size' implementations with no additional overhead, and
        at runtime select the most suitable size.
        
        For the couple of kilobytes of extra code size, this optimisation has
        to be worth it assuming table size is variable and there are some stats
        to give cardinality estimates...
       
        krackers wrote 14 hours 47 min ago:
        Isn't this idea similar to the classic "power of 2 random choices"
        
  HTML  [1]: https://www.eecs.harvard.edu/~michaelm/postscripts/handbook200...
       
        londons_explore wrote 14 hours 47 min ago:
        What are they running this code on?
        
        I doubt their hardware is any faster shuffling bits in a uint32 than a
        uint64, and using uint64 should have a decent benefit to the false
        positive rate...
       
          FreakLegion wrote 13 hours 59 min ago:
          That struck me as an odd choice, too. On average there's no
          difference in false positives, but the smaller the blocks, the more
          likely they'll be saturated. Since there are 6 leftover bits in the
          hash anyway, there's no cost to increase the two 5-bit values to 6
          bits and the block size to 64. You'll have a lot fewer hot blocks
          that way.
          
          With blocks this small there's also no reason not to optimize the
          number of hash functions (albeit this brings back the specter of
          saturation). There are no cache misses to worry about; all positions
          can be checked with a single mask.
       
        ozgrakkurt wrote 15 hours 12 min ago:
        You can actually make those two bits more independent afaik. [1] [2]
        First one is useful for grasping the idea second one is more
        comprehensive. Both try to make multiple bit loads but try to make them
        as independent as possible as far a I can understand.
        
        Also hash function has huge effect on bloom filter performance. I was
        getting 2x perf when using xxhash3 instead of wyhash even though wyhash
        is a faster hash function afaik.
        
  HTML  [1]: https://github.com/apache/parquet-format/blob/master/BloomFilt...
  HTML  [2]: https://github.com/facebook/rocksdb/blob/main/util/bloom_impl....
       
        h33t-l4x0r wrote 16 hours 43 min ago:
        Hmm, Bloom filters seem important. I'm wondering why my CS education
        never even touched on them and it's tbh triggering my imposter
        syndrome.
       
          jb3689 wrote 8 hours 17 min ago:
          Distributed systems and probabilistic data structures really should
          be in every undergrad CS curriculum even if just in passing for the
          second
       
          on_the_train wrote 11 hours 45 min ago:
          Unpopular opinion: They're one of these things that are popular
          because of their cool name. For most purposes, they're not a good fit
       
          vyr wrote 14 hours 33 min ago:
          they're common in databases and performance instrumentation of
          various kinds (as are other forms of data structure "sketch" like
          count sketches) but not as common outside those realms.
          
          i've gotten interview questions best solved with them a few times; a
          Microsoft version involved spell-checking in extremely limited
          memory, and the interviewer told me that they'd actually been used
          for that back in the PDP era.
       
          dheera wrote 14 hours 56 min ago:
          My education didn't touch upon it but I've been grilled on it
          multiple times in interviews.
          
          I learned about them after the first time I got grilled and rejected.
          Sucks to be the first company that grilled me about it, thanks for
          the tip though, you just didn't stick around long enough to see how
          fast I learn
       
          benmanns wrote 16 hours 8 min ago:
          They were only touched on (and just barely) in my CS education, so
          donât feel too left out. Spend an evening or two on the Wiki for
          Probabilistic data structures[0]. With a CS education you should have
          the baseline knowledge to find them really fascinating. Enjoy!
          
          Oh, and
          I donât find myself actually implementing any of these very often
          or knowing that they are in use. I occasionally use things like
          APPROX_COUNT_DISTINCT in Snowflake[1], which is a HyperLogLog (linked
          in the Wiki).
          
          [0]: [1]:
          
  HTML    [1]: https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...
  HTML    [2]: https://docs.snowflake.com/en/sql-reference/functions/approx...
       
        vlmutolo wrote 17 hours 19 min ago:
        This article is a little confusing. I think this is a roundabout way to
        invent the blocked bloom filter with k=2 bits inserted per element.
        
        It seems like the authors wanted to use a single hash for performance
        (?). Maybe they correctly determined that naive Bloom filters have poor
        cache locality and reinvented block bloom filters from there.
        
        Overall, I think block bloom filters should be the default most people
        reach for. They completely solve the cache locality issues (single
        cache miss per element lookup), and they sacrifice only like 10â15%
        space increase to do it. I had a simple implementation running at
        something like 20ns per query with maybe k=9. It would be about 9x that
        for native Bloom filters.
        
        Thereâs some discussion in the article about using a single hash to
        come up with various indexing locations, but itâs simpler to just
        think of block bloom filters as:
        
        1. Hash-0 gets you the block index
        
        2. Hash-1 through hash-k get you the bits inside the block
        
        If your implementation slices up a single hash to divide it into
        multiple smaller hashes, thatâs fine.
       
          Sesse__ wrote 10 hours 56 min ago:
          > Overall, I think block bloom filters should be the default most
          people reach for.
          
          I think this depends on how big your filters are. Most people think
          of Bloom filters as having to have hundreds of thousands of elements,
          but I frequently find them useful all the way down to 32 bits (!).
          (E.g., there are papers showing chained hash tables where each bucket
          has a co-sited tiny Bloom filter to check if it's worth probing the
          chain.) In the âno man's landâ in-between with a couple ten
          thousand buckets, the blocking seems to be mostly negative; it only
          makes sense as long as you actually keep missing the cache.
       
          sakras wrote 16 hours 37 min ago:
          Yeah I kind of think authors didn't conduct a thorough-enough
          literature review here. There are well-known relations between number
          of hash functions you use and the FPR, cache-blocking and
          register-blocking are classic techniques (Cache-, Hash-, and
          Space-Efficient Bloom Filters by Putze et. al), and there are even
          ways of generating patterns from only a single hash function that
          works well (shamelessly shilling my own blogpost on the topic: [1] )
          
          I also find the use of atomics to build the filter confusing here. If
          you're doing a join, you're presumably doing a batch of hashes, so
          it'd be much more efficient to partition your Bloom filter, lock the
          partitions, and do a bulk insertion.
          
  HTML    [1]: https://save-buffer.github.io/bloom_filter.html
       
            thomasmg wrote 15 hours 20 min ago:
            Your blogpost is great! Except for one detail: you have used modulo
            n. If n is not known at compile time, multiply+shift is much faster
            [1]. Division and modulo (remainder) are slow, except on Apple
            silicon (I don't know what they did there). BTW for blocked Bloom
            filters, there are some SIMD variants that seem to be simpler than
            yours [2] (maybe I'm wrong, I didn't look at the details, just it
            seems yours uses more code). I implemented a register-based one in
            one in Java here [3].
            
            Bulk insertion: yes, if there are many keys, bulk insertion is
            faster. For xor filters, I used radix sort before insertion [4] (I
            should have documented the code better), but for fuse filters and
            blocked Bloom filters it might not be worth it, unless if the
            filter is huge. [1] [2] [3] 
            [4]
            
  HTML      [1]: https://lemire.me/blog/2016/06/27/a-fast-alternative-to-th...
  HTML      [2]: https://github.com/FastFilter/fastfilter_cpp/blob/master/s...
  HTML      [3]: https://github.com/FastFilter/fastfilter_java/blob/master/...
  HTML      [4]: https://github.com/FastFilter/fastfilter_cpp/blob/master/s...
       
        pkoird wrote 17 hours 45 min ago:
        Clever. My first impression was that surely this saturates the filter
        too fast as we're setting more bits at once but looks like the maths
        checks out. It's one of those non-intuitive things that I am glad I
        learned today.
       
          FreakLegion wrote 13 hours 38 min ago:
          It works because the original filter has suboptimal settings. An
          optimal filter of that size and number of items would set 5 bits per
          item and have about a quarter of the false positive rate. The 2 bits
          per item in the blocked filter is still suboptimal, but it's also
          saving them from saturating a bunch of 32-bit blocks, at the cost of
          a much higher overall false positive rate.
       
          lemagedurage wrote 17 hours 29 min ago:
          True, I had the same feeling. The article does go off 256K elements
          in a bloom filter of 2M. After 1M elements, using 2 bits actually
          increases false positive rate, but at that point the false positive
          rate is higher than 50% already.
       
       
   DIR <- back to front page