URI:
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Read Locks Are Not Your Friends
       
       
        alecco wrote 2 hours 30 min ago:
        It always amazes me the amount these folks are willing to work and
        struggle just to avoid reading basic database literature.
        
        Also Fedor Pikus has some nice cppcon talks from years ago on all this.
        Very low level.
       
        sevensor wrote 3 hours 55 min ago:
        Does this apply also to std::shared_mutex in C++? This is a timely
        article if so; I’m in the middle of doing some C++ multithreading
        that relies on a shared_mutex. I have some measuring to do.
       
          gpderetta wrote 3 hours 40 min ago:
          mostly yes.
       
            sevensor wrote 2 hours 59 min ago:
            Thanks, that’s what I was afraid of. The ping ponging described
            in the article seems hard to avoid regardless of what language
            you’re using.
       
        api wrote 4 hours 9 min ago:
        Take a look at crates like arc_swap if you have a read often write
        rarely lock case. You can easily implement the RCU pattern. Just be
        sure to read about how to use RCU properly.
        
        Well done this pattern gives you nearly free reads and cheap writes,
        sometimes cheaper than a lock.
        
        For frequent writes a good RWLock is often better since RCU can degrade
        rapidly and badly under write contention.
       
        ot wrote 4 hours 20 min ago:
        This is drawing broad conclusions from a specific RW mutex
        implementation. Other implementations adopt techniques to make the
        readers scale linearly in the read-mostly case by using per-core state
        (the drawback is that write locks need to scan it).
        
        One example is folly::SharedMutex, which is very battle-tested: [1]
        There are more sophisticated techniques such as RCU or hazard pointers
        that make synchronization overhead almost negligible for readers, but
        they generally require to design the algorithms around them and are not
        drop-in replacements for a simple mutex, so a good RW mutex
        implementation is a reasonable default.
        
  HTML  [1]: https://uvdn7.github.io/shared-mutex/
       
          amluto wrote 54 min ago:
          Wow, folly::SharedMutex is quite an example of design tradeoffs. I
          wonder what application the authors wanted it for where using a
          global array was better than a per-mutex array.
       
          Jyaif wrote 3 hours 54 min ago:
          And a Rust equivalent of folly::SharedMutex:
          
  HTML    [1]: https://docs.rs/crossbeam-utils/latest/crossbeam_utils/sync/...
       
          mike_hearn wrote 4 hours 3 min ago:
          Right, and if you're on the JVM you have access to things like
          ConcurrentHashMap which is lock free.
       
          PaulHoule wrote 4 hours 7 min ago:
          I think it’s not unusual that reader-writer locks, even if well
          implemented, get in places where there are so many readers stacked up
          that writers never get to get a turn or 1 writer winds up holding up
          N readers which is not so scalable as you increase N.
       
        Retr0id wrote 4 hours 28 min ago:
        claudes love to talk about The Hardware Reality
       
          stuaxo wrote 4 hours 22 min ago:
          "The performance Death Spiral" was the point I realised I was being
          LLMd and bailed out.
       
        amluto wrote 4 hours 30 min ago:
        The code examples are confusing.  The show the code that takes the
        locks, but they don’t show any of the data structures involved.  The
        rwlock variant clones the Arc (makes sense), but the mutex variant does
        not (is it hidden inside inner.get)?
        
        In any case, optimizing this well would require a lot more knowledge of
        what’s going on under the hood. What are the keys?  Can the entire
        map be split into several maps? Can a reader hold the rwlock across
        multiple lookups? Is a data structure using something like RCU an
        option?
       
        _dky wrote 4 hours 41 min ago:
        If implementation is task based and task always runs on same virtual
        CPU (slots equaling CPUs or parallelism), wonder if something like
        below might help.
        
        RW lock could be implemented using an array of length equal to slots
        and proper padding to ensure each slot is in its own face line (avoid
        invalidating CPU cache when different slot is read/written).
        
        For read lock: Each task acquires the lock for their slot.
        
        For write lock: Acquire lock from left most slot to right. Writes can
        starve readers when they block on in-flight reader at a different slot
        when moving from left to right.
        
        I do not know how Rust RW locks are implemented.
       
        whizzter wrote 4 hours 50 min ago:
        I'd be super interested in how this compares between cpu architectures,
        is there an optimization in Apple silicon that makes this bad while
        it'd fly on Intel/AMD cpus?
       
          PunchyHamster wrote 4 hours 2 min ago:
          Read lock requires communication between cores. It just can't scale
          with CPU count
       
          gpderetta wrote 4 hours 11 min ago:
          the behaviour is quite typical for any MESI style cache coherence
          system (i.e. most if not all of them).
          
          A specific microarchitecture might alleviate this a bit with lower
          latency cross-core communication, but the solution (using a single
          naive RW lock to protect the cache) is inherently non-scalable.
       
          hansvm wrote 4 hours 40 min ago:
          I've observed the same behavior on AMD and Intel at $WORK. Our
          solution (ideal for us, reads happening roughly 1B times more often
          than writes) was to pessimize writes in  favour of reads and add some
          per-thread state to prevent cache line sharing.
          
          We also tossed in an A/B system, so reads aren't delayed even while
          writes are happening; they just get stale data (also fine for our
          purposes).
       
            the_duke wrote 3 hours 54 min ago:
            Rust has an interesting crate for this, arc-swap [1].
            
            It's essentially just an atomic pointer that can be swapped out.
            
  HTML      [1]: https://docs.rs/arc-swap/latest/arc_swap/
       
       
   DIR <- back to front page